[-] th3raid0r@programming.dev 3 points 3 months ago* (last edited 3 months ago)

TPMs can be extracted with physical access

Sure, but IIRC, they'd still need my PIN (for TPM+PIN through cryptenroll). I don't think it's possible to do TPM backed encryption without a PIN on Linux.

EDIT: Oh wait, you can... Why anyone would is beyond me though.

[-] th3raid0r@programming.dev 2 points 3 months ago* (last edited 3 months ago)

This sounds like a lenovo machine. Or something with a similar MOK enrollment process.

I forget the exact process, but I recall needing to reset the secureboot keys in "install mode" or something, then it would allow me to perform the MOK enrollment. If secureboot is greyed out in the BIOS it is never linux's fault. That's a manufacturer issue.

Apparently, some models of Lenovo don't even enable MOK enrolment and lock it down entirely. Meaning that you'd need to sign with Microsofts keys, not your own. The only way to do this is to be a high-up microsoft employee OR use a pre-provided SHIM from the distribution.

https://wiki.archlinux.org/title/Unified_Extensible_Firmware_Interface/Secure_Boot#Using_a_signed_boot_loader

For that case, Ubuntu and Fedora are better because, per the Ubuntu documentation they do this by default.

On Ubuntu, all pre-built binaries intended to be loaded as part of the boot process, with the exception of the initrd image, are signed by Canonical's UEFI certificate, which itself is implicitly trusted by being embedded in the shim loader, itself signed by Microsoft.

Once you have secureboot working on Ubuntu or Fedora, you could likely follow these steps to enable TPM+PIN - https://wiki.archlinux.org/title/Systemd-cryptenroll#Trusted_Platform_Module

There might be some differences as far as kernel module loading and ensuring you're using the right tooling for your distro, but most importantly, the bones of the process are the same.

OH! And if you aren't getting the secureboot option in the installer UI, that could be due to booting the install media in "legacy" or "MBR" mode. Gotta ensure it's in UEFI mode.

EDIT: One more important bit, you'll need to be using the latest nvidia drivers with the nvidia-open modules. Otherwise you'll need to additionally sign your driver blobs and taint your kernel. Nvidia-Open is finally "default" as of the latest driver, but this might differ on a per-distro basis.

[-] th3raid0r@programming.dev 5 points 3 months ago

Yeah, no kidding. The same systemd that enables the very things OP is trying to enable...

systemdboot + sbctl + systemd-cryptenroll and voila. TPM backed disk encryption with a PIN or FIDO2 token.

AFAIK this should be doable in Ubuntu, it just requires some command-line-fu.

Last I heard the Fedora installer was aiming to better support this type of thing - not so sure about Ubuntu.

[-] th3raid0r@programming.dev 1 points 4 months ago

Hahah, good luck. Proton Drive is really terrible. I can't even upload a single 1GB file through the service.

[-] th3raid0r@programming.dev 6 points 4 months ago

Well, I mean, most corps trying to shoehorn AI into things are using Cloud implementations of the various "AI" solutions.

What, pay for our own datacenter? Nah.

Just import openai and add "the AI" that way. 🤦‍♂️

51

A coworker send me this fantastic piece on getting linux to boot off of google drive (and s3). Definitely a fun read!

(I'm not the author of this article)

[-] th3raid0r@programming.dev 1 points 5 months ago

It's not even a steaming pile of crap or anything. Since it's basically a managed distributed database solution there's limits to what we can do and maintain strong consistency. Things generally take a long time and are very sequentially dependent. So we have automation of course! Buuuut there's very little comfort or trust in what is now very well exercised automation - which is the number 1 barrier in removing many sources of toil. Too many human "check this thing visually before proceeding" steps blocking an otherwise well automated process.

We are so damn close, but some key stakeholders keep wanting just one more thing in our platform support (We need ARM support, We need customer managed pki support, etc.) and we just don't get the latitude we need to actually make things reliable. It's like we're Cloud Platform / DevOps / QA / and SRE rolled into one and they can't seem to make up their damn mind on which rubric they decide to grade us on.

Hell they keep asking us to cut back our testing environment costs but demand new platform features tested at scale. We could solve it with a set of automated and standardized QA environments, but it's almost impossible to get that type of work prioritized.

My direct manager is actually pretty great, but found herself completely powerless after a recent reorg that changed the director she reports to. So all the organizational progress we made was completely reset and we're back to square one of having to explain what we want - except now we're having "kubernetes!" shouted at us while we try to chart a path.

I'm already brushing up my resume, but I must say, the new Gen-AI dominated hiring landscape is weird and bad. Until then, I just have to do the best I can with this business politics hell.

[-] th3raid0r@programming.dev 0 points 5 months ago* (last edited 5 months ago)
  1. They want to be notified of anything that could potentially slow down their system. So any anomaly. The catch being is that they constantly change patterns because they introduce new workloads weekly - which wouldn't be a problem if they could better communicate their forecasts. And that's just one of a few dozen customers - again all with unique cluster configuration and needs.

  2. Yeah, it sucks. The first year was pretty great and we had a fully integrated and unified managed services team where we were getting some great automation done. Then they split the team in half in order to focus on a different flavor of our product (with an entirely new backend) and left folks who were newer (myself included) with maintaining the old product. We were even told that we should be doing minimal maintenance on the thing as the new product would be the new norm. Then once upper management remembered how contracts work, they decided we needed to support 3 new platforms without growing the team. All while onboarding new customers and growing the environment count. We're now in operational overload after some turnover that was backfilled with offshore support that has a very minimal presence.

  3. I have tried championing this, but I don't expect an ableist, masculinity shaming person like you to understand a call for social pointers on how to "manage up".

"Man Up" - good lord, way to be an ass.

[-] th3raid0r@programming.dev 2 points 5 months ago

I’d suggest to just set up automatons to fix those things automatically. Lets say 80% CPU for 5 minutes it too high. Ok, add an auto-scale rule at 65% CPU for 3 minutes to add an extra node to the cluster to load balance the CPU load

Sure, if it were a normal service and not a distributed database that requires days to scale. Days. It's not, "add one node" and we're good. It's Add Node - Migrate Data - Add Node - Migrate Data... And in many cases, we have explicit instructions NOT to scale the customer because they won't be able to afford the larger cluster.

Also, would you auto-scale for a 5 minute blip that goes away in that time and doesn't consistently recur? I certainly wouldn't. The customer might not be able to pay for the size we put them on.

Our customers can simultaneously demand that we respond to all alerts AND not to scale their cluster. Who's fuckin' idea this was, I've no clue.

Like it sounds like you’re saying the issues are caused by systems not being robust and lack of automation… If they’re this scared of outages and breaking SLA, they should work on having less outages, or having fall-backs when they occur.

No. That's reading far more into my statement than I hoped. The reliability is indeed there - it's VERY unlikely our managed database goes down to a technology issue in our control. If it does, it's usually an operator error thing. However, if it were down to just operator error alerts and things actually impacting the end users, my job would be a dream!

Automation is somewhat there, but there's a few stakeholders that insist on human validated steps. So, while I have an ansible playbook for most issues, operating that playbook takes hours.

But it could get pretty difficult to get management to do this kinda things from random suggestions from some SRE. I’d probably talk with the team-lead about this, and other people in your team, cause you’re probably not the only one with these issues. And then have a meeting with the entire dev/SRE team and management to point out it’s not sustainable the way it’s going, and with suggestions to improve it

Sure, if it were technical. But this is largely not a technical issue, as you had assumed. The issue is that there is someone, with power, who gets to say that we must follow unreasonable customer requests to the letter. Even if those requests run counter to our sustainability.

[-] th3raid0r@programming.dev 6 points 5 months ago

In the most recent case, perhaps that could be a fix. Sure. But that won't work for every customer we have. Its certainly an idea worth bringing up to the team provided management doesn't shoot it down. Thanks!

25
submitted 5 months ago* (last edited 5 months ago) by th3raid0r@programming.dev to c/ask_experienced_devs@programming.dev

I'm just so exhausted these days. We have formal SLA's, but its not like they're ever followed. After all, Customer X needs to be notified within 5 minutes of any anomalous events in their cluster, and Customer Y is our biggest customer, so we give them the white glove treatment.

Yadda yadda, bla bla. So on and so forth, almost every customer has some exception/difference in SLAs.

I was hired on to be an SRE, but I'm just a professional dashboard starer at this point. The amount of times I've been alerted in the middle of the night because CPU was running high for 5 minutes is too damn high. Just so I can apologize to Mr. Customer that they maybe had a teensy slowdown during that time.

If I try to get us back to fundamentals and suggest we should only alert on impact, not short lived anomalies, there is some surface level agreement, but everyone seems to think "well we might miss something, so we need to keep it".

It's like we're trying to prevent outages by monitoring for potential issues rather than actually making our system more robust and automate-able.

How do I convince these people that this isn't sustainable? That trying to "catch" incidents before they happen is a fools errand. It's like that chart about the "war on drugs" where it shows exponential expense growth as you try to prevent ALL drug usage (which is impossible). Yet this tech company seems to think we should be trying to prevent all outages with excessive monitoring.

And that doesn't even get into the bonkers agreements we make with customers to agree to do a deep dive research on why 2 different environments have a response time that differs by 1ms.

Or the agreements that force us to complete customer provided training - while not assessing how much training we already committed to. It's entirely normal to do 3-4x HIPAA / PCI / Compliance trainings when everyone else in the org only has to do one set of those.

I'm at a point where I'm considering moving on. This job just isn't sustainable and there's no interest in the org to make it sustainable.

But perhaps one of y'all managed to fix something similar in their org with a few key conversations and some effort? What other things could I try as a sort of final "Hail Mary" before looking to greener pastures?

th3raid0r

joined 8 months ago