this post was submitted on 18 Feb 2024

174 points (95.3% liked)

Ask Lemmy

26903 readers

1883 users here now

A Fediverse community for open-ended, thought provoking questions

Please don't post about US Politics. If you need to do this, try [email protected]

Rules: (interactive)

1) Be nice and; have fun

Doxxing, trolling, sealioning, racism, and toxicity are not welcomed in AskLemmy. Remember what your mother said: if you can't say something nice, don't say anything at all. In addition, the site-wide Lemmy.world terms of service also apply here. Please familiarize yourself with them

2) All posts must end with a '?'

This is sort of like Jeopardy. Please phrase all post titles in the form of a proper question ending with ?

3) No spam

Please do not flood the community with nonsense. Actual suspected spammers will be banned on site. No astroturfing.

4) NSFW is okay, within reason

Just remember to tag posts with either a content warning or a [NSFW] tag. Overtly sexual posts are not allowed, please direct them to either [email protected] or [email protected]. NSFW comments should be restricted to posts tagged [NSFW].

5) This is not a support community.

It is not a place for 'how do I?', type questions. If you have any questions regarding the site itself or would like to report a community, please direct them to Lemmy.world Support or email [email protected]. For other questions check our partnered communities list, or use the search function.

Reminder: The terms of service apply here too.

Partnered Communities:

Logo design credit goes to: tubbadu

founded 1 year ago

MODERATORS

[email protected]

174

What's the worst way you ever broke production? (lemm.ee)

submitted 9 months ago by [email protected] to c/[email protected]

104 comments fedilink hide all child comments

Fess up. You know it was you.

top 50 comments

sorted by: hot top controversial new old

[–] [email protected] 5 points 9 months ago

Set off cascading event bus loops that ran out of control. Friends don’t let friends allow events to spawn more events.

[–] [email protected] 4 points 9 months ago* (last edited 9 months ago)

I seriously never had a major gaffe.

My buddy Donny, however, repartitioned and overwrote the wrong hard drive... Destroying video that took in the neighborhood of about 9,000 hours to render.

This was in ~~1996~~ 1997 so you can only imagine how devastating that was when our rendering farm was 10 machines with Pentium III's.

Seems trivial now when we have so much computing power at our fingertips, but 10 computers as a dedicated rendering farm was considered insane at that time.

[–] [email protected] 17 points 9 months ago (1 children)

Two things pop up

I once left an alert() asking "what the fuck?". That was mostly laughed upon, so no worry.
I accidentally dropped the production database and replaced it by the staging one. That was not laughed upon.

[–] [email protected] 9 points 9 months ago

I once dropped a table in the production database. I did not replace it with the same table from staging.

On the bright side, we discovered our vendor wasn't doing daily backups.

[–] [email protected] 21 points 9 months ago (2 children)

Early in my career as a cloud sysadmin, shut down the production database server of a public website for a couple of minutes accidentally. Not that bad and most users probably just got a little annoyed, but it didn't go unnoticed by management 😬 had to come up with a BS excuse that it was a false alarm.

Because of the legacy OS image of the server, simply changing the disk size in the cloud management portal wasn't enough and it was necessary to make changes to the partition table via command line. I did my research, planned the procedure and fallback process, then spun up a new VM to test it out before trying it on prod. Everything went smoothly except on the moment I had to shut down and delete the newly created VM, I instead shut down the original prod VM because they had similar names.

Put everything back in place, and eventually resized the original prod VM, but not without almost suffering a heart attack. At least I didn't go as far as deleting the actual database server :D

[–] [email protected] 8 points 9 months ago

I tried to change ONE record in the production db but I forgot the WHILE clause, ended up changing over 2 MILLION records instead. Three hour production shutdown. Fun times.

[–] [email protected] 4 points 9 months ago

I did my research, planned the procedure and fallback process, then spun up a new VM to test it out before trying it on prod

Went through a similar process when I was resizing some partitions on my media server. On the test run I forgot to specify G on the new size so it defaulted to MB when I resized it. Resulting in a 450gb partition going down to 400mb. I was real glad I tested that out first.

[–] [email protected] 11 points 9 months ago (1 children)

Well first of, in a properly managed environment/team there's never a single point of failure... *ahem*... that being said..

The worst I ever did was lose a whole bunch of irreplaceable data because of... things. I can't go into detail on that one. I did have a back plan for this kind of thing, but it was never implemented because my teammates thought it was a waste of time to cover for such a minuscule chance of a screw-up. I guess they didn't know me too well back then :)

[–] [email protected] 2 points 9 months ago (1 children)

"properly managed" is carrying a whole lotta weight in that first sentence.

[–] [email protected] 3 points 9 months ago

I was purely talking in hypotheticals, I've never seen such a thing with my own eyes :)

[–] [email protected] 23 points 9 months ago (1 children)

It wasn't me personally but I was working as a temp at one of the world's biggest shoe distribution centers when a guy accidentally made all of the size 10 shoes start coming out onto the conveyor belts. Apparently it wasn't a simple thing to stop it and for three days we basically just stood around while engineers were flown in from China and the Netherlands to try and sort it out. The guy who made the fuckup happen looked totally destroyed. On the last day I remember a group of guys in suits coming down and walking over to him in the warehouse and then he didn't work there any more. It must have cost them an absolute fortune.

[–] [email protected] 10 points 9 months ago (1 children)

How can a guy accidentally order all size 10 shoes to come out, without there being any way to stop it

[–] [email protected] 7 points 9 months ago

No idea. It was a new facility, so maybe it was a bug in their new system preventing them stopping it! I was 18 at the time and found it hilarious. They kept us there the whole time because they thought it would be quick to sort out. We shot each other down roller conveyors, rode the pallet trucks around like scooters and smoked cigarettes inside big cardboard boxes while we were waiting. Good times.

[–] [email protected] 10 points 9 months ago

Then colleague upgraded glibc by copying it in via scp. Then we couldn't ssh in anymore. :) Not sure how important that server was. I think it was reinstalled soon-ish.

[–] [email protected] 14 points 9 months ago

Crashed a important server because it didnt have room for the update I was trying to install. Love old windows servers.

[–] [email protected] 28 points 9 months ago* (last edited 9 months ago) (1 children)

Worked for an MSP, we had a large storage array which was our cloud backup repository for all of our clients. It locked up and was doing this semi-regularly, so we decided to run an "OS reinstall". Basically these things install the OS across all of the disks, on a separate partition to where the data lives. "OS Reinstall" clones the OS from the flash drive plugged into the mainboard back to all the disks and retains all configuration and data. "Factory default", however, does not.

This array was particularly... special... In that you booted it up, held a paperclip into the reset pin, and the LEDs would flash a pattern to let you know you're in the boot menu. You click the pin to move through the boot menu options, each time you click it the lights flash a different pattern to tell you which option is selected. First option was normal boot, second or third was OS reinstall, the very next option was factory default.

I head into the data centre. I had the manual, I watched those lights like a hawk and verified the "OS reinstall" LED flash pattern matched up, then I held the pin in for a few seconds to select the option.

All the disks lit up, away we go. 10 minutes pass. Nothing. Not responding on its interface. 15 minutes. 20 minutes, I start sweating. I plug directly into the NIC and head to the default IP filled with dread. It loads. I enter the default password, it works.

There staring back at me: "0B of 45TB used".

Fuck.

This was in the days where 50M fibre was rare and most clients had 1-20M ADSL. Yes, asymmetric. We had to send guys out as far as 3 hour trips with portable hard disks to re-seed the backups over a painful 30ish days of re-ingesting them into the NAS.

The worst part? Years later I discovered that, completely undocumented, you can plug a VGA cable in and you get a text menu on the screen that shows you which option you have selected.

I (somehow) did not get fired.

[–] [email protected] 4 points 9 months ago

You still remember so. That means you learned and probably won't do it again.

[–] [email protected] 22 points 9 months ago

My first time shutting down a factory at the end of second shift for the weekend. I shut down the compressors first, and that hard stopped a bunch of other equipment that relied on the air pressure. Lessons learned. I spent another hour restarting then properly shutting down everything. Never did that again.

[–] [email protected] 19 points 9 months ago (1 children)

UPDATE ON articles SET status = 0 WHERE body LIKE '%...%'

On master production server, running myisam, against a text column, millions of rows.

This causes queries to stack because table locks

Rather than waiting for the query to finish. a slave was promoted to master.

Lesson: don't trust mysqladmin to not do something bad.

[–] [email protected] 2 points 9 months ago

Table locks can be a real pain. You know you need to do the change, but the system is constantly running queries towards it. Now days it's a bit easier with algorithm=inplace and lock=none, but in the good old days you were on your own. Your only friend was luck. Large migrations like that still gives me shivers

[–] [email protected] 25 points 9 months ago

I fixed a bug and gave everyone administrator access once. I didn’t know that bug was… in use (is that the right way to put it?) by the authentication library. So every successful login request, instead of being returned the user who just logged in, was returned the first user in the DB, “admin”.

Had to take down prod for that one. In my four years there, that was the only time we ever took down prod without an announcement.

[–] [email protected] 27 points 9 months ago (3 children)

Plugged a serial cable into a UPS that was not expecting RS232. Took down the entire server room. Beyoop.

[–] [email protected] 2 points 9 months ago (1 children)

You don't have two unrelated power inputs? (UPS and regular power)

[–] [email protected] 4 points 9 months ago

This was 2001 at a shoestring dialup ISP that also did consulting and had a couple small software products. So no.

[–] [email protected] 2 points 9 months ago

Took down the entire server room

ow, goddamn...

[–] [email protected] 20 points 9 months ago

That's a common one I have seen on r/sysadminds.

I think APC is the company with the stupid issue.

[–] [email protected] 22 points 9 months ago

Broke teller machines at a bank by accidentally renaming the server all the machines were pointed to. Took an hour to bring back up.

[–] [email protected] 28 points 9 months ago (1 children)

I spent over 20 years in the military in IT. I took took down the network at every base I was ever at each time finding a new way to do it. Sometimes, but rarely, intentionally.

[–] [email protected] 9 points 9 months ago

took out a node center by applying the patches gd recommended.... took an entire weekend to restore all the shots and my ass got fed 3/4ths into the woodchipper before it came out that the vendor was at fault for this debacle.

[–] [email protected] 1 points 9 months ago

me when im QA : ))

[–] [email protected] 23 points 9 months ago

"acknowledge all" used to behave a bit different in Cisco UCS manager. Well at least the notifications of pending actions all went away... because they were no longer pending.

[–] [email protected] 14 points 9 months ago* (last edited 9 months ago)

Extracted a sizeable archive to a pretty small root/OS volume

[–] [email protected] 40 points 9 months ago* (last edited 9 months ago)

It was the bad old days of sysadmin, where literally every critical service ran on an iron box in the basement.

I was on my first oncall rotation. Got my first call from helpdesk, exchange was down, it's 3AM, and the oncall backup and Exchange SMEs weren't responding to pages.

Now I knew Exchange well enough, but I was new to this role and this architecture. I knew the system was clustered, so I quickly pulled the documentation and logged into the cluster manager.

I reviewed the docs several times, we had Exchange server 1 named something thoughtful like exh-001 and server 2 named exh-002 or something.

Well, I'd reviewed the docs and helpdesk and stakeholders were desperate to move forward, so I initiated a failover from clustered mode with 001 as the primary, instead to unclustered mode pointing directly to server 10.x.x.xx2

What's that you ask? Why did I suddenly switch to the IP address rather than the DNS name? Well that's how the servers were registered in the cluster manager. Nothing to worry about.

Well... Anyone want to guess which DNS name 10.x.x.xx2 was registered to?

Yeah. Not exh-002. For some crazy legacy reason the DNS names had been remapped in the distant past.

So anyway that's how I made a 15 minute outage into a 5 hour one.

On the plus side, I learned a lot and didn't get fired.

load more comments