Unscheduled Backup Testing

This incident summary was originally posted on The Blower’s Patreon page and is being copied here to guarantee future visibility.


Hi folks! We had an outage this morning during a planned maintenance window and feel like it’s worth documenting what happened. It turned out that the backup systems we’ve had in place for a long time worked just as intended, so while it was a bit of an inconvenience it’s nice to know our plans were solid there.

Background

A couple of weeks ago, we reshuffled The Blower’s hardware. The main point was to reduce the footprint, and get everything onto one physical server – one that we had spare parts for and could maintain for a while to come. Our previous setup was very resilient, balanced across several redundant machines, but with the increased cost of hardware lately it wouldn’t really be affordable to replace some of the components like hard drives if they should fail. Instead, we now have spares on standby if there’s an issue. A simpler setup, but with a need to restart things for maintenance a little more often, as we can’t just shift running services around on the fly.

One Week Earlier

As part of the shuffle, lots of data got moved to temporary storage and copied back again a few times. One of those jobs didn’t complete successfully. The data drive for the primary database server didn’t completely migrate and logged an error. The database was still available and running, it was just stuck using a slow hard drive and in order to clean up the config and retry the copy we’d need to reboot it – and this would require The Blower to be offline for a short time.

Rather than have an unscheduled outage, we decided to give it a week, declare a proper maintenance window, let people know, and do some other software upgrades at the same time. This was expected to take 10, maybe 20 minutes.

What Went Wrong

All the upgrades and reboots went well. Then at the point where the only thing left was the database server disk cleanup, while it was shut down, we edited the configuration. The failed copy of the database volume was removed and the server restarted.

Well … here’s the key to the issue. It wasn’t the failed copy that was removed, but the disk volume that had been running for the last week since the migration was attempted. The database server had just been restarted with the entire last week of all posts and activity completely missing.

What Should Have Happened

The secondary database server was now logging errors indicating that the connection to the primary was not up to date with its expectations – it should always be ahead, and right now it was significantly behind. At this point, we should have promoted the secondary to primary – that’s what it’s for! In the event of a failure of the primary, we usually have a live copy of it, right up to the last second. We can carry on from there, no problem, then rebuild a new secondary.

Unfortunately, last week we’d had some different issues with the secondary – another migration glitch got it out of sync and we ended up restoring it from a backup at that time. When we had issues again today, it kind of looked like the same thing again and assuming that was going to need a restore we simply removed the secondary from the configs, started restoring it again, and brought the instance back online with just the primary.

It was rapidly apparent something was wrong when the newest post on the timeline was from May 30.

Everybody’s Dead Dave

At this point, everything was shut down again rapidly. Since a restore had already been started on the secondary database, it was now effectively corrupt until that completed, and was no longer a working and up to date copy. The primary was a week out of date and also unusable. We broken two copies of our database. This is … not good.

Fortunately, there’s another safety net. Every update to the databases is logged and archived to another storage location. This is how the secondary was restored the week before, and what we’d kicked off already (unnecessarily) today. We’ve never had to restore anything with such high stakes before, but because everything had been cleanly shut down around 9:30am each and every transaction should be in that storage.

Failing that, the next option was restoring the full server overnight backups – this would roll all activity back to about 2am though, losing everyone’s posts about their breakfast in the process. It’s not a great option.

Happily, the remainder of the story is really just time spent nervously watching a progress bar go from 0% to 100% while hoping like crazy that we ended up with both database servers once again at the point they’d been at over an hour before.

The End

Around 11am, the restore of the primary completed (the restore of the secondary was cancelled as we didn’t need that to get online and it’d be quicker to do one at a time). Everything came up! After some nervous checking, the instance caught up with what was going on in the world and things have been ticking along since.

As a bonus, the storage copy to the faster drive that kicked this whole thing off has also now been done, so things may be a little more responsive from here!

Anyway folks – the moral of the story is that the system works, probably. Or maybe that it’s best to have two coffees before doing any maintenance. Maybe three.

Definitely more than one.


If you’re a user or fan of The Blower, you can support it via Patreon, or many other methods.

Fediverse Reactions

Leave a Reply

Only people in my network can comment.