Automatic Railway Postgres update wiped the database volumes | Railway | Page 1

woeful ferry Jun 18, 2025, 11:01 PM

#

We see a railway automatic deployment of 2 postgres instances in our environment around 50 minutes ago that completely wiped the volumes. Our service immediately started alarming us with errors that table_xyz does not exist.

This was our staging environment so only our dev teams were impacted but we are very concerned about this happening in production.

Project id: 979d8e07-4c99-427d-a247-04265f375550
EnvironmentId c39c1bd0-9903-48d6-a25f-1f1d02a5eda7

verbal onyxBOT Jun 18, 2025, 11:01 PM

#

Project ID: 979d8e07-4c99-427d-a247-04265f375550,c39c1bd0-9903-48d6-a25f-1f1d02a5eda7

woeful ferry Jun 19, 2025, 12:17 AM

#

I would like to get some sense of what went on here for me to know how to prevent this in prod

civic smelt Jun 19, 2025, 12:54 AM

#

woeful ferry I would like to get some sense of what went on here for me to know how to preven...

Hey, is your database still wiped?

woeful ferry Jun 19, 2025, 12:55 AM

#

yes it never recovered. a new migration executed on service restart

civic smelt Jun 19, 2025, 12:55 AM

#

And yes, unfortunately this was Railway's fault.

We've identified that a component in our deployment infrastructure became unresponsive, which caused your deployment to hang before eventually failing. We sincerely apologize for the frustration this caused.

We've now implemented additional monitoring to detect this type of issue immediately, so our team can resolve it much faster if it happens again.

civic smelt Jun 19, 2025, 12:56 AM

#

woeful ferry yes it never recovered. a new migration executed on service restart

Do you want to recover it? I can raise it to the team.

woeful ferry Jun 19, 2025, 12:58 AM

#

yes is that possible. its unclear to me how hang in the deployment causes a disk wipe. is there more i should be doing in production environment to prevent against this. this would have been very costly to us in production.

civic smelt Jun 19, 2025, 12:59 AM

#

woeful ferry yes is that possible. its unclear to me how hang in the deployment causes a disk...

I'll raise this to team, unfortunately I do not know most of the answers to your question.
cc @median sinew

median sinew Jun 19, 2025, 1:01 AM

#

Hello,

The issue the two other users faced would not cause any data loss.

civic smelt Jun 19, 2025, 1:02 AM

#

Oh my bad then, ignore what I said.

median sinew Jun 19, 2025, 1:03 AM

#

Should I be looking at tax-postgres?

woeful ferry Jun 19, 2025, 1:04 AM

#

yes, same thing happened to ponder-db in same env

#

we did not see the hung deployment. deployments were successful but just all the data was gone

median sinew Jun 19, 2025, 1:11 AM

#

Looks like ponder-db was last deployed around 2 months ago? should I be looking at another environment? I am looking at staging right now.

woeful ferry Jun 19, 2025, 1:11 AM

#

just tax-postgres. ignore previous message about ponder-db

median sinew Jun 19, 2025, 1:44 AM

#

The team is looking into this now.

lament cloak Jun 19, 2025, 1:57 AM

#

woeful ferry just tax-postgres. ignore previous message about ponder-db

Would you mind checking now?

#

I believe it's been re-initialized correctly (happy to cover what happened after and what we'll do to fix it in a sec)

woeful ferry Jun 19, 2025, 1:59 AM

#

yes looks like data has been restored. reminder to encrypt at rest seeing u guys poke around in the table 😂

#

looking forward to full diagnostic and some advice on protecting our production services from such issues

lament cloak Jun 19, 2025, 2:00 AM

#

woeful ferry yes looks like data has been restored. reminder to encrypt at rest seeing u guys...

Everything is encrypted at rest on our side!

woeful ferry Jun 19, 2025, 2:01 AM

#

but presumably you can view tables the same way we can through the webapp

lament cloak Jun 19, 2025, 2:02 AM

#

woeful ferry but presumably you can view tables the same way we can through the webapp

Yes. We have a BAA for enterprise that will make it so that someone internally needs your consent

But, in this case, I actually just went off metrics

woeful ferry Jun 19, 2025, 2:02 AM

#

this

lament cloak Jun 19, 2025, 2:02 AM

#

What happened, and I have to figure out why is that you had 2 identical volumes (as in, they had the same unique identifier)

However one of them was older. I think it might have picked an older snapshot. Why? Ill dig in

lament cloak Jun 19, 2025, 2:02 AM

#

woeful ferry this

Yup

woeful ferry Jun 19, 2025, 3:49 PM

#

would love a full post mortem here and some advice on protecting ourselves from such events in prod. thanks

lament cloak Jun 20, 2025, 6:46 PM

#

woeful ferry would love a full post mortem here and some advice on protecting ourselves from ...

Yup. We believe this occurs when you have backups and a migration

The backup fires, the migration grabs the new partial as gospel, and then migrates that

#

Will get confirmation for this today

#

!remind me to follow up in 4 hours

twin skyBOT Jun 20, 2025, 6:46 PM

#

lament cloak !remind me to follow up in 4 hours

Got it, I will remind you to follow up at Fri, 20 Jun 2025 22:46:37 GMT

lament cloak Jun 20, 2025, 6:46 PM

#

(Latest)

civic smelt Jun 20, 2025, 7:44 PM

#

just a heads up that he left the server

twin skyBOT Jun 20, 2025, 10:46 PM

#

Hey @lament cloak, remember to follow up - #1385031788786618368 message

woeful ferry Jul 14, 2025, 6:35 PM

#

im back

#

this is still pretty insane to us that we had a volume just bricked. would like some further explanation and some credits

lament cloak Jul 17, 2025, 1:11 AM

#

woeful ferry im back

Oh, interesting. I didn't follow up

lament cloak Jul 17, 2025, 1:12 AM

#

lament cloak Yup. We believe this occurs when you have backups and a migration The backup fi...

This is 100% what happened here. We patched it right after you reported it. It's definitely obscene; 1:1B+ collision racecase

#Automatic Railway Postgres update wiped the database volumes