#Automatic Railway Postgres update wiped the database volumes

39 messages · Page 1 of 1 (latest)

woeful ferry
#

We see a railway automatic deployment of 2 postgres instances in our environment around 50 minutes ago that completely wiped the volumes. Our service immediately started alarming us with errors that table_xyz does not exist.

This was our staging environment so only our dev teams were impacted but we are very concerned about this happening in production.

Project id: 979d8e07-4c99-427d-a247-04265f375550
EnvironmentId c39c1bd0-9903-48d6-a25f-1f1d02a5eda7

verbal onyxBOT
#

Project ID: 979d8e07-4c99-427d-a247-04265f375550,c39c1bd0-9903-48d6-a25f-1f1d02a5eda7

woeful ferry
#

I would like to get some sense of what went on here for me to know how to prevent this in prod

civic smelt
woeful ferry
#

yes it never recovered. a new migration executed on service restart

civic smelt
#

And yes, unfortunately this was Railway's fault.

We've identified that a component in our deployment infrastructure became unresponsive, which caused your deployment to hang before eventually failing. We sincerely apologize for the frustration this caused.

We've now implemented additional monitoring to detect this type of issue immediately, so our team can resolve it much faster if it happens again.
civic smelt
woeful ferry
#

yes is that possible. its unclear to me how hang in the deployment causes a disk wipe. is there more i should be doing in production environment to prevent against this. this would have been very costly to us in production.

civic smelt
median sinew
#

Hello,

The issue the two other users faced would not cause any data loss.

civic smelt
#

Oh my bad then, ignore what I said.

median sinew
#

Should I be looking at tax-postgres?

woeful ferry
#

yes, same thing happened to ponder-db in same env

#

we did not see the hung deployment. deployments were successful but just all the data was gone

median sinew
#

Looks like ponder-db was last deployed around 2 months ago? should I be looking at another environment? I am looking at staging right now.

woeful ferry
#

just tax-postgres. ignore previous message about ponder-db

median sinew
#

The team is looking into this now.

lament cloak
#

I believe it's been re-initialized correctly (happy to cover what happened after and what we'll do to fix it in a sec)

woeful ferry
#

yes looks like data has been restored. reminder to encrypt at rest seeing u guys poke around in the table 😂

#

looking forward to full diagnostic and some advice on protecting our production services from such issues

lament cloak
woeful ferry
#

but presumably you can view tables the same way we can through the webapp

lament cloak
woeful ferry
lament cloak
#

What happened, and I have to figure out why is that you had 2 identical volumes (as in, they had the same unique identifier)

However one of them was older. I think it might have picked an older snapshot. Why? Ill dig in

lament cloak
woeful ferry
#

would love a full post mortem here and some advice on protecting ourselves from such events in prod. thanks

lament cloak
#

Will get confirmation for this today

#

!remind me to follow up in 4 hours

twin skyBOT
lament cloak
#

(Latest)

civic smelt
#

just a heads up that he left the server

twin skyBOT
woeful ferry
#

im back

#

this is still pretty insane to us that we had a volume just bricked. would like some further explanation and some credits

lament cloak
lament cloak