#Postgres is slow

20 messages · Page 1 of 1 (latest)

torpid basalt
#

I'm currently testing Railway as a possible host for my existing app that has both a high write and read volume.

I made a copy of my production database set up and the results are VERY worrisome. I’ve run VACUUM ANALYZE on the whole DB right after the DB transfer.

Right now I have 0% of my write load and less than 1% of my read load but requests often time out (especially after a few hours of site inactivity). Even with no hot cache, the DB should be able to fullfil requests in time. Especially since it has about double (8GB) the max RAM (4GB) and more CPU than the database I currently have at Heroku (Standard 0).

wraith gazelleBOT
#

Project ID: 715a154f-c769-440d-9dad-373a1d7efafe

torpid basalt
#

715a154f-c769-440d-9dad-373a1d7efafe

torpid basalt
#

At this point it's clear something is just wrong with the volume because disk I/O is super problematic.

torpid basalt
#

Can this be moved to central station? I see @noble plover has moved at least one client three days ago to the legacy system which isn't very reassuring but at least things will function:
https://station.railway.com/questions/urgent-production-instance-very-sl-d5482327

Railway Help Station

We spent a lot of time diagnosing the issues, due to Postgres and Redis Logs:Postgres logs very long sync times : 2025-08-04 15:21:40.040 UTC [26] LOG: checkpoint complete: wrote 26 buffers (0.2%); 0 WAL file(s) added, 0 removed, 0 recycled; write=195.859 s, sync=234.474 s, total=590.264 s; sync files=21, longest=42.797 s, average=11.166 s; dis...

rough badgeBOT
#

New reply sent from Help Station thread:

OK, I found it's already on Central Station. Things I did (some multiple times):

  • VACUUM FULL
  • ANALYZE
  • Tuned settings
  • Tuned
    shm_size
    Nothing seems to have helped so far.

You're seeing this because this thread has been automatically linked to the Help Station thread.

#

New reply sent from Help Station thread:

I'm not a big Postgres expert but I don't think this is good news:
2025-08-09 00:38:05.568 UTC [27] LOG: checkpoint complete: wrote 93 buffers (0.1%); 0 WAL file(s) added, 0 removed, 67 recycled; write=8.964 s, sync=6.407 s, total=33.174 s; sync files=43, longest=5.783 s, average=0.149 s; distance=1106118 kB, estimate=1106118 kB; lsn=10/97001A50, redo lsn=10/85FFF3F0
2025-08-09 00:43:08.495 UTC [27] LOG: checkpoint complete: wrote 347 buffers (0.3%); 0 WAL file(s) added, 0 removed, 63 recycled; write=35.078 s, sync=0.025 s, total=35.827 s; sync files=43, longest=0.010 s, average=0.001 s; distance=1026903 kB, estimate=1098196 kB; lsn=10/C4AD5348, redo lsn=10/C4AD5310
Is it normal for a WAL checkpoint to take > 35s writing 0 files?

You're seeing this because this thread has been automatically linked to the Help Station thread.

young rune
#

I have moved your service to a different stacker with a different drive setup, can you let me know if you continue to see Disk I/O issues?

And just a friendly reminder that we don't ping the team members 🙂

rough badgeBOT
#

New reply sent from Help Station thread:

That didn’t help at all. Site is extremely slow. It’s not networking because query times are similar when ran in psql.

You're seeing this because this thread has been automatically linked to the Help Station thread.

young rune
#

Gotcha, thank you for letting me know.

Would you happen to have any tracing that involves the slow database queries? that would be most helpful as we debug this further.

rough badgeBOT
#

New reply sent from Help Station thread:

except for information containing actual queries, I’m willing to provide any info that’s needed. I honestly think the volume is totally botched, I cannot believe an SSD will ever behave this way. (Also, see my other thread regarding volume usage discrepancies).
Could it be that the volume didn’t get moved when I moved the Postgres deployment to us-east-1?
Is there an option to redeploy me on the “legacy” (non-metal) stack just so we confirm this is the issue?

You're seeing this because this thread has been automatically linked to the Help Station thread.

young rune
#

I've moved the database back to the Legacy region, let me know if that solves anything.

rough badgeBOT
#

New reply sent from Help Station thread:

As a start, cold cache queries are not timing out. They’re slow (I guess that’s somewhat expected).
I will wait a few more hours and then query update again.

You're seeing this because this thread has been automatically linked to the Help Station thread.

rough badgeBOT
#

New reply sent from Help Station thread:

Intermediate update:
I haven't had a single request time out since brody moved the database to the "legacy" region. This points to a serious issue I/O in metal regions, especially when it comes to Postgres.
cold-cache queries are still extra slow and surprising for an SSD. This makes me wonder: From what I've seen, at least where I looked, Railway does not mention IOPS anywhere in their pricing or documentation. Maybe this is the hidden problem? Everywhere there's disk, there's disk IOPS but I don't see what's my dedicated IOPS for the Postgres volume for example.
I'll keep updating here as I had to spin up some monitoring resources to benchmark things a bit more carefully.
I might spin up a DB at Neon to compare the results.

You're seeing this because this thread has been automatically linked to the Help Station thread.

rough badgeBOT
#

New reply sent from Help Station thread:

So, anything that has not been very recently queried still yields an abysmal 9s-25s time to execute.
I'm going to test Neon now in North Virginia now because those results (with 0% write load and less than 1% of the read load) are just not acceptable.

You're seeing this because this thread has been automatically linked to the Help Station thread.

rough badgeBOT
#

New reply sent from Help Station thread:

I deployed a comparable instance on Neon — although this wasn't my original plan — and the difference is night and day despite the "unfair" advantage the Railway Postgres deployment has (private networking).
It could be argued that this is expected given that Neon is specialized in Postgres but I don't think the differences are in tuning. They are just too BIG.
This has got to be one of two things:
Some disk I/O quirk that Railway ought to figure out.
Disk writes have already been acknowledged by an employee as something that's being worked on
, this is great. But my Postgres instance couldn't handle
0 writes
and single non-concurrent reads when hosted on metal, and was still painfully slow when it was moved to "Legacy".
Something I screwed up big time with the deployment.
I have a few days before I have to make up my mind about a new host for the workloads but I'm definitely not hosting my DB at Railway. The deployment at 8GB memory either couldn't really keep its cache, or (most probably) is disastrous with disk I/O.

You're seeing this because this thread has been automatically linked to the Help Station thread.

rough badgeBOT
#

New reply sent from Help Station thread:

Can this be escalated to the engineering team, or is presumed that this is a an application level issue?

You're seeing this because this thread has been automatically linked to the Help Station thread.

rough badgeBOT
#

New reply sent from Help Station thread:

Hi I experience the same for Mysql and MongoDB, very slow despite the load is not high, below 100QPS.. Moving same same workload to toher provider has immediately 5-10x improvement in I/O speed

You're seeing this because this thread has been automatically linked to the Help Station thread.

rough badgeBOT
#

New reply sent from Help Station thread:

I've had something similar on my pg instance, my app is still in dev so I'm the only one using it, the metrics are almost flat and very very low, I have an e2e test suite I run to be sure all is working fine, locally all good, online I can see that sometime I get timeout issues and when this happens I get these logs in the db:

2025-08-13 16:49:14.715 UTC [74] LOG: checkpoint complete: wrote 21 buffers (0.1%); 0 WAL file(s) added, 0 removed, 0 recycled; write=2.027 s, sync=0.021 s, total=2.098 s; sync files=9, longest=0.015 s, average=0.003 s; distance=108 kB, estimate=108 kB; lsn=0/2E19A18, redo lsn=0/2E199E0

You're seeing this because this thread has been automatically linked to the Help Station thread.

rough badgeBOT
#

New reply sent from Help Station thread:

⬆️ This thread has been escalated to the Railway team.

You're seeing this because this thread has been automatically linked to the Help Station thread.