#Critical problem with Arch linux

43 messages · Page 1 of 1 (latest)

sonic meteor
#

Hi, for over a year before now I've been using Arch as my daily driver distro. Recently (around 3 months ago) I built my own PC and had no major problems with Arch on it until around 2 weeks ago.

What started happening is random PC freezes. My screen will freeze on one frame all of a sudden, all audio stops, nothing's happening. The only thing left to do is hard reboot of the entire PC, because this doesn't pass either. This can happen sometimes 8 hours after starting the PC, sometimes after only 30 minutes.

Some of the things I've tried:

  • Checking journalctl. The logs show nothing specific until the next boot. The logs cut off at the moment of the freeze.
  • Reinstalling Arch from nothing. Still was happening.
  • Installing a different distro (specifically EndeavourOS). One day of freedom and now it's still happening. This is what I'm currently on, but it's basically just Arch with some extra stuff installed.
  • Running memtest86+ overnight. 6 passes, 0 errors.
  • Running checks on my SSD. Everything's showing it as operational.
    • I doubt it's the SSD but I think I could still test more if this seems likely.
  • Changing the entire PSU. It was still happening, my current PSU should have no problem handling my entire PC.
  • Force reinstalling every package with something like yay -Syyu --override='*' $(yay -Qq).
  • Running different benchmarks to check what could be malfunctional.
    • The only thing I saw that looked abnormal is that when running unix-bench, the Dhrystone 2 test shows very anomalous value. Specifically, it shows 0.1 lps on single-thread test and 3.2 lps on multi-threaded test. This is especially weird taking that running the same test from the EndeavourOS installation medium (not the fully installed OS), I get a result on the scale of 1910805895.4. The discrepancy makes me suspect this might be related to this problem.

I'm running out of characters already so I'll provide more context under this post.

#

What I didn't try:

  • Running a non-arch based distro. This is because EndeavourOS was giving me green-ish light with the unix-bench, so I thought it will be fine. Today it turned out to not be the case, so I didn't have the opportunity to test this. I also don't want to give up on Arch just yet.
  • Replacing other PC components. I don't have a GPU or CPU that I could test this with, and it's not feasible for me to acquire some.
  • Possibly a bunch of other troubleshooting steps, including hardware and software. There might be some extra stuff that I did but didn't specify here because I forgot them, so if you have any ideas for what to check, let me know, maybe I did try them.
distant cedar
#

Logs cutting off on the freeze suggests a hardware fault

#

Check a bunch of previous boots, you might get lucky and find a log that snuck in before the crash via a race condition

sonic meteor
#

yeah that's what I'm thinking but also why would then EndeavourOS live installation not have the problems

sonic meteor
distant cedar
#

Hmm well you've done everything I would suggest really :(

sonic meteor
#

yeah it feels like I've tried everything over the weekend

#

and each thing i've tried either shows nothing or points in a completely different direction than everything else

distant cedar
#

Checking another distro would be a good datapoint. If that also breaks then hardware gets more likely

#

Or even another arch kernel, like -lts or -zen

sonic meteor
#

oh yeah, that's another thing i tried, the lts kernel

distant cedar
#

Hmmmmmm thonk

sonic meteor
#

I also tried just keeping terminal open with a bunch of monitoring software (btop and amdgpu_top specifically) but no component there shows any abnormal peak usage before the freeze

#

what I'm trying now is dumping dmesg to a file each second to be able to review it once it crashes again.

#

because otherwise I'm only accessing dmesg from the current boot, not the one that crashed

wraith dust
#

@sonic meteor Just a rather random thought but which CPU do you have? My old machine with a i5-2500 CPU started doing exactly this and I ended up just throw the machine away after trying lot of things like replace GFX card and memory. (also change bios battery)

sonic meteor
#

I don't think I want to throw this one out

wraith dust
#

yeah... that should not happen. Perhaps a BIOS update?

sonic meteor
#

I can try that, though my BIOS is from January of this year so it's not that outdated

#

but yeah, I'll try that later

wraith dust
#

Well, worth at least checking changelog of BIOS updates if they mention unstability

sonic meteor
#

oh I'm checking that right now, apparently one of the updates since has been for support of my specific CPU

#

I'll definitely try that later today

#

idk why it didn't even run through my head that my CPU was released after my BIOS version

#

still, it surprised me then that the PC ran fine for almost 3 months

wraith dust
#

Let's hope it is not because your motherboard have cooked your CPU with too much power 😛

sonic meteor
#

to be fair, before it wouldn't even have that much power to draw lmao

#

before I was running on a PSU that was barely enough for my PC to run somewhat stable

sonic meteor
#

I just updated the BIOS, hopefully this whole thing stops now

#

I ran the dhrystone on unix-bench again, i'm still getting 3.2 which scares me that if it was the mobo BIOS blasting my CPU with too much power then it's just damaged

#

Another thing i just tried - kdump also had no logs.

I set up kdump a few boots ago, forgot to check it. now I checked it and it produced no logs at all.

I assume this means that it's almost certainly a hardware issue

sonic meteor
#

it just froze after the bios update

#

fuck my life i guess

sonic meteor
#

Someone online had a similar issue with the same CPU and apparently they had their integrated GPU disabled in BIOS, I'll check if that's the case for me if I crash again.

sonic meteor
#

OKAY something really weird just happened.

On my previous boot I had a loop of watch -n 1 'sudo dmesg > dmesg.out' running in the background. I just got a crash and I got a really weird output in the journal:

-- Boot 20493c5d97e746838628ce984d744c06 --
lip 30 00:04:41 eosbtw kernel: Linux version 6.15.8-arch1-1 (linux@archlinux) (gcc (GCC) 15.1.1 20250425, GNU ld (GNU Binutils) 2.44.0) #1 SMP PREEMPT_DYNAMIC Thu, 24 Jul 2025 18:18:11 +0000
-- Boot ae6388bbff824c3989ef68b52131df78 --
lip 29 23:46:15 eosbtw sudo[226334]: pam_unix(sudo:session): session opened for user root(uid=0) by bsawko(uid=1000)
-- Boot 20493c5d97e746838628ce984d744c06 --
lip 30 00:04:41 eosbtw kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-linux root=UUID=97781601-7dee-4a37-be8a-b265068d28e3 rw nowatchdog nvme_load=YES loglevel=3
-- Boot ae6388bbff824c3989ef68b52131df78 --
lip 29 23:46:15 eosbtw sudo[226334]: pam_unix(sudo:session): session closed for user root
-- Boot 20493c5d97e746838628ce984d744c06 --
lip 30 00:04:41 eosbtw kernel: x86/split lock detection: #DB: warning on user-space bus_locks
-- Boot ae6388bbff824c3989ef68b52131df78 --
lip 29 23:46:16 eosbtw sudo[226340]:   bsawko : TTY=pts/3 ; PWD=/home/bsawko ; USER=root ; COMMAND=/usr/bin/dmesg
-- Boot 20493c5d97e746838628ce984d744c06 --
lip 30 00:04:41 eosbtw kernel: BIOS-provided physical RAM map:
-- Boot ae6388bbff824c3989ef68b52131df78 --
lip 29 23:46:16 eosbtw sudo[226340]: pam_unix(sudo:session): session opened for user root(uid=0) by bsawko(uid=1000)
-- Boot 20493c5d97e746838628ce984d744c06 --
...

This switch between two boots keeps going for almost 700 lines.

#

this looks like the previous boot got literally hibernated for some reason and when trying to boot again, it was running the system twice for a brief moment.

sonic meteor
#

ok so i know it's been almost a month but for some godforsaken reason the update to kernel version 6.16 fixed the problem????

#

which is strange because 6.12 and 6.15 both were broken

#

just wanted to throw in this update

#

going back to 6.12 still gives me the same problem

distant cedar
#

Arch do be rolling