#HA crashes repeatedly due to some host issue (I think), need some help debugging

1 messages ยท Page 1 of 1 (latest)

carmine wraith
#

Yesterday night I noticed my HAOS server was down, weird. I hadn't used it or done any changes. Browser didn't load, observer didn't load. Ping worked, host was ON. Plugged HDMI, and there were tons of errors.

Left it for this morning, rebooted host, it happened again. After a 2nd reboot, it booted fine and HA was accessible. It has worked for 1h or so, and now I've noticed plenty errors again. It's still accessible, but lots of things not working. Seems like a host level issue, but I can't decipher the logs.

System info dump

prime epoch
#

What hardware is this?

carmine wraith
#

HA crashes repeatedly due to some host issue (I think), need some help debugging

carmine wraith
# prime epoch What hardware is this?

one sec, rebooting so I can get the details.

MiniPC with like N100 or similar. 16GB RAM, 256GB SSD, and then an additional SSD 2TB used for media, automounted on boot via udev rule.

prime epoch
#

This looks like a dying/dead SSD or filesystem corruption at the minimum.

carmine wraith
#

it has 1.5 years

prime epoch
#

I'd recommend a re-flash and restore. You might need to replace the disk. You can try if ha os datadisk wipe works for a factory reset.
But before that we could look at which disk this even is.
Enter login at the CLI and trun this

lsblk -o+FSTYPE,LABEL,MODEL
carmine wraith
#

I guess it's the culprit would be sdb? The system SSD, right?

#

any way to confirm it?

prime epoch
#

Yep. Looks like it's the main OS/boot drive. Well the logs state sdb but you can check again. Try something like this

journalctl -b0 -krg "sd[a-z]"
#

How exactly did you mount the external drive? Udev?

carmine wraith
prime epoch
#

Logs containing sda/sdb. If the errors say sdb it's sdb and vice cersa.

prime epoch
#

If you didn't move any physical connection it's likely the SSD dying. The 256GB SSD model sounds pretty no name.

carmine wraith
#

yeah, the miniPC is no-name, and it's the one that came with it

#

it looks truncated too

#

is there any checks I can do before going to buy a new SSD in a rush?

prime epoch
#

As said, I'd recommend you try the factory reset first. HAOS is a bit too bare bones to do many diagnostic things. At least not in this state. You'd need smartctl.

carmine wraith
#

factory reset, and restore backup then?

prime epoch
#

You need to use the Advanced Addon with disabled protection mode and I doubt that will work in your case though: #1435280112684896427 message

#

Yeah.

carmine wraith
#

come on, it's an renowned AZW

prime epoch
#

A SATA M.2 even. A bit niche.

carmine wraith
#

don't want to go for branded tho, I prefer cheap

carmine wraith
# prime epoch You need to use the Advanced Addon with disabled protection mode and I doubt tha...

And the last bit

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The above only provides legacy SMART information - try 'smartctl -x' for more
prime epoch
# carmine wraith should I look for a different type? I'm not an expert

Depends what your MiniPC supports. Maybe you can take a picture of the M.2 socket. I don't really have a problem with cheap SSDs for this use case if the price is right.
The main reason I'm not a fan of SATA M.2s is that they can only be used for very specific machines, they are niche and thus there's fewer options and higher prices. The drive has been running for at least 2.5 years, not 1.5. Other than that I don't see much.

#

If you run update-smart-drivedb first these Unknown_Attributes might become readable. and this looks different

Device is:        Not in smartctl database 7.5/5706
carmine wraith
carmine wraith
# prime epoch If you run `update-smart-drivedb` first these `Unknown_Attribute`s might become ...
SMART Attributes Data Structure revision number: 20
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 New_Bad_Blk_Count       0x1300   100   100   010    Old_age   Offline      -       0
  9 Power_On_Hours          0x1200   100   100   000    Old_age   Offline      -       23175
 12 Power_Cycle_Count       0x1200   100   100   000    Old_age   Offline      -       194
164 Erase_Count             0x0000   100   100   000    Old_age   Offline      -       378 1182 1054
165 Max_Erase_Count         0x0000   100   100   000    Old_age   Offline      -       1182
166 Min_Erase_Count         0x0000   100   100   000    Old_age   Offline      -       378
167 Average_Erase_Count     0x2200   100   100   000    Old_age   Offline      -       1054
194 Temperature_Celsius     0x2200   032   032   000    Old_age   Offline      -       32 (Min/Max 24/40)
199 UDMA_CRC_Error_Count    0x1200   100   100   000    Old_age   Offline      -       0
241 Host_Writes_GiB         0x3200   100   100   000    Old_age   Offline      -       73765
242 Host_Reads_GiB          0x3200   100   100   000    Old_age   Offline      -       31813
prime epoch
#

Alright so right now I'd recommend these steps in order until something is solved

  • The fsck.mode=force boot arg
  • ha os datadisk wipe and restore
  • Full disk re-flash and restore
  • Send me a pictue e of the M.2 SLot so we can see if it support NVME or just SATA and then get a replacement drive
carmine wraith
carmine wraith
prime epoch
#

Hmm. So a NVMe should fit there but I can't tell 100% if the slot supports that. I was hoping for some silkscreen markings on the PCB.

#

During boot there is a slot A/B selection. If you press e there you can change the kernel args.

#

Give me a second I'll show you.

#

Press e here.

#

Looks like this then

#

Now add your kernel args. Without the typo, of course

#

Then press F10 or CTRL+X as it says at the bottom.

#

But to be clear. I do not expect a fsck to help here.

carmine wraith
#

trying again now

#

it has to be slot B, right? I just pressed e and it got into slot B, but I did not choose it myself

#

same thing happened

prime epoch
#

It's just what was selected for me. You can try another one too.

#

If you also add debug you might get more information. I.e otherargshere... fsck.mode=force debug.

carmine wraith
#

trying that in slot A now, but so far the same

#

even with debug

prime epoch
#

:<

#

I guess it's ubuntu live booting and re-flash time then.

#

Well, or the ha os datadisk wipe first.

carmine wraith
#

oh, there was a message now, but it went back and couldnt read it

#

let me try again

prime epoch
#

I'd video record such things.

carmine wraith
prime epoch
#

Weird. Maybe you pressed e too often and inserted it into the boot config or something? Just adding some args like this shouldn't cause this.

carmine wraith
prime epoch
#

Looks correct to me ๐Ÿคท

carmine wraith
#

anyway, if you say it has little chance to work, let's move onto the wipe step

#

let me try and get the last local backup

prime epoch
#

Also about that NVMe support. I guess we could look at lspci | grep -Ei "sata|nvme". If there's a NVMe mention you can likely use a NVMe too.

#

Ideally you have your backups off-device. Your backup should be usable even if the device explodes. I store mine on my NAS but my HA is also running on proxmox VE so I also have backups of the whole system.

carmine wraith
#

yeah, I should have in the cloud too

prime epoch
#

Might be an idea for you too. Run HAOS in a VM if you re-install anyways?

carmine wraith
#

but it's easier if I can grab the local copy

#

I tried proxmox last year, but it's just not for me. Too much overhead in terms of complexity, for no gain for me

#

I remember we had a discussion about it

prime epoch
#

It certainly has a high ceiling but just for a basic HAOS VM you don't need to learn too much and have the snapshots and backups. Probably not enough of an incentive. I remember your picture but my memory is not good enough to remember all my discussions in detail. I have a lot and they tend to be very similar so it becomes kind of a slurry in my mind.

carmine wraith
#

is that confirmation?

prime epoch
#

No. I thought you'd run this via the login way again.

carmine wraith
prime epoch
#

Hmm.

#

Oh. Right. The system lspci is ๐Ÿ’ฉ

# lspci
01:01.0 Class 0100: 1af4:1004
00:08.0 Class 0780: 1af4:1003
00:01.2 Class 0c03: 8086:7020
00:1f.0 Class 0604: 1b36:0001
00:01.0 Class 0601: 8086:7000
00:1e.0 Class 0604: 1b36:0001
00:00.0 Class 0600: 8086:1237
00:01.3 Class 0680: 8086:7113
00:12.0 Class 0200: 1af4:1000
00:03.0 Class 00ff: 1af4:1002
00:01.1 Class 0101: 8086:7010
01:02.0 Class 0100: 1af4:1004
00:02.0 Class 0300: 1234:1111
00:05.0 Class 0604: 1b36:0001
#

Try this in the addon then

apk add pciutils
lspci -k | less
carmine wraith
#

I see sata, but no nvme mention

prime epoch
#

I just didn't expect the addon to work. Especially not installing packages (AKA writing data). Yeah seems it can't do NVMe. Or maybe it's just disabled in the UEFI for some reason. Weird not to use NVMe for a Alder Lake machine.

carmine wraith
#

I can check inot the UEFI, but I don't think so

#

it's a cheap miniPC, I guess those are the cuts

prime epoch
#

Check df -hT and see how much space you need. I'd probably recommend a 250G~ one again.

carmine wraith
#

tbh I got it in a rush when the previous one died, similar to today ๐Ÿ˜‚ One can't just live without HA for too many days

prime epoch
#

I'm just confused why the addon even works with all these errors.

carmine wraith
#

I keep 24h of recordings

carmine wraith
#

"while" being a random amount of time, till it explodes

#

in the meantime, it throws no errors

prime epoch
carmine wraith
prime epoch
#

Is it possible the drive is getting too hot?

carmine wraith
#

hmm it could be, the other SSD is right on top of it tbh, and space is tight

prime epoch
#

Maybe keep watch smartctl ... running for a bit.

carmine wraith
#

rn it's open and the large drive is outside hanging

prime epoch
#

Earlier it was 32ยฐC which is totally fine but they can get hot very fast on certain actions. There's isn't a lot that can soak the heat. Ideally it just throttles but who knows.

carmine wraith
prime epoch
#

How did you run it before?

carmine wraith
#

I think so, in the addon. Can't remember

prime epoch
#

Hmm. Maybe the temperature is not reported correctly. Worst is the same value. Or I'm interpreting it wrong.

carmine wraith
#

sh: smartctl: not found and can't remember how I did before

prime epoch
#

See link.

#

How much does a suitable drive cost you?

carmine wraith
#

ยฃ40 is the cheapest I can see in a quick search

#

ยฃ45 probably

prime epoch
#

I can get a used one for 28~ here. New about 50~.

#

What's interesting is that some of these advertise just 50TBW and yours has reached > 70TBW.

carmine wraith
#

haven't looked at used, but amazon gets me today/tomorrow. I'd have to wait some more days for a used one. And you always test your luck

carmine wraith
#

I'm doing the watch smartctl, should I close it again to reproduce the normal conditions and see if temp rises?

prime epoch
#

Perhaps. I don't know the exact specifications of the AZW one and they can usually write more than their rating.

prime epoch
carmine wraith
#

now it's holding without errors for a good while tbh, better than ever before

carmine wraith
prime epoch
#

Ah you mean that. Yeah.

carmine wraith
#

it's hovering 30-32C anyway

prime epoch
#

Yeah put the physical stuff back into the original problem orientation/situation. up to 65ยฐC or so should be unproblematic for the M.2 one.

#

There's usually two temperatures. Controller and flash. Not sure which ones yours shows exactly.

carmine wraith
#

nah it's closed and it's not moving from 30 tbh. If it would have had so much of an impact it would have changed quickly

carmine wraith
#

that's worrying. How do laptops survive their life use then?

#

actually my current laptop is older I think

prime epoch
#

It the temperature doesn't change when you do things then perhaps that's a bad omen.
Laptops don't tend to run 24/7.

carmine wraith
#

yeah, fair enough

carmine wraith
#

I could try and use Immich a bit, but it's not like I've been putting HA under load these days.

#

at all

prime epoch
#

About that. Just yesterday I exchanged the S4500 datacenter SSD in my firewall with this specimen to save 1w~. I'm economic like that

carmine wraith
#

holy crap ๐Ÿ˜‚

#

1W? Was it really worth it? Or was it free?

prime epoch
#

I have a lot of drives. It was absolutely not worth my time though. I was just bored and obsessed with saving power.

carmine wraith
#

now it crashed

prime epoch
#

Here's most of the SATA ones.

#

:<

carmine wraith
#

the SSH addon was getting stuck in the watch, I restarted the app, and noticed supervisor was crashing

#

and then the error chain reaction

prime epoch
#

Can you check the host logs again?

carmine wraith
#

sorry it's like this, but I don't know how to get them after reboot

carmine wraith
carmine wraith
prime epoch
#

Hmm so the erofs error is not so good. ro stands for read only so this is not just writes.

#

Some errors can be printed to the screen directly. You can change that via dmesg -n.

#

Next steps then. Wipe time.

carmine wraith
#

ok

#

can you remind me the commands and from where to run them? I got the backup ready

prime epoch
prime epoch
#

Run the wipe one via keyboard and monitor. IIRC it is guarded to not work via the addon. At least not directly.

carmine wraith
prime epoch
#

Good question. Under normal (supported) circumstances it should not. But under normal circumstances the second SSD would be the data disk so this is a bit confusing. Let me check something first.

carmine wraith
#

I already started it xD

#

I have backups anyway

#

we'll see

prime epoch
#

Technically it's more of a overwrite than a wipe but eh.

#

Your external disk has the STORAGE label so it shouldn't touch it but ๐Ÿคท

carmine wraith
#

well, not sure if the wipe completed, but it went straight into the error chain

#

erofs

#

lots of them

prime epoch
#

You likely have to flash via ubuntu then. Shouldn't be too hard.

carmine wraith
#

let me fetch the tutorial to jog my meory

carmine wraith
#

any way I can "refresh" them? Or just reboot and launch again and cross fingers?

prime epoch
#

That's nigh unreadable mate.

#

Run this in the CLI.

sudo lsblk -o+FSTYPE,LABEL,MODEL
#

Sorry for the delay, by the way.

carmine wraith
#

what cli, it's wiped now xD Not sure it can boot at all

#

sorry, it's the banding of the TV

prime epoch
#

In the ubuntu terminal.

carmine wraith
#

ahh lol

prime epoch
#

We're done with HA for now.

carmine wraith
#

just a bunch of loops, the USB drive, and the 2TB SSD

prime epoch
#

Hmm that's not good. Power it off and see if it's seated properly. Then try again.

carmine wraith
#

the wipe destroyed the drive it seems xD

#

oh now I got into UEFI

#

it just took like 4min

prime epoch
#

I usually disable PXE boot. The UEFI does not see it either?

carmine wraith
#

nope

#

I disconnected and connected it again just in case

#

:S

#

what a way to die

carmine wraith
#

I find it very strange that it sits into the boot logo for so long now

prime epoch
#

Every UEFI is a bit different. I disable that logo and fast boot too. I want to see all.

carmine wraith
#

but no, after 2 boots, it won't detect the drive :S

#

the USB gets detected fine

carmine wraith
prime epoch
#

Looks like it.

carmine wraith
prime epoch
#

No name and Frequently returned item. At least it's TLC.

carmine wraith
#

what's TLC?

prime epoch
#

But yeah if you buy on amazon you likely have safety of it does within 2 years. I can't find many in this budged range.

carmine wraith
#

doesn't mention TLC tho

prime epoch
#

This has less TBW too. 80 vs 160.

carmine wraith
#

I just saw that

#

I'll go with the famous ediloca then

prime epoch
#

Yep.

carmine wraith
#

plus it delivers tonight, so I can get things working before bed hopefully

#

thanks for the help, I'll ping you if I get stuck or when things get working again! Thanks a lot, as always!

carmine wraith
#

@prime epoch new drive received. Flashed, backup restored and working.

prime epoch
#

Yay.

carmine wraith
#

or should the backup have resored it too?

#

just saying as there's nothing in the large drive mapped folder

prime epoch
#

Not everything is backed up. Especially OS tweaks.

carmine wraith
#

yeah, makes sense

prime epoch
#

I think I read that there's work on that though.

carmine wraith
#

ughh I'll have to jog my memory hard. That was well outside my comfort zone

prime epoch
#

Might have just been about network configs. I don't full remember.

prime epoch
carmine wraith
prime epoch
#

It's a OS side tweak too.

prime epoch
#

You can put all your tweaks on a USB stick as per docs and they will just be reapplied automatically.

carmine wraith
#

ough, is there no way for backups to keep your local addons? I just realized I lost them ๐Ÿ™‚

#

there's no "addons" folder in the backup settings