Linux Suggestions to fix odd Linux Raspberry Pi 4 problem - system up ssh down only some services running
Update - this is an old solved question. I've added what I've recently discovered on page 3.

 

I have an odd problem with the Raspberry Pi 4 + m.2 SSD on Raspbian that runs all kinds of things for me - PiHole including DHCP, Home Assistant, and PostgreSQL are the key ones. When this server goes down my network goes down because of lack of DHCP - until I switch over to router DHCP which mucks up IP address allocation. Home Assistant does all kinds of air conditioning automation, including actively changing how the ducted system works so it doesn't overheat or overcool rooms, but directly adjusting damper positions.

 

Current state is everything is running in docker containers. I have a six month old system image backup including the OS and the whole file system, and I have nightly backups of key parts of the file system pushed out to S3, including the Docker folder, HA, key OS config files, etc. I turned off syslog to reduce disk usage, which I'm regretting now. At least Pi Hole is working so the network is up.

 

Last night I modified the hosts file, I added an extra domain to the end of an existing entry. I rebooted to make sure all caches were emptied. When it came up I was getting 10 emails per minute from Home Assistant telling me it couldn't connect to PostgreSQL. I couldn't connect to the server using SSH. The server is in a cupboard in my son's room so I couldn't access it, but I rebooted it by turning power to the cupboard off then on.

 

At that point it came up, the emails stopped, and I could SSH in. I stopped all the docker containers except PiHole, started PostgreSQL container, and then started Home Assistant. The machine felt very slow, typing was instant but running commands took much longer than usual. top said CPU usage was minimal. Home Assistant came up. Around ten minutes later it stopped responding to SSH again, but Pi Hole is working, and I got an email overnight from apt regarding updates. Backups didn't run last night though, they're restic pushing to S3. In the ten minutes I could access the server I got access to the kernel log, which suggested there's some kind of disk issue (log below).

 

Question: how would you address this? Any suggestions for how to restore the OS so that I don't have to reinstall from scratch?

 

My plan is to approach things somewhat like this, obviously stopping if any step works:

 

  • Get the Pi into my office and reboot it to see if it's magically started working
  • Plug the SSD into my main Windows PC to check disk health with the manufacturer tool and Hard Disk Sentinel
  • Plug the SSD into my Ubuntu PC and run fsck
  • Try it again

If that fails to fix things I guess I'll reinstall the OS from scratch, then restore my various files from the last nightly backup.

 

 

 

Feb 22 20:52:38 pi4server kernel: [   47.037040] br-eace4cfeb730: port 1(veth3249e98) entered blocking state
Feb 22 20:52:38 pi4server kernel: [   47.037065] br-eace4cfeb730: port 1(veth3249e98) entered forwarding state
Feb 22 20:52:38 pi4server kernel: [   47.769010] Bluetooth: hci1: Opcode 0x c03 failed: -4
Feb 22 20:52:44 pi4server kernel: [   53.100876] sd 0:0:0:0: [sda] tag#13 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=3s
Feb 22 20:52:44 pi4server kernel: [   53.100908] sd 0:0:0:0: [sda] tag#13 Sense Key : 0x4 [current]
Feb 22 20:52:44 pi4server kernel: [   53.100916] sd 0:0:0:0: [sda] tag#13 ASC=0x44 ASCQ=0x0
Feb 22 20:52:44 pi4server kernel: [   53.100926] sd 0:0:0:0: [sda] tag#13 CDB: opcode=0x28 28 00 23 4c 40 e8 00 02 80 00
Feb 22 20:52:44 pi4server kernel: [   53.100933] critical target error, dev sda, sector 592199912 op 0x0:(READ) flags 0x80700 phys_seg 80 prio class 2
Feb 22 20:52:44 pi4server kernel: [   53.101151] sd 0:0:0:0: [sda] tag#14 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=3s
Feb 22 20:52:44 pi4server kernel: [   53.101156] sd 0:0:0:0: [sda] tag#15 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=3s
Feb 22 20:52:44 pi4server kernel: [   53.101162] sd 0:0:0:0: [sda] tag#16 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=3s
Feb 22 20:52:44 pi4server kernel: [   53.101166] sd 0:0:0:0: [sda] tag#14 Sense Key : 0x4 [current]
Feb 22 20:52:44 pi4server kernel: [   53.101167] sd 0:0:0:0: [sda] tag#15 Sense Key : 0x4 [current]
Feb 22 20:52:44 pi4server kernel: [   53.101170] sd 0:0:0:0: [sda] tag#16 Sense Key : 0x4 [current]
Feb 22 20:52:44 pi4server kernel: [   53.101175] sd 0:0:0:0: [sda] tag#15 ASC=0x44 ASCQ=0x0
Feb 22 20:52:44 pi4server kernel: [   53.101175] sd 0:0:0:0: [sda] tag#16 ASC=0x44 ASCQ=0x0
Feb 22 20:52:44 pi4server kernel: [   53.101176] sd 0:0:0:0: [sda] tag#14 ASC=0x44 ASCQ=0x0
Feb 22 20:52:44 pi4server kernel: [   53.101181] sd 0:0:0:0: [sda] tag#16 CDB: opcode=0x28 28 00 02 e5 ae b0 00 00 18 00
Feb 22 20:52:44 pi4server kernel: [   53.101184] sd 0:0:0:0: [sda] tag#15 CDB: opcode=0x28 28 00 02 8e 56 c0 00 00 38 00
Feb 22 20:52:44 pi4server kernel: [   53.101184] sd 0:0:0:0: [sda] tag#14 CDB: opcode=0x28 28 00 00 c5 e8 70 00 00 40 00
Feb 22 20:52:44 pi4server kernel: [   53.101186] critical target error, dev sda, sector 48606896 op 0x0:(READ) flags 0x80700 phys_seg 3 prio class 2
Feb 22 20:52:44 pi4server kernel: [   53.101190] critical target error, dev sda, sector 12970096 op 0x0:(READ) flags 0x80700 phys_seg 8 prio class 2
Feb 22 20:52:44 pi4server kernel: [   53.101191] critical target error, dev sda, sector 42882752 op 0x0:(READ) flags 0x80700 phys_seg 7 prio class 2
Feb 22 20:52:44 pi4server kernel: [   53.101218] sd 0:0:0:0: [sda] tag#12 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=3s
Feb 22 20:52:44 pi4server kernel: [   53.101227] sd 0:0:0:0: [sda] tag#12 Sense Key : 0x4 [current]
Feb 22 20:52:44 pi4server kernel: [   53.101229] sd 0:0:0:0: [sda] tag#17 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=3s
Feb 22 20:52:44 pi4server kernel: [   53.101234] sd 0:0:0:0: [sda] tag#12 ASC=0x44 ASCQ=0x0
Feb 22 20:52:44 pi4server kernel: [   53.101235] sd 0:0:0:0: [sda] tag#17 Sense Key : 0x4 [current]
Feb 22 20:52:44 pi4server kernel: [   53.101241] sd 0:0:0:0: [sda] tag#17 ASC=0x44 ASCQ=0x0
Feb 22 20:52:44 pi4server kernel: [   53.101242] sd 0:0:0:0: [sda] tag#12 CDB: opcode=0x28 28 00 04 0c 2b 00 00 00 20 00
Feb 22 20:52:44 pi4server kernel: [   53.101248] sd 0:0:0:0: [sda] tag#17 CDB: opcode=0x28 28 00 00 09 8c 48 00 01 a8 00
Feb 22 20:52:44 pi4server kernel: [   53.101249] critical target error, dev sda, sector 67906304 op 0x0:(READ) flags 0x80700 phys_seg 4 prio class 2
Feb 22 20:52:44 pi4server kernel: [   53.101252] critical target error, dev sda, sector 625736 op 0x0:(READ) flags 0x80700 phys_seg 53 prio class 2

It’s a long read. Is everything installed on a compact flash drive. My guess is that card has had too many writes and is failing. Time to replace.
Move the home assistant database somewhere else the compact flash will last longer.
Move dhcp back to your router.

 
 
 
 

Try an sdcard like https://www.pbtech.co.nz/product/MEMSDK32210/SanDisk-High-Endurance-64GB-Micro-SDXC-UHS-I-C10-U

Reading the description my thoughts were heading to "hmm...this might be a corrupt disk" and then you literally said "and then there are disk errors".

 

So that's where I would be concentrating. And getting prepared to replace the disk and rebuild.

 

I'd be doing the fsck first myself, but the steps you've listed seem sensible enough.

 

If everything is running in docker containers you could copy the volumes (or mounted directories) somewhere else and then rebuild and copy back. When I ran Pihole I had everything mounted to a directory and _most_ of the config defined in a docker-compose.yml. Think it was just my static reservations and hosts entries I had in a file. I pretty much built with ansible too so could recover that easily. Basically its not too hard. Postgresql you could use pg_dump to export your databases (perhaps you do that for backups already). Then recovery for that is mostly trivial. I dont know Home Assistant stuff so cant help there.  Of course, this assumes you can do those exports with a corrupt disk and theyre not corrupted at all. Fingers crossed for you there.



Thanks @nzkc. The M.2 SSD is only six months old so it's under warranty, but not in stock locally. I'm hoping that it's a file system error rather than a physical disk error. I'm attaching it to my Windows PC to get a good look at the SMART data and the manufacturer info about the disk before I bother with an fsck as if the disk is borked not much point trying to recover it. I can probably take a full system image more easily from Windows as well. Because fsck won't run on a mounted disk I figure I'll do that from another machine. The problem could be with the USB to M2 interface, which is inside my M.2 argon case.

 

I don't bother backing up the database, I only keep a week of HA history and I don't really care if I lose it. I have everything else backed up.

 

If the disk is borked I'll install on my spare Pi4 with a high endurance SD card while I wait. That will let me test my recovery process either way. Hopefully with docker and good backups it shouldn't take toooo long, which is why I set it up with Docker in the first place.

 

 

 

@fearandloathing thanks for trying to help, but I suspect commenting without reading the question is rarely going to be particularly helpful.

 

 

 

Any other thoughts or suggestions are welcome. I've used Linux a bit for 20 years but I'm nowhere near an expert.

  #3198955 23-Feb-2024 08:57
Send private message

fearandloathing: I read the whole post, it’s early my brain didn’t process that you had an ssd. Go figure


It is a chunky block of text with a lot of information. I do tend to be somewhat verbose when I ask technical questions.

Where I live, it is not unusual to get a power cut at random, so I have the following options in /boot/cmdline.txt (or /boot/firmware/cmdline.txt in latest OS):

 

fsck.repair=yes fsck.repair=force

 

Not a solution, but does help fix future drive corruption errors on reboot.



Thanks that will be useful in future. I considered getting some kind of UPS for this but haven't gotten around to it. Possibly just a small external battery pack.

Is it a decent SSD? Some of them seem to have the stupidity of a SD card when it comes to wear levelling, and die just as quickly when used on a pi. Looking at you kingspec.




Richard rich.ms

richms:

 

Is it a decent SSD? Some of them seem to have the stupidity of a SD card when it comes to wear levelling, and die just as quickly when used on a pi. Looking at you kingspec.

 

 

It's an AData SU650 512GB SSD. You tell me if it's decent or not. The main requirement was to get one that worked with the R.Pi4 and my case, Argon One M.2.

As others have alluded to best to run SMART tools on the PI and see what the disk is showing. 

 

use the usual linux tools to scan and repair the disk. 

 

One question i have is have you made sure you have enabled TRIM? 

 

Id recommend quick google for all of these and running the repair as already mentioned after performing a SMART scan, theres not much point trying to repair if the disk is done for... 

 

 

If you put it into something so you can see it on windows, I find HD sentinel gives a decent amount of data on how stuffed a SSD is.

 

The pictures make it look dram-less which is worse for lots of small writes like a pi gives things, and puts it at the same level as a SD card for actually writing way more to the flash than the amount you write to it, whereas the dram cached ones will leave things in the dram till it has enough to bother erasing and re-writing a block. Bad for if you have unexpected power offs but good for lifespan and performance.

 

I have found that SSDs are probably the only PC part I have seen a significant deterioration on performance and lifespan since they were introduced with all the MLC and no dram stuff being put out now that works well enough for someone booting a PC and running chrome but is crap for any real work.

 

The kingspecs that I had issues with would just begin to access slower and slower till windows gave up on them until they were fully power cycled. The stats were still fine for lifetime, amount of writes etc, but there was something clearly not right about it. Then one day they just started to go red in HD sentinel with a remaining life of 0%

 

Wasn't running in a pi but similar symptoms - very very slow, random corruption happening (probably from the power cycling while it was still doing slow housekeeping on stuff in the SLC area) and just unusable. Was still working enough to clone onto a 4x the price Samsung 970 and that has got a few years now with the same workload, no slowdown and even though HD Sentinel is warning me its at the end of life because of amount written to it, it is still working fine.




Richard rich.ms

HDSentinel says the disk is ok, just a bit of use.

 

I plugged it into my i2600K Ubuntu Linux box via the USB adapter and ran fsck. It found an error and fixed it.

 

I then booted the disk. It did disk checks and rebooted a few times. The Pi is in "emergency mode", says "press enter" but won't go far. The primary set of error messages look like this and the images below.

 

Based on this it looks like file system corruption, rather than a physical disk issue. The disk could have caused the corruption, I guess. The Pi was working fine, I did a reboot and it stopped responding.

 

My plan of action is to get HDSentiel to do a disk test, then reinstall Raspbian and try to restore from backup. I don't think I can recover this at all.

 

 

 

EXT4-fs error (device SDA2). ext4_get_inode.... unable to read itable block (about 8 lines of this)

 

EXT4-fs error (device SDA2). reading directory iblock0 (about ten lines of this)

 

 

 

 

Running the HDSentinel "quick scan" which is estimated at 2 minutes has taken about 10 minute so far... and it says 37 seconds elapsed. I'm thinking the SSD is borked, either than or the USB to M2 interface in the Argon one case. Hard to tell without another M.2 to USB adapter... I'll try to find one tomorrow, suggestions welcome. This UGreen looks to do NVME and SATA SSDs but isn't in stock locally, but I think it looks like the best option as it does B+M key that I need, even if it does take a few days to arrive.

 

Finding the right kind of SSD is somewhat tricky. B key or B+M key, SATA rather than NVME, 2280. PBTech has this range none of which seem great quality. This Transcend range has some described as industrial / embedded. This Transend 64GB costs about the same as the AData 512GB but has a DRAM cache and 3520 TBW. This one is about the same price, also has DRAM, and has 9680 TBW.

 

I think I'll return this disk as faulty and get the second Transcend, but I'll search around a bit first. The 2 minute test is up to about 15 minutes elapsed now, I don't want to risk it. I'll set up my other R.Pi with an SD card until I can get a replacement, which might take a week or so since it's not store stock.

I wouldn't leave such existential network functions to a Raspberry Pi without redundancy - but I've already said that. 😉




- NET: FTTH, OPNsense, 10G backbone, GWN APs, ipPBX
- SRV: 12 RU HA server cluster, 0.1 PB storage on premise
- IoT:   thread, zigbee, tasmota, BidCoS, LoRa, WX suite, IR
- 3D:    two 3D printers, 3D scanner, CNC router, laser cutter

