this post was submitted on 19 Oct 2023

50 points (96.3% liked)

Linux

48375 readers

1521 users here now

From Wikipedia, the free encyclopedia

Linux is a family of open source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991 by Linus Torvalds. Linux is typically packaged in a Linux distribution (or distro for short).

Distributions include the Linux kernel and supporting system software and libraries, many of which are provided by the GNU Project. Many Linux distributions use the word "Linux" in their name, but the Free Software Foundation uses the name GNU/Linux to emphasize the importance of GNU software, causing some controversy.

Rules

Posts must be relevant to operating systems running the Linux kernel. GNU/Linux or otherwise.
No misinformation
No NSFW content
No hate speech, bigotry, etc

Related Communities

Community icon by Alpár-Etele Méder, licensed under CC BY 3.0

founded 5 years ago

MODERATORS

[email protected]

Two separate Hard Drives corrupted in as many days... User error? (midwest.social)

submitted 1 year ago* (last edited 1 year ago) by [email protected] to c/[email protected]

31 comments fedilink hide all child comments

Edit- I set the machine to work last night testing memtester and badblocks (read only) both tests came back clean, so I assumed I was in the clear. Today, wanting to be extra sure, i ran a read-write badblocks test and watched dmesg while it worked. I got the same errors, this time on ata3.00. Given that the memory test came back clean, and smartctl came back clean as well, I can only assume the problem is with the ata module, or somewhere between the CPU and the ata bus. i'll be doing a bios update this morning and then trying again, but seems to me like this machine was a bad purchase. I'll see what options I have with replacement.
Edit-2- i retract my last statement. It appears that only one of the drives is still having issues, which is the SSD from the original build. All write interactions with the SSD produce I/O errors (including re-partitioning the drive), while there appear to be no errors reading or writing to the HDD. Still unsure what caused the issue on the HDD. Still conducting testing (running badblocks rw on the HDD, might try seeing if I can reproduce the issue under heavy load). Safe to say the SSD needs repair or to be pitched. I'm curious if the SD got damaged, which would explain why the issue remains after being zeroed out and re-written and why the HDD now seems fine. Or maybe multiple SATA ports have failed now?

I have no idea if this is the forum to ask these types of questions, but it felt a little like a murder mystery that would be a little fun to solve. Please let me know if this type of post is unwelcome and I will immediately take it down and return to lurking.

Background:

I am very new to linux. Last week I purchased a cheap refurbished headless desktop so I could build a home media server, as well as play around with vms and programming projects. This is my first ever exposure to linux, but I consider myself otherwise pretty tech-savvy (dabble in python scripting in my spare time, but not much beyond that).

This week, i finally got around to getting the server software installed and operating (see details of the build below). Plex was successfully pulling from my media storage and streaming with no problems. As i was getting the docker containers up, I started getting "not enough storage" errors for new installs. Tried purging docker a couple times, still couldn't proceed, so I attempted to expand the virtual storage in the VM. Definitely messed this up, and immediately Plex stops working, and no files are visible on the share anymore. To me, it looked as if it attempted taking storage from the SMB share to add to the system files partition. I/O errors on the OMV virtual machine for days.

Take two.

I got a new HDD (so i could keep working as I tried recovery on the SSD). I got everything back up (created a whole new VM for docker and OMV). Gave the docker VM more storage this time (I think i was just reckless with my package downloads anyway), made sure that the SMB share was properly mounted. As I got the download client running (it made a few downloads), I notice the OVM virtual machine redlining on memory from the proxmox window. Thought, (uh oh, i should fix that). Tried taking everything down so I could reboot the OVM with more memory allocation, but the shutdown process hung on the OVM. Made sure all my devices on the network were disconnected, then stopped the VM from the proxmox window.

On OVM reboot, i noticed all kinds of I/O errors on both the virtual boot drive and the mounted SSD. I could still see files in the share on my LAN devices, but any attempt to interact with the folder stalled and would error out.

I powered down all the VM's and now i'm trying to figure out where I went wrong. I'm tempted to just abandon the VM's and just install it all on a Ubuntu OS, but I like the flexibility of having the VM's to spin up new OS's and try things out. The added complexity is obviously over my head, but if I can understand it better I'll give it another go.

Here's the build info:

Build:

HP prodesk 600g1
intel i5
upgraded 32gb after-market DDR3 1600mhz Patriot Ram
KingFlash 250gb SSD
WD 4T SSD (originally NTFS drive from my windows pc with ~2T of data existing)
WD 4T HDD (bought this after the SSD corrupted, so i could get the server back up while i delt with the SSD)
500Mbps ethernet connection

Hypervisor

Proxmox (latest), Ubuntu kernel
VM110: Ubuntu-22.04.3-live server amd64, OpenMediaVault 6.5.0
VM130: Ubuntu-22.04.3-live, docker engine, portainer
- Containers: Gluetun, qBittorrent, Sonarr, Radarr, Prowlarr)
LCX101: Ubuntu-22.04.3, Plex Server
Allocations
VM110: 4gb memory, 2 cores (balooning and swap ON)
VM130: 30gb memory, 4 cores (ballooning and swap ON)

Shared Media Architecture (attempt 1)

Direct-mounted the WD SSD to VM110. Partitioned and formatted the file system inside the GUI, created a folder share, set permissions for my share user. Shared as an SMB/CIFS
bind-mounted the shared folder to a local folder in VM130 (/media/data)
passed the mounted folder to the necessary docker containers as volumes in the docker-compose file (e.g. - volumes: /media/data:/data, ect)

No shame in being told I did something incredibly dumb, i'm here to learn, anyway. Maybe just not learn in a way that destroys 6 months of dvd rips in the process ___

all 35 comments

sorted by: hot top controversial new old

[–] [email protected] 25 points 1 year ago (1 children)

I have horrible errors in my ZFS pools until I did a memtest. Fixing my ram eliminated all the errors.

[–] [email protected] 8 points 1 year ago

this is looking like a likely scenario, thanks for the suggestion!

[–] [email protected] 23 points 1 year ago (2 children)

If you are getting actual hardware/sata errors on the host (not sure if that's exactly what's happening from your description), and multiple drives have had a similar problem, I'd suspect the sata cable or controller/mobo. Intel had a lot of weird sata issues on their older chipsets, so I'd also recommend making sure it has the latest bios update. Could you be more specific on what kind of hardware errors are showing up? Like, maybe parts of the logs.

[–] humancrayon 3 points 1 year ago

Came here to say I had something similar happen with my NAS a year back. Thought it was the drives, then the controller it was attached to. Turns out it was some crappy blue breakout cables causing the drives to error out and disconnect.

Ordered new breakout cables of a different brand and have zero errors since.

[–] [email protected] 2 points 1 year ago (1 children)

i'm going back and looking, but I may have deleted logs for the VM's when I deleted the VM and started repair.

here's a readout of one of the instances of trying to shutdown the VM and having to ssh in and 'force' a shutdown (didn't think i was forcing it from the terminal window, but maybe I did?) Doens't give much more information.

/var/log/pve/tasks/D/UPID:pve1:00000AD5:0000AEFC:652DD26D:qmshutdown:110:root@pam::TASK ERROR: VM quit/powerdown failed - got timeout

i'm still looking for more detailed logs, but i'm starting to wonder if you're right. This makes me more sad than having messed something up myself, because fixing it would involve buying more hardware :(

oop, just found some better ones in the journalctl. These were happening way earlier than I thought:

Oct 13 17:28:42 pve1 kernel: SMB2_read: 36 callbacks suppressed
Oct 13 17:28:42 pve1 kernel: CIFS: VFS: Send error in read = -5
Oct 13 17:28:42 pve1 kernel: CIFS: Status code returned 0xc0000185 STATUS_IO_DEVICE_ERROR
Oct 13 17:28:42 pve1 kernel: CIFS: VFS: Send error in read = -5
Oct 13 17:28:42 pve1 kernel: CIFS: Status code returned 0xc0000185 STATUS_IO_DEVICE_ERROR

there's a million of these.

and here are some of the one's i was seeing when I popped open the console while it was happening. pve1 was the mount device to the VM running the OMV server I think:

Oct 14 23:14:14 pve1 kernel: I/O error, dev sdb, sector 1801348384 op 0x0:(READ) flags 0x1000000 phys_seg >

and a bunch of these, looks like they happen after a lot of I/O errors happen and the system can't reach the smb server anymore:

Oct 16 13:25:04 pve1 kernel: CIFS: VFS: \192.168.0.135\plex_media BAD_NETWORK_NAME: \192.168.0.135\plex_media

Here's ones from yesterday, probably around the time i was getting the new HDD back up again. These call out the sata port specifically, and it's running repeatedly in a loop:

Oct 18 21:52:22 pve1 kernel: ata4.00: configured for UDMA/133
Oct 18 21:52:22 pve1 kernel: ata4: EH complete
Oct 18 21:52:22 pve1 kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Oct 18 21:52:22 pve1 kernel: ata4.00: irq_stat 0x40000001
Oct 18 21:52:22 pve1 kernel: ata4.00: failed command: WRITE DMA EXT
Oct 18 21:52:22 pve1 kernel: ata4.00: cmd 35/00:a8:00:38:ff/00:00:27:01:00/e0 tag 18 dma 86016 out
                                      res 51/04:a8:00:38:ff/00:00:27:01:00/e0 Emask 0x1 (device error)
Oct 18 21:52:22 pve1 kernel: ata4.00: status: { DRDY ERR }
Oct 18 21:52:22 pve1 kernel: ata4.00: error: { ABRT }

and here's some more implicating the sdc device:

Oct 18 21:57:23 pve1 kernel: sd 3:0:0:0: [sdc] tag#25 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=2s
Oct 18 21:57:23 pve1 kernel: sd 3:0:0:0: [sdc] tag#25 Sense Key : Illegal Request [current]
Oct 18 21:57:23 pve1 kernel: sd 3:0:0:0: [sdc] tag#25 Add. Sense: Unaligned write command
Oct 18 21:57:23 pve1 kernel: sd 3:0:0:0: [sdc] tag#25 CDB: Write(16) 8a 00 00 00 00 00 e8 c4 08 00 00 00 00 08 00 00
Oct 18 21:57:23 pve1 kernel: I/O error, dev sdc, sector 3905161216 op 0x1:(WRITE) flags 0x1008800 phys_seg 1 prio class 2

There are actually kind of painting a picture. The culprit looks like that sata port, i'll see if i can switch it to another and do some test writes, maybe that'll fix it

[–] [email protected] 2 points 1 year ago (2 children)

The CIFS errors and logs inside the VMs are rather uninteresting as they're just passing through the underlying HW's issue.

These logs presented here definitely indicate an issue between CPU and drives. Could also be RAM but I'd check SATA cables and controllers first.

[–] [email protected] 4 points 1 year ago* (last edited 1 year ago) (1 children)

Yup, after scrubbing the log file, the problem port is ONLY ATA port 4.00. No other ports have thrown errors, BUT, i just did a block check on all the boot drive partitions, and it looks like they all have bad superblocks.... not sure if the issue then is with the specific sata port or if the issue originates in the memory, or if the bad blocks get propagated to the other drives? unclear.

Oct 19 09:59:17 pve1 kernel: ata4.00: cmd 35/00:08:00:08:c4/00:00:e8:00:00/e0 tag 15 dma 4096 out
Oct 19 09:59:17 pve1 kernel: ata4.00: status: { DRDY ERR }
Oct 19 09:59:17 pve1 kernel: ata4.00: error: { ABRT }
Oct 19 09:59:17 pve1 kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Oct 19 09:59:17 pve1 kernel: ata4.00: irq_stat 0x40000001
Oct 19 09:59:17 pve1 kernel: ata4.00: failed command: WRITE DMA EXT

i'll do:

a memory test
swap the ports of the HD's to specifically avoid port4.00
do a read-write test to make sure the issue doesn't re-appear.

if non of the above solves the mystery, i suppose I can splurge on another junker and see if I have better luck on the next one. I just have to decide if I wait for ddrescue to finish, or just start it now... Probably start it now, on the off-chance i'm just creating more bad blocks on the backup.

[–] [email protected] 1 points 1 year ago (1 children)

do a read-write test to make sure the issue doesn’t re-appear

I can recommend https://wiki.archlinux.org/title/Badblocks

[–] [email protected] 1 points 1 year ago (1 children)

fuck me, that test was damning. Read test was fine, but starting a read-write test revealed all the same I/O errors as before, this time on a differen't port.

[–] [email protected] 1 points 1 year ago

Have you tried a running a different distro live f/usb or something like that? Doesn't seem likely that it would help, but who knows...

It's unlikely the kernel or other low-level code is the problem on 10 year old Intel hardware, though. I've run numerous distros on numerous different machines, many of which were Intel-based, over the last couple decades, and never had this kind of basic, low-level problem with SATA before without it being the cable or controller. Oh, I just remembered: check the PSU as well if you can. A faulty PSU could have a bad rail or wire or something that leads to these problems. If you have a known-good one lying around, depending on the motherboard, you could try temporarily hooking it up to the board and drive and see if it changes anything.

To eliminate Linux as a potential culprit, you could try to install Windows (7, 8, 10, whatever) and see if it exhibits similar problems.

[–] [email protected] 1 points 1 year ago (1 children)

Well shit. Looks like the other sata ports are having the same problem.

trying to get a hardware probe running, but what are the chances i need to replace the motherboard/the machine? It's looking likely the problem is upstream from the sata drives themselves, i just don't know if it's worth trying to swap the cpu before just ditching the machine entirely. I don't have a cpu lying around to test it. memtester came back clean after 5 passes.

[–] [email protected] 1 points 1 year ago

I'd honestly just abandon the hardware. It's not worth your time to deal with that.

[–] [email protected] 18 points 1 year ago* (last edited 1 year ago) (1 children)

This has the indications of a hardware error, tbh. Nothing you're doing is out of the ordinary as far as maintenance goes.

Have you run a memtest on the machine to verify that your memory is fine? You mentioned that it was a refurb machine, but is it possible that they didn't test all of the functionality and maybe just did a cursory pass through before testing if it was completely stable?

I've also had issues like this when I had a failing southbridge which controlled my SATA ports long ago. So honestly, this sounds like your refurb machine wasn't tested well and may still have issues.

[–] [email protected] 1 points 1 year ago (1 children)

I think this is it. Checked memory with memtester, came back clean. Switched to a different ata port, started a read-write using badblocks, and i was flooded with I/O errors, even on the new port and with a fresh disk wipe.

Looks like i'll be looking for a replacement. Thanks for the suggestion, as painful as it was!

[–] [email protected] 1 points 1 year ago

Yeah, sometimes when learning something new it's a 'forest for the trees' situation, and you've gotta fall back on what you DO know. I'm glad the linux community here on Lemmy could help you out. It looks like everyone else covered the stuff I missed (bad sata cable, power supply issues, etc) - so good luck! :)

[–] [email protected] 11 points 1 year ago (1 children)

It’s been roughly 20 years now but my employer at the time had a number of servers that started having odd drive failures at similar times. Long story short we eventually discovered that it was the power supplies that were starting to fail.

These servers had something like 6 hard drives in them, and while troubleshooting we started seeing a pattern where any 5 would work, but as soon as the 6th was reconnected then drives would randomly fail. We eventually replaced the power supply and all 6 drives were happy again.

[–] [email protected] 3 points 1 year ago

It's something Ive had with my raspberry pi as well. USBs on the pi only support a limited current from the ports and I was exceeding that with two SSDs, causing one drive to spam IO errors and enter readonly mode.

Solved with an external hub so all good now :)

[–] [email protected] 9 points 1 year ago (1 children)

What way do you have all these drives attached physically? Are they through USB and is there anything else also attached.

I had a similar issue of constantly corrupting USB HDDs, turns out the mini pc I was using couldn't draw enough power to keep them all going simultaneously so they would fail and become corrupt.

[–] [email protected] 4 points 1 year ago* (last edited 1 year ago)

they are all sata 3 internal hard drives. I just swapped the sata port that was having the problems, and so far the testing has been looking good. I have been worried, though, that the psu isn't big enough, or that the cabling is too old (they are old-school color-labeled and relatively thin compared to my desktop's). it should be fine though, since the specs say it supports up to 4 drives (including optical)

edit - the tests did NOT go well. read-write on a different sata port produced the same errors.

[–] [email protected] 8 points 1 year ago (1 children)

I recommend using smartctl to check the drive health.

[–] [email protected] 2 points 1 year ago (2 children)

Yes this should be the first thing. Run smartctl -a /dev/sda (replace with your actual hdd device) and look at the attributes. You can copy it here so we can advise. Typical failure indicators are:

Attribute 5 (reallocated sector count)
10 (spin retry count)
184 (end to end error)
187 (reported uncorrectable errors)
188 (command timeout)
197 (current pending sector count)
198 (offline uncorrectable sector count)

[–] [email protected] 1 points 1 year ago (1 children)

this came back clean, though the drives did not have smart reporting enabled. looks like the ata controller or some component between the cpu and sata bus is fucked.

[–] [email protected] 1 points 1 year ago

actually I think i've identified an issue with the original SSD. Here's the readout to sdb, which i was just having more issues with:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   100   100  
***
   Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100  
***
   Old_age   Always       -       2050
 12 Power_Cycle_Count       0x0032   100   100  
***
   Old_age   Always       -       11
165 Unknown_Attribute       0x0032   100   100  
***
   Old_age   Always       -       4194345
166 Unknown_Attribute       0x0032   100   100  
***
   Old_age   Always       -       0
167 Unknown_Attribute       0x0032   100   100  
***
   Old_age   Always       -       159
168 Unknown_Attribute       0x0032   100   100  
***
   Old_age   Always       -       1
169 Unknown_Attribute       0x0032   100   100  
***
   Old_age   Always       -       1859
170 Unknown_Attribute       0x0032   100   100  
***
   Old_age   Always       -       0
171 Unknown_Attribute       0x0032   100   100  
***
   Old_age   Always       -       0
172 Unknown_Attribute       0x0032   100   100  
***
   Old_age   Always       -       0
173 Unknown_Attribute       0x0032   100   100  
***
   Old_age   Always       -       0
174 Unknown_Attribute       0x0032   100   100  
***
   Old_age   Always       -       6
184 End-to-End_Error        0x0032   100   100  
***
   Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100  
***
   Old_age   Always       -       105
188 Command_Timeout         0x0032   100   100  
***
   Old_age   Always       -       0
194 Temperature_Celsius     0x0022   074   049  
***
   Old_age   Always       -       26 (Min/Max 22/49)
199 UDMA_CRC_Error_Count    0x0032   100   100  
***
   Old_age   Always       -       0
230 Unknown_SSD_Attribute   0x0032   001   001  
***
   Old_age   Always       -       34359738376
232 Available_Reservd_Space 0x0033   100   100   004    Pre-fail  Always       -       100
233 Media_Wearout_Indicator 0x0032   100   100  
***
   Old_age   Always       -       1773
234 Unknown_Attribute       0x0032   100   100  
***
   Old_age   Always       -       1852
241 Total_LBAs_Written      0x0030   253   253  
***
   Old_age   Offline      -       1787
242 Total_LBAs_Read         0x0030   253   253  
***
   Old_age   Offline      -       9876
244 Unknown_Attribute       0x0032   000   100  
***
   Old_age   Always       -       0

[–] [email protected] 8 points 1 year ago (1 children)

You've gotten some really good replies already (it's likely the cable or port issues). I just want to point out that the KingFast brand is dollar-store quality memory and storage. There are many online posts discussing their drives failing or corrupting after power interruptions, etc.

I know you said it's been replaced already, so just a caution against saving it for another rainy day project. I binned mine upon receipt of a refurbished PC.

[–] [email protected] 2 points 1 year ago

that's good to know, thanks.

i figured it was bottom barrel, i hadn't really heard of the brand and it was a bottom-barrel priced referb

[–] [email protected] 5 points 1 year ago

If the disks are of the same type, check their serial numbers.

Once I set up a RAID with four 120GB disks. Back then, they were basically close to cutting edge technology as a 16TB drive would be today, and expensive as f-ck. Within a week, two disks failed, bringing the raid down. One failed in the evening, the other in the morning. When I called about warranty, I noticed that all four disks were within +-20 in their serial numbers, and got suspicious. I got the two drives replaced (with different, wide spread serial numbers), set up the RAID again, only to have a fail within less than ten days again - another one of the original set dead. This time I asked not only for a replacement of the next dead one, but also of the fourth, which was declined. I cut my losses and set up a way smaller RAID with only three disks. The fourth is in a drawer somewhere, wit a big red warning sticker.

[–] [email protected] 2 points 1 year ago

Try to avoid badblocks if you can. It is really hard on storage devices

[–] [email protected] 2 points 1 year ago

Which brand are they?