this post was submitted on 11 Jul 2023

19 points (95.2% liked)

Selfhosted

40407 readers

212 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.
No spam posting.
Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.
Don't duplicate the full text of your blog or github here. Just post the link for folks to click.
Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).
No trolling.

Resources:

selfh.st Newsletter and index of selfhosted software and apps
awesome-selfhosted software
awesome-sysadmin resources
Self-Hosted Podcast from Jupiter Broadcasting

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 2 years ago

MODERATORS

[email protected]

ZFS says drive is faulted, does that always mean it needs replacing? (lemmy.world)

submitted 1 year ago by [email protected] to c/[email protected]

16 comments fedilink hide all child comments

My weekly zpool scrub came back with this:

  pool: blackhole
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
    repaired.
  scan: scrub repaired 0B in 02:01:59 with 0 errors on Tue Jul 11 04:02:09 2023
config:

    NAME                                    STATE     READ WRITE CKSUM
    blackhole                               DEGRADED     0     0     0
      raidz1-0                              DEGRADED     0     0     0
        ata-WDC_WD120EDAZ-11F3RA0_5PG8DYKC  ONLINE       0     0     0
        ata-WDC_WD120EFBX-68B0EN0_5QKJ6M8B  ONLINE       0     0     0
        ata-WDC_WD120EFBX-68B0EN0_5QKJTT8B  FAULTED     51     0     0  too many errors

errors: No known data errors

I only got the drive 6 months ago, well within WD's 3 year warranty so I opened a support case, but do errors like this basically always mean the drive is its way out or is it possible to have false positives?

top 12 comments

sorted by: hot top controversial new old

[–] [email protected] 10 points 1 year ago (4 children)

Typically, yes. It could be due to either a flaky SATA cable/connection/controller, so you might try moving it to a different port if you are able, clearing the error, and seeing if it reoccurs.

Regardless, just make sure you have a good backup of the data or are confident in the other two disks.

[–] [email protected] 4 points 1 year ago

Change cable or re seat sata connector, clear errors and start a scrub is what I always do.

[–] [email protected] 2 points 1 year ago

I've been suuuuper lazy troubleshooting this so it's been a few weeks, but I talked to WD support, they said to run a full extended S.M.A.R.T. test on the drive, it passed with no issues.

Reconnected it to my server using a different SATA cable on a different port on the motherboard, with a different power connector. It resilvered with no problems, and a zpool scrub returned no errors this time so hopefully I'm in the clear!

I have a script that runs once a week that does a scrub then sends the output of zpool status to a Discord channel. When this first started it had read errors (as mentioned in the post), then checksum errors two weeks later. With there being a couple different errors before troubleshooting, and now with no errors after a scrub I'm hoping this means everything's fine now.

[–] [email protected] 2 points 1 year ago

Thanks, I will start a backup now. I don't have any extra automated backups so I guess this is my wake-up call to figure something out.

[–] [email protected] 0 points 1 year ago

This is the way!

[–] [email protected] 4 points 1 year ago (1 children)

Look at your SMART data and run some tests. Reseat the SATA cables too. I’ve had that cause problems. Even had a SATA cable go bad a few times over the years. That will cause you to get these kinds of errors too. ZFS is pretty paranoid about data loss. So if it even gets a small hint that something is wrong it does t his.

At this point it really comes down to how valuable the data is to you. Most of the time when I see this error I’m not seeing anything on SMART that would lead me to believe that there’s a problem. So I’ll clear the error and watch it. If I start to get the same problem with the same drive I’ll usually replace it when I can. That being said I have pretty good backups so it would inconvenience me a lot but it’s most likely not going to be the end of the world if my drive dies on me. YMMV

[–] [email protected] 2 points 1 year ago (1 children)

Thanks, I'll try some of those things out and see if a second scrub says the same.

That being said I have pretty good backups

Out of curiosity, what do you do for backups? The initial cost of 3x12TB drives was enough to make me not want to spend a bunch more money on backup stuff at the time, but now that I'm seeing errors I'm willing to spend a bit of money again and should look into my options.

[–] [email protected] 2 points 1 year ago* (last edited 1 year ago) (1 children)

I use Backblaze B2 and restic. Just a simple systemd unit & timer setup to kick off the backup. I also have a restic repo setup on a external drive for my most important things eg family photos. I try to follow the 3,2,1 rule as much as possible. Fedora magazine has two articles on the site about setting up restic like that. They’re pretty helpful if you need pointers.

If there’s something I want to share or don’t need/want encrypted I’ll just use rclone to sync it to B2.

Backups are very cheap with B2, restores can be costly. But cost less than something like AWS Glacier. So it’s my last resort for restoring things. Mostly I rely on snapshots in the case that I delete something by accident. (Check out SANOID)

I also have all my other systems using B2. I have a bucket setup for all of my family’s laptops that they backup too as well. Keeps everyone’s data safe.

If you have some data that required putting on an eyepatch & tricorn (Yarr me matey) to acquire and you don’t care about loosing it then don’t back that up.

Edit: I also keep my important data on set of mirrored pairs. It’s not space efficient but it does the job of keeping things performant and safe. Eventually I’ll expand that past 3 pairs but for right now it’s 3 pairs (6 drives) of 10TB disks.

Anything else that isn’t important is just on a small Z1 array. I put all my older drives on that array because they would just be on a shelf doing nothing otherwise so I don’t care about wasting storage on that array. Not a recommended practice at all. So do as I say not as I do kinda thing.

[–] [email protected] 2 points 1 year ago

Wow that's pretty substantial, thanks for the tips! Wow yeah Backblaze does seem pretty affordable.

[–] [email protected] 3 points 1 year ago

I think it's possible to have false-positives. Like [email protected] said above, do a clear and scrub to see if that helps. It happened to me last month after some really intensive disk i/o and AI stuff and I did that and the drive hasn't had an issue since.

Additionally, I plugged in one of my old, supposedly faulted drives from last year as an external drive on my desktop to test it out, and it is still working fine months later, so yeah, it appears that there is some possibility for false-positives.

Like another person said, make sure you have good backups and that the other drives are solid, but I'd take a wait-and-see approach.

[–] [email protected] 3 points 1 year ago* (last edited 1 year ago)

I had the same problem months ago and I simply did clear and scrub. Never occurred again even if I noticed that the drive that failed is slower then the others (average access time is the double). I'll change it in a couple of months, hoping it lasts until then

[–] [email protected] 1 points 1 year ago

I've had this happen when I had ram issues. You can try doing a memory test if you want to take that out of the equation.

load more comments