ELI5: a tool to check my disks while mounted? (lemmy.world)

submitted 2 years ago by PlutoniumAcid@lemmy.world to c/linux@lemmy.world

14 comments fedilink hide all child comments

On Windows, we've had the defrag tool and others, that happily works on a drive even while it is in use, even the OS disk.

On Linux, I know of the fsck command but that requires the drive in question to be unmounted. Not great when you want to check a running server. I do not want to stop my server and boot it from USB, just to run a disk check. I can't imagine that's what the data centers are doing, either!

Surely some Linux tool exists that can do some basic checks on a running system?

top 14 comments

sorted by: hot top controversial new old

[-] vividspecter@lemm.ee 4 points 2 years ago

Use btrfs or zfs and you can do a scrub online, and you aren't recommended to do offline maintenance except in extreme cases. Or in other words, use a better filesystem and offline maintenance isn't necessary.

[-] mhzawadi@lemmy.horwood.cloud 1 points 2 years ago

Fsck can be set to run at boot, so you just need to wait for the check to finish.

[-] teft@startrek.website -2 points 2 years ago* (last edited 2 years ago)

Maybe smartctl or hdparm. Both can check drives for errors I believe. You’ll still have to unmount to correct the errors though.

Also those data centers are probably using raid 1+0 so they can just unmount one of the drives since drives in raid1+0 can be hot swappable.

[-] JWBananas@startrek.website 10 points 2 years ago

Also those data centers are probably using raid 1+0 so they can just unmount one of the drives since drives in raid1+0 can be hot swappable.

Wat.

I can assure you that's not what data centers are doing, for numerous reasons.

[-] PlutoniumAcid@lemmy.world 1 points 2 years ago

Then what are they doing? It seems very cumbersome to have to take a drive offline for routine maintenance.

[-] chiisana@lemmy.chiisana.net 5 points 2 years ago

They don’t do anything.

They have lots and lots of redundancy, and when enough drive fails, they decommission the entire server and/or rack.

Them big players play at a very different scale than the rest of us.

[-] Kangie@lemmy.srcfiles.zip 3 points 2 years ago

We don't do maintenance, we just have redundancy, and backups, then replace failed components.

[-] JWBananas@startrek.website 3 points 2 years ago

Hardware-backed RAID, with error monitoring and patrol read. iSCSI or similar to present that to a virtualization layer. VMFS or similar atop that. Files atop that to represent virtual drives. Virtual machines atop that.

Patrol read starts catching errors long before SMART will. Those drives get replicated to (and replaced by) hot spares, online. Failing drives then get replaced with new hot spares.

But all of that is irrelevant, because at the enterprise level, they are scaling their applications horizontally, with distributed containers. So even if they needed to do fsck at the guest filesystem level (or even if they weren't using virtualization) they would just redeploy the containers to a different node and then direct traffic away from the one that needs the maintenance.

[-] teft@startrek.website 1 points 2 years ago

Why wouldn't a data center use raid? Seems silly not to. They may not hot swap the drives to do file checks but it's totally doable.

[-] JWBananas@startrek.website 4 points 2 years ago

They almost undoubtedly would. That wasn't the problematic statement.

Let's go over some fundamentals here.

fsck is a utility for checking and repairing filesystem errors. Some filesystems do not support doing so when they are mounted.

Why? At a high level, because:

The utility needs the filesystem to be in a consistent state on disk. If the filesystem is mounted and in-use, that will not be so: The utility might come across data affected by an in-flight operation. In its state at that exact moment, the utility might think there is corruption and might attempt to repair it.

But in doing so, it might actually cause corruption once the in-flight operation is complete. That is because the mounted filesystem also expects the disks to be in a consistent state.

Some filesystems are designed to support online fsck. But for OP's purposes, I assume that the ones they are using are not so (hence the reason for the post).

"I know!" said the other commenter. "RAID uses mirroring! So why not just take the mirror offline and do it there?"

Well, for the exact same reasons as above, and then some additional ones.

Offlining a mirror like that while the filesystem is in use is still going to result in the data on the drive being in an inconsistent state. And then, as a bonus, if you tried to online it again, that puts the mirrors in an inconsistent state from each other too.

Even if you wanted to offline a mirror to check for errors, and even if you were doing a read-only check (thus not actually repairing any errors, thus not actually changing anything on that particular drive), and even if you didn't have to worry about the data on disk being inconsistent... The filesystem is in use. So data on the still-online drive has undoubtedly changed, meaning you can't just online the other one again (since they are now inconsistent from each other).

[-] teft@startrek.website -2 points 2 years ago

So they swap the drives like i said? I never mentioned them correcting them online or checking them online or any of that mess. I just said they run 1+0 so they can pull a drive and pop a new one in without shutting down. I have two different statements in my comment. I'll add a paragraph break to make it clearer that they aren't related.

[-] nomecks@lemmy.world 2 points 2 years ago* (last edited 2 years ago)

Nearly all systems have some sort of background error checking which periodically reads all data and validates it hasn't changed. They also watch for SMART errors and pre-fail disks before they die entirely.

They use all forms of RAID (Netapp is a weird dual stripe RAID 4, for example) and Erasure coding primarily.

[-] teft@startrek.website 0 points 2 years ago

How does that invalidate anything I said?

[-] Sethayy@sh.itjust.works 1 points 2 years ago

Dude I think this is a more you ain't smart enough to know that youre not smart... just take the L

this post was submitted on 01 Sep 2023

15 points (100.0% liked)

Linux

14066 readers

2 users here now

Welcome to c/linux!

Welcome to our thriving Linux community! Whether you're a seasoned Linux enthusiast or just starting your journey, we're excited to have you here. Explore, learn, and collaborate with like-minded individuals who share a passion for open-source software and the endless possibilities it offers. Together, let's dive into the world of Linux and embrace the power of freedom, customization, and innovation. Enjoy your stay and feel free to join the vibrant discussions that await you!

Rules:

Stay on topic: Posts and discussions should be related to Linux, open source software, and related technologies.
Be respectful: Treat fellow community members with respect and courtesy.
Quality over quantity: Share informative and thought-provoking content.
No spam or self-promotion: Avoid excessive self-promotion or spamming.
No NSFW adult content
Follow general lemmy guidelines.

founded 2 years ago

MODERATORS

MigratingtoLemmy@lemmy.world