Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Btrfs on Zoned Block Devices (lwn.net)
84 points by fomine3 on April 22, 2021 | hide | past | favorite | 62 comments


Are these host-managed SMR HDDs available for individual purchase, or are they enterprise-only? Last time I looked it seemed like the latter. I'd be curious to know if anyone has found a way to purchase them individually, and if they're significantly cheaper or higher-capacity than the closest equivalent PMR drive.

I've played with device-managed SMR drives and decided I hate them. Too much I/O happening behind my back, so a low workload they can sustain before hitting a horrible performance cliff. Also, in my very limited sample, poor reliability even if you stay within the rated workload. (2/2 drives that were in continuous use started accumulating bad sectors within a couple years.)

In theory, host-managed SMR drives with the right software might have much less write multiplication. The NVR software I'm working on more or less uses them as a big ring buffer (or several big ring buffers, one per stream) which seems ideal. But I suspect getting host-managed SMR to work optimally isn't worth the effort unless you have datacenters full of drives.


It would be enough if there was a way to disable device managed mode. Then the drive would still work on legacy machines, but if the host supports it, it could perform better.


> It would be enough if there was a way to disable device managed mode.

They exists, and they are called host-aware SMR drives. Sadly, device-managed drives exists because drive manufacturers want to both achieve cost-cutting and increased profits (and was relatively unknown until the whole WD Red (and companions) DMSMR brouhaha last year forced every hard disk manufacturer to state if the drive is DMSMR or not).


Do you have an example model available for purchase by individuals?


Sorry for this late reply but as I knew it it is exclusively an enterprise product (I know, it should be available also to prosumers).


Host-managed SMR is indeed not available in consumer drives yet FAFAIK. They're all drive-managed SMR, which is the worst of both worlds.


I wonder if a single bad bit could permanently destroy a whole ~16MiB sector? So SMR drives would need significantly more reserves to compensate for that.


Shingled drives can still be read sector-by-sector. The shingling just changes how the data is written; it doesn't change how the data is read.

So no, a single bad bit would only affect a single sector. And in fact, a single bad bit wouldn't do any harm, as all drives use error correcting codes that can handle a lot more than a single bit going bad. You'd actually lose a whole sector at time (512 bytes or 4096 bytes, depending on the drive).


The difference between a regular and a SMR drive is that the SMR drives have a narrower read head compared to the write head.

This means that when writing data it overwrites neighboring tracks, and hence writing is partially overlapped like shingles.

Apart from that they're more or less the same as normal drives as far as I know, firmware aside. So a single bad bit shouldn't affect more than in a non-SMR drive.


Am I mistaken? If so I would appreciate a correction.

I based my post primarily on what I recalled from this talk[1] by a HGST engineer.

[1]: https://youtu.be/a2lnMxMUxyc?t=556


Can someone knowledgeable explain why Btrfs isn't like better than ZFS at this point in terms of stability and feature set? Is it just designed poorly, or neglected, or what? It seems like it's been many years and it's still having a ton of trouble.


Btrfs is actually in use by some big companies like Facebook but the initial issues seem to linger in people's memory and thus everyone and their cat avoids btrfs like fire. It reminds me of systemd for some reason.

For the record I'm using btrfs on Arch (so recent kernel) for years with no issues (including LUKS encrypted root filesystem and RAID1 arrays for backups).


> initial issues

You mean like the current advice not to use anything except mirroring and striping (RAID-0/1/10)?

> Parity may be inconsistent after a crash (the "write hole"). The problem born when after "an unclean shutdown" a disk failure happens. But these are two distinct failures. These together break the BTRFS raid5 redundancy. If you run a scrub process after "an unclean shutdown" (with no disk failure in between) those data which match their checksum can still be read out while the mismatched data are lost forever.

* https://btrfs.wiki.kernel.org/index.php/RAID56

I've been using ZFS since it came out on Solaris 10 over a decade ago and it was specifically designed not to have a write hole due to its COW/ACID nature.

See this 2008 SNIA presentation from Bonwick and Moore, the creators of ZFS talking about not having a write hole:

* ACID/COW: https://www.youtube.com/watch?v=NRoUC9P1PmA&t=24m

* Integrity: https://www.youtube.com/watch?v=NRoUC9P1PmA&t=55m20s


(N=1) I have a single-disk laptop running opensuse (tumbleweed) on btrfs. It's the only machine I've ever owned to corrupt its root filesystem beyond repair, and it's done so twice IIRC (definitely 2, maybe 3), within the last few years. It's not just the initial issues.


(N=1) I've had ZFS on an opensolaris system and it got corrupted, and since the ZFS engineers think they are gods who don't make mistakes there was no fsck that would even attempt to repair it. It was a perfectly repairable corruption which I fixed myself with a bit googling and dd to copy some bytes from one location on the disk to another (ZFS apparently keeps multiple copies of some the important data structures that describe the pool, one at the beginning of the block device and one towards the end, kindof as a backup I guess). For btrfs you at least have a decent working fsck, if shit hits the fan. ZFS is like, fuck you we won't even try.


I want to like ZFS, and am using it, but however good it is at not losing data while in operation, the UI feels like it’s designed to make you wreck your data. I guess I just need more practice, but not being able to just rip a drive out and mount it on another machine in a pinch makes me damn nervous. Something about how it’s managed makes the whole file system feel ephemeral, just one bad-but-not-obviously-so command away from being destroyed, and I’m nowhere near being comfortable with that yet (and don’t really see a path to getting to comfort)


> but not being able to just rip a drive out and mount it on another machine in a pinch makes me damn nervous

Why can't you? Granted, you need enough disks to actually have all the data - so ex. if you did RAID0 then yes you need all disks, but say if you did a mirror you can totally just yank a disk out, attach it to another machine, and `zpool import` it.


Can you? I was under the impression that without an “export” beforehand, you can’t.


See the "split" command with OpenZFS 0.8.0+:

* https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSSplitPoolE...

Only with mirrored drives.

Any RAID-Z level would need a full export/import as data is striped, but hot-swap drives can be pulled once things are unmount.


I recently did just this. I had to use the -f flag but it imported just fine on a different computer.

I agree that it can be a bit daunting to operate, there are a few footguns around that, while it might not lead to data loss, but can lead to unfortunate situations.

Just the other day someone on the mailing list had managed to add a single drive as a new top-level vdev to a petabyte pool, rather than adding it as a new spare drive, simply by omitting the word "spare" from the "zpool add" command...

That said, I've been using ZFS at home here with 6+ disks for almost a decade now, and I've never lost data despite lots of various incidents, including lots of power losses and various hardware failures (like disks, mobo and PSU). So overall I'm very happy with it.


I guess it just seems like there’s a lot more state than with file systems I’m used to dealing with. In fact, since journaling became normal, most just have two states (from the user’s perspective) whether powered on or off, mounted or unmounted, whatever—broken, or OK.

ZFS has... a lot more. It’s just very different and the way these states fit together, and worrying about how to operate on them safely, makes me more nervous, in many ways, than less-safe file systems do. I’m sure that will pass, but it’s still not fun.


Yeah since it replaces "the whole stack" it's daunting and a lot of terms to, well, come to terms with.

For me I found it beneficial to watch the videos on how ZFS is built up, like this one[1]. Helped putting the pieces together.

[1]: https://www.youtube.com/watch?v=MsY-BafQgj4


Oh no that'd be a terrible design:) AFAIK the most difficulty is that you might have to use `zpool import -f` to force it to ignore the pool not having been cleanly exported.

EDIT: It'd look like this: https://serverfault.com/questions/964075/how-can-i-recover-m...


I've had btrfsck segfault on my a couple times


scrub didn't help? I thought scrub was like fsck for ZFS.


> For btrfs you at least have a decent working fsck

When I say "corrupted beyond repair", I mean "the btrfs tools were not actually helpful".


Another (n=1) anecdote: similarly for me, ~4 years ago my raid-1 workstation OS drive which was, at the time, using btrfs nuked itself without warning or repair. Trying to recover any data was likewise an exercise in rapidly learning about FS internals.

I use zfs on everything now. I am sure at some point it will die horribly, but for now I haven't had a single problem in ~60 managed drives across 3 machines.


The other way to interpret this would be "it was the only filesystem to detect corruption on my malfunctioning hardware", because that's what usually happens in the last few years.

Or have you been using ZFS on the same hardware?


I haven't used ZFS on the same hardware, but during the time when BTRFS ate itself multiple times my home partition on the same drive also on BTRFS was perfectly fine, so it'd have to be an awfully specific hardware failure. Also, it would have had to hit the metadata both times since we're talking "pool wouldn't import" not "it gave me data checksum errors". Which again, is possible, but on an SSD with wear-leveling seems a tad unlikely.


He said "corrupt its root filesystem beyond repair", not "detect checksum errors"

Btw raid5/6 is still broken on btrfs which makes it a hard sell for any system with more than 2 disks. cf. raidz on ZFS


So? Do you think filesystem metadata is stored in a magical pixie cloud, or on the same unreliable physical hardware where it can easily get corrupted, especially after a crash or an unexpected power loss?

I posted this link here already:

https://www.usenix.org/conference/atc19/presentation/jaffer

f2fs (at least in its state a couple of years ago) is/was a prime example of how a filesystem can get into a barely working state with massive amounts of data and metadata corruption, and not even notice it.

God I love this site. In case of a minor disagreement with someone don't even bother to think, just press "downvote".


Corrupt data should be corrected by checksummed btrfs, isn't it?


Only if you have more than one copy of the data


RAID 5 has been generally advised against for years, due to performance issues and the effect of unrecoverable errors during rebuilds.

Btrfs RAID1 works perfectly, and RAID1c3/RAID1c4 provides additional redundancy. In place of RAID5, use RAID10 instead.


raidz2 is only advised against if your arrays are so small that you don't care about the price of storage.

If you want more IOPS, add more raidz2 (raid6) stripes to the pool. In practice, spinning rust is the new tape. Trying to do random access under 1MB is just silly

I don't stress over rebuilds. 2 more disks failing during a rebuild is incredibly unlikely compared to everything else that might force me to restore a backup (software bugs, data center flooding, etc).


I'm running opensuse tumbleweed past couple years on my laptop & desktop, with btrfs as root FS. No issues. I also run btrfsmaintenance script every month, maybe that helps.


SLES is a recommended platform for SAP, and that has had btrfs as the default filesystem since version 12 in 2014.

I'm using btrfs on several systems, laptop, desktop and server, on various configurations of disks.

It has served me well for years, on the server it helped me detect a bad SATA controller. It would work perfectly in light usage, but start introducing errors in heavy usage, which made one disk inconsistent with the others in the storage pool.

Btrfs alerted me to this and after moving the disk to a good controller, I ran btrfs-check --repair on the unmounted disk (after reading the warnings), which got the FS back to a consistent state, remounted the whole pool and ran a btrfs scrub to get everything back in line with itself. The whole process did take a while, but I had backups and wanted to try out the tools. In the end there was no data loss, and the pool is still running perfectly today.


>SLES is a recommended platform for SAP, and that has had btrfs as the default filesystem since version 12 in 2014.

And they specifically tell you to use XFS for any production deployments.


Another anecdote - on working Xeon E3 hardware (so ECC, etc) I have had btrfs corrupt itself as recently as 5.x kernels from just normal use with compression on a single root device. ext4, xfs and zfs work flawlessly on the same hardware.

Furthermore - btrfs feels excessively complicated for simple workflows - if I want to snapshot a btrfs volume without exposing the snapshot to the machine’s view of the file system, I have to do a bunch of volume layout setup first. With ZFS I can just snapshot.


The problems weren't 'initial'. I had to abandon a BTRFS partition that could only mount read-only just two years ago. It wasn't all that long ago that I had two separate installs experience ridiculously bad performance issues just because they had rolling hourly snapshots happening in the background for a few months, but I suppose I could be convinced to restart that test...


The problem is they weren't just "initial issues". Btrfs was ~8 years old when I started using it as a backend for a production Ceph cluster, on 48 disks (no fancy btrfs redundancy features used, just plain single-disk filesystems with snapshots). I encountered at least two critical issues: filesystems that die on hard power downs (simply issuing a hard shutdown of all machines ended up with two filesystems corrupted beyond repair after booting again), and a snapshot reference leak issue that meant data for deleted snapshots was never freed until a reboot - and even more scarily, fsck reported tons of unfixable errors on those filesystems, although they magically went away after a clean reboot.

Btrfs has not done a good job of inspiring any confidence, many years into development. Thankfully, Ceph has moved on from FS-backed storage to its own implementation on top of raw block devices, and I no longer have Btrfs anywhere in production.

Mind you, I don't trust ZFS either; it does seem to be stabler from Btrfs, but it still suffers from the fundamental issue that all of these "fancy" filesystems do: the fsck/repair tools are never up to par, and there is next to no chance of disaster recovery (with the added drawback that ZFS is not in-tree).

My first experience with one of these "if anything fails, all your data is gone" filesystems was ReiserFS many years ago - 8 bad sectors on a disk killed my home directory and all my data was gone. Since then, I've had rather complex accidents with ext4 and XFS* where I could do manual and automated surgery and recover ~100% of my data. Btrfs and ZFS are in the same class as ReiserFS here. The repair tools just aren't there. Sure, they handle redundancy at the device level like a fancy RAID for "well-behaved" failures like devices just disappearing, but anything outside or their model, or that tickes a bug, and you can well kiss your data goodbye.

Just to give an example: I once recovered an XFS filesystem that was built on top of a RAID6 array which, due to an unfortunate sequence of events, had one drive too many drop out during a replacement, which resulted in me manually stitching together an array where one drive had out-of-date data (i.e. every block out of N was from an earlier point-in-time from the others). Fsck fixed everything, high-level checksums took care of the few files that were being written to and had become corrupted, and I lost nothing of value. On a good filesystem, fsck does its best to recover all existing data and guarantee the result is consistent.

Yes, I know, backups. I have backups. That's not a reason to neglect repair tools. Backups are one layer of defense that can also fail; they are no excuse to neglect FS-level robustness. For example, my off-site backups are bottlenecked on my 1G internet connection, which means that if I have a weird but largely recoverable soft failure, it is much more efficient to rsync data back from the backup, using checksums to avoid data transfer, rather than copy everything again.

And this is why I use CephFS as my "smart" single-host storage solution these days. It has overhead, but it works well, is much more introspectable than ZFS/Btrfs (you can dig through the stack layers if you understand how it works very easily), and I trust its ability to recover from weird failures and device states much more than any RAID solution or fancy multi-device filesystem. It is extremely well engineered.

* I don't recommend XFS either due to kernel implementation performance issues around allocations and such; it was the cause of massive latency issues on my home server for years until I discovered its antics. But at least I've never lost data to XFS. So yeah, just use ext4 if you need a normal filesystem.


It would be interesting to take each of these filesystems in a simulated environment and zero out stripes of data to see what it would take to kill the disk. All of these fancy filesystems are supposed to have redundancy and error detection in their core structures. But I wonder how well that’s tested - if you simulate single block read failures, are there any blocks that would totally corrupt btrfs or zfs? How about adjacent block pairs?

Seems like a pretty easy test to run and if it found problems, they’d be well worth fixing. (And you could do the test itself pretty efficiently on a ramdisk).



The ZFS project has a test suite[1], and as far as I can determine, nuking data is part of it. See for example the zraid_test tool[2], which seems to do what you suggest.

[1]: https://github.com/openzfs/zfs/tree/master/tests/zfs-tests

[2]: https://github.com/openzfs/zfs/tree/master/cmd/raidz_test (run_rec_check_impl etc)


I would say poorly designed.

The typical btrfs fix is (slightly exaggerated) "When doing an A while a B is pending, re-enumerate the Cs for the purpose of D unless the E is locked, in which case, reschedule the F". Where A...F are all not trivial like snapshot, out of disk space, device rebalancing, and so on.

If that level of thinking of all the things at once is required to write correct code, well, it's evidently not happening.


Being "better than ZFS" is not trivial. ZFS is an excellent and very stable file system with many features.


What kind of trouble? I use it on all of my systems and my only complaint is I have to scrub and balance regularly.


it is kind of humorous having to balance regularly. I had hit a problem where statfs() call returned a { .f_bavail = 0 } which made some tool complain; funnier, the other statfs information was their expected values and one could calculate bavail correctly from them. I didn't even notice it until then. The solution unfortunately was a full fs rebalance and for some reason the rebalance tool will usually fail with some unhelpful and scary warning.

either way, my story is the same as yours. I have no troubles, I have at least ~10 daily snapshots of many subvolumes I can pull from if I accidentally: `rm -rf /usr` or something silly like that. It's a great FS and unlike ZFS, is actually upstream in the Linux kernel.


The RAID 5/6 striping feature is still incomplete after all these years. There's still a risk of data loss in certain cases. Seems like a feature companies would be very interested in so I wonder why nobody's ever fixed it.


> There's still a risk of data loss in certain cases

If you need this functionality and want to mitigate the write hole, gen an UPS.

> Seems like a feature companies would be very interested in so I wonder why nobody's ever fixed it.

Because those driving the development have no interest in RAID5/6, they do not need it. RAID5/6 is enthusiast/SOHO feature, and these groups do not take part in development, so no wonder it is neglected.

In fact, I will combine both points: it is more effective/economical for everyone, who needs RAID5/6 with btrfs to get an UPS than to pay the manhours for the development. Additionally, it would be an expense incurred by those who need it, not by those who do the development currently and do not need the feature.


How about fixing RAID5 first please? At least for non-metadata blocks ("-draid5 -mraid1" or "-draid6 -mraid1c3")?

https://phoronix.com/scan.php?page=news_item&px=Btrfs-Warnin...

It's kind of insane that two companies now sell a fixed btrfs-RAID5 as a proprietary commercial product (yes, in spite of the GPL). They do it by using the mdadm RAID5 code, which works, and add proprietary hooks to allow btrfs to use its checksums to figure out which spindle is corrupt in a parity mismatch situation. This (a) closes the write hole without a journal doubling the I/O load and (b) detects silent corruption, neither of which mdadm can do.

Or just upstream this patch set:

https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg...

...so we can do this ourselves with LVM (split each spindle into a small metadata device and a large device, stitch the large data devices together with mdadm-raid5/6, then add that and all the small devices to a btrfs filesystem with the small devices marked "metadata only" and -dsingle -mraid1c3).

I really feel like the btrfs guys are stuck in "shiny new thing" mode here. Last time I checked host-managed SMR devices were only available in engineering sample quantities. Even if that's changed they surely are still very rare, a tiny minority of worldwide storage device sales. The fact that btrfs-RAID5 scrubs take something like O(num_spindles^2) seek latencies is crazy... I have an 8-spindle array that scrubs in 12 hours with "-draid0" but takes 10 days with "-draid5".


I am sure that Western Digital (whom the author of this patchset works for) will be very interested in your opinion. He implements whatever makes business sense for WD, and not what some random forum commenter wants. I am not at all surprised that a storage vendor would want upstream Linux support for its upcoming devices before they're actually being sold in stores.



I think, it's simply not high importance for the commercial backers of the project. It's not a hobby those people get payed for it.

Raid5/6 seems to me more popular in home use.


Whats sad is that there is clearly a commercial value to the feature, but that commercial value is captured by proprietary implementations of it (e.g. Synology's btrfs-on-mdraid)


Exactly this.

The fact that Synology's makes money selling their proprietary btrfs-RAID5 is incontrovertible proof of the commercial relevance.


F2FS - I'd currently rather use this, if I had to use a Zoned device (Assuming it's under 16TB in size).

https://manpages.ubuntu.com/manpages/bionic/man8/mkfs.f2fs.8...

-m -z #-of-sections-per-zone

https://zonedstorage.io/linux/fs/

The f2fs section says "Zoned block device support was added to f2fs with kernel 4.10. Since f2fs uses a metadata block on-disk format with fixed block location, only zoned block devices which include conventional zones can be supported. Zoned devices composed entirely of sequential zones cannot be used with f2fs as a standalone device and require a multi-device setup to place metadata blocks on a randomly writable storage."


Has f2fs reliability changed much since this publication?

https://www.usenix.org/conference/atc19/presentation/jaffer

I'd be afraid to use it for anything other than throwaway test data.


Is this supported on top of DM-crypt's zoned support, or is that planed?


Nitpick: the device mapper zoned-device support isn't built on top of the device mapper encrypted-device support; both dm-crypt and dm-zoned are independent device mapper targets.

This is native support for btrfs-on-SMR, without the dm-zoned layer in between. DM-zoned is meant for filesystems who are zone-unaware, and it works by batching writes and redirecting blocks into appropriate areas. Having the filesystem allocator be aware of the underlying device's zones allows for more efficient/performant use of the zones.


Would it be possible to define a "zone" that GRUB_SAVEDEFAULT can overwrite?

I reverted from btrfs to ext4 on my main desktop a few years ago, because grub couldn't remember the last selected menu item (error: sparse file not allowed), and I didn't feel like creating a dedicated /boot partition.


Booting from zoned device on btrfs is not supported, the 0th zone contains the super block and the boot loader data are before any superblock so updating them would require reset and complete rewrite of the live data in the zone.

The 'sparse file file not found' is caused by grub, it would try to overwrite file blocks directly, but on btrfs it would cause checksum mismatch. This has been solved by storing the env block outside of the filesystem at 256K and synced back and forth once the system is booted. 256K is ok as btrfs does not use the first 1M on any device for bootloaders.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: