Discussion:
Getting started with ZFS
Colin Fee
2013-07-12 02:06:20 UTC
Permalink
Last night after some updates on my media server, one of the disks in a
mirror set failed a SMART check and was kicked out (fortunately I have good
backups). It's failing on the bad sector count but I haven't done a deeper
analysis yet.

However it's time to bite the bullet and do some upgrade work. I'm
intending to replace all of the disks with larger ones to increase the
storage size and when I do, use ZFS.

The box is based around a Gigabyte GA-880GM-USB3 mobo, with an AMD Phenom
II X6 1055T cpu and buckets of RAM and the OS on a 240Gb SSD.

So I'm looking for a strategy re the implementation of ZFS. I can install
up to 4 SATA disks onto the mobo (5 in total with one slot used by the SSD)
--
Colin Fee
***@gmail.com
Colin Fee
2013-07-12 04:30:29 UTC
Permalink
Post by Colin Fee
Last night after some updates on my media server, one of the disks in a
mirror set failed a SMART check and was kicked out (fortunately I have good
backups). It's failing on the bad sector count but I haven't done a deeper
analysis yet.
However it's time to bite the bullet and do some upgrade work. I'm
intending to replace all of the disks with larger ones to increase the
storage size and when I do, use ZFS.
The box is based around a Gigabyte GA-880GM-USB3 mobo, with an AMD Phenom
II X6 1055T cpu and buckets of RAM and the OS on a 240Gb SSD.
So I'm looking for a strategy re the implementation of ZFS. I can install
up to 4 SATA disks onto the mobo (5 in total with one slot used by the SSD)
I should add to that I've begun reading through necessary literature and
sites, and a colleague at work who uses ZFS extensively on his home Mac
server stuff has given me some intro videos to watch.

I guess what I'm asking is for pointers to the gotchas that most people
overlook or those gems that people have gleaned from experience.
--
Colin Fee
***@gmail.com
Kevin
2013-07-12 04:43:59 UTC
Permalink
Make sure you have lots of ram.

If you are using raidz or raidz2 you will need to ensure your zdevs are
designed from the start as they cannot change in the future.
Pools are cool
snapshots are good

checkout btrfs.
Post by Colin Fee
Post by Colin Fee
Last night after some updates on my media server, one of the disks in a
mirror set failed a SMART check and was kicked out (fortunately I have good
backups). It's failing on the bad sector count but I haven't done a deeper
analysis yet.
However it's time to bite the bullet and do some upgrade work. I'm
intending to replace all of the disks with larger ones to increase the
storage size and when I do, use ZFS.
The box is based around a Gigabyte GA-880GM-USB3 mobo, with an AMD Phenom
II X6 1055T cpu and buckets of RAM and the OS on a 240Gb SSD.
So I'm looking for a strategy re the implementation of ZFS. I can
install up to 4 SATA disks onto the mobo (5 in total with one slot used by
the SSD)
I should add to that I've begun reading through necessary literature and
sites, and a colleague at work who uses ZFS extensively on his home Mac
server stuff has given me some intro videos to watch.
I guess what I'm asking is for pointers to the gotchas that most people
overlook or those gems that people have gleaned from experience.
--
Colin Fee
_______________________________________________
luv-main mailing list
http://lists.luv.asn.au/listinfo/luv-main
Kevin
2013-07-12 05:29:04 UTC
Permalink
Lots is certainly the wrong word now that i think about it.

You need 4GB min for a reasonable zfs server plus more if you use dedupe
Post by Kevin
Make sure you have lots of ram.
If you are using raidz or raidz2 you will need to ensure your zdevs are
designed from the start as they cannot change in the future.
Pools are cool
snapshots are good
checkout btrfs.
Post by Colin Fee
Post by Colin Fee
Last night after some updates on my media server, one of the disks in a
mirror set failed a SMART check and was kicked out (fortunately I have good
backups). It's failing on the bad sector count but I haven't done a deeper
analysis yet.
However it's time to bite the bullet and do some upgrade work. I'm
intending to replace all of the disks with larger ones to increase the
storage size and when I do, use ZFS.
The box is based around a Gigabyte GA-880GM-USB3 mobo, with an AMD
Phenom II X6 1055T cpu and buckets of RAM and the OS on a 240Gb SSD.
So I'm looking for a strategy re the implementation of ZFS. I can
install up to 4 SATA disks onto the mobo (5 in total with one slot used by
the SSD)
I should add to that I've begun reading through necessary literature and
sites, and a colleague at work who uses ZFS extensively on his home Mac
server stuff has given me some intro videos to watch.
I guess what I'm asking is for pointers to the gotchas that most people
overlook or those gems that people have gleaned from experience.
--
Colin Fee
_______________________________________________
luv-main mailing list
http://lists.luv.asn.au/listinfo/luv-main
Tim Connors
2013-07-12 06:21:32 UTC
Permalink
I'd say don't. It's nowhere near ready for production yet.

Don't plan on doing anything else with the server. zfsonlinux still is
not integrated to page cache and linux memory management (due for 0.7
maybe if we're lucky, but I see no one working on the code). It does
fragment memory, and even if you limit it to half your RAM of even far
smaller, it will still use more than you allow it to, and crash your
machine after a couple of weeks of uptime (mind you, it's never lost data
for me, and now that I have a hardware watchdog, I just put up with losing
my NFS shares for 10s of minutes every few weeks as it detects a
softlockup and reboots).

It also goes on frequent go-slows, stopworks and strikes, before marching
down Spring St (seriously, it goes into complete meltdown above about 92%
disk usage, but even when not nearly full, will frequently get bogged down
so much that NFS autofs mounts time out before succesfully mounting a
share).

rsync, hardlink and metadata heavy workloads are particularly bad for it.

snapshots don't give you anything at all over LVM, and just introduces a
different set of commands. zfs send/recv is seriously overrated compared
to dd (the data is fragmented enough on read, because of COW, that
incremental sends are frequently as slow as just sending the whole device
in the first place).

raidz is seriously inflexible compared to mdadm. Doesn't yet balance IO
to different speed devices, but a patch has just been committed to HEAD
(so might make it out to 0.6.2 - but I haven't yet tested it and have
serious doubts).

Think of zvols as lvm logical volumes. With different syntax (and
invented by people that have never heard of LVM). You can't yet swap to
zvols without having hard lockups every day or so (after dipping into
about 200MB of a swap device), whereas I've never ever had swapping to lvm
cause a problem (and we have that set up for hundreds of VMs here at
$ORK).

In short, overrated. I wouldn't do it again if I didn't already have a
seriously large (for the hardware) amount of files and hardlinks that take
about a month to migrate over to a new filesystem. ext4 on bcache on lvm
(or vice versa) on mdadm sounds like a much more viable way to go in the
future. bcache isn't mature yet, but I don't think it will be long given
the simplicity of it, before it will be. Separation of mature layers is
good. Whereas zfs is a massive spaghetti of new code that hooks into too
many layers and takes half an hour to build on my machine (but at least
it's now integrated into dkms on debian).
Post by Kevin
Make sure you have lots of ram.
If you are using raidz or raidz2 you will need to ensure your zdevs are
designed from the start as they cannot change in the future.
Pools are cool
snapshots are good
checkout btrfs.
Post by Colin Fee
Post by Colin Fee
Last night after some updates on my media server, one of the disks in a
mirror set failed a SMART check and was kicked out (fortunately I have good
backups). It's failing on the bad sector count but I haven't done a deeper
analysis yet.
However it's time to bite the bullet and do some upgrade work. I'm
intending to replace all of the disks with larger ones to increase the
storage size and when I do, use ZFS.
The box is based around a Gigabyte GA-880GM-USB3 mobo, with an AMD Phenom
II X6 1055T cpu and buckets of RAM and the OS on a 240Gb SSD.
So I'm looking for a strategy re the implementation of ZFS. I can
install up to 4 SATA disks onto the mobo (5 in total with one slot used by
the SSD)
I should add to that I've begun reading through necessary literature and
sites, and a colleague at work who uses ZFS extensively on his home Mac
server stuff has given me some intro videos to watch.
I guess what I'm asking is for pointers to the gotchas that most people
overlook or those gems that people have gleaned from experience.
--
Colin Fee
_______________________________________________
luv-main mailing list
http://lists.luv.asn.au/listinfo/luv-main
--
Tim Connors
Craig Sanders
2013-07-12 06:11:26 UTC
Permalink
Post by Colin Fee
The box is based around a Gigabyte GA-880GM-USB3 mobo, with an AMD Phenom
II X6 1055T cpu and buckets of RAM and the OS on a 240Gb SSD.
more than adequate.

my main zfs box here is a Phenom II x6 1090T with 16GB RAM and a pair
of 240GB SSDs for OS/L2ARC/ZIL. it's also my desktop machine, internet
gateway, firewall, virtualisation server, and everything/anything I want
to run or experiment with box.

so a 1055T dedicated to just a ZFS fileserver will have no problem
coping with the load.
Post by Colin Fee
So I'm looking for a strategy re the implementation of ZFS.
0. zfsonlinux is pretty easy to work with, easy to learn and to use.

i'd recommend playing around with it to get a feel for how it works and
to experiment with some features before putting it into 'production' use
- otherwise you may find later on that there was a more optimal way of
doing what you want.

e.g. once you've got a pool set up, get into the habit of creating
subvolumes on the pool for different purposes rather than just
sub-directories. You can enable different quotas (soft quota
or hard-limit) on each volume, and have different attributes
(e.g. compression makes sense for text documents, but not for
already-compressed video files).

my main pool is mounted as /export (a nice, traditional *generic*
name), and i have /export/home, /export/src, /export/ftp, /export/www,
/export/tftp and several other subvols on it. as well as zvols for VMs.

if performance is important to you then do some benchmarking on your own
hardware with various configurations. Russell's bonnie++ is a great
tool for this.



1. if your disks are new and have 4K sectors OR if you're likely to add
4K-sector drives in future, then create your pool with 'ashift=12'

4K sector drives, if not quite the current standard right now, are
pretty close to current standard and will inevitably replace the old
512-byte sector standard in a very short time.

(btw, ashift=12 works even with old 512-byte sector drives, because 4096
is a multiple of 512. there is an insignificantly tiny reduction in
usable space when using ashift=12 on 512 sector drives).


2. The SSD can be partitioned so that some of it is for the OS (50-100GB
should be plenty), some for a small ZIL write intent cache (4 or 8GB is
heaps), and the remainder for L2ARC cache.


3. if you're intending to use some of that SSD for ZIL (write intent
cache) then it's safer to have two ZIL partitions on two separate SSDs
in a mirror configuration, so that if one SSD dies, you don't lose
recent unflushed writes. this is one of those things that is low risk
but high damage potential.

in fact, an mdadm raid-1 for the OS, two non-raid L2ARC cache
partitions, and mirrored ZIL is, IMO, an ideal configuration.

if you can only fit one SSD in the machine, then obviously you can't do
this.


4. if performance is far more important than size, then create your pool
with two mirrored pairs (i.e. the equivalent of RAID-10) rather than
RAID-Z1. This will give you the capacity of two drives, whereas RAID-Z1
would give you the capacity of 3 of your 4 drives.

It also has the advantage of being relatively easier/cheaper to expand,
just add another mirrored pair of drives to the pool. expanding a RAID-Z
involves either adding another complete RAID-Z to the pool (i.e. 4
drives, so that you have a pool consisting of two RAID-Zs) or replacing
each individual drive one after the other.

e.g. 4 x 2TB drives would give 4TB total in RAID-10, compared to 6TB if
you used RAID-Z1.

I use RAID-Z1 on my home file server and find the performance to be
acceptable. The only time I really find it slow is when a 'zpool scrub'
is running (i have a weekly cron job on my box)...my 4TB zpools are now
about 70% full, so it takes about 6-8 hours for scrub to run. It's only
a problem if i'm using the machine late at night.

I also use RAID-Z1 on my mythtv box. I've never had an issue with
performance on that, although I do notice if I transcode/cut-ads more
than 3 or 4 recordings at once (but that's a big IO load for a home
server, read 2GB or more - much more for HD recordings - and write
1GB or so for each transcode, all simultaneous). It's mostly large
sequential files, which is a best-case scenario for disk performance.
Post by Colin Fee
I can install up to 4 SATA disks onto the mobo (5 in total with one
slot used by the SSD)
if you ever plan to add more drives AND have a case that can physically
fit them, then I highly recommend LSI 8-port SAS PCI-e 8x controllers.
They're dirt cheap (e.g. IBM M1015 can be bought new off ebay for under
$100), high performance, and can take 8 SAS or SATA drives. They're
much better and far cheaper than any available SATA expansion cards.

craig
--
craig sanders <***@taz.net.au>

BOFH excuse #188:

..disk or the processor is on fire.
Tim Connors
2013-07-12 06:31:09 UTC
Permalink
Post by Craig Sanders
e.g. once you've got a pool set up, get into the habit of creating
subvolumes on the pool for different purposes rather than just
sub-directories. You can enable different quotas (soft quota
or hard-limit) on each volume, and have different attributes
(e.g. compression makes sense for text documents, but not for
already-compressed video files).
Note that `mv <tank/subvol1>/blbah <tank/subvol2>/foo` to different
subvols on the same device still copies-then-deletes the data as if its a
separate filesystem (it is - think of it just like lvms) despite the fact
that it's all just shared storage in the one pool. Slightly inconvenient
and surprising, but makes sense in a way (even if zfs could be convinced
to mark a set of blocks from a file as changing its owner from subvol1 to
subvol2, how do you teach `mv` that this inter-mount move is special?
Post by Craig Sanders
1. if your disks are new and have 4K sectors OR if you're likely to add
4K-sector drives in future, then create your pool with 'ashift=12'
Do it anyway, out of habit. You will one day have 4k sectors, and
migrating is a bitch (when you've maxed out the number of drives in your 4
bay NAS).
Post by Craig Sanders
3. if you're intending to use some of that SSD for ZIL (write intent
cache) then it's safer to have two ZIL partitions on two separate SSDs
in a mirror configuration, so that if one SSD dies, you don't lose
recent unflushed writes. this is one of those things that is low risk
but high damage potential.
Terribly terribly low risk. It has to die, undetected, at the same time
that the machine suffers from a power failure (ok, so these tend to be
correlated failures). (I say undetected, because zpool status will warn
you when any of its devices are unhealthy - set up a cron.hourly to
monitor zpool status). And if it does die, then COW will mean you'll get
a consistent 5 second old copy of your data. Which for non-database, non
mailserver workloads (and database workloads for those of us who pay
zillions of dollars for oracle, and then don't use it to its full
potential), doesn't actually matter (5 seconds old data vs
non-transactional data that would be 5 seconds old if the power failed 5
seconds ago? Who cares?)

For me, SLOG just means that hardlinks over NFS don't take quite so long.
Sometimes I even see SLOG usage grow up to 10MB(!) for a few seconds at a
time!
--
Tim Connors
Trent W. Buck
2013-07-12 14:01:03 UTC
Permalink
Post by Tim Connors
Note that `mv <tank/subvol1>/blbah <tank/subvol2>/foo` to different
subvols on the same device still copies-then-deletes the data as if its a
separate filesystem (it is - think of it just like lvms) despite the fact
that it's all just shared storage in the one pool. Slightly inconvenient
and surprising, but makes sense in a way (even if zfs could be convinced
to mark a set of blocks from a file as changing its owner from subvol1 to
subvol2, how do you teach `mv` that this inter-mount move is special?
GNU cp has --reflink which AIUI is extra magic for ZFS & btrfs users.
It wouldn't really surprise me if there was extra magic for mv as well,
though I doubt it would apply in this case (as you say, inter-device
move).

Under btrfs, if / is btrfs root and you make a subvol "opt", it'll be
visible at /opt even if you forget to mount it. Maybe in that case, mv
*is* that clever, if only by accident.

Incidentally, I have this in my .bashrc:

grep -q btrfs /proc/mounts && alias cp='cp --reflink=auto'

(The grep is just because it's shorter than checking if cp is new enough
to have --reflink.)
Russell Coker
2013-07-12 15:43:02 UTC
Permalink
Post by Tim Connors
Note that `mv <tank/subvol1>/blbah <tank/subvol2>/foo` to different
subvols on the same device still copies-then-deletes the data as if its a
separate filesystem (it is - think of it just like lvms) despite the fact
that it's all just shared storage in the one pool. Slightly inconvenient
and surprising, but makes sense in a way (even if zfs could be convinced
to mark a set of blocks from a file as changing its owner from subvol1 to
subvol2, how do you teach `mv` that this inter-mount move is special?
cp has the --reflink option, mv could use the same mechanism if it was
supported across subvols. I believe that there is work on supporting reflink
across subvols in BTRFS and there's no technical reason why ZFS couldn't do it
too.
--
My Main Blog http://etbe.coker.com.au/
My Documents Blog http://doc.coker.com.au/
Chris Samuel
2013-07-14 08:55:18 UTC
Permalink
Post by Russell Coker
cp has the --reflink option, mv could use the same mechanism if it was
supported across subvols. I believe that there is work on supporting
reflink across subvols in BTRFS and there's no technical reason why ZFS
couldn't do it too.
Cross-subvol reflinks were merged for the 3.6 kernel.

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=362a20c5e27614739c4

You can't reflink across mount points though, it was mentioned by Christoph
Hellwig back in 2011 when he first NAK'd the cross-subvol patches
http://www.spinics.net/lists/linux-btrfs/msg09229.html .

cheers,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

This email may come with a PGP signature as a file. Do not panic.
For more info see: http://en.wikipedia.org/wiki/OpenPGP
Russell Coker
2013-07-12 15:43:40 UTC
Permalink
Post by Colin Fee
So I'm looking for a strategy re the implementation of ZFS. I can install
up to 4 SATA disks onto the mobo (5 in total with one slot used by the SSD)
Firstly plan what you are doing especially regarding boot.

Do you want to have /boot be a RAID-1 across all 4 of the disks? Do you want
it just on the SSD?

My home server uses BTRFS (which is similar to ZFS in many ways) for mass
storage and has a SSD for the root filesystem. That means that when some
server task uses a lot of IO capacity on the mass storage it won't slow down
local workstationa access. If you have a shared workstation/server plan then
this might be worth considering.

Don't bother with ZIL or L2ARC, most home use has no need for more performance
than modern hard drives can provide and it's best to avoid the complexity.
Post by Colin Fee
You need 4GB min for a reasonable zfs server plus more if you use dedupe
I've had a server with 4G have repeated kernel OOM failures running ZFS even
though dedupe wasn't enabled. I suggest that 8G is the bare minimum.
Post by Colin Fee
0. zfsonlinux is pretty easy to work with, easy to learn and to use.
Actually it's a massive PITA compared to every filesystem that most Linux
users have ever used.

You need DKMS to build the kernel modules and then the way ZFS operates is
very different from traditional Linux filesystems.
Post by Colin Fee
1. if your disks are new and have 4K sectors OR if you're likely to add
4K-sector drives in future, then create your pool with 'ashift=12'
4K sector drives, if not quite the current standard right now, are
pretty close to current standard and will inevitably replace the old
512-byte sector standard in a very short time.
If you are using disks that don't use 4K sectors then they are probably small
by modern standards in which case you don't have such a need for ZFS.
--
My Main Blog http://etbe.coker.com.au/
My Documents Blog http://doc.coker.com.au/
Craig Sanders
2013-07-13 01:45:50 UTC
Permalink
Post by Russell Coker
Post by Colin Fee
So I'm looking for a strategy re the implementation of ZFS. I can
install up to 4 SATA disks onto the mobo (5 in total with one slot
used by the SSD)
Firstly plan what you are doing especially regarding boot.
Do you want to have /boot be a RAID-1 across all 4 of the disks?
not a good idea with ZFS. Don't give partitions to ZFS, give it entire disks.

from the zfsonlinux faq:

http://zfsonlinux.org/faq.html#PerformanceConsideration

"Create your pool using whole disks: When running zpool create use
whole disk names. This will allow ZFS to automatically partition
the disk to ensure correct alignment. It will also improve
interoperability with other ZFS implementations which honor the
wholedisk property."
Post by Russell Coker
Do you want it just on the SSD?
much better to do it this way, which is one of the reasons I suggested a
second SSD to have mdadm raid-1 for / and /boot.
Post by Russell Coker
Don't bother with ZIL or L2ARC, most home use has no need for more performance
than modern hard drives can provide and it's best to avoid the complexity.
if he's got an SSD then partitioning it to give some L2ARC and ZIL is
easy enough, and both of them will provide noticable benefits even for
home use.

it takes a little bit of advance planning to set up (i.e. partitioning
the SSD and then issuing 'zfs add' commands, but it's set-and-forget.
doesn't add any ongoing maintainence complexity.

e.g. here's the 'zfs history' from when I installed my old SSD in my
myth box after upgrading the SSD on my main system. it had previously
been running for almost a year without L2ARC or ZIL (but with 16GB RAM,
just because I had some spare RAM lying around)

2013-04-09.09:59:01 zpool add export log scsi-SATA_Patriot_WildfirPT1131A00006353-part5
2013-04-09.09:59:07 zpool add export cache scsi-SATA_Patriot_WildfirPT1131A00006353-part6
Post by Russell Coker
Post by Colin Fee
You need 4GB min for a reasonable zfs server plus more if you use dedupe
I've had a server with 4G have repeated kernel OOM failures running ZFS even
though dedupe wasn't enabled. I suggest that 8G is the bare minimum.
i'd agree with that. I haven't had the OOM failures even on 4GB
(possibly because i've always used ZFS with either 8+GB or with some SSD
L2ARC) but ZFS does benefit from more RAM.
Post by Russell Coker
Post by Colin Fee
0. zfsonlinux is pretty easy to work with, easy to learn and to use.
Actually it's a massive PITA compared to every filesystem that most
Linux users have ever used.
yeah, well, it's a bit more complicated than mkfs. but a *lot* less
complicated than mdadm and lvm.

and gaining the benefits of sub-volumes or logical volumes of any kind
is going to add some management complexity whether you use btrfs(8),
zfs(8), or (worse) lvcreate/lvextend/lvresize/lvwhatever.
Post by Russell Coker
You need DKMS to build the kernel modules and then the way ZFS operates is
very different from traditional Linux filesystems.
dkms automates and simplifies the entire compilation process, so it just
takes time. it's not complicated or difficult.

or you could just use the zfsonlinux binary repo for the distribution of
your choice and install a pre-compiled binary. for debian, instructions
are here:

http://zfsonlinux.org/debian.html

craig
--
craig sanders <***@taz.net.au>

BOFH excuse #54:

Evil dogs hypnotised the night shift
Russell Coker
2013-07-13 04:42:22 UTC
Permalink
Post by Craig Sanders
Post by Russell Coker
Firstly plan what you are doing especially regarding boot.
Do you want to have /boot be a RAID-1 across all 4 of the disks?
not a good idea with ZFS. Don't give partitions to ZFS, give it entire disks.
http://zfsonlinux.org/faq.html#PerformanceConsideration
"Create your pool using whole disks: When running zpool create use
whole disk names. This will allow ZFS to automatically partition
the disk to ensure correct alignment. It will also improve
interoperability with other ZFS implementations which honor the
wholedisk property."
Who's going to transfer a zpool of disks from a Linux box to a *BSD or Solaris
system? Almost no-one.

If Solaris is even a consideration then just use it right from the start, ZFS
is going to work better on Solaris anyway. If you have other reasons for
choosing the OS (such as Linux being better for pretty much everything other
than ZFS) then you're probably not going to change.
Post by Craig Sanders
Post by Russell Coker
Don't bother with ZIL or L2ARC, most home use has no need for more
performance than modern hard drives can provide and it's best to avoid
the complexity.
if he's got an SSD then partitioning it to give some L2ARC and ZIL is
easy enough, and both of them will provide noticable benefits even for
home use.
it takes a little bit of advance planning to set up (i.e. partitioning
the SSD and then issuing 'zfs add' commands, but it's set-and-forget.
doesn't add any ongoing maintainence complexity.
But it does involve more data transfer. Modern SSDs shouldn't wear out, but
I'm not so keen on testing that theory. For a system with a single SSD you
will probably have something important on it. Using it for nothing but
ZIL/L2ARC might be a good option, but also using it for boot probably
wouldn't.
Post by Craig Sanders
Post by Russell Coker
Post by Craig Sanders
0. zfsonlinux is pretty easy to work with, easy to learn and to use.
Actually it's a massive PITA compared to every filesystem that most
Linux users have ever used.
yeah, well, it's a bit more complicated than mkfs. but a *lot* less
complicated than mdadm and lvm.
I doubt that claim. It's very difficult to compare complexity, but the
layered design of mdadm and lvm makes it easier to determine what's going on
IMHO.
Post by Craig Sanders
and gaining the benefits of sub-volumes or logical volumes of any kind
is going to add some management complexity whether you use btrfs(8),
zfs(8), or (worse) lvcreate/lvextend/lvresize/lvwhatever.
It's true that some degree of complexity is inherent in solving more complex
problems. That doesn't change the fact that it's difficult to work with.
--
My Main Blog http://etbe.coker.com.au/
My Documents Blog http://doc.coker.com.au/
Craig Sanders
2013-07-13 15:53:32 UTC
Permalink
Post by Russell Coker
Post by Craig Sanders
http://zfsonlinux.org/faq.html#PerformanceConsideration
"Create your pool using whole disks: When running zpool create use
whole disk names. This will allow ZFS to automatically partition
the disk to ensure correct alignment. It will also improve
interoperability with other ZFS implementations which honor the
wholedisk property."
Who's going to transfer a zpool of disks from a Linux box to a *BSD or Solaris
system? Almost no-one.
read the first reason again, automatic alignment of the partitions
increases performance - and it's the sort of thing that it's a PITA
to do manually, calculating the correct starting locations for all
partitions. it'll be a lot more significant to most linux users than the
second reason...as you say, that's irrelevant to most people.
Post by Russell Coker
But it does involve more data transfer. Modern SSDs shouldn't wear
out, but I'm not so keen on testing that theory. For a system with a
single SSD you will probably have something important on it. Using it
for nothing but ZIL/L2ARC might be a good option, but also using it
for boot probably wouldn't.
if you're that paranoid about SSDs wearing out then just don't use them
for anything except a read-cache. or better yet, not at all.

it really isn't worth worrying about, though. current generation SSDs
are at least as reliable as hard disks, and with longevity to match.
they're not going to wear out any faster than a hard disk.
Post by Russell Coker
Post by Craig Sanders
Post by Russell Coker
Post by Craig Sanders
0. zfsonlinux is pretty easy to work with, easy to learn and to use.
Actually it's a massive PITA compared to every filesystem that most
Linux users have ever used.
yeah, well, it's a bit more complicated than mkfs. but a *lot* less
complicated than mdadm and lvm.
I doubt that claim. It's very difficult to compare complexity, but the
layered design of mdadm and lvm makes it easier to determine what's going on
IMHO.
mdadm alone is simple enough and well documented, with great built-in
help - about on a par with zfs. they're both very discoverable, which is
particularly important for tools that you don't use daily, you typically
only use them when adding or replacing hardware or when something has
gone wrong. the btrfs tools are similarly easy to learn.


lvm is the odd one out - it's reasonably well documented, but the built
in help is crap, and it it is not easy to discover how to use the
various lv tools or, indeed, which of the tools is the appropriate one
to use - "is it lvresize or lvextend to resize a partition, or both but
in what order"?

and, unlike lvm, with both btrfs and zfs, once you've created the pool
you don't need to remember or specify the device names of the disks in
the pool unless you're replacing one of them - most operations (like
creating, deleting, changing attributes of a volume or zvol) are done
with just the name of the pool and volume. this leads to short, simple,
easily understood command lines.

btrfs and zfs are at a higher abstraction level, allowing you to work
with the big picture of volumes and pools rather than get bogged down in
little details of disks and partitions.

the 'zpool history' command is also invaluable - you can easily see
*exactly* which commands you used to make changes to the pool or a file
system again - built in time-stamped cheat-notes of everything you've
done to the pool. this makes it trivially easy to remember how you did
something six months ago.
Post by Russell Coker
Post by Craig Sanders
and gaining the benefits of sub-volumes or logical volumes of any
kind is going to add some management complexity whether you use
btrfs(8), zfs(8), or (worse) lvcreate/lvextend/lvresize/lvwhatever.
It's true that some degree of complexity is inherent in solving more
complex problems. That doesn't change the fact that it's difficult to
work with.
if you mean 'zfs is difficult to work with', then you must be using some
definition of 'difficult' of which i was previously unaware.

craig
--
craig sanders <***@taz.net.au>

BOFH excuse #347:

The rubber band broke
Jeremy Visser
2013-07-14 02:15:59 UTC
Permalink
Post by Craig Sanders
and, unlike lvm, with both btrfs and zfs, once you've created the pool
you don't need to remember or specify the device names of the disks in
the pool unless you're replacing one of them
Can you clarify what you mean by this?

My volume group is /dev/vg_glimfeather. I haven’t interacted with the
raw PV, /dev/sda6, in over a year.

All modern distros autodetect all VGs upon boot.
Craig Sanders
2013-07-14 06:54:22 UTC
Permalink
Post by Jeremy Visser
Post by Craig Sanders
and, unlike lvm, with both btrfs and zfs, once you've created the
pool you don't need to remember or specify the device names of the
disks in the pool unless you're replacing one of them
Can you clarify what you mean by this?
My volume group is /dev/vg_glimfeather. I haven?t interacted with the
raw PV, /dev/sda6, in over a year.
here's an example from the lvextend man page:

Extends the size of the logical volume "vg01/lvol10" by 54MiB on physi-
cal volume /dev/sdk3. This is only possible if /dev/sdk3 is a member of
volume group vg01 and there are enough free physical extents in it:

lvextend -L +54 /dev/vg01/lvol10 /dev/sdk3

and another from lvresize:

Extend a logical volume vg1/lv1 by 16MB using physical extents
/dev/sda:0-1 and /dev/sdb:0-1 for allocation of extents:

lvresize -L+16M vg1/lv1 /dev/sda:0-1 /dev/sdb:0-1

to extend or resize the volume, you've got to tell it which disk(s) to
allocate the free space from.


with zfs, you'd just do 'zfs set quota=nnnn pool/volume', and not have
to give a damn about which particular disks in the pool the extra space
came from.

actually, 'set reservation' would be closer in meaning to the lvm
commands - 'quota' is is from shared space used by all volumes on that
pool so overcommit is possible (even default). e.g. on a 1TB pool, you
can create 5 volumes each with quotas of 1TB.

'reservation' actually reserves that amount of space for that volume
from the pool. no other volume gets to use that reserved space.
Post by Jeremy Visser
All modern distros autodetect all VGs upon boot.
craig
--
craig sanders <***@taz.net.au>

BOFH excuse #57:

Groundskeepers stole the root password
Brett Pemberton
2013-07-14 07:25:29 UTC
Permalink
On Sun, Jul 14, 2013 at 12:15:59PM +1000, Jeremy Visser wrote:> My volume
group is /dev/vg_glimfeather. I haven?t interacted with the
Post by Jeremy Visser
raw PV, /dev/sda6, in over a year.
Extends the size of the logical volume "vg01/lvol10" by 54MiB on physi-
cal volume /dev/sdk3. This is only possible if /dev/sdk3 is a member of
lvextend -L +54 /dev/vg01/lvol10 /dev/sdk3
If you look at the synopsis for the man page, you'll realise that the
physical device is completely optional. It's for when you might have
multiple PVs and wish to force this extend to happen on one in particular.
For most use-cases you don't need to specify it at all.
Extend a logical volume vg1/lv1 by 16MB using physical
extents
lvresize -L+16M vg1/lv1 /dev/sda:0-1 /dev/sdb:0-1
to extend or resize the volume, you've got to tell it which disk(s) to
allocate the free space from.
Nope. You can, but you don't have to.

/ Brett
Craig Sanders
2013-07-15 04:30:43 UTC
Permalink
Post by Brett Pemberton
Post by Craig Sanders
to extend or resize the volume, you've got to tell it which disk(s) to
allocate the free space from.
Nope. You can, but you don't have to.
oh well, consider that an example of lvm's difficulty to learn and
remember. I've been using LVM for a lot longer than ZFS but I still
find it difficult and complicated to work with, and the tools are clumsy
and awkward.

nowadays, i see them as first generation pool & volume management tools,
functional but inelegant. later generation tools like btrfs and ZFS
not only add more features (like compression, de-duping, etc etc) by
blurring the boundaries between block devices and file systems but more
importantly they add simplicity and elegance that is missing from the
first generation.



and as for lvextend, lvreduce, and lvresize - it's completely insane
for there to be three tools that do basically the same job. sure, i get
that there are historical and compatibility reasons to keep lvextend
and lvreduce around, but they should clearly be marked in the man page
as "deprecated, use lvresize instead" (or even just "same as lvresize")
because for a newcomer to LVM or even a long term but infrequent user
(as is normal, lvm isn't exactly daily-usage) they just add confusion -
i have to remind myself every time I use them that extend doesn't mean
anything special in LVM, it's not something extra i have to do before or
after a resize, it's just a resize.

craig
--
craig sanders <***@taz.net.au>

BOFH excuse #144:

Too few computrons available.
Russell Coker
2013-07-15 05:25:29 UTC
Permalink
On Mon, 15 Jul 2013 14:30:43 +1000
Post by Craig Sanders
oh well, consider that an example of lvm's difficulty to learn and
remember. I've been using LVM for a lot longer than ZFS but I still
find it difficult and complicated to work with, and the tools are
clumsy and awkward.
They are designed for first-generation filesystems where fsck is a
standard utility that's regularly used as opposed to more modern
filesystems where there is no regularly used fsck because they are just
expected to work all the time.

The ratio of block device size to RAM for affordable desktop systems is
now around 500:1 (4T:8G) where it was about 35:1 in the mid 90's (I'm
thinking 70M disk and 2M of RAM). The time taken to linearly read an
entire block device is now around 10 hours when it used to be about 2
minutes in 1990 (I'm thinking of a 1:1 interleave 70MB MFM disk doing
500KB/s).

Those sort of numbers change everything about filesystems, and they
aren't big filesystems. The people who designed ZFS were thinking of
arrays of dozens of disks as a small filesystem.

Ext4 and LVM still have their place. I'm a little nervous about the
extra complexity of ZFS and BTRFS when combined with a system that
lacks ECC RAM. I've already had two BTRFS problems that required
backup/format/restore due to memory errors.
Post by Craig Sanders
nowadays, i see them as first generation pool & volume management
tools, functional but inelegant. later generation tools like btrfs
and ZFS not only add more features (like compression, de-duping, etc
etc) by blurring the boundaries between block devices and file
systems but more importantly they add simplicity and elegance that is
missing from the first generation.
True, but they also add complexity. Did you ever run debugfs on an
ext* filesystem? It's not that hard to do. I really don't want to
imagine running something similar on BTRFS or ZFS.
Post by Craig Sanders
and as for lvextend, lvreduce, and lvresize - it's completely insane
for there to be three tools that do basically the same job.
It's pretty much standard in Unix to have multiple ways of doing it.

gzip -d / zcat
cp / tar / cat / mv all have overlapping functionality
tar / cpio
Post by Craig Sanders
sure, i
get that there are historical and compatibility reasons to keep
lvextend and lvreduce around, but they should clearly be marked in
the man page as "deprecated, use lvresize instead" (or even just
"same as lvresize") because for a newcomer to LVM or even a long term
So it's a man page issue.
--
My Main Blog http://etbe.coker.com.au/
My Documents Blog http://doc.coker.com.au/
Craig Sanders
2013-07-16 02:15:11 UTC
Permalink
Post by Russell Coker
They are designed for first-generation filesystems where fsck is a
standard utility that's regularly used as opposed to more modern
filesystems where there is no regularly used fsck because they are
just expected to work all the time.
and as you point out, disks are getting so large that fsck is just
impractical. fsck on boot even more so - 10 hours downtime just to
reboot?

(i particularly hate ext2/3/4's default mount-count and interval
settings for auto-fsck on boot...i don't need an fsck just because this
is the first time i've rebooted the server in 6 months)
Post by Russell Coker
Ext4 and LVM still have their place.
ext4, yes - e.g. on laptops and in VM images. not so sure about lvm,
there isn't anything it does that zfs and btrfs don't do better.
Post by Russell Coker
I'm a little nervous about the extra complexity of ZFS and BTRFS when
combined with a system that lacks ECC RAM. I've already had two BTRFS
problems that required backup/format/restore due to memory errors.
i'm far more nervous about undetected errors in current size hard disks
using traditional filesystems like ext4 and xfs.

RAM is cheap, and the price difference for ECC is not prohibitive.

e.g. without spending too much time shopping around for best prices, I
found Kingston 8GB DDR3-133 non-ECC being sold pretty much everywhere
for around $75. ECC is a bit harder to find but i spotted that megabuy
is selling Kingston 8GB DDR3-1333 ECC RAM for $97.

http://www.megabuy.com.au/kingston-kvr13e98i-8gb-1333mhz-ddr3-ecc-cl9-dimm-intel-p336271.html

they also have DDR3-1600 ECC for the same price.

that's about 30% more for ECC, but still adds up to a difference of less
than $50 for a 16GB RAM system.


oddly, mwave is currently selling 4GB ECC Kingston RAM for $22.29

http://www.mwave.com.au/product/sku-aa64505-kingston_4gb_memory_ddr31333mhz_240pin_reg_ecc_kaca

that price is so low I suspect a sale to get rid of an overstocked item,
or maybe an error.


BTW, all AMD CPU motherboards (even on home/hobbyist m/bs like from
asus, asrock, gigabyte etc) support ECC and have done for several years.

Some Intel home/hobbyist motherboards do too (until recently they
disabled ECC on non-server CPUs as part of their usual artificial market
segmentation practices).

All server boards support ECC, of course.
Post by Russell Coker
Post by Craig Sanders
nowadays, i see them as first generation pool & volume management
[...]
True, but they also add complexity. Did you ever run debugfs on an
ext* filesystem? It's not that hard to do. I really don't want to
imagine running something similar on BTRFS or ZFS.
zfs should prevent (detect and correct) the kind of small errors that
are fixable manually with debugfs, and for anything more serious
restoring from backup is the only practical solution.
Post by Russell Coker
Post by Craig Sanders
and as for lvextend, lvreduce, and lvresize - it's completely insane
for there to be three tools that do basically the same job.
It's pretty much standard in Unix to have multiple ways of doing it.
yeah, from multiple alternative implementations, or by using variations
of multiple small tools together in a pipeline. not generally in the
same software package.

when it is within the one package, it's generally for legacy reasons
(i.e. backwards-compatibility, so as to not break scripts that were
written to use the old tool) and clearly documented as such.
Post by Russell Coker
So it's a man page issue.
documentation in general but yes, that's what i said.

craig
--
craig sanders <***@taz.net.au>

BOFH excuse #448:

vi needs to be upgraded to vii
Colin Fee
2013-07-15 05:29:00 UTC
Permalink
A most enlightening thread so far. I've begun playing with ZFS within a VM
to get used to the various commands and their outcomes etc.
Tim Connors
2013-07-15 05:38:07 UTC
Permalink
Post by Colin Fee
A most enlightening thread so far. I've begun playing with ZFS within a VM
to get used to the various commands and their outcomes etc.
Craig Sanders
2013-07-16 02:28:41 UTC
Permalink
One thing I read yesterday was that zfs can be prone to issues from
fragmentation - is there a preventive strategy or remedial measure to take
into account?
Don't ever use more than 80% of your file system?
btrfs has a 'balance' command to rebalance storage of files across the
disks (e.g. after you've added or replaced disks in the pool. That has
the side-effect of de-fragging files.

of course, on a large pool, it takes ages to run.

zfs doesn't have this.
Yeah, I know that's not a very acceptable alternative.
it's not that bad. you're supposed to keep an eye on your systems and
add more disks to the pool (or replace current disks with larger ones)
as it starts getting full.

my main zpools (4x1TB in RAIDZ1) are about 70% full. I probably should
start thinking about replacing the drives with 4x2TB soon...or deleting
some crap.


craig
--
craig sanders <***@taz.net.au>

BOFH excuse #108:

The air conditioning water supply pipe ruptured over the machine room
Kevin
2013-07-16 02:33:28 UTC
Permalink
Post by Craig Sanders
One thing I read yesterday was that zfs can be prone to issues from
fragmentation - is there a preventive strategy or remedial measure to
take
into account?
Don't ever use more than 80% of your file system?
btrfs has a 'balance' command to rebalance storage of files across the
disks (e.g. after you've added or replaced disks in the pool. That has
the side-effect of de-fragging files.
of course, on a large pool, it takes ages to run.
zfs doesn't have this.
balance in btrfs is also used to change the pool replication (single to
raid1 (1 replica) to raid 10(1 replica with striping)) and i would imagine
for moving in and out of parity based replication systems. (erasure codes)
Jason White
2013-07-17 08:03:13 UTC
Permalink
Post by Craig Sanders
btrfs has a 'balance' command to rebalance storage of files across the
disks (e.g. after you've added or replaced disks in the pool. That has
the side-effect of de-fragging files.
of course, on a large pool, it takes ages to run.
Btrfs also has online de-fragmentation, performed while the file system is
otherwise idle. In addition, there's a defragment command that the
administrator can invoke on files and directories.
Chris Samuel
2013-07-17 12:01:09 UTC
Permalink
Post by Jason White
Btrfs also has online de-fragmentation, performed while the file system is
otherwise idle. In addition, there's a defragment command that the
administrator can invoke on files and directories.
But be aware that snapshot aware defrag is only in kernel v3.9 and later, in
other words earlier kernels will (if you defrag) possibly end up duplicating
blocks as part of the defrag.

See commit 38c227d87c49ad5d173cb5d4374d49acec6a495d for more.

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

This email may come with a PGP signature as a file. Do not panic.
For more info see: http://en.wikipedia.org/wiki/OpenPGP
Russell Coker
2013-07-25 09:13:25 UTC
Permalink
Post by Craig Sanders
my main zpools (4x1TB in RAIDZ1) are about 70% full. I probably should
start thinking about replacing the drives with 4x2TB soon...or deleting
some crap.
2*4TB would give you twice the storage you've currently got with less noise
and power use. 3*4TB in a RAID-Z1 will give you 3* the storage you've
currently got.

How do you replace a zpool? I've got a system with 4 small disks in a zpool
that I plan to replace with 2 large disks. I'd like to create a new zpool
with the 2 large disks, do an online migration of all the data, and then
remove the 4 old disks.

Also are there any issues with device renaming? For example if I have sdc,
sdd, sde, sdf used for a RAID-Z1 and I migrate the data to a RAID-1 on sdg and
sdh, when I reboot the RAID-1 will be sdc and sdd, will that cause problems?
--
My Main Blog http://etbe.coker.com.au/
My Documents Blog http://doc.coker.com.au/
Tom Robinson
2013-07-25 23:00:42 UTC
Permalink
Happy SysAdmin Day

http://sysadminday.com/
--
Tom Robinson

19 Thomas Road Mobile: +61 4 3268 7026
Healesville, VIC 3777 Home: +61 3 5962 4543
Australia GPG Key: 8A4CB7A7

CONFIDENTIALITY: Copyright (C). This message with any appended or
attached material is intended for addressees only and may not be
copied or forwarded to or used by other parties without permission.
Craig Sanders
2013-07-26 01:37:38 UTC
Permalink
Post by Russell Coker
Post by Craig Sanders
my main zpools (4x1TB in RAIDZ1) are about 70% full. I probably
should start thinking about replacing the drives with 4x2TB
soon...or deleting some crap.
2*4TB would give you twice the storage you've currently got with less
noise and power use.
and faster than raidz or raid5 too, but it would only give me a total of
4TB in a mirrored/raid-1 pair. seagate ST4000DM000 drives seem to be the
cheapest the moment at $195...so a total of $390.

noise isn't an issue (the system fans are louder than the drives, and
they're low noise fans), and power isn't much of a concern (the drives
use 4.5W each idle, 6W in use)

i've been reading up on drive reviews recently - advance planning for
the upgrade - and while the ST4000DM000 has got good reviews, the WD
RED drives seem better. 4TB RED drives aren't available yet, and 3TB
WD30EFRX drives cost $158 each.
Post by Russell Coker
3*4TB in a RAID-Z1 will give you 3* the storage you've currently got.
my current plan is to replace my 4x1TB raidz backup pool with one or
two mirrored pairs of either 3TB or 4TB drives. i'll wait a while and
hope the 4TB WD REDs are reasonably priced. if they're cheap enough
i'll buy four. or i'll get seagates. otherwise i'll get either a 2
x 4TB drives or 4 x 3TB. most likely the former as I can always add
another pair later (which is one of my reasons for switching from raidz
to mirrored pairs - cheaper upgrades. i've always preferred raid1 or
mirrors anyway).

i'm in no hurry, i can wait to see what happens with pricing. and i've
read rumours that WD are likely to be releasing 5TB drives around
the same time as their 4TB drives too, which should make things
"interesting".


my current 1TB drives can also be recycled into two more 2x1TB pairs
to put some or all of them back into the backup pool - giving e.g. a
total of 2x4TB + 2x1TB + 2x1TB or 6TB usable space. Fortunately, I have
enough SAS/SATA ports and hot-swap drive bays to do that. I'll probably
just use two of them with the idea of replacing them with 4TB or larger
drives in a year or three.

i don't really need more space in my main pool. growth in usage is slow
and I can easily delete a lot of stuff (i'd get back two hundred GB or
more if i moved my ripped CD and DVD collection to the zpool on my myth
box which is only about 60% full - and i could easily bring that back to
30 or 40% by deleting old recordings i've already watched)
Post by Russell Coker
How do you replace a zpool? I've got a system with 4 small disks in a
zpool that I plan to replace with 2 large disks. I'd like to create
a new zpool with the 2 large disks, do an online migration of all the
data, and then remove the 4 old disks.
there are two ways.

1. replace each drive individually with 'zpool replace olddrive
newdrive' commands. this takes quite a while (depending on the amount of
data that needs to be resilvered). when every drive in a vdev (a single
drive, a mirrored pair, a raidz) in the pool is replaced with a larger
one, the extra space is immediately available. for a 4-drive raidz it
would take a very long time to replace each one individually.

IMO this is only worthwhile for replacing a failed or failing drive.

2. create a new pool, 'zfs send' the old pool to the new one and destroy
the old pool. this is much faster, but you need enough sata/sas ports to
have both pools active at the same time. it also has the advantage of
rewriting the data according to the attributes of the new pool (e.g. if
you had gzip compression on the old pool and lz4[1] compression on the
new your data will be recompressed with lz4 - highly recommended BTW)

if you're converting from a 4-drive raidz to a 2-drive mirrored pair,
then this is the only way to do it.



[1] http://wiki.illumos.org/display/illumos/LZ4+Compression
Post by Russell Coker
Also are there any issues with device renaming? For example if I have
sdc, sdd, sde, sdf used for a RAID-Z1 and I migrate the data to a
RAID-1 on sdg and sdh, when I reboot the RAID-1 will be sdc and sdd,
will that cause problems?
http://zfsonlinux.org/faq.html#WhatDevNamesShouldIUseWhenCreatingMyPool

using sda/sdb/etc has the usual problems of drives being renamed if
the kernel detects them in a different order (e.g. new kernel version,
different load order for driver modules, variations in drive spin up
time, etc) and is not recommended except for testing/experimental pools.

it's best to use the /dev/disk/by-id device names. they're based on the
model and serial number of the drive so are guaranteed unique and will
never change.

e.g. my backup pool currently looks like this (i haven't bothered
running zpool upgrade on it yet to take advantage of lz4 and other
improvements)

pool: backup
state: ONLINE
status: The pool is formatted using a legacy on-disk format. The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'. Once this is done, the
pool will no longer be accessible on software that does not support
feature flags.
scan: scrub repaired 160K in 4h21m with 0 errors on Sat Jul 20 06:03:58 2013
config:

NAME STATE READ WRITE CKSUM
backup ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ata-ST31000528AS_6VP3FWAG ONLINE 0 0 0
ata-ST31000528AS_9VP4RPXK ONLINE 0 0 0
ata-ST31000528AS_9VP509T5 ONLINE 0 0 0
ata-ST31000528AS_9VP4P4LN ONLINE 0 0 0

errors: No known data errors

craig
--
craig sanders <***@taz.net.au>

BOFH excuse #193:

Did you pay the new Support Fee?
Kevin
2013-07-26 01:53:10 UTC
Permalink
Post by Craig Sanders
Post by Russell Coker
Post by Craig Sanders
my main zpools (4x1TB in RAIDZ1) are about 70% full. I probably
should start thinking about replacing the drives with 4x2TB
soon...or deleting some crap.
2*4TB would give you twice the storage you've currently got with less
noise and power use.
and faster than raidz or raid5 too, but it would only give me a total of
4TB in a mirrored/raid-1 pair. seagate ST4000DM000 drives seem to be the
cheapest the moment at $195...so a total of $390.
noise isn't an issue (the system fans are louder than the drives, and
they're low noise fans), and power isn't much of a concern (the drives
use 4.5W each idle, 6W in use)
i've been reading up on drive reviews recently - advance planning for
the upgrade - and while the ST4000DM000 has got good reviews, the WD
RED drives seem better. 4TB RED drives aren't available yet, and 3TB
WD30EFRX drives cost $158 each.
I would not give the red drives much consideration as they are
mostly a marketing change over the blacks. Based on reports from companies
like backblaze I would consider hitachi deskstars if the budget permitted
else go with seagate DM series
Anything advertised for enterprise purposes is really pointless.

WD blacks or reds are also appropriate but avoid the greens (stupid idle
timeout.)
Post by Craig Sanders
Post by Russell Coker
How do you replace a zpool? I've got a system with 4 small disks in a
zpool that I plan to replace with 2 large disks. I'd like to create
a new zpool with the 2 large disks, do an online migration of all the
data, and then remove the 4 old disks.
*snipping text*
if you're converting from a 4-drive raidz to a 2-drive mirrored pair,
then this is the only way to do it.
the static vdev sizing is truly one of the few annoying things about zfs
Tom Robinson
2013-07-26 02:20:57 UTC
Permalink
Post by Kevin
Post by Craig Sanders
Post by Russell Coker
Post by Craig Sanders
my main zpools (4x1TB in RAIDZ1) are about 70% full. I probably
should start thinking about replacing the drives with 4x2TB
soon...or deleting some crap.
2*4TB would give you twice the storage you've currently got with less
noise and power use.
and faster than raidz or raid5 too, but it would only give me a total of
4TB in a mirrored/raid-1 pair. seagate ST4000DM000 drives seem to be the
cheapest the moment at $195...so a total of $390.
noise isn't an issue (the system fans are louder than the drives, and
they're low noise fans), and power isn't much of a concern (the drives
use 4.5W each idle, 6W in use)
i've been reading up on drive reviews recently - advance planning for
the upgrade - and while the ST4000DM000 has got good reviews, the WD
RED drives seem better. 4TB RED drives aren't available yet, and 3TB
WD30EFRX drives cost $158 each.
I would not give the red drives much consideration as they are
mostly a marketing change over the blacks. Based on reports from companies
like backblaze I would consider hitachi deskstars if the budget permitted
else go with seagate DM series
Anything advertised for enterprise purposes is really pointless.
WD blacks or reds are also appropriate but avoid the greens (stupid idle timeout.)
Have I done something wrong? I've been running the WD Greens for nearly two years without any issues
(raidz mirror). There's a flash tool (wdidle3) that allows you to change the LCC timeout to a
maximum of five minutes or disable it completely. Here's at least one reference

http://alfredoblogspage.blogspot.com.au/2013/01/western-digital-wd20earx-load-cycle.html
Post by Kevin
Post by Craig Sanders
Post by Russell Coker
How do you replace a zpool? I've got a system with 4 small disks in a
zpool that I plan to replace with 2 large disks. I'd like to create
a new zpool with the 2 large disks, do an online migration of all the
data, and then remove the 4 old disks.
*snipping text*
if you're converting from a 4-drive raidz to a 2-drive mirrored pair,
then this is the only way to do it.
the static vdev sizing is truly one of the few annoying things about zfs
_______________________________________________
luv-main mailing list
http://lists.luv.asn.au/listinfo/luv-main
--
Tom Robinson

19 Thomas Road Mobile: +61 4 3268 7026
Healesville, VIC 3777 Home: +61 3 5962 4543
Australia GPG Key: 8A4CB7A7

CONFIDENTIALITY: Copyright (C). This message with any appended or
attached material is intended for addressees only and may not be
copied or forwarded to or used by other parties without permission.
Craig Sanders
2013-07-26 06:41:17 UTC
Permalink
Post by Tom Robinson
Have I done something wrong? I've been running the WD Greens for
nearly two years without any issues (raidz mirror).
if they're working without error for you, i wouldn't worry about it.
Post by Tom Robinson
There's a flash tool (wdidle3) that allows you to change the LCC
timeout to a maximum of five minutes or disable it completely.
i didn't reflash my WD Greens, I reflashed my SAS cards to "IT" mode
instead...haven't had a problem with them since then.
Post by Tom Robinson
Here's at least one reference
http://alfredoblogspage.blogspot.com.au/2013/01/western-digital-wd20earx-load-cycle.html
too bad that's on blogspot.com.au

i really hate blogspot pages, crappy (and really bloody irritating)
javascript and they don't work at all with NoScript. I don't even
bother clicking on a link to an interesting-sounding article if i notice
the URL is blogspot.

craig
--
craig sanders <***@taz.net.au>

BOFH excuse #173:

Recursive traversal of loopback mount points
Craig Sanders
2013-07-26 06:36:06 UTC
Permalink
Post by Kevin
I would not give the red drives much consideration as they are
mostly a marketing change over the blacks.
given that 3TB WD Black drives are about $70 more then 3TB WD RED, that
would make the RED drives a bargain.
Post by Kevin
Based on reports from companies like backblaze I would consider
hitachi deskstars if the budget permitted else go with seagate DM
series
hitachi 4TB Deskstar 0S03358 is about $270 at the moment, much more
expensive than the seagate ST4000DM000 at $195.
Post by Kevin
Anything advertised for enterprise purposes is really pointless.
true.
Post by Kevin
WD blacks or reds are also appropriate but avoid the greens (stupid idle
timeout.)
i use WD Greens in my myth box. my main /export pool is made up of a
mixture of WD Green 1TB and seagate ST31000528AS.

i had to reflash my LSI-based SAS card to "IT" mode to avoid idle
timeout problems, but since I did that they've been working flawlessly.
Post by Kevin
the static vdev sizing is truly one of the few annoying things about zfs
yep. it would be nice to do stuff like just add a fifth drive to a 4-drive
raidz and have it rebalance all the data. but you can't do that (and zfs
doesn't do rebalancing like btrfs either).

craig
--
craig sanders <***@taz.net.au>

BOFH excuse #111:

The salesman drove over the CPU board.
Russell Coker
2013-08-10 10:11:32 UTC
Permalink
Post by Craig Sanders
noise isn't an issue (the system fans are louder than the drives, and
they're low noise fans), and power isn't much of a concern (the drives
use 4.5W each idle, 6W in use)
With noise it's not an issue of the loudest thing is the one you hear. The
noise from different sources adds and you can have harmonics too.

As an aside, I've got a 6yo Dell PowerEdge T105 that's CPU fan is becoming
noisy. Any suggestions on how to find a replacement?
Post by Craig Sanders
Post by Russell Coker
How do you replace a zpool? I've got a system with 4 small disks in a
zpool that I plan to replace with 2 large disks. I'd like to create
a new zpool with the 2 large disks, do an online migration of all the
data, and then remove the 4 old disks.
there are two ways.
1. replace each drive individually with 'zpool replace olddrive
newdrive' commands. this takes quite a while (depending on the amount of
data that needs to be resilvered). when every drive in a vdev (a single
drive, a mirrored pair, a raidz) in the pool is replaced with a larger
one, the extra space is immediately available. for a 4-drive raidz it
would take a very long time to replace each one individually.
IMO this is only worthwhile for replacing a failed or failing drive.
2. create a new pool, 'zfs send' the old pool to the new one and destroy
the old pool. this is much faster, but you need enough sata/sas ports to
have both pools active at the same time. it also has the advantage of
rewriting the data according to the attributes of the new pool (e.g. if
you had gzip compression on the old pool and lz4[1] compression on the
new your data will be recompressed with lz4 - highly recommended BTW)
if you're converting from a 4-drive raidz to a 2-drive mirrored pair,
then this is the only way to do it.
According to the documentation you can add a new vdev to an existing pool.

So if you have a pool named "tank" you can do the following to add a new
RAID-1:
zpool add tank mirror /dev/sde2 /dev/sdf2

Then if you could remove a vdev you could easily migrate a pool. However it
doesn't appear possible to remove a vdev, you can remove a device that
contains mirrored data but nothing else.

Using a pair of mirror vdevs would allow you to easily upgrade the filesystem
while only replacing two disks. But the down-side to that is that if two
disks fail that could still lose most of your data while a RAID-Z2 over 4
disks would give the same capacity as 2*RAID-1 but cover you in the case of
two failures.
Post by Craig Sanders
Post by Russell Coker
Also are there any issues with device renaming? For example if I have
sdc, sdd, sde, sdf used for a RAID-Z1 and I migrate the data to a
RAID-1 on sdg and sdh, when I reboot the RAID-1 will be sdc and sdd,
will that cause problems?
http://zfsonlinux.org/faq.html#WhatDevNamesShouldIUseWhenCreatingMyPool
using sda/sdb/etc has the usual problems of drives being renamed if
the kernel detects them in a different order (e.g. new kernel version,
different load order for driver modules, variations in drive spin up
time, etc) and is not recommended except for testing/experimental pools.
it's best to use the /dev/disk/by-id device names. they're based on the
model and serial number of the drive so are guaranteed unique and will
never change.
zpool replace tank sdd /dev/disk/by-id/ata-ST4000DM000-1F2168_Z300MHWF-part2

That doesn't seem to work. I used the above command to replace a disk and
then after that process was complete I rebooted the system and saw the
following:

# zpool status
pool: tank
state: ONLINE
scan: scrub repaired 0 in 11h38m with 0 errors on Thu Aug 1 11:38:57 2013
config:

NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd2 ONLINE 0 0 0

It appears that ZFS is scanning /dev for the first device node with a suitable
UUID.
Post by Craig Sanders
e.g. my backup pool currently looks like this (i haven't bothered
running zpool upgrade on it yet to take advantage of lz4 and other
improvements)
pool: backup
state: ONLINE
status: The pool is formatted using a legacy on-disk format. The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'. Once this is done, the
pool will no longer be accessible on software that does not support
feature flags.
scan: scrub repaired 160K in 4h21m with 0 errors on Sat Jul 20 06:03:58
NAME STATE READ WRITE CKSUM
backup ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ata-ST31000528AS_6VP3FWAG ONLINE 0 0 0
ata-ST31000528AS_9VP4RPXK ONLINE 0 0 0
ata-ST31000528AS_9VP509T5 ONLINE 0 0 0
ata-ST31000528AS_9VP4P4LN ONLINE 0 0 0
errors: No known data errors
Strangely that's not the way it works for me.
--
My Main Blog http://etbe.coker.com.au/
My Documents Blog http://doc.coker.com.au/
Craig Sanders
2013-08-10 15:14:51 UTC
Permalink
Post by Russell Coker
As an aside, I've got a 6yo Dell PowerEdge T105 that's CPU fan is becoming
noisy. Any suggestions on how to find a replacement?
sorry, no idea.
Post by Russell Coker
Post by Craig Sanders
How do you replace a zpool? [...]
zpool that I plan to replace with 2 large disks. I'd like to create
a new zpool with the 2 large disks, do an online migration of all the
data, and then remove the 4 old disks.
there are two ways.
1. replace each drive individually [...]
2. create a new pool, 'zfs send' the old pool to the new one [...]
According to the documentation you can add a new vdev to an existing pool.
yep, but as you note, you can't remove a vdev from a pool...so it's not
much use for replacing a zpool.
Post by Russell Coker
Using a pair of mirror vdevs would allow you to easily upgrade the filesystem
while only replacing two disks. But the down-side to that is that if two
disks fail that could still lose most of your data while a RAID-Z2 over 4
disks would give the same capacity as 2*RAID-1 but cover you in the case of
two failures.
yep, but you've got the performance of raidz2 rather than mirrored pairs
(which, overall, is not as bad as raid5/raid6 but still isn't great).
that may be exactly what you want, but you need to know the tradeoffs
for what you're choosing.
Post by Russell Coker
zpool replace tank sdd /dev/disk/by-id/ata-ST4000DM000-1F2168_Z300MHWF-part2
That doesn't seem to work. I used the above command to replace a disk and
then after that process was complete I rebooted the system and saw the
firstly, you don't need to type the full /dev/disk/by-id/ path. just
ata-ST4000DM000-1F2168_Z300MHWF-part2 would do. typing the full path
isn't wrong - works just as well, it just takes longer to type and
uglifies the output of zpool status and iostat.

secondly, why add a partition and not a whole disk?

third, is sdd2 the same drive/partition as ata-ST4000DM000-1F2168_Z300MHWF-part2?
if so, then it added the correct drive.

to get the naming "correct"/"consistent", did you try 'zpool export
tank' and then 'zpool import -d /dev/disk/by-id' ?
Post by Russell Coker
It appears that ZFS is scanning /dev for the first device node with a
suitable UUID.
sort of. it remembers (in /etc/zfs/zpool.cache) what drives were in each
pool and only scans if you tell it to with zpool import.

if zpool.cache lists sda, sdb, sdc etc then that's how they'll appear
in 'zpool status'. and if zpool.cache lists /dev/disk/by-id names then
that's what it'll show.

try hexdumping zpool.cache to see what i mean.
Post by Russell Coker
Strangely that's not the way it works for me.
i created mine with the ata-ST31999528AS_* names, so that's what's in my
zpool.cache file.

to fix on your system, export and import as mentioned above.

craig
--
craig sanders <***@taz.net.au>

BOFH excuse #108:

The air conditioning water supply pipe ruptured over the machine room
Tim Connors
2013-07-15 05:33:09 UTC
Permalink
Post by Craig Sanders
Post by Brett Pemberton
Post by Craig Sanders
to extend or resize the volume, you've got to tell it which disk(s) to
allocate the free space from.
Nope. You can, but you don't have to.
oh well, consider that an example of lvm's difficulty to learn and
remember. I've been using LVM for a lot longer than ZFS but I still
find it difficult and complicated to work with, and the tools are clumsy
and awkward.
Really? You haven't come across HP-UX's LVM or veritas cluster storage's
lvm then :)

I find zfs obtuse, awkward, inflexible, prone to failure and unreliable.

One day when I got sick of the linux kernel's braindead virtual memory
management, I tried to install debian kfreebsd, but gave up before I
finished installation because not having lvm seemed so primitive. I was
probably just trying to use the wrong tool for the job.
Post by Craig Sanders
nowadays, i see them as first generation pool & volume management tools,
functional but inelegant. later generation tools like btrfs and ZFS
not only add more features (like compression, de-duping, etc etc) by
blurring the boundaries between block devices and file systems but more
importantly they add simplicity and elegance that is missing from the
first generation.
Does anyone use zfs's dedup in practice? Completely useless. Disk is a
heck of a lot cheaper than RAM. Easier to add more disk too compared to
maxing out a server and having to replace *everything* to add more RAM
(dissapointingly, I didn't pay $250 to put 32GB in my new laptop. I hoped
to replace the 16GB with 32GB when it becomes more economical, but when I
opened it up, I found there were only 2 sockets. According to the manual
that wasn't supplied with the laptop, the other two sockets are under the
heatsink wedged underneath the keyboard, and not user-replacable.
According to ebay, within the past couple of weeks, they did start selling
16GB sodimms, but no idea whether my haswell motherboard will be able to
take them when it comes time to upgrade (which is probably very soon,
judging by how iceape's memory usage just bloated even futher beyond
reasonableness)).
Post by Craig Sanders
and as for lvextend, lvreduce, and lvresize - it's completely insane
for there to be three tools that do basically the same job. sure, i get
that there are historical and compatibility reasons to keep lvextend
and lvreduce around, but they should clearly be marked in the man page
as "deprecated, use lvresize instead" (or even just "same as lvresize")
Au contraire. If you use lvresize habitually, one day you're going to
accidentally shrink your LV instead of expand it, and the filesystem below
it will then start accessing beyond end of device, with predictably
catastrophic results. Use lvextend prior to resize2fs, and resize2fs
shrink prior to lvreduce, and you'll be right.
--
Tim Connors
Craig Sanders
2013-07-16 02:47:19 UTC
Permalink
Post by Tim Connors
I've been using LVM for a lot longer than ZFS but I still find it
difficult and complicated to work with, and the tools are clumsy and
awkward.
Really? You haven't come across HP-UX's LVM or veritas cluster
storage's lvm then :)
they're not that much worse than lvm.

OK, so call them 1st generation, lvm gen 1.5 and lvm2 gen 1.75
Post by Tim Connors
I find zfs obtuse, awkward, inflexible, prone to failure and unreliable.
you must be using a different zfs than the one i'm using because that
description is the exact opposite in every respect of my experience with
zfs.
Post by Tim Connors
One day when I got sick of the linux kernel's braindead virtual memory
management, I tried to install debian kfreebsd, but gave up before I
finished installation because not having lvm seemed so primitive. I
was probably just trying to use the wrong tool for the job.
probably. the freebsd kernel has zfs built in, so zfs would be right
tool there.
Post by Tim Connors
Does anyone use zfs's dedup in practice? Completely useless.
yes, people do. it's very heavily used on virtualisation servers, where
there are numerous almost-identical copies of the same VM with minor
variations.

it's also useful on backup servers where you end up with dozens or
hundreds of copies of the same files (esp. if you're backing up entire
systems, including OS)
Post by Tim Connors
Disk is a heck of a lot cheaper than RAM. Easier to add more disk too
compared to
yep, that fits my usage pattern too...i don't have that much duplicate
data. i'm probably wasting less than 200GB or so on backups of linux
systems in my backup pool including snapshots, so it's not worth it to
me to enable de-duping.

other people have different requirements.
Post by Tim Connors
maxing out a server and having to replace *everything* to add more RAM
(dissapointingly, I didn't pay $250 to put 32GB in my new laptop. I hoped
to replace the 16GB with 32GB when it becomes more economical, but when I
opened it up, I found there were only 2 sockets. According to the manual
that wasn't supplied with the laptop, the other two sockets are under the
heatsink wedged underneath the keyboard, and not user-replacable.
According to ebay, within the past couple of weeks, they did start selling
16GB sodimms, but no idea whether my haswell motherboard will be able to
take them when it comes time to upgrade (which is probably very soon,
judging by how iceape's memory usage just bloated even futher beyond
reasonableness)).
i've found 16GB more than adequate to run 2 4TB zpools, a normal desktop
with-the-lot (including firefox with dozens of windows and hundreds of
tabs open at the same time) and numerous daemons (squid, apache, named,
samba, and many others).

of course, a desktop system is a lot easier to upgrade than a laptop if
and when 32GB becomes both affordable and essential.


FYI: NoScript and AdBlock Plus stop javascript from using too much RAM as
well as spying on you.
Post by Tim Connors
Au contraire. If you use lvresize habitually, one day you're going to
accidentally shrink your LV instead of expand it, and the filesystem below
it will then start accessing beyond end of device, with predictably
catastrophic results. Use lvextend prior to resize2fs, and resize2fs
shrink prior to lvreduce, and you'll be right.
the risk of typing '-' rather than '+' does not scare me all that much.

i tend to check and double-check potentially dangerous command lines
before i hit enter, anyway.

craig
--
craig sanders <***@taz.net.au>

BOFH excuse #286:

Telecommunications is downgrading.
Matthew Cengia
2013-07-16 07:49:49 UTC
Permalink
[...]
Post by Craig Sanders
Post by Tim Connors
Does anyone use zfs's dedup in practice? Completely useless.
yes, people do. it's very heavily used on virtualisation servers, where
there are numerous almost-identical copies of the same VM with minor
variations.
it's also useful on backup servers where you end up with dozens or
hundreds of copies of the same files (esp. if you're backing up entire
systems, including OS)
Are you actually talking about retroactive deduplication here, or just COW? IMHO taking a little extra care when copying VM images or taking backups and ensuring use of snapshots and/of --reflink are usually good enough as opposed to going back and hunting for duplicate data.

[...]
Post by Craig Sanders
Post by Tim Connors
Au contraire. If you use lvresize habitually, one day you're going
to
Post by Tim Connors
accidentally shrink your LV instead of expand it, and the filesystem
below
Post by Tim Connors
it will then start accessing beyond end of device, with predictably
catastrophic results. Use lvextend prior to resize2fs, and resize2fs
shrink prior to lvreduce, and you'll be right.
the risk of typing '-' rather than '+' does not scare me all that much.
That of course assumes you're using a relative size. If you're using an absolute size this is far less obvious. That's the other thing: lvextend -L 32G on a 64G LV will do nothing, as would lvreduce -L 64G on a 32G LV. This is useful when ensuring an LV meets minimum size requirements and saves significant (potentially buggy) testing code.
Post by Craig Sanders
i tend to check and double-check potentially dangerous command lines
before i hit enter, anyway.
craig
--
Sent from my phone. Please excuse my brevity.
Regards,
Matthew Cengia
Craig Sanders
2013-07-16 12:16:52 UTC
Permalink
Post by Matthew Cengia
Post by Craig Sanders
it's also useful on backup servers where you end up with dozens or
hundreds of copies of the same files (esp. if you're backing up entire
systems, including OS)
Are you actually talking about retroactive deduplication here, or just
COW? IMHO taking a little extra care when copying VM images or taking
backups and ensuring use of snapshots and/of --reflink are usually
good enough as opposed to going back and hunting for duplicate data.
neither. zfs has a de-dupe attribute which can be set. it then keeps a
table of block hashes so that blocks about to be written which have a
hash already in the table get substituted with a pointer to the original
block.

Like the compression attribute, it only affects writes performed after
the attribute has been set so it isn't retroactive.

NOTE: de-duplication takes enormous amounts of RAM, and slows down write
performance significantly. it's really not worth doing unless you are
certain that you have a very large proportion of duplicate data....and
once it has been turned on for a pool, it can't be turned off without
destroying, re-creating, and restoring the pool (technically, you can
turn off the dedup attribute but you'll still be paying the overhead
price for it but without any benefit).

Here's the original post announcing de-duplication in zfs, from 2009
(still Sun, but old Sun webpages are now on oracle.com)

https://blogs.oracle.com/bonwick/entry/zfs_dedup

for more details, see http://zfsonlinux.org/docs.html which contains
links to:


. Illumos ZFS Documentation
. ZFS on Linux User Guide
. Solaris ZFS Administration Guide

there are also numerous blog posts (of varying quality and cluefulness)
describing how to do stuff with zfs or how something in zfs works, just
a google search away.
Post by Matthew Cengia
That of course assumes you're using a relative size. If you're using
lvextend -L 32G on a 64G LV will do nothing, as would lvreduce -L 64G
on a 32G LV. This is useful when ensuring an LV meets minimum size
requirements and saves significant (potentially buggy) testing code.
or i could just use zfs where there is no risk of making a mistake like
that. IIRC, zfs will refuse to set the quota below current usage, but
even if you could accidentally set the quota on a subvol to less than
its current usage, zfs isnt going to start deleting or truncating files
to make them fit.

it's a different conceptual model to LVM, anyway. With lvm you allocate
chunks of disk to be dedicated to specific LVs. if you allocate too much
to one LV, you waste space. if you allocate too little, you fill up the
fs too quickly. detailed forward planning is almost essential if you
don't want to be micro-managing storage (as in "i'm running out of space
on /home so take 50GB from /usr and add it to /home") all the time.

with zfs, it's a global pool of space usable by all filesystems created
on it - the default quota is no quota, so any and all sub-volumes can
use up all available space. you can set quotas so that subvolumes (and
children of that subvolume) are limited to using no more than X amount
of the total pool (but there's still no guaranteed/reserved space - it's
all shared space from the pool).

with zfs, if /home is running out of quota then I can just do something
like 'zfs set quota=xxxx export/home' as long as there's still free
space available in the pool.

you can also set a 'reservation' where a certain amount of storage in
the pool is reserved for just one subvolume (and its descendants).
reserved storage still comes from the pool, but no other subvolume or
zvol can use any of it.

both reservations and quotas are soft - you can change them at any time,
very easily and without any hassle. reservations are more similar to lvm
space allocations than are quotas, but even reservations are flexible -
you can increase, decrease, or unset them at whim.

craig
--
craig sanders <***@taz.net.au>

BOFH excuse #67:

descramble code needed from software company
Tim Connors
2013-07-17 02:17:18 UTC
Permalink
Post by Craig Sanders
Post by Tim Connors
I find zfs obtuse, awkward, inflexible, prone to failure and unreliable.
you must be using a different zfs than the one i'm using because that
description is the exact opposite in every respect of my experience with
zfs.
I've got one for you; just came in on the mailing list (someone got stale
mounts when their kernel nfs server restarted):

ZFS and the NFS kernel server are not that tightly integrated.

When you do a 'zfs set sharenfs="foo,bar" pool/vol' of a 'zfs share
pool/vol', the zfs tools just give a call to the NFS kernel server
saying 'Hey I want you to share this over NFS'.

If the NFS kernel server is restarted it unshares everything and only
reads back whatever is in /etc/exports. This is actually expected as
NFS doesn't know anything about ZFS. Doing a
'zfs share -a' exports all your NFS/SMB shares again.

When you don't use your system native tools, or when someone tries to
solve something at the wrong layer (zfs trying to mess around with NFS?
That's almost as bad as the GUI at my previous workplace that tried to
keep track of the state of something and then changing that state through
a different layer to usual and then remembering the old state instead of
directly querying it), you've got to expect problems.

This particular problem I get around by just ignoring ZFS's NFS settings.
I have no idea what value they're meant to add.
Post by Craig Sanders
Post by Tim Connors
One day when I got sick of the linux kernel's braindead virtual memory
management, I tried to install debian kfreebsd, but gave up before I
finished installation because not having lvm seemed so primitive. I
was probably just trying to use the wrong tool for the job.
probably. the freebsd kernel has zfs built in, so zfs would be right
tool there.
Not when I tried it, mind you.
Post by Craig Sanders
Post by Tim Connors
Does anyone use zfs's dedup in practice? Completely useless.
yes, people do. it's very heavily used on virtualisation servers, where
there are numerous almost-identical copies of the same VM with minor
variations.
Even then, for it to be worthwhile, when you have 800TB of VMs deployed,
you can't easily dedup that (although from memory, a TB of RAM only costs
about $100K).

For any VMs I've ever seen, the identical shared data isn't all that much
(our templates are 40GB in size) compared to the half TB deployed on
average per VM. Hardly seems worth all the pain.
Post by Craig Sanders
it's also useful on backup servers where you end up with dozens or
hundreds of copies of the same files (esp. if you're backing up entire
systems, including OS)
On my home backup server, the backup software dedups at the file level
(but shared between VMs - achieved by hardlinking all files detected to be
the same, comparing actual content rather than hash collisions). It does
a very good job according to its own stats. Block level dedup is a bit
overkill except if you're backing up raw VM snapshot images.
Post by Craig Sanders
yep, that fits my usage pattern too...i don't have that much duplicate
data. i'm probably wasting less than 200GB or so on backups of linux
systems in my backup pool including snapshots, so it's not worth it to
me to enable de-duping.
other people have different requirements.
I acknowledge that there are some uninteresting systems out there that are
massively duplicated SOEs with bugger-all storage. Might fit that
pattern. And yet I believe our VDI appliances that they're trying to roll
out at work *still* won't be backed by ZFS with dedup.
Post by Craig Sanders
i've found 16GB more than adequate to run 2 4TB zpools, a normal desktop
with-the-lot (including firefox with dozens of windows and hundreds of
tabs open at the same time) and numerous daemons (squid, apache, named,
samba, and many others).
of course, a desktop system is a lot easier to upgrade than a laptop if
and when 32GB becomes both affordable and essential.
Unfortunately, little Atom NASes seem to max out at 4GB.
Post by Craig Sanders
Post by Tim Connors
Au contraire. If you use lvresize habitually, one day you're going to
accidentally shrink your LV instead of expand it, and the filesystem below
it will then start accessing beyond end of device, with predictably
catastrophic results. Use lvextend prior to resize2fs, and resize2fs
shrink prior to lvreduce, and you'll be right.
the risk of typing '-' rather than '+' does not scare me all that much.
Like Matthew said, the issue is when you provide absolute size, and might
get the units wrong. Woops, just shrank it to 1000th of its original
size!
Post by Craig Sanders
i tend to check and double-check potentially dangerous command lines
before i hit enter, anyway.
No <up>-<up>-<enter> sudo reboots? ;P
--
Tim Connors
Craig Sanders
2013-07-17 03:53:53 UTC
Permalink
Post by Tim Connors
Post by Craig Sanders
Post by Tim Connors
I find zfs obtuse, awkward, inflexible, prone to failure and unreliable.
you must be using a different zfs than the one i'm using because
that description is the exact opposite in every respect of my
experience with zfs.
I've got one for you; just came in on the mailing list (someone got
ZFS and the NFS kernel server are not that tightly integrated.
yep, that's a known issue - the integration with NFS and CIFS was
designed for solaris, and it's a low-priority issue to get it as tightly
integrated for linux.

it also doesn't qualify as either "prone to failure" or "unreliable"


personally, i prefer defining NFS & CIFS exports in /etc/exports
and smb.conf anyway. if i wanted to make use of the zfs attributes,
it wouldn't be at all difficult to write a script to generate the
appropriate config fragments for samba and exports.

i'd start with a loop like this:

for fs in $(zfs get sharesmb,sharenfs -t filesystem) ; do
[...]
done


i've already got code to generate smb.conf stanzas for websites in my
vhosting system (so that web sites can be shared to appropriate users on
the LAN or via VPN) - it's trivial. and generating /etc/exports is even
easier.

BTW, speaking of samba, zfs snapshots also work nicely with shadow copy
if you're backing up windows PCs.

http://forums.freebsd.org/showthread.php?t=32282
Post by Tim Connors
Even then, for it to be worthwhile, when you have 800TB of VMs deployed,
you can't easily dedup that (although from memory, a TB of RAM only costs
about $100K).
no, de-duping would not be practical at that scale.
Post by Tim Connors
For any VMs I've ever seen, the identical shared data isn't all that
much (our templates are 40GB in size) compared to the half TB deployed
on average per VM. Hardly seems worth all the pain.
your usage obviously can't make use of de-duping - but for lots
of virtualisation scenarios (e.g. web-hosting with lots of little
web-servers) it can make sense.
Post by Tim Connors
Post by Craig Sanders
it's also useful on backup servers where you end up with dozens or
hundreds of copies of the same files (esp. if you're backing up
entire systems, including OS)
On my home backup server, the backup software dedups at the file level
(but shared between VMs - achieved by hardlinking all files detected
to be the same, comparing actual content rather than hash collisions).
It does a very good job according to its own stats.
yeah, well, i hate backuppc - every time i've tried to use it, it's been a
disaster. rsync/zfs send and snapshots work far better for me.

somebody, might have been you, mentioned in this thread that rsync is a
worst-case load for zfs....not in my experience. but the link farms in
backuppc absolutely kill performance on every filesystem i've tried it
on, including zfs.
Post by Tim Connors
I acknowledge that there are some uninteresting systems out there that
are massively duplicated SOEs with bugger-all storage. Might fit that
pattern. And yet I believe our VDI appliances that they're trying to
roll out at work *still* won't be backed by ZFS with dedup.
i agree - there are only very limited circumstances where de-duping is
worthwhile. in most cases, it's just too expensive - the cost of ram vs
disk or ssd is prohibitive. but there *are* some scenarios where it is
the right solution, or at least a good solution.
Post by Tim Connors
Post by Craig Sanders
i've found 16GB more than adequate to run 2 4TB zpools, [...]
Unfortunately, little Atom NASes seem to max out at 4GB.
"Don't do that then"

this kind of idiotic arbitrary limitation is the main reason i prefer to
build my own than to buy an appliance. not only is it usually cheaper
(sometimes far cheaper - half or a third of the total cost because
appliance NASes are absurd over-priced for what you get), but you get
much better hardware with much better specs. and upgradability and
repairitude.

(i also really like that I'm the one making the design trade-off
decisions when building my own, rather than accepting someone else's
choices. i can design the system to suit my exact needs rather than
adapt my needs to fit the capabilities of a generic appliance)


for qnap and similar applicances you're looking at over $500 for 2 bay
NAS - two drives, that's crap. for 5 or 6 drive bays, it's $800 - $1000.
8 bays, around $1300. and that's with no drives.

Who in their right mind would pay that when you can build your own 8 or
even 12 bay (or more if you use 2.5" drives) ZFS system with a great
motherboard and CPU and bucketloads of ram for between $500 and $800?

you can even still use an atom or fusion cpu if low-power is a
requirement (but, really, with that many drives the CPU power usage
isn't that big a deal, and a better CPU helps with compression)


for HP microsevers and the like, i posted about that in luv-talk a few
days ago, with rough costings for a slightly cheaper and significantly
better AMD Fusion based system.
Post by Tim Connors
Post by Craig Sanders
i tend to check and double-check potentially dangerous command lines
before i hit enter, anyway.
No <up>-<up>-<enter> sudo reboots? ;P
nope, i have shutdown and reboot excluded from bash history:

export HISTIGNORE='shutdown*:reboot*'

having to retype shutdown commands saves me from the
aaaaaaararggggghhhhhh! feeling of accidentally rebooting when i didn't mean
to...i've done that too many times in the past.


craig
--
craig sanders <***@taz.net.au>

BOFH excuse #179:

multicasts on broken packets
Tim Connors
2013-07-15 02:02:34 UTC
Permalink
Post by Craig Sanders
Post by Jeremy Visser
Post by Craig Sanders
and, unlike lvm, with both btrfs and zfs, once you've created the
pool you don't need to remember or specify the device names of the
disks in the pool unless you're replacing one of them
Can you clarify what you mean by this?
My volume group is /dev/vg_glimfeather. I haven?t interacted with the
raw PV, /dev/sda6, in over a year.
Extends the size of the logical volume "vg01/lvol10" by 54MiB on physi-
cal volume /dev/sdk3. This is only possible if /dev/sdk3 is a member of
lvextend -L +54 /dev/vg01/lvol10 /dev/sdk3
Yes, but why would you do that instead of just letting lvm pick where to
put its extents? If I add a disk in lvm or zfs, I have to either
zfs add tank <dev-id> (or is it zpool?) or vgextend <vg> <dev-node>. Same
difference. In both cases, I have to read the manpage to remember quite
how to use it.
Post by Craig Sanders
Extend a logical volume vg1/lv1 by 16MB using physical extents
lvresize -L+16M vg1/lv1 /dev/sda:0-1 /dev/sdb:0-1
Wow, damn. Maybe the manpage is deficient, or maybe you're reading the
wrong section. It's *definitely* not that complicated.

Just cargo cult off the net like everyone else!
--
Tim Connors
Craig Sanders
2013-07-15 04:34:50 UTC
Permalink
Post by Tim Connors
Wow, damn. Maybe the manpage is deficient, or maybe you're reading the
wrong section. It's *definitely* not that complicated.
Just cargo cult off the net like everyone else!
cargo-culting from the man-page obviously doesn't work - as Brett
highlighted, the examples given are way more than complicated than
they need to be.

*all* of the examples in the lvextend & lvresize man pages have the
PhysicalVolumePath on the command line.

craig
--
craig sanders <***@taz.net.au>

BOFH excuse #316:

Elves on strike. (Why do they call EMAG Elf Magic)
Tim Connors
2013-07-15 04:58:37 UTC
Permalink
Post by Craig Sanders
Post by Tim Connors
Wow, damn. Maybe the manpage is deficient, or maybe you're reading the
wrong section. It's *definitely* not that complicated.
Just cargo cult off the net like everyone else!
cargo-culting from the man-page obviously doesn't work - as Brett
highlighted, the examples given are way more than complicated than
they need to be.
*all* of the examples in the lvextend & lvresize man pages have the
PhysicalVolumePath on the command line.
Ah so they do. But the EBNF usage gives PhysicalVolumePath in optional
square brackets.

PhysicalVolumePath is of course useful if you want to move /usr mostly to
ssd but keep the rest preferentially going to your 3TB spinning rust, but
have it all in the same VG anyway.
--
Tim Connors
Kevin
2013-07-15 05:25:45 UTC
Permalink
One small problem with LVM (and a reason not to see it as a valid
comparison to) that
ZFS and BTRFS to solve is that lvm snapshots suck.
http://johnleach.co.uk/words/613/lvm-snapshot-performance
TL;DR lvm snapshots rewrite the original data before writing the change.
Craig Sanders
2013-07-16 03:45:03 UTC
Permalink
Post by Tim Connors
Ah so they do. But the EBNF usage gives PhysicalVolumePath in optional
square brackets.
true, but when you're operating in full-on cargo-cult mode you skip all
that verbose waffle and jump straight to the Examples section :-)

craig
--
craig sanders <***@taz.net.au>

BOFH excuse #152:

My pony-tail hit the on/off switch on the power strip.
Tim Connors
2013-07-14 04:07:41 UTC
Permalink
Post by Craig Sanders
Post by Russell Coker
Post by Craig Sanders
http://zfsonlinux.org/faq.html#PerformanceConsideration
"Create your pool using whole disks: When running zpool create use
whole disk names. This will allow ZFS to automatically partition
the disk to ensure correct alignment. It will also improve
interoperability with other ZFS implementations which honor the
wholedisk property."
Other than the compatibility reason stated there, this FAQ, and the
differing behaviour of the scheduling elevator dependent upon whether it's
a disk or a partition, has always *smelt* to me. If dealing with a whole
disk, it still creates -part1 & -part9 anyway.
Post by Craig Sanders
Post by Russell Coker
Who's going to transfer a zpool of disks from a Linux box to a *BSD or Solaris
system? Almost no-one.
read the first reason again, automatic alignment of the partitions
increases performance - and it's the sort of thing that it's a PITA
to do manually, calculating the correct starting locations for all
partitions. it'll be a lot more significant to most linux users than the
second reason...as you say, that's irrelevant to most people.
eh? parted has done proper alignment by default since just after the
dinosaurs were wiped out. I was actually surprised the other day when
ext4.ko issued an alignment warning about a filesystem on one of the new
Oracle database VMs at work, and I looked at the many year old template it
was deployed from, and alignment was already correct. The Oracle forums
(I was surprised that there was actually an Oracle community who talk out
in the open on web fora - I thought it was all enterprisey type people who
don't like to talk) mentioned this was a known issue with how Oracle
does redo log IO, and it was just a cosmetic issue.
Post by Craig Sanders
and, unlike lvm, with both btrfs and zfs, once you've created the pool
you don't need to remember or specify the device names of the disks in
the pool unless you're replacing one of them - most operations (like
creating, deleting, changing attributes of a volume or zvol) are done
with just the name of the pool and volume. this leads to short, simple,
easily understood command lines.
eh? That's pretty close to the occasions where you need to specify raw
device names in lvm too.
--
Tim Connors
Craig Sanders
2013-07-16 03:16:02 UTC
Permalink
Post by Tim Connors
Other than the compatibility reason stated there, this FAQ, and the
differing behaviour of the scheduling elevator dependent upon whether
it's a disk or a partition, has always *smelt* to me.
it's not about whether it's a disk or a partition, it's about the
alignment of the partition with a multiple of the sector size.

4K alignment works for both 512-byte sector drives and 4K or "advanced
format" drives
Post by Tim Connors
If dealing with a whole disk, it still creates -part1 & -part9 anyway.
yep, it creates a GPT partition table so that the data partiton is
4K-aligned.

interesting discussion here:

https://github.com/zfsonlinux/zfs/issues/94



in theory, zfs could use the raw disk and start at sector 0, but one
of the risks there is that careless use of tools like grub-install (or
fdisk/gdisk/parted etc) could bugger up your pool....and there's really
no good reason not to use a partition table, at worst you lose the first
1MB (2048 x 512 bytes) of each disk.
Post by Tim Connors
eh? parted has done proper alignment by default since just after the
dinosaurs were wiped out.
it's still quite common for drives to be partitioned so that the first
partition starts on sector 63 rather than 2048 - which works fine for
512-byte sectors but not so great for 4K-sector drives.

and i know i've spent significant amounts of time in the past, tediously
calculating the optimum offsets for partitions to use with mdadm.

craig
--
craig sanders <***@taz.net.au>

BOFH excuse #424:

operation failed because: there is no message for this error (#1014)
Tim Connors
2013-07-14 03:57:16 UTC
Permalink
Post by Craig Sanders
Post by Russell Coker
You need DKMS to build the kernel modules and then the way ZFS operates is
very different from traditional Linux filesystems.
dkms automates and simplifies the entire compilation process, so it just
takes time. it's not complicated or difficult.
It's unreliable. I've had to rescue my headless NAS more than once when
zfs failed to build. In kernel ext4 has the benefits of being a little
more mature, and in-tree.
--
Tim Connors
Craig Sanders
2013-07-16 03:17:55 UTC
Permalink
Post by Tim Connors
[...dkms...]
It's unreliable. I've had to rescue my headless NAS more than once when
zfs failed to build.
what, you rebooted without checking that the compile had completed
successfully and the modules were correctly installed?

craig
--
craig sanders <***@taz.net.au>

BOFH excuse #442:

Trojan horse ran out of hay
Petros
2013-07-15 23:45:00 UTC
Permalink
Don't ever use more than 80% of your file system? Yeah, I know that's not
a very acceptable alternative.
The 80% is a bit of an "old time myth", I was running ZFS with higher
usage under FreeBSD until I hit the "slowness".

http://portrix-systems.de/blog/brost/oracle-databases-on-zfs/
[About Solaris]

"zfs actually switches to a different algorithm to find free blocks at
a certain level and that this level is configurable with
metaslab_df_free_pct. Older releases switch at about 80% full and try
to find the “best fit” for new data which slows things down even more."

More about the tuning here:

http://blogs.everycity.co.uk/alasdair/2010/07/zfs-runs-really-slowly-when-free-disk-usage-goes-above-80/

To be honest, I don't know exactly the FreeBSD threshold. At the
moment, Nagios is giving me a warning at 80% and then I start thinking
about remedies (without feeling too much pressure at this point).

Under untuned other filesystems you have a 5% threshold,
non-privileged users cannot write after that.

For my server usage, I accepted that I need some RAM (I usually
reserve 4GB for the ARC) and some "wasted" disk space. Given that both
are cheap I don't bother.

I accepted it because it makes my life much easier in many regards. It
saves me time I would spend to fiddle around with other solutions.

Weren't computer invented to save time and do the boring stuff for you?

BTW: I don't use dedup. Firstly because I use cloning many times and
after that: Well, that are changes unique to the ZFS in question.

I have problems to come up with a scenario to use it. But I am pretty
sure someone asked for it. Maybe someone running a big big server farm
and distributing copies of many many Gigabytes of data to many VMs on
the same box?

Or something like that. I mean, not every feature needs to be used by
me. And it's not on by default so it does not bother me.

Regards
Peter
Toby Corkindale
2013-07-16 01:02:16 UTC
Permalink
Post by Petros
Don't ever use more than 80% of your file system? Yeah, I know that's not
a very acceptable alternative.
The 80% is a bit of an "old time myth", I was running ZFS with higher
usage under FreeBSD until I hit the "slowness".
Chipping in with a bit of real-world info for you here.
Yes, ZFS on Linux still suffers from massive slowdowns once you run out
of most of the free space. It's more than 80%, I'll grant, but not a
whole lot - maybe 90%?
Post by Petros
BTW: I don't use dedup. Firstly because I use cloning many times and
after that: Well, that are changes unique to the ZFS in question.
I have problems to come up with a scenario to use it. But I am pretty
sure someone asked for it. Maybe someone running a big big server farm
and distributing copies of many many Gigabytes of data to many VMs on
the same box?
Dedup is really painful on zfs. Attempting to use it over a
multi-terabyte pool ended in failure; using it just a subset of data
worked out OK, but ended up with >50GB of memory in the server to cope.
Fine for a big server, but how many of you are running little
fileservers at home with that much?

You do save on a fair bit of disk space if you're storing a pile of
virtual machine images that contain a lot of similarities.

-Toby
Craig Sanders
2013-07-16 03:37:41 UTC
Permalink
Post by Petros
The 80% is a bit of an "old time myth", I was running ZFS with
higher usage under FreeBSD until I hit the "slowness".
Chipping in with a bit of real-world info for you here. Yes, ZFS on
Linux still suffers from massive slowdowns once you run out of most of
the free space. It's more than 80%, I'll grant, but not a whole lot -
maybe 90%?
traditional filesystems like ext4 reserve 5% for root anyway, so there's
not much difference between 90 and 95%
Dedup is really painful on zfs. Attempting to use it over a
multi-terabyte pool ended in failure; using it just a subset of data
worked out OK, but ended up with >50GB of memory in the server to cope.
IIRC, de-duping can be enabled on a per subvolume/zvol basis, but the
RAM/L2ARC requirement still depends on the size of the pool and NOT the
volume.

otherwise, i'd enable it on some of the subvolumes on my own pools.


so if you want de-duping with small memory foot-print, make a small
zpool dedicated just for that. could be a good way to get economical
use of a bunch of, say, 250 or 500GB SSDs for VM images.

hmmm, now that i think of it, that's actually a scenario where the
cost-benefit of de-duping would really be worthwhile. and no need for
L2ARC or ZIL on that pool, they'd just be a bottleneck (e.g. 4 SSDs
accessed directly have more available IOPS than 4 SSDs accessed through
1 SSD as "cache")

use a pool of small, fast SSDs as de-duped "tier-1" storage for VM rootfs
images, and mechanical disk pool for bulk data storage.
You do save on a fair bit of disk space if you're storing a pile of
virtual machine images that contain a lot of similarities.
it also improves performance, as the shared blocks for common programs
like sed, grep and many others are more likely to be in cached in ARC or
L2ARC.

craig
--
craig sanders <***@taz.net.au>

BOFH excuse #382:

Someone was smoking in the computer room and set off the halon systems.
Petros
2013-07-17 05:02:41 UTC
Permalink
Quoting Tim Connors
Post by Tim Connors
I've got one for you; just came in on the mailing list (someone got stale
Post by Tim Connors
ZFS and the NFS kernel server are not that tightly integrated.
When you do a 'zfs set sharenfs="foo,bar" pool/vol' of a 'zfs
share
pool/vol', the zfs tools just give a call to the NFS kernel server
saying 'Hey I want you to share this over NFS'.
That's what it should do. Under FreeBSD it sends a HUP to mountd - that's it.
Post by Tim Connors
Post by Tim Connors
If the NFS kernel server is restarted it unshares everything and only
reads back whatever is in /etc/exports. This is actually expected as
NFS doesn't know anything about ZFS. Doing a
'zfs share -a' exports all your NFS/SMB shares again.
Yes, it is a dependency. In general /etc/init.d or /etc/rc.d ..
scripts take care of it so there is room for improvement if it is not
done yet.
Post by Tim Connors
Post by Tim Connors
When you don't use your system native tools, or when someone tries to
solve something at the wrong layer (zfs trying to mess around with
NFS?
It uses - it just has another list as /etc/exports. It's not a big
deal, I would think.

ZFS changes the way to work with filesystems in may ways. Who would
create/clone/destry/move around filesystems as I do with ZFS now? I
have 164 ZFS filesystems on the system I just queried.

It makes sense to keep the list of exports here.

On a ZFS system, /etc/exports is a "legacy file".

BTW: the ZFS hierarchy gives you better control access to parts of the
directory than on other filesystems, because it is easy to create a
new ZFS for a directory tree and give it different export options.

Regards
Peter
Loading...