Important: due to a bug in my benchmark program, the tps numbers in this post are incorrect. See here for the correct numbers.
In the previous post, I discussed Western Digital’s “Advanced Format” drives and the problems caused by their misreporting their real, physical sector size.
I wrote a benchmark utility to demonstrate the performance penalty of unaligned accesses and uncover a drive’s physical sector size. What it does is write blocks of zeroes varying size at regular intervals. For each block size, it writes a total of 128 MB at intervals of four times the block size, and at an offset that varies from 512 bytes up to half of the block size.
With the default settings, the first pass will write 131,072 1,024-byte blocks at n × 4,096, and the second pass will do the same at n × 4,096 + 512. The third, fourth and fifth passes will write 65,536 2,048-byte blocks each at n × 8,192, n × 8,192 + 512 and n × 8,192 + 1,024. It will make four more passes with 4,096-byte blocks and five with 8,192-byte blocks.
Here’s the idea: most passes will be very slow (up to half an hour per pass), but when we hit the right block size and alignment, performance will skyrocket; so on—let’s say—a WD20EARS with factory settings, passes 6 (4,096 bytes at offset 0), 10 (8,192 bytes at offset 0) and 14 (8192 bytes at offset 4,096) should stand out from the crowd. In fact, here are the results for passes 6 through 9:
count size offset step msec tps kBps 32768 4096 0 16384 19503 138 6720 32768 4096 512 16384 1216537 2 107 32768 4096 1024 16384 1213479 2 108 32768 4096 2048 16384 1214623 2 107
Pass 6 takes 20 seconds, while passes 7, 8 and 9 take 20 minutes.
Let me rephrase that: properly aligned non-sequential writes are faster than misaligned ones by a factor of sixty.
Sixty. Six zero.
We really, really need to get that fixed somehow.
That’s not the whole story, though. Let’s see how it compares to a 7,200 rpm, 2 TB Hitachi Deskstar (HDS722020ALA330) with 512-byte physical sectors:
count size offset step msec tps kBps 32768 4096 0 16384 8803 307 14889 32768 4096 512 16384 8701 310 15063 32768 4096 1024 16384 8735 309 15004 32768 4096 2048 16384 8705 310 15056
The Hitachi blows through the test so fast you don’t even have time to make yourself a cup of coffee, let alone drink it.
This is a 7,200 rpm, 400 GB Caviar SE16 (WD4000AAKS)—more than three years old, so don’t expect too much:
count size offset step msec tps kBps 32768 4096 0 16384 21348 126 6139 32768 4096 512 16384 21674 124 6047 32768 4096 1024 16384 20799 129 6301 32768 4096 2048 16384 21031 128 6232
So, about the same as we get from the WD20EARS with aligned writes.
Now, here’s the kicker. The last drive in my test lineup is a WD20EADS—almost the same as the WD20EARS, but with 512-byte sectors and only 32 MB cache (although cache doesn’t mean anything here—I made sure my test program writes enough data to blow through the cache on every pass).
count size offset step msec tps kBps 32768 4096 0 16384 22811 118 5745 32768 4096 512 16384 19552 138 6703 32768 4096 1024 16384 36945 73 3547 32768 4096 2048 16384 50102 53 2616
Ouch. It’s not just slow, it’s also very inconsistent. I have no idea what to make of that.
Note 1: I did not mention rotational speed for the WD Green disks, because Western Digital themselves do not specify one; the spec sheet just says “IntelliPower”. Not sure what to make of that, either. Tom’s Hardware contradict themselves, saying in one review that it means 5,400, and in another that it means it varies. Meanwhile, my supplier claim the WD20EARS rotates at 7,200 rpm. Go figure.
Note 2: I also have a 1 TB WD10EARS, but I haven’t tested it yet. I expect it to perform pretty much as well (or as poorly, depending on your perspective) as the WD20EARS.
Update: the results for the WD10EARS are in. Strangely, it is much faster at unaligned writes than the WD20EARS, although it’s a little slower at aligned writes.
count size offset step msec tps kBps 32768 4096 0 16384 23105 116 5672 32768 4096 512 16384 79285 34 1653 32768 4096 1024 16384 75814 35 1728 32768 4096 2048 16384 79920 33 1640
A naïve sequential-write benchmark (diskinfo -t) suggests that it’s about 20% slower overall. It is possible that both disks use a striped layout internally, so the WD20EARS gets better results because it has more platters. If that is the case, it should be possible to modify phybs to detect the stripe size.
so… don’t buy huge WD disks yet, then.
That depends. If you need large amounts of storage for backup or archival purposes, and performance is not a big issue, they have a better dollar-per-terabyte ratio than pretty much any other disk on the market. For instance, my regular supplier charges NOK 895 for a WD20EARS, NOK 995 for a 2 TB Samsung SpinPoint F3 EcoGreen, and NOK 999 for a 2 TB Hitachi Deskstar 7K2000 (same model that I tested). That’s a 10% price advantage in WD’s favor.
(BTW, if you or anyone else is in a giving mood, I’d love to test the SpinPoint…)
interesting series of posts, thanks for sharing your data, DES.
Very interesting. And shame on them for reporting the wrong blocksize!
I just bought a ssd to put in my laptop, and guess I’ve got to test that for sector size as well.
My plan is to have a zfs-only setup with gpt on it, do you have any recommendations?
SSDs are complicated. You can flip any bit anywhere from 1 to 0 at any time, but you can only flip it back to 1 by erasing the whole block. The “erase block size” varies; my impression is that it tends to be large (on the order of 128 kB).
I wouldn’t run ZFS on a single-disk system, BTW. A single-vdev ZFS pool is significantly slower than UFS on the same device. Where ZFS really shines is large RAID setups with multiple raidz or raidz2 vdevs in the same pool.
Well, considering that I now run a zfs only setup on 2.5inch sata 5400rpm drive i think it will feel snappier with a ssd :)
The speed is not my main concern when choosing between zfs and ufs, I consider the features of zfs to far outweigh the slowdown. Which is not to say that I like to throw away performance needlessly.
Thanks for your input though :)
Good read. However the TPS metric is incorrect because the calculation overflows. Fix (assuming sizeof(long) >= 5):
— phybs.c (revision 212043)
+++ phybs.c (working copy)
@@ -102,7 +102,7 @@
usec = t1.tv_sec * 1000000 + t1.tv_usec;
usec -= t0.tv_sec * 1000000 + t0.tv_usec;
printf(“%10lu%8lu%8lu\n”, usec / 1000,
– count * 1000000 / usec,
+ count * 1000000UL / usec,
count * size * 1000000 / 1024 / usec);
free(buf);
}
Andreas and Pieter, sorry for forgetting to check my moderation queue…
Thanks for the fix, Pieter. I’ll try to re-run the bencmarks and publish corrected figures. Luckily, the kBps figures were correct.
new wd15earx, freebsd 8.2 release amd64, 4k alignment
FS formatted with
newfs -S 4096 -b 32768 -f 4096 -U /dev/ada1p2
[root@timp ~/phybs]# ./phybs /dev/ada1p2
count size offset step msec tps kBps
262144 512 0 2048 37585 6974 3487
131072 1024 0 4096 20415 6420 6420
131072 1024 512 4096 20353 6439 6439
65536 2048 0 8192 12144 5396 10792
65536 2048 512 8192 12179 5380 10761
65536 2048 1024 8192 12166 5386 10773
32768 4096 0 16384 7490 4374 17499
32768 4096 512 16384 6565 4990 19963
32768 4096 1024 16384 7881 4157 16631
32768 4096 2048 16384 8652 3787 15149
No, that’s not right.
First of all, formatting the partition has no effect, since phybs operates directly on the device, not on the filesystem. If you want to test the filesystem, you have mount it, create a large file, and run phybs on that file.
Secondly, you ran phybs on a partition instead of the whole disk. This means the results will be skewed unless the partition is aligned, and you didn’t show what gpart commands you used to create it.
Finally, you ran phybs in read mode (the default), so the drive’s prefetch cache is masking the effects of unaligned accesses.
From the data sheet, it looks like the EARX drives are identical to the EARS drives except for the SATA interface (6 Gbps instead of 3 Gbps), so the results should be pretty much the same. Try this on a scratch disk:
# phybs -w -l 1024 /dev/ada1
Thank you
[root@timp ~/phybs]# ./phybs -w -l 1024 /dev/ada1
count size offset step msec tps kBps
131072 1024 0 4096 132845 986 986
131072 1024 512 4096 128295 1021 1021
65536 2048 0 8192 73596 890 1780
65536 2048 512 8192 67192 975 1950
65536 2048 1024 8192 67330 973 1946
32768 4096 0 16384 16911 1937 7750
32768 4096 512 16384 51732 633 2533
32768 4096 1024 16384 51460 636 2547
32768 4096 2048 16384 52040 629 2518
16384 8192 0 32768 11221 1460 11680
16384 8192 512 32768 48505 337 2702
16384 8192 1024 32768 50572 323 2591
16384 8192 2048 32768 49215 332 2663
16384 8192 4096 32768 11123 1472 11783
Just got a server back from service, and they had replaced the Seagate drives with 1.5TB WD Green Power drives.
root#nfs1004 [/local/src/phybs] dmesg | grep da0
da0 at mpt0 bus 0 scbus0 target 0 lun 0
da0: Fixed Direct Access SCSI-5 device
da0: 300.000MB/s transfers
da0: Command Queueing enabled
da0: 1430799MB (2930277168 512 byte sectors: 255H 63S/T 182401C)
root#nfs1004 [/local/src/phybs] ./phybs -rw /dev/da0
count size offset step msec tps kBps
262144 512 0 2048 546207 479 239
131072 1024 0 4096 450918 290 290
131072 1024 512 4096 416217 314 314
65536 2048 0 8192 378569 173 346
65536 2048 512 8192 327592 200 400
65536 2048 1024 8192 338574 193 387
32768 4096 0 16384 229636 142 570
32768 4096 512 16384 276989 118 473
32768 4096 1024 16384 257056 127 509
32768 4096 2048 16384 260296 125 503
Am I right in concluding that I should create 4k-aligned devices for these before creating a zpool?
Cheers
No, you should send the disks back and ask for WD Caviar Blacks or Samsung Spinpoint F4s (not F4EG) instead.
Seriously, I cannot overemphasize how crappy these drives are. Feel free to refer them to me if they object.
Thanks, I’ll return the disks and swap them for WD Caviar Blacks now. (The supplier agreed fully that those WD GP disks were unsuitable for ZFS, and couldn’t quite understand why they had sent us those when they were aware that they would be used with ZFS.)