Hi, recently I faced across an interesting task to setup a storage server for backup of a large number of block devices.
Every week we back up all virtual machines in our cloud, so there is a need to be able handle thousands of backups and do it as fast and efficiently as possible.
Unfortunately, the standard RAID5 , RAID6 levels are not suitable due the fact that recovery process on such large disks as ours will be painfully long and most likely never finished successfully.
Let’s consider what alternatives are:
Erasure Coding — An analogue to RAID5, RAID6, but with a configurable parity level. Also the fault tolerance is performed not for whole block devices, but for each object separately. The easiest way to try Erasure Coding is to deploy minio .
DRAID is currently an unreleased feature of ZFS. Unlike RAIDZ, DRAID has a distributed parity block and uses all the disks in the array during recovery, this makes it better surviving for disk failures and provides faster recovery than standard RAID levels.
For this setup I’ve got a server Fujitsu Primergy RX300 S7 with Intel Xeon CPU E5–2650L 0 @ 1.80GHz processor, nine RAM modules Samsung DDR3–1333 8Gb PC3L-10600R ECC Registered (M393B1K70DH0-YH9) , disk shelf Supermicro SuperChassis 847E26-RJBOD1 connected via Dual LSI SAS2X36 Expander and also 45 disks Seagage ST6000NM0115–1YZ110 for 6TB each.
Before make any decisions, we first need to properly test everything.
To do this I prepared and tested various configurations. I used minio, which acted as an S3 gateway and start it in different modes with a different number of targets.
Basically I was choosing between minio with erasure coding and software raid configurations with the same amount of disks and parity level, and these are: RAID6, RAIDZ2 and DRAID2.
For reference: when you run minio with just one target, then it works simply as an S3 gateway, representing your local file system as S3 storage. If you run minio with several targets, then the Erasure Coding mode will be automatically turned on, in this case it will spread data between your targets and provide fault tolerance for your objects. By default, minio divides targets into groups of 16 disks, where each group has 2 parity. Those at the same time two disks can fail without data loss.
To perform benchmark I used 16 disks for 6TB each and I was writing small objects for 1MB size long, this quite accurately described our future load, since all modern backup tools divide data into blocks of several megabytes and write them this way.
I used the s3bench utility that was running on a remote server and sends tens of thousands of such objects in hundred streams to minio. Afterwards it was trying to read them back by the same way.
The benchmark results are shown in the following table:
As we can see, minio with erasure coding mode works much worse for writing than minio running on top of software RAID6, RAIDZ2 and DRAID2 in the same configuration.
Additionally the test of minio on ext4 vs XFS was requested. Surprisingly, but XFS was significantly slower than ext4 for my type of load.
In the first batch of tests, mdadm was showing superiority over ZFS, but later George Melikov suggested me few options, which significantly improved ZFS performance:
xattr=sa atime=off recordsize=1M
and after applying them, tests with ZFS got a lot better.
In the last two tests I also tried to move metadata (
special ) and ZIL (
log ) to the mirror of SSDs. But moving of metadata didn’t give much gain in writing speed, when ZIL was moved my SSDSC2KI128G8 ’s were brake everything with 100% utilization, so I considered this test a failure. However I do not exclude that if I had faster SSD disks then perhaps this could greatly improve my results, but unfortunately I didn’t have them.
Finally, I decided to stop on DRAID and despite it’s beta status it is the fastest and most effective storage solution in our case.
I created a simple DRAID2 in configuration with three groups and two distributed spares:
# zpool status data pool: data state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM data ONLINE 0 0 0 draid2:3g:2s-0 ONLINE 0 0 0 sdy ONLINE 0 0 0 sdam ONLINE 0 0 0 sdf ONLINE 0 0 0 sdau ONLINE 0 0 0 sdab ONLINE 0 0 0 sdo ONLINE 0 0 0 sdw ONLINE 0 0 0 sdak ONLINE 0 0 0 sdd ONLINE 0 0 0 sdas ONLINE 0 0 0 sdm ONLINE 0 0 0 sdu ONLINE 0 0 0 sdai ONLINE 0 0 0 sdaq ONLINE 0 0 0 sdk ONLINE 0 0 0 sds ONLINE 0 0 0 sdag ONLINE 0 0 0 sdi ONLINE 0 0 0 sdq ONLINE 0 0 0 sdae ONLINE 0 0 0 sdz ONLINE 0 0 0 sdan ONLINE 0 0 0 sdg ONLINE 0 0 0 sdac ONLINE 0 0 0 sdx ONLINE 0 0 0 sdal ONLINE 0 0 0 sde ONLINE 0 0 0 sdat ONLINE 0 0 0 sdaa ONLINE 0 0 0 sdn ONLINE 0 0 0 sdv ONLINE 0 0 0 sdaj ONLINE 0 0 0 sdc ONLINE 0 0 0 sdar ONLINE 0 0 0 sdl ONLINE 0 0 0 sdt ONLINE 0 0 0 sdah ONLINE 0 0 0 sdap ONLINE 0 0 0 sdj ONLINE 0 0 0 sdr ONLINE 0 0 0 sdaf ONLINE 0 0 0 sdao ONLINE 0 0 0 sdh ONLINE 0 0 0 sdp ONLINE 0 0 0 sdad ONLINE 0 0 0 spares s0-draid2:3g:2s-0 AVAIL s1-draid2:3g:2s-0 AVAIL errors: No known data errors