It is widely know that ZFS can compress and deduplicate. The deduplication works across the pool level and removes duplicate data blocks as they are written to disk. This results into having only unique blocks stored on the disk while the duplicate blocks are shared among the files.

There is a good read about how dedupe works and some tweaking things like changing the checksum hashing function. https://blogs.oracle.com/bonwick/entry/zfs_dedup

Note: Compression works fine under zfsonlinux but the current version is not yet supporting deduplication (16.09.2014). ZFS on FreeBSD (for example FreeNAS) and Solaris (and Opensolaris) have a higher pool version and support deduplication. Deduplication has been introduced with pool version 31.

Zpool versions and features (blogs.oracle.com)
List of operating systems supporting ZFS (wikipedia)

Now, how to determine if you would actually benefit from deduplicated and compressed datasets?

I run the following under FreeNAS with a testsetup filled with real data. (self recorded camera .mov, ISOs, virtual disks (mostly Xen VDIs which are thin provisioned and therefor a bad example too), iSCSI luns, office documents, few audio files, zipped stuff, log files).

The rule of Thumb says to have 1GB of Ram per TB of Data. For deduplicated ZPools you actually should have 5 GB of Ram for 1TB of Data.

Lets dig into some figures

You can see here my pool (for this demo I’m running on a single 1TB HD)

[root@freenas] ~# zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
ZFS_1TB_Disk 928G 732G 196G 78% 1.02x ONLINE /mnt

732GB allocated.
Please note: running ZFS over 80% of allocation (used) will make the file-system extremely sluggish. Performance will drop down to a few MB/sec.

Now some nice stats.
I need to spoil a bit dedup is not paying out in this case. But see the numbers.

[root@freenas] ~# zpool status -D ZFS_1TB_Disk
pool: ZFS_1TB_Disk
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
ZFS_1TB_Disk ONLINE 0 0 0
gptid/e77ad37c-3d0a-11e4-874d-0026b97b0cc9 ONLINE 0 0 0
errors: No known data errors
dedup: DDT entries 6120540, size 507 on disk, 163 in core
bucket allocated referenced
______ ______________________________ ______________________________
refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE
------ ------ ----- ----- ----- ------ ----- ----- -----
1 5.69M 719G 711G 711G 5.69M 719G 711G 711G
2 150K 16.9G 16.6G 16.6G 306K 34.0G 33.5G 33.5G
4 2.00K 25.1M 11.7M 14.7M 8.23K 106M 49.6M 62.0M
8 239 6.82M 5.92M 6.53M 2.29K 65.3M 54.9M 60.9M
16 96 6.53M 6.12M 6.27M 1.85K 123M 114M 117M
32 10 390K 98.5K 120K 465 21.1M 5.65M 6.50M
64 6 768K 396K 396K 500 62.5M 34.6M 34.6M
256 1 128K 128K 128K 302 37.8M 37.8M 37.8M
4K 1 128K 4K 4K 5.84K 747M 23.3M 23.3M
8K 1 128K 4K 4K 9.97K 1.25G 39.9M 39.9M
Total 5.84M 736G 728G 728G 6.02M 755G 745G 745G

736G written to the disk out of 755G addressed data.
What does that tell me? more figures. Warning: the zdb -b command took about 20 minutes to run.

[root@freenas] ~# zdb -b ZFS_1TB_Disk
Traversing all blocks to verify nothing leaked ...
750G completed (1496MB/s) estimated time remaining: 4294654283hr 4294967295min 4294967272sec
No leaks (block sum matches space maps exactly)
bp count: 6715048
bp logical: 813464691712 avg: 121140
bp physical: 801489885184 avg: 119357 compression: 1.01
bp allocated: 804917678080 avg: 119867 compression: 1.01
bp deduped: 18510073856 ref>1: 156038 deduplication: 1.02
SPA allocated: 786407604224 used: 78.92%
[root@freenas] ~# zdb -DD ZFS_1TB_Disk
DDT-sha256-zap-duplicate: 156038 entries, size 467 on disk, 151 in core
DDT-sha256-zap-unique: 5964505 entries, size 508 on disk, 164 in core
DDT histogram (aggregated over all DDTs):
bucket allocated referenced
______ ______________________________ ______________________________
refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE
------ ------ ----- ----- ----- ------ ----- ----- -----
1 5.69M 719G 711G 711G 5.69M 719G 711G 711G
2 150K 16.9G 16.6G 16.6G 306K 34.0G 33.5G 33.5G
4 2.00K 25.1M 11.7M 14.7M 8.23K 106M 49.6M 62.0M
8 239 6.82M 5.92M 6.53M 2.29K 65.3M 54.9M 60.9M
16 96 6.53M 6.12M 6.27M 1.85K 123M 114M 117M
32 10 390K 98.5K 120K 465 21.1M 5.65M 6.50M
64 6 768K 396K 396K 500 62.5M 34.6M 34.6M
256 1 128K 128K 128K 302 37.8M 37.8M 37.8M
4K 1 128K 4K 4K 5.84K 747M 23.3M 23.3M
8K 1 128K 4K 4K 9.97K 1.25G 39.9M 39.9M
Total 5.84M 736G 728G 728G 6.02M 755G 745G 745G
dedup = 1.02, compress = 1.01, copies = 1.00, dedup * compress / copies = 1.04

So this is near to totally useless in my case. What do the numbers tell me. (Beside of that I wasted loads of time with dedupe).
I try to color code the figures below with the one in the examples.

I have a total of 6.02 million of blocks used (in fact to be accurate 6120540)  where refcnt shows me the blocksize used. Each block (block entry) allocates 507bytes on the disk and 163bytes in the memory.
Quick maths: 
6120540*163=997648020/1024/1024=951.43MB used in RAM
6120540*507=3103113780/1024/1024=2959.35MB used on the disk just to hold the dedupe tables.

That makes my dedupe efforts even more useless.

 dedup = 1.02, compress = 1.01, copies = 1.00, dedup * compress / copies = 1.04

So my dedup ratio was 1.02, compression 1.01 and the combined affort brought me a 1.04 ratio.

Now if you would like to try if dedupe would work on you existing zfs you can run “zdb -S poolname” to simulate a deduplication and get similar as the above figures. (This command takes a few minutes)

[root@freenas] ~# zdb -S ZFS_1TB_Disk
Simulated DDT histogram:
bucket allocated referenced
______ ______________________________ ______________________________
refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE
------ ------ ----- ----- ----- ------ ----- ----- -----
1 5.69M 719G 711G 711G 5.69M 719G 711G 711G
2 150K 16.9G 16.6G 16.6G 306K 34.0G 33.5G 33.5G
4 2.00K 25.1M 11.7M 14.7M 8.24K 106M 49.6M 62.0M
8 239 6.82M 5.92M 6.53M 2.29K 65.3M 54.9M 60.9M
16 96 6.53M 6.12M 6.27M 1.85K 123M 114M 117M
32 10 390K 98.5K 120K 465 21.1M 5.65M 6.50M
64 6 768K 396K 396K 500 62.5M 34.6M 34.6M
256 1 128K 128K 128K 302 37.8M 37.8M 37.8M
4K 1 128K 4K 4K 5.84K 747M 23.3M 23.3M
8K 1 128K 4K 4K 9.97K 1.25G 39.9M 39.9M
Total 5.84M 736G 728G 728G 6.02M 755G 745G 745G
dedup = 1.02, compress = 1.01, copies = 1.00, dedup * compress / copies = 1.04

Conclusion:
It would be now to you to consider if deduplication is worth it for your data. It isn’t for mine.

I throw in another sample. Its on Xen a Proxmox VM for OpenVZ guests. The Idea was that my OpenVZ guests (all Centos 6.x) share quite a lot of common data which should be easy to deduplicate to reduce the small VDI I assigned. see here how that failed as well and went up my list of getting rid off and reconsider.

# zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
zfs-data 149G 46.2G 103G 31% 1.05x ONLINE -
# zdb -DD zfs-data
DDT-sha256-zap-duplicate: 66133 entries, size 325 on disk, 175 in core
DDT-sha256-zap-unique: 509333 entries, size 304 on disk, 167 in core
DDT histogram (aggregated over all DDTs):
bucket allocated referenced
______ ______________________________ ______________________________
refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE
------ ------ ----- ----- ----- ------ ----- ----- -----
1 497K 50.9G 42.6G 42.6G 497K 50.9G 42.6G 42.6G
2 39.8K 1.68G 1.23G 1.23G 90.4K 3.71G 2.68G 2.68G
4 22.9K 546M 303M 303M 93.9K 2.19G 1.22G 1.22G
8 1.61K 17.4M 12.8M 12.8M 14.5K 167M 124M 124M
16 184 4.36M 4.04M 4.04M 3.72K 84.5M 77.4M 77.4M
32 54 56.5K 35.5K 35.5K 2.20K 2.41M 1.44M 1.44M
64 31 978K 954K 954K 2.60K 84.9M 83.0M 83.0M
128 6 3.50K 3.50K 3.50K 1.04K 624K 624K 624K
256 2 1K 1K 1K 602 301K 301K 301K
1K 1 8K 512 512 1.29K 10.3M 662K 662K
Total 562K 53.1G 44.1G 44.1G 708K 57.1G 46.7G 46.7G
dedup = 1.06, compress = 1.22, copies = 1.00, dedup * compress / copies = 1.30

Dedup is disappointing while compression makes up a bit.

ZFS compression

More a note to myself.

To compare filesystem usage in a directory with and without compression.

Referenced on disk

# du -hAd0
56G .

Allocated on disk

# du -hd0
54G .

ZFS block size (record size):

As described here https://www.joyent.com/blog/bruning-questions-zfs-record-size with a few experimental samples ZFS uses dynamic block allocation from 512b to 128k.

Taking into count the above statement that for each block of deduplication it needs 320 bytes of memory, you would need to keep an eye on the block used by the system and make the maths to ensure you’re not losing data / the pool.

If you know you have larger files as default or you want to lock down a particular block size (for example when using a database on ZFS) you alway can lock the block size by using:

zfs set recordsize=128k pool/fs

However some test have not revealed any improvement to do such. So I leave that type of tuning for particular DB use.

ZFS Dedupe and not “enough” RAM (worst case scenario)

I could not really test that because my RAM used table size was so small that pulling 4 GB would not cause a big harm. However, it has been reported that running out of free / enough memory on dedupe tables will cause on a reboot or import the zpool to fail. I do believe that scenario because a test to export my pool and reimport did take some while (~4 minutes) until the dedupe table has been loaded into the memory.

ZFS Dedupe and removing deduped Zvol

On a large scale zvol with deduplication the removal of a filesystem can cause the server to stall.
When using zfs destroy pool/fs ZFS is recalculating the whole deduplication. on a 1TB HD/Zpool, it took 5 hours to do so.

A workaround would be to disable prior to deletion the deduplication flag on that filesystem and then remove it.
  • AngieK

    Hey so what does the x1.02 mean? What rate is that

    • It means the ratio is 1 to 1.02. So about 2% reduction from the original size. but that’s an assumption, I wasn’t able to find any man file explaining that ratio.

  • Scott S.

    “Please note: running ZFS over 80% of allocation (used) will make the
    file-system extremely sluggish. Performance will drop down to a few
    MB/sec.” – This should not be true, significant improvements were made in ZFS as it approaches closer to full space used.

    Using smaller recordsize will increase the probability that a block will be deduplicated, two blocks of 64k for example is more likely than two separate 1M blocks being identical. Usually compression can only help, lz4 is so lightweight I can’t see why not to use it.

    • Thanks Scott, good point with the block size I will run some tests on that one.

      Which ZFS version will be the one with the better full space handling, as I just a few days ago had still issues with latest ZFSonLinux version.

%d bloggers like this: