ZFS Evil Tuning Guide

From Siwiki

Jump to: navigation, search

Contents

[edit] Overview

[edit] Tuning is Evil

Tuning is evil and should not be done...in general.

First, consider that the default values are set by the people who know most things about the effects of the tuning. If a better value exists, it would be the default. While alternative values might help a given workload, it could quite possibly degrade some other aspects of performance. Maybe, catastrophically so.

Over time, tuning recommendations might become stale at best or might lead to performance degradations. Customers are leery of changing a tuning that is in place and the net effect is a worse product than what it could be. Moreover, tuning enabled on a given system might spread to other systems, where it might not be warranted at all. If you must implement a ZFS tuning parameter, please reference the URL of this document:

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide

[edit] Review ZFS Best Practices Guide

On the other hand, ZFS best practices are things we encourage people to use. They are a set of recommendations that have been shown to work in different environments and are expected to keep working in the foreseeable future. So, before turning to tuning, make sure you've read and understood the best practices around deploying a ZFS environment that are described here:

http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide

[edit] Identify ZFS Tuning Changes

The syntax for enabling a given tuning recommendation has changed over the life of ZFS releases. So, when upgrading to newer releases, make sure that the tuning recommendations are still effective. If you decide to use a tuning recommendation, reference this page in the /etc/system file or in the associated script.

[edit] The Tunables

In no particular order:

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Checksums http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#ARCSIZE http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#ZFETCH http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#VDEVPF http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#MAXPEND http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#FLUSH http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#ZIL http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#METACOMP

[edit] Tuning ZFS Checksums

End-to-end checksumming is one of the great features of ZFS. It allows ZFS to detect and correct many kinds of errors other products can't detect and correct. Disabling checksum is, of course, a very bad idea. Having file system level checksums enabled can alleviate the need to have application level checksums enabled. In this case, using the ZFS checksum becomes a performance enabler.

The checksums are computed asynchronously to most application processing and should normally not be an issue. However, each pool currently has a single thread computing the checksums (RFE below) and it's possible for that computation to limit pool throughput. So, if disk count is very large (>> 10) or single CPU is weak (< Ghz), then this tuning might help. If a system is close to CPU saturated, the checksum computations might become noticeable. In those cases, do a run with checksums off to verify if checksum calculation is a problem.

If you tune this parameter, please reference this URL in shell script or in an /etc/system comment.

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Checksums

Verify the type of checksum used:

zfs get checksum <filesystem>

Tuning is achieved dynamically by using:

zfs set checksum=off <filesystem>

And reverted:

zfs set checksum='on | fletcher2 | fletcher4 | sha256' <filesystem>

Fletcher2 checksum (the default) has been observed to consume roughly 1Ghz of a CPU when checksumming 500 MByte per second.

[edit] RFEs
  • single-threaded checksum & raidz2 parity calculations limit write bandwidth on thumper

Fix integrated in Nevada, build 79

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6533726

[edit] Limiting the ARC Cache

The ARC is where ZFS caches data from all active storage pools. The ARC grows and consumes memory on the principle that no need exists to return data to the system while there is still plenty of free memory. When the ARC has grown and outside memory pressure exists, for example, when a new application starts up, then the ARC releases its hold on memory. ZFS is not designed to steal memory from applications. A few bumps appeared along the way, but the established mechanism works reasonably well for many situations and does not commonly warrant tuning.

However, a few situations stand out.

  • If a future memory requirement is significantly large and well defined, then it can be advantageous to prevent ZFS from growing the ARC into it. So, if we know that a future application requires 20% of memory, it makes sense to cap the ARC such that it does not consume more than the remaining 80% of memory.
  • If the application is a known consumer of large memory pages, then again limiting the ARC prevents ZFS from breaking up the pages and fragmenting the memory. Limiting the ARC preserves the availability of large pages.
  • If dynamic reconfiguration of a memory board is needed (supported on certain platforms), then it is a requirement to prevent the ARC (and thus the kernel cage) to grow onto all boards.

For theses cases, it can be desirable to limit the ARC. This will, of course, also limit the amount of cached data and this can have adverse effects on performance. No easy way exists to foretell if limiting the ARC degrades performance.

If you tune this parameter, please reference this URL in shell script or in an /etc/system comment.

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#ARCSIZE

[edit] Solaris 10 8/07 and Solaris Nevada (snv_51) Releases

For example, if an application needs 5 GBytes of memory on a system with 36-GBytes of memory, you could set the arc maximum to 30 GBytes, (0x780000000 or 32212254720 bytes). Set the zfs:zfs_arc_max parameter in the /etc/system file:

set zfs:zfs_arc_max = 0x780000000

or

set zfs:zfs_arc_max = 32212254720

[edit] Earlier Solaris Releases

You can only change the ARC maximum size by using the mdb command. Because the system is already booted, the ARC init routine has already executed and other ARC size parameters have already been set based on the default c_max size. Therefore, you should tune the arc.c and arc.p values, along with arc.c_max, using the formula:

arc.c = arc.c_max

arc.p = arc.c / 2

For example, to the set the ARC parameters to small values, such as arc_c_max to 512MB, and complying with the formula above (arc.c_max to 512MB, and arc.p to 256MB), use the following syntax:

# mdb -kw
 > arc::print -a p c c_max
ffffffffc00b3260 p = 0xb75e46ff
ffffffffc00b3268 c = 0x11f51f570
ffffffffc00b3278 c_max = 0x3bb708000

 > ffffffffc00b3260/Z 0x10000000
ffffffffc00b3260:  0xb75e46ff        = 0x10000000
 > ffffffffc00b3268/Z 0x20000000
ffffffffc00b3268:  0x11f51f570        = 0x20000000
 > ffffffffc00b3278/Z 0x20000000
ffffffffc00b3278:  0x11f51f570        = 0x20000000 

You should verify the values have been set correctly by examining them again in mdb (using the same print command in the example). You can also monitor the actual size of the ARC to ensure it has not exceeded:

# echo "arc::print -d size" | mdb -k

The above command displays the current ARC size in decimal.

You can also use the arcstat script available at http://blogs.sun.com/realneel/entry/zfs_arc_statistics to check the arc size as well as other arc statistics


Here is a perl script that you can call from an init script to configure your ARC on boot with the above guidelines:

#!/bin/perl

use strict;
my $arc_max = shift @ARGV;
if ( !defined($arc_max) ) {
        print STDERR "usage: arc_tune <arc max>\n";
        exit -1;
}
$| = 1;
use IPC::Open2;
my %syms;
my $mdb = "/usr/bin/mdb";
open2(*READ, *WRITE,  "$mdb -kw") || die "cannot execute mdb";
print WRITE "arc::print -a\n";
while(<READ>) {
        my $line = $_;

        if ( $line =~ /^ +([a-f0-9]+) (.*) =/ ) {
                $syms{$2} = $1;
        } elsif ( $line =~ /^\}/ ) {
                last;
        }
}
# set c & c_max to our max; set p to max/2
printf WRITE "%s/Z 0x%x\n", $syms{p}, ( $arc_max / 2 );
print scalar <READ>;
printf WRITE "%s/Z 0x%x\n", $syms{c}, $arc_max;
print scalar <READ>;
printf WRITE "%s/Z 0x%x\n", $syms{c_max}, $arc_max;
print scalar <READ>;
[edit] RFEs
  • ZFS should avoiding growing the ARC into trouble

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6488341

  • The ARC allocates memory inside the kernel cage, preventing DR

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6522017

  • ZFS/ARC should cleanup more after itself

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6424665

  • Each zpool needs to monitor it's throughput and throttle heavy writers

Fix integrated into Nevada, build 87. For more information, see this link: New ZFS write throttle

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6429205

[edit] Further Reading

http://blogs.sun.com/roch/entry/does_zfs_really_use_more

http://blogs.sun.com/realneel/entry/zfs_arc_statistics

[edit] File-Level Prefetching

ZFS implements a file-level prefetching mechanism labeled zfetch. This mechanism looks at the patterns of reads to files, and anticipates on some reads, reducing application wait times. The current code needs attention (RFE below) and suffers from 2 drawbacks:

  • Sequential read patterns made of small reads very often hit in the cache. In this case, the current code consumes a significant amount of CPU time trying to find the next I/O to issue, whereas performance is governed more by the CPU availability.
  • The zfetch code has been observed to limit scalability of some loads.

So, if CPU profiling, by using lockstat(1M) with -I argument or er_kernel as described here:

http://developers.sun.com/prodtech/cc/articles/perftools.html

shows significant time in zfetch_* functions, or if lock profiling (lockstat(1m)) shows contention around zfetch locks, then disabling file level prefetching should be considered.

Disabling prefetching can be achieved dynamically or through a setting in the /etc/system file.

If you tune this parameter, please reference this URL in shell script or in an /etc/system comment.

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#ZFETCH

[edit] Solaris 10 8/07 and Solaris Nevada (snv_51) Releases

Set dynamically:

echo zfs_prefetch_disable/W0t1 | mdb -kw

Revert to default:

echo zfs_prefetch_disable/W0t0 | mdb -kw

Set the following parameter in the /etc/system file:

set zfs:zfs_prefetch_disable = 1

[edit] Earlier Solaris Releases

Set dynamically:

echo zfetch_array_rd_sz/Z0x0 | mdb -kw

Revert to default:

echo zfetch_array_rd_sz/Z0x100000 | mdb -kw

Set the following parameter in the /etc/system file:

set zfs:zfetch_array_rd_sz = 0

[edit] RFEs
  • 6412053 zfetch needs some love

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6412053

  • 6579975 dnode_new_blkid should first check as RW_READER

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6579975

[edit] Device-Level Prefetching

ZFS does a device-level prefetching in addition to file-level prefetching. When ZFS reads a block from a disk, it inflates the I/O size, hoping to pull interesting data or metadata from the disk. Prior to the Solaris Nevada (snv_70) release, the code has caused problems for system with lots of disks because the extra prefetched data can cause congestion on the channel between the storage and the host. Tuning down the prefetching has been effective for OLTP type loads in the past. However, in the Solaris Nevada release, the code is now only prefetching metadata and this is not expected to require any tuning.

If you tune this parameter, please reference this URL in shell script or in an /etc/system comment.

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#VDEVPF

No tuning is required for snv_70 and after.

[edit] Solaris 10 8/07 and Nevada (snv_53 to snv_69) Releases

Set the following parameter in the /etc/system file:

set zfs:zfs_vdev_cache_bshift = 13

/* Comments
/* Setting zfs_vdev_cache_bshift with mdb crashes a system.
/* zfs_vdev_cache_bshift is the base 2 logarithm of  the size used to read disks. 
/* The default value of 16 means reads are issued in size of 1 << 16 = 64K. 
/* A value of 13 means disk reads are padded to 8K.

For earlier releases, see: http://blogs.sun.com/roch/entry/tuning_the_knobs

[edit] RFEs
  • vdev_cache wises up: increase DB performance by 16%

Fix integrated in Nevada, build 70

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6437054

[edit] Further Reading

http://blogs.sun.com/erickustarz/entry/vdev_cache_improvements_to_help

[edit] Device I/O Queue Size (I/O Concurrency)

ZFS controls the I/O queue depth for a given LUN. The default is 35, which allows common SCSI and SATA disks to reach their maximum throughput under ZFS. However, having 35 concurrent I/Os means that the service times can be inflated. For NVRAM-based storage, it is not expected that this 35-deep queue is reached nor plays a significant role. Tuning this parameter for NVRAM-based storage is expected to be ineffective. For JBOD-type storage, tuning this parameter is expected to help response times at the expense of raw streaming throughput.

The Solaris Nevada release now has the option of storing the ZIL on separate devices from the main pool. Using separate intent log devices can alleviate the need to tune this parameter for loads that are synchronously write intensive.

If you tune this parmeter, please reference this URL in shell script or in an /etc/system comment.

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#MAXPEND

Tuning is not expected to be effective for NVRAM-based storage arrays.

[edit] Solaris 10 8/07 and Solaris Nevada (snv_53 to snv_69) Releases

Set dynamically:

echo zfs_vdev_max_pending/W0t10 | mdb -kw

Revert to default:

echo zfs_vdev_max_pending/W0t35 | mdb -kw

Set the following parameter in the /etc/system file:

set zfs:zfs_vdev_max_pending = 10

For earlier Solaris releases, see:

http://blogs.sun.com/roch/entry/tuning_the_knobs

[edit] RFEs
  • 6471212 need reserved I/O scheduler slots to improve I/O latency of critical ops

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6471212

[edit] Further Reading

http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on

[edit] Cache Flushes

If you've noticed terrible NFS or database performance on SAN storage array, the problem is not with ZFS, but with the way the disk drivers interact with the storage devices.

ZFS is designed to work with storage devices that manage a disk-level cache. ZFS commonly asks the storage device to ensure that data is safely placed on stable storage by requesting a cache flush. For JBOD storage, this works as designed and without problems. For many NVRAM-based storage arrays, a problem might come up if the array takes the cache flush request and actually does something rather than ignoring it. Some storage will flush their caches despite the fact that the NVRAM protection makes those caches as good as stable storage.

ZFS issues infrequent flushes (every 5 second or so) after the uberblock updates. The problem here is fairly inconsequential. No tuning is warranted here.

ZFS also issues a flush every time an application requests a synchronous write (O_DSYNC, fsync, NFS commit, and so on). The completion of this type of flush is waited upon by the application and impacts performance. Greatly so, in fact. From a performance standpoint, this neutralizes the benefits of having an NVRAM-based storage.

The upcoming fix is that the flush request semantic will be qualified to instruct storage devices to ignore the requests if they have the proper protection. This change requires a fix to our disk drivers and for the storage to support the updated semantics.

Since ZFS is not aware of the nature of the storage and if NVRAM is present, the best way to fix this issue is to tell the storage to ignore the requests. For more information, see:

http://blogs.digitar.com/jjww/?itemid=44.

http://forums.hds.com/index.php?showtopic=497.

Please check with your storage vendor for ways to achieve the same thing.

As a last resort, when all LUNs exposed to ZFS come from NVRAM-protected storage array and procedures ensure that no unprotected LUNs will be added in the future, ZFS can be tuned to not issue the flush requests. If some LUNs exposed to ZFS are not protected by NVRAM, then this tuning can lead to data loss, application level corruption, or even pool corruption.

NOTE: Cache flushing is commonly done as part of the ZIL operations. While disabling cache flushing can, at times, make sense, disabling the ZIL does not.

If you tune this parameter, please reference this URL in shell script or in an /etc/system comment.

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#FLUSH

[edit] Solaris 10 11/06 and Solaris Nevada (snv_52) Releases

Set dynamically:

echo zfs_nocacheflush/W0t1 | mdb -kw

Revert to default:

echo zfs_nocacheflush/W0t0 | mdb -kw

Set the following parameter in the /etc/system file:

set zfs:zfs_nocacheflush = 1

Risk: Some storage might revert to working like a JBOD disk when their battery is low, for instance. Disabling the caches can have adverse effects here. Check with your storage vendor.

[edit] Earlier Solaris Releases

Set the following parameter in the /etc/system file:

set zfs:zil_noflush = 1

Set dynamically:

echo zil_noflush/W0t1 | mdb -kw

Revert to default:

echo zil_noflush/W0t0 | mdb -kw

Risk: Some storage might revert to working like a JBOD disk when their battery is low, for instance. Disabling the caches can have adverse effects here. Check with your storage vendor.

[edit] RFEs
  • sd driver should set SYNC_NV bit when issuing SYNCHRONIZE CACHE to SBC-2 devices (integrated in snv_74)

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6462690

  • zil shouldn't send write-cache-flush command ...

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6460889

[edit] Disabling the ZIL (Don't)

ZIL stands for ZFS Intent Log. It is used during synchronous writes operations. The ZIL is an essential part of ZFS and should never be disabled. Significant performance gains can be achieved by not having the ZIL, but that would be at the expense of data integrity. One can be infinitely fast, if correctness is not required.

One reason to disable the ZIL is to check if a given workload is significantly impacted by it. A little while ago, a workload that was a heavy consumer of ZIL operations was shown to not be impacted by disabling the ZIL. It convinced us to look elsewhere for improvements. If the ZIL is shown to be a factor in the performance of a workload, more investigation is necessary to see if the ZIL can be improved.

The Solaris Nevada release now has the option of storing the ZIL on separate devices from the main pool. Using separate possibly low latency devices for the Intent Log is a great way to improve ZIL sensitive loads.

Caution: Disabling the ZIL on an NFS server will lead to client side corruption. The ZFS pool integrity itself is not compromised by this tuning.

If you tune this parameter, please reference this URL in shell script or in an /etc/system comment.

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#ZIL

[edit] Current Solaris Releases

If you must, then:

echo zil_disable/W0t1 | mdb -kw

Revert to default:

echo zil_disable/W0t0 | mdb -kw

[edit] RFEs
  • zil synchronicity

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6280630

[edit] Further Reading

http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on http://blogs.sun.com/erickustarz/entry/zil_disable http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine

[edit] Disabling Metadata Compression

Caution: This tuning needs to be researched as it's now apparent that the tunable applies only to indirect blocks leaving a lot of metadata compressed anyway.

With ZFS, compression of data blocks is under the control of the file system administrator and can be turned on or off by using the command "zfs set compression ...".

On the other hand, ZFS internal metadata is always compressed on disk, by default. For metadata intensive loads, this default is expected to gain some amount of space (a few percentages) at the expense of a little extra CPU computation. However, a bigger motivation exists to have metadata compression on. For directories that grow to millions of objects then shrink to just a few, metadata compression saves large amounts of space (>>10X).

In general, metadata compression can be left as is. If your workload is CPU intensive (say > 80% load) and kernel profiling shows medata compression is a significant contributor and we are not expected to create and shrink huge directories, then disabling metadata compression can be attempted with the goal of providing more CPU to handle the workload.

If you tune this parameter, please reference this URL in shell script or in an /etc/system comment.

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#METACOMP

[edit] Solaris 10 11/06 and Solaris Nevada (snv_52) Releases

Set dynamically:

echo zfs_mdcomp_disable/W0t1 | mdb -kw

Revert to default:

echo zfs_mdcomp_disable/W0t0 | mdb -kw

Set the following parameter in the /etc/system file:

set zfs:zfs_mdcomp_disable = 1

[edit] Earlier Solaris Releases

Not tunable.

[edit] RFEs
  • 6391873 metadata compression should be turned back on (Integrated in NEVADA snv_36)

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6391873

[edit] Additional ZFS References

  • ZFS Best Practices

http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide

  • ZFS Dynamics

http://blogs.sun.com/roch/entry/the_dynamics_of_zfs

  • ZFS Links

http://opensolaris.org/os/community/zfs/links/

  • Er_kernel profiling

http://developers.sun.com/prodtech/cc/articles/perftools.html

  • ZFS and Database/OLTP

http://blogs.sun.com/realneel/entry/zfs_and_databases

  • ZFS and Database/OLTP

http://blogs.sun.com/realneel/entry/zfs_and_databases_time_for

  • ZFS and Database/OLTP

http://blogs.sun.com/roch/entry/zfs_and_oltp

  • ZFS and NFS

http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine

  • ZFS and Direct I/O

http://blogs.sun.com/roch/entry/zfs_and_directio

  • ZFS Separate Intent Log (SLOG)

http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on

[edit] Integrated RFEs that introduced or changed tunables
  • snv_51 : 6477900 want more /etc/system tunables for ZFS performance analysis
  • snv_52 : 6485204 more tuneable tweakin
  • snv_53 : 6472021 vdev knobs can not be tuned
Solaris Internals
Personal tools
The Books