Performance features of ZFS
 This is work in progress
The three basic building blocks of ZFS are
- Copy on write
ZFS maintains its records as a tree of blocks. Every block in the filesystem is accessible via a single super block called the uberblock. When a new block gets written, and if it is an existing block, A copy of the data is made and then modified before being written to disk. This principle, called Copy on write, means that ZFS will never overwrite live data. This means that there is no window of vulnerability where live data is being overwritten and a system crash occurs. This also means that there is no need for a fsck.
In ZFS, operations (that modify) on the filesystem are bunched together in transactions before being committed to disk. Thus related changes (for ex, write 100 bytes, and close the file) are put together into a transaction, and either the whole transaction completes, or fails. Since it is a transaction-based operation, individual operations in the transaction can be reordered (as long as it does not affect data integrity) to optimize performance. This also for operations to be coalesced to maximize performance.
ZFS maintains checksums for every block in its tree of blocks. This checksum is maintained in the parent node. Since it is maintained separately, errors like phantom writes, or misdirected writes will not cause data to be inconsistent. Moreover, since the checksum is maintained in the parent block, and we need to access the parent block to get to the desired block, there is no need for a separate read of the checksum. Thus links between the trees are all guaranteed to be valid, and thus will not cause a panic due to the filesystem following a bad link.
The major components of ZFS are SPA, DMU, ZAP, and ZPL.
The Storage Pool Allocator or SPA handles all block allocation and IO. It abstracts devices into vdevs and provides a virtually addressed blocks to the DMU. The SPA provides interfaces for allocating and freeing these virtually addressed blocks. The Data management unit transforms the virtually addressed blocks into transactional object interfaces for the ZPL. The DMU is responsible for maintaining the data consistency. The DMU never modifies blocks in place, instead it does a COW of the block, and writes the new block. Since all blocks are children of the uberblock, all indirected blocks between the leaf block and the parent uberblock (including the uberblock) need to written for the transaction to succeed. Once the intermediate blocks are written, the DMU rewrites the uberblock in one atomic operation, switching atomically from the old tree of blocks to the new tree. Multiple copies of the uberblock are written as an insurance against disk failure. The ZFS posix layer implements a POSIX compliant system on top of the DMU objects exporting vnode operations to the system call layer. It also implements features like range locks,
 ZFS Key performance features
ZFS has several key features that make it very high performance. Some of the main performance wins in ZFS are
- Write sequentialization
- Dynamic Striping
- Parallel three-phase transaction groups
- Intelligent prefetch
- Multiple block sizes
- Sync semantics at async speed
- Concurrent, constant time directory operations
- POSIX compliant concurrent writes
- Explicit IO priority with deadline scheduling
 Write sequentialization
Many filesystems (including ZFS) cache regular writes in memory. In a traditional filesystem, modified data blocks have to be updated in-place, and hence random writes to the file translate to random writes to disk. However in ZFS, When modified data needs to be sent to disk (as a result of a fsync(3C) or close(2)), the COW nature of ZFS allows it to choose sequential blocks to write this data. This has the effect of converting random writes to sequential writes. Since sequential I/O is faster than random I/O due to less head movement, ZFS can drive high bandwidth from the disk. COW filesystems have an extra penalty of updating the metablocks. However, since these operations are all part of a transaction, and are also COW, ZFS chooses to use sequential blocks to write metadata too. Since ZFS metadata is compressed by default, there is less data to write. Moreover, since a transaction can contain updates for multiple files, metadata is further coalesced. The cost of the extra metadata blocks is more than offset by the improved locality sequentialization.
 Dynamic striping.
In ZFS you can add new devices to pool at any given time. This means that the extra bandwidth provided by the new disks is immediately available for applications to use. Jeff Bonwick explains this in his blog. This is mainly enabled by the copy on write feature. More you modify, more bandwidth you get
 Parallel Three phase transaction groups
There are three transactions at any given time in ZFS.
- Open: accepting transactions
- Quiescing: waiting for transactions to complete
- Syncing: pushing changes to disk
Every update to the ZFS (write or metadata updates) is assigned a transaction and assigned a transaction group. If there is no throttling in effect, this transaction is assigned usually belong to the open transaction group. Transactions can be assigned either to the next open transaction group, the current transaction group or a specific transactions. Usually, if an operation is not holding any locks, it gets assigned to the open transaction group. If the current open transaction group is full, it waits till a new one opens up. This throttles the issue of writes to disk.
This also prevents burstiness
How many operations per transaction group?
 Intelligent prefetch
There are three kinds of prefetching
- metadata prefetch (dmu_prefetch())
- znode prefetch
- vdev cache prefetch
ZFS has a lot of semantic information about how a user is interacting with the filesystem. If the user does a readdir(), it is highly probable that he/she will stat() files in that directory. The metadata contents of the directory can be prefetched. Similarly a thread reading through a whole file will benefit by ZFS intelligently prefetching unread blocks in the background. Since ZFS has a IO scheduler, these prefetch ios can have lower priority, and inherit the priority of the actual io if it happens when the prefetch io is in flight. The prefetching policy is on a per filesystem basisVERIFY, and is highly extensible so that in the future workload-specific prefetchers could be integrated.
Metadata prefetch Metadata prefetch (issued via dmu_prefetch()) is used by ZAP, space_map_load(), and zfs_readdir()
znode prefetch ZFS can detect sequential (increasing and decreasing) and strided access (increasing and decreasing) to files. Since this is a global thing, multiple processes reading same file will benefit from prefetch. Streams are maintained in an avl tree. When a read happens, a check is made to see if it matches any existing streams. If so, further checks are made to identify if it a forward/reverse sequential/strided access. If so, then it does the actual prefetching dmu_zfetch is the main entry point. Prefetch is smart enough to figure out that files with sizes than the max record size (default 128k) consist of a single block, and no prefetch is required.
zfs_prefetch_disable controls whether prefetch is on or off.
UFS only has a single prefetch stream. Consequently, it should perform quite poorly on something like a diff(1) of two files because it can't effectively prefetch them both.
ZFS keeps track, on a per-znode basis, of the last offset that it read from, and it prefetches a fixed amount of data at a fixed distance ahead of the last read.
6289676: ARC needs traffic control for in-flight I/Os One of the early interesting enhancements/bug/issue with zfs and prefetching is eloquently mentioned in CR 6289676. Originally inflight prefetch ios were not tracked by the arc. Let us take the case of a thread that is reading data sequentially. ZFS prefetch would have recognized the sequential access, and would have fired a bunch of prefetch ios. But the arc does not know about it. Hence when the thread tries to read the next block, it would incur an arc miss, which then caused the arc to issue another duplicate io!. This was fixed by inserting I/Os into the arc hash-table once they've been issued, so that other callers can locate them. If another caller finds an in-progress I/O, the caller is blocked pending the completion of the in-progress I/O, instead of issuing both and resolving the conflict later. [ This prevents double issue of IO by both the reader thread and the prefetch thread) ACTUALLY, there is no prefetch thread]
Care is taken to ensure prefetch blocks are not granted MFU status until they are read atleast twice. [6289686 dbuf_prefetch() disturbs balance of ARC]
When callers of dbuf_prefetch() prefetch lots of blocks which are subsequently read, the ARC has a tendancy to become heavily MFU based. This is because the prefetch is considered one access, and the following read is considered a second -- implying that this block is going to be frequently used. Soln: Introduce an ARC flag which specifies that a particular read is part of a prefetch. Append this flag, if appropriate, to the arc buf header. If this flag is present when performing a state change operation, check to see if the buffer actually needs to remain in the MRU state, before switching it to the MFU state.
How do I know if prefetch is happening?
 Prefetch Parameters [DO NOT USE]
max # of streams per zfetch : zfetch_max_streams = 8; min time before stream reclaim: zfetch_min_sec_reap = 2; max number of blocks to fetch at a time: zfetch_block_cap = 32; number of bytes in a array_read at which we stop prefetching: zfetch_array_rd_sz = 1024 * 1024; (1MB)
reads bigger than zfetch_array_rd_sz are not prefetched.
dmu_zfetch() is the main entry point
dbuf_prefetch() is used to fetch the blocks
Another kind of prefetch is possible via dmu_prefetch. This is used by zap, space_map_load(), and zfs_readdir().
What exactly is a stream?
Is there a separate thread that issues the prefetchs? No. As a part of zfs_read, dmu_zfetch gets called, which in turn fires off a bunch of NO_WAIT IOs via dbuf_prefetch
 Multiple block sizes
Most filesystems have a fixed block size. If your data does not match your block size, you get sub-optimial results. If the block is too big, you have less metadata, but there is a higher probability of space wastage. If your block size is too small, this means that there is more metadata. ZFS supports power of two blocks upto 128k. Moreover, the choice of the block size is automatic by default (there is a manual override) All blocks of a file are of the same size. It is not (currently) possible to downgrade the blocksize of a file after it has been created. You can always set maxrecordsize and copy the file to get this effect.
ZFS algorithm for selecting block sizes
- The initial block size is the smallest support block size larger than the first write to the file.
- Grow to the next largest block size for the entire file when the total file length increases beyond the current block size (up to the maximum block size).
- Shrink the block size when the entire file will fit in a single smaller block.
ZFS currently support nine block sizes, from 512 bytes to 128K. Larger block size could be supported in the future, but see roch's blog on why 128k is enough.
How do I know if a block is being upgraded to the next size?
 Sync semantics at async speed
ZFS is always consistent on disk. However, in case of a system crash, ZFS will not have updates since the last transaction. To overcome this, and to provide a fastpath for synchronous writes, ZFS uses an intent log (called the zil). In the intent log, we could log only metadata (like ufs). However this will cause recent writes to be lost. We could log user and metadata and recover everything (like nfs), but this is slow. We could just write everything to disk and wait for the IO to complete. But this is very slow. We could log to NVRAM on the IO bus like the Netapp filers do, however this is very expensive. Ideally we want to log everything to nvram, but that is not there yet.
ZFS logs everything via the ZFS intent log. The key feature of the zil is that the zil io can be coalesced to achieve greater bandwidth to disk. The zil uses all available disks, and hence can utilize the full bandwidth of the pool. There is a no dedicated disk for ZIL.
 Concurrent, constant time directory operations
Large directories need constant time operations (lookup, create, delete, etc). Hot directories need concurrent operations. ZFS uses extensible hashing to solve this. Block based, amortized growth cost, short chains for constant time ops, per-block locking for high concurrency. A caveat is that readir returns entries in hash order.
Directories are implemented via the ZFS Attribute Processor (ZAP) in ZFS. ZAP can be used to arbitrary name value pairs. ZAP uses two algorithms are optimized for large lists (large directories) and small lists (attribute lists).
The ZAP implementation is in zap.c and zap_leaf.c. Each directory is maintained as a table of pointers to constant sized buckets holding a variable number of entries. Each directory record is 16k in size. When this block gets full, a new block of size next power of two is allocated.
A directory starts off as a microzap, and then upgraded to a fat zap (via mzap_upgrade) if the size of the name exceeds MZAP_NAME_LEN ( MZAP_ENT_LEN - 8 - 4 - 2) or 50 or if the size of the microzap exceeds MZAP_MAX_BLKSZ (128k)
 POSIX compliant concurrent writes
Existing filesystems force tradeoffs between write concurrency and POSIX compliance. This is often termed as the single writer lock. UFS implements a per file reader/writer lock. This single writer lock in the default case will only allow one thread to update the file at any given time. This lock guarantees write ordering for synchronous writes. This is fine for most files, but in some case (like databases which use preallocated files), this will cause a serialization of all writes to the file, causing scaling problems. UFS Directio relaxes this constraint, and thus allows much more scalability for applications that do not care about write ordering for synchronous writes.
ZFS eliminates the single writer lock while guaranteeing write ordering[FIXME:] by using range locks. ZFS employes byte-range locking to allow maximum concurrency while satisfying POSIX overlapping write semantics. It has to be noted that there is still a single lock to grow a file, but it is rare that concurrent threads try to grow a file at a given time.
 Explicit IO priority with deadline scheduling
ZFS uses quantized deadline scheduling to schedule IO. This is a combination of deadline scheduling and unidirection elevator algorithm. IO is assigned priority based on type of IO.
Drive by scheduling: when we issue a high priority IO to a different region of disk, we also issue near by ios.
IO scheduler activations: IO completion interrupts are treated as an event stream to drive the ZFS IO scheduler. No timeouts, no polling, high/low watermarks, flush operations etc. This enables the IO scheduler to re-examine io device utilization and adjust.
modified version of the Megiddo/Modha replacement cache. Caches data from all pools into host memory. Grows on demand, and shrinks on demand.
Cache hits cache misses arc size arc mfu/mru list lengths prefetch hits?