ZFS for Databases

From Siwiki

Jump to: navigation, search


[edit] ZFS for Databases

There is a lot of interest in using ZFS with databases. Wherever ZFS performance meets the application performance requirements, its simple administration model, extreme robustness, and zero-cost pricing make it extremely attractive! This section discusses what is known about ZFS performance with databases, and outlines its limitations.

Since ZFS performance is being improved continually, we highly recommend that you use latest Oracle Solaris Express releases.

[edit] General Comments

  • ZFS performance on databases is a fast moving target. Keeping up-to-date with Solaris releases is very important.
  • If the database uses a fixed disk block or record size for I/O, set the ZFS recordsize property to match. You can do this on a per-file system basis, even though multiple file systems might share a single pool.
  • With it's copy-on-write design, tuning down the ZFS recordsize is a way to improve OLTP performance at the expense of batch reporting queries.
  • ZFS checksums every block stored on disk. This alleviates the need for the database layer to checksum data an additional time. If checksums are computed by ZFS instead of at the database layer, any discrepancy can be caught and fixed before the data is returned to the application.
  • UFS Direct I/O is used to overcome some of the design deficiencies of UFS and to eliminate double buffering of data. In ZFS, the UFS design deficiencies do not exist and ZFS uses the primarycache and secondarycache file system properties to manage buffering data in the ARC. Note that using the secondarycache (L2ARC) property to improve random reads also requires the primarycache property to be enabled.
  • Keep pool space under 80% utilization to maintain pool performance.
  • The following features might impact database performance and are described in the ZFS Evil Tuning Guide:
    • Device I/O Queue Size (I/O Concurrency)- By default, ZFS issues concurrent I/Os to each top-level device and this can lead to inflated service time. For more information, see Device I/O Queue Size
    • ZFS does some low-level prefetches of up to 64K for each input block, which can cause saturation of storage channels. For more information, see this bug 6437054 (fixed) and this blog [1].
    • Device-Level Prefetching - Using vdev-level prefetches of 8K and between 5 and 10 concurrent I/O was shown to help some database loads. For help tuning this value, see Device-Level_Prefetching.

[edit] Oracle Considerations

  • General Recommendations:
    • For better OLTP performance, match the ZFS recordsize to the Oracle db_block_size.
    • Keep an eye on batch reporting during mixed batch and OLTP; a small recordsize can lead to an IOPS inflation during batch.
    • A mirrored pool or using hardware RAID is preferable over a RAIDZ pool. If you do not use a redundant ZFS storage pool, such as mirrored storage pool, then ZFS can only report, but not repair data inconsistencies.
    • If you have a busy database with high modification activity, consider the following pool configuration:
      • Primary database pool for data files, such as tables, index, undo, and temp data
      • Redo log pool with a separate log device - A separate pool for redo log gives higher priority to the write activity on the redo log, which can improve database response time.
      • Archive log storage pool - You can use a small internal disk for this purpose. Or, create the archive log file system in the main database pool.
    • If your database activity does not warrant the administrative overhead of 3 separate pools, then consider the following two-pool configuration, possibly using 3 slices on a large shared storage array LUN for both pools:
      • Primary database pool
      • Redo log pool with a separate log device
    • Create separate ZFS file systems as follows:
      • Database tables file system and index file system in the main database pool, using an 8 KB recordsize and the default primarycache value
      • Temp data file system and undo table space file system in the main database pool, using default recordsize and primarycache values
      • Redo log file system in the redo log pool, using default recordsize and primarycache values
      • Archive log file system in the archive pool, enabling compression and default recordsize value and primarycache set to metadata
    • For more information about separate log devices, see separate intent log. Do not confuse ZFS intent log devices with Oracle logs.
    • For Oracle over JBOD storage, using a segregated set of spindles, that are not subject to competing reads or writes, can help the log latency. This in turn can help some workloads, such as those with a high write to read ratio at the storage level. For the Oracle log pool, using some fast separate devices, such as NVRAM, battery backed DRAM, or 15K RPM drives, for the intent log appears like a very good idea.

[edit] PostgreSQL Considerations

  • Limit the ZFS ARC as described in [Memory and Dynamic Reconfiguration Recommendations]
  • See the Oracle Considerations section above about using a separate pool for logs
  • Set ZFS recordsize=8K (Note: Do this before any datafile creation)
  • Initialize the database from the log pool, then for each database create a new table space pointing to data pool

[edit] MySQL Considerations

  • For better OLTP performance, match the ZFS recordsize to the storage engine block size.
  • InnoDB:
    • Limit the ZFS ARC as described in [Memory and Dynamic Reconfiguration Recommendations]
    • See Oracle Considerations section above about using a separate pool for logs
    • Use a different path for data and log (set in my.cnf)
    • Set ZFS recordsize=16K for the InnoDB data files, and leave it at the default for InnoDB logs. (Note: Do this before any datafile creation)

[edit] MyISAM Considerations

  • Limit the ZFS ARC as described in [Memory and Dynamic Reconfiguration Recommendations]
  • Create a separate intent log for the log (WAL). If you do not have this feature (that is, you're running a Solaris 10 release), then create at least 2 pools, one for data and one for the log (WAL)
  • Set ZFS recordsize=8K (Note: Do this before any datafile creation)

See results real results obtained on PostgreSQL and MySQL with db_STRESS benchmark.

For ZFS/MySQL performance information, see Scaling MySQL on a 256-way T5440 server using Solaris ZFS and Java 1.7

[edit] OLTP Tests with ZFS

In addition to the above configuration and tuning guidelines, see the following links about on-going ZFS database configuration guidelines and performance testing:

Some experiments with OLTP workloads using ZFS for Oracle have attained steady-state throughput better than UFS out of the box, but were initially about 12% slower when compared with UFS direct I/O. If you discount the double checksum penalty (both at ZFS level, and database level), this gap gets even closer. As more performance enhancements are integrated, we expect ZFS to outperform UFS in these tests.

[edit] Limitations

When assessing ZFS performance for a particular application, keep these things in mind.

  1. Variable performance of sequential reads - Since ZFS does block-level copy-on-write for all writes, the physical layout of a file will tend to become random over time for files that are updated randomly. Normal RDBMS operations such as the updating of tables and indexes characteristically result in such random writes. As a consequence, the performance of database queries that do sustained large sequential reads (including "full table scans", "full index scans", or CREATE INDEX operations) can deteriorate over time as the database is updated. Simple ad-hoc or utility queries such as SELECT COUNT (*) from a non-indexed table commonly cause such sustained sequential I/O demand. The worst case impact varies inversely with the ZFS record size, thus posing a tradeoff between the performance of random reads and the performance of sequential reads in relation to the degree of potential fragmentation.
  2. Variable covariance between random and sequential reads - If and when sequential reads of fragmented files occur, it may adversely impact other random reads in the workload. In other words, OLTP operations can be adversely impacted in direct proportion to the degree of fragmentation in data that is the subject of sustained large sequential reads, such as untuned or ad-hoc queries, or utility operations.
  3. Sharability - ZFS is not at this time a sharable file system, at least not in the manner of Shared QFS or VxFS with Oracle Disk Manager (ODM) on shared volumes. Therefore, it cannot be directly used as the shared storage required for a clustered database like Oracle RAC. However, it can be shared by using NFS, or as ZFS volume, which might be accessed by using iSCSI. From the database engine's perspective, the net performance will include the performance characteristics of whatever stack is interposed between a client server and the ZFS server. Those intermediate layers may also pose new challenges to delivering end-to-end integrity comparable to that of native ZFS.
  4. Read+Modify+Write ("RMW") Overhead - Synchronous writes (such as typically used by databases like Oracle) can require that ZFS fetch the target ZFS record to the ARC to modify it before issuing the actual write operation to disk. The worst case scenario is where the ZFS record size is very large in relation to the database block size. As an example, if the Oracle block size was 8 KB and the ZFS record size was 128 KB, an 8 KB synchronous write might require a 128 KB read followed by a 128 KB write.
  5. RAIDZ caveat - Since RAIDZ reads all of its underlying elements for every read request, it is generally not advised for use with databases. Databases tend to rely heavily on efficient random reads, and RAIDZ is a bad match for that requirement.

[edit] Legacy Solaris Version and Patch Information

The Solaris 10 8/07 release contains numerous ZFS performance fixes and enhancements, and should be regarded as the minimum Solaris level to pursue application-specific performance evaluations. Some of these issues are platform-specific, and others more generic. CRs fixed in Solaris 10 8/07 or via patch include: 4369168, 4614772, 6197606, 6217718, 6236019, 6295662, 6296436 (in S10 rm patch 124244-02), 6310079, 6335025, 6371967, 6373298, 6374138, 6402533, 6413510, 6421427 (in S10 ATA Driver Patch: 123836-01), 6428639, 6436999, 6458128, 6465634, 6467201, 6470042, 6472230, 6473775, 6477900, 6478388, 6479563, 6484044, 6486390, 6487440, 6493685, 6498799, 6518397, 6547155, 6562458

Solaris Internals
Personal tools
The Books
The Ads