ZFS for Databases
From Siwiki
Contents |
[edit] ZFS for Databases
There is a lot of interest in using ZFS with databases. Wherever ZFS performance meets the application performance requirements, its simple administration model, extreme robustness, and zero-cost pricing make it extremely attractive! This section discusses what is known about ZFS performance with databases, and outlines its limitations.
[edit] OLTP Tests with ZFS
Some experiments with OLTP workloads using ZFS for Oracle have attained steady-state throughput better than UFS out of the box, but were initially about 12% slower when compared with UFS direct I/O. If you discount the double checksum penalty (both at ZFS level, and database level), this gap gets even closer. As more performance enhancements are integrated, we expect ZFS to outperform UFS in these tests.
Configuration and tuning guidelines for using ZFS with databases as derived from these tests are documented in the following blogs ...
The configuration and tuning guidelines from these blogs are recapped here:
- Match the database block size with ZFS record size.
- If you have a high throughput, write heavy workload, you are better off by separating the log files on a separate pool.
- As an alternate to putting the logs in a separate pool, they might be placed on some other filesystem option such as UFS direct I/O.
- Tuning down the vdev cache (until CR 6437054 is fixed) yields some improvement for OLTP workloads
- Tuning vq max pending (CR 6457709) helps improve performance in some cases.
Since Solaris 10 8/07 (update 4) you can tune vdev cache and vq max pending using /etc/system as described in the ZFS Evil Tuning Guide. For ealier releases one can use the ztune script.
Since ZFS performance is being improved continually, we highly recommend that you use latest OpenSolaris builds. Performance enhancements to OpenSolaris generally make it into Solaris updates every quarter.
[edit] Limitations
When assessing ZFS performance for a particular application, keep these things in mind.
- 1: Variable performance of sequential reads -- Since ZFS does block-level Copy-On-Write for all writes, the physical layout of a file will tend to become random over time for files that are updated randomly. Normal RDBMS operations such as the updating of tables and indexes characteristically result in such random writes. As a consequence, the performance of database queries that do sustained large sequential reads (including "full table scans", "full index scans", or CREATE INDEX operations) can deteriorate over time as the database is updated. Simple ad-hoc or utility queries such as SELECT COUNT (*) from a non-indexed table commonly cause such sustained sequential I/O demand. The worst case impact varies inversely with the ZFS record size, thus posing a tradeoff between the performance of random reads and the performance of sequential reads in relation to the degree of potential fragmentation.
- 2: Variable covariance between random and sequential reads -- If and when sequential reads of fragmented files occur, it may adversely impact other random reads in the workload. In other words, OLTP operations can be adversely impacted in direct proportion to the degree of fragmentation in data that is the subject of sustained large sequential reads, such as untuned or ad-hoc queries, or utility operations.
- 3: Sharability -- ZFS is not at this time a sharable filesystem, at least not in the manner of Shared QFS or VxFS with Oracle Disk Manager (ODM) on shared volumes. Therefore, it cannot be directly used as the shared storage required for a clustered database like Oracle RAC. However, it can be exported via NFS, or as ZVOL volumes which might be accessed via iSCSI. From the database engine's perspective, the net performance will include the performance characteristics of whatever stack is interposed between a client server and the ZFS server. Those intermediate layers may also pose new challenges to delivering end-to-end integrity comparable to that of native ZFS.
- 4: Read+Modify+Write ("RMW") Overhead -- Synchronous writes (such as typically used by databases like Oracle) can require that ZFS fetch the target ZFS record to the ARC to modify it before issuing the actual write operation to disk. The worst case scenario is where the ZFS record size is very large in relation to the database block size. As an example, if the Oracle block size was 8 KB and the ZFS record size was 128 KB, an 8 KB synchronous write might require a 128 KB read followed by a 128 KB write.
- 5: RAID-Z caveat -- Since RAID-Z reads all of its underlying elements for every read request, it is generally not advised for use with databases. Databases tend to rely heavily on efficient random reads, and RAID-Z is a bad match for that requirement.
[edit] Solaris Version and Patching Notes for ZFS Performance
Solaris 10 Update 3 (S10U3 a.k.a 11/06) contains a performance fix that especially helps database performance (CR 6424554 full block re-writes need not read data in). We highly recommend that you upgrade to Solaris 10 Update 3, if you are still using Solaris 10 Update 2.
Solaris 10 Update 4 (S10U4 a.k.a. 8/07) contains numerous ZFS performance fixes and enhancements, and should be regarded as the minimum Solaris level to pursue application-specific performance evaluations. Some of these issues are platform-specific, and others more generic. CR's with fixes in S10U4 or via patch include: 4369168, 4614772, 6197606, 6217718, 6236019, 6295662, 6296436 (in S10 rm patch 124244-02), 6310079, 6335025, 6371967, 6373298, 6374138, 6402533, 6413510, 6421427 (in S10 ATA Driver Patch: 123836-01), 6428639, 6436999, 6458128, 6465634, 6467201, 6470042, 6472230, 6473775, 6477900, 6478388, 6479563, 6484044, 6486390, 6487440, 6493685, 6498799 (in S10U4 since Build 8, but not in any other patch.), 6518397, 6547155, 6562458 (in S10U4 since Build 11, but not in any other patch.) - so the case for using S10U4 as a baseline is quite compelling.
Solaris 10 Update 5 (S10U5 a.k.a. 5/08) ...
Categories: ZFS | Oracle | File System | Database
