Last month we took a walk though some of the major features of three common file systems for Solaris,  the Solaris UFS file system, Veritas VxFS file system, and the LSC QFS file system. This month we start to look at some of the factors that affect file system performance, and how different file system configuration options can effect performance in different ways.

File System Performance

File System performance is often a major component of overall system performance, and is heavily dependent on the nature of the application generating the load. To achieve optimal performance, the underlying  file system configuration must be balanced to match the application characteristics.

If you are a developer, then you may already have a good idea of how your application is reading or writing though the file system, but if you are an administrator of an application then you may need to spend some time analyzing the application to understand what type of I/O profile is being presented to the file system.

Once we have a good understanding of the application, we can try and optimize the file system configuration to make the most efficient use of the underlying storage device. Our objective is to:

We will only touch on file system caching this month, and leave the bulk of Solaris caching implementation to the next month.

Understanding the Workload Profile

Before we can configure a file system, we need to understand the characteristics of the workload that is going to be using the file system. We begin by looking at a simple breakdown of application workload profiles and how we can determine what type of profile a given application has by tracing the application.

The important characteristics of the workload profile can be grouped into 5 categories, which are shown in Table 1.

Characteristic
Values
Description
File Access Profile
    Data or Attribute 
    Intensive?
    Does the application read/write/create/delete many small files or does it just read/write within existing files?
Access Pattern
Random/Sequential/
Strided
Are the read/writes random or sequentially ordered?
Bandwidth
    Megabytes per second
    What is the bandwidth requirement of the application? What is the average and peak rate of data that is read from or written to the file system from the application?
IO Size
    Bytes
    What is the most common I/O size requested? Does it match the block size of the file system?
Latency Sensitivity
    Milliseconds
    Is the application sensitive to read, and especially write latency?

Table 1. Examples of Application Characteristics

Data or Attribute Intensive?

Data intensive applications are those which shift a lot of data around, without creating or deleing many files, where as attribute intensive applications are those which create and delete a lot of files and only read and write small amounts of data in each. An example of a data intensive workload is a scientific batch program that creates 20 gigabytes files, and an example of an attribute intensive workload is an office automation file server that creates, deletes and stores hundreds of small files, each less than 1 megabytes.

Access Patterns

The access pattern of an application has a lot to do with the amount of optimization that can be done. An application can either be reading or writing sequentially though a file, or may be accessing a file in a random order. Sequential workloads are the easiest to tune, since we can group the I/O's and optimize how we issue I/O's to the underlying storage device.

Another type of access pattern is strided access, and is typically found in scientific applications. Strided workloads a sequential  in small groups (perhaps 1 megabyte), then seek within the file and write another small sequential block. They can be considered mostly sequential when tuning the file system, but have some caching characteristics similar to random workloads.

Bandwidth

The Bandwidth of an application characterises the amount of data that the application is shifting to and from its files. In many cases, bandwidth is most useful for capacity planning the underlying storage devices, but there are also some important caching characteristics which are affected by the amount of bandwidth the application uses.

I/O Size

The size of each I/O has a large impact on how the file system should be configured. I/O devices generally are much less efficient with smaller I/O's, and we can use the file system to group small adjacent I/O's into larger transfers to the storage devices. For random workload patterns, the I/O size has some strong interactions with the block size of the file system.

Latency Sensitivity

Some applications are very sensitive the the amount of time taken for each I/O, for example a database is very senstive to the amount of time taken to write to its log file. These types of application are the ones that can benefit well from well thought out caching strategies.

Data Intensive Sequential Workloads

Sequential workloads are those which perform repetitive  I/O's in ascending or descending order, reading or writing sequential file blocks.  We typically see sequential patterns when we shift large amounts of data around, for example when when we copy a file, read a large portion of a file, or write a large portion of a file.

We can use the truss command to investigate the nature of an applications access pattern by looking at the read, write and lseek system calls. Below, we use the truss command to trace the systems calls generated by our application, process id 20432.
# truss -p 20432

read(3, "\0\0\0\0\0\0\0\0\0\0\0\0".., 512)      = 512
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0".., 512)      = 512
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0".., 512)      = 512
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0".., 512)      = 512
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0".., 512)      = 512
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0".., 512)      = 512
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0".., 512)      = 512
We use the arguments from the read system calls to determine that the application is reading 512 byte blocks, and because there are no lseek system calls in-between each read we can deduce that the application is reading sequentially.  When the application requests the first read, the file system reads in the first file system block for the file, and an 8 kilobyte chunk will be read in. This operation requires a physical disk read, which takes in the oder of a few milliseconds. The second 512 byte read will simply read the next 512 bytes from the 8k file system block which is still in memory which only takes a few hundred microseconds. We only see 1 physical disk read for each 8k worth of data.

This is a major benefit to the performance on the application, because each disk I/O takes in the order of a few milliseconds, and if we were to wait for a disk I/O every 512 bytes then the application would spend most of its time waiting for the disk. Reading the physical disk in 8k blocks means that the application only needs to wait for a disk I/O every 16 reads rather than every read,  reducing the amount of time spent waiting for I/O.

"Read Adhead" helps Sequential Performance

Waiting for a disk I/O every 16 512 byte reads is still terribly inefficient, since we might only spend a few hundred microseconds processing the these 512 byte blocks, and then spend 10 milliseconds waiting for the next 8 kilobyte block to come from disk. Putting this in perspective,  this is the same ratio as catching a bus to travel 10 minutes down the road, getting off the bus and then waiting  6 hours for the next bus to travel for the next 10 minutes.

File systems are smart enough the be able to work around this problem. Because the access pattern is repeating, the file system can predict that it is very likely you will read the next block, given that you are reading in a sequential order.  Most file systems implement an algorithm that does this, commonly known as the "read ahead" algorithm. The read ahead algorithm detects that a file is being read sequentially by looking at the current block being requested, and comparing it to the last block that was requested. If they are adjacent, then the file system can initiate a read for the next few blocks as it reads the current block, and then when we come back for the next block it should have already been read in and we don't need to stop and wait. Revisiting our example, this is analogous to phoning ahead and booking the next bus stops, so that when we get off the bus the next one is already waiting for us and we don't need to wait. The only thing that we need worry about now is that we initiate read ahead of  enough blocks at a time so that we rarely catch up and have to stop and wait for a physical disk I/O.

The actual algorithms used in each file system type vary, but they all follow the same principles; they look at the recent access patterns and decide to read ahead a  number of blocks in advance. The number of blocks that are read ahead is usually configurable, and often the defaults are not big enough to provide optimal performance.
 

UFS File System Read Ahead

The UFS file system decides when to implement read ahead by keeping track of the last read operation; if the last read and the current read are sequential then a read ahead of the next sequential series of file system blocks is initiated. There are several criteria that must be met to engage UFS read ahead: The UFS file system uses the notion of "cluster size" to describe the amount of blocks that are read ahead in advance. This defaults to 7 8 kilobyte blocks (56 kilobyte ) in Solaris versions up and till Solaris 2.6 (56 kilobytes was the maximum DMA size on ancient I/O bus systems), and then in Solaris 2.6 the default changed to be the maximum size transfer that the underlying device supports, which defaults to 16  8 kilobyte blocks (128 kilobytes) on most storage devices.

The default values for read ahead are often inappropriate, and must be set larger to allow optimal read rates. The size of the read ahead cluster should be set very large for high performance sequential access to take advantage of modern I/O systems. Modern I/O systems are capable of very large bandwidth, but the cost for each I/O is still considerable, and as a result we want to choose as large as possible cluster size to minimize the amount of individual I/Os.

We can observe the default behavior of our 512 byte read example by looking at the average size of the I/O's that are reported for the underlying storage device.
# iostat -x 5
device    r/s  w/s   kr/s   kw/s wait actv  svc_t  %w  %b 
fd0       0.0  0.0    0.0    0.0  0.0  0.0    0.0   0   0 
sd6       0.0  0.0    0.0    0.0  0.0  0.0    0.0   0   0 
ssd11     0.0  0.0    0.0    0.0  0.0  0.0    0.0   0   0 
ssd12     0.0  0.0    0.0    0.0  0.0  0.0    0.0   0   0 
ssd13     49.0 0.0 6272.0    0.0  0.0  3.7   73.7   0  93 
ssd15     0.0  0.0    0.0    0.0  0.0  0.0    0.0   0   0 
ssd16     0.0  0.0    0.0    0.0  0.0  0.0    0.0   0   0 
ssd17     0.0  0.0    0.0    0.0  0.0  0.0    0.0   0   0 
The iostat command shows us that we are issuing 49 read operations per second to the disk ssd113, and averaging 6272 kilobytes per second from the  disk. If we divide the transfer rate by the number of I/O's per second, we derive that the average transfer size is 128 kilobytes. This confirms that default 128k cluster size is grouping the 512 byte read requests into 128 kilobyte groups.

We can look at the cluster size of a UFS file system by using the mkfs command with the -m option to reveal the current file system parameters. The cluster size or read ahead size is shown by the maxcontig parameter.
# mkfs -m /dev/rdsk/c1t2d3s0

mkfs -F ufs -o nsect=80,ntrack=19,bsize=8192,fragsize=1024,
cgsize=32,free=4,rps=90,nbpi=4136,opt=t,apc=0,gap=0,nrpos=8,
maxcontig=16 /dev/rdsk/c1t2d3s0 3105360

UFS

VxFS

For this file system, we can see that the cluster size is 16 blocks of 8192 bytes, or 128 kilobytes. As an alternative, you can use the fstyp -v command on some file systems.
# fstyp -v /dev/dsk/c1t4d0s2
ufs
magic   11954   format  dynamic time    Sun May 23 16:16:40 1999
sblkno  16      cblkno  24      iblkno  32      dblkno  400
sbsize  2048    cgsize  5120    cgoffset 40     cgmask  0xffffffe0
ncg     86      size    2077080 blocks  2044038
bsize   8192    shift   13      mask    0xffffe000
fsize   1024    shift   10      mask    0xfffffc00
frag    8       shift   3       fsbtodb 1
minfree 3%      maxbpg  2048    optim   time
maxcontig 16    rotdelay 0ms    rps     90
csaddr  400     cssize  2048    shift   9       mask    0xfffffe00
ntrak   19      nsect   80      spc     1520    ncyl    2733
cpg     32      bpg     3040    fpg     24320   ipg     2944
nindir  2048    inopb   64      nspf    2
nbfree  190695  ndir    5902    nifree  247279  nffree  304
cgrotor 31      fmod    0       ronly   0
fs_reclaim is not set
file system state is valid, fsclean is 2

UFS

Choosing Read Ahead Cluster Sizes

As a general rule of thumb, I like to ensure that no device has to do more than 200 I/O operations per second to achieve maximum bandwidth. This rule allows us to pick the optimal cluster size for file system read ahead, and by keeping the number of I/O operations per second low we save a lot of host CPU time because the operating system does not need to issue as many SCSI requests. This rule provides valid cluster sizes for all the storage devices I have come across to date.

The previous  example (shown with iostat) was able to achieve the maximum bandwidth that this 7200 RPM 4 gigabyte disk could achieve, needing only 49 I/O operations per second. Today's 10,000 RPM disk drives are capable of 20 megabytes per second, and also work well with the default cluster size of 128k. Some of the more advanced storage devices, such as a hardware raid controllers or a software raid stripes of many disks are capable of much higher transfer rates. You will typically see 50 - 100 megabytes per second from most modern storage devices, and when we put a file system atop one of these devices we need to use different values for the cluster sizes to achieve efficient read ahead.

Using the 200 operations per second rule, a Sun A5200 fiber storage array that can do 100 megabytes per second would need a 512k cluster size to allow us to saturate the device with 200 I/O operations per second.

Setting cluster sizes for RAID volumes

There are two other important things that we need to consider before we leap in and configure our cluster size larger than 128 kilobytes. The first is that the SCSI drivers in Solaris limit the maximum size of a SCSI transfer to 128 kilobytes by default, and even if we configure the file system to issue 512 kilobyte requests, the SCSI drivers will still break the requests into smaller 128 kilobyte chunks. The same limit applies with volume managers like Solstice Disk Suite and Veritas Volume Manager. Whenever we use a device that requires us to use larger cluster sizes, we need to set the SCSI and volume manager parameters in the /etc/system configuration file to allow bigger transfers. The following changes in /etc/system provide the necessary configuration for larger cluster sizes.
 
*
* Allow larger SCSI I/O transfers, parameter is bytes
*
set maxphys = 2097152

*
* Allow larger DiskSuite I/O transfers, parameter is bytes
*
set md_maxphys = 2097152


*
* Allow larger VxVM I/O transfers, parameter is 512 byte units
*
set vxio:vol_maxio = 4096
The second thing we need to consider is that RAID devices and volumes often consist of several physical devices that are arranged as one larger volume, and we need to pay attention to what happens to these large I/O requests when they arrive at the RAID volume. For example, a simple RAID level 0 stripe can be constructed from 7 disks in a Sun A5200 storage subsystem, and then I/O's are interlaces across each of the 7 devices according to the interlace size or stripe size. (More information on RAID configuration can be found at Brian Wong's RAID configuration column at  http://www.sunworld.com/sunworldonline/swol-09-1995/swol-09-raid5.html).

The effect of interlacing the I/O's across separate devices is that we can potentially break up the I/O that comes from the file system info several requests that can occur in parallel. Consider a single 512 kilobyte read request that comes from our file system, when it arrives at a RAID volume that is configured with a 128 kilobyte interlace size, it will be broken into 4 separate 128 kilobyte requests, rather than 7.

Since we have 7 separate disk devices in our RAID volume, we have the ability to perform 7 I/O operations in parallel, and ideally we should have the file system issue a request that will initiate I/O's on all 7 devices at once. To do this, we would need to initiate an I/O which is the size of the entire stripe width of the RAID volume, so that when it is broken up it is broken up into exactly 7 components. This requires us issuing I/O's that are 7 x 128 kilobytes, or 896 kilobytes each, and to do this we need to set the cluster size to 896 kilobytes.

RAID level 5 is similar, but we need to remember that we only have an effective space on n - 1 devices, so a 8 way RAID level 5 stripe will have the same stripe width and cluster size as a 7 way RAID 0 stripe. The guidelines for cluster sizes on RAID 5 devices are:

We can either set the cluster size at the time we create the file system using newfs options, or after the fact by using the tunefs command. To create a file system with a 896 kilobyte cluster size we would use the newfs command with the -C option as follows.
# newfs -C 112 /dev/md/dsk/d20

UFS

We can change the cluster size after the file system has been create with the tunefs command.
# tunefs -a 112 /dev/md/dsk/d20

UFS

Limitations of UFS Read Ahead

It is important to note that the UFS read ahead algorithms do not differentiate between multiple readers, and hence two processes reading the same file will break the read ahead algorithms.

VxFS File System Read Ahead

The Veritas VxFS file system also implements read ahead, but VxFS uses a different mechanism for setting the read ahead size.

The read ahead size for VxFS is set automatically when using the Veritas Volume Manager at mount time. The mount command queries the volume manager and sets the  read ahead options to suit the underlying volume. Alternatively, the options can be set at mount time by command line options or by entries in the /etc/vx/tunefstab.

The VxFS file system uses a parameter read_pref_io in conjunction with the read_nstream parameter to determine how much data to read ahead. The default readahead  is 64K. The parameter read_nstream reflects the desired number of parallel read requests of size read_pref_io to have outstanding at one time. The file system uses the product of read_nstream multiplied by read_pref_io to determine its read ahead size. The default value for read_nstream is 1.

An example shows how to set the read ahead size to 896 kilobytes using the vxtunefs command.
# mount -F vxfs /dev/dsk/c0t3d0s7 /mnt
# vxtunefs -o read_pref_io=917504 /mnt

VxFS

LSI QFS File System Read Ahead

The QFS file system implements read ahead in a similar manor,  and also uses the maxcontig parameter to reflect the number of blocks per cluster for read ahead. The QFS file system maxcontig parameter must be set at mount time, by using the -o maxcontig option.
# mount -o maxcontig=112 samfs1

QFS

Storage Device Read Ahead

Modern I/O systems often have some intelligence in the storage device and pre fetching is possible at this level. For example, the A3500 storage controller has options to control the read ahead size performed at the controller level which should be aligned with the cluster size of the file system:

The Sun A3500 storage controller has comprehensive read ahead options in hardware, which can be found in the A3500 users guide, or the A3500 tuning white paper, available from Sun.

Read ahead with Memory Mapped Files

Memory mapped files invoke different read ahead algorithms, since they bypass the read logic in the file system. Sequential access though a memory mapped file is either detected or forced using MAV_SEQUENTIAL in the memory segment driver that implements mapped files, seg_vn, and read ahead does not use the file system cluster size - it is fixed at 64k.

File System Write Behind

If we were to write each I/O synchronously we would have to wait a relatively long amount of time in between processing for each write to complete, in fact we would most likely spend most of our execution time waiting for the the I/O's to complete rather than doing much real work. Unix uses a far more efficient way to process writes, it passes the writes over to the operating system which allows the application to continue processing. This practice of delayed asynchronous writes is the default way the file systems write data blocks, and synchronous writes are only used when a special file option is set.

Delayed asynchronous writes allow the application to continue to process without having to wait for each I/O, and it also allows the operating system to delay the writes long enough to group together adjacent writes. When we are writing sequentially this allows us to issue fewer larger writes, rather than several smaller writes. As we discussed earlier, it is far more efficient to write a few large writes than many smaller writes.

UFS File System Write Behind

The UFS file system uses the same cluster size parameter, maxcontig,  to control how many writes are grouped together before a physical write is performed. The same guidelines should followed as for read ahead, and again if a RAID device is used, care should be taken to align the cluster size to the stripe size of the underlying device.

The following example shows the IO statistics for writes generated by the mkile command, which issues sequential 8k writes.
# mkfile 500m testfile&
# iostat -x 5
device    r/s  w/s   kr/s   kw/s wait actv  svc_t  %w  %b 
sd3       0.0  0.0    0.0    0.0  0.0  0.0    0.0   0   0 
ssd49     0.0  0.0    0.0    0.0  0.0  0.0    0.0   0   0 
ssd50     0.0  0.0    0.0    0.0  0.0  0.0    0.0   0   0 
ssd64     0.0 39.8    0.0 5097.4  0.0 39.5  924.0   0 100  
ssd65     0.0  0.0    0.0    0.0  0.0  0.0    0.0   0   0 
ssd66     0.0  0.0    0.0    0.0  0.0  0.0    0.0   0   0 
ssd67     0.0  0.0    0.0    0.0  0.0  0.0    0.0   0   0 

The iostat command shows us that we are issuing 39.8 write operations per second to the disk ssd64, and averaging 5097.4 kilobytes per second from the  disk. If we divide the transfer rate by the number of I/O's per second, we derive that the average transfer size is 128 kilobytes. This confirms that default 128k cluster size is grouping the 512 byte read requests into 128 kilobyte groups.

We can change the cluster size of the UFS file system and observe the results quite easily. Lets change the cluster size to 1 megabyte, or 1024k. To do this, we need to set maxcontig to 128, which represents 128 8 kilobyte blocks, or 1 megabyte. We also need to have already set maxphys larger in /etc/system as described earlier.
# tunefs -a 16 /ufs
maximum contiguous block count changes from 16 to 128
# mkfile 500m testfile&
# iostat -x 5
device    r/s  w/s   kr/s   kw/s wait actv  svc_t  %w  %b 
sd3       0.0  0.0    0.0    0.0  0.0  0.0    0.0   0   0 
ssd49     0.0  0.0    0.0    0.0  0.0  0.0    0.0   0   0 
ssd50     0.0  0.0    0.0    0.0  0.0  0.0    0.0   0   0 
ssd64     0.2  6.0    1.0 6146.0  0.0  5.5  804.4   0  99  
ssd65     0.0  0.0    0.0    0.0  0.0  0.0    0.0   0   0 
ssd66     0.0  0.0    0.0    0.0  0.0  0.0    0.0   0   0 
ssd67     0.0  0.0    0.0    0.0  0.0  0.0    0.0   0   0 

UFS

We can see now from  iostat that we are issuing 6.0 write operations per second to the disk ssd64, and averaging 6146 kilobytes per second from the  disk. If we divide the transfer rate by the number of I/O's per second, we derive that the average transfer size is 1024 kilobytes. Our new 1024 kilobyte cluster size is now grouping the 512 byte read requests into 1024 kilobyte write requests.

Note that the UFS clustering algorithm will only work properly if there is 1 process or thread writing to the file at a time. If there is more than one process or thread writing to the same file concurrently, then the delayed write algorithm in UFS will start breaking up the writes in random sizes.

The UFS Write Throttle

The UFS file system starting with Solaris 2.x contains a throttle to limit the amount of unwritten data per file. This prevents any one user from saturating all of memory by limiting the amount of outstanding writes on a file to 384 kilobytes by default.

The default parameters for the UFS write throttle will prevent you from using the full sequential write performance of most disks and storage systems. If you ever have trouble getting a disk, stripe or RAID controller to show up as 100% busy when writing sequentially, the UFS write throttle is the likely cause.

There are two parameters that control the write throttle, the high water mark and the low water mark. The UFS file system suspends writing when the amount of outstanding writes grows larger than the number of bytes in the system variable ufs_HW, and then resumes writing when the amount of writes falls below ufs_LW.

You can increase the UFS write throttle while the system is running and observe the change in results on line with the adb command.
# adb -kw
physmem 4dd7

ufs_HW/W 0t16777216
ufs_HW:         0x800000        =       0x800000
ufs_LW/W 0t8388608
ufs_LW:         0x1000000       =       0x1000000 

UFS

You can also set the write throttle permanently in /etc/system. I recommend setting the write throttle high water mark to 1/64th of total memory size, and the low water mark to 1/128th of total memory size, e.g. for a 1 gigabyte machine set ufs_HW to 16777216 and ufs_LW to 8388608.
*
* ufs_LW = 1/128th of memory
* ufs_HW = 1/64th of memory
*
set ufs_LW=8388608
set ufs_HW=16777216

UFS

No Write Throttle in Veritas VxFS

It should be noted that there is no equivalent write throttle in the Veritas VxFS file system. Caution should be taken when creating large files, as excessive memory paging may result.

LSI QFS File System Write Throttle

The QFS file system from LSI has a similar write throttle to UFS, which can be configured at the time the file system is mounted using the wr_throttle option. The wr_throttle option is the number of kilobytes that can be outstanding before the file system writes are suspended, and can range from 256 to 32768 kilobytes. An example of setting the QFS write throttle is shown.
# mount -o wr_throttle=16384 /qfs1

QFS

RAID Level 5 Stripes and Cluster Alignment

Earlier we talked about the importance of matching the cluster size to the stripe width of a storage device, and how this balances the I/O as it is split into several independent requests for each member of the stripe. There is one more important related factor when using RAID level 5 - alignment.
RAID level 5 volumes protect data integrity by calculating parity information and storing that as extra data that can allow a single drive to fail without causing a data loss. Each time we write to a stripe, the parity information is calculated by reading all of the data for a given stripe, recomputing the parity and then writing out the parity information. For example, if we have a 5 wide RAID level 5 stripe with a 128 kilobyte interlace, and we want to write 128 kilobytes to the stripe, we need to read 128 kilobytes of data from each of 4 drives, recompute the parity, write the new parity block, and then write the 128 kilobytes of data. We have to do several reads and writes just to allow a single write.

All of this overhead causes a substantial write penalty, in fact writing to a RAID level 5 volume can be an order of magnitude slower than writing to an equivalent RAID level 0 stripe. This overhead is worst when our write request is smaller than the size of the stripe, since we have to read, modify and write. If we write an exact stripe width, then we only have to write, since we have everything we need to calculate the parity for the entire stripe. Writing stripe width I/Os to a RAID level 5 volume can be substantially faster than partial stripe writes as a result.

Given that full stripe writes are much more efficient, we want to ensure that we write exact stripe width units where possible, and we can enable this by setting the cluster size of the file system to exactly match the stripe width, which as mentioned before is the number of disks minus 1, multiplied by the interlace size. There is however still one catch, that is even if we write the write size I/O, what happens if we start our write half way though the stripe? If we do this, we end up writing two partial stripes, which as discussed is many times slower than a full stripe write.

To overcome this problem, some file systems have an option to align clustered writes with the stripe on a pre configured boundary. This is known as write alignment, and although UFS does not provide an option to do this, the Veritas and QFS file system do have options to configure write alignment.

Stripe alignment is most critical on software RAID 5 implementations, since the volume manger has to write each requests as is is initiated. Hardware RAID 5 implementations are less of an issue since they have a non-volatile memory cache (NVRAM) that can delay the writes long enough in hardware to correctly re-align each write. If you have a Sun A5000/5200 storage subsystem, or are using a group of independent SCSI disks with Veritas VxVM or Disk Suite RAID 5 then write alignment will buy you a lot of extra performance.

For VxFS you can specify the alignment when the file system is constructed with the mkfs command. The align argument is in bytes.
# mkfs -F vxfs -o align=524288 /dev/vx/dsk/benchvol /mnt

VxFS

For LSC QFS, you can set the alignment when you build the file system with the -a option. The argument is in kilobytes.
# sammkfs -a 512 samfs1

QFS

Data Intensive Random Workloads

We can use the truss command to investigate the nature of an applications access pattern by looking at the read, write and lseek system calls. Below, we use the truss command to trace the systems calls generated by our application, process id 19231.
# truss -p 19231
lseek(3, 0x0D780000, SEEK_SET)                  = 0x0D780000
read(3, 0xFFBDF5B0, 8192)                       = 0
lseek(3, 0x0A6D0000, SEEK_SET)                  = 0x0A6D0000
read(3, 0xFFBDF5B0, 8192)                       = 0
lseek(3, 0x0FA58000, SEEK_SET)                  = 0x0FA58000
read(3, 0xFFBDF5B0, 8192)                       = 0
lseek(3, 0x0F79E000, SEEK_SET)                  = 0x0F79E000
read(3, 0xFFBDF5B0, 8192)                       = 0
lseek(3, 0x080E4000, SEEK_SET)                  = 0x080E4000
read(3, 0xFFBDF5B0, 8192)                       = 0
lseek(3, 0x024D4000, SEEK_SET)                  = 0x024D4000
We use the arguments from the read and lseek system calls to determine the size of each I/O and the seek offset at which each read is performed. The lseek system call shows us the offset within the file in hexadecimal, and for our example the first two seeks are to offset 0x0D780000 and 0xA6D0000, or byte numbers 225968128 and 38617088 respectively. These two addresses appear to be random, and further inspection of the remaining offsets show us that the reads are indeed completely random. We can also look at the argument to the read system call and see the size of each read as the third argument, and in our example every read is exactly 8192 bytes, or 8 kilobytes. In summary, our example we can see that the seek pattern is completely random, and  that the file is being read in 8k blocks.
 
There are several factors that should be considered when configuring a file system for random I/O:
It is very important to try to match the file system block size to a multiple of the I/O size for workloads that include a large proportion of writes. A write to a file system that is not a multiple of the block size will result in a partial write of a block, which requires the old block to be read, the new contents updated and then the whole block written out again. This read/modify/write cycle causes a lot of extra I/O's. The I/O size of the application should be chosen to match the block size. Applications that do odd size writes should be modified to pad each record out to the nearest possible block size multiple where possible to eliminate the read/modify/write cycle.
 
Random I/O workload often access data in very small blocks (2 kilobyte though 8 kilobyte), and each I/O to/from the storage device requires a seek and an I/O because we are reading only 1 file system block at a time. Each disk I/O takes in the order of a few milliseconds, and while the I/O is occurring the application needs to stall and wait for the I/O to complete. This can represent a large proportion of the applications response time. As a result, caching file system blocks into memory can make a big difference to application performance, since we can avoid many of those expensive and slow I/O's. For example, consider a database that does 3 reads from a storage device to retrieve a customer record from disk, if the database takes 500 microseconds of CPU time to retrieve the record, and spends 3 x 5 ms to read the data from disk, it spent a total of 15.5 milliseconds to retrieve the record, and 97% of that time was waiting for disk reads.
 
We can dramatically reduce the amount of time spent waiting for I/O's by caching. We can use memory to cache previously read disk blocks, and if that disk block is needed again we simply retrieve it from memory, avoiding the need to go to the storage device again.

I'm not going to go into details about random workloads yet, since next month we will be discussing caching in detail.

Summary

So far, we have begun to cover some of the important factors that can affect file system performance, and how the file system parameters affect performance in different ways.

Next month we will be walking though the Solaris file caching implementation, and we will discuss further how the file system uses cache and how it affects file system performance.