Last month we took a walk though some of the major features of three
common file systems for Solaris, the Solaris UFS file system, Veritas
VxFS file system, and the LSC QFS file system. This month we
start to look at some of the factors that affect file system performance,
and how different file system configuration options can effect performance
in different ways.
File System Performance
File System performance is often a major component of overall system performance,
and is heavily dependent on the nature of the application generating the
load. To achieve optimal performance, the underlying file system
configuration must be balanced to match the application characteristics.
If you are a developer, then you may already have a good idea of how
your application is reading or writing though the file system, but if you
are an administrator of an application then you may need to spend some
time analyzing the application to understand what type of I/O profile is
being presented to the file system.
Once we have a good understanding of the application, we can try and
optimize the file system configuration to make the most efficient use of
the underlying storage device. Our objective is to:
-
Reduce the number of I/O's to the underlying device(s) where possible
-
Group smaller I/O's together into larger I/O's where possible
-
Optimize the seek pattern to reduce the amount of time spent waiting for
disk seeks
-
Cache as much as data as realistic to reduce physical I/O's
We will only touch on file system caching this month, and leave the bulk
of Solaris caching implementation to the next month.
Understanding the Workload Profile
Before we can configure a file system, we need to understand the characteristics
of the workload that is going to be using the file system. We begin by
looking at a simple breakdown of application workload profiles and how
we can determine what type of profile a given application has by tracing
the application.
The important characteristics of the workload profile can be grouped
into 5 categories, which are shown in Table 1.
|
Characteristic
|
Values
|
Description
|
|
File Access Profile
|
|
Does the application read/write/create/delete many small
files or does it just read/write within existing files?
|
| Access Pattern |
Random/Sequential/
Strided
|
Are the read/writes random or sequentially ordered?
|
|
Bandwidth
|
|
What is the bandwidth requirement of the application? What
is the average and peak rate of data that is read from or written to the
file system from the application?
|
|
IO Size
|
|
|
|
Latency Sensitivity
|
|
Is the application sensitive to read, and especially write
latency?
|
Table 1. Examples of Application Characteristics
Data or Attribute Intensive?
Data intensive applications are those which shift a lot of data around,
without creating or deleing many files, where as attribute intensive applications
are those which create and delete a lot of files and only read and write
small amounts of data in each. An example of a data intensive workload
is a scientific batch program that creates 20 gigabytes files, and an example
of an attribute intensive workload is an office automation file server
that creates, deletes and stores hundreds of small files, each less than
1 megabytes.
Access Patterns
The access pattern of an application has a lot to do with the amount of
optimization that can be done. An application can either be reading or
writing sequentially though a file, or may be accessing a file in a random
order. Sequential workloads are the easiest to tune, since we can group
the I/O's and optimize how we issue I/O's to the underlying storage device.
Another type of access pattern is strided access, and is typically found
in scientific applications. Strided workloads a sequential in small
groups (perhaps 1 megabyte), then seek within the file and write another
small sequential block. They can be considered mostly sequential when tuning
the file system, but have some caching characteristics similar to random
workloads.
Bandwidth
The Bandwidth of an application characterises the amount of data that the
application is shifting to and from its files. In many cases, bandwidth
is most useful for capacity planning the underlying storage devices, but
there are also some important caching characteristics which are affected
by the amount of bandwidth the application uses.
I/O Size
The size of each I/O has a large impact on how the file system should be
configured. I/O devices generally are much less efficient with smaller
I/O's, and we can use the file system to group small adjacent I/O's into
larger transfers to the storage devices. For random workload patterns,
the I/O size has some strong interactions with the block size of the file
system.
Latency Sensitivity
Some applications are very sensitive the the amount of time taken for each
I/O, for example a database is very senstive to the amount of time taken
to write to its log file. These types of application are the ones that
can benefit well from well thought out caching strategies.
Data Intensive Sequential Workloads
Sequential workloads are those which perform repetitive
I/O's in ascending or descending order, reading or writing sequential file
blocks. We typically see sequential patterns when we shift large
amounts of data around, for example when when we copy a file, read a large
portion of a file, or write a large portion of a file.
We can use the truss command to investigate the nature of an applications
access pattern by looking at the read, write and lseek system calls. Below,
we use the truss command to trace the systems calls generated by our application,
process id 20432.
# truss -p 20432
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0".., 512) = 512
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0".., 512) = 512
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0".., 512) = 512
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0".., 512) = 512
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0".., 512) = 512
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0".., 512) = 512
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0".., 512) = 512
|
We use the arguments from the read system calls to determine that the application
is reading 512 byte blocks, and because there are no lseek system calls
in-between each read we can deduce that the application is reading sequentially.
When the application requests the first read, the file system reads in
the first file system block for the file, and an 8 kilobyte chunk will
be read in. This operation requires a physical disk read, which takes in
the oder of a few milliseconds. The second 512 byte read will simply read
the next 512 bytes from the 8k file system block which is still in memory
which only takes a few hundred microseconds. We only see 1 physical disk
read for each 8k worth of data.
This is a major benefit to the performance on the application, because
each disk I/O takes in the order of a few milliseconds, and if we were
to wait for a disk I/O every 512 bytes then the application would spend
most of its time waiting for the disk. Reading the physical disk in 8k
blocks means that the application only needs to wait for a disk I/O every
16 reads rather than every read, reducing the amount of time spent
waiting for I/O.
"Read Adhead" helps Sequential Performance
Waiting for a disk I/O every 16 512 byte reads is still terribly inefficient,
since we might only spend a few hundred microseconds processing the these
512 byte blocks, and then spend 10 milliseconds waiting for the next 8
kilobyte block to come from disk. Putting this in perspective, this
is the same ratio as catching a bus to travel 10 minutes down the road,
getting off the bus and then waiting 6 hours for the next bus to
travel for the next 10 minutes.
File systems are smart enough the be able to work around this problem.
Because the access pattern is repeating, the file system can predict that
it is very likely you will read the next block, given that you are reading
in a sequential order. Most file systems implement an algorithm that
does this, commonly known as the "read ahead" algorithm. The read ahead
algorithm detects that a file is being read sequentially by looking at
the current block being requested, and comparing it to the last block that
was requested. If they are adjacent, then the file system can initiate
a read for the next few blocks as it reads the current block, and then
when we come back for the next block it should have already been read in
and we don't need to stop and wait. Revisiting our example, this is analogous
to phoning ahead and booking the next bus stops, so that when we get off
the bus the next one is already waiting for us and we don't need to wait.
The only thing that we need worry about now is that we initiate read ahead
of enough blocks at a time so that we rarely catch up and have to
stop and wait for a physical disk I/O.
The actual algorithms used in each file system type vary, but they all
follow the same principles; they look at the recent access patterns and
decide to read ahead a number of blocks in advance. The number of
blocks that are read ahead is usually configurable, and often the defaults
are not big enough to provide optimal performance.
UFS File System Read Ahead
The UFS file system decides when to implement read ahead by keeping track
of the last read operation; if the last read and the current read are sequential
then a read ahead of the next sequential series of file system blocks is
initiated. There are several criteria that must be met to engage UFS read
ahead:
-
The last file system read must be sequential with the current
-
There must be only concurrent one reader of the file (reads from other
processes will break the sequential access pattern for the file)
-
The blocks of the file being read must be sequentially layered out on the
disk
-
The file must be being read or written via the read and write system calls,
memory mapped files do not use UFS read ahead
The UFS file system uses the notion of "cluster size" to describe the amount
of blocks that are read ahead in advance. This defaults to 7 8 kilobyte
blocks (56 kilobyte ) in Solaris versions up and till Solaris 2.6 (56 kilobytes
was the maximum DMA size on ancient I/O bus systems), and then in Solaris
2.6 the default changed to be the maximum size transfer that the underlying
device supports, which defaults to 16 8 kilobyte blocks (128 kilobytes)
on most storage devices.
The default values for read ahead are often inappropriate, and must
be set larger to allow optimal read rates. The size of the read ahead cluster
should be set very large for high performance sequential access to take
advantage of modern I/O systems. Modern I/O systems are capable of very
large bandwidth, but the cost for each I/O is still considerable, and as
a result we want to choose as large as possible cluster size to minimize
the amount of individual I/Os.
We can observe the default behavior of our 512 byte read example by
looking at the average size of the I/O's that are reported for the underlying
storage device.
# iostat -x 5
device r/s w/s kr/s kw/s wait actv svc_t %w %b
fd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
sd6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
ssd11 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
ssd12 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
ssd13 49.0 0.0 6272.0 0.0 0.0 3.7 73.7 0 93
ssd15 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
ssd16 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
ssd17 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
|
The iostat command shows us that we are issuing 49 read operations per
second to the disk ssd113, and averaging 6272 kilobytes per second from
the disk. If we divide the transfer rate by the number of I/O's per
second, we derive that the average transfer size is 128 kilobytes. This
confirms that default 128k cluster size is grouping the 512 byte read requests
into 128 kilobyte groups.
We can look at the cluster size of a UFS file system by using the mkfs
command with the -m option to reveal the current file system parameters.
The cluster size or read ahead size is shown by the maxcontig parameter.
| # mkfs -m /dev/rdsk/c1t2d3s0
mkfs -F ufs -o nsect=80,ntrack=19,bsize=8192,fragsize=1024,
cgsize=32,free=4,rps=90,nbpi=4136,opt=t,apc=0,gap=0,nrpos=8,
maxcontig=16 /dev/rdsk/c1t2d3s0 3105360 |
UFS
VxFS
|
For this file system, we can see that the cluster size is 16 blocks of
8192 bytes, or 128 kilobytes. As an alternative, you can use the fstyp
-v command on some file systems.
# fstyp -v /dev/dsk/c1t4d0s2
ufs
magic 11954 format dynamic time Sun May 23 16:16:40 1999
sblkno 16 cblkno 24 iblkno 32 dblkno 400
sbsize 2048 cgsize 5120 cgoffset 40 cgmask 0xffffffe0
ncg 86 size 2077080 blocks 2044038
bsize 8192 shift 13 mask 0xffffe000
fsize 1024 shift 10 mask 0xfffffc00
frag 8 shift 3 fsbtodb 1
minfree 3% maxbpg 2048 optim time
maxcontig 16 rotdelay 0ms rps 90
csaddr 400 cssize 2048 shift 9 mask 0xfffffe00
ntrak 19 nsect 80 spc 1520 ncyl 2733
cpg 32 bpg 3040 fpg 24320 ipg 2944
nindir 2048 inopb 64 nspf 2
nbfree 190695 ndir 5902 nifree 247279 nffree 304
cgrotor 31 fmod 0 ronly 0
fs_reclaim is not set
file system state is valid, fsclean is 2
|
UFS
|
Choosing Read Ahead Cluster Sizes
As a general rule of thumb, I like to ensure that no device has to do more
than 200 I/O operations per second to achieve maximum bandwidth. This rule
allows us to pick the optimal cluster size for file system read ahead,
and by keeping the number of I/O operations per second low we save a lot
of host CPU time because the operating system does not need to issue as
many SCSI requests. This rule provides valid cluster sizes for all the
storage devices I have come across to date.
The previous example (shown with iostat) was able to achieve the
maximum bandwidth that this 7200 RPM 4 gigabyte disk could achieve, needing
only 49 I/O operations per second. Today's 10,000 RPM disk drives are capable
of 20 megabytes per second, and also work well with the default cluster
size of 128k. Some of the more advanced storage devices, such as a hardware
raid controllers or a software raid stripes of many disks are capable of
much higher transfer rates. You will typically see 50 - 100 megabytes per
second from most modern storage devices, and when we put a file system
atop one of these devices we need to use different values for the cluster
sizes to achieve efficient read ahead.
Using the 200 operations per second rule, a Sun A5200 fiber storage
array that can do 100 megabytes per second would need a 512k cluster size
to allow us to saturate the device with 200 I/O operations per second.
Setting cluster sizes for RAID volumes
There are two other important things that we need to consider before we
leap in and configure our cluster size larger than 128 kilobytes. The first
is that the SCSI drivers in Solaris limit the maximum size of a SCSI transfer
to 128 kilobytes by default, and even if we configure the file system to
issue 512 kilobyte requests, the SCSI drivers will still break the requests
into smaller 128 kilobyte chunks. The same limit applies with volume managers
like Solstice Disk Suite and Veritas Volume Manager. Whenever we use a
device that requires us to use larger cluster sizes, we need to set the
SCSI and volume manager parameters in the /etc/system configuration file
to allow bigger transfers. The following changes in /etc/system provide
the necessary configuration for larger cluster sizes.
*
* Allow larger SCSI I/O transfers, parameter is bytes
*
set maxphys = 2097152
*
* Allow larger DiskSuite I/O transfers, parameter is bytes
*
set md_maxphys = 2097152
*
* Allow larger VxVM I/O transfers, parameter is 512 byte units
*
set vxio:vol_maxio = 4096
|
The second thing we need to consider is that RAID devices and volumes often
consist of several physical devices that are arranged as one larger volume,
and we need to pay attention to what happens to these large I/O requests
when they arrive at the RAID volume. For example, a simple RAID level 0
stripe can be constructed from 7 disks in a Sun A5200 storage subsystem,
and then I/O's are interlaces across each of the 7 devices according to
the interlace size or stripe size. (More information on RAID configuration
can be found at Brian Wong's RAID configuration column at
http://www.sunworld.com/sunworldonline/swol-09-1995/swol-09-raid5.html).
The effect of interlacing the I/O's across separate devices is that
we can potentially break up the I/O that comes from the file system info
several requests that can occur in parallel. Consider a single 512 kilobyte
read request that comes from our file system, when it arrives at a RAID
volume that is configured with a 128 kilobyte interlace size, it will be
broken into 4 separate 128 kilobyte requests, rather than 7.
Since we have 7 separate disk devices in our RAID volume, we have the
ability to perform 7 I/O operations in parallel, and ideally we should
have the file system issue a request that will initiate I/O's on all 7
devices at once. To do this, we would need to initiate an I/O which is
the size of the entire stripe width of the RAID volume, so that when it
is broken up it is broken up into exactly 7 components. This requires us
issuing I/O's that are 7 x 128 kilobytes, or 896 kilobytes each, and to
do this we need to set the cluster size to 896 kilobytes.
RAID level 5 is similar, but we need to remember that we only have an
effective space on n - 1 devices, so a 8 way RAID level 5 stripe will have
the same stripe width and cluster size as a 7 way RAID 0 stripe. The guidelines
for cluster sizes on RAID 5 devices are:
-
RAID level 0, striping - Cluster size = number of stripe members x interlace
size
-
RAID level 1, mirroring - Cluster size = the same as for a single disk
-
RAID level 10, striping + mirroring - Cluster size = number of stripe members
per mirror x interlace size
We can either set the cluster size at the time we create the file system
using newfs options, or after the fact by using the tunefs command. To
create a file system with a 896 kilobyte cluster size we would use the
newfs command with the -C option as follows.
# newfs -C 112 /dev/md/dsk/d20
|
UFS
|
We can change the cluster size after the file system has been create with
the tunefs command.
# tunefs -a 112 /dev/md/dsk/d20
|
UFS
|
Limitations of UFS Read Ahead
It is important to note that the UFS read ahead algorithms do not differentiate
between multiple readers, and hence two processes reading the same file
will break the read ahead algorithms.
VxFS File System Read Ahead
The Veritas VxFS file system also implements read ahead, but VxFS uses
a different mechanism for setting the read ahead size.
The read ahead size for VxFS is set automatically when using the Veritas
Volume Manager at mount time. The mount command queries the volume manager
and sets the read ahead options to suit the underlying volume. Alternatively,
the options can be set at mount time by command line options or by entries
in the /etc/vx/tunefstab.
The VxFS file system uses a parameter read_pref_io in conjunction with
the read_nstream parameter to determine how much data to read ahead. The
default readahead is 64K. The parameter read_nstream reflects the
desired number of parallel read requests of size read_pref_io to have outstanding
at one time. The file system uses the product of read_nstream multiplied
by read_pref_io to determine its read ahead size. The default value for
read_nstream is 1.
An example shows how to set the read ahead size to 896 kilobytes using
the vxtunefs command.
# mount -F vxfs /dev/dsk/c0t3d0s7 /mnt
# vxtunefs -o read_pref_io=917504 /mnt
|
VxFS
|
LSI QFS File System Read Ahead
The QFS file system implements read ahead in a similar
manor, and also uses the maxcontig parameter to reflect the number
of blocks per cluster for read ahead. The QFS file system maxcontig parameter
must be set at mount time, by using the -o maxcontig option.
# mount -o maxcontig=112 samfs1
|
QFS
|
Storage Device Read Ahead
Modern I/O systems often have some intelligence in the storage device and
pre fetching is possible at this level. For example, the A3500 storage
controller has options to control the read ahead size performed at the
controller level which should be aligned with the cluster size of the file
system:
The Sun A3500 storage controller has comprehensive read ahead options
in hardware, which can be found in the A3500 users guide, or the A3500
tuning white paper, available from Sun.
Read ahead with Memory Mapped Files
Memory mapped files invoke different read ahead algorithms, since they
bypass the read logic in the file system. Sequential access though a memory
mapped file is either detected or forced using MAV_SEQUENTIAL in the memory
segment driver that implements mapped files, seg_vn, and read ahead does
not use the file system cluster size - it is fixed at 64k.
File System Write Behind
If we were to write each I/O synchronously we would have to wait a relatively
long amount of time in between processing for each write to complete, in
fact we would most likely spend most of our execution time waiting for
the the I/O's to complete rather than doing much real work. Unix uses a
far more efficient way to process writes, it passes the writes over to
the operating system which allows the application to continue processing.
This practice of delayed asynchronous writes is the default way the file
systems write data blocks, and synchronous writes are only used when a
special file option is set.
Delayed asynchronous writes allow the application to continue to process
without having to wait for each I/O, and it also allows the operating system
to delay the writes long enough to group together adjacent writes. When
we are writing sequentially this allows us to issue fewer larger writes,
rather than several smaller writes. As we discussed earlier, it is far
more efficient to write a few large writes than many smaller writes.
UFS File System Write Behind
The UFS file system uses the same cluster size parameter, maxcontig,
to control how many writes are grouped together before a physical write
is performed. The same guidelines should followed as for read ahead, and
again if a RAID device is used, care should be taken to align the cluster
size to the stripe size of the underlying device.
The following example shows the IO statistics for writes generated by
the mkile command, which issues sequential 8k writes.
# mkfile 500m testfile&
# iostat -x 5
device r/s w/s kr/s kw/s wait actv svc_t %w %b
sd3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
ssd49 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
ssd50 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
ssd64 0.0 39.8 0.0 5097.4 0.0 39.5 924.0 0 100
ssd65 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
ssd66 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
ssd67 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
|
The iostat command shows us that we are issuing 39.8 write operations
per second to the disk ssd64, and averaging 5097.4 kilobytes per second
from the disk. If we divide the transfer rate by the number of I/O's
per second, we derive that the average transfer size is 128 kilobytes.
This confirms that default 128k cluster size is grouping the 512 byte read
requests into 128 kilobyte groups.
We can change the cluster size of the UFS file system and observe the
results quite easily. Lets change the cluster size to 1 megabyte, or 1024k.
To do this, we need to set maxcontig to 128, which represents 128 8 kilobyte
blocks, or 1 megabyte. We also need to have already set maxphys larger
in /etc/system as described earlier.
# tunefs -a 16 /ufs
maximum contiguous block count changes from 16 to 128
# mkfile 500m testfile&
# iostat -x 5
device r/s w/s kr/s kw/s wait actv svc_t %w %b
sd3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
ssd49 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
ssd50 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
ssd64 0.2 6.0 1.0 6146.0 0.0 5.5 804.4 0 99
ssd65 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
ssd66 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
ssd67 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
|
UFS
|
We can see now from iostat that we are issuing 6.0 write operations
per second to the disk ssd64, and averaging 6146 kilobytes per second from
the disk. If we divide the transfer rate by the number of I/O's per
second, we derive that the average transfer size is 1024 kilobytes. Our
new 1024 kilobyte cluster size is now grouping the 512 byte read requests
into 1024 kilobyte write requests.
Note that the UFS clustering algorithm will only work properly if there
is 1 process or thread writing to the file at a time. If there is more
than one process or thread writing to the same file concurrently, then
the delayed write algorithm in UFS will start breaking up the writes in
random sizes.
The UFS Write Throttle
The UFS file system starting with Solaris 2.x contains a throttle to limit
the amount of unwritten data per file. This prevents any one user from
saturating all of memory by limiting the amount of outstanding writes on
a file to 384 kilobytes by default.
The default parameters for the UFS write throttle will prevent you from
using the full sequential write performance of most disks and storage systems.
If you ever have trouble getting a disk, stripe or RAID controller to show
up as 100% busy when writing sequentially, the UFS write throttle is the
likely cause.
There are two parameters that control the write throttle, the high water
mark and the low water mark. The UFS file system suspends writing when
the amount of outstanding writes grows larger than the number of bytes
in the system variable ufs_HW, and then resumes writing when the amount
of writes falls below ufs_LW.
You can increase the UFS write throttle while the system is running
and observe the change in results on line with the adb command.
# adb -kw
physmem 4dd7
ufs_HW/W 0t16777216
ufs_HW: 0x800000 = 0x800000
ufs_LW/W 0t8388608
ufs_LW: 0x1000000 = 0x1000000
|
UFS
|
You can also set the write throttle permanently in /etc/system. I recommend
setting the write throttle high water mark to 1/64th of total memory size,
and the low water mark to 1/128th of total memory size, e.g. for a 1 gigabyte
machine set ufs_HW to 16777216 and ufs_LW to 8388608.
*
* ufs_LW = 1/128th of memory
* ufs_HW = 1/64th of memory
*
set ufs_LW=8388608
set ufs_HW=16777216
|
UFS
|
No Write Throttle in Veritas VxFS
It should be noted that there is no equivalent write throttle
in the Veritas VxFS file system. Caution should be taken when creating
large files, as excessive memory paging may result.
LSI QFS File System Write Throttle
The QFS file system from LSI has a similar write throttle
to UFS, which can be configured at the time the file system is mounted
using the wr_throttle option. The wr_throttle option is the number of kilobytes
that can be outstanding before the file system writes are suspended, and
can range from 256 to 32768 kilobytes. An example of setting the QFS write
throttle is shown.
| # mount -o wr_throttle=16384 /qfs1 |
QFS
|
RAID Level 5 Stripes and Cluster Alignment
Earlier we talked about the importance of matching the
cluster size to the stripe width of a storage device, and how this balances
the I/O as it is split into several independent requests for each member
of the stripe. There is one more important related factor when using RAID
level 5 - alignment.
RAID level 5 volumes protect data integrity by calculating
parity information and storing that as extra data that can allow a single
drive to fail without causing a data loss. Each time we write to a stripe,
the parity information is calculated by reading all of the data for a given
stripe, recomputing the parity and then writing out the parity information.
For example, if we have a 5 wide RAID level 5 stripe with a 128 kilobyte
interlace, and we want to write 128 kilobytes to the stripe, we need to
read 128 kilobytes of data from each of 4 drives, recompute the parity,
write the new parity block, and then write the 128 kilobytes of data. We
have to do several reads and writes just to allow a single write.
All of this overhead causes a substantial write penalty,
in fact writing to a RAID level 5 volume can be an order of magnitude slower
than writing to an equivalent RAID level 0 stripe. This overhead is worst
when our write request is smaller than the size of the stripe, since we
have to read, modify and write. If we write an exact stripe width, then
we only have to write, since we have everything we need to calculate the
parity for the entire stripe. Writing stripe width I/Os to a RAID level
5 volume can be substantially faster than partial stripe writes as a result.
Given that full stripe writes are much more efficient, we
want to ensure that we write exact stripe width units where possible, and
we can enable this by setting the cluster size of the file system to exactly
match the stripe width, which as mentioned before is the number of disks
minus 1, multiplied by the interlace size. There is however still one catch,
that is even if we write the write size I/O, what happens if we start our
write half way though the stripe? If we do this, we end up writing two
partial stripes, which as discussed is many times slower than a full stripe
write.
To overcome this problem, some file systems have an option
to align clustered writes with the stripe on a pre configured boundary.
This is known as write alignment, and although UFS does not provide an
option to do this, the Veritas and QFS file system do have options to configure
write alignment.
Stripe alignment is most critical on software RAID 5 implementations,
since the volume manger has to write each requests as is is initiated.
Hardware RAID 5 implementations are less of an issue since they have a
non-volatile memory cache (NVRAM) that can delay the writes long enough
in hardware to correctly re-align each write. If you have a Sun A5000/5200
storage subsystem, or are using a group of independent SCSI disks with
Veritas VxVM or Disk Suite RAID 5 then write alignment will buy you a lot
of extra performance.
For VxFS you can specify the alignment when the file system
is constructed with the mkfs command. The align argument is in bytes.
| # mkfs -F vxfs -o align=524288 /dev/vx/dsk/benchvol /mnt |
VxFS
|
For LSC QFS, you can set the alignment when you build the
file system with the -a option. The argument is in kilobytes.
| # sammkfs -a 512 samfs1 |
QFS
|
Data Intensive Random Workloads
We can use the truss command to investigate the nature of
an applications access pattern by looking at the read, write and lseek
system calls. Below, we use the truss command to trace the systems calls
generated by our application, process id 19231.
# truss -p 19231
lseek(3, 0x0D780000, SEEK_SET) = 0x0D780000
read(3, 0xFFBDF5B0, 8192) = 0
lseek(3, 0x0A6D0000, SEEK_SET) = 0x0A6D0000
read(3, 0xFFBDF5B0, 8192) = 0
lseek(3, 0x0FA58000, SEEK_SET) = 0x0FA58000
read(3, 0xFFBDF5B0, 8192) = 0
lseek(3, 0x0F79E000, SEEK_SET) = 0x0F79E000
read(3, 0xFFBDF5B0, 8192) = 0
lseek(3, 0x080E4000, SEEK_SET) = 0x080E4000
read(3, 0xFFBDF5B0, 8192) = 0
lseek(3, 0x024D4000, SEEK_SET) = 0x024D4000
|
We use the arguments from the read and lseek system calls to determine
the size of each I/O and the seek offset at which each read is performed.
The lseek system call shows us the offset within the file in hexadecimal,
and for our example the first two seeks are to offset 0x0D780000 and 0xA6D0000,
or byte numbers 225968128 and 38617088 respectively. These two addresses
appear to be random, and further inspection of the remaining offsets show
us that the reads are indeed completely random. We can also look at the
argument to the read system call and see the size of each read as the third
argument, and in our example every read is exactly 8192 bytes, or 8 kilobytes.
In summary, our example we can see that the seek pattern is completely
random, and that the file is being read in 8k blocks.
There are several factors that should be considered when configuring
a file system for random I/O:
-
Try to match the I/O size and the file system block size
-
Choose an appropriately large file system cache
-
Disable pre fetch and read ahead, or limit read ahead to the size of each
I/O
-
Disable file system caching when the application does its own caching,
e.g. databases
It is very important to try to match the file system block
size to a multiple of the I/O size for workloads that include a large proportion
of writes. A write to a file system that is not a multiple of the block
size will result in a partial write of a block, which requires the old
block to be read, the new contents updated and then the whole block written
out again. This read/modify/write cycle causes a lot of extra I/O's. The
I/O size of the application should be chosen to match the block size. Applications
that do odd size writes should be modified to pad each record out to the
nearest possible block size multiple where possible to eliminate the read/modify/write
cycle.
Random I/O workload often access data in very small blocks (2 kilobyte
though 8 kilobyte), and each I/O to/from the storage device requires a
seek and an I/O because we are reading only 1 file system block at a time.
Each disk I/O takes in the order of a few milliseconds, and while the I/O
is occurring the application needs to stall and wait for the I/O to complete.
This can represent a large proportion of the applications response time.
As a result, caching file system blocks into memory can make a big difference
to application performance, since we can avoid many of those expensive
and slow I/O's. For example, consider a database that does 3 reads from
a storage device to retrieve a customer record from disk, if the database
takes 500 microseconds of CPU time to retrieve the record, and spends 3
x 5 ms to read the data from disk, it spent a total of 15.5 milliseconds
to retrieve the record, and 97% of that time was waiting for disk reads.
We can dramatically reduce the amount of time spent waiting
for I/O's by caching. We can use memory to cache previously read disk blocks,
and if that disk block is needed again we simply retrieve it from memory,
avoiding the need to go to the storage device again.
I'm not going to go into details about random workloads yet, since next
month we will be discussing caching in detail.
Summary
So far, we have begun to cover some of the important factors that can affect
file system performance, and how the file system parameters affect performance
in different ways.
Next month we will be walking though the Solaris file caching implementation,
and we will discuss further how the file system uses cache and how it affects
file system performance.