Application Specific Tuning
 Oracle RDBMS
 System Setup Recommendations for Solaris 8 and Solaris 9
The following settings are categorized as setup recommendations because they do not tune performance. They enable Oracle to install and run for typical loads, as the default values for the System V IPC parameters tend to be too small. Note that the IPC parameters msgmap, msgssz, msgseg and semmap are obsolete in Solaris 8 and beyond. They are ignored if they appear in /etc/system. Note also that shmmin and shmseg are obsolete in Solaris 9 and beyond. For Solaris 8 and 9, in /etc/system:
# # shared memory parameters # set shmsys:shminfo_shmmax = (0.25 * total physical memory) # or just set this to 4294967296 on a 32 bit system) set shmsys:shminfo_shmmni =128 set shmsys:shminfo_shmseg = 64 # # semaphore parameters # set semsys:seminfo_semmsl = 512 set semsys:seminfo_semmns = 2500 set semsys:seminfo_semmnu = 2500 set semsys:seminfo_semume = 128 set semsys:seminfo_semmni = 128 set semsys:seminfo_semopm = 512 # # message queues # set msgsys:msginfo_msgmax = 65536 set msgsys:msginfo_msgmnb = 65536 set msgsys:msginfo_msgmni = 128 set msgsys:msginfo_msgtql = 8192 # # Increase sync queues for network packets # sq_max_size = 64 # # Minimize possibility of TS threads getting starved # This is for Solaris 8 ONLY! # set TS:ts_sleep_promote = 1 # # Tune fsflushd to stay out of the way # set tune_t_fsflushr = 1 set autoup = (if 64GB of RAM or less, set to 300. Otherwise, set to 4 x GB of RAM. For example, for a system with 96GB of RAM, 4 x 96 = 384).
The suggestions above do NOT apply to Solaris 10 or OpenSolaris.
 System Setup Recommendations for Solaris 10 and OpenSolaris
In Solaris 10, the System V IPC parameters have been made dynamic. That is, a parameter value can be changed on-the-fly using the new Solaris 10 resource control facilities. Also, many of the older parameters have been obsoleted, and remaining parameters have much larger default values. This information is summarized in the Solaris Tunable Parameters Reference Manual.
If /etc/system entries exist in Solaris 10, and the set values are larger than the new default values, the settings will work system-wide. However, Solaris 10 contains new administrative commands for managing IPC parameters on a per-project basis (projects and tasks were introduced in Solaris 9 as a resource control name space).
Below is an example of using the prctl(1) command to examine the current value of the maximum shared memory segment size (20GB in this example, with a system maximum settable value of 16 Exabytes).
sol10# id -p uid=0(root) gid=0(root) projid=100(filebench) sol10# prctl -n project.max-shm-memory -i project 100 project: 100: filebench NAME PRIVILEGE VALUE FLAG ACTION RECIPIENT project.max-shm-memory privileged 20.0GB - deny - system 16.0EB max deny - sol10# prctl -n project.max-shm-memory -r -v 24gb -i project 100 sol10# prctl -n project.max-shm-memory -i project 100 project: 100: filebench NAME PRIVILEGE VALUE FLAG ACTION RECIPIENT project.max-shm-memory privileged 24.0GB - deny - system 16.0EB max deny -
In the example above, the id(1) command with the "-p" flag shows we are in the filebench project, which has a project ID of 100. The prctl(1) command to fetch the current max-shm-memory parameter indicates it is currently 20GB. We than use prctl(1) to dynamically change the max-shm-memory to 24GB, and interrogate it once again to insure they change took effect (which it did!).
Permanent changes are made using the projadd(1) and projmod(1) commands to add entries for the project in the /etc/project file, insuring that the desired values will be set if the system ever reboots. For exaample, to make this permanent, we would create a project entry for the user or project in question.
sol10# projadd -c "My database" -U oracle user.oracle sol10# projmod -sK "project.max-shm-memory=(privileged,64G,deny)" user.oracle sol10# su - oracle oracle$ prctl -n project.max-shm-memory -i project user.oracle project: 101: user.oracle NAME PRIVILEGE VALUE FLAG ACTION RECIPIENT project.max-shm-memory privileged 64.0GB - deny - system 16.0EB max deny -
 Tuning Recommendations
- File System and I/O Related
- Pick the right file system. Good combinations are:
- Problematic combinations are any filesystems that rely on the OS page cache for buffering
- Always use UFS Direct I/O, NFS Direct I/O or Raw-disk for the re-do logs. Putting Oracle redo logs on cached file systems will cause a hot bottleneck at the file system level. This is the most significant file system / IO related tuning tip.
- Details on Oracle File System Performance are covered in this older, but still relevant paper on Oracle I/O 
- Optionally, put database tables on Direct I/O as well. Further information is covered in the Direct I/O page.
- Oracle over NFS. Oracle RDBMS works quite well over NFS, providing you are using gigabit ethernet (or faster), Direct I/O is enabled, and you are using Solaris 9 or later. The following paper covers Oracle on NFS studies done in our PAE group: 
- Oracle Related
- Use the largest SGA/block buffer cache possible to maximize Oracle cache hit rates. We have tested up to 540GB SGA with the 64-bit version of Oracle.
- Use the 64-bit version of RDBMS. This allows SGA larger than 3.75 GB, which is almost always beneficial when there is sufficient RAM.
- For Solaris 8, ensure that the scheduler starvation prevention parameter is set (this is the default for Solaris 9 and later). Set the following in /etc/system
set TS:ts_sleep_promote = 1
- For optimal performance, use the Solaris FX Scheduler Class
- This is available on Solaris 9 or later.
- This fixed priority scheduler will provide longer quantums on CPU, and reduce context switching, which results from the default behaviour of the time-share (TS) scheduler.
- For Solaris 8 and 9, tune the file system flush daemon. (NOTE: Since early 2006 patch levels, tune_t_fsflushr now defaults to 1, so it should no longer be explicitly set. Any system with the default otherwise is now severely out-of-date on patches!)
# # Tune fsflushd to stay out of the way ... # # Only set tune_t_fsflushr if it is not 1 by default. # set tune_t_fsflushr = 1 # # For autoup, if 64GB of RAM or less, use 300. Otherwise, use 4 x GB of RAM. # For example, for a system with 96GB of RAM, 4 x 96 = 384. # set autoup = <number>
- Hardware/Platform Specific
- For T1/2000, in /etc/system add set consistent_coloring=2 or ensure that patch 118833-03 or later has been applied.
 Solaris Optimizations for Oracle
Here's a quick summary of some of the important optimizations we've done for Oracle specifically:
- Containers: zones licensing for processor pools. To facilitate server consolidation, you can use Solaris Containers to partition up the CPU resources. For example, 4 cpus could be given to each of two containers using processor pools. Futhermore, there is [Oracle licencing support] for Zones, so that only the number of processors in each zone need be licensed.
- Resource manager: Fair Share Scheduler supported for consolidation. The Solaris Fair Share Scheduler can be used to prioritize the CPU resources between instances of Oracle. The Fair Share scheduler offers fine grained allocation of resources based on CPU ratios (e.g. shares of 10/90 will guarantee at least 10% and 90% respectively to each instance).
- DR: Ability to add/remove memory. Solaris allows physical RAM to be added dynamically. The file system cache will grow/shrink automatically based on the additional RAM. Futhermore, the Dynamic ISM feature allows the Oracle SGA to dynamically grow and shrink, using either SQL commands. The grow/shrink may also be automated using the Resource Configuration Manager.
- UFS Direct I/O: Optimized database I/O performance. Solaris 8 Concurrent Direct I/O - A new implementation of Direct I/O which eliminates the single writer lock in typical database usage was made available in Solaris 8 update 3. This allows the UFS file system to perform at near raw disk performance when used with the Oracle database.
- libaio: async I/O library for database data and log writes. Oracle performs writes in batches or lists, which are submitted to the operating system using the listio() functionality. The listio() asynchronous I/O facility was added specifically to optimize database performance.
- KAIO: Kernelized async I/O, and its many enhancements. When using raw disks, Kernel async I/O allows list to be passed directly into the kernel in a single system call, reducing the number of system calls. A substantial reduction of latency and instruction path was gained from this feature.
- Efficient listio reaping (aio_waitn). The [aiowaitn()] functionality allows lists of I/O to be waited for and reaped in a single system call.
- Timed semaphore wait (semtimedop). This new system call allows Oracle to wait on a single event with a timer, rather than having to arm and fire a timer, saving several system calls.
- Scheduler Activations
- Large pages
- Giant pages for newer platforms
- Large pages for Oracle binaries/libraries
- MPO APIs
- Intimate shared memory for databases
- FX: Fixed priority scheduler for databases
 Oracle's use of the "times()" syscall
"Why does Oracle call times() so often? Is something broken?
When using truss or dtrace to profile Oracle shadow processes, one often sees a lot of calls to "times". Sysadmins often get worried and think there is something wrong. Oracle is working as designed and there is minimal impact.
root@catscratchb> truss -cp 7700 ^C syscall seconds calls errors read .002 120 write .008 210 times .053 10810 semctl .000 17 semop .000 8 semtimedop .000 9 mmap .003 68 munmap .003 5 yield .002 231 pread .150 2002 kaio .003 68 kaio .001 68 -------- ------ ---- sys totals: .230 13616 0 usr time: 1.127 elapsed: 22.810
At first glance it would seem alarming to have so many times() calls, but how much does this really effect performance? This question can best be anwsered by looking at the overall "elasped" and "cpu" time. Below is output from the "procsystime" tool included in the [Dtrace toolkit].
root@catscratchb> ./procsystime -Teco -p 7700 Hit Ctrl-C to stop sampling... ^C Elapsed Times for PID 7700, SYSCALL TIME (ns) mmap 17615703 write 21187750 munmap 21671772 times 90733199 <<== Only 0.28% of elapsed time semsys 188622081 read 226475874 yield 522057977 pread 31204749076 TOTAL: 32293113432 CPU Times for PID 7700, SYSCALL TIME (ns) semsys 1346101 yield 3283406 read 7511421 mmap 16701455 write 19616610 munmap 21576890 times 33477300 <<== 10.6% of CPU time for the times syscall pread 211710238 TOTAL: 315223421 Syscall Counts for PID 7700, SYSCALL COUNT munmap 17 semsys 84 read 349 mmap 350 yield 381 write 540 pread 3921 times 24985 <<== 81.6% of syscalls. TOTAL: 30627
According to the profile above, the times() syscall accounts for only 0.28% of the overall response time. It does use 10.6% of sys CPU. The usr/sys CPU percentages are "83/17" for this application. So, using the 17% for system CPU we can calculate the overall amount of CPU for the times() syscall: 100*(.17*.106)= 1.8%.
Oracle uses the times() syscall to keep track of timed performance statistics. Timed statistics can be enabled/disabled by setting the init.ora parameter "TIMED_STATISTICS=TRUE". In fact, it is an *old* benchmark trick to disable TIMED_STATISTICS after all tuning has been done. This is usually good for another 2% in overall throughput. In a production environment, it is NOT advisable to ever disable TIMED_STATISTICS. These statistics are extremely important to monitor and maintain application performance.
Also see MySQL Performance Tuning on solaris (yufei's blog).
 MySQL InnoDB performance tuning on T2000
MySQL is one of the most popular opensource database. It supports several pluggable storage engines, MyISAM manages non-transactional tables and it is good performer especially on MySQL 5.1. The InnoDB and BDB storage engines provide transaction-safe tables. They were acquired by Oracle not so long ago. The MEMORY storage engine which formerly known as heap storage engine provides in-memory tables. The InnoDB supports both file system and raw disks on Solaris. It does row level locking.
MySQL InnoDB allocates a few types of memory buffers out of the heap. The main buffer that is sized by innodb_buffer_pool_size is used to cache data and indexes of the tables. Each buffer block is 16k bytes. There is a dedicated thread that does pread to bring in the data to the buffer from the disk. There are also other types of memory buffers like sort buffer, log buffer, query_cache buffer. This is how to monitor MySQL statistics to get idea of whether need to increase those buffer size.
mysql> show innodb status mysqladmin extended-status
It is important to understand how MySQL executes select queries by explain. That can help optimize the select performance.
MySQL is single process multi-threaded. For each new user connection, there is 1 threaded created from MySQL. We can control the number of concurrent threads that can run simultaneous by limiting innodb_thread_concurrency. Normally on T2000 we set it equal to the number of cpus.
There is no dedicated log write threads. There is one io thread that wakes up frequently to do group writes but in our evaluation I rarely see it fires. Each user thread could issue pwrite when the transaction is commited. And it is serialized! The default innodb_flush_log_at_trx_commit is 1, meaning that after each transaction is committed, there is a pwrite first, then followed by fsflush. If set innodb_flush_log_at_trx_commit to 0, the performance would improve quite dramatically, pwrite and fsflush is only done every one second instead of after each transaction commit. But the risk is that customer might lose one second of data in case of power loss. innodb_flush_log_at_trx_commit = 2 would issue pwrite after each commit but fsflush every 1 second. If log disk response time is not fast enough, it could become a huge performance bottleneck. It is very important to size enough IOPs for log disks. We have seen performance being doubled or tripled right away by using a few more disks for the log files. For performance purpose, if you use filesystems to store datafiles and log files, you should use forcedirectio.
This is an example of how to use Solaris Volume Manager(svm) to create a soft partition and then set up the logfile links to this location
# metadb -d -f c1t0d0s7 c1t1d0s7 # metadb -a -f c1t0d0s7 c1t1d0s7
Then create a concatenation
# metainit d10 1 4 c1t0d0s0 c1t1d0s0 c1t2d0s0 c1t3d0s0
You can stop here and do newfs on /dev/md/rdsk/d10 or continue to create a soft partition(for example 10G) on d10
# metainit d101 -p d10 10G
# newfs /dev/md/rdsk/d101 # mount -o forcedirectio /dev/md/dsk/d101 /logs
There is innodb_flush_method. The default is performing very well on Solaris which uses fsync() to flush both the data and log files.
When MySQL starts, there are 10 threads created. Thread 1 handles network connections and create new threads for new user connections. Then there are 4 io threads. There is 1 log write thread doing group commits once a while. There is one insert thread, InnoDB stores data physically by its primary key order and insertion would not cause random reads on the disk, but for the non unique secondary index insert, it could cause a lot of random reads. So to avoid that, InnoDB checks whether the secondary index page is in the buffer pool. If it is, InnoDB does the insertion directly to the index page. If the index page is not found in the buffer pool, InnoDB inserts the record to a special insert buffer structure. The insert buffer is kept so small that it fits entirely in the buffer pool, and insertions can be done very fast. Periodically, the insert buffer is merged into the secondary index trees in the database. Often it is possible to merge several insertions to the same page of the index tree, saving disk I/O operations. There is another thread that I am not sure its purpose. Thread 6 is handling rollback. Thread 7 and 8 are monitoring thread. Thread 9 is the master thread. Thread 10 is the signal handler thread
Here are a few steps of how to tune MySQL InnoDB performance on T2000 or other Solaris platforms
- Choose at least MySQL version 5.0.22. In my experience MySQL 4.1 has more user locks inside hence it doesn't scale well. MySQL 5.1 beta is even better than 5.0.
- If your database size is more than 4GB, using MySQL 64bit is better because you can use more than 4GB memory for innoDB database buffers.
- MySQL 5.0.22 64bit from MySQL website is a pretty good performer. But in case if you want to compile it on your own on Solaris, You can refer a good whitepaper from Jenny Chen from Sun Microsystem. Compile time can be reduced to a few minutes using parallel build "dmake" in sun studio 11
- You can use libumem or libmtmalloc to get scalable memory allocation performance on multi-threaded MySQL. The way to do this is before mysql is started, setenv LD_PRELOAD_64 /usr/lib/sparcv9/libumem.so. This example is for 64bit MySQL
- Always keep an eye on MySQL internal statistics. Normally you can look at reads versus hits and tell if you need to increase certain buffer size
mysql> show innodb status; mysql> show full processlist; mysqladmin extended-status
- Making sure there is no io bottleneck. Collect "iostat -xtcnz 5" on the system and look at that what is the disk response time (column asvc_t in miliseconds) and what is the average outstanding ios(column actv). Making sure that you have enough IOPs for the logfile disks and datafile disks.
- On solaris systems, you can use "prstat -Lmc" to monitor the active processes on the systems.
- MySQL startup options are very important. Following is an example of /etc/my.cnf I use to evaluate MySQL sysbench OLTP performance on T2000
# #MySQL configure # # The MySQL server [mysqld] port=3306 socket=/tmp/mysql.sock user=root datadir=/data basedir=/usr/local/mysql max_connections=1600 max_connect_errors=10 table_cache=2048 max_allowed_packet=1048576 binlog_cache_size=1048576 key_buffer_size=16777216 max_heap_table_size=67108864 sort_buffer_size=65536 join_buffer_size=1048576 query_cache_size=26214400 thread_cache=16 thread_concurrency=32 thread_stack=64K ft_min_word_len=4 default_table_type=MYISAM transaction_isolation=REPEATABLE-READ tmp_table_size=64M skip-locking skip-concurrent-insert server-id=1 #######below is for innodb############ innodb_status_file=0 innodb_data_file_path=ibdata1:100M:autoextend innodb_buffer_pool_size=1800M innodb_additional_mem_pool_size=20M innodb_log_file_size=200M innodb_log_files_in_group=2 innodb_log_buffer_size=80M innodb_flush_log_at_trx_commit=1 innodb_lock_wait_timeout=300 innodb_max_dirty_pages_pct=90 innodb_thread_concurrency=32
 MySQL MyISAM performance tuning on T2000
The default storage engine of MySQL is MyISAM. Unlike InnoDB, it stores each MyISAM table in three files, the schema file(.frm), the data file(.MYD) and the index file(.MYI). It only supports filesystem. It does table level locking.
Compared to InnoDB, MyISAM doesn't have its own data buffer cache, it only has index buffer cache, the key buffer (variable key_buffer_size). We need to use file system buffer cache for data cache for MyISAM tables.
Here are a few steps of how to tune MySQL MyISAM performance on T2000 or other Solaris platforms
- There is a very good performance feature in MySQL 5.1 beta that allows MyISAM using mmap memory instead of malloced buffers. If you experienced very high mutex contention in earlier MySQL releases, you can get a huge performance improvments. How do you know if you have this problem?
- If you are using MySQL 5.0 or earlier version, during the test you can collect "lockstat sleep 1" and if you see ufs_lockfs_begin and ufs_lockfs_end in the callers column. You are likely to get the performance boost by going to the latest 5.1. You need to have "myisam_use_mmap=1" in /etc/my.cnf to be able to use this feature. This is an example of lockstat output that has the performance issue.
Adaptive mutex spin: 140294 events in 1.145 seconds (122505 events/sec) Count indv cuml rcnt spin Lock Caller ------------------------------------------------------------------------------- 31341 22% 22% 0.00 9 0x600052c4d10 ufs_lockfs_end+0x70 30952 22% 44% 0.00 10 0x600052c4d10 ufs_lockfs_begin+0xe4
- You can use libumem or libmtmalloc to get scalable memory allocation performance on multi-threaded MySQL. The way to do this is before mysql is started, setenv LD_PRELOAD_64 /usr/lib/sparcv9/libumem.so. This example is for 64bit MySQL. For 32bit MySQL you can do setenv LD_PRELOAD /usr/lib/libumem.so
- Since we have to use filesystem buffer cache to cache data for MyISAM tables, it is important to tune segmap_percent in /etc/system. The default segmap_percent is 12% on solaris, that means you can only get to use 12% of the system memory for filesystem buffer cache. It depends on your database size, but setting it too high could lead low memory on the system causing excessive paging. In our case, we set it to 80%. In /etc/system, set segmap_percent=80. You need to reboot the system to make it effective.
- key cache is important feature for MyISAM to cache index blocks that allows multiple threads to access key buffer simultaneously. You can check the performance of key cache by doing "show status" and look at what is the ratio of key_reads and key_read_requests. It should be less than 1%. Otherwise, you might need to increase key_buffer_size. The maximum for key_buffer_size is 4G however you can create multiple key cache (The size limit of 4GB applies to each cache individually, not as a group.) That will also help on the situation where access to one key cache structure does not block access to the other key cache
- Making sure there is no io bottleneck. Collect "iostat -xtcnz 5" on the system and look at that what is the disk response time (column asvc_t in milliseconds) and what is the average outstanding ios(column actv).
- On solaris systems, you can use "prstat -Lmc" to monitor the active processes on the systems.
 Solaris Parameters
- The following parameters should be used for T2000, and are listed at 
set ip:ip_squeue_bind = 0 set ip:ip_squeue_fanout = 1 set ipge:ipge_tx_syncq=1 set ipge:ipge_taskq_disable = 0 set ipge:ipge_tx_ring_size = 2048 set ipge:ipge_srv_fifo_depth = 2048 set ipge:ipge_bcopy_thresh = 512 set ipge:ipge_dvma_thresh = 1 set segkmem_lpsize=0x400000 set consistent_coloring=2 set pcie:pcie_aer_ce_mask=0x1
- The following parameters are only needed for Solaris 9 and earlier releases
set semsys:seminfo_semmni = 4096 set shmsys:shminfo_shmmax = 15032385536 set shmsys:shminfo_shmmni = 4096
Reboot the system for the above changes to take effect.
- The appropriate values for System V IPC tunables depend on your settings in postgresql.conf such as shared_buffer, max_connection, etc.
- For Solaris 10, changes to System V IPC tunables are not usually necessary - default values are large enough for most typical environments. Should changes to the defaults be required though, use the prctl command to set resource controls (rctls) for affected projects. For the sake of backward compatability, values may still be set in /etc/system, although those values take effect only if they are larger than the Solaris 10 defaults.
- Putting the log into a different filesystem with Direct I/O can improve performance for OLTP workloads. Here's a sample entry from /etc/vfstab showing a filesystem with forcedirectio enabled:
/dev/dsk/c1t1d0s0 /dev/rdsk/c1t1d0s0 /pglog ufs 1 yes forcedirectio
 PostgreSQL Tuning
Performance tests have been carried out on a Sun Fire T2000 system with 8 cores and 32 GB memory using an internal OLTP workload located on the internal drives. The test database was less than 2GB in size though, and could fit into memory. Our main interest was CPU scalability, and indeed PostgreSQL scales quite well on the T2000 when increasing the number of users.
The testing showed that a number of parameters in postgresql.conf can be tuned to improve OLTP performance. The following settings provided the best throughput:
shared_buffer = 300000 max_connections = 1500 work_mem = 512 maintenance_work_mem = 512000 max_fsm_pages = 1000000 wal_sync_method = fdatasync wal_buffers = 128 checkpoint_segments = 128 effective_cache_size = 1200000
The above parameters were tuned for our OLTP test workload and the results may vary for your application.
The following links provide some guidelines for determining the optimal settings for your specific workload:
- Server Configuration (PostgreSQL 8.1.3 Manual)
- PostgreSQL 8.0 Performance Checklist (from Power PostgreSQL, by Josh Berkus and Joe Conway)
In the Apache 2.2 provided with CoolStack 1.1 for x86 architecture, PHP might cause the maximum number of semaphores to be exceeded. To bump the max from 128 up to 256:
# projadd -c "Apache-PHP" -U webservd httpd.php # projmod -sK "project.max-sem-ids=(privileged,256,deny)" httpd.php
The Apache bits provided with CoolStack 1.1 are not SMF ready, so, before you can use this project resource control, it must be converted to SMF, then add the project to the service configuration:
svccfg -s http:CSKapache2 setprop start/project = astring: httpd.php
Also see converting CoolStack Apache to SMF for tips on making the conversion.
 NFS Client
 NFS Server
- The number of NFS threads is controlled by the NFSD_SERVERS parameter in /etc/default/nfs. You need to increase this parameter for CMT or SMP systems. For servers with 1GB or more memory, setting to 1024 is a good starting point.
# Maximum number of concurrent NFS requests. # Equivalent to last numeric argument on nfsd command line. NFSD_SERVERS=1024
- For T2000, the following tunables need to be set in /etc/system to achieve near line rate on the gigabit ethernet ports while using NFS.
set ipge:ipge_tx_syncq=1 set ipge:ipge_taskq_disable=1
- For NFS servers with large set of files, the default value of ncsize might need to be increased. The ncsize parameter should be set in /etc/system to approximately double the number of active files.