Application Specific Tuning

From Siwiki

Jump to: navigation, search


[edit] Oracle RDBMS

[edit] System Setup Recommendations for Solaris 8 and Solaris 9

The following settings are categorized as setup recommendations because they do not tune performance. They enable Oracle to install and run for typical loads, as the default values for the System V IPC parameters tend to be too small. Note that the IPC parameters msgmap, msgssz, msgseg and semmap are obsolete in Solaris 8 and beyond. They are ignored if they appear in /etc/system. Note also that shmmin and shmseg are obsolete in Solaris 9 and beyond. For Solaris 8 and 9, in /etc/system:

# shared memory parameters
set shmsys:shminfo_shmmax = (0.25 * total physical memory) # or just set this to 4294967296 on a 32 bit system)
set shmsys:shminfo_shmmni =128
set shmsys:shminfo_shmseg = 64

# semaphore parameters
set semsys:seminfo_semmsl = 512
set semsys:seminfo_semmns = 2500
set semsys:seminfo_semmnu = 2500
set semsys:seminfo_semume = 128
set semsys:seminfo_semmni = 128
set semsys:seminfo_semopm = 512
# message queues
set msgsys:msginfo_msgmax = 65536
set msgsys:msginfo_msgmnb = 65536
set msgsys:msginfo_msgmni = 128
set msgsys:msginfo_msgtql = 8192
# Increase sync queues for network packets
sq_max_size = 64
# Minimize possibility of TS threads getting starved
# This is for Solaris 8 ONLY!
set TS:ts_sleep_promote = 1
# Tune fsflushd to stay out of the way
set tune_t_fsflushr = 1
set autoup = (if 64GB of RAM or less, set to 300. Otherwise, set to 4 x GB of RAM. For example, for a system with 96GB of RAM, 4 x 96 = 384).

The suggestions above do NOT apply to Solaris 10 or OpenSolaris.

[edit] System Setup Recommendations for Solaris 10 and OpenSolaris

In Solaris 10, the System V IPC parameters have been made dynamic. That is, a parameter value can be changed on-the-fly using the new Solaris 10 resource control facilities. Also, many of the older parameters have been obsoleted, and remaining parameters have much larger default values. This information is summarized in the Solaris Tunable Parameters Reference Manual.

If /etc/system entries exist in Solaris 10, and the set values are larger than the new default values, the settings will work system-wide. However, Solaris 10 contains new administrative commands for managing IPC parameters on a per-project basis (projects and tasks were introduced in Solaris 9 as a resource control name space).

Below is an example of using the prctl(1) command to examine the current value of the maximum shared memory segment size (20GB in this example, with a system maximum settable value of 16 Exabytes).

sol10# id -p
uid=0(root) gid=0(root) projid=100(filebench)

sol10# prctl -n project.max-shm-memory -i project 100
project: 100: filebench
NAME    PRIVILEGE       VALUE    FLAG   ACTION                       RECIPIENT
        privileged      20.0GB      -   deny                                 -
        system          16.0EB    max   deny                                 -

sol10# prctl -n project.max-shm-memory -r -v 24gb -i project 100

sol10# prctl -n project.max-shm-memory -i project 100
project: 100: filebench
NAME    PRIVILEGE       VALUE    FLAG   ACTION                       RECIPIENT
        privileged      24.0GB      -   deny                                 -
        system          16.0EB    max   deny                                 -

In the example above, the id(1) command with the "-p" flag shows we are in the filebench project, which has a project ID of 100. The prctl(1) command to fetch the current max-shm-memory parameter indicates it is currently 20GB. We than use prctl(1) to dynamically change the max-shm-memory to 24GB, and interrogate it once again to insure they change took effect (which it did!).

Permanent changes are made using the projadd(1) and projmod(1) commands to add entries for the project in the /etc/project file, insuring that the desired values will be set if the system ever reboots. For exaample, to make this permanent, we would create a project entry for the user or project in question.

sol10# projadd -c "My database" -U oracle

sol10# projmod -sK "project.max-shm-memory=(privileged,64G,deny)"

sol10# su - oracle

oracle$ prctl -n project.max-shm-memory -i project
project: 101:
NAME    PRIVILEGE       VALUE    FLAG   ACTION                       RECIPIENT
        privileged      64.0GB      -   deny                                 -
        system          16.0EB    max   deny                                 -

[edit] Tuning Recommendations

  • File System and I/O Related
    • Pick the right file system. Good combinations are:
    • Problematic combinations are any filesystems that rely on the OS page cache for buffering
    • Always use UFS Direct I/O, NFS Direct I/O or Raw-disk for the re-do logs. Putting Oracle redo logs on cached file systems will cause a hot bottleneck at the file system level. This is the most significant file system / IO related tuning tip.
    • Details on Oracle File System Performance are covered in this older, but still relevant paper on Oracle I/O [1]
    • Optionally, put database tables on Direct I/O as well. Further information is covered in the Direct I/O page.
    • Oracle over NFS. Oracle RDBMS works quite well over NFS, providing you are using gigabit ethernet (or faster), Direct I/O is enabled, and you are using Solaris 9 or later. The following paper covers Oracle on NFS studies done in our PAE group: [2]
  • Oracle Related
    • Use the largest SGA/block buffer cache possible to maximize Oracle cache hit rates. We have tested up to 540GB SGA with the 64-bit version of Oracle.
    • Use the 64-bit version of RDBMS. This allows SGA larger than 3.75 GB, which is almost always beneficial when there is sufficient RAM.
  • For Solaris 8, ensure that the scheduler starvation prevention parameter is set (this is the default for Solaris 9 and later). Set the following in /etc/system
set TS:ts_sleep_promote = 1
  • For optimal performance, use the Solaris FX Scheduler Class
    • This is available on Solaris 9 or later.
    • This fixed priority scheduler will provide longer quantums on CPU, and reduce context switching, which results from the default behaviour of the time-share (TS) scheduler.
  • For Solaris 8 and 9, tune the file system flush daemon. (NOTE: Since early 2006 patch levels, tune_t_fsflushr now defaults to 1, so it should no longer be explicitly set. Any system with the default otherwise is now severely out-of-date on patches!)
# Tune fsflushd to stay out of the way ...
# Only set tune_t_fsflushr if it is not 1 by default.
set tune_t_fsflushr = 1
# For autoup, if 64GB of RAM or less, use 300. Otherwise, use 4 x GB of RAM.
# For example, for a system with 96GB of RAM, 4 x 96 = 384.
set autoup = <number>
  • Hardware/Platform Specific
    • For T1/2000, in /etc/system add set consistent_coloring=2 or ensure that patch 118833-03 or later has been applied.
set consistent_coloring=2

[edit] Solaris Optimizations for Oracle

Here's a quick summary of some of the important optimizations we've done for Oracle specifically:

  • Containers: zones licensing for processor pools. To facilitate server consolidation, you can use Solaris Containers to partition up the CPU resources. For example, 4 cpus could be given to each of two containers using processor pools. Futhermore, there is [Oracle licencing support] for Zones, so that only the number of processors in each zone need be licensed.
  • Resource manager: Fair Share Scheduler supported for consolidation. The Solaris Fair Share Scheduler can be used to prioritize the CPU resources between instances of Oracle. The Fair Share scheduler offers fine grained allocation of resources based on CPU ratios (e.g. shares of 10/90 will guarantee at least 10% and 90% respectively to each instance).
  • DR: Ability to add/remove memory. Solaris allows physical RAM to be added dynamically. The file system cache will grow/shrink automatically based on the additional RAM. Futhermore, the Dynamic ISM feature allows the Oracle SGA to dynamically grow and shrink, using either SQL commands. The grow/shrink may also be automated using the Resource Configuration Manager.
  • UFS Direct I/O: Optimized database I/O performance. Solaris 8 Concurrent Direct I/O - A new implementation of Direct I/O which eliminates the single writer lock in typical database usage was made available in Solaris 8 update 3. This allows the UFS file system to perform at near raw disk performance when used with the Oracle database.
  • libaio: async I/O library for database data and log writes. Oracle performs writes in batches or lists, which are submitted to the operating system using the listio() functionality. The listio() asynchronous I/O facility was added specifically to optimize database performance.
  • KAIO: Kernelized async I/O, and its many enhancements. When using raw disks, Kernel async I/O allows list to be passed directly into the kernel in a single system call, reducing the number of system calls. A substantial reduction of latency and instruction path was gained from this feature.
  • Efficient listio reaping (aio_waitn). The [aiowaitn()] functionality allows lists of I/O to be waited for and reaped in a single system call.
  • Timed semaphore wait (semtimedop). This new system call allows Oracle to wait on a single event with a timer, rather than having to arm and fire a timer, saving several system calls.
  • Scheduler Activations
  • Large pages
  • Giant pages for newer platforms
  • Large pages for Oracle binaries/libraries
  • MPO APIs
  • Intimate shared memory for databases
  • FX: Fixed priority scheduler for databases

[edit] Oracle's use of the "times()" syscall

"Why does Oracle call times() so often? Is something broken? When using truss or dtrace to profile Oracle shadow processes, one often sees a lot of calls to "times". Sysadmins often get worried and think there is something wrong. Oracle is working as designed and there is minimal impact.

root@catscratchb> truss -cp 7700
syscall               seconds   calls  errors
read                     .002     120
write                    .008     210
times                    .053   10810
semctl                   .000      17
semop                    .000       8
semtimedop               .000       9
mmap                     .003      68
munmap                   .003       5
yield                    .002     231
pread                    .150    2002
kaio                     .003      68
kaio                     .001      68
                     --------  ------   ----
sys totals:              .230   13616      0
usr time:               1.127
elapsed:               22.810

At first glance it would seem alarming to have so many times() calls, but how much does this really effect performance? This question can best be anwsered by looking at the overall "elasped" and "cpu" time. Below is output from the "procsystime" tool included in the [Dtrace toolkit].

root@catscratchb> ./procsystime -Teco -p 7700
Hit Ctrl-C to stop sampling...
Elapsed Times for PID 7700,
         SYSCALL          TIME (ns)
            mmap           17615703
           write           21187750
          munmap           21671772
           times           90733199       <<== Only 0.28% of elapsed time
          semsys          188622081
            read          226475874
           yield          522057977
           pread        31204749076
          TOTAL:        32293113432

CPU Times for PID 7700,
         SYSCALL          TIME (ns)
          semsys            1346101
           yield            3283406
            read            7511421
            mmap           16701455
           write           19616610
          munmap           21576890
           times           33477300         <<== 10.6% of CPU time for the times syscall
           pread          211710238
          TOTAL:          315223421

Syscall Counts for PID 7700,
         SYSCALL              COUNT
          munmap                 17
          semsys                 84
            read                349
            mmap                350
           yield                381
           write                540
           pread               3921
           times              24985    <<== 81.6% of syscalls.
          TOTAL:              30627

According to the profile above, the times() syscall accounts for only 0.28% of the overall response time. It does use 10.6% of sys CPU. The usr/sys CPU percentages are "83/17" for this application. So, using the 17% for system CPU we can calculate the overall amount of CPU for the times() syscall: 100*(.17*.106)= 1.8%.

Oracle uses the times() syscall to keep track of timed performance statistics. Timed statistics can be enabled/disabled by setting the init.ora parameter "TIMED_STATISTICS=TRUE". In fact, it is an *old* benchmark trick to disable TIMED_STATISTICS after all tuning has been done. This is usually good for another 2% in overall throughput. In a production environment, it is NOT advisable to ever disable TIMED_STATISTICS. These statistics are extremely important to monitor and maintain application performance.

[edit] SAP

[edit] MySQL

Also see MySQL Performance Tuning on solaris (yufei's blog).

[edit] MySQL InnoDB performance tuning on T2000

MySQL is one of the most popular opensource database. It supports several pluggable storage engines, MyISAM manages non-transactional tables and it is good performer especially on MySQL 5.1. The InnoDB and BDB storage engines provide transaction-safe tables. They were acquired by Oracle not so long ago. The MEMORY storage engine which formerly known as heap storage engine provides in-memory tables. The InnoDB supports both file system and raw disks on Solaris. It does row level locking.

MySQL InnoDB allocates a few types of memory buffers out of the heap. The main buffer that is sized by innodb_buffer_pool_size is used to cache data and indexes of the tables. Each buffer block is 16k bytes. There is a dedicated thread that does pread to bring in the data to the buffer from the disk. There are also other types of memory buffers like sort buffer, log buffer, query_cache buffer. This is how to monitor MySQL statistics to get idea of whether need to increase those buffer size.

mysql> show innodb status 
mysqladmin extended-status

It is important to understand how MySQL executes select queries by explain. That can help optimize the select performance.

MySQL is single process multi-threaded. For each new user connection, there is 1 threaded created from MySQL. We can control the number of concurrent threads that can run simultaneous by limiting innodb_thread_concurrency. Normally on T2000 we set it equal to the number of cpus.

There is no dedicated log write threads. There is one io thread that wakes up frequently to do group writes but in our evaluation I rarely see it fires. Each user thread could issue pwrite when the transaction is commited. And it is serialized! The default innodb_flush_log_at_trx_commit is 1, meaning that after each transaction is committed, there is a pwrite first, then followed by fsflush. If set innodb_flush_log_at_trx_commit to 0, the performance would improve quite dramatically, pwrite and fsflush is only done every one second instead of after each transaction commit. But the risk is that customer might lose one second of data in case of power loss. innodb_flush_log_at_trx_commit = 2 would issue pwrite after each commit but fsflush every 1 second. If log disk response time is not fast enough, it could become a huge performance bottleneck. It is very important to size enough IOPs for log disks. We have seen performance being doubled or tripled right away by using a few more disks for the log files. For performance purpose, if you use filesystems to store datafiles and log files, you should use forcedirectio.

This is an example of how to use Solaris Volume Manager(svm) to create a soft partition and then set up the logfile links to this location

# metadb -d -f c1t0d0s7 c1t1d0s7
# metadb -a -f c1t0d0s7 c1t1d0s7

Then create a concatenation

# metainit d10 1 4 c1t0d0s0 c1t1d0s0 c1t2d0s0 c1t3d0s0

You can stop here and do newfs on /dev/md/rdsk/d10 or continue to create a soft partition(for example 10G) on d10

# metainit d101 -p d10 10G  

After that

# newfs /dev/md/rdsk/d101
# mount -o forcedirectio /dev/md/dsk/d101  /logs

There is innodb_flush_method. The default is performing very well on Solaris which uses fsync() to flush both the data and log files.

When MySQL starts, there are 10 threads created. Thread 1 handles network connections and create new threads for new user connections. Then there are 4 io threads. There is 1 log write thread doing group commits once a while. There is one insert thread, InnoDB stores data physically by its primary key order and insertion would not cause random reads on the disk, but for the non unique secondary index insert, it could cause a lot of random reads. So to avoid that, InnoDB checks whether the secondary index page is in the buffer pool. If it is, InnoDB does the insertion directly to the index page. If the index page is not found in the buffer pool, InnoDB inserts the record to a special insert buffer structure. The insert buffer is kept so small that it fits entirely in the buffer pool, and insertions can be done very fast. Periodically, the insert buffer is merged into the secondary index trees in the database. Often it is possible to merge several insertions to the same page of the index tree, saving disk I/O operations. There is another thread that I am not sure its purpose. Thread 6 is handling rollback. Thread 7 and 8 are monitoring thread. Thread 9 is the master thread. Thread 10 is the signal handler thread

Here are a few steps of how to tune MySQL InnoDB performance on T2000 or other Solaris platforms

  • Choose at least MySQL version 5.0.22. In my experience MySQL 4.1 has more user locks inside hence it doesn't scale well. MySQL 5.1 beta is even better than 5.0.
  • If your database size is more than 4GB, using MySQL 64bit is better because you can use more than 4GB memory for innoDB database buffers.
  • MySQL 5.0.22 64bit from MySQL website is a pretty good performer. But in case if you want to compile it on your own on Solaris, You can refer a good whitepaper from Jenny Chen from Sun Microsystem. Compile time can be reduced to a few minutes using parallel build "dmake" in sun studio 11
  • You can use libumem or libmtmalloc to get scalable memory allocation performance on multi-threaded MySQL. The way to do this is before mysql is started, setenv LD_PRELOAD_64 /usr/lib/sparcv9/ This example is for 64bit MySQL
  • Always keep an eye on MySQL internal statistics. Normally you can look at reads versus hits and tell if you need to increase certain buffer size
mysql> show innodb status;
mysql> show full processlist;
mysqladmin extended-status
  • Making sure there is no io bottleneck. Collect "iostat -xtcnz 5" on the system and look at that what is the disk response time (column asvc_t in miliseconds) and what is the average outstanding ios(column actv). Making sure that you have enough IOPs for the logfile disks and datafile disks.
  • On solaris systems, you can use "prstat -Lmc" to monitor the active processes on the systems.
  • MySQL startup options are very important. Following is an example of /etc/my.cnf I use to evaluate MySQL sysbench OLTP performance on T2000
#MySQL configure
# The MySQL server
#######below is for innodb############

[edit] MySQL MyISAM performance tuning on T2000

The default storage engine of MySQL is MyISAM. Unlike InnoDB, it stores each MyISAM table in three files, the schema file(.frm), the data file(.MYD) and the index file(.MYI). It only supports filesystem. It does table level locking.

Compared to InnoDB, MyISAM doesn't have its own data buffer cache, it only has index buffer cache, the key buffer (variable key_buffer_size). We need to use file system buffer cache for data cache for MyISAM tables.

Here are a few steps of how to tune MySQL MyISAM performance on T2000 or other Solaris platforms

  • There is a very good performance feature in MySQL 5.1 beta that allows MyISAM using mmap memory instead of malloced buffers. If you experienced very high mutex contention in earlier MySQL releases, you can get a huge performance improvments. How do you know if you have this problem?
  • If you are using MySQL 5.0 or earlier version, during the test you can collect "lockstat sleep 1" and if you see ufs_lockfs_begin and ufs_lockfs_end in the callers column. You are likely to get the performance boost by going to the latest 5.1. You need to have "myisam_use_mmap=1" in /etc/my.cnf to be able to use this feature. This is an example of lockstat output that has the performance issue.
Adaptive mutex spin: 140294 events in 1.145 seconds (122505 events/sec)

Count indv cuml rcnt     spin Lock                   Caller                  
31341  22%  22% 0.00        9 0x600052c4d10          ufs_lockfs_end+0x70     
30952  22%  44% 0.00       10 0x600052c4d10          ufs_lockfs_begin+0xe4   
  • You can use libumem or libmtmalloc to get scalable memory allocation performance on multi-threaded MySQL. The way to do this is before mysql is started, setenv LD_PRELOAD_64 /usr/lib/sparcv9/ This example is for 64bit MySQL. For 32bit MySQL you can do setenv LD_PRELOAD /usr/lib/
  • Since we have to use filesystem buffer cache to cache data for MyISAM tables, it is important to tune segmap_percent in /etc/system. The default segmap_percent is 12% on solaris, that means you can only get to use 12% of the system memory for filesystem buffer cache. It depends on your database size, but setting it too high could lead low memory on the system causing excessive paging. In our case, we set it to 80%. In /etc/system, set segmap_percent=80. You need to reboot the system to make it effective.
  • key cache is important feature for MyISAM to cache index blocks that allows multiple threads to access key buffer simultaneously. You can check the performance of key cache by doing "show status" and look at what is the ratio of key_reads and key_read_requests. It should be less than 1%. Otherwise, you might need to increase key_buffer_size. The maximum for key_buffer_size is 4G however you can create multiple key cache (The size limit of 4GB applies to each cache individually, not as a group.) That will also help on the situation where access to one key cache structure does not block access to the other key cache
  • Making sure there is no io bottleneck. Collect "iostat -xtcnz 5" on the system and look at that what is the disk response time (column asvc_t in milliseconds) and what is the average outstanding ios(column actv).
  • On solaris systems, you can use "prstat -Lmc" to monitor the active processes on the systems.

[edit] PostgreSQL

[edit] Solaris Parameters

  • The following parameters should be used for T2000, and are listed at [3]
    set ip:ip_squeue_bind = 0
    set ip:ip_squeue_fanout = 1
    set ipge:ipge_tx_syncq=1
    set ipge:ipge_taskq_disable = 0
    set ipge:ipge_tx_ring_size = 2048
    set ipge:ipge_srv_fifo_depth = 2048
    set ipge:ipge_bcopy_thresh = 512
    set ipge:ipge_dvma_thresh = 1
    set segkmem_lpsize=0x400000
    set consistent_coloring=2
    set pcie:pcie_aer_ce_mask=0x1 
  • The following parameters are only needed for Solaris 9 and earlier releases
   set semsys:seminfo_semmni = 4096
   set shmsys:shminfo_shmmax = 15032385536
   set shmsys:shminfo_shmmni = 4096 

Reboot the system for the above changes to take effect.

  • The appropriate values for System V IPC tunables depend on your settings in postgresql.conf such as shared_buffer, max_connection, etc.
  • For Solaris 10, changes to System V IPC tunables are not usually necessary - default values are large enough for most typical environments. Should changes to the defaults be required though, use the prctl command to set resource controls (rctls) for affected projects. For the sake of backward compatability, values may still be set in /etc/system, although those values take effect only if they are larger than the Solaris 10 defaults.
  • Putting the log into a different filesystem with Direct I/O can improve performance for OLTP workloads. Here's a sample entry from /etc/vfstab showing a filesystem with forcedirectio enabled:
    /dev/dsk/c1t1d0s0 /dev/rdsk/c1t1d0s0 /pglog ufs 1 yes forcedirectio 

[edit] PostgreSQL Tuning

Performance tests have been carried out on a Sun Fire T2000 system with 8 cores and 32 GB memory using an internal OLTP workload located on the internal drives. The test database was less than 2GB in size though, and could fit into memory. Our main interest was CPU scalability, and indeed PostgreSQL scales quite well on the T2000 when increasing the number of users.

The testing showed that a number of parameters in postgresql.conf can be tuned to improve OLTP performance. The following settings provided the best throughput:

    shared_buffer = 300000
    max_connections = 1500
    work_mem = 512
    maintenance_work_mem = 512000
    max_fsm_pages = 1000000
    wal_sync_method = fdatasync
    wal_buffers = 128
    checkpoint_segments = 128
    effective_cache_size = 1200000 

The above parameters were tuned for our OLTP test workload and the results may vary for your application.

The following links provide some guidelines for determining the optimal settings for your specific workload:

[edit] Apache

[edit] PHP5

In the Apache 2.2 provided with CoolStack 1.1 for x86 architecture, PHP might cause the maximum number of semaphores to be exceeded. To bump the max from 128 up to 256:

   # projadd -c "Apache-PHP" -U webservd httpd.php
   # projmod -sK "project.max-sem-ids=(privileged,256,deny)" httpd.php

The Apache bits provided with CoolStack 1.1 are not SMF ready, so, before you can use this project resource control, it must be converted to SMF, then add the project to the service configuration:

   svccfg -s http:CSKapache2 setprop start/project = astring: httpd.php

Also see converting CoolStack Apache to SMF for tips on making the conversion.

[edit] NFS Client

See File_Systems

[edit] NFS Server

  • The number of NFS threads is controlled by the NFSD_SERVERS parameter in /etc/default/nfs. You need to increase this parameter for CMT or SMP systems. For servers with 1GB or more memory, setting to 1024 is a good starting point.
# Maximum number of concurrent NFS requests.
# Equivalent to last numeric argument on nfsd command line.
  • For T2000, the following tunables need to be set in /etc/system to achieve near line rate on the gigabit ethernet ports while using NFS.
set ipge:ipge_tx_syncq=1
set ipge:ipge_taskq_disable=1 
  • For NFS servers with large set of files, the default value of ncsize might need to be increased. The ncsize parameter should be set in /etc/system to approximately double the number of active files.
set ncsize=1048576
Solaris Internals
Personal tools
The Books
The Ads