Multiple Page Size Support

From Siwiki

Jump to: navigation, search

One of my favorite features of Solaris 9 is "Multiple Page Size Support" (MPSS) Why, because it's one of the easiest ways to get a significant performance gain for a large range of applications.

Contents

[edit] MPSS and Application Performance

Memory intensive applications which have a large working set often perform sub-optimally unless they make use of larger MMU pages. This is because they make inefficient use of the microprocessor's facility known as the 'Translation Lookup Buffer' or 'TLB'. MPSS allows exploiting larger page sizes for the microprocessors memory management unit (MMU, or M-Emu) which allows more efficient use of the TLB, ultimately resulting in improved application performance.

Applications most likely to benefit from MPSS typically have working sets greater than a few hundred megabytes, and are memory intensive. Since the TLB can only hold a few hundred translations at a time, these applications typically overflow the microprocessors TLB. The Solaris kernel services the overflows from the UltraSPARC TLB, which can result in a significant amount of system-software time.

Here comes the catch; the time spent processing these TLB overflows (we refer to these as TLB misses) is not reported as "system time" by our regular performance tools. i.e. mpstat, sar and vmstat will report an application's TLB misses as user time. This can be quite misleading, since it can appear that the CPU is spending all of its time running your application when in fact it's spending a large fraction of its time in the kernel.

[edit] Measuring MMU Overheads

So how do you tell how frequently your application is overflowing the TLB? Solaris 9 introduces a new tool? This is one objective of the 'trapstat' tool now included with Solaris. trapstat provides an easy way to measure the time spent in the kernel servicing TLB misses. Using the -t option, trapstat will report how many TLB misses are occurring, and what percentage of the total CPU time is spent processing TLB misses.

The -t option provides first-level summary statistics. Time spent servicing TLB misses is summarized in the lower right corner; in the example below, 46.2% of the total execution time is spent servicing TLB misses. TLB miss detail are broken down to show TLB misses incurred in the data portion (dTLB) of the address space and for the instruction portion (iTLB) of the address space. Data is also provided for user (u) and kernel-mode (k) misses. We are primarily interested in the user-mode misses, since our application likely runs in user mode.

sol9# trapstat -t 1 111
cpu m| itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim |%tim
-----+-------------------------------+-------------------------------+----
  0 u|         1  0.0         0  0.0 |   2171237 45.7         0  0.0 |45.7
  0 k|         2  0.0         0  0.0 |      3751  0.1         7  0.0 | 0.1
=====+===============================+===============================+====
 ttl |         3  0.0         0  0.0 |   2192238 46.2         7  0.0 |46.2

For further detail, use the -T option to provide a per-page-size breakdown. In out example, trapstat -T, shows us that almost all of the misses occurred on 8Kbyte pages.

sol9# trapstat -T 1
cpu m size| itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim |%tim
----------+-------------------------------+-------------------------------+----
  0 u   8k|        30  0.0         0  0.0 |   2170236 46.1         0  0.0 |46.1
  0 u  64k|         0  0.0         0  0.0 |         0  0.0         0  0.0 | 0.0
  0 u 512k|         0  0.0         0  0.0 |         0  0.0         0  0.0 | 0.0
  0 u   4m|         0  0.0         0  0.0 |         0  0.0         0  0.0 | 0.0
- - - - - + - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - + - -
  0 k   8k|         1  0.0         0  0.0 |      4174  0.1        10  0.0 | 0.1
  0 k  64k|         0  0.0         0  0.0 |         0  0.0         0  0.0 | 0.0
  0 k 512k|         0  0.0         0  0.0 |         0  0.0         0  0.0 | 0.0
  0 k   4m|         0  0.0         0  0.0 |         0  0.0         0  0.0 | 0.0
==========+===============================+===============================+====
      ttl |        31  0.0         0  0.0 |   2174410 46.2        10  0.0 |46.2

We can conclude from this output that the application could potentially run almost twice as fast if we could eliminate the majority of the TLB misses. Our objective in using the mechanisms discussed below is to minimize the user-mode data TLB (dTLB) misses, potentially by instructing the application to use larger pages for its data segments. Typically, data misses are incurred in the program's heap or stack segments. We can use the Solaris 9 multiple-page-size support commands to direct the application to use 4Mbyte pages for its heap, stack or anonymous memory mappings.

[edit] Using Larger MMU Page Sizes

Prior to Solaris 9, databases are typically the only users of larger page sizes via the intimate shared memory (ISM) facility provided by the SHM_SHARE_MMU option of shmat(2). Solaris 9 provides three methods of advising preferred page sizes for applications:

  1. A wrapper program, ppgsz(1)
  2. A preload library, libmpss.so.1
  3. A programmatic interface, memcntl(2) 

UltraSPARC supports four page sizes; 8Kbyte (default), 64Kbyte, 512Kbyte and 4Mbyte. See the note below about optimal combinations of page sizes with respect to various UltraSPARC versions.

Enabling large pages for an application's heap is quite straight forward. Simply wrap the target binary with the ppgsz command and the appropriate options:

sol9# ppgsz -o heap=4M ./testprog &

After starting the target program, we can check to see how many large pages were allocated with the pmap -sx command. In the following example, we can see that the majority of our heap has been allocated with 4Mbyte pages:

sol9# pmap -sx `pgrep testprog`
2953:   ./testprog
 Address  Kbytes     RSS    Anon  Locked Pgsz Mode   Mapped File
00010000       8       8       -       -   8K r-x--  dev:277,83 ino:114875
00020000       8       8       8       -   8K rwx--  dev:277,83 ino:114875
00022000    3960    3960    3960       -   8K rwx--    [ heap ]
00400000  131072  131072  131072       -   4M rwx--    [ heap ]
FF280000     120     120       -       -   8K r-x--  libc.so.1
FF29E000     136     128       -       -    - r-x--  libc.so.1
FF2C0000      72      72       -       -   8K r-x--  libc.so.1
FF2D2000     192     192       -       -    - r-x--  libc.so.1
FF302000     112     112       -       -   8K r-x--  libc.so.1
FF31E000      48      32       -       -    - r-x--  libc.so.1
FF33A000      24      24      24       -   8K rwx--  libc.so.1
FF340000       8       8       8       -   8K rwx--  libc.so.1
FF390000       8       8       -       -   8K r-x--  libc_psr.so.1
FF3A0000       8       8       -       -   8K r-x--  libdl.so.1
FF3B0000       8       8       8       -   8K rwx--    [ anon ]
FF3C0000     152     152       -       -   8K r-x--  ld.so.1
FF3F6000       8       8       8       -   8K rwx--  ld.so.1
FFBFA000      24      24      24       -   8K rwx--    [ stack ]
-------- ------- ------- ------- -------
total Kb  135968  135944  135112       -

The ppgsz command is the simplest to use, but the specified page size preferences will not be inherited across exec(2) calls to child processes. If your program execs another and you want the page size preferences, you should use the mpss.so.1 preload library to make this happen.

The mpss.so.1 shared object in /usr/lib provides a means by which the preferred stack or heap page-size can be automatically enforced for launched processes and their descendants. The library has an the advantage over the wrapper in that page-sizes are inherited across exec(2). To enable mpss.so, set LD_PRELOAD in the environment (see ld.so.1(1)) along with one or more MPSS (multiple page-size support) environment variables.

     MPSSHEAP=size

     MPSSSTACK=size
           MPSSHEAP and  MPSSSTACK  specify  the  preferred  page
           sizes for the heap and stack, respectively. The speci-
           fied  page  size(s)  are  applied   to   all   created
           processes.

For example: Using sh or ksh:

sol9# LD_PRELOAD=$LD_PRELOAD:mpss.so.1 MPSSHEAP=4M
sol9#./myprog

Using csh or tcsh:

sol9# setenv LD_PRELOAD $LD_PRELOAD:mpss.so.1
sol9# setenv MPSSHEAP 4M
sol9# ./myprog

To confirm that the application is now running more efficiently, run trapstat again. Ideally, the percentage of time spent in the kernel will be lower. In this example, the percentage of time has dropped from 46.2% to 0.2%!

sol9# trapstat -T 1
cpu m size| itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim |%tim
----------+-------------------------------+-------------------------------+----
  0 u   8k|        30  0.0         0  0.0 |       221  0.1         0  0.0 | 0.1
  0 u  64k|         0  0.0         0  0.0 |         0  0.0         0  0.0 | 0.0
  0 u 512k|         0  0.0         0  0.0 |         0  0.0         0  0.0 | 0.0
  0 u   4m|         0  0.0         0  0.0 |         0  0.0         0  0.0 | 0.0
- - - - - + - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - + - -
  0 k   8k|         1  0.0         0  0.0 |      4271  0.1        10  0.0 | 0.1
  0 k  64k|         0  0.0         0  0.0 |         0  0.0         0  0.0 | 0.0
  0 k 512k|         0  0.0         0  0.0 |         0  0.0         0  0.0 | 0.0
  0 k   4m|         0  0.0         0  0.0 |         0  0.0         0  0.0 | 0.0
==========+===============================+===============================+====
      ttl |        31  0.0         0  0.0 |      4491  0.2        10  0.0 | 0.2

Things to watch out for

TLB sizes vary between UltraSPARC versions. The UltraSPARC I and II microprocessors (143 MHz?480 MHz) data TLB has 64 entries that supports all four page-sizes. User applications can use use any of the four page-sizes available.

However, the 750Mhz UltraSPARC III microprocessor includes several small TLBs and one large 512-entry TLB that which supports only 8Kbyte entries. Thus, use of large pages is typically not beneficial on 750Mhz UltraSPARC III systems.

The 900Mhz onwards UltraSPARC III has two large 512 entry TLBs, one of which is configured automatically based on the most common page-sizes in a process's address space. A process using one large page-size in addition to the base page-size (8 Kbytes) will have one of its large TLB automatically programmed to enable the large page-size when there are 8 or more pages using the larger page-size within the process. Thus, large pages can be used very effectively on UltraSPARC III. However, since the large TLBs can only be configured for one page-size at a time per process, only two pages sizes should be used concurrently (typically 8Kbytes + one other size).

Another point of interest is large page fragmentation. Large pages require contiguous physical memory, and if smaller pages fragment physical memory a request for larger pages might not be fulfilled. When the system boots, a sizable pool of large pages is available but if a significant number of smaller page sizes are allocated and locked down, an application requesting larger page sizes might not have its request fulfilled with the larger page size. The application request for memory is still satisfied, it just not might be allocated with its preferred larger page size. i.e. it will still run, but its mappings will be backed by smaller page sizes. For this reason, it is advisable to check with pmap -xs to ensure larger page sizes are indeed allocated.

There is however a way to minimize the amount of fragmentation. The Solaris 9 kernel will attempt to relocate pages in an attempt to create the required contiguous memory; this works for all but locked pages. By enabling the kernel cage we can vastly improve this situation; because the kernel will be allocated from a small contiguous range of memory, thus minimizing the fragmentation of other pages within the system. The kernel cage is enabled on E10k and F15k systems, but not on other systems. It can be enabled by setting the kernel_cage_enable in /etc/system:

	set kernel_cage_enable=1

[edit] Commands and API Quick-Reference

ppgsz	     An administrative wrapper program for advising page size
	     preferences. ppgsz is not inherited across exec() by a
	     new program.

pmap -sx     A utility to print the MMU page size for each mapping in
	     the program.

mpss.so.1    A preload library, enabled by setting
	     LD_PRELOAD=mpss.so.1 for advising page size preferences
	     for existing applications. Advise is held across exec()
	     of a new program.

trapstat -t  A tool for measuring the amount of time spent servicing
	     TLB misses.

cc	     New options are included in the SunOne Studio 8 compiler
	     -xpagesize_heap and -xpagesize_stack

That's it! There's a paper with a longer description.


[edit] Support for Large Pages within the Kernel's Address Space

Support for mapping the kernel heap with large pages was added in S10 update 2, but by default is only enabled for systems with 1GB of memory or more. You can lower this threshold by setting the segkmem_lpminphysmem tunable in /etc/system. The default kernel large page size is 4M on USII systems, and should give best performance, but you can experiment with 64K if you wish by setting segkmem_lpsize in /etc/system. The USIIi TLB is fully associative, so there is no disadvantage in enabling a mixture of page sizes on the system.

Over time, physical memory fragmentation may cause attempts to allocate a large page for the kernel heap to fail, in which case small pages are used. You can avoid this by estimating your kernel heap memory demand and preallocating a large page kernel heap area at boot time, by setting the segkmem_kmemlp_min tunable in /etc/system.

All of these tunables have units of bytes.

[edit] Large Pages for Binaries

Solaris 10 Update 1 provides large page support for instructions within a binary. Solaris 9 introduced Multiple Page Size Support (MPSS), for the applications to use any of the possible hardware address translation sizes supported by the system; however MPSS only provides large pages for user applications in anonymous memory eg., heap, stack, ISM, MAP_ANON mappings.

Solaris 10 introduced dynamic TSB support, which helps reducing the dTSB miss rate considerably. However, on Solaris 10, user iTLB misses are still a major bottleneck. In Solaris, user text segments are implemented with memory mapped files, and can potentially be mapped with large pages.

[edit] VMPSS - Large Pages for Executables and Files

Now with Solaris 10 Update 1, Sun introduced MPSS for executables (text/instructions) and files, which is referred as MPSS for Vnodes or simply VMPSS. VMPSS extended MPSS support to regular file mappings in order to reduce iTLB misses for text, dTLB misses for initialized data, and also to reduce dTLB misses in general for other memory mapped files. Due to VMPSS, large applications may get large pages automatically, for text and initialized data segments, based on the segment size and the page size support capabilities of the underlying hardware. So if you have a large application running on Solaris 10 Update 1, the application might already be enjoying the benefits of VMPSS, and hence you may observe noticeable improvement in the run-time performance of the application. To check if the application is taking advantage of VMPSS, run pmap -sx <pid>, and observe the page sizes under Pgsz column of pmap output.

[edit] Default page sizes, and tunables on different platforms

[edit] UltraSPARC II

No large pages are used by default on US-II systems. To enable the default use of 64K or 4M pages for text, add the following lines in /etc/system {and reboot the machine once}:

set use_text_pgsz64k = 1
set use_text_pgsz4m = 1

To enable the default use of 64K pages for data, add set use_initdata_pgsz64k 1 as well, to /etc/system

[edit] UltraSPARC III/III+/IV/IV+

4M is the only large page size supported by default for text mappings. If the application exhibits performance regression with this default behavior, the default use of 4M text pages can be disabled by adding the following line in /etc/system:

set use_text_pgsz4m = 0

No large pages are used by default for initialized data segments, on these platforms.

[edit] UltraSPARC T1 (aka Niagara)

64K and 4M page sizes for text, and 64K page sizes for initialized data are used by default.

In case of regressions, add the following lines in /etc/system.

To disable the defalt use of 64K or 4M text pages:

set use_text_pgsz64k = 0
set use_text_pgsz4m = 0

To disable the default use of 64K data pages:

set use_initdata_pgsz64k = 0

[edit] x86

No large pages are used by default on x86. The default use of 2M text pages on PAE machines or 4M pages on non-PAE machines can be enabled with the following setting in /etc/system:

set use_text_largepages = 1
Solaris Internals
Personal tools
The Books
The Ads