Networks

From Siwiki

Jump to: navigation, search

Contents

[edit] Tuning Network Performance

Tunings are divided into TCP/IP layer and driver layer. Driver tuning is further divided into cross-platform tuning and platform-specific tuning. For example, for tuning FTP over Sun Multithreaded 10GbE on x64 server, apply the tuning for FTP, cross-platform and x64-specific tuning for Sun Multithreaded 10GbE.

[edit] Tuning TCP/IP

[edit] For 10GbE throughput

ndd -set /dev/tcp tcp_recv_hiwat 400000                                
ndd -set /dev/tcp tcp_xmit_hiwat 400000
ndd -set /dev/tcp tcp_max_buf 2097152
ndd -set /dev/tcp tcp_cwnd_max 2097152

The default socket buffer size is too small for a few TCP streams to reach 10Gbit throughput. Set it to 256KB or larger in your application or benchmark tool (iperf, netperf, ttcp, uperf).

[edit] For FTP throughput

Increase TCP window size on both ends so that FTP throughput is not limited by TCP flow control. Add

tcpwindow 400000

in /etc/ftpd/ftpaccess on FTP server and /usr/sbin/ftprestart. Use command "tcpwindow 400000" on FTP client before file transfer.

[edit] For NFS throughput

To maximize parallelism between an NFS client and its servers: /etc/system

set rpcmod:clnt_max_conns = 8

As of S10u8, the default is 1, which allows for a single connection from the client to each server. This results in a single thread bottleneck for communication with each server. Increasing this value on the client increases the number of connections to each server.

A new default is being investigated for CR 6887770

As a starting point, use ip:ip_soft_rings_cnt/2 (assumes ip_soft_rings_cnt has already been configured for your platform and driver).

To be effective, requires fix for CR 2179399, available in snv_117, s10u8, or s10 patch 141914-02

[edit] Additional tunable for single connection TCP throughput

To optimize throughput on few (<= # CPU / 2) TCP connections with heavy traffic: /etc/system

set ip:tcp_squeue_wput=1

This has been found to help throughput on x86, and jumbo frame throughput on CMT for this type of workload.

[edit] Additional tunable for bursty TCP connection establishment

Bursty TCP connection establishment will lead to unbalanced connection -> CPU mapping (CR 6364567 [1]). For example, TCP throughput with multiple connections measured by iperf may be limited. The work-around is /etc/system

set hires_tick=1

Using this tunable may increase CPU utilization.

[edit] Tuning Ethernet Adaptors

[edit] Tuning Sun 10 GbE with Intel 82598/82599 10 Gigabit Ethernet Controller

[edit] 82599 specific tuning

82599 may generate too many interrupts due to CR 6855939 before it is fixed in Solaris Nevada 122. The work-around is to add one line in ixgbe.conf

intr_throttling=200;

[edit] Cross-platform tuning

UDP packets are received by a single RX ring. UDP packets can be received by multiple rings with the tuning:

ixgbe.conf

rss_udp_enable=1;

Note: fragmented UDP packets may be received out of order with this tuning.

Tuning bcopy threshold on x64 systems may reduce cpu utilization and/or improve throughput. Experiment to find the optimal value. As an example:

ixgbe.conf

tx_copy_threshold=1024;

[edit] x64 specific - Nevada 110 or later; S10U8 or later

/etc/system

set ddi_msix_alloc_limit=8

[edit] x64 specific - S10U7; before Nevada 110

/etc/system

set ddi_msix_alloc_limit=8
set pcplusmp:apic_multi_msi_max=8
set pcplusmp:apic_msix_max=8
set pcplusmp:apic_intr_policy=1

/kernel/drv/ixgbe.conf

rx_queue_number=8;

[edit] Tuning NIU, XAUI, or Sun Multithreaded 10GbE

[edit] Cross-platform tuning

Software LSO helps to maximize TCP transmit throughput and/or reduce CPU utilization. Software LSO is available in Nevada 82 and as S10U5 patch (Patch-ID#: 138048). To enable software LSO, edit nxge.conf and uncomment the line

soft-lso-enable = 1;

Tuning bcopy threshold on x64 systems may reduce cpu utilization and/or improve throughput. /etc/system

set nxge:nxge_bcopy_thresh=1024

To minimize latency (with increased cpu utilization), interrupt blanking can be reduced. /kernel/drv/nxge.conf

rxdma-intr-time=1;
rxdma-intr-pkts=8;

Note that a fix for CR 6722278 is needed for the tuning to take effect.

[edit] CMT specific

T5440/T5240/T5140/T5220/T5120/T2000/T1000

For S10U7 or earlier /etc/system

set ip:ip_soft_rings_cnt=16

No tuning is needed for S10U8.

For T1000/T2000 systems with 1.0GHz CPU, interrupt fencing by

psradm -i 1-3 5-7 9-11 13-15 17-19 21-23 25-27 29-31

may improve throughput.

[edit] M series specific - S10U8

The default values should work well for general workloads.

[edit] M series specific - S10U7 or earlier

/etc/system

set ddi_msix_alloc_limit=8
set ip:ip_soft_rings_cnt=16
set ip_squeue_soft_ring=1
set ip_threads_per_cpu=2

[edit] x64 specific

[edit] x64 - Nevada 110 or later

/etc/system

set ddi_msix_alloc_limit=8
[edit] x64 - S10U5 or later; before Nevada 110
set ddi_msix_alloc_limit=8
set pcplusmp:apic_multi_msi_max=8
set pcplusmp:apic_msix_max=8
set pcplusmp:apic_intr_policy=1
set nxge:nxge_msi_enable=2
[edit] x64 - S10U4

/etc/system

set ip_squeue_soft_ring=1
set ip:ip_soft_rings_cnt=n (n=min(8, number of cores))
[edit] Sun Fire X4150, X4450 specific

Disable 'Hardware Prefetcher', 'Adjacent Cache Line Prefetch' in BIOS, in addition to tuning above.

[edit] Tuning Sun Multithreaded Quad Gigabit Ethernet

[edit] Cross-platform tuning

Software LSO helps to maximize TCP transmit throughput and/or reduce CPU utilization. Software LSO is available in Nevada 82 and as S10U5 patch (Patch-ID#: 138048). To enable software LSO, edit nxge.conf and uncomment the line

soft-lso-enable = 1;

Tuning bcopy threshold on x64 systems may reduce cpu utilization and/or improve throughput. /etc/system

set nxge:nxge_bcopy_thresh=1024

To minimize latency (with increased cpu utilization), interrupt blanking can be reduced. /kernel/drv/nxge.conf

rxdma-intr-time=1;
rxdma-intr-pkts=8;

Note that a fix for CR 6722278 is needed for the tuning to take effect.

[edit] CMT specific

None for general workloads. For small packet workloads:

/etc/system

set ip:ip_soft_rings_cnt=16

[edit] M series specific

None for general workloads. For small packet workloads:

/etc/system

set ip_squeue_soft_ring=1
set ip:ip_soft_rings_cnt=16

and interrupt fencing:

psradm -i 1 3 5 7 9 ... (#cpu - 1)

[edit] x64 specific

None for general workloads. For small packet workloads:

[edit] x64 - Nevada 110 or later

/etc/system

set ddi_msix_alloc_limit=4

/kernel/drv/nxge.conf

#msix-request=4;
[edit] x64 - S10U5 or later; before Nevada 110
set ddi_msix_alloc_limit=4
set pcplusmp:apic_multi_msi_max=4
set pcplusmp:apic_msix_max=4
set pcplusmp:apic_intr_policy=1
set nxge:nxge_msi_enable=2

/kernel/drv/nxge.conf

#msix-request=4;

[edit] Tuning for link aggregation

Because soft ring count is for the aggregated link, not individual interface, more soft rings are recommended for link aggregation. As a starting point, use (# of recommended soft rings for 1 interface) * (# of aggregated interface) soft rings. e.g.

set ip:ip_soft_rings_cnt=8

for 4 aggregated e1000g, since 2 soft rings are recommended for e1000g, 2 * 4 = 8.

If mpstat shows interrupt CPU is almost 100% utilized, distribute NIC interrupt to all cores. Use 8 core T2000 as an example:

psradm -i 1-3 5-7 9-11 13-15 17-19 21-23 25-27 29-31

[edit] Tuning in /etc/system vs. ndd

Some tunables can be changed in either /etc/system or ndd. e.g. number of soft rings. Changing these tunables in /etc/system take effect after the system reboots, and persists across reboots. Changing them using ndd take effect immediately, but doesn't persist across reboots.

For example, changing number of soft rings using ndd effect NIC plumbed afterwards, but NIC already plumbed are not effected.

[edit] Explanation for tunables

  • ddi_msix_alloc_limit: This is a system-wide setting of the maximum number of MSI (Message Signaled Interrupt) and MSI-X that can be allocated per PCI device. It has been removed in S10U8 on SPARC platforms. The default is to allocate maximum 2 MSI per device. If this value is set too high, the system may panic because the system runs out of interrupts.

Each receive DMA channel of a NIC can generate one interrupt, and each interrupt will target one CPU. Sun Multi-threaded 10GbE has 8 receive DMA channels per port, and Quad GbE has 4, so their interrupts can target at most 8 and 4 different CPU, respectively. To avoid interrupt CPU becoming the performance bottleneck, it is recommended to start with a value of the number of receive DMA channels per port or (# of CPU), whichever is lower, so that interrupt loads are distributed to enough CPU.

  • apic_multi_msi_max and apic_msix_max are removed in Nevada 110.
  • ip_soft_rings_cnt: This is a system-wide setting of how many software rings (aka soft rings) to use to process received packets. The default is 2 on Niagara systems. For optimal receive throughput, it is recommended to start with 8 to 16 software rings on CMT, and 16 or 32 on OPL. The optimal number of software rings depends on network device and workload. You can specify different number of software rings per network device.
  • ip_soft_rings_10gig_cnt: With Solaris 10 Update 8, the number of soft rings are controlled by different tunables: ip_soft_rings_cnt for GigE and ip_soft_rings_10gig_cnt for 10 GigE. The number of soft rings used are 2 * ip_soft_rings_10gig_cnt per 10 GigE port. ip_soft_rings_10gig_cnt defaults to 8, so the default is 16 soft rings for 10 GigE.
  • tcp_squeue_wput: When this is set to 1 (default is 2), the application tries to process its own packets but don't try to drain the squeue. The result is more TCP packets will be processed by soft ring thread and more balanced utilization on 2 CPU for one connection. CPU efficiency may be slightly lower.
  • For systems with 1.0GHz CPU under heavy network traffic, the interrupt CPU may become the bottleneck when NIC interrupts fall on only 2 or 3 cores. The psradm command above enables only 1 strand per core to take interrupt, thus NIC interrupts are distributed to all cores.
  • apic_intr_policy: 1 is round robin interrupt distribution. This is the default after Nevada 110.
  • apic_enable_dynamic_migration: 0 disables interrupt migration between cpu.
  • nxge_msi_enable: 2 is MSI-X. There are more MSI-X vectors available than MSI, so MSI-X is preferred.
Solaris Internals
Personal tools
The Books
The Ads