Networks
From Siwiki
Contents
|
[edit] Tuning Network Performance
Tunings are divided into TCP/IP layer and driver layer. Driver tuning is further divided into cross-platform tuning and platform-specific tuning. For example, for tuning FTP over Sun Multithreaded 10GbE on x64 server, apply the tuning for FTP, cross-platform and x64-specific tuning for Sun Multithreaded 10GbE.
[edit] Tuning TCP/IP
[edit] For 10GbE throughput
ndd -set /dev/tcp tcp_recv_hiwat 400000 ndd -set /dev/tcp tcp_xmit_hiwat 400000 ndd -set /dev/tcp tcp_max_buf 2097152 ndd -set /dev/tcp tcp_cwnd_max 2097152
The default socket buffer size is too small for a few TCP streams to reach 10Gbit throughput. Set it to 256KB or larger in your application or benchmark tool (iperf, netperf, ttcp, uperf).
[edit] For FTP throughput
Increase TCP window size on both ends so that FTP throughput is not limited by TCP flow control. Add
tcpwindow 400000
in /etc/ftpd/ftpaccess on FTP server and /usr/sbin/ftprestart. Use command "tcpwindow 400000" on FTP client before file transfer.
[edit] Additional tunable for single connection TCP throughput
To optimize throughput on few (<= # CPU / 2) TCP connections with heavy traffic: /etc/system
set ip:tcp_squeue_wput=1
This has been found to help throughput on x86, and jumbo frame throughput on CMT for this type of workload.
[edit] Additional tunable for bursty TCP connection establishment
Bursty TCP connection establishment will lead to unbalanced connection -> CPU mapping (CR 6364567 [1]). For example, TCP throughput with multiple connections measured by iperf may be limited. The work-around is /etc/system
set hires_tick=1
Using this tunable may increase CPU utilization.
[edit] Tuning Ethernet Adaptors
[edit] Tuning Sun 10 GbE with Intel 82598EB 10 Gigabit Ethernet Controller
[edit] Cross-platform tuning
Tuning bcopy threshold on x64 systems may reduce cpu utilization and/or improve throughput. Experiment to find the optimal value. As an example:
ixgbe.conf
tx_copy_threshold=1024;
[edit] x64 specific
/etc/system
[edit] Nevada 110 or later
set ddi_msix_alloc_limit=8
[edit] S10U7; before Nevada 110
/etc/system
set ddi_msix_alloc_limit=8 set pcplusmp:apic_multi_msi_max=8 set pcplusmp:apic_msix_max=8 set pcplusmp:apic_intr_policy=1
/kernel/drv/ixgbe.conf
rx_queue_number=8;
[edit] Tuning NIU, XAUI, or Sun Multithreaded 10GbE
[edit] Cross-platform tuning
Software LSO helps to maximize TCP transmit throughput and/or reduce CPU utilization. Software LSO is available in Nevada 82 and as S10U5 patch (Patch-ID#: 138048). To enable software LSO, edit nxge.conf and uncomment the line
soft-lso-enable = 1;
Tuning bcopy threshold on x64 systems may reduce cpu utilization and/or improve throughput. /etc/system
set nxge:nxge_bcopy_thresh=1024
To minimize latency (with increased cpu utilization), interrupt blanking can be reduced. /kernel/drv/nxge.conf
rxdma-intr-time=1; rxdma-intr-pkts=8;
Note that a fix for CR 6722278 is needed for the tuning to take effect.
[edit] CMT specific
T5440/T5240/T5140/T5220/T5120/T2000/T1000 /etc/system
set ip:ip_soft_rings_cnt=16
For T1000/T2000 systems with 1.0GHz CPU, interrupt fencing by
psradm -i 1-3 5-7 9-11 13-15 17-19 21-23 25-27 29-31
may improve throughput.
[edit] M series specific
/etc/system
set ddi_msix_alloc_limit=8 set ip:ip_soft_rings_cnt=16 set ip_squeue_soft_ring=1 set ip_threads_per_cpu=2
[edit] x64 specific
[edit] x64 - Nevada 110 or later
/etc/system
set ddi_msix_alloc_limit=8
[edit] x64 - S10U5 or later; before Nevada 110
set ddi_msix_alloc_limit=8 set pcplusmp:apic_multi_msi_max=8 set pcplusmp:apic_msix_max=8 set pcplusmp:apic_intr_policy=1 set nxge:nxge_msi_enable=2
[edit] x64 - S10U4
/etc/system
set ip_squeue_soft_ring=1 set ip:ip_soft_rings_cnt=n (n=min(8, number of cores))
[edit] Sun Fire X4150, X4450 specific
Disable 'Hardware Prefetcher', 'Adjacent Cache Line Prefetch' in BIOS, in addition to tuning above.
[edit] Tuning Sun Multithreaded Quad Gigabit Ethernet
[edit] Cross-platform tuning
Software LSO helps to maximize TCP transmit throughput and/or reduce CPU utilization. Software LSO is available in Nevada 82 and as S10U5 patch (Patch-ID#: 138048). To enable software LSO, edit nxge.conf and uncomment the line
soft-lso-enable = 1;
Tuning bcopy threshold on x64 systems may reduce cpu utilization and/or improve throughput. /etc/system
set nxge:nxge_bcopy_thresh=1024
To minimize latency (with increased cpu utilization), interrupt blanking can be reduced. /kernel/drv/nxge.conf
rxdma-intr-time=1; rxdma-intr-pkts=8;
Note that a fix for CR 6722278 is needed for the tuning to take effect.
[edit] CMT specific
None for general workloads. For small packet workloads:
/etc/system
set ip:ip_soft_rings_cnt=16
[edit] M series specific
None for general workloads. For small packet workloads:
/etc/system
set ip_squeue_soft_ring=1 set ip:ip_soft_rings_cnt=16
and interrupt fencing:
psradm -i 1 3 5 7 9 ... (#cpu - 1)
[edit] x64 specific
None for general workloads. For small packet workloads:
[edit] x64 - Nevada 110 or later
/etc/system
set ddi_msix_alloc_limit=4
/kernel/drv/nxge.conf
#msix-request=4;
[edit] x64 - S10U5 or later; before Nevada 110
set ddi_msix_alloc_limit=4 set pcplusmp:apic_multi_msi_max=4 set pcplusmp:apic_msix_max=4 set pcplusmp:apic_intr_policy=1 set nxge:nxge_msi_enable=2
/kernel/drv/nxge.conf
#msix-request=4;
[edit] Tuning for link aggregation
Because soft ring count is for the aggregated link, not individual interface, more soft rings are recommended for link aggregation. As a starting point, use (# of recommended soft rings for 1 interface) * (# of aggregated interface) soft rings. e.g.
set ip:ip_soft_rings_cnt=8
for 4 aggregated e1000g, since 2 soft rings are recommended for e1000g, 2 * 4 = 8.
If mpstat shows interrupt CPU is almost 100% utilized, distribute NIC interrupt to all cores. Use 8 core T2000 as an example:
psradm -i 1-3 5-7 9-11 13-15 17-19 21-23 25-27 29-31
[edit] Tuning in /etc/system vs. ndd
Some tunables can be changed in either /etc/system or ndd. e.g. number of soft rings. Changing these tunables in /etc/system take effect after the system reboots, and persists across reboots. Changing them using ndd take effect immediately, but doesn't persist across reboots.
For example, changing number of soft rings using ndd effect NIC plumbed afterwards, but NIC already plumbed are not effected.
[edit] Explanation for tunables
- ddi_msix_alloc_limit: This is a system-wide setting of the maximum number of MSI (Message Signaled Interrupt) and MSI-X that can be allocated per PCI device. The default is to allocate maximum 2 MSI per device. If this value is set too high, the system may panic because the system runs out of interrupts.
Each receive DMA channel of a NIC can generate one interrupt, and each interrupt will target one CPU. Sun Multi-threaded 10GbE has 8 receive DMA channels per port, and Quad GbE has 4, so their interrupts can target at most 8 and 4 different CPU, respectively. To avoid interrupt CPU becoming the performance bottleneck, it is recommended to start with a value of the number of receive DMA channels per port or (# of CPU), whichever is lower, so that interrupt loads are distributed to enough CPU.
- apic_multi_msi_max and apic_msix_max are removed in Nevada 110.
- ip_soft_rings_cnt: This is a system-wide setting of how many software rings (aka soft rings) to use to process received packets. The default is 2 on Niagara systems. For optimal receive throughput, it is recommended to start with 8 to 16 software rings on CMT, and 16 or 32 on OPL. The optimal number of software rings depends on network device and workload. You can specify different number of software rings per network device.
- tcp_squeue_wput: When this is set to 1 (default is 2), the application tries to process its own packets but don't try to drain the squeue. The result is more TCP packets will be processed by soft ring thread and more balanced utilization on 2 CPU for one connection. CPU efficiency may be slightly lower.
- For systems with 1.0GHz CPU under heavy network traffic, the interrupt CPU may become the bottleneck when NIC interrupts fall on only 2 or 3 cores. The psradm command above enables only 1 strand per core to take interrupt, thus NIC interrupts are distributed to all cores.
- apic_intr_policy: 1 is round robin interrupt distribution. This is the default after Nevada 110.
- apic_enable_dynamic_migration: 0 disables interrupt migration between cpu.
- nxge_msi_enable: 2 is MSI-X. There are more MSI-X vectors available than MSI, so MSI-X is preferred.
