Copyright 2006, Hewlett-Packard Company A Brief Look at Latency vs Througput Tradeoffs For High-Speed Network Interfaces Rick Jones Hewlett-Packard Company Cupertino, California ftp://tardy.cup.hp.com/dist/networking/briefs/nic_latency_vs_tput.txt $Id: nic_latency_vs_tput.txt 50 2006-10-14 00:37:41Z raj $ http://tardy.hpl.hp.com/svn/briefs/trunk/nic_latency_vs_tput.txt Copyright 2005, Hewlett-Packard Company Introduction: This evolving document will discuss tradeoffs between minimizing latency and maximizing throughput for various "high-speed" (gigabit and higher) network interfaces such as Gigabit Ethernet and 10 Gigabit Ethernet. The netperf (http://www.netperf.org) benchmark will be used to demonstrate the minimum latency versus maximum throughput tradeoffs made by the driver (e1000) for this NIC. Initial measurements will use Debian Sarge. It is expected that while the "constants" involved may differ between distros the basic concepts remain the same. Still, later measurements may use RHEL4 or SLES9 as desireable. The 2.0 HP Debian Telco Edition may be included in a later revision of this document. Reader feedback will determine whether, and in which order, data for other distros are included. Vote early, vote often. Executive Summary: Up to version 7.3.12, the e1000 driver used in conjunction with the A9900A PCI-X Dual-port Gigabit Ethernet adaptor strongly favors maximum packet per second throughput over minimum request/response latency. Anyone desiring lowest possible request/response latency needs to alter the modprobe parameters used when the e1000 driver is loaded. This appears to reduce round-trip latency by as much as 85%. However, configuring the A9900A PCI-X Dual-port Gigabit Ethernet adaptor for minimum request/response latency will reduce maximum packet per second performance (as measured with the netperf TCP_RR test) by ~23% and increase the service demand for bulk data transfer by ~63% for sending and ~145% for receiving. With version 7.3.12 and later the default situation is much improved, with default single-transaction, single-stream latency significantly lower than previous versions. This is done in a way that has a positive effect on the bulk-transfer service demands, although it does appear to have a slight negative effect on default aggregate request/response performance. It is still possible to obtain further latency reductions by "hand-tuning" the InterruptThrottleRate settings even with version 7.3.12 of the e1000 driver. However, some of these latency reductions come at a heavy price for other sorts of workloads. In similar contrast :) the tg3 driver used in conjunction with the A7061A or NC7781 interfaces and/or rx1600/dl145/bl20p core Gigabit Ethernet, when measured in the bl20p went from ~84 microseconds to ~46 microseconds, or a round-trip latency reduction of ~45%. Achieving this lower round-trip latency increases service demand for bulk throughput by 32% on the sending side and nearly 110% on the receiving side. Issues in test behaviour make it difficult to quantify the effect on aggregate request/response performance, however, the data does show that tuning for minimum latency dropped aggregate request/response performance. Tuning the coalescing settings in the tg3 driver for only ~1000 interrupts per second made the single-instance round-trip latency go to ~1000 microseconds, but increased aggregate request/response by perhaps 12%. Copyright 2005, Hewlett-Packard Company Configuration: The initial flavor of Hewlett-Packard Integrity server used in these tests was the rx1600, which had 2, 1.0 GHz Itanium2 LV CPUs and some quantity of RAM. An add-on A9900A PCI-X Dual-Port Gigabit Ethernet adaptor was added to one of the add-on PCI-X slots. The rx1600's were connected via UTP to 1000-BaseT ports on an HP ProCurve 9308 switch. The initial OS load on the rx1600s was Debian Sarge, stable, with 2.6.8-2-mckinley-smp kernel. The version of the e1000 driver used for the A9900A in that OS load was 5.2.52-k4. As the author was aware of "issues" with TSO in the 2.6.8 kernel, TSO was disabled for the tests involving 2.6.8. For these tests, netperf was configured with --enable-burst. This optional feature will inject a specified (via a test-specific -b option) some number of requests into the test connection before entering the core request/response loop. When coupled with the test-specific -D option this can be used to have multiple transactions outstanding at a time on the connection. In this way it is possible to maximize the aggregate transactions per second with fewer netperf processes. Previous tests had a single synchronous transaction per netperf/netserver pair and so required _MANY_ such pairs which became as much a test of contexct switching as anything else. That each transaction remained a distinct pair of packets on the wire was verified via link-level statistics gathered via ethtool. If the reader wishes to utilize this method on other platforms they would be wise to do the same. The author _knows_ there is at least one platform out there that will generate spurrious ACKs when it receives back-to-back sub-MSS TCP segments, which would throw such measurements off entirely. Copyright 2005, Hewlett-Packard Company Results: First, we see the average Transaction/s performance of a single instance of netperf with no initial burst of requests injected into the data connection, and the e1000 driver using default interrupt coalescing settings: linger:/opt/netperf2# src/netperf -H 192.168.3.213 -t TCP_RR -i 10,3 -I 99,5 -l 60 -c -C TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.3.213 (192.168.3.213) port 0 AF_INET : +/-2.5% @ 99% conf. !!! WARNING !!! Desired confidence was not achieved within the specified iterations. !!! This implies that there was variability in the test environment that !!! must be investigated before going further. !!! Confidence intervals: Throughput : 15.3% !!! Local CPU util : 11.9% !!! Remote CPU util : 11.8% Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 16384 87380 1 1 60.00 1906.74 1.67 1.67 17.505 17.503 16384 87380 The author tried a number of times to hit the confidence intervals, but didn't get any better than +/- 1/2 what you see above for Throughput and CPU util. The netperf and/or netserver processes may have been migrating from one CPU to another on their respective systems. At such low transaction per second rates it may simply not be possible to hit the confidence intervals, or the author was not sufficiently patient. Now we look when the e1000 drivers on both systems are loaded with InterruptThrottleRate set to 0 to disable the interupt throttling: Since both were single, synchronous transaction tests, we can invert the Transaction/s figures to arrive at the average RTT or round-trip latency. With the default interrupt throttle rate settings that becomes ~525 microseconds. With the interrupt throttle disabled that becomes: linger:/opt/netperf2# src/netperf -H 192.168.3.213 -t TCP_RR -i 10,3 -I 99,5 -l 60 -c -C TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.3.213 (192.168.3.213) port 0 AF_INET : +/-2.5% @ 99% conf. !!! WARNING !!! Desired confidence was not achieved within the specified iterations. !!! This implies that there was variability in the test environment that !!! must be investigated before going further. !!! Confidence intervals: Throughput : 0.2% !!! Local CPU util : 8.6% !!! Remote CPU util : 7.0% Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 16384 87380 1 1 60.00 13017.80 13.95 14.24 21.432 21.874 16384 87380 Which translates to ~77 microseconds RTT. This of course is much, Much, MUCH lower than the default settings. As such, it would seem that one would want to always disable the interrupt throttle. However, in the finest traditions of there being no such things as free lunches, if we now look at a simple unidirectional throughput test and compare service demands (CPU consumed per unit of work)... First, with the interrupt throttle enabled: linger:/opt/netperf2# src/netperf -H 192.168.3.213 -i 10,3 -I 99,5 -l 60 -c -C -- -s 128K -S 128K -m 32K TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.3.213 (192.168.3.213) port 0 AF_INET : +/-2.5% @ 99% conf. Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 262144 262144 32768 60.00 941.37 17.91 19.45 3.118 3.385 And then with the throttle disabled: linger:/opt/netperf2# src/netperf -H 192.168.3.213 -i 10,3 -I 99,5 -l 60 -c -C -- -s 128K -S 128K -m 32K TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.3.213 (192.168.3.213) port 0 AF_INET : +/-2.5% @ 99% conf. !!! WARNING !!! Desired confidence was not achieved within the specified iterations. !!! This implies that there was variability in the test environment that !!! must be investigated before going further. !!! Confidence intervals: Throughput : 0.0% !!! Local CPU util : 14.0% !!! Remote CPU util : 0.7% Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 262144 262144 32768 60.00 941.37 29.22 47.73 5.086 8.308 In both cases we hit basically link-rate (94X Mbit/s) however, with the throttle disabled on both sides service demand is significantly increased - by ~63% for sending and by ~145% for receive. As one might imagine, this is rather significant. Now we do aggregate TCP_RR with the first burst option, this time using a total of three rx1600's, one as our "SUT" (System Under Test) and two as our "LG's" (Load Generators) Netperf TCP_RR Single-Byte Transactions per Second vs First Burst and Number of Netperf Instances Hewlett-Packard Integrity Server rx1600, 2x1.0GHz Itanium2 Single-port of A9900A PCI-X Gigabit Ethernet NIC InterruptThrottle Enabled Concurrent Netperfs First burst 1 2 4 -------------------------------------------------- 1 | 4090 7350 14890 2 | 6040 11550 21800 4 | 10320 18560 34390 8 | 13650 30970 52440 16 | 24790 49560 84890 32 | 45980 72750 134190 64 | 65660 122310 142250 128 | 65520 117400 141710 In the above, keep in mind that the single-concurrent netperf tests were to a single LG rather than to two. Also, the number of outstanding requests will be the number of netperfs multiplied by one more than the "First burst" value. So, 1 concurrent netperf with first burst of 2 is three outstanding requests, and 2 concurrent netperfs with a first burst of 1 is four outstanding requests. Now we look at maximum transaction per second rates with the Interrupt Throttle disabled: Netperf TCP_RR Single-Byte Transactions per Second vs First Burst and Number of Netperf Instances Hewlett-Packard Integrity Server rx1600, 2x1.0GHz Itanium2 Single-port of A9900A PCI-X Gigabit Ethernet NIC InterruptThrottle Disabled Concurrent Netperfs First burst 1 2 4 -------------------------------------------------- 1 | 24240 45140 75250 2 | 33740 58820 97060 4 | 48330 78120 108230 8 | 56670 102800 109720 16 | 64650 99670 106090 32 | 68410 100500 64 | 69330 128 | 71660 We can see that there are much higher transaction per second rates at lower numbers of simultaneous transactions. However, there is also a reduction in the peak transaction per second rate the system can achieve. 10G - AD144A PCI-X 1.0 10 Gigabit Ethernet SR Here we have aggregate data for TCP_RR through a 10 Gigabit Ethernet Adaptor. These data were gathered with a Debian 2.6.12-1-mckinley-smp on the SUT, the kernel on the LG's remains as before. Also, the SUT is an rx1620 with two 1.6 GHz 3MB cache CPUs. We disable the InterrruptThrottle on the LG's to minimize their effect on the measurement since we don't have a 10G NIC at the other end. The 10G card is being driven by version 1.7.7.1 of the "s2io" driver. Subsequent tests used the 2.0.12.0 version of the driver on a 2.6.15 kernel. The reader will also notice we start having a "first burst" row for "0" which means no additional request injected into the data connection before the core loop. Netperf TCP_RR Single-Byte Transactions per Second vs First Burst and Number of Netperf Instances Hewlett-Packard Integrity Server rx1620, 2x1.6GHz Itanium2 AD144A PCI-X 10 Gigabit Ethernet NIC InterruptThrottle Disabled on rx1600 LG's, 10G at Defaults Concurrent Netperfs First burst 1 2 4 -------------------------------------------------- 0 | 16890 34490 52930 1 | 29530 54030 89660 2 | 37380 72630 109970 4 | 52730 103790 130580 8 | 68500 124840 134420 16 | 72770 126300 150310 32 | 72540 126100 150920 64 | 72940 126730 153100 128 | 74900 125700 156400 Observations with top showed that both CPUs in the rx1620 were virtually saturated by the First Burst == 64, four-concurrent netperf test. Since we have switched CPUs, a proper comparison with the previous e1000 data is not really possible. So, we run a whole bunch of netperf numbers again :) First with the e1000 driver (now version 6.0.54-k2-NAPI) configured for defaults. (The LG's sticking with the InterruptThrottle disabled) Netperf TCP_RR Single-Byte Transactions per Second vs First Burst and Number of Netperf Instances Hewlett-Packard Integrity Server rx1620, 2x1.6GHz Itanium2 Single-port of A9900A PCI-X Gigabit Ethernet NIC InterruptThrottle Enabled Concurrent Netperfs First burst 1 2 4 -------------------------------------------------- 0 | 8000 16000 32000 1 | 16000 32000 64000 2 | 24000 47990 95730 4 | 39830 79730 107720 8 | 60639 111060 151140 16 | 68530 135910 203150 32 | 88130 165690 217460 64 | 88440 167730 219820 128 | 48040 117360 222270 [The author has no explanation for the 128 first burst performance drop-offs. However, they appear to be repeatable, and there does appear to be a slight difference in the number of packets transmitted vs received when they should be the same. Perhaps some strange interactions with congestion windows at connection startup. Packet traces, which were not taken, would probably be required to arrive at an answer. ] Notice how the zero first burst, single instance transaction rate is now 8000 transactions per second. This is considerably better than with the driver version in 2.6.8 on the rx1600 - likely as not there were a couple changes in the driver since there was a major version number change :) The result is also similar to what the author sees when the target system is running HP-UX rather than Linux and from this he concludes that the default InterruptThrottle stuff has changed considerably. Heck, OS influence may have changed as well. Anyway, we can see that the default setting for the e1000-driven NIC still keeps the transaction per second rate for a single stream rather lower than it could be, likely as not in the interest of making bulk throughput more efficient. And now with the InterruptThrottle disabled: Netperf TCP_RR Single-Byte Transactions per Second vs First Burst and Number of Netperf Instances Hewlett-Packard Integrity Server rx1620, 2x1.6GHz Itanium2 Single-port of A9900A PCI-X Gigabit Ethernet NIC InterruptThrottle Disabled Concurrent Netperfs First burst 1 2 4 -------------------------------------------------- 0 | 16230 31577 50580 1 | 26270 48850 89710 2 | 36920 68790 114630 4 | 51350 105137 140440 8 | 69000 127170 151620 16 | 75999 129030 160580 32 | 75690 128970 162590 64 | 76140 129900 160990 128 | 77460 115780 169450 To give some idea of the effect of different values for the InterruptThrottleRate here are some additional results: Netperf TCP_RR Single-Byte Transactions per Second vs First Burst and Number of Netperf Instances Hewlett-Packard Integrity Server rx1620, 2x1.6GHz Itanium2 Single-port of A9900A PCI-X Gigabit Ethernet NIC InterruptThrottleRate 1000 Concurrent Netperfs First burst 1 2 4 6 8 -------------------------------------------------------------------- 0 | 1000 4000 1 | 2000 8000 2 | 3000 12000 4 | 5000 20000 8 | 8990 35990 16 | 16990 68000 32 | 32990 129730 64 | 63960 180610 215230 239150 128 | 63990 173180 216290 235990 The one concurrent netperf was to a single rx1600 LG. The four concurrent netperf tests were to two rx1600 LGs and the six concurrent netperf tests were to three rx1600 LGs and the 8 concurrent netperf tests were to four rx1600's. The number of concurrent netperfs was increased in an attempt to absorb remaining idle CPU cycles. Increasing the first burst value beyond 128 did not appear to increase performance. Here we have data for the 2.0.12.0 version of the s2io driver on the 2.6.15 kernel: Netperf TCP_RR Single-Byte Transactions per Second vs First Burst and Number of Netperf Instances Hewlett-Packard Integrity Server rx1620, 2x1.6GHz Itanium2 AD144A PCI-X 10 Gigabit Ethernet NIC Linux 2.6.15 Kernel s2io 2.0.12.0 s2io Interrupt Coalescing at Defaults Concurrent Netperfs First burst 1 2 4 -------------------------------------------------- 0 | 16430 31220 59890 1 | 28784 49950 95470 2 | 37380 70530 114160 4 | 53740 98280 136900 8 | 71700 117690 159570 16 | 71610 130190 n/a 32 | 71350 129910 n/a 64 | 72270 130410 n/a 128 | 74490 133490 n/a Don't forget that the clients remain rx1600s, albeit running 2.6.12 now instead of 2.6.8. And we use 1, 2 or 4 distinct rx1600's as load generators here. Here is the single-stream 0 burst run with CPU util and service demand, interrupt coalescing at the default setting for the s2io driver. The remote remains a Debian 2.6.12 rx1600 with that kernel's e1000 driver configured for no interrupt throttling: languid:/opt/netperf2# src/netperf -t TCP_RR -H 192.168.3.213 -i 30,3 -c TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.3.213 (192.168.3.213) port 0 AF_INET : +/-2.5% @ 99% conf. : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % U us/Tr us/Tr 16384 87380 1 1 10.00 16380.05 9.51 -1.00 11.607 -1.000 16384 87380 A quick compare with e1000 6.1.16-k2-NAPI, using single-stream and interrupt defaults: languid:/opt/netperf2# src/netperf -t TCP_RR -H 192.168.3.213 -i 30,3 -c -C TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.3.213 (192.168.3.213) port 0 AF_INET : +/-2.5% @ 99% conf. : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 16384 87380 1 1 10.00 7998.18 5.37 8.67 13.427 21.670 16384 87380 Again we can see the effect of the default interrupt throttle settings on the e1000 driver: languid:/opt/netperf2# src/netperf -t TCP_RR -H 192.168.3.213 -i 30,3 -c TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.3.213 (192.168.3.213) port 0 AF_INET : +/-2.5% @ 99% conf. : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % U us/Tr us/Tr 16384 87380 1 1 10.00 15341.98 11.85 -1.00 15.447 -1.000 16384 87380 The service demands suggest that one should be able to achieve higher aggregate request/response performance over the AD144A, if perhaps its interrupt rates were throttled. Copyright 2006, Hewlett-Packard Company And here we compare settings on the core Gigabit Ethernet interface on the rx1600s. This is based on a BCM5701 chip and should mimic behavior of other BCM-based NICs such as the add-on A7061A PCI-X Gigabit Ethernet Adaptor. This is with the rx1600's still running the Debian 2.6.8-2-mckinley-smp and the interfaces connected via an HP ProCurve 2724 switch. The version of the tg3 driver is 3.10. First, Transaction rate and CPU utilization for the default setting: linger:/opt/netperf2# TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.3.213 (192.168.3.213) port 0 AF_INET : +/-2.5% @ 99% conf. : first burst 0 !!! WARNING !!! Desired confidence was not achieved within the specified iterations. !!! This implies that there was variability in the test environment that !!! must be investigated before going further. !!! Confidence intervals: Throughput : 0.2% !!! Local CPU util : 12.5% !!! Remote CPU util : 13.1% Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 16384 87380 1 1 30.00 13000.91 13.89 14.47 21.361 22.267 16384 87380 the variability likely came from the netperf/netserver's bouncing from CPU to CPU - so +/- 6-7% on the CPU util numbers (% not percentage points). The transaction rate seems to have been dead-on though. Here is the TCP_STREAM 128x32 data: linger:/opt/netperf2# src/netperf -H 192.168.4.213 -i 10,3 -I 99,5 -l 60 -c -C -- -s 128K -S 128K -m 32K TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.4.213 (192.168.4.213) port 0 AF_INET : +/-2.5% @ 99% conf. !!! WARNING !!! Desired confidence was not achieved within the specified iterations. !!! This implies that there was variability in the test environment that !!! must be investigated before going further. !!! Confidence intervals: Throughput : 0.0% !!! Local CPU util : 5.1% !!! Remote CPU util : 0.0% Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 262144 262144 32768 60.00 941.33 29.78 50.01 5.183 8.705 The Local CPU util was only 0.1% short of hitting the confidence interval. And then the aggregate single-byte TCP_RR performance, still with default coalescing parms. Netperf TCP_RR Single-Byte Transactions per Second vs First Burst and Number of Netperf Instances Hewlett-Packard Integrity Server rx1600, 2x1.0GHz Itanium2 Core or A7061A PCI-X Gigabit Ethernet NIC Debian 2.6.8 Kernel tg3 3.10 Interrupt Coalescing at Defaults Concurrent Netperfs First burst 1 2 4 -------------------------------------------------- 0 | 11940 22510 37550 1 | 22100 38380 65810 2 | 27640 48080 85650 4 | 40510 66660 103310 8 | 48120 94770 105350 16 | 56190 97680 103540 32 | 63070 97960 96390 64 | 67060 97860 98180 128 | 67170 97790 98950 It was not possible to alter the coalescing parms via ethtool with the 3.10 version of the driver on 2.6.8-1, so the rx1600's were upgraded to 2.6.12-1 which has the 3.31 driver in it. We re-run the single-stream numbers with the defaults for 2.6.12-1/3.31: linger:/opt/netperf2# src/netperf -H 192.168.4.213 -i 10,3 -I 99,5 -l 60 -c -C -- -s 128K -S 128K -m 32K TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.4.213 (192.168.4.213) port 0 AF_INET : +/-2.5% @ 99% conf. Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 262144 262144 32768 60.00 941.44 26.09 28.59 4.540 4.975 linger:/opt/netperf2# src/netperf -H 192.168.4.213 -t TCP_RR -i 30,3 -I 99,5 -l 60 -c -C TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.4.213 (192.168.4.213) port 0 AF_INET : +/-2.5% @ 99% conf. : first burst 0 !!! WARNING !!! Desired confidence was not achieved within the specified iterations. !!! This implies that there was variability in the test environment that !!! must be investigated before going further. !!! Confidence intervals: Throughput : 0.0% !!! Local CPU util : 13.5% !!! Remote CPU util : 23.3% Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 16384 87380 1 1 60.00 9722.71 7.53 6.72 15.499 13.831 16384 87380 And the aggregates: Netperf TCP_RR Single-Byte Transactions per Second vs First Burst and Number of Netperf Instances Hewlett-Packard Integrity Server rx1600, 2x1.0GHz Itanium2 Core or A7061A PCI-X Gigabit Ethernet NIC Debian 2.6.12 Kernel tg3 3.31 Interrupt Coalescing at Defaults Concurrent Netperfs First burst 1 2 4 -------------------------------------------------- 0 | 9720 17470 28140 1 | 17670 29010 48760 2 | 24470 39350 72100 4 | 33630 61320 108490 8 | 50230 99420 133600 16 | 84900 125090 132110 32 | 94920 124700 152590 64 | 94140 124490 129410 128 | 93660 124670 129960 Alas, it seems that even 3.31 does not allow changing the parms "on the fly" with ethtool. Curses, foiled again. Copyright (c) 2006, Hewlett-Packard Company For a different take, here is some data from BL25p systems with pairs of dual-core 2.4 GHz Opteron processors running SuSE SLES9 SP2. Uname -a reports the kernel as 2.6.5-7.241-smp and the NIC driver was version 3.37 of tg3. The NIC in use was an NC7881 (aka Broadcom 5703). The default interrupt coalescing parms from ethtool -c are: rx-usecs: 20 rx-frames: 5 rx-usecs-irq: 20 rx-frames-irq: 5 tx-usecs: 72 tx-frames: 53 tx-usecs-irq: 20 tx-frames-irq: 5 This system had four cores, and was communicating with between one and four other identical systems as the load generators. Here we have a single instance TCP_RR with service demands: lnx20:~/netperf-2.4.2pre1 # src/netperf -t TCP_RR -H lnx21 -i 10,3 -l 60 -c -C TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lnx21 (129.2.10.71) port 0 AF_INET : +/-2.5% @ 99% conf. : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 16384 87380 1 1 60.00 11877.66 2.75 3.16 9.251 10.640 16384 87380 Here we have a single instance TCP_STREAM 128x32 with service demands: lnx20:~/netperf-2.4.2pre1 # src/netperf -t TCP_STREAM -H lnx21 -i 10,3 -l 60 -c -C -- -s 128K -S 128K -m 32K TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lnx21 (129.2.10.71) port 0 AF_INET : +/-2.5% @ 99% conf. Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 262142 262142 32768 60.00 932.66 8.97 8.18 3.153 2.876 And here are the aggregates: Netperf TCP_RR Single-Byte Transactions per Second vs First Burst and Number of Netperf Instances Hewlett-Packard ProLiant Server BL25p Core or NC7781 PCI-X Gigabit Ethernet NIC Interrupt Coalescing at Defaults Concurrent Netperfs First burst 1 2 4 8 ------------------------------------------------------------- 0 | 11720 23450 38360 80860 1 | 22120 43150 85550 150940 2 | 31010 62520 124820 215080 4 | 51260 102420 198480 n/a 8 | 81420 168470 279820 n/a 16 | 163990 250930 n/a n/a 32 | 199660 248960 n/a n/a 64 | 198860 258310 n/a n/a 128 | 201000 257940 n/a n/a The "n/a" stems from there not being a sufficiently close match between TX and RX packets through the interface to satisfy the assumption that each transaction was a distinct pair of TCP segments. The author noticed that there was a correlation between the packet count mismatches and the presence of TCP retransmissions. However, he was unable to find the source of the retransmissions to correct them and see if that would resolve the packet count mismatches. It would not take very many retransmssions to completely bunch-up transactions - even a single retransmission might suffice. Now, if we alter the rx-frames parameter on both sides to minimize latency: rx-frames: 1 and run the single-stream, no-burst TCP_RR test: lnx20:~/netperf-2.4.2pre1 # src/netperf -t TCP_RR -H lnx22 -i 30,3 -l 60 -c -C TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lnx22 (129.2.10.72) port 0 AF_INET : +/-2.5% @ 99% conf. : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 16384 87380 1 1 60.00 21512.70 5.70 5.58 10.595 10.383 16384 87380 We can see that the round-trip latency has gone from ~84 microseconds to ~46 microseconds, or a reduction of ~45%. In somewhat broad handwaving terms, the service demand for a transaction remains the same, since in a single-instance, single-transaction test we are taking just as many interrupts as we were before If we rerun the single-stream TCP_RR test: lnx20:~/netperf-2.4.2pre1 # src/netperf -t TCP_STREAM -H lnx22 -i 30,3 -l 60 -c -C -- -s 128K -S 128K -m 32K TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lnx22 (129.2.10.72) port 0 AF_INET : +/-2.5% @ 99% conf. Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 262142 262142 32768 60.00 941.52 11.95 17.32 4.159 6.029 Throughput appears to have increased slightly but that is within the error limits. Sending service demand has increased by nearly 32% and receiving service demand has increased by nearly 110%! Now, if we rerun the aggregate TCP_RR tests: Netperf TCP_RR Single-Byte Transactions per Second vs First Burst and Number of Netperf Instances Hewlett-Packard ProLiant Server BL25p Core or NC7781 PCI-X Gigabit Ethernet NIC rx-frames set to 1 to minimize latency Concurrent Netperfs First burst 1 2 4 8 ------------------------------------------------------------- 0 | 21790 38860 71720 127470 1 | 36430 69740 134240 206200 2 | 51570 102200 176230 261060 4 | 83470 158310 226170 n/a 8 | 137150 188650 n/a 16 | 138830 192010 n/a 32 | 139620 193330 n/a 64 | 140755 195450 n/a 128 | 140722 196070 n/a Now for grins, let us drop the interrupt rate lower than the defaults. We set: rx-usecs: 1000 rx-frames: 100 rx-usecs-irq: 1000 rx-frames-irq: 100 tx-usecs: 1000 tx-frames: 100 tx-usecs-irq: 1000 tx-frames-irq: 100 and then set some descriptor counts: lnx20:~/netperf-2.4.2pre1 # ethtool -g eth2 Ring parameters for eth2: Pre-set maximums: RX: 511 RX Mini: 0 RX Jumbo: 255 TX: 0 Current hardware settings: RX: 511 RX Mini: 0 RX Jumbo: 100 TX: 511 To make sure that the slower interrupts don't lead to lack of DMA buffers. This drops the single-stream transaction rate to ~1000: lnx20:~/netperf-2.4.2pre1 # src/netperf -t TCP_RR -H lnx22 -i 30,3 -l 60 -c -C TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lnx22 (129.2.10.72) port 0 AF_INET : +/-2.5% @ 99% conf. : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 16384 87380 1 1 60.00 1010.79 0.56 0.61 22.039 24.009 16384 87380 It also hase an affect on single-stream bulk throughput: lnx20:~/netperf-2.4.2pre1 # src/netperf -t TCP_STREAM -H lnx22 -l 60 -c -C -- -s 128K -S 128K -m 32K TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lnx22 (129.2.10.72) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 262142 262142 32768 60.00 522.27 5.12 9.75 3.210 6.119 it is possible that this could be addressed via increasing the socket buffers and thus the TCP widow size, but the author didn't take the time to make that test. Here we re-run the aggregate TCP_RR tests: Netperf TCP_RR Single-Byte Transactions per Second vs First Burst and Number of Netperf Instances Hewlett-Packard ProLiant Server BL25p Core or NC7781 PCI-X Gigabit Ethernet NIC Interrupt Coalescing set to ~1000 Interrupts/s Concurrent Netperfs First burst 1 2 4 8 ------------------------------------------------------------- 0 | 1010 2000 3920 7650 1 | 2010 3950 7680 14730 2 | 2990 5870 11290 21360 4 | 4930 9580 18060 33620 8 | 8720 16540 30490 54360 16 | 15780 28710 50740 312800 32 | 28239 49250 306900 n/a 64 | 47913 202450 n/a 128 | 48140 217980 n/a The author has no certain explanation for the sudden jump but can say that the sanity check against ethtool -S output shows there was no sudden bundling of multiple transactions per TCP segment. Perhaps it was a NAPI cross-over point - polling the NIC rather than taking interrupts. For grins, the author re-ran the 2 concurrent netperfs with a burst of 16 data point and saw that the CPU utilization was 4.17% of the system or a service demand of 5.71 microseconds CPU consumed per transaction. For somparison, he reran with the default coalescing parms with two concurrent netperfs and a burst of zero for a similar PPS rate and the CPU utilization was 6.59% and the service demand was 9.65 microseconds of CPU consumed per transaction. Copyright 2006, Hewlett-Packard Company Around September 2006, the author was contacted by some driver writers at Intel who had seen earlier versions of this writeup. Seems they had a "new and improved" driver for me to try which was meant to address the latency vs throughput tradeoffs in the e1000 driver. After a bit of delay, the author agreed to run this new driver through its paces. By this time the systems being used were a pair of rx1620's each with 2, 1.6GHz 3MB "fast" FSB Itanium 2 CPUs. The kernel has moved to: [root@tarry src]# uname -a Linux tarry.hpl.hp.com 2.6.9-42.EL #1 SMP Wed Jul 12 23:25:09 EDT 2006 ia64 ia64 ia64 GNU/Linux and the e1000 driver is initially: [root@tarry src]# ethtool -i eth2 driver: e1000 version: 7.0.33-k2-NAPI firmware-version: N/A bus-info: 0000:20:02.1 The systems remain connected via an HP ProCurve 9308 switch. The version of netperf used was top of trunk on October 12, 2006 - aka revision 81 in the repository at: http://www.netperf.org/svn/netperf2/trunk . We start with basic latency, varying the CPU to which netperf/netserver are being bound since these are SMP systems: [root@tarry netperf2_work]# for i in 0 1; do for j in 0 1; do echo $i $j \ `netperf -P 0 -T $i,$j -t TCP_RR -H 192.168.2.125 -c -C -i 30,3 -l 30`;\ done; done Single-stream, Single-byte, Single-transaction netperf TCP_RR Transaction Rate, CPU Util and Service Demand vs Netperf (Loc) and Netserver (Rem) CPU Binding 99% Confidence <= +/- 2.5% Unless Noted rx1620 1.6 GHz "fast" FSB 2 Chips, 2 Cores RedHat ES4 U4, e1000 7.0.33-k2-NAPI, Default Interrupt Coalescing CPU Transactions % CPU Service Dem. CPU Util Binding Per Utilization usec/tran +/- % Loc Rem Second Loc Rem Loc Rem Loc Rem 0 0 6463.07 6.43 5.85 19.884 18.099 0 1 6609.95 6.47 4.41 19.586 13.357 1 0 6597.81 4.88 5.96 14.806 18.073 1 1 6749.12 5.02 4.48 14.876 13.289 We can see that things are much better by default than when this paper was first written, by a factor large enough to be more than just the change in CPU frequency, but the transaction rate is still rather low. How about aggregates? Well, lets take a look :) [root@tarry netperf2_work]# for i in 0 1; do for j in 0 1; do echo $i $j\ `netperf -P 0 -T $i,$j -t TCP_RR -H 192.168.2.125 -c -C -i 30,3 -l 30 --\ -D -b 128`; done; done Single-stream, Single-byte, Aggregate-transaction netperf TCP_RR Transaction Rate, CPU Util and Service Demand vs Netperf (Loc) and Netserver (Rem) CPU Binding 99% Confidence <= +/- 2.5% Unless Noted rx1620 1.6 GHz "fast" FSB 2 Chips, 2 Cores RedHat ES4 U4, e1000 7.0.33-k2-NAPI, Default Interrupt Coalescing CPU Transactions % CPU Service Dem. CPU Util Binding Per Utilization usec/tran +/- % Loc Rem Second Loc Rem Loc Rem Loc Rem 0 0 157976.35 78.58 75.10 9.949 9.508 0 1 157721.99 78.90 47.15 10.005 5.979 1 0 159642.81 49.98 76.56 6.261 9.591 1 1 160227.92 50.17 47.66 6.262 5.949 And if we initiate two concurrent such tests: 66387.29 99.41 98.67 144435.00 99.41 98.67 -------------------------------------------------- 2 210820 99.41 98.67 15.09 14.98 Next we look at single-stream unidirectional throughput, still with the original driver: [root@tarry netperf2_work]# for i in 0 1; do for j in 0 1; do echo $i $j\ `netperf -P 0 -T $i,$j -t TCP_STREAM -H 192.168.2.125 -c -C -i 30,3 -l 30\ -- -s 128K -S 128K -m 32K`; done; done Single-stream, Unidirectional netperf TCP_STREAM 128x32 Transfer Rate, CPU Util and Service Demand vs Netperf (Loc) and Netserver (Rem) CPU Binding 99% Confidence <= +/- 2.5% Unless Noted rx1620 1.6 GHz "fast" FSB 2 Chips, 2 Cores RedHat ES4 U4, e1000 7.0.33-k2-NAPI, Default Interrupt Coalescing CPU Megabits % CPU Service Dem. CPU Util Binding Per Utilization usec/tran +/- % Loc Rem Second Loc Rem Loc Rem Loc Rem 0 0 941.46 7.98 19.44 1.389 3.382 0 1 941.26 7.10 12.38 1.236 2.156 1 0 941.38 7.58 19.43 1.319 3.382 1 1 941.45 6.71 12.42 1.167 2.161 And now single-connection, bidirectional performance. Using 32KB requests and responses, 353X transactions per second is basically link-rate - > ~1850 Mbit/s: [root@tarry netperf2_work]# for i in 0 1; do for j in 0 1; do echo $i $j\ `netperf -P 0 -T $i,$j -t TCP_RR -H 192.168.2.125 -c -C -i 30,3 -l 30 --\ -s 128K -S 128K -r 32K -b 6`; done; done Single-stream, 32K Request/Response, Single-transaction netperf TCP_RR Transaction Rate, CPU Util and Service Demand vs Netperf (Loc) and Netserver (Rem) CPU Binding 99% Confidence <= +/- 2.5% Unless Noted rx1620 1.6 GHz "fast" FSB 2 Chips, 2 Cores RedHat ES4 U4, e1000 7.0.33-k2-NAPI, Default Interrupt Coalescing CPU Transactions % CPU Service Dem. CPU Util Binding Per Utilization usec/tran +/- % Loc Rem Second Loc Rem Loc Rem Loc Rem 0 0 3537.10 28.84 27.11 163.054 153.315 0 1 3536.68 27.90 18.35 157.762 103.788 1 0 3536.84 19.78 26.29 111.845 148.654 1 1 3551.75 19.37 17.94 109.047 101.023 Now, we switch to the "new and improved" :) driver, version: 7.3.12-NAPI. First with the default InterruptThrottleRate [root@tarry netperf2_work]# for i in 0 1; do for j in 0 1; do echo $i $j \ `netperf -P 0 -T $i,$j -t TCP_RR -H 192.168.2.125 -c -C -i 30,3 -l 30`;\ done; done Single-stream, Single-byte, Single-transaction netperf TCP_RR Transaction Rate, CPU Util and Service Demand vs Netperf (Loc) and Netserver (Rem) CPU Binding 99% Confidence <= +/- 2.5% Unless Noted rx1620 1.6 GHz "fast" FSB 2 Chips, 2 Cores RedHat ES4 U4, e1000 7.3.12-NAPI, Default Interrupt Coalescing CPU Transactions % CPU Service Dem. CPU Util Binding Per Utilization usec/tran +/- % Loc Rem Second Loc Rem Loc Rem Loc Rem 0 0 9362.13 10.82 9.68 23.114 20.671 0 1 10360.67 11.59 8.11 22.378 15.664 1 0 10290.14 9.04 10.38 17.578 20.183 1 1 12187.93 9.95 9.07 16.334 14.890 We can see that this leads to a significant improvement in Transactions per second, from ~6500 per second to as high as 12000 per second. There is a corresponding increase in service demand - at the highest transaction rate it seems to be on the order of 10% on the netperf side and ~12% on the netserver side. [root@tarry netperf2_work]# for i in 0 1; do for j in 0 1; do echo $i $j\ `netperf -P 0 -T $i,$j -t TCP_RR -H 192.168.2.125 -c -C -i 30,3 -l 30 --\ -D -b 128`; done; done Single-stream, Single-byte, Aggregate-transaction netperf TCP_RR Transaction Rate, CPU Util and Service Demand vs Netperf (Loc) and Netserver (Rem) CPU Binding 99% Confidence <= +/- 2.5% Unless Noted rx1620 1.6 GHz "fast" FSB 2 Chips, 2 Cores RedHat ES4 U4, e1000 7.3.12-NAPI, Default Interrupt Coalescing CPU Transactions % CPU Service Dem. CPU Util Binding Per Utilization usec/tran +/- % Loc Rem Second Loc Rem Loc Rem Loc Rem 0 0 151140.06 79.10 76.24 10.468 10.089 0 1 152856.23 79.25 49.27 10.369 6.447 1 0 147345.40 50.12 72.39 6.804 9.826 1 1 148385.54 50.04 48.16 6.744 6.491 Two concurrent tests: 140898.89 99.51 98.19 59397.15 99.51 98.19 --------------------------------------------------- 2 200290 For the single stream, on average the new driver with default settings has about 5.6% lower single-stream aggregate transaction throughput, and the two stream throughput is about 5% lower as well. [root@tarry netperf2_work]# for i in 0 1; do for j in 0 1; do echo $i $j\ `netperf -P 0 -T $i,$j -t TCP_STREAM -H 192.168.2.125 -c -C -i 30,3 -l 30\ -- -s 128K -S 128K -m 32K`; done; done Single-stream, Unidirectional netperf TCP_STREAM 128x32 Transfer Rate, CPU Util and Service Demand vs Netperf (Loc) and Netserver (Rem) CPU Binding 99% Confidence <= +/- 2.5% Unless Noted rx1620 1.6 GHz "fast" FSB 2 Chips, 2 Cores RedHat ES4 U4, e1000 7.3.12-NAPI, Default Interrupt Coalescing CPU Megabits % CPU Service Dem. CPU Util Binding Per Utilization usec/tran +/- % Loc Rem Second Loc Rem Loc Rem Loc Rem 0 0 941.39 5.80 17.50 1.010 3.046 0 1 941.39 5.22 10.45 0.908 1.818 1 0 941.40 5.23 17.44 0.910 3.036 1 1 941.45 4.81 10.62 0.838 1.849 Here we can see that the throughput holds-up, and the service demand is reduced, both goodness. Now the bidirectional: [root@tarry netperf2_work]# for i in 0 1; do for j in 0 1; do echo $i $j\ `netperf -P 0 -T $i,$j -t TCP_RR -H 192.168.2.125 -c -C -i 30,3 -l 30 --\ -s 128K -S 128K -r 32K -b 6`; done; done Single-stream, 32K Request/Response, Single-transaction netperf TCP_RR Transaction Rate, CPU Util and Service Demand vs Netperf (Loc) and Netserver (Rem) CPU Binding 99% Confidence <= +/- 2.5% Unless Noted rx1620 1.6 GHz "fast" FSB 2 Chips, 2 Cores RedHat ES4 U4, e1000 7.3.12-NAPI, Default Interrupt Coalescing CPU Transactions % CPU Service Dem. CPU Util Binding Per Utilization usec/tran +/- % Loc Rem Second Loc Rem Loc Rem Loc Rem 0 0 3530.70 27.60 27.00 156.367 152.971 0 1 3531.69 26.93 17.13 152.495 97.020 1 0 3531.62 18.16 26.24 102.839 148.621 1 1 3537.95 17.92 17.03 101.319 96.288 The Transactions per second remain basically unchanged as things were at link-rate. We can see that the netperf-side service demand has decreased from an average (across the four data points) of ~135 us/Tran for the old driver to an average of ~128 us/Tran. Netserver side has decreased from an average of ~126 usec/tran to ~124, which is likely "within the noise" of the benchmark. Now we try the InterruptThrottleRate at a more aggressive setting of "1" which means the driver will attempt to autotune withing a broader range of values for the InterruptThrottleRate. [root@tarry netperf2_work]# for i in 0 1; do for j in 0 1; do echo $i $j \ `netperf -P 0 -T $i,$j -t TCP_RR -H 192.168.2.125 -c -C -i 30,3 -l 30`;\ done; done Single-stream, Single-byte, Single-transaction netperf TCP_RR Transaction Rate, CPU Util and Service Demand vs Netperf (Loc) and Netserver (Rem) CPU Binding 99% Confidence <= +/- 2.5% Unless Noted rx1620 1.6 GHz "fast" FSB 2 Chips, 2 Cores RedHat ES4 U4, e1000 7.3.12-NAPI, InterruptThrottleRate=1 CPU Transactions % CPU Service Dem. CPU Util Binding Per Utilization usec/tran +/- % Loc Rem Second Loc Rem Loc Rem Loc Rem 0 0 13723.42 15.44 13.95 22.501 20.328 0 1 14683.49 16.64 11.44 22.663 15.582 1 0 14580.45 12.92 14.70 17.725 20.169 1 1 15746.54 14.32 13.02 18.182 16.541 We can see that with this setting the transactions per second are even higher, meaning the average latency is even lower. This stems from the driver being willing to alter the InterruptThrottleSetting across a wider range. [root@tarry netperf2_work]# for i in 0 1; do for j in 0 1; do echo $i $j\ `netperf -P 0 -T $i,$j -t TCP_RR -H 192.168.2.125 -c -C -i 30,3 -l 30 --\ -D -b 128`; done; done Single-stream, Single-byte, Aggregate-transaction netperf TCP_RR Transaction Rate, CPU Util and Service Demand vs Netperf (Loc) and Netserver (Rem) CPU Binding 99% Confidence <= +/- 2.5% Unless Noted rx1620 1.6 GHz "fast" FSB 2 Chips, 2 Cores RedHat ES4 U4, e1000 7.3.12-NAPI, InterruptThrottleRate=1 CPU Transactions % CPU Service Dem. CPU Util Binding Per Utilization usec/tran +/- % Loc Rem Second Loc Rem Loc Rem Loc Rem 0 0 149165.36 85.09 82.09 11.408 11.007 0 1 115868.52 64.50 50.04 11.134 8.637 1 0 103507.95 50.17 58.13 9.693 11.231 1 1 99628.70 50.03 46.78 10.044 9.392 139051.76 99.59 98.22 35250.74 99.59 98.22 --------------------------------------------------- 2 174302 99.59 98.22 We can see further erosion in the single-stream, aggregate TCP_RR performance when the InterruptThrottleRate is set to 1. Once again, no such thing as a free lunch :) The degregation is most severe when the netperf and netserver are bound to the same CPU as takes interrupts from the NIC and when we take the system to complete CPU saturation. This stands to reason given the driver may be letting the NIC generate more interrupts, which steals CPU cycles from the rest of the stack.. [root@tarry netperf2_work]# for i in 0 1; do for j in 0 1; do echo $i $j\ `netperf -P 0 -T $i,$j -t TCP_STREAM -H 192.168.2.125 -c -C -i 30,3 -l 30\ -- -s 128K -S 128K -m 32K`; done; done Single-stream, Unidirectional netperf TCP_STREAM 128x32 Transfer Rate, CPU Util and Service Demand vs Netperf (Loc) and Netserver (Rem) CPU Binding 99% Confidence <= +/- 2.5% Unless Noted rx1620 1.6 GHz "fast" FSB 2 Chips, 2 Cores RedHat ES4 U4, e1000 7.3.12-NAPI, InterruptThrottleRate=1 CPU Megabits % CPU Service Dem. CPU Util Binding Per Utilization usec/tran +/- % Loc Rem Second Loc Rem Loc Rem Loc Rem 0 0 941.34 5.71 17.23 0.995 2.999 0 1 941.40 5.28 10.51 0.919 1.829 1 0 941.45 5.22 17.37 0.909 3.022 1 1 941.45 4.84 10.56 0.842 1.837 InterruptThrottleRate=1 remains dynamic, and under the unidrectional workload, the driver is still able to keep the interrupt overhead minimized even under the more "aggressive setting." [root@tarry netperf2_work]# for i in 0 1; do for j in 0 1; do echo $i $j\ `netperf -P 0 -T $i,$j -t TCP_RR -H 192.168.2.125 -c -C -i 30,3 -l 30 --\ -s 128K -S 128K -r 32K -b 6`; done; done Single-stream, 32K Request/Response, Single-transaction netperf TCP_RR Transaction Rate, CPU Util and Service Demand vs Netperf (Loc) and Netserver (Rem) CPU Binding 99% Confidence <= +/- 2.5% Unless Noted rx1620 1.6 GHz "fast" FSB 2 Chips, 2 Cores RedHat ES4 U4, e1000 7.3.12-NAPI, InterruptThrottleRate=1 CPU Transactions % CPU Service Dem. CPU Util Binding Per Utilization usec/tran +/- % Loc Rem Second Loc Rem Loc Rem Loc Rem 0 0 3529.82 27.43 26.72 155.418 151.388 0 1 3531.71 26.91 17.21 152.399 97.472 1 0 3531.75 18.05 25.91 102.191 146.748 1 1 3537.95 18.06 17.07 102.088 96.488 Finally, with InterruptThrottleRate set to 0: [root@tarry netperf2_work]# for i in 0 1; do for j in 0 1; do echo $i $j \ `netperf -P 0 -T $i,$j -t TCP_RR -H 192.168.2.125 -c -C -i 30,3 -l 30`;\ done; done Single-stream, Single-byte, Single-transaction netperf TCP_RR Transaction Rate, CPU Util and Service Demand vs Netperf (Loc) and Netserver (Rem) CPU Binding 99% Confidence <= +/- 2.5% Unless Noted rx1620 1.6 GHz "fast" FSB 2 Chips, 2 Cores RedHat ES4 U4, e1000 7.3.12-NAPI, InterruptThrottleRate=0 CPU Transactions % CPU Service Dem. CPU Util Binding Per Utilization usec/tran +/- % Loc Rem Second Loc Rem Loc Rem Loc Rem 0 0 16848.52 19.73 16.60 23.417 19.703 2.6 3.2 0 1 18291.56 20.81 13.75 22.759 15.034 1 0 18165.77 16.59 19.31 18.263 21.259 1 1 19902.64 15.33 13.95 15.408 14.017 4.8 4.1 [root@tarry netperf2_work]# for i in 0 1; do for j in 0 1; do echo $i $j\ `netperf -P 0 -T $i,$j -t TCP_RR -H 192.168.2.125 -c -C -i 30,3 -l 30 --\ -D -b 128`; done; done Single-stream, Single-byte, Aggregate-transaction netperf TCP_RR Transaction Rate, CPU Util and Service Demand vs Netperf (Loc) and Netserver (Rem) CPU Binding 99% Confidence <= +/- 2.5% Unless Noted rx1620 1.6 GHz "fast" FSB 2 Chips, 2 Cores RedHat ES4 U4, e1000 7.3.12-NAPI, InterruptThrottleRate=0 CPU Transactions % CPU Service Dem. CPU Util Binding Per Utilization usec/tran +/- % Loc Rem Second Loc Rem Loc Rem Loc Rem 0 0 135953.23 97.74 97.80 14.379 14.388 0 1 100257.48 63.46 50.04 12.660 9.983 1 0 96633.78 50.20 56.14 10.391 11.620 1 1 94182.15 50.03 45.93 10.624 9.753 134900.96 99.48 100.48 4528.33 97.53 100.15 Trans/s +/- 35% ------------------------------------------------------- 2 139420 99.48 100 Things were not especially stable for the two concurrent, aggregate tests when the InterruptThrottleRate was set to zero, even using the maximum of 30 iterations in netperf. The smaller of the two, which was likely bound to the CPU taking interrupts from the NIC, had its throughput vary rather a lot. Still, it would seem reasonable to state that with the InterruptThrottleRate set to zero, while the single-instance, single-transaction latency is lowest, it does indeed still drag-down the aggregates, even more than when the dynamic throttle is enabled. [root@tarry netperf2_work]# for i in 0 1; do for j in 0 1; do echo $i $j\ `netperf -P 0 -T $i,$j -t TCP_STREAM -H 192.168.2.125 -c -C -i 30,3 -l 30\ -- -s 128K -S 128K -m 32K`; done; done Single-stream, Unidirectional netperf TCP_STREAM 128x32 Transfer Rate, CPU Util and Service Demand vs Netperf (Loc) and Netserver (Rem) CPU Binding 99% Confidence <= +/- 2.5% Unless Noted rx1620 1.6 GHz "fast" FSB 2 Chips, 2 Cores RedHat ES4 U4, e1000 7.3.12-NAPI, InterruptThrottleRate=0 CPU Megabits % CPU Service Dem. CPU Util Binding Per Utilization usec/tran +/- % Loc Rem Second Loc Rem Loc Rem Loc Rem 0 0 941.44 21.59 51.06 3.758 8.886 0 1 941.20 20.71 38.46 3.605 6.695 1 0 941.37 21.46 50.83 3.735 8.847 1 1 941.35 20.16 38.44 3.509 6.690 We can see that while throughput may remain at link-rate, CPU utilization and service demand have shot though the roof with the interrupt throuttle disabled. This does not bode well for the bidirectional throughput test which we rerun next. [root@tarry netperf2_work]# for i in 0 1; do for j in 0 1; do echo $i $j\ `netperf -P 0 -T $i,$j -t TCP_RR -H 192.168.2.125 -c -C -i 30,3 -l 30 --\ -s 128K -S 128K -r 32K -b 6`; done; done Single-stream, 32K Request/Response, Single-transaction netperf TCP_RR Transaction Rate, CPU Util and Service Demand vs Netperf (Loc) and Netserver (Rem) CPU Binding 99% Confidence <= +/- 2.5% Unless Noted rx1620 1.6 GHz "fast" FSB 2 Chips, 2 Cores RedHat ES4 U4, e1000 7.3.12-NAPI, InterruptThrottleRate=1 CPU Transactions % CPU Service Dem. CPU Util Binding Per Utilization usec/tran +/- % Loc Rem Second Loc Rem Loc Rem Loc Rem 0 0 3494.09 68.92 60.72 394.498 347.575 0 1 3499.08 69.02 42.46 394.478 242.690 1 0 3435.35 50.16 59.90 292.031 348.755 1 1 3458.17 50.06 42.93 289.497 248.279 Sure enough, we see huge increases in CPU utilization and service demand. It was only that we had lots of left-over CPU in the previous tests that we see less of a decrease in Transactions per second.