Copyright 2006, Hewlett-Packard Company


	    A Brief Look at Latency vs Througput Tradeoffs
		  For High-Speed Network Interfaces


		   Rick Jones <rick.jones2@hp.com>
		       Hewlett-Packard Company
			Cupertino, California


ftp://tardy.cup.hp.com/dist/networking/briefs/nic_latency_vs_tput.txt

$Id: nic_latency_vs_tput.txt 50 2006-10-14 00:37:41Z raj $
http://tardy.hpl.hp.com/svn/briefs/trunk/nic_latency_vs_tput.txt


	       Copyright 2005, Hewlett-Packard Company


Introduction:

This evolving document will discuss tradeoffs between minimizing
latency and maximizing throughput for various "high-speed" (gigabit
and higher) network interfaces such as Gigabit Ethernet and 10 Gigabit
Ethernet.  The netperf (http://www.netperf.org) benchmark will be used
to demonstrate the minimum latency versus maximum throughput tradeoffs
made by the driver (e1000) for this NIC.  Initial measurements will
use Debian Sarge.

It is expected that while the "constants" involved may differ between
distros the basic concepts remain the same.  Still, later measurements
may use RHEL4 or SLES9 as desireable.  The 2.0 HP Debian Telco Edition
may be included in a later revision of this document.  Reader feedback
will determine whether, and in which order, data for other distros are
included. Vote early, vote often.

Executive Summary:

Up to version 7.3.12, the e1000 driver used in conjunction with the
A9900A PCI-X Dual-port Gigabit Ethernet adaptor strongly favors
maximum packet per second throughput over minimum request/response
latency.  Anyone desiring lowest possible request/response latency
needs to alter the modprobe parameters used when the e1000 driver is
loaded.  This appears to reduce round-trip latency by as much as 85%.

However, configuring the A9900A PCI-X Dual-port Gigabit Ethernet
adaptor for minimum request/response latency will reduce maximum
packet per second performance (as measured with the netperf TCP_RR
test) by ~23% and increase the service demand for bulk data transfer
by ~63% for sending and ~145% for receiving.

With version 7.3.12 and later the default situation is much improved,
with default single-transaction, single-stream latency significantly
lower than previous versions.  This is done in a way that has a
positive effect on the bulk-transfer service demands, although it does
appear to have a slight negative effect on default aggregate
request/response performance.

It is still possible to obtain further latency reductions by
"hand-tuning" the InterruptThrottleRate settings even with version
7.3.12 of the e1000 driver.  However, some of these latency reductions
come at a heavy price for other sorts of workloads.

In similar contrast :) the tg3 driver used in conjunction with the
A7061A or NC7781 interfaces and/or rx1600/dl145/bl20p core Gigabit
Ethernet, when measured in the bl20p went from ~84 microseconds to ~46
microseconds, or a round-trip latency reduction of ~45%.  Achieving
this lower round-trip latency increases service demand for bulk
throughput by 32% on the sending side and nearly 110% on the receiving
side.  Issues in test behaviour make it difficult to quantify the
effect on aggregate request/response performance, however, the data
does show that tuning for minimum latency dropped aggregate
request/response performance.

Tuning the coalescing settings in the tg3 driver for only ~1000
interrupts per second made the single-instance round-trip latency go
to ~1000 microseconds, but increased aggregate request/response by
perhaps 12%.


	       Copyright 2005, Hewlett-Packard Company


Configuration:

The initial flavor of Hewlett-Packard Integrity server used in these
tests was the rx1600, which had 2, 1.0 GHz Itanium2 LV CPUs and some
quantity of RAM.  An add-on A9900A PCI-X Dual-Port Gigabit Ethernet
adaptor was added to one of the add-on PCI-X slots.

The rx1600's were connected via UTP to 1000-BaseT ports on an HP
ProCurve 9308 switch.

The initial OS load on the rx1600s was Debian Sarge, stable, with
2.6.8-2-mckinley-smp kernel.  The version of the e1000 driver used for
the A9900A in that OS load was 5.2.52-k4.

As the author was aware of "issues" with TSO in the 2.6.8 kernel, TSO
was disabled for the tests involving 2.6.8.

For these tests, netperf was configured with --enable-burst.  This
optional feature will inject a specified (via a test-specific -b
option) some number of requests into the test connection before
entering the core request/response loop.  When coupled with the
test-specific -D option this can be used to have multiple transactions
outstanding at a time on the connection.  In this way it is possible
to maximize the aggregate transactions per second with fewer netperf
processes.  Previous tests had a single synchronous transaction per
netperf/netserver pair and so required _MANY_ such pairs which became
as much a test of contexct switching as anything else.

That each transaction remained a distinct pair of packets on the wire
was verified via link-level statistics gathered via ethtool.  If the
reader wishes to utilize this method on other platforms they would be
wise to do the same.  The author _knows_ there is at least one
platform out there that will generate spurrious ACKs when it receives
back-to-back sub-MSS TCP segments, which would throw such measurements
off entirely.


	       Copyright 2005, Hewlett-Packard Company


Results:

First, we see the average Transaction/s performance of a single
instance of netperf with no initial burst of requests injected into
the data connection, and the e1000 driver using default interrupt
coalescing settings:

linger:/opt/netperf2# src/netperf -H 192.168.3.213 -t TCP_RR -i 10,3 -I 99,5 -l 60 -c -C
TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.3.213 (192.168.3.213) port 0 AF_INET : +/-2.5% @ 99% conf.
!!! WARNING
!!! Desired confidence was not achieved within the specified iterations.
!!! This implies that there was variability in the test environment that
!!! must be investigated before going further.
!!! Confidence intervals: Throughput      : 15.3%
!!!                       Local CPU util  : 11.9%
!!!                       Remote CPU util : 11.8%

Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr

16384  87380  1       1      60.00   1906.74  1.67   1.67   17.505  17.503
16384  87380

The author tried a number of times to hit the confidence intervals,
but didn't get any better than +/- 1/2 what you see above for
Throughput and CPU util. The netperf and/or netserver processes may
have been migrating from one CPU to another on their respective
systems.  At such low transaction per second rates it may simply not
be possible to hit the confidence intervals, or the author was not
sufficiently patient.  Now we look when the e1000 drivers on both
systems are loaded with InterruptThrottleRate set to 0 to disable the
interupt throttling:


Since both were single, synchronous transaction tests, we can invert
the Transaction/s figures to arrive at the average RTT or round-trip
latency.  With the default interrupt throttle rate settings that
becomes ~525 microseconds.  With the interrupt throttle disabled that
becomes:

linger:/opt/netperf2# src/netperf -H 192.168.3.213 -t TCP_RR -i 10,3 -I 99,5 -l 60 -c -C
TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.3.213 (192.168.3.213) port 0 AF_INET : +/-2.5% @ 99% conf.
!!! WARNING
!!! Desired confidence was not achieved within the specified iterations.
!!! This implies that there was variability in the test environment that
!!! must be investigated before going further.
!!! Confidence intervals: Throughput      :  0.2%
!!!                       Local CPU util  :  8.6%
!!!                       Remote CPU util :  7.0%

Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr

16384  87380  1       1      60.00   13017.80  13.95  14.24  21.432  21.874
16384  87380

Which translates to ~77 microseconds RTT.  This of course is much,
Much, MUCH lower than the default settings.

As such, it would seem that one would want to always disable the
interrupt throttle.  However, in the finest traditions of there being
no such things as free lunches, if we now look at a simple
unidirectional throughput test and compare service demands (CPU
consumed per unit of work)...

First, with the interrupt throttle enabled:

linger:/opt/netperf2# src/netperf -H 192.168.3.213 -i 10,3 -I 99,5 -l 60 -c -C -- -s 128K -S 128K -m 32K
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.3.213 (192.168.3.213) port 0 AF_INET : +/-2.5% @ 99% conf.
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

262144 262144  32768    60.00       941.37   17.91    19.45    3.118   3.385


And then with the throttle disabled:

linger:/opt/netperf2# src/netperf -H 192.168.3.213 -i 10,3 -I 99,5 -l 60 -c -C -- -s 128K -S 128K -m 32K
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.3.213 (192.168.3.213) port 0 AF_INET : +/-2.5% @ 99% conf.
!!! WARNING
!!! Desired confidence was not achieved within the specified iterations.
!!! This implies that there was variability in the test environment that
!!! must be investigated before going further.
!!! Confidence intervals: Throughput      :  0.0%
!!!                       Local CPU util  : 14.0%
!!!                       Remote CPU util :  0.7%

Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

262144 262144  32768    60.00       941.37   29.22    47.73    5.086   8.308

In both cases we hit basically link-rate (94X Mbit/s) however, with
the throttle disabled on both sides service demand is significantly
increased - by ~63% for sending and by ~145% for receive.

As one might imagine, this is rather significant.

Now we do aggregate TCP_RR with the first burst option, this time
using a total of three rx1600's, one as our "SUT" (System Under Test)
and two as our "LG's" (Load Generators)

	  Netperf TCP_RR Single-Byte Transactions per Second
				  vs
	     First Burst and Number of Netperf Instances
      Hewlett-Packard Integrity Server rx1600, 2x1.0GHz Itanium2
	   Single-port of A9900A PCI-X Gigabit Ethernet NIC
		      InterruptThrottle Enabled

                                 Concurrent Netperfs
              First burst      1          2           4
	  --------------------------------------------------
                   1    |     4090       7350       14890
                   2    |     6040      11550       21800
                   4    |    10320      18560       34390
                   8    |    13650      30970       52440
                  16    |    24790      49560       84890
                  32    |    45980      72750      134190
                  64    |    65660     122310      142250
                 128    |    65520     117400      141710


In the above, keep in mind that the single-concurrent netperf tests
were to a single LG rather than to two.  Also, the number of
outstanding requests will be the number of netperfs multiplied by one
more than the "First burst" value.  So, 1 concurrent netperf with
first burst of 2 is three outstanding requests, and 2 concurrent
netperfs with a first burst of 1 is four outstanding requests.


Now we look at maximum transaction per second rates with the Interrupt
Throttle disabled:

	  Netperf TCP_RR Single-Byte Transactions per Second
				  vs
	     First Burst and Number of Netperf Instances
      Hewlett-Packard Integrity Server rx1600, 2x1.0GHz Itanium2
	   Single-port of A9900A PCI-X Gigabit Ethernet NIC
		      InterruptThrottle Disabled

                                 Concurrent Netperfs
              First burst      1          2           4
	  --------------------------------------------------
                   1    |    24240      45140       75250
                   2    |    33740      58820       97060
                   4    |    48330      78120      108230
                   8    |    56670     102800      109720
                  16    |    64650      99670      106090
                  32    |    68410     100500
                  64    |    69330
                 128    |    71660


We can see that there are much higher transaction per second rates at
lower numbers of simultaneous transactions.  However, there is also a
reduction in the peak transaction per second rate the system can
achieve.


10G - AD144A PCI-X 1.0 10 Gigabit Ethernet SR

Here we have aggregate data for TCP_RR through a 10 Gigabit Ethernet
Adaptor.  These data were gathered with a Debian 2.6.12-1-mckinley-smp
on the SUT, the kernel on the LG's remains as before.  Also, the SUT
is an rx1620 with two 1.6 GHz 3MB cache CPUs.  We disable the
InterrruptThrottle on the LG's to minimize their effect on the
measurement since we don't have a 10G NIC at the other end.  The 10G
card is being driven by version 1.7.7.1 of the "s2io" driver.
Subsequent tests used the 2.0.12.0 version of the driver on a 2.6.15
kernel.

The reader will also notice we start having a "first burst" row for
"0" which means no additional request injected into the data
connection before the core loop.

	  Netperf TCP_RR Single-Byte Transactions per Second
				  vs
	     First Burst and Number of Netperf Instances
      Hewlett-Packard Integrity Server rx1620, 2x1.6GHz Itanium2
		 AD144A PCI-X 10 Gigabit Ethernet NIC
      InterruptThrottle Disabled on rx1600 LG's, 10G at Defaults

                                 Concurrent Netperfs
              First burst      1          2           4
	  --------------------------------------------------
                   0    |    16890      34490       52930
                   1    |    29530      54030       89660
                   2    |    37380      72630      109970
                   4    |    52730     103790      130580
                   8    |    68500     124840      134420
                  16    |    72770     126300      150310
                  32    |    72540     126100      150920
                  64    |    72940     126730      153100
                 128    |    74900     125700      156400

Observations with top showed that both CPUs in the rx1620 were
virtually saturated by the First Burst == 64, four-concurrent netperf
test.

Since we have switched CPUs, a proper comparison with the previous
e1000 data is not really possible.  So, we run a whole bunch of
netperf numbers again :) First with the e1000 driver (now version
6.0.54-k2-NAPI) configured for defaults. (The LG's sticking with the
InterruptThrottle disabled)


	  Netperf TCP_RR Single-Byte Transactions per Second
				  vs
	     First Burst and Number of Netperf Instances
      Hewlett-Packard Integrity Server rx1620, 2x1.6GHz Itanium2
	   Single-port of A9900A PCI-X Gigabit Ethernet NIC
		      InterruptThrottle Enabled

                                 Concurrent Netperfs
              First burst      1          2           4
	  --------------------------------------------------
                   0    |     8000      16000       32000
                   1    |    16000      32000       64000
                   2    |    24000      47990       95730
                   4    |    39830      79730      107720
                   8    |    60639     111060      151140
                  16    |    68530     135910      203150
                  32    |    88130     165690      217460
                  64    |    88440     167730      219820
                 128    |    48040     117360      222270

[The author has no explanation for the 128 first burst performance
 drop-offs.  However, they appear to be repeatable, and there does
 appear to be a slight difference in the number of packets transmitted
 vs received when they should be the same. Perhaps some strange
 interactions with congestion windows at connection startup. Packet
 traces, which were not taken, would probably be required to arrive at
 an answer. ]

Notice how the zero first burst, single instance transaction rate is
now 8000 transactions per second.  This is considerably better than
with the driver version in 2.6.8 on the rx1600 - likely as not there
were a couple changes in the driver since there was a major version
number change :) The result is also similar to what the author sees
when the target system is running HP-UX rather than Linux and from
this he concludes that the default InterruptThrottle stuff has changed
considerably.  Heck, OS influence may have changed as well.

Anyway, we can see that the default setting for the e1000-driven NIC
still keeps the transaction per second rate for a single stream rather
lower than it could be, likely as not in the interest of making bulk
throughput more efficient.

And now with the InterruptThrottle disabled:

	  Netperf TCP_RR Single-Byte Transactions per Second
				  vs
	     First Burst and Number of Netperf Instances
      Hewlett-Packard Integrity Server rx1620, 2x1.6GHz Itanium2
	   Single-port of A9900A PCI-X Gigabit Ethernet NIC
		      InterruptThrottle Disabled

                                 Concurrent Netperfs
              First burst      1          2           4
	  --------------------------------------------------
                   0    |    16230      31577       50580
                   1    |    26270      48850       89710
                   2    |    36920      68790      114630
                   4    |    51350     105137      140440
                   8    |    69000     127170      151620
                  16    |    75999     129030      160580
                  32    |    75690     128970      162590
                  64    |    76140     129900      160990
                 128    |    77460     115780      169450

To give some idea of the effect of different values for the
InterruptThrottleRate here are some additional results:

	  Netperf TCP_RR Single-Byte Transactions per Second
				  vs
	     First Burst and Number of Netperf Instances
      Hewlett-Packard Integrity Server rx1620, 2x1.6GHz Itanium2
	   Single-port of A9900A PCI-X Gigabit Ethernet NIC
		      InterruptThrottleRate 1000

                               Concurrent Netperfs
      First burst      1        2        4        6        8
   --------------------------------------------------------------------
           0    |     1000              4000
           1    |     2000              8000
           2    |     3000             12000
           4    |     5000             20000
           8    |     8990             35990
          16    |    16990             68000
          32    |    32990            129730
          64    |    63960            180610    215230   239150
         128    |    63990            173180    216290   235990


The one concurrent netperf was to a single rx1600 LG.  The four
concurrent netperf tests were to two rx1600 LGs and the six concurrent
netperf tests were to three rx1600 LGs and the 8 concurrent netperf
tests were to four rx1600's.  The number of concurrent netperfs was
increased in an attempt to absorb remaining idle CPU cycles.
Increasing the first burst value beyond 128 did not appear to increase
performance.

Here we have data for the 2.0.12.0 version of the s2io driver on the
2.6.15 kernel:

	  Netperf TCP_RR Single-Byte Transactions per Second
				  vs
	     First Burst and Number of Netperf Instances
      Hewlett-Packard Integrity Server rx1620, 2x1.6GHz Itanium2
		 AD144A PCI-X 10 Gigabit Ethernet NIC
		  Linux 2.6.15 Kernel s2io 2.0.12.0
		s2io Interrupt Coalescing at Defaults

                                 Concurrent Netperfs
              First burst      1          2           4
	  --------------------------------------------------
                   0    |    16430      31220       59890
                   1    |    28784      49950       95470
                   2    |    37380      70530      114160
                   4    |    53740      98280      136900
                   8    |    71700     117690      159570
                  16    |    71610     130190        n/a
                  32    |    71350     129910        n/a
                  64    |    72270     130410        n/a
                 128    |    74490     133490        n/a

Don't forget that the clients remain rx1600s, albeit running 2.6.12
now instead of 2.6.8.  And we use 1, 2 or 4 distinct rx1600's as load
generators here.

Here is the single-stream 0 burst run with CPU util and service
demand, interrupt coalescing at the default setting for the s2io
driver. The remote remains a Debian 2.6.12 rx1600 with that kernel's
e1000 driver configured for no interrupt throttling:

languid:/opt/netperf2# src/netperf -t TCP_RR -H 192.168.3.213 -i 30,3 -c
TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.3.213 (192.168.3.213) port 0 AF_INET : +/-2.5% @ 99% conf. : first burst 0
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % U    us/Tr   us/Tr

16384  87380  1       1      10.00   16380.05  9.51   -1.00  11.607  -1.000
16384  87380

A quick compare with e1000 6.1.16-k2-NAPI, using single-stream and
interrupt defaults:

languid:/opt/netperf2# src/netperf -t TCP_RR -H 192.168.3.213 -i 30,3 -c -C
TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.3.213 (192.168.3.213) port 0 AF_INET : +/-2.5% @ 99% conf. : first burst 0
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr

16384  87380  1       1      10.00   7998.18  5.37   8.67   13.427  21.670
16384  87380

Again we can see the effect of the default interrupt throttle settings
on the e1000 driver:

languid:/opt/netperf2# src/netperf -t TCP_RR -H 192.168.3.213 -i 30,3 -c
TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.3.213 (192.168.3.213) port 0 AF_INET : +/-2.5% @ 99% conf. : first burst 0
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % U    us/Tr   us/Tr

16384  87380  1       1      10.00   15341.98  11.85  -1.00  15.447  -1.000
16384  87380

The service demands suggest that one should be able to achieve higher
aggregate request/response performance over the AD144A, if perhaps its
interrupt rates were throttled.


	       Copyright 2006, Hewlett-Packard Company

And here we compare settings on the core Gigabit Ethernet interface on
the rx1600s.  This is based on a BCM5701 chip and should mimic
behavior of other BCM-based NICs such as the add-on A7061A PCI-X
Gigabit Ethernet Adaptor.  This is with the rx1600's still running the
Debian 2.6.8-2-mckinley-smp and the interfaces connected via an HP
ProCurve 2724 switch.  The version of the tg3 driver is 3.10.

First, Transaction rate and CPU utilization for the default setting:

linger:/opt/netperf2# TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.3.213 (192.168.3.213) port 0 AF_INET : +/-2.5% @ 99% conf. : first burst 0
!!! WARNING
!!! Desired confidence was not achieved within the specified iterations.
!!! This implies that there was variability in the test environment that
!!! must be investigated before going further.
!!! Confidence intervals: Throughput      :  0.2%
!!!                       Local CPU util  : 12.5%
!!!                       Remote CPU util : 13.1%

Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr

16384  87380  1       1      30.00   13000.91  13.89  14.47  21.361  22.267
16384  87380

the variability likely came from the netperf/netserver's bouncing from
CPU to CPU - so +/- 6-7% on the CPU util numbers (% not percentage
points).  The transaction rate seems to have been dead-on though.

Here is the TCP_STREAM 128x32 data:

linger:/opt/netperf2#  src/netperf -H 192.168.4.213 -i 10,3 -I 99,5 -l 60 -c -C -- -s 128K -S 128K -m 32K
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.4.213 (192.168.4.213) port 0 AF_INET : +/-2.5% @ 99% conf.
!!! WARNING
!!! Desired confidence was not achieved within the specified iterations.
!!! This implies that there was variability in the test environment that
!!! must be investigated before going further.
!!! Confidence intervals: Throughput      :  0.0%
!!!                       Local CPU util  :  5.1%
!!!                       Remote CPU util :  0.0%

Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

262144 262144  32768    60.00       941.33   29.78    50.01    5.183   8.705

The Local CPU util was only 0.1% short of hitting the confidence
interval.

And then the aggregate single-byte TCP_RR performance, still with
default coalescing parms.


	  Netperf TCP_RR Single-Byte Transactions per Second
				  vs
	     First Burst and Number of Netperf Instances
      Hewlett-Packard Integrity Server rx1600, 2x1.0GHz Itanium2
	      Core or A7061A PCI-X Gigabit Ethernet NIC
		     Debian 2.6.8 Kernel tg3 3.10
		   Interrupt Coalescing at Defaults

                                 Concurrent Netperfs
              First burst      1          2           4
	  --------------------------------------------------
                   0    |    11940      22510       37550
                   1    |    22100      38380       65810
                   2    |    27640      48080       85650
                   4    |    40510      66660      103310
                   8    |    48120      94770      105350
                  16    |    56190      97680      103540
                  32    |    63070      97960       96390
                  64    |    67060      97860       98180
                 128    |    67170      97790       98950


It was not possible to alter the coalescing parms via ethtool with the
3.10 version of the driver on 2.6.8-1, so the rx1600's were upgraded
to 2.6.12-1 which has the 3.31 driver in it.  We re-run the
single-stream numbers with the defaults for 2.6.12-1/3.31:


linger:/opt/netperf2# src/netperf -H 192.168.4.213 -i 10,3 -I 99,5 -l 60 -c -C -- -s 128K -S 128K -m 32K
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.4.213 (192.168.4.213) port 0 AF_INET : +/-2.5% @ 99% conf.
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

262144 262144  32768    60.00       941.44   26.09    28.59    4.540   4.975

linger:/opt/netperf2# src/netperf -H 192.168.4.213 -t TCP_RR -i 30,3 -I 99,5 -l 60 -c -C
TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.4.213 (192.168.4.213) port 0 AF_INET : +/-2.5% @ 99% conf. : first burst 0
!!! WARNING
!!! Desired confidence was not achieved within the specified iterations.
!!! This implies that there was variability in the test environment that
!!! must be investigated before going further.
!!! Confidence intervals: Throughput      :  0.0%
!!!                       Local CPU util  : 13.5%
!!!                       Remote CPU util : 23.3%

Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr

16384  87380  1       1      60.00   9722.71  7.53   6.72   15.499  13.831
16384  87380

And the aggregates:

	  Netperf TCP_RR Single-Byte Transactions per Second
				  vs
	     First Burst and Number of Netperf Instances
      Hewlett-Packard Integrity Server rx1600, 2x1.0GHz Itanium2
	      Core or A7061A PCI-X Gigabit Ethernet NIC
		    Debian 2.6.12 Kernel tg3 3.31
		   Interrupt Coalescing at Defaults

                                 Concurrent Netperfs
              First burst      1          2           4
	  --------------------------------------------------
                   0    |     9720      17470       28140
                   1    |    17670      29010       48760
                   2    |    24470      39350       72100
                   4    |    33630      61320      108490
                   8    |    50230      99420      133600
                  16    |    84900     125090      132110
                  32    |    94920     124700      152590
                  64    |    94140     124490      129410
                 128    |    93660     124670      129960

Alas, it seems that even 3.31 does not allow changing the parms "on
the fly" with ethtool.  Curses, foiled again.


	     Copyright (c) 2006, Hewlett-Packard Company


For a different take, here is some data from BL25p systems with pairs
of dual-core 2.4 GHz Opteron processors running SuSE SLES9 SP2.  Uname
-a reports the kernel as 2.6.5-7.241-smp and the NIC driver was
version 3.37 of tg3. The NIC in use was an NC7881 (aka Broadcom 5703).
The default interrupt coalescing parms from ethtool -c are:

rx-usecs: 20
rx-frames: 5
rx-usecs-irq: 20
rx-frames-irq: 5

tx-usecs: 72
tx-frames: 53
tx-usecs-irq: 20
tx-frames-irq: 5


This system had four cores, and was communicating with between one and
four other identical systems as the load generators.

Here we have a single instance TCP_RR with service demands:

lnx20:~/netperf-2.4.2pre1 # src/netperf -t TCP_RR -H lnx21 -i 10,3 -l 60 -c -C
TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lnx21 (129.2.10.71) port 0 AF_INET : +/-2.5% @ 99% conf. : first burst 0
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr

16384  87380  1       1      60.00   11877.66  2.75   3.16   9.251   10.640
16384  87380

Here we have a single instance TCP_STREAM 128x32 with service demands:

lnx20:~/netperf-2.4.2pre1 # src/netperf -t TCP_STREAM -H lnx21 -i 10,3 -l 60 -c -C -- -s 128K -S 128K -m 32K
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lnx21 (129.2.10.71) port 0 AF_INET : +/-2.5% @ 99% conf.
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

262142 262142  32768    60.00       932.66   8.97     8.18     3.153   2.876


And here are the aggregates:

	  Netperf TCP_RR Single-Byte Transactions per Second
				  vs
	     First Burst and Number of Netperf Instances
		Hewlett-Packard ProLiant Server BL25p
	      Core or NC7781 PCI-X Gigabit Ethernet NIC
		   Interrupt Coalescing at Defaults

                                   Concurrent Netperfs
          First burst      1          2           4           8
       -------------------------------------------------------------
               0    |    11720      23450       38360       80860
               1    |    22120      43150       85550      150940
               2    |    31010      62520      124820      215080
               4    |    51260     102420      198480        n/a
               8    |    81420     168470      279820        n/a
              16    |   163990     250930       n/a          n/a
              32    |   199660     248960       n/a          n/a
              64    |   198860     258310       n/a          n/a
             128    |   201000     257940       n/a          n/a

The "n/a" stems from there not being a sufficiently close match
between TX and RX packets through the interface to satisfy the
assumption that each transaction was a distinct pair of TCP segments.
The author noticed that there was a correlation between the packet
count mismatches and the presence of TCP retransmissions.  However, he
was unable to find the source of the retransmissions to correct them
and see if that would resolve the packet count mismatches.  It would
not take very many retransmssions to completely bunch-up transactions
- even a single retransmission might suffice.


Now, if we alter the rx-frames parameter on both sides to minimize
latency:

rx-frames: 1

and run the single-stream, no-burst TCP_RR test:

lnx20:~/netperf-2.4.2pre1 # src/netperf -t TCP_RR -H lnx22 -i 30,3 -l 60 -c -C
TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lnx22 (129.2.10.72) port 0 AF_INET : +/-2.5% @ 99% conf. : first burst 0
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr

16384  87380  1       1      60.00   21512.70  5.70   5.58   10.595  10.383
16384  87380

We can see that the round-trip latency has gone from ~84 microseconds
to ~46 microseconds, or a reduction of ~45%.  In somewhat broad handwaving
terms, the service demand for a transaction remains the same, since in
a single-instance, single-transaction test we are taking just as many
interrupts as we were before

If we rerun the single-stream TCP_RR test:

lnx20:~/netperf-2.4.2pre1 # src/netperf -t TCP_STREAM -H lnx22 -i 30,3 -l 60 -c -C -- -s 128K -S 128K -m 32K
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lnx22 (129.2.10.72) port 0 AF_INET : +/-2.5% @ 99% conf.
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

262142 262142  32768    60.00       941.52   11.95    17.32    4.159   6.029

Throughput appears to have increased slightly but that is within the
error limits.  Sending service demand has increased by nearly 32% and
receiving service demand has increased by nearly 110%!

Now, if we rerun the aggregate TCP_RR tests:

	  Netperf TCP_RR Single-Byte Transactions per Second
				  vs
	     First Burst and Number of Netperf Instances
		Hewlett-Packard ProLiant Server BL25p
	      Core or NC7781 PCI-X Gigabit Ethernet NIC
		rx-frames set to 1 to minimize latency

                                   Concurrent Netperfs
          First burst      1          2           4           8
       -------------------------------------------------------------
               0    |    21790      38860       71720      127470
               1    |    36430      69740      134240      206200
               2    |    51570     102200      176230      261060
               4    |    83470     158310      226170       n/a
               8    |   137150     188650       n/a
              16    |   138830     192010       n/a
              32    |   139620     193330       n/a
              64    |   140755     195450       n/a
             128    |   140722     196070       n/a

Now for grins, let us drop the interrupt rate lower than the defaults.
We set:

rx-usecs: 1000
rx-frames: 100
rx-usecs-irq: 1000
rx-frames-irq: 100
tx-usecs: 1000
tx-frames: 100
tx-usecs-irq: 1000
tx-frames-irq: 100

and then set some descriptor counts:
lnx20:~/netperf-2.4.2pre1 # ethtool -g eth2
Ring parameters for eth2:
Pre-set maximums:
RX:             511
RX Mini:        0
RX Jumbo:       255
TX:             0
Current hardware settings:
RX:             511
RX Mini:        0
RX Jumbo:       100
TX:             511

To make sure that the slower interrupts don't lead to lack of DMA buffers.

This drops the single-stream transaction rate to ~1000:

lnx20:~/netperf-2.4.2pre1 # src/netperf -t TCP_RR -H lnx22 -i 30,3 -l 60 -c -C
TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lnx22 (129.2.10.72) port 0 AF_INET : +/-2.5% @ 99% conf. : first burst 0
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr

16384  87380  1       1      60.00   1010.79  0.56   0.61   22.039  24.009
16384  87380


It also hase an affect on single-stream bulk throughput:

lnx20:~/netperf-2.4.2pre1 # src/netperf -t TCP_STREAM -H lnx22 -l 60 -c -C -- -s 128K -S 128K -m 32K
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lnx22 (129.2.10.72) port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

262142 262142  32768    60.00       522.27   5.12     9.75     3.210   6.119

it is possible that this could be addressed via increasing the socket
buffers and thus the TCP widow size, but the author didn't take the
time to make that test.

Here we re-run the aggregate TCP_RR tests:

	  Netperf TCP_RR Single-Byte Transactions per Second
				  vs
	     First Burst and Number of Netperf Instances
		Hewlett-Packard ProLiant Server BL25p
	      Core or NC7781 PCI-X Gigabit Ethernet NIC
	    Interrupt Coalescing set to ~1000 Interrupts/s

                                   Concurrent Netperfs
          First burst      1          2           4           8
       -------------------------------------------------------------
               0    |     1010       2000        3920        7650
               1    |     2010       3950        7680       14730
               2    |     2990       5870       11290       21360
               4    |     4930       9580       18060       33620
               8    |     8720      16540       30490       54360
              16    |    15780      28710       50740      312800
              32    |    28239      49250      306900        n/a
              64    |    47913     202450        n/a
             128    |    48140     217980        n/a

The author has no certain explanation for the sudden jump but can say
that the sanity check against ethtool -S output shows there was no
sudden bundling of multiple transactions per TCP segment.  Perhaps it
was a NAPI cross-over point - polling the NIC rather than taking
interrupts.


For grins, the author re-ran the 2 concurrent netperfs with a burst of
16 data point and saw that the CPU utilization was 4.17% of the system
or a service demand of 5.71 microseconds CPU consumed per transaction.
For somparison, he reran with the default coalescing parms with two
concurrent netperfs and a burst of zero for a similar PPS rate and the
CPU utilization was 6.59% and the service demand was 9.65 microseconds
of CPU consumed per transaction.


Copyright 2006, Hewlett-Packard Company

Around September 2006, the author was contacted by some driver writers
at Intel who had seen earlier versions of this writeup.  Seems they
had a "new and improved" driver for me to try which was meant to
address the latency vs throughput tradeoffs in the e1000 driver.
After a bit of delay, the author agreed to run this new driver through
its paces.  By this time the systems being used were a pair of
rx1620's each with 2, 1.6GHz 3MB "fast" FSB Itanium 2 CPUs.  The
kernel has moved to:

[root@tarry src]# uname -a
Linux tarry.hpl.hp.com 2.6.9-42.EL #1 SMP Wed Jul 12 23:25:09 EDT 2006 ia64 ia64 ia64 GNU/Linux

and the e1000 driver is initially:

[root@tarry src]# ethtool -i eth2
driver: e1000
version: 7.0.33-k2-NAPI
firmware-version: N/A
bus-info: 0000:20:02.1

The systems remain connected via an HP ProCurve 9308 switch.  The
version of netperf used was top of trunk on October 12, 2006 - aka
revision 81 in the repository at:

  http://www.netperf.org/svn/netperf2/trunk .

We start with basic latency, varying the CPU to which
netperf/netserver are being bound since these are SMP systems:

[root@tarry netperf2_work]# for i in 0 1; do for j in 0 1; do echo $i $j \
  `netperf -P 0 -T $i,$j -t TCP_RR -H 192.168.2.125 -c -C -i 30,3 -l 30`;\
   done; done

    Single-stream, Single-byte, Single-transaction netperf TCP_RR
	    Transaction Rate, CPU Util and Service Demand
				  vs
	    Netperf (Loc) and Netserver (Rem) CPU Binding
	       99% Confidence <= +/- 2.5% Unless Noted
	      rx1620 1.6 GHz "fast" FSB 2 Chips, 2 Cores
  RedHat ES4 U4, e1000 7.0.33-k2-NAPI, Default Interrupt Coalescing

  CPU    Transactions    %  CPU      Service Dem.    CPU Util
 Binding     Per       Utilization     usec/tran       +/- %
 Loc Rem   Second       Loc   Rem     Loc     Rem   Loc     Rem

  0   0    6463.07     6.43  5.85   19.884  18.099
  0   1    6609.95     6.47  4.41   19.586  13.357
  1   0    6597.81     4.88  5.96   14.806  18.073
  1   1    6749.12     5.02  4.48   14.876  13.289

We can see that things are much better by default than when this paper
was first written, by a factor large enough to be more than just the
change in CPU frequency, but the transaction rate is still rather low.

How about aggregates?  Well, lets take a look :)

[root@tarry netperf2_work]# for i in 0 1; do for j in 0 1; do echo $i $j\
  `netperf -P 0 -T $i,$j -t TCP_RR -H 192.168.2.125 -c -C -i 30,3 -l 30 --\
   -D -b 128`; done; done

   Single-stream, Single-byte, Aggregate-transaction netperf TCP_RR
	    Transaction Rate, CPU Util and Service Demand
				  vs
	    Netperf (Loc) and Netserver (Rem) CPU Binding
	       99% Confidence <= +/- 2.5% Unless Noted
	      rx1620 1.6 GHz "fast" FSB 2 Chips, 2 Cores
  RedHat ES4 U4, e1000 7.0.33-k2-NAPI, Default Interrupt Coalescing

  CPU    Transactions    %  CPU      Service Dem.    CPU Util
 Binding     Per       Utilization     usec/tran       +/- %
 Loc Rem   Second       Loc   Rem     Loc     Rem   Loc     Rem

  0   0  157976.35     78.58 75.10   9.949   9.508
  0   1  157721.99     78.90 47.15  10.005   5.979
  1   0  159642.81     49.98 76.56   6.261   9.591
  1   1  160227.92     50.17 47.66   6.262   5.949

And if we initiate two concurrent such tests:

   	  66387.29     99.41 98.67
   	 144435.00     99.41 98.67
 --------------------------------------------------
   2     210820        99.41 98.67  15.09   14.98


Next we look at single-stream unidirectional throughput, still with
the original driver:

[root@tarry netperf2_work]# for i in 0 1; do for j in 0 1; do echo $i $j\
  `netperf -P 0 -T $i,$j -t TCP_STREAM -H 192.168.2.125 -c -C -i 30,3 -l 30\
   -- -s 128K -S 128K -m 32K`; done; done

       Single-stream, Unidirectional netperf TCP_STREAM 128x32
	      Transfer Rate, CPU Util and Service Demand
				  vs
	    Netperf (Loc) and Netserver (Rem) CPU Binding
	       99% Confidence <= +/- 2.5% Unless Noted
	      rx1620 1.6 GHz "fast" FSB 2 Chips, 2 Cores
  RedHat ES4 U4, e1000 7.0.33-k2-NAPI, Default Interrupt Coalescing

  CPU     Megabits       %  CPU      Service Dem.    CPU Util
 Binding     Per       Utilization     usec/tran       +/- %
 Loc Rem   Second       Loc   Rem     Loc     Rem   Loc     Rem
  0   0    941.46      7.98 19.44    1.389   3.382
  0   1    941.26      7.10 12.38    1.236   2.156
  1   0    941.38      7.58 19.43    1.319   3.382
  1   1    941.45      6.71 12.42    1.167   2.161

And now single-connection, bidirectional performance.  Using 32KB
requests and responses, 353X transactions per second is basically
link-rate - > ~1850 Mbit/s:

[root@tarry netperf2_work]# for i in 0 1; do for j in 0 1; do echo $i $j\
  `netperf -P 0 -T $i,$j -t TCP_RR -H 192.168.2.125 -c -C -i 30,3 -l 30 --\
   -s 128K -S 128K -r 32K -b 6`; done; done

Single-stream, 32K Request/Response, Single-transaction netperf TCP_RR
	    Transaction Rate, CPU Util and Service Demand
				  vs
	    Netperf (Loc) and Netserver (Rem) CPU Binding
	       99% Confidence <= +/- 2.5% Unless Noted
	      rx1620 1.6 GHz "fast" FSB 2 Chips, 2 Cores
  RedHat ES4 U4, e1000 7.0.33-k2-NAPI, Default Interrupt Coalescing

  CPU    Transactions    %  CPU      Service Dem.    CPU Util
 Binding     Per       Utilization     usec/tran       +/- %
 Loc Rem   Second       Loc   Rem     Loc     Rem   Loc     Rem

  0   0    3537.10     28.84 27.11 163.054 153.315
  0   1    3536.68     27.90 18.35 157.762 103.788
  1   0    3536.84     19.78 26.29 111.845 148.654
  1   1    3551.75     19.37 17.94 109.047 101.023

Now, we switch to the "new and improved" :) driver, version:
7.3.12-NAPI. First with the default InterruptThrottleRate

[root@tarry netperf2_work]# for i in 0 1; do for j in 0 1; do echo $i $j \
  `netperf -P 0 -T $i,$j -t TCP_RR -H 192.168.2.125 -c -C -i 30,3 -l 30`;\
   done; done

    Single-stream, Single-byte, Single-transaction netperf TCP_RR
	    Transaction Rate, CPU Util and Service Demand
				  vs
	    Netperf (Loc) and Netserver (Rem) CPU Binding
	       99% Confidence <= +/- 2.5% Unless Noted
	      rx1620 1.6 GHz "fast" FSB 2 Chips, 2 Cores
    RedHat ES4 U4, e1000 7.3.12-NAPI, Default Interrupt Coalescing

  CPU    Transactions    %  CPU      Service Dem.    CPU Util
 Binding     Per       Utilization     usec/tran       +/- %
 Loc Rem   Second       Loc   Rem     Loc     Rem   Loc     Rem

  0   0    9362.13     10.82  9.68  23.114  20.671
  0   1   10360.67     11.59  8.11  22.378  15.664
  1   0   10290.14      9.04 10.38  17.578  20.183
  1   1   12187.93      9.95  9.07  16.334  14.890

We can see that this leads to a significant improvement in
Transactions per second, from ~6500 per second to as high as 12000 per
second.  There is a corresponding increase in service demand - at the
highest transaction rate it seems to be on the order of 10% on the
netperf side and ~12% on the netserver side.

[root@tarry netperf2_work]# for i in 0 1; do for j in 0 1; do echo $i $j\
  `netperf -P 0 -T $i,$j -t TCP_RR -H 192.168.2.125 -c -C -i 30,3 -l 30 --\
   -D -b 128`; done; done

   Single-stream, Single-byte, Aggregate-transaction netperf TCP_RR
	    Transaction Rate, CPU Util and Service Demand
				  vs
	    Netperf (Loc) and Netserver (Rem) CPU Binding
	       99% Confidence <= +/- 2.5% Unless Noted
	      rx1620 1.6 GHz "fast" FSB 2 Chips, 2 Cores
    RedHat ES4 U4, e1000 7.3.12-NAPI, Default Interrupt Coalescing

  CPU    Transactions    %  CPU      Service Dem.    CPU Util
 Binding     Per       Utilization     usec/tran       +/- %
 Loc Rem   Second       Loc   Rem     Loc     Rem   Loc     Rem

  0   0  151140.06     79.10 76.24  10.468  10.089
  0   1  152856.23     79.25 49.27  10.369   6.447
  1   0  147345.40     50.12 72.39   6.804   9.826
  1   1  148385.54     50.04 48.16   6.744   6.491

Two concurrent tests:

   	 140898.89     99.51  98.19
   	  59397.15     99.51  98.19
 ---------------------------------------------------
   2     200290


For the single stream, on average the new driver with default settings
has about 5.6% lower single-stream aggregate transaction throughput,
and the two stream throughput is about 5% lower as well.

[root@tarry netperf2_work]# for i in 0 1; do for j in 0 1; do echo $i $j\
  `netperf -P 0 -T $i,$j -t TCP_STREAM -H 192.168.2.125 -c -C -i 30,3 -l 30\
   -- -s 128K -S 128K -m 32K`; done; done

       Single-stream, Unidirectional netperf TCP_STREAM 128x32
	      Transfer Rate, CPU Util and Service Demand
				  vs
	    Netperf (Loc) and Netserver (Rem) CPU Binding
	       99% Confidence <= +/- 2.5% Unless Noted
	      rx1620 1.6 GHz "fast" FSB 2 Chips, 2 Cores
  RedHat ES4 U4, e1000 7.3.12-NAPI, Default Interrupt Coalescing

  CPU     Megabits       %  CPU      Service Dem.    CPU Util
 Binding     Per       Utilization     usec/tran       +/- %
 Loc Rem   Second       Loc   Rem     Loc     Rem   Loc     Rem

  0   0    941.39      	5.80 17.50   1.010   3.046
  0   1    941.39      	5.22 10.45   0.908   1.818
  1   0    941.40      	5.23 17.44   0.910   3.036
  1   1    941.45      	4.81 10.62   0.838   1.849

Here we can see that the throughput holds-up, and the service demand
is reduced, both goodness.

Now the bidirectional:
[root@tarry netperf2_work]# for i in 0 1; do for j in 0 1; do echo $i $j\
  `netperf -P 0 -T $i,$j -t TCP_RR -H 192.168.2.125 -c -C -i 30,3 -l 30 --\
   -s 128K -S 128K -r 32K -b 6`; done; done

Single-stream, 32K Request/Response, Single-transaction netperf TCP_RR
	    Transaction Rate, CPU Util and Service Demand
				  vs
	    Netperf (Loc) and Netserver (Rem) CPU Binding
	       99% Confidence <= +/- 2.5% Unless Noted
	      rx1620 1.6 GHz "fast" FSB 2 Chips, 2 Cores
    RedHat ES4 U4, e1000 7.3.12-NAPI, Default Interrupt Coalescing

  CPU    Transactions    %  CPU      Service Dem.    CPU Util
 Binding     Per       Utilization     usec/tran       +/- %
 Loc Rem   Second       Loc   Rem     Loc     Rem   Loc     Rem

  0   0    3530.70     27.60 27.00 156.367 152.971
  0   1    3531.69     26.93 17.13 152.495  97.020
  1   0    3531.62     18.16 26.24 102.839 148.621
  1   1    3537.95     17.92 17.03 101.319  96.288

The Transactions per second remain basically unchanged as things were
at link-rate.  We can see that the netperf-side service demand has
decreased from an average (across the four data points) of ~135
us/Tran for the old driver to an average of ~128 us/Tran. Netserver
side has decreased from an average of ~126 usec/tran to ~124, which is
likely "within the noise" of the benchmark.

Now we try the InterruptThrottleRate at a more aggressive setting of
"1" which means the driver will attempt to autotune withing a broader
range of values for the InterruptThrottleRate.

[root@tarry netperf2_work]# for i in 0 1; do for j in 0 1; do echo $i $j \
  `netperf -P 0 -T $i,$j -t TCP_RR -H 192.168.2.125 -c -C -i 30,3 -l 30`;\
   done; done

    Single-stream, Single-byte, Single-transaction netperf TCP_RR
	    Transaction Rate, CPU Util and Service Demand
				  vs
	    Netperf (Loc) and Netserver (Rem) CPU Binding
	       99% Confidence <= +/- 2.5% Unless Noted
	      rx1620 1.6 GHz "fast" FSB 2 Chips, 2 Cores
      RedHat ES4 U4, e1000 7.3.12-NAPI, InterruptThrottleRate=1

  CPU    Transactions    %  CPU      Service Dem.    CPU Util
 Binding     Per       Utilization     usec/tran       +/- %
 Loc Rem   Second       Loc   Rem     Loc     Rem   Loc     Rem
  0   0   13723.42     15.44 13.95  22.501  20.328
  0   1   14683.49     16.64 11.44  22.663  15.582
  1   0   14580.45     12.92 14.70  17.725  20.169
  1   1   15746.54     14.32 13.02  18.182  16.541

We can see that with this setting the transactions per second are even
higher, meaning the average latency is even lower.  This stems from
the driver being willing to alter the InterruptThrottleSetting across
a wider range.

[root@tarry netperf2_work]# for i in 0 1; do for j in 0 1; do echo $i $j\
  `netperf -P 0 -T $i,$j -t TCP_RR -H 192.168.2.125 -c -C -i 30,3 -l 30 --\
   -D -b 128`; done; done

   Single-stream, Single-byte, Aggregate-transaction netperf TCP_RR
	    Transaction Rate, CPU Util and Service Demand
				  vs
	    Netperf (Loc) and Netserver (Rem) CPU Binding
	       99% Confidence <= +/- 2.5% Unless Noted
	      rx1620 1.6 GHz "fast" FSB 2 Chips, 2 Cores
      RedHat ES4 U4, e1000 7.3.12-NAPI, InterruptThrottleRate=1

  CPU    Transactions    %  CPU      Service Dem.    CPU Util
 Binding     Per       Utilization     usec/tran       +/- %
 Loc Rem   Second       Loc   Rem     Loc     Rem   Loc     Rem

  0   0  149165.36     85.09 82.09  11.408  11.007
  0   1  115868.52     64.50 50.04  11.134   8.637
  1   0  103507.95     50.17 58.13   9.693  11.231
  1   1   99628.70     50.03 46.78  10.044   9.392

   	 139051.76     99.59 98.22
   	  35250.74     99.59 98.22
 ---------------------------------------------------
  2      174302        99.59 98.22

We can see further erosion in the single-stream, aggregate TCP_RR
performance when the InterruptThrottleRate is set to 1.  Once again,
no such thing as a free lunch :) The degregation is most severe when
the netperf and netserver are bound to the same CPU as takes
interrupts from the NIC and when we take the system to complete CPU
saturation.  This stands to reason given the driver may be letting the
NIC generate more interrupts, which steals CPU cycles from the rest of
the stack..

[root@tarry netperf2_work]# for i in 0 1; do for j in 0 1; do echo $i $j\
  `netperf -P 0 -T $i,$j -t TCP_STREAM -H 192.168.2.125 -c -C -i 30,3 -l 30\
   -- -s 128K -S 128K -m 32K`; done; done

       Single-stream, Unidirectional netperf TCP_STREAM 128x32
	      Transfer Rate, CPU Util and Service Demand
				  vs
	    Netperf (Loc) and Netserver (Rem) CPU Binding
	       99% Confidence <= +/- 2.5% Unless Noted
	      rx1620 1.6 GHz "fast" FSB 2 Chips, 2 Cores
      RedHat ES4 U4, e1000 7.3.12-NAPI, InterruptThrottleRate=1

  CPU     Megabits       %  CPU      Service Dem.    CPU Util
 Binding     Per       Utilization     usec/tran       +/- %
 Loc Rem   Second       Loc   Rem     Loc     Rem   Loc     Rem

  0   0    941.34      5.71 17.23   0.995   2.999
  0   1    941.40      5.28 10.51   0.919   1.829
  1   0    941.45      5.22 17.37   0.909   3.022
  1   1    941.45      4.84 10.56   0.842   1.837

InterruptThrottleRate=1 remains dynamic, and under the unidrectional
workload, the driver is still able to keep the interrupt overhead
minimized even under the more "aggressive setting."

[root@tarry netperf2_work]# for i in 0 1; do for j in 0 1; do echo $i $j\
  `netperf -P 0 -T $i,$j -t TCP_RR -H 192.168.2.125 -c -C -i 30,3 -l 30 --\
   -s 128K -S 128K -r 32K -b 6`; done; done

Single-stream, 32K Request/Response, Single-transaction netperf TCP_RR
	    Transaction Rate, CPU Util and Service Demand
				  vs
	    Netperf (Loc) and Netserver (Rem) CPU Binding
	       99% Confidence <= +/- 2.5% Unless Noted
	      rx1620 1.6 GHz "fast" FSB 2 Chips, 2 Cores
    RedHat ES4 U4, e1000 7.3.12-NAPI, InterruptThrottleRate=1

  CPU    Transactions    %  CPU      Service Dem.    CPU Util
 Binding     Per       Utilization     usec/tran       +/- %
 Loc Rem   Second       Loc   Rem     Loc     Rem   Loc     Rem

  0   0    3529.82     27.43 26.72 155.418 151.388
  0   1    3531.71     26.91 17.21 152.399  97.472
  1   0    3531.75     18.05 25.91 102.191 146.748
  1   1    3537.95     18.06 17.07 102.088  96.488


Finally, with InterruptThrottleRate set to 0:

[root@tarry netperf2_work]# for i in 0 1; do for j in 0 1; do echo $i $j \
  `netperf -P 0 -T $i,$j -t TCP_RR -H 192.168.2.125 -c -C -i 30,3 -l 30`;\
   done; done

    Single-stream, Single-byte, Single-transaction netperf TCP_RR
	    Transaction Rate, CPU Util and Service Demand
				  vs
	    Netperf (Loc) and Netserver (Rem) CPU Binding
	       99% Confidence <= +/- 2.5% Unless Noted
	      rx1620 1.6 GHz "fast" FSB 2 Chips, 2 Cores
      RedHat ES4 U4, e1000 7.3.12-NAPI, InterruptThrottleRate=0

  CPU    Transactions    %  CPU      Service Dem.    CPU Util
 Binding     Per       Utilization     usec/tran       +/- %
 Loc Rem   Second       Loc   Rem     Loc     Rem   Loc     Rem

  0   0   16848.52     19.73 16.60  23.417  19.703  2.6     3.2
  0   1   18291.56     20.81 13.75  22.759  15.034 
  1   0   18165.77     16.59 19.31  18.263  21.259 
  1   1   19902.64     15.33 13.95  15.408  14.017  4.8     4.1

[root@tarry netperf2_work]# for i in 0 1; do for j in 0 1; do echo $i $j\
  `netperf -P 0 -T $i,$j -t TCP_RR -H 192.168.2.125 -c -C -i 30,3 -l 30 --\
   -D -b 128`; done; done

   Single-stream, Single-byte, Aggregate-transaction netperf TCP_RR
	    Transaction Rate, CPU Util and Service Demand
				  vs
	    Netperf (Loc) and Netserver (Rem) CPU Binding
	       99% Confidence <= +/- 2.5% Unless Noted
	      rx1620 1.6 GHz "fast" FSB 2 Chips, 2 Cores
      RedHat ES4 U4, e1000 7.3.12-NAPI, InterruptThrottleRate=0

  CPU    Transactions    %  CPU      Service Dem.    CPU Util
 Binding     Per       Utilization     usec/tran       +/- %
 Loc Rem   Second       Loc   Rem     Loc     Rem   Loc     Rem

  0   0  135953.23     97.74 97.80  14.379  14.388
  0   1  100257.48     63.46 50.04  12.660   9.983
  1   0   96633.78     50.20 56.14  10.391  11.620
  1   1   94182.15     50.03 45.93  10.624   9.753


          134900.96    99.48 100.48
            4528.33    97.53 100.15  Trans/s +/- 35%
 -------------------------------------------------------
  2       139420       99.48 100

Things were not especially stable for the two concurrent, aggregate
tests when the InterruptThrottleRate was set to zero, even using the
maximum of 30 iterations in netperf.  The smaller of the two, which
was likely bound to the CPU taking interrupts from the NIC, had its
throughput vary rather a lot.  Still, it would seem reasonable to
state that with the InterruptThrottleRate set to zero, while the
single-instance, single-transaction latency is lowest, it does indeed
still drag-down the aggregates, even more than when the dynamic
throttle is enabled.

[root@tarry netperf2_work]# for i in 0 1; do for j in 0 1; do echo $i $j\
  `netperf -P 0 -T $i,$j -t TCP_STREAM -H 192.168.2.125 -c -C -i 30,3 -l 30\
   -- -s 128K -S 128K -m 32K`; done; done

       Single-stream, Unidirectional netperf TCP_STREAM 128x32
	      Transfer Rate, CPU Util and Service Demand
				  vs
	    Netperf (Loc) and Netserver (Rem) CPU Binding
	       99% Confidence <= +/- 2.5% Unless Noted
	      rx1620 1.6 GHz "fast" FSB 2 Chips, 2 Cores
      RedHat ES4 U4, e1000 7.3.12-NAPI, InterruptThrottleRate=0

  CPU     Megabits       %  CPU      Service Dem.    CPU Util
 Binding     Per       Utilization     usec/tran       +/- %
 Loc Rem   Second       Loc   Rem     Loc     Rem   Loc     Rem

  0   0    941.44      21.59 51.06  3.758   8.886
  0   1    941.20      20.71 38.46  3.605   6.695
  1   0    941.37      21.46 50.83  3.735   8.847
  1   1    941.35      20.16 38.44  3.509   6.690

We can see that while throughput may remain at link-rate, CPU
utilization and service demand have shot though the roof with the
interrupt throuttle disabled.  This does not bode well for the
bidirectional throughput test which we rerun next.

[root@tarry netperf2_work]# for i in 0 1; do for j in 0 1; do echo $i $j\
  `netperf -P 0 -T $i,$j -t TCP_RR -H 192.168.2.125 -c -C -i 30,3 -l 30 --\
   -s 128K -S 128K -r 32K -b 6`; done; done

Single-stream, 32K Request/Response, Single-transaction netperf TCP_RR
	    Transaction Rate, CPU Util and Service Demand
				  vs
	    Netperf (Loc) and Netserver (Rem) CPU Binding
	       99% Confidence <= +/- 2.5% Unless Noted
	      rx1620 1.6 GHz "fast" FSB 2 Chips, 2 Cores
    RedHat ES4 U4, e1000 7.3.12-NAPI, InterruptThrottleRate=1

  CPU    Transactions    %  CPU      Service Dem.    CPU Util
 Binding     Per       Utilization     usec/tran       +/- %
 Loc Rem   Second       Loc   Rem     Loc     Rem   Loc     Rem

  0   0   3494.09      68.92 60.72 394.498 347.575
  0   1   3499.08      69.02 42.46 394.478 242.690
  1   0   3435.35      50.16 59.90 292.031 348.755
  1   1   3458.17      50.06 42.93 289.497 248.279

Sure enough, we see huge increases in CPU utilization and service
demand. It was only that we had lots of left-over CPU in the previous
tests that we see less of a decrease in Transactions per second.