rocknowbot
Interrupts In Linux Device Drivers For Mac
$ sudo ethtool -S eth0 NIC statistics: rxpackets: 597028087 txpackets: rxbytes: 47 txbytes: 14 rxbroadcast: 96 txbroadcast: 116 rxmulticast: 20294528. Monitoring this data can be difficult. It is easy to obtain, but there is no standardization of the field values. Different drivers, or even different versions of the same driver might produce different field names that have the same meaning. You should look for values with “drop”, “buffer”, “miss”, etc in the label. Next, you will have to read your driver source. You’ll be able to determine which values are accounted for totally in software (e.g., incremented when there is no memory) and which values come directly from hardware via a register read.
In the case of a register value, you should consult the data sheet for your hardware to determine what the meaning of the counter really is; many of the labels given via ethtool can be misleading. Using sysfs sysfs also provides a lot of statistics values, but they are slightly higher level than the direct NIC level stats provided.
You can find the number of dropped incoming network data frames for, e.g. Eth0 by using cat on a file.
$ cat /sys/class/net/eth0/statistics/rxdropped 2 The counter values will be split into files like collisions, rxdropped, rxerrors, rxmissederrors, etc. Unfortunately, it is up to the drivers to decide what the meaning of each field is, and thus, when to increment them and where the values come from. You may notice that some drivers count a certain type of error condition as a drop, but other drivers may count the same as a miss. If these values are critical to you, you will need to read your driver source to understand exactly what your driver thinks each of these values means. Using /proc/net/dev An even higher level file is /proc/net/dev which provides high-level summary-esque information for each network adapter on the system. $ cat /proc/net/dev Inter- Receive Transmit face bytes packets errs drop fifo frame compressed multicast bytes packets errs drop fifo colls carrier compressed eth0: 00 0 2 0 0 0 209248582604 0 0 0 0 0 0 lo: 535 0 0 0 0 0 0 535 0 0 0 0 0 0 This file shows a subset of the values you’ll find in the sysfs files mentioned above, but it may serve as a useful general reference.
The caveat mentioned above applies here, as well: if these values are important to you, you will still need to read your driver source to understand exactly when, where, and why they are incremented to ensure your understanding of an error, drop, or fifo are the same as your driver. Tuning network devices Check the number of RX queues being used If your NIC and the device driver loaded on your system support RSS / multiqueue, you can usually adjust the number of RX queues (also called RX channels), by using ethtool. $ sudo ethtool -l eth0 Channel parameters for eth0: Cannot get device channel parameters: Operation not supported This means that your driver has not implemented the ethtool getchannels operation.
This could be because the NIC doesn’t support adjusting the number of queues, doesn’t support RSS / multiqueue, or your driver has not been updated to handle this feature. Adjusting the number of RX queues Once you’ve found the current and maximum queue count, you can adjust the values by using sudo ethtool -L. Note: making these changes will, for most drivers, take the interface down and then bring it back up; connections to this interface will be interrupted.
This may not matter much for a one-time change, though. Adjusting the size of the RX queues Some NICs and their drivers also support adjusting the size of the RX queue. Exactly how this works is hardware specific, but luckily ethtool provides a generic way for users to adjust the size.
Increasing the size of the RX queue can help prevent network data drops at the NIC during periods where large numbers of data frames are received. Data may still be dropped in software, though, and other tuning is required to reduce or eliminate drops completely. Note: making these changes will, for most drivers, take the interface down and then bring it back up; connections to this interface will be interrupted.
This may not matter much for a one-time change, though. Adjusting the processing weight of RX queues Some NICs support the ability to adjust the distribution of network data among the RX queues by setting a weight. You can configure this if:. Your NIC supports flow indirection. Your driver implements the ethtool functions getrxfhindirsize and getrxfhindir. You are running a new enough version of ethtool that has support for the command line options -x and -X to show and set the indirection table, respectively.
$ sudo ethtool -N eth0 rx-flow-hash udp4 sdfn The sdfn string is a bit cryptic; check the ethtool man page for an explanation of each letter. Adjusting the fields to take a hash on is useful, but ntuple filtering is even more useful for finer grained control over which flows will be handled by which RX queue. Ntuple filtering for steering network flows Some NICs support a feature known as “ntuple filtering.” This feature allows the user to specify (via ethtool) a set of parameters to use to filter incoming network data in hardware and queue it to a particular RX queue. For example, the user can specify that TCP packets destined to a particular port should be sent to RX queue 1. On Intel NICs this feature is commonly known as.
Other NIC vendors may have other marketing names for this feature. As we’ll see later, ntuple filtering is a crucial component of another feature called Accelerated Receive Flow Steering (aRFS), which makes using ntuple much easier if your NIC supports it. ARFS will be covered later. This feature can be useful if the operational requirements of the system involve maximizing data locality with the hope of increasing CPU cache hit rates when processing network data. For example consider the following configuration for a webserver running on port 80:. A webserver running on port 80 is pinned to run on CPU 2. IRQs for an RX queue are assigned to be processed by CPU 2.
TCP traffic destined to port 80 is ‘filtered’ with ntuple to CPU 2. All incoming traffic to port 80 is then processed by CPU 2 starting at data arrival to the userland program. Careful monitoring of the system including cache hit rates and networking stack latency will be needed to determine effectiveness.
As mentioned, ntuple filtering can be configured with ethtool, but first, you’ll need to ensure that this feature is enabled on your device. Tuning network data arrival Interrupt coalescing is a method of preventing interrupts from being raised by a device to a CPU until a specific amount of work or number of events are pending. This can help prevent and can help increase throughput or latency, depending on the settings used.
Fewer interrupts generated result in higher throughput, increased latency, and lower CPU usage. More interrupts generated result in the opposite: lower latency, lower throughput, but also increased CPU usage. Historically, earlier versions of the igb, e1000, and other drivers included support for a parameter called InterruptThrottleRate. This parameter has been replaced in more recent drivers with a generic ethtool function. $ sudo ethtool -c eth0 Coalesce parameters for eth0: Adaptive RX: off TX: off stats-block-usecs: 0 sample-interval: 0 pkt-rate-low: 0 pkt-rate-high: 0. Ethtool provides a generic interface for setting various coalescing settings. Keep in mind, however, that not every device or driver will support every setting.
You should check your driver documentation or driver source code to determine what is, or is not, supported. As per the ethtool documentation: “Anything not implemented by the driver causes these values to be silently ignored.” One interesting option that some drivers support is “adaptive RX/TX IRQ coalescing.” This option is typically implemented in hardware. The driver usually needs to do some work to inform the NIC that this feature is enabled and some bookkeeping as well (as seen in the igb driver code above). The result of enabling adaptive RX/TX IRQ coalescing is that interrupt delivery will be adjusted to improve latency when packet rate is low and also improve throughput when packet rate is high. $ sudo ethtool -C eth0 adaptive-rx on You can also use ethtool -C to set several options.
Some of the more common options to set are:. rx-usecs: How many usecs to delay an RX interrupt after a packet arrives. rx-frames: Maximum number of data frames to receive before an RX interrupt. rx-usecs-irq: How many usecs to delay an RX interrupt while an interrupt is being serviced by the host. rx-frames-irq: Maximum number of data frames to receive before an RX interrupt is generated while the system is servicing an interrupt. And many, many more. Note: while interrupt coalescing seems to be a very useful optimization at first glance, the rest of the networking stack internals also come into the fold when attempting to optimize.
Interrupt coalescing can be useful in some cases, but you should ensure that the rest of your networking stack is also tuned properly. Simply modifying your coalescing settings alone will likely provide minimal benefit in and of itself.
Adjusting IRQ affinities If your NIC supports RSS / multiqueue or if you are attempting to optimize for data locality, you may wish to use a specific set of CPUs for handling interrupts generated by your NIC. Setting specific CPUs allows you to segment which CPUs will be used for processing which IRQs. These changes may affect how upper layers operate, as we’ve seen for the networking stack. If you do decide to adjust your IRQ affinities, you should first check if you running the irqbalance daemon. This daemon tries to automatically balance IRQs to CPUs and it may overwrite your settings.
If you are running irqbalance, you should either disable irqbalance or use the -banirq in conjunction with IRQBALANCEBANNEDCPUS to let irqbalance know that it shouldn’t touch a set of IRQs and CPUs that you want to assign yourself. Next, you should check the file /proc/interrupts for a list of the IRQ numbers for each network RX queue for your NIC.
Finally, you can adjust the which CPUs each of those IRQs will be handled by modifying /proc/irq/IRQNUMBER/smpaffinity for each IRQ number. You simply write a hexadecimal bitmask to this file to instruct the kernel which CPUs it should use for handling the IRQ.
$ sudo bash -c 'echo 1 /proc/irq/8/smpaffinity' Network data processing begins Once the softirq code determines that a softirq is pending, begins processing, and executes netrxaction, network data processing begins. Let’s take a look at portions of the netrxaction processing loop to understand how it works, which pieces are tunable, and what can be monitored. Netrxaction processing loop netrxaction begins the processing of packets from the memory the packets were DMA’d into by the device. The function iterates through the list of NAPI structures that are queued for the current CPU, dequeuing each structure, and operating on it.
The processing loop bounds the amount of work and execution time that can be consumed by the registered NAPI poll functions. It does this in two ways:. By keeping track of a work budget (which can be adjusted), and. Checking the elapsed time From: while (!
Listempty ( & sd - polllist )) struct napistruct. n; int work, weight; /.
If softirq window is exhausted then punt. Allow this to run for 2 jiffies since which will allow. an average latency of 1.5/HZ./ if ( unlikely ( budget. $ cat /proc/net/softnetstat 6dcad20 000000 000000 0000000 6f0e150 000000 000000 0000000 660774ec 000000 000000 000000 61c000 000000 000000 000b1b3 000000 000000 000000 6488cb0 000000 000000 0000000 Important details about /proc/net/softnetstat:. Each line of /proc/net/softnetstat corresponds to a struct softnetdata structure, of which there is 1 per CPU. The values are separated by a single space and are displayed in hexadecimal.
The first value, sd-processed, is the number of network frames processed. This can be more than the total number of network frames received if you are using ethernet bonding. There are cases where the ethernet bonding driver will trigger network data to be re-processed, which would increment the sd-processed count more than once for the same packet. The second value, sd-dropped, is the number of network frames dropped because there was no room on the processing queue. More on this later. The third value, sd-timesqueeze, is (as we saw) the number of times the netrxaction loop terminated because the budget was consumed or the time limit was reached, but more work could have been.
Increasing the budget as explained earlier can help reduce this. The next 5 values are always 0. The ninth value, sd-cpucollision, is a count of the number of times a collision occurred when trying to obtain a device lock when transmitting packets. This article is about receive, so this statistic will not be seen below. The tenth value, sd-receivedrps, is a count of the number of times this CPU has been woken up to process packets via an. The last value, flowlimitcount, is a count of the number of times the flow limit has been reached. Flow limiting is an optional feature that will be examined shortly.
If you decide to monitor this file and graph the results, you must be extremely careful that the ordering of these fields hasn’t changed and that the meaning of each field has been preserved. You will need to read the kernel source to verify this.
Tuning network data processing Adjusting the netrxaction budget You can adjust the netrxaction budget, which determines how much packet processing can be spent among all NAPI structures registered to a CPU by setting a sysctl value named net.core.netdevbudget. $ sudo sysctl -w net.core.netdevbudget=600 You may also want to write this setting to your /etc/sysctl.conf file so that changes persist between reboots. The default value on Linux 3.13.0 is 300. Generic Receive Offloading (GRO) Generic Receive Offloading (GRO) is a software implementation of a hardware optimization that is known as Large Receive Offloading (LRO). The main idea behind both methods is that reducing the number of packets passed up the network stack by combining “similar enough” packets together can reduce CPU usage. For example, imagine a case where a large file transfer is occurring and most of the packets contain chunks of data in the file. Instead of sending small packets up the stack one at a time, the incoming packets can be combined into one packet with a huge payload.
Linux Device Driver Pdf
That packet can then be passed up the stack. This allows the protocol layers to process a single packet’s headers while delivering bigger chunks of data to the user program. The problem with this sort of optimization is, of course, information loss.
If a packet had some important option or flag set, that option or flag could be lost if the packet is coalesced into another. And this is exactly why most people don’t use or encourage the use of LRO. LRO implementations, generally speaking, had very lax rules for coalescing packets. GRO was introduced as an implementation of LRO in software, but with more strict rules around which packets can be coalesced. By the way: if you have ever used tcpdump and seen unrealistically large incoming packet sizes, it is most likely because your system has GRO enabled. As you’ll see soon, packet capture taps are inserted further up the stack, after GRO has already happened. Tuning: Adjusting GRO settings with ethtool You can use ethtool to check if GRO is enabled and also to adjust the setting.
Note: enabling RPS to distribute packet processing to CPUs which were previously not processing packets will cause the number of `NETRX` softirqs to increase for that CPU, as well as the `si` or `sitime` in the CPU usage graph. You can compare before and after of your softirq and CPU usage graphs to confirm that RPS is configured properly to your liking. Receive Flow Steering (RFS) Receive flow steering (RFS) is used in conjunction with RPS.
RPS attempts to distribute incoming packet load amongst multiple CPUs, but does not take into account any data locality issues for maximizing CPU cache hit rates. You can use RFS to help increase cache hit rates by directing packets for the same flow to the same CPU for processing. Tuning: Enabling RFS For RFS to work, you must have RPS enabled and configured. RFS keeps track of a global hash table of all flows and the size of this hash table can be adjusted by setting the net.core.rpssockflowentries sysctl. $ sudo bash -c 'echo 2048 /sys/class/net/eth0/queues/rx-0/rpsflowcnt' Hardware accelerated Receive Flow Steering (aRFS) RFS can be sped up with the use of hardware acceleration; the NIC and the kernel can work together to determine which flows should be processed on which CPUs. To use this feature, it must be supported by the NIC and your driver. Consult your NIC’s data sheet to determine if this feature is supported.
If your NIC’s driver exposes a function called ndorxflowsteer, then the driver has support for accelerated RFS. Tuning: Enabling accelerated RFS (aRFS) Assuming that your NIC and driver support it, you can enable accelerated RFS by enabling and configuring a set of things:. Have RPS enabled and configured. Have RFS enabled and configured. Your kernel has CONFIGRFSACCEL enabled at compile time. The Ubuntu kernel 3.13.0 does. Have ntuple support enabled for the device, as described previously.
You can use ethtool to verify that ntuple support is enabled for the device. Configure your IRQ settings to ensure each RX queue is handled by one of your desired network processing CPUs. Once the above is configured, accelerated RFS will be used to automatically move data to the RX queue tied to a CPU core that is processing data for that flow and you won’t need to specify an ntuple filter rule manually for each flow. Moving up the network stack with netifreceiveskb Picking up where we left off with netifreceiveskb, which is called from a few places.
The two most common (and also the two we’ve already looked at):. napiskbfinish if the packet is not going to be merged to an existing GRO’d flow, OR. napigrocomplete if the protocol layers indicated that it’s time to flush the flow, OR. Reminder: netifreceiveskb and its descendants are operating in the context of a the softirq processing loop and you'll see the time spent here accounted for as sitime or si with tools like top. Netifreceiveskb begins by first checking a sysctl value to determine if the user has requested receive timestamping before or after a packet hits the backlog queue. If this setting is enabled, the data is timestamped now, prior to it hitting RPS (and the CPU’s associated backlog queue). If this setting is disabled, it will be timestamped after it hits the queue.
This can be used to distribute the load of timestamping amongst multiple CPUs if RPS is enabled, but will introduce some delay as a result. $ cat /proc/net/snmp Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates Ip: 1 8125 0 0 15771700 0 0 6 4 12987882 51 1 101520 1 0 0 0. This file contains statistics for several protocol layers. The IP protocol layer appears first. The first line contains space separate names for each of the corresponding values in the next line. In the IP protocol layer, you will find statistics counters being bumped.
Those counters are referenced by a C enum. All of the valid enum values and the field names they correspond to in /proc/net/snmp can be found in: enum IPSTATSMIBNUM = 0, /.
frequently written fields in fast path, kept in same cache line./ IPSTATSMIBINPKTS, /. InReceives./ IPSTATSMIBINOCTETS, /. InOctets./ IPSTATSMIBINDELIVERS, /.
InDelivers./ IPSTATSMIBOUTFORWDATAGRAMS, /. OutForwDatagrams./ IPSTATSMIBOUTPKTS, /. OutRequests./ IPSTATSMIBOUTOCTETS, /. OutOctets./ /. $ cat /proc/net/snmp grep Udp: Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors Udp: 16314 0 0 17161 0 0 Much like the detailed statistics found in this file for the IP protocol, you will need to read the protocol layer source to determine exactly when and where these values are incremented. InDatagrams: Incremented when recvmsg was used by a userland program to read datagram.
$ sudo sysctl -w net.ipv4.tcpdmacopybreak=2048 The default value is 4096. Conclusion The Linux networking stack is complicated. It is impossible to monitor or tune it (or any other complex piece of software) without understanding at a deep level exactly what’s going on. Often, out in the wild of the Internet, you may stumble across a sample sysctl.conf that contains a set of sysctl values that should be copied and pasted on to your computer.
This is probably not the best way to optimize your networking stack. Monitoring the networking stack requires careful accounting of network data at every layer. Starting with the drivers and proceeding up.
Interrupt Handling In Linux
That way you can determine where exactly drops and errors are occurring and then adjust settings to determine how to reduce the errors you are seeing. There is, unfortunately, no easy way out.
Linux Device Driver Example
Help with Linux networking or other systems Need some extra help navigating the network stack? Have questions about anything in this post or related things not covered? Send us an and let us know how we can help. Related posts If you enjoyed this post, you may enjoy some of our other low-level technical posts:. Share this post.