XDP and the Intel X710

Recently we committed to using the vc5 load balancer (what I wrote) as part of the refresh of our edge infrastructure at work. We have been using it for some time for small, lower priority services, deployed on virtual machines rather than hardware. It speaks BGP to advertise healthy services to the network routers and works well, but VMs won’t scale to the demands of our main public facing services.

I had performed some pretty extensive testing on a retired Dell R730 server with a dual port 10Gbps Intel X520 ethernet card with some success. The load balancer uses XDP to process packets with a small eBPF program which is loaded into the kernel and attached to the network interfaces.

The X520 card supports XDP natively in the driver, which means that ethernet frames can be processed and forwarded by XDP before the kernel has a chance to allocate an “SKB” — the socket buffer structure (struct sk_buff) that Linux uses internally for processing network packets. As practically all traffic will be forwarded by the load balancer and not processed by the network stack this massively reduces the overhead of processing traffic, and increases the potential capacity of the server.

Flushed with hope, we ordered some R650 servers from Dell, each with a quad port 10Gbps Intel X710 OCP3.0 card. The servers are excellent and a single one is able to handle our full production load (many tens of Gbps of egress traffic) with single digit percentages of CPU capacity; the load balancer works as a Direct Server Return (DSR) device, which only steers incoming traffic and 99% of this is simply TCP ACK packets. The X710 seemed like the natural progession from the X520, it uses the i40e driver rather than the X520’s ixgbe, but both have native XDP support.

After initial testing the time came to move the servers to the colocation facilities where they will operate, but we had overlooked that we would have to change the media type of the network interfaces from 10GBASE-T to SFP+ fibre modules, so we got replacement cards which only differed in this respect and shipped them out.

The first issue that we ecountered was the X710 would not always be able to get a link. The card didn’t seem to be able to negotiate a carrier with the switch. We’re using partitioned 40Gbps ports on the switch side and maybe it’s something to do with that.

We have a separate management interface so we could install the OS and log in, and I found that it was usually possible to get the X710 links to come up with an ethtool -r (renegotate) command. With some startup scripts to mitigate this we could get the interfaces to form a 4x10Gbps bond with switches in MLAG and all seemed well.

Then a switch rebooted. The affected links did not come back up. Not a huge problem, as the bonded interface is still operational with the links to the other switch, and we can persuade the card to renegtiate with ethtool, potentially automating this in future to avoid manual effort.

Disaster averted. However, I then noticed that although traffic was flowing to the load balancer, the service did not seems to function. I’ve put some instrumentation in to check that packets are being processed by the XDP code, and they are, but they don’t reach the destination. It is like they are simply not being re-transmitted after processing, so effectively blackholed.

Stopping and restarting the load balancer service restored normal operation, so it looks like re-attaching XDP code to the interfaces will bring them back life. I confimed this by including some code to hot re-attach XDP whilst the load balancer is running by means of a command channel, and this also works.

Currently the mitigation for a downed link is to remove the interface from the bond device, use ethtool to restore the carrier, ask the load balancer to re-attach XDP, and then return the interface to the bond. A real pain in the neck, but at least it’s not a show stopper.

A colleague on network team has created a systemd.link file (below) to only advertise 10000baseLR/Full instead of both supported link modes (1000baseX/Full and 10000baseLR/Full) and after another switch reboot the affected links did come back up, but I would like to see this as consistent behaviour before breathing a sigh of relief.

/etc/systemd/network/10-Set10Gfull-autooff.link:

[Match]
Path=pci-0000:51:00.?

[Link]
BitsPerSecond=10G
Duplex=full
Advertise=10000baselr-full

Unfortunately, the switches in question are critical production devices, so we’re not able to test without disruption. Thankfully, the load balancers are still in testing and no production services have been migrated to them yet, so no harm done.

I need to further investigate what is happening when XDP stops functioning. I’ve not been able to reproduce on the X520 card and XDP in generic mode doesn’t seem affected, so maybe it is some bug in the i40e driver? If you have any insights and are able to raise an issue on the GitHub repository I’d be most grateful.

Update; the issues experienced appear to be fixed!

X710 XDP eBPF

XDP and the Intel X710

A tale of woe

See also