Bypassing the Linux kernel for high-performance packet filtering

mavam · on Sept 7, 2015

We've developed packet BRICKS for this: https://github.com/bro/packet-bricks

Packet BRICKS is a Linux/FreeBSD daemon that is capable of receiving and distributing ingress traffic to userland applications. Its main responsibilities may include (i) load-balancing, (ii) duplicating and/or (iii) filtering ingress traffic across all registered applications. The distribution is flow-aware (i.e. packets of one connection will always end up in the same application). At the moment, packet-bricks uses netmap packet I/O framework for receiving packets. It employs netmap pipes to forward packets to end host applications.

(Credit goes to Asim Jamshed, who pulled this off as part of an internship at ICSI.)

tinco · on Sept 7, 2015

So is the advantage that it's more performant than iptables+virtual interfaces? If I use iptables to distribute traffic over virtual interfaces do IP headers get parsed twice by the kernel in some inefficient way?

mavam · on Sept 7, 2015

Iptables sits in the kernel and is also not available on non-Linux platforms like FreeBSD. With packet bricks you bypass the kernel and expose "virtual" interfaces to your applications by means of a simple configuration. Here's an example from the README:

        bricks> lb = Brick.new("LoadBalancer", 2)
	bricks> lb:connect_input("eth3")
	bricks> lb:connect_output("eth3{0", "eth3{1", "eth3{2", "eth3{3", "eth2")
	bricks> pe:link(lb)

This binds pkteng pe with LoadBalancer brick and asks the system to read ingress packets from eth3 and split them flow-wise based on the 2-tuple (src & dst IP addresses) metadata of the packet header. The "lb:connect_output(...)" command creates four netmap-specific pipes named "netmap:eth3{x" where 0 <= x < 4 and an egress interface named "eth2". The traffic is evenly split between all five channels based on the 2 tuple header as previously mentioned. Userland applications can now use packet-bricks to get their fair share of ingress traffic. The brick is finally linked with the packet engine.

eikenberry · on Sept 7, 2015

So the goals of packet bricks are portability and ease of configuration, not performance gains?

mavam · on Sept 7, 2015

The goal is to have both. In fact, the whole point of kernel bypass is performance, so just having ease of configuration would defeat the point.

We're using packet bricks primarily for high-performance network monitoring in environments with more than 10 Gbps aggregate upstream traffic.

technion · on Sept 8, 2015

Incredibly interesting project. Are you aware of any experiences using this with nginx or HAProxy on the load balancing side and the impact it may have had?

awgn · on Sept 7, 2015

In this essay the author forgot to mention PFQ, that at the time of writing represents a performant and innovative approach to packet capture and in-kernel functional processing of packets (running on-top-of vanilla drivers). The software is available at www.pfq.io.

tuukkah · on Sept 7, 2015

PFQ is impressive: "Rx and Tx line-rate on 10-Gbit links (14,8 Mpps), on-top-of Intel ixgbe vanilla drivers." http://www.pfq.io/

majke · on Sept 7, 2015

Can you explain how to use PFQ only for selected network flows? (and not do a take-over-the-whole-nic?)

4h53n · on Sept 7, 2015

University of Pisa, right? Authors of the Netmap and PF_RING also from that university, if I am not mistaken? You guys seem to have a mutual interest towards wire-rate packet processing. Keep it up...

lukego · on Sept 7, 2015

Great prespective.

Thinking of the future: Recent experience suggests that we are able to do around 50-100 Mpps of traffic dispatching in Snabb Switch using one CPU core. I suspect that dedicated software traffic dispatchers will displace hardware (RSS, VMDq, etc) in the immediate future.

We are planning to explore this soon in the context of software dispatching for 100Gbps ethernet ports.

abrookewood · on Sept 7, 2015

I remember reading about this: http://highscalability.com/blog/2014/2/13/snabb-switch-skip-...

wmf · on Sept 7, 2015

So would you recommend giving Snabb the NIC, letting it drop the DDoS traffic, and then injecting what's left into the kernel stack via a tap interface or something?

lukego · on Sept 7, 2015

Great question actually.

That is one solution: take a 10G port into Snabb Switch, filter and sort the traffic, then feed a slice to the kernel e.g. via a tap device with multiqueue /dev/vhost-net acceleration (same interface that QEMU/KVM uses).

The risk I see here is that maybe it is even harder to tune your kernel when it is using a software interface for I/O instead of a hardware one. The kernel can depend on so many things (multiqueue, TSO, LRO, checksum, encap offload, etc) that can behave differently between hardware/software NICs and you would need to be confident that this will work out well. Otherwise the risk is that you take your hardest problem - tuning the kernel - and make it even harder.

If we are only talking about 2x10G ports per server then one alternative would be to connect Snabb Switch to the kernel with a physical 10G port instead of a software one. That is, separate the DDoS-protecting frontend (Snabb) from the backend (kernel) with a network cable. You could still run the Snabb application on the same server but with a dedicated network card cabled directly to the kernel. (You could also run it on a different server if you prefer.)

End off-cuff braindump :-)

justincormack · on Sept 7, 2015

Don't use a tap interface, but Snabb does have a fast virtio driver for kvm, that maybe you could adapt (or use directly if you dont mind the userspace parts being in another vm).

Or just do everything in userspace. Reinjecting into the kernel is not strictly necessary, although obviously you may be constrained by existing code.

ck2 · on Sept 7, 2015

For those not at cloudflare level and don't want to reinvent the wheel, try IPSET instead of iptables

http://daemonkeeper.net/781/mass-blocking-ip-addresses-with-...

TLDR: http://daemonkeeper.net/wp-content/uploads/2012/05/ipset4.pn...

Symmetry · on Sept 7, 2015

This reminds me of the work the Arrakis folks are doing:

https://www.usenix.org/system/files/conference/osdi14/osdi14...

https://www.youtube.com/watch?v=WG3b2hE4i6U

s1m0n · on Sept 7, 2015

The article appears to be inaccurate. Why? AFAIK it's possible to make use of netmap on a box with a single nic. I tried this out for myself about 2 years ago on a VMware virtual machine. How does it work? A user land packet filter can "forward" certain packets on to the kernel -- e.g. ssh packets in my case -- while others stick around in shared memory for kernel bypass. This means that I can ssh to the box and the ssh packets flow via the kernel, while the rest of the packets bypass the kernel, but all packets floor over the same nic. Nice :-)

majke · on Sept 7, 2015

Well, sure. You can use netmap and use a "host ring" to inject the packets back to the kernel. While good for toy app, this won't work in real workloads. For example host ring doesn't support multiple RX queues. The article goes into details how to work around it and just leave most of the network flows to be dealt with by the kernel, while opting-in to the kernel bypass for only selected flows.

s1m0n · on Sept 7, 2015

If you are only interested in pushing infrequently used ssh packets into the kernel for e.g. low bandwidth health monitoring -- while all other packets bypass there kernel -- then why would this be considered a "toy app"? Surely it's a useful technique because it allows netmap to be used on very many cheap dedicated hosts for rent where only one NIC is available and you have no control over the hardware, or?

s1m0n · on Sept 7, 2015

This is the inaccurate sentence: "Snabbswitch, DPDK and netmap take over the whole network card, not allowing any traffic on that NIC to reach the kernel." Obviously with netmap traffic to the NIC may reach the kernel...

majke · on Sept 7, 2015

There are many ways to inject packets back to kernel. Tuntap, raw socket on loopback, "dummy" device, etc. So by this count you can always make packets reach the kernel.

There are two problems with doing the "take over the nic" techniques:

1) I don't believe you can actually push, say 2M pps back to the kernel with any of this techniques. There is a reason RSS exists, and even if you can process 10M pps on one CPU, it doesn't mean it's easy to insert them back to kernel.

2) I don't think putting a piece of custom code between CloudFlare kernel and network card is feasible on the architectural level. You really want to stand in the way and have to actively forward all these packets?

s1m0n · on Sept 7, 2015

The title of the article does not mention CloudFlare; only bypassing. The fact that the CloudFlare architecture pushes a higher bandwidth of packets into the network kernel and bypasses the rest does not make it a good technique or to be recommended. If you are primarily interested in the best performance with a single NIC solution then I believe it is suboptimal. Why? You are asking the CPU to do two different types of work; optimized and unoptimized. Because of cache line pollution then the "unoptimized" work via the network kernel will pollute the other work. I may be wrong but I would bet you'd get better performance by separating your CloudFlare specific workload onto two boxes, each with one NIC. In this scenario then no cache line pollution can occur. Of course, these two boxes might not be easily possible within three existing CloudFlare architecture. But this has nothing to do with the general idea of packets bypassing the kernel. After the bypass you want the CPU to process those packets in the most efficient way...

ilanco · on Sept 7, 2015

It doesn't mention in the article why the kernel is so slow at processing network packets. I'm not a kernel programmer so this may be utterly wrong, but wouldn't it be possible to sacrifice some feature for speed by disabling it in the kernel code?

acdha · on Sept 7, 2015

One thing to remember is that high-speed packet filtering is an unusual workflow and CloudFlare operates at a much greater scale than most of us see: most Linux devices are not connected to 10G, much less 100G, networks and they're usually doing more work than looking at a packet to decide whether to accept or reject it. The fact that the APIs and the kernel stack were designed many years before those kind of speeds were possible doesn't matter because most sites don't have that much traffic and most server applications will bottleneck at doing other work well before that point.

The example in the article found a single core handling 1.4M packets per second. If you're running a web-server shoveling data out to clients those packets are going to be close to the maximum size which, if I haven't screwed up the math, looks something like this:

1.4M * 1400 bytes (assuming a low MTU) * 8 (bytes -> bits) = 15Gbps

That's not to say that there isn't still plenty of room to improve and, as lukego noted, there's a lot of work in progress (see e.g. https://lwn.net/Articles/615238/ on work to batch operations to avoid paying some of the processing costs for every packet) but for the average server you'd find bottlenecks on something like a database, application logic, request handling, client network capacity, etc. before the network stack overhead is your greatest challenge. The people who encounter this tend to be CDN vendors like CloudFlare and security people who need to filter, analyze, or generate traffic on levels which are at least the the scale of a large company (e.g. https://github.com/robertdavidgraham/masscan).

X-Istence · on Sept 7, 2015

I work at an ISP, we are starting to connect stuff at 40G (times 2 for redundancy) because we want a single machine to do more work.

Improving the kernel would greatly speed up many of our applications with no real downsides.

With 40G to the edge, we also need to be able to properly firewall/filter/and all that fun stuff the traffic!

lukego · on Sept 7, 2015

People are also spending hundreds of billions of dollars on equipment like routers ever year. These could be Linux boxes if the kernel had sensible performance.

I am kind of amazed that the Linux kernel did not become the dominant data-plane for the networking industry ahead of proprietary implementations from Cisco, Juniper, etc.

Hopefully Snabb Switch will have better luck there... ;-)

asdfaoeu · on Sept 7, 2015

ASICs are always going to be way faster than a Linux kernel.

__d · on Sept 7, 2015

Most switch/router dataplane processing is done in hardware, with control by custom drivers in the control-plane OS.

Cisco IOS is variously hosted on a proprietary real-time OS, QNX or Linux. Juniper JunOS is FreeBSD-based. Arista AOS is Linux-based.

With the advent of SDN, the packet-rate limitations of a general-purpose OS lead to things like DPDK and Snabb, but they both run within a Linux host environment (DPDK can use FreeBSD as well; not sure re: Snabb).

zurn · on Sept 7, 2015

When 100M ethernet was introduced in the 90s, PCs had Pentiums, at < 100 M instructions per second. Both x86 CPUs and Ethernet have gotten ~1000x faster in the same time - cf 100G Ethernet and processors with 30x frequency, ~2x more IPC, and 20x more cores.

tedunangst · on Sept 7, 2015

You haven't accounted for memory bandwidth and latency. And counting increased core count means you can't actually assume the same old single threaded algorithm will scale linearly.

zurn · on Sept 7, 2015

Those are real-world speed bumps that make sense when you want to do stateful processing w/ large working sets, but not really for this "just get packets through the stack" toy benchmark that fits in cache and the workload is parallel-friendly.

And also the article is talking about handling just 10G, not 100G!

gonzo · on Sept 7, 2015

> One thing to remember is that high-speed packet filtering is an unusual workflow and CloudFlare operates at a much greater scale than most of us see: most Linux devices are not connected to 10G

You're about to see a whole bunch more 10G in the coming 5 years.

scurvy · on Sept 7, 2015

Sorry if I sound mean, but this is just a long apologist post about how things are just so hard. Really? Why? Why can't Linux match BSD's performance?

Also, 10gb servers are not rare by any means. Take a look around the next time you walk in a colo. 10 gb servers everywhere.

acdha · on Sept 7, 2015

> Sorry if I sound mean, but this is just a long apologist post about how things are just so hard. Really? Why? Why can't Linux match BSD's performance?

Mostly I just wish you'd read it again: you appear to have missed the part where I said that this is a real problem which needed working on.

The point I was making is that it's not a problem for most Linux users. Linux includes millions of devices attached to sub-100Mb networks but even if you want to look solely at things in modern data centers ask yourself how many of them are running network-limited applications or are providing services to internet users over an uplink which is actually fast enough to stress modern hardware. For all but the most demanding users the Linux vs. BSD decision will be made on other factors.

scurvy · on Sept 7, 2015

Those millions of sub-100Mb network devices aren't pushing Linux forward though. Sure, they're using Linux, but they're not the ones pushing it forward. It's colo and datacenter use cases that are. A basic MySQL OLTP load with SSD's and a modicum of compute power will easily saturate a 10gig NIC. It does start to die shortly after that due to network/kernel stuff. I'd really like to get a lot more out of my existing hardware if i could.

acdha · on Sept 7, 2015

You mean “pushing it forward in this particular direction”. Again, it's great to have people working on this and better if the companies using Linux support developers working on the problem but it's only a deal-breaker for a much smaller number of people.

Just to use an example which you see a lot today: how many of the developers jumping on Docker care that much about network performance at this level? I would argue that continued development of the container system has done more to boost Linux usage than low-level networking performance, even though both are entirely legitimate and worthwhile concerns worth developer sponsorship. (ZFS has pulled in the opposite direction for people who care about storage)

From the other direction, imagine if *BSD had gotten serious about package management by the mid-to-late 90s when it was obvious how much better the experience was on Debian so that a generation of developers wasn't trained to favor Linux to avoid getting sucked into dependency management. That doesn't have anything to do with the kernel but it mattered more for many, many people. This would have been really interesting if kfreebsd had hit critical mass and made the cost of switching that much lower.

betaby · on Sept 7, 2015

> Why can't Linux match BSD's performance?

You do sound both mean and factless. Meaningful apple to apple BSD to Linux comparison needed.

scurvy · on Sept 7, 2015

LMGTFY: http://bsdrp.net/documentation/technical_docs/performance

BSD numbers > Linux numbers.

betaby · on Sept 7, 2015

Except there is no Linux presence on any graph on this link.

scurvy · on Sept 8, 2015

The data are in the original post. Are you just trolling comments or have you actually read the Cloudflare articles?

tedunangst · on Sept 7, 2015

I'm not certain if "apologist post" refers to the article or the comment you replied to, but neither mentions BSD. ?

justincormack · on Sept 7, 2015

Well the BSDs have issues too. FreeBSD/Netflix has been doing in kernel SSL to try to fix some issues, and there is Netmap available for packet processing in userspace. So they are running into similar issues as Linux.

gonzo · on Sept 7, 2015

Netflix is doing in-kernel SSL so they don't have to hit the web server to encrypt the disk blocks for their streaming movies.

That doesn't mean it's a (Free)BSD issue, just that Netflix chose a particular architecture and are doing the work to make it go fast.

Netmap is in FreeBSD, and available for linux. I think it might be in Dragonfly at this point as well.

jeffreyrogers · on Sept 7, 2015

Doesn't look like anyone has really answered your question yet.

There are three main reasons the kernel is slow for networking: per-packet dynamic memory allocation, lots of memory copying, and system call overheads.

The first two can be improved by modifying the kernel, and I think people are attempting to do this. The system call overheads arise naturally from having the networking code in the kernel. Basically every time you perform a system call the kernel has to save the userspace context, do the system call, and restore the context. This takes time and is bad for cache locality.

But as others noted, for most people who aren't cloudflare this doesn't really matter.

acconsta · on Sept 7, 2015

>But as others noted, for most people who aren't cloudflare this doesn't really matter.

Aren't most web applications I/O bound? The Arrakis team sped up Memcached and Haproxy quite a lot by bypassing the kernel. It seems like there could be a large market for these techniques as they become easier to use.

http://people.inf.ethz.ch/troscoe/pubs/peter-arrakis-osdi14....

jeffreyrogers · on Sept 7, 2015

Hmmm, that's a good point. I was thinking more of a typical webapp, but you're probably right that there are certain classes of applications (e.g. caching) that are I/O bound under load.

s1m0n · on Sept 8, 2015

Reason #4: Another problem is that the network kernel was never designed to do internet on the mass scale desired today. Companies like whatsapp devoted lots of time to getting e.g. 2M concurrent TCP connections (considered good) running on a single box, mainly because of the greedy overhead and design of the legacy network kernel. Whereas, in theory it should be possible to have 10M or more concurrent TCP connections on modern average hardware. So from this POV then the legacy network kernel is the bloated memory greedy mess that Java is to software development. See http://c10m.robertgraham.com/p/manifesto.html

lukego · on Sept 7, 2015

Smart people are working on this: https://lwn.net/Articles/629155/

However, even after a few years now I don't think they are seeing light at the end of the tunnel.

zurn · on Sept 7, 2015

Note they cite higher numbers as the current status than the Cloudfare article. In the LWN article it says "The kernel, today, can only forward something between 1M and 2M packets per core every second"

Also eg. http://vger.kernel.org/netconf2009_slides/LinuxCon2009_Jespe... from 6 years ago show forwarding at 4 Mpps.

(Forwarding is, of course, receiving AND sending instead of just receiving so these should translate to higher RX-only numbers.)

zurn · on Sept 7, 2015

It isn't as slow as they claim, just like was discussed in HN comments to the predecessor article ("How to receive a million packets per second"). Still they repeat the general claim of "Vanilla Linux can do only about 1M pps". Makes for better headlines I guess.

zobzu · on Sept 7, 2015

tldr: cloudflare like to reimplement things and claim it solves the world problems. Thats cool, thats how open source work. Fork, copy, reimplement, try new stuff.

that-said...slightly-longer: for what they do there are alternatives like ipset for other things its not as clear-cut, hence things like PF_RING. its not that great thought, you're sacrificing all features for fast sniffing.

technically a good zero-copy implementation of packet mmap w/ a userspace ring would achieve +- the same thing, too.

acconsta · on Sept 7, 2015

I feel like there's a disconnect between the HPC community, which has publicly deployed these techniques for years, and the broader tech community. Even some enterprise hardware uses InfiniBand (with kernel bypass) these days.

Yet you never hear about Google or AWS using kernel bypass in their load balancers, for example (possibly a trade secret, possibly the result of Linux monoculture).

lukego · on Sept 7, 2015

I reckon that networking is transitioning from being a system programming problem (interrupt - switch to kernel - grab packet - process quickly - switch back) to being a HPC problem (infinite stream of packets arriving in memory).

ISPs are where I expect to see the disruption of HPC-oriented x86 servers being supremely capable of handling work previously done by specialized hardware.

revelation · on Sept 7, 2015

At this point, they could presumably just use FPGAs. There are plenty of dev boards with 10GiB+ interfaces precisely because FPGAs are such a good fit for this kind of processing.

majke · on Sept 7, 2015

Please factor in development and support cost. There is a reason why people prefer general purpose computers to dedicated hardware. I think only big players like Google can afford dedicated hardware teams.

__d · on Sept 7, 2015

Not to disagree about the need to factor in the dev cost, but there's plenty of companies smaller than Google doing TCP stacks on FPGAs.

Pretty much all high-frequency trading today uses FPGAs, for instance, and that's often with teams of fewer than ten people.

majke · on Sept 7, 2015

This is very exciting. Can you give some examples?

corysama · on Sept 8, 2015

http://www.argondesign.com/case-studies/2013/sep/18/high-per...

https://www.youtube.com/watch?v=uDy_8Q0GdTk

IIRC, the FPGA has been incorporated into a switch. When a market data packet starts to arrive, the system starts sending a response packet before the input packet has completely arrived and before the system has actually made a decision. While the input packet is read, the system decides whether or not it will cancel the response market order by intentionally corrupting the checksum of the output packet at the last possible instant.

stzup7 · on Sept 7, 2015

I've been following these posts for a while and it looks to me they've decided to go with Solarflare but haven't really explained why. I would be interested to see a fair comparison between Solarflare & openonload and their competitors such as Dolphin & super sockets, Mellanox & VMA, Chelsio & rdma. Also, if they're looking at pure PPS stats a good FPGA with a built-in hardware TCP/IP stack could be a very powerful filter.

masklinn · on Sept 7, 2015

Have they? They just noted that Solarflare's proprietary library EF_VI has an interesting approach to kernel bypass, then indicate that you can replicate that approach on other NICs and show how.

stzup7 · on Sept 7, 2015

Right. The fact that it's the only proprietary platform tested biased my mind too quickly :)

__d · on Sept 7, 2015

It's worth noting that OpenOnload is GPLv2.

The issue is that Solarflare has patents on some of the underlying techniques. They were very open to licensing those patents for a reasonable figure when I last talked with them about it, but that wouldn't work for a general-purpose OSS project.

scurvy · on Sept 7, 2015

We actually use the offloading Myricom NICs in our deployment. We've found them to work very well. Solarflare makes nice stuff, too, but Myricom is easier to get started with.

__d · on Sept 7, 2015

Was this Myricom before they went bust and got taken over? Or more recent? It seemed like most of the momentum (such as it was) that DBL had was lost in the mess.

scurvy · on Sept 8, 2015

After.

omgtehlion · on Sept 7, 2015

From all those vendors with bypass stacks only solarflare and mellanox are really friendly to start with. Others require to register/request a quote/pay your arm and leg to just get started. On the other side: you could easily pickup an SF or mlnx card at almost any shop, or get a used one from ebay for a hundred bucks, then download software from the internet and start playing with the stuff right away. Just 'olnoad ./your-binary', that's it.

wslh · on Sept 7, 2015

(I wasn't expecting so many downvotes for this question)

I am curious about packet filtering in Windows. Anyone with experience in HN?

Now, in my company, we are doing some tests using different methods: WinPcap, WFP, NDIS, and WinPcap is the winner in a VM but we will start to test with real 10gbps ethernet cards next week.

UnoriginalGuy · on Sept 7, 2015

I legitimately don't see why you would.

This isn't one of these "Windows sucks!" posts. Windows is a very good endpoint for many services (DNS, DHCP, AD, IIS/ASP.net, VPNs, etc) and of course an extremely popular client.

But that being as it may, packet storms should never be allowed to hit endpoints, that the entire purpose of packet filtering. So you'll want to be taking them out on front-line appliances, and appliances based on the Windows NT kernel simply don't exist.

So this is why CloudFlare cares about this, they're utilising Linux on an appliance in front of their endpoints to try and drop as many "bad" packets as they can detect. Both Linux and various BSD variants are used commonly on networking equipment, so trying to optimise them seems to make a lot of sense.

Windows on the other hand? If you're trying to do packet filtering on the endpoint itself then you're fighting a losing battle. For example, they're low-level hooking network traffic, and while that works wonderfully for filtering, it is a terrible idea if the machine is used for other things as it can disrupt normal legitimate machine traffic.

wslh · on Sept 7, 2015

My use case is different, It is more about packet capturing on the endpoint than packet filtering. In my use case the filtering refers to filtering uninterested data.

Also, this is oriented to internal network endpoints not visible from Internet, so I don't expect to receive massive network-intensive attacks.

MichaelGG · on Sept 7, 2015

So long you create a filter that limits the data you're capturing you should be fine. This is assuming you can write a filter that gets your captured traffic down to something manageable. It's possible that most endpoint devices will not have spare capacity to add 10G capturing to their workload.

drewg123 · on Sept 7, 2015

Check out the Myrcom Sniffer 10G, it supports Windows, Linux & FreeBSD https://www.myricom.com/software/sniffer10g.html

It is completely OS-bypass, and should be good to handle full line rate (14.8Mpps per port).

Noxwizard · on Sept 7, 2015

For my Masters' work, I needed high speed tx/rx on Windows and looked into the same things you did. I can't find the statistics for the tests I ran, but WinPcap's speeds weren't much better than Winsock's, which was fairly poor. The solution I used was an NDIS kernel filter and protocol driver which pushed the packets into user-space memory. Luigi Rizzo has recently added a Windows port of netmap to his repository, so you might want to look into that: https://github.com/luigirizzo/netmap

trentnelson · on Sept 9, 2015

Did you look into registered I/O?

https://technet.microsoft.com/en-us/library/Hh997032.aspx

wslh · on Sept 7, 2015

Thanks for the netmap reference, we will take a look.

In our tests WinPcap was faster than an NDIS driver, so it will be interesting to compare.

MichaelGG · on Sept 7, 2015

I faced this. I wrote a network search engine and used F#. For ease I deployed on Windows. Winpcap is fine, but you don't have a lot of space to easily improve. Looking at the features the Intel NICs had and how easy it was to use them on Linux... Why would I ever want to try to optimize it on Windows?

That said I think the Wireshark guys (linked from the Wireshark site anyways) might have some answers. I know for WiFi capture they had fully functional Windows devices.

wslh · on Sept 7, 2015

I answer to UnoriginalGuy about my use case.

karthick18 · on Sept 8, 2015

Good post.

And I have performance numbers with OVS-dpdk that make kernel bypass compelling since its off-the-charts while comparing with kernel datapath.

For those interested, my ovs-dpdk experiments which also include patches, README etc. for others to carry it themselves can be found here:

https://www.dropbox.com/sh/nfe70cgksmy543k/AABD_0qsQ15e2GItX...

The perf results directory has the ovs-dpdk perf for all the use-cases in the dataplane performance pdf that you might be interested in.

Has use-cases covering up to 11VM or 11 containers with 11 IP flows to measure dataplane performance.

Also you really need a server with 1 gig hugetlb support (and also enable that for guest) to extract maximum performance.

Expected I guess ...

s1m0n · on Sept 7, 2015

Wouldn't it be more efficient to just have all or nearly all packets bypass the network kernel? Why compromise?

mavam · on Sept 7, 2015

That's exactly the idea behind packet bricks (see other comment): you have one single tool that takes all the packets directly from the NIC (say eth0) and then exposes them according to your bricks configuration to a bunch of other interfaces (eth0}0, etho0}1, etc.). Very similar to Click. This layer of indirection obviates the need for shared NIC access, which is what CloudFlare works around in a more cumbersome way.

s1m0n · on Sept 8, 2015

Looks interesting. I'll take a look. You might also be interested in mTCP (https://github.com/eunyoung14/mtcp), or possibly adding mTCP functionality to packet bricks?

mavam · on Sept 8, 2015

Excellent follow up, because the main developer of packet bricks is also a co-author of mTCP :-).

s1m0n · on Sept 8, 2015

amelius · on Sept 7, 2015

> I do hope an open source kernel bypass API will emerge soon

Or how about a faster kernel? :)

acconsta · on Sept 7, 2015

The cost of kernel I/O isn't just the direct cost of time spent in the kernel. Even a really, really fast kernel will pollute the CPU caches and TLB.

inversionOf · on Sept 7, 2015

Given the overhead of context switches, is it possible to take a general purpose application like nginx and use a user-mode TCP stack? For instance if I had a network adapter that is solely dedicated to nginx, and don't need any of the kernel TCP services. Is this even a viable consideration?

I've done high performance nginx, in the million request per second range (there are situations that benefit from these, though unfortunately such discussions always get waylaid by people insisting that performance doesn't matter), but there is enormous system overhead at this rate that I'd like to get around.

justincormack · on Sept 7, 2015

You can run Nginx with a rump kernel (rumpkernel.org), so with a completely userspace tcp stack. I havent yet done any work on optimising it for 10Gb networking, it is on my TODO list (there are Snabb, Netmap and dpdk drivers, although I might just get it to drive the NIC directly). (Current tests are just with a tap device or raw socket which is very slow).

jsnell · on Sept 7, 2015

It's possible, you'd need to override the relevant system calls with LD_PRELOADed library.

I don't know if a complete drop-in solution is the right solution though. If your application is performance sensitive enough to require embedding a full networking stack, you might as well make use of better APIs. For example it'd be silly to indirect the event dispatching through something poll/select-like. Instead you'd much rather just have the core IO loop call the handlers directly. Or as another example, zero-copy will be impossible with a recv()-like interface where the client provides the buffer that data needs to go to, but will be trivial with an API where it's the network stack giving the client a buffer that already has the data.

If you want to experiment with this, mTCP (http://shader.kaist.edu/mtcp/) is probably the right starting point.

blibble · on Sept 7, 2015

this is exactly how openonload works, it's pretty impressive that they get nearly all weird behaviour of the Linux socket API correct (correct behaviour across fork, select/poll/epoll, multicast behaviour, etc).

presentation: http://www.openonload.org/openonload-google-talk.pdf

acconsta · on Sept 7, 2015

The Arrakis team did it with Haproxy, Reddis, and Memcached, but in a research operating system:

http://people.inf.ethz.ch/troscoe/pubs/peter-arrakis-osdi14....

BSD has had userland networking for a while:

http://www.bsdcan.org/2014/schedule/events/447.en.html

I'm not sure if there's anything comparable for Linux.