Packet BRICKS is a Linux/FreeBSD daemon that is capable of receiving and distributing ingress traffic to userland applications. Its main responsibilities may include (i) load-balancing, (ii) duplicating and/or (iii) filtering ingress traffic across all registered applications. The distribution is flow-aware (i.e. packets of one connection will always end up in the same application). At the moment, packet-bricks uses netmap packet I/O framework for receiving packets. It employs netmap pipes to forward packets to end host applications.
(Credit goes to Asim Jamshed, who pulled this off as part of an internship at ICSI.)
So is the advantage that it's more performant than iptables+virtual interfaces? If I use iptables to distribute traffic over virtual interfaces do IP headers get parsed twice by the kernel in some inefficient way?
Iptables sits in the kernel and is also not available on non-Linux platforms like FreeBSD. With packet bricks you bypass the kernel and expose "virtual" interfaces to your applications by means of a simple configuration. Here's an example from the README:
This binds pkteng pe with LoadBalancer brick and asks the system to read ingress packets from eth3 and split them flow-wise based on the 2-tuple (src & dst IP addresses) metadata of the packet header. The "lb:connect_output(...)" command creates four netmap-specific pipes named "netmap:eth3{x" where 0 <= x < 4 and an egress interface named "eth2". The traffic is evenly split between all five channels based on the 2 tuple header as previously mentioned. Userland applications can now use packet-bricks to get their fair share of ingress traffic. The brick is finally linked with the packet engine.
Incredibly interesting project. Are you aware of any experiences using this with nginx or HAProxy on the load balancing side and the impact it may have had?
In this essay the author forgot to mention PFQ, that at the time of writing represents a performant and innovative approach to packet capture and in-kernel functional processing of packets (running on-top-of vanilla drivers). The software is available at www.pfq.io.
University of Pisa, right? Authors of the Netmap and PF_RING also from that university, if I am not mistaken? You guys seem to have a mutual interest towards wire-rate packet processing. Keep it up...
Thinking of the future: Recent experience suggests that we are able to do around 50-100 Mpps of traffic dispatching in Snabb Switch using one CPU core. I suspect that dedicated software traffic dispatchers will displace hardware (RSS, VMDq, etc) in the immediate future.
We are planning to explore this soon in the context of software dispatching for 100Gbps ethernet ports.
So would you recommend giving Snabb the NIC, letting it drop the DDoS traffic, and then injecting what's left into the kernel stack via a tap interface or something?
That is one solution: take a 10G port into Snabb Switch, filter and sort the traffic, then feed a slice to the kernel e.g. via a tap device with multiqueue /dev/vhost-net acceleration (same interface that QEMU/KVM uses).
The risk I see here is that maybe it is even harder to tune your kernel when it is using a software interface for I/O instead of a hardware one. The kernel can depend on so many things (multiqueue, TSO, LRO, checksum, encap offload, etc) that can behave differently between hardware/software NICs and you would need to be confident that this will work out well. Otherwise the risk is that you take your hardest problem - tuning the kernel - and make it even harder.
If we are only talking about 2x10G ports per server then one alternative would be to connect Snabb Switch to the kernel with a physical 10G port instead of a software one. That is, separate the DDoS-protecting frontend (Snabb) from the backend (kernel) with a network cable. You could still run the Snabb application on the same server but with a dedicated network card cabled directly to the kernel. (You could also run it on a different server if you prefer.)
Don't use a tap interface, but Snabb does have a fast virtio driver for kvm, that maybe you could adapt (or use directly if you dont mind the userspace parts being in another vm).
Or just do everything in userspace. Reinjecting into the kernel is not strictly necessary, although obviously you may be constrained by existing code.
The article appears to be inaccurate. Why? AFAIK it's possible to make use of netmap on a box with a single nic. I tried this out for myself about 2 years ago on a VMware virtual machine. How does it work? A user land packet filter can "forward" certain packets on to the kernel -- e.g. ssh packets in my case -- while others stick around in shared memory for kernel bypass. This means that I can ssh to the box and the ssh packets flow via the kernel, while the rest of the packets bypass the kernel, but all packets floor over the same nic. Nice :-)
Well, sure. You can use netmap and use a "host ring" to inject the packets back to the kernel. While good for toy app, this won't work in real workloads. For example host ring doesn't support multiple RX queues. The article goes into details how to work around it and just leave most of the network flows to be dealt with by the kernel, while opting-in to the kernel bypass for only selected flows.
If you are only interested in pushing infrequently used ssh packets into the kernel for e.g. low bandwidth health monitoring -- while all other packets bypass there kernel -- then why would this be considered a "toy app"? Surely it's a useful technique because it allows netmap to be used on very many cheap dedicated hosts for rent where only one NIC is available and you have no control over the hardware, or?
This is the inaccurate sentence: "Snabbswitch, DPDK and netmap take over the whole network card, not allowing any traffic on that NIC to reach the kernel." Obviously with netmap traffic to the NIC may reach the kernel...
There are many ways to inject packets back to kernel. Tuntap, raw socket on loopback, "dummy" device, etc. So by this count you can always make packets reach the kernel.
There are two problems with doing the "take over the nic" techniques:
1) I don't believe you can actually push, say 2M pps back to the kernel with any of this techniques. There is a reason RSS exists, and even if you can process 10M pps on one CPU, it doesn't mean it's easy to insert them back to kernel.
2) I don't think putting a piece of custom code between CloudFlare kernel and network card is feasible on the architectural level. You really want to stand in the way and have to actively forward all these packets?
The title of the article does not mention CloudFlare; only bypassing. The fact that the CloudFlare architecture pushes a higher bandwidth of packets into the network kernel and bypasses the rest does not make it a good technique or to be recommended. If you are primarily interested in the best performance with a single NIC solution then I believe it is suboptimal. Why? You are asking the CPU to do two different types of work; optimized and unoptimized. Because of cache line pollution then the "unoptimized" work via the network kernel will pollute the other work. I may be wrong but I would bet you'd get better performance by separating your CloudFlare specific workload onto two boxes, each with one NIC. In this scenario then no cache line pollution can occur. Of course, these two boxes might not be easily possible within three existing CloudFlare architecture. But this has nothing to do with the general idea of packets bypassing the kernel. After the bypass you want the CPU to process those packets in the most efficient way...
It doesn't mention in the article why the kernel is so slow at processing network packets. I'm not a kernel programmer so this may be utterly wrong, but wouldn't it be possible to sacrifice some feature for speed by disabling it in the kernel code?
One thing to remember is that high-speed packet filtering is an unusual workflow and CloudFlare operates at a much greater scale than most of us see: most Linux devices are not connected to 10G, much less 100G, networks and they're usually doing more work than looking at a packet to decide whether to accept or reject it. The fact that the APIs and the kernel stack were designed many years before those kind of speeds were possible doesn't matter because most sites don't have that much traffic and most server applications will bottleneck at doing other work well before that point.
The example in the article found a single core handling 1.4M packets per second. If you're running a web-server shoveling data out to clients those packets are going to be close to the maximum size which, if I haven't screwed up the math, looks something like this:
That's not to say that there isn't still plenty of room to improve and, as lukego noted, there's a lot of work in progress (see e.g. https://lwn.net/Articles/615238/ on work to batch operations to avoid paying some of the processing costs for every packet) but for the average server you'd find bottlenecks on something like a database, application logic, request handling, client network capacity, etc. before the network stack overhead is your greatest challenge. The people who encounter this tend to be CDN vendors like CloudFlare and security people who need to filter, analyze, or generate traffic on levels which are at least the the scale of a large company (e.g. https://github.com/robertdavidgraham/masscan).
People are also spending hundreds of billions of dollars on equipment like routers ever year. These could be Linux boxes if the kernel had sensible performance.
I am kind of amazed that the Linux kernel did not become the dominant data-plane for the networking industry ahead of proprietary implementations from Cisco, Juniper, etc.
Hopefully Snabb Switch will have better luck there... ;-)
Most switch/router dataplane processing is done in hardware, with control by custom drivers in the control-plane OS.
Cisco IOS is variously hosted on a proprietary real-time OS, QNX or Linux.
Juniper JunOS is FreeBSD-based.
Arista AOS is Linux-based.
With the advent of SDN, the packet-rate limitations of a general-purpose OS lead to things like DPDK and Snabb, but they both run within a Linux host environment (DPDK can use FreeBSD as well; not sure re: Snabb).
When 100M ethernet was introduced in the 90s, PCs had Pentiums, at < 100 M instructions per second. Both x86 CPUs and Ethernet have gotten ~1000x faster in the same time - cf 100G Ethernet and processors with 30x frequency, ~2x more IPC, and 20x more cores.
You haven't accounted for memory bandwidth and latency. And counting increased core count means you can't actually assume the same old single threaded algorithm will scale linearly.
Those are real-world speed bumps that make sense when you want to do stateful processing w/ large working sets, but not really for this "just get packets through the stack" toy benchmark that fits in cache and the workload is parallel-friendly.
And also the article is talking about handling just 10G, not 100G!
> One thing to remember is that high-speed packet filtering is an unusual workflow and CloudFlare operates at a much greater scale than most of us see: most Linux devices are not connected to 10G
You're about to see a whole bunch more 10G in the coming 5 years.
> Sorry if I sound mean, but this is just a long apologist post about how things are just so hard. Really? Why? Why can't Linux match BSD's performance?
Mostly I just wish you'd read it again: you appear to have missed the part where I said that this is a real problem which needed working on.
The point I was making is that it's not a problem for most Linux users. Linux includes millions of devices attached to sub-100Mb networks but even if you want to look solely at things in modern data centers ask yourself how many of them are running network-limited applications or are providing services to internet users over an uplink which is actually fast enough to stress modern hardware. For all but the most demanding users the Linux vs. BSD decision will be made on other factors.
Those millions of sub-100Mb network devices aren't pushing Linux forward though. Sure, they're using Linux, but they're not the ones pushing it forward. It's colo and datacenter use cases that are. A basic MySQL OLTP load with SSD's and a modicum of compute power will easily saturate a 10gig NIC. It does start to die shortly after that due to network/kernel stuff. I'd really like to get a lot more out of my existing hardware if i could.
You mean “pushing it forward in this particular direction”. Again, it's great to have people working on this and better if the companies using Linux support developers working on the problem but it's only a deal-breaker for a much smaller number of people.
Just to use an example which you see a lot today: how many of the developers jumping on Docker care that much about network performance at this level? I would argue that continued development of the container system has done more to boost Linux usage than low-level networking performance, even though both are entirely legitimate and worthwhile concerns worth developer sponsorship. (ZFS has pulled in the opposite direction for people who care about storage)
From the other direction, imagine if *BSD had gotten serious about package management by the mid-to-late 90s when it was obvious how much better the experience was on Debian so that a generation of developers wasn't trained to favor Linux to avoid getting sucked into dependency management. That doesn't have anything to do with the kernel but it mattered more for many, many people. This would have been really interesting if kfreebsd had hit critical mass and made the cost of switching that much lower.
Well the BSDs have issues too. FreeBSD/Netflix has been doing in kernel SSL to try to fix some issues, and there is Netmap available for packet processing in userspace. So they are running into similar issues as Linux.
Doesn't look like anyone has really answered your question yet.
There are three main reasons the kernel is slow for networking: per-packet dynamic memory allocation, lots of memory copying, and system call overheads.
The first two can be improved by modifying the kernel, and I think people are attempting to do this. The system call overheads arise naturally from having the networking code in the kernel. Basically every time you perform a system call the kernel has to save the userspace context, do the system call, and restore the context. This takes time and is bad for cache locality.
But as others noted, for most people who aren't cloudflare this doesn't really matter.
>But as others noted, for most people who aren't cloudflare this doesn't really matter.
Aren't most web applications I/O bound? The Arrakis team sped up Memcached and Haproxy quite a lot by bypassing the kernel. It seems like there could be a large market for these techniques as they become easier to use.
Hmmm, that's a good point. I was thinking more of a typical webapp, but you're probably right that there are certain classes of applications (e.g. caching) that are I/O bound under load.
Reason #4: Another problem is that the network kernel was never designed to do internet on the mass scale desired today. Companies like whatsapp devoted lots of time to getting e.g. 2M concurrent TCP connections (considered good) running on a single box, mainly because of the greedy overhead and design of the legacy network kernel. Whereas, in theory it should be possible to have 10M or more concurrent TCP connections on modern average hardware. So from this POV then the legacy network kernel is the bloated memory greedy mess that Java is to software development. See http://c10m.robertgraham.com/p/manifesto.html
Note they cite higher numbers as the current status than the Cloudfare article. In the LWN article it says "The kernel, today, can only forward something between 1M and 2M packets per core every second"
It isn't as slow as they claim, just like was discussed in HN comments to the predecessor article ("How to receive a million packets per second"). Still they repeat the general claim of "Vanilla Linux can do only about 1M pps". Makes for better headlines I guess.
tldr: cloudflare like to reimplement things and claim it solves the world problems. Thats cool, thats how open source work. Fork, copy, reimplement, try new stuff.
that-said...slightly-longer: for what they do there are alternatives like ipset
for other things its not as clear-cut, hence things like PF_RING. its not that great thought, you're sacrificing all features for fast sniffing.
technically a good zero-copy implementation of packet mmap w/ a userspace ring would achieve +- the same thing, too.
I feel like there's a disconnect between the HPC community, which has publicly deployed these techniques for years, and the broader tech community. Even some enterprise hardware uses InfiniBand (with kernel bypass) these days.
Yet you never hear about Google or AWS using kernel bypass in their load balancers, for example (possibly a trade secret, possibly the result of Linux monoculture).
I reckon that networking is transitioning from being a system programming problem (interrupt - switch to kernel - grab packet - process quickly - switch back) to being a HPC problem (infinite stream of packets arriving in memory).
ISPs are where I expect to see the disruption of HPC-oriented x86 servers being supremely capable of handling work previously done by specialized hardware.
At this point, they could presumably just use FPGAs. There are plenty of dev boards with 10GiB+ interfaces precisely because FPGAs are such a good fit for this kind of processing.
Please factor in development and support cost. There is a reason why people prefer general purpose computers to dedicated hardware. I think only big players like Google can afford dedicated hardware teams.
IIRC, the FPGA has been incorporated into a switch. When a market data packet starts to arrive, the system starts sending a response packet before the input packet has completely arrived and before the system has actually made a decision. While the input packet is read, the system decides whether or not it will cancel the response market order by intentionally corrupting the checksum of the output packet at the last possible instant.
I've been following these posts for a while and it looks to me they've decided to go with Solarflare but haven't really explained why. I would be interested to see a fair comparison between Solarflare & openonload and their competitors such as Dolphin & super sockets, Mellanox & VMA, Chelsio & rdma. Also, if they're looking at pure PPS stats a good FPGA with a built-in hardware TCP/IP stack could be a very powerful filter.
Have they? They just noted that Solarflare's proprietary library EF_VI has an interesting approach to kernel bypass, then indicate that you can replicate that approach on other NICs and show how.
The issue is that Solarflare has patents on some of the underlying techniques. They were very open to licensing those patents for a reasonable figure when I last talked with them about it, but that wouldn't work for a general-purpose OSS project.
We actually use the offloading Myricom NICs in our deployment. We've found them to work very well. Solarflare makes nice stuff, too, but Myricom is easier to get started with.
Was this Myricom before they went bust and got taken over? Or more recent? It seemed like most of the momentum (such as it was) that DBL had was lost in the mess.
From all those vendors with bypass stacks only solarflare and mellanox are really friendly to start with. Others require to register/request a quote/pay your arm and leg to just get started. On the other side: you could easily pickup an SF or mlnx card at almost any shop, or get a used one from ebay for a hundred bucks, then download software from the internet and start playing with the stuff right away. Just 'olnoad ./your-binary', that's it.
(I wasn't expecting so many downvotes for this question)
I am curious about packet filtering in Windows. Anyone with experience in HN?
Now, in my company, we are doing some tests using different methods: WinPcap, WFP, NDIS, and WinPcap is the winner in a VM but we will start to test with real 10gbps ethernet cards next week.
This isn't one of these "Windows sucks!" posts. Windows is a very good endpoint for many services (DNS, DHCP, AD, IIS/ASP.net, VPNs, etc) and of course an extremely popular client.
But that being as it may, packet storms should never be allowed to hit endpoints, that the entire purpose of packet filtering. So you'll want to be taking them out on front-line appliances, and appliances based on the Windows NT kernel simply don't exist.
So this is why CloudFlare cares about this, they're utilising Linux on an appliance in front of their endpoints to try and drop as many "bad" packets as they can detect. Both Linux and various BSD variants are used commonly on networking equipment, so trying to optimise them seems to make a lot of sense.
Windows on the other hand? If you're trying to do packet filtering on the endpoint itself then you're fighting a losing battle. For example, they're low-level hooking network traffic, and while that works wonderfully for filtering, it is a terrible idea if the machine is used for other things as it can disrupt normal legitimate machine traffic.
My use case is different, It is more about packet capturing on the endpoint than packet filtering. In my use case the filtering refers to filtering uninterested data.
Also, this is oriented to internal network endpoints not visible from Internet, so I don't expect to receive massive network-intensive attacks.
So long you create a filter that limits the data you're capturing you should be fine. This is assuming you can write a filter that gets your captured traffic down to something manageable. It's possible that most endpoint devices will not have spare capacity to add 10G capturing to their workload.
For my Masters' work, I needed high speed tx/rx on Windows and looked into the same things you did. I can't find the statistics for the tests I ran, but WinPcap's speeds weren't much better than Winsock's, which was fairly poor. The solution I used was an NDIS kernel filter and protocol driver which pushed the packets into user-space memory. Luigi Rizzo has recently added a Windows port of netmap to his repository, so you might want to look into that: https://github.com/luigirizzo/netmap
I faced this. I wrote a network search engine and used F#. For ease I deployed on Windows. Winpcap is fine, but you don't have a lot of space to easily improve. Looking at the features the Intel NICs had and how easy it was to use them on Linux... Why would I ever want to try to optimize it on Windows?
That said I think the Wireshark guys (linked from the Wireshark site anyways) might have some answers. I know for WiFi capture they had fully functional Windows devices.
That's exactly the idea behind packet bricks (see other comment): you have one single tool that takes all the packets directly from the NIC (say eth0) and then exposes them according to your bricks configuration to a bunch of other interfaces (eth0}0, etho0}1, etc.). Very similar to Click. This layer of indirection obviates the need for shared NIC access, which is what CloudFlare works around in a more cumbersome way.
Looks interesting. I'll take a look. You might also be interested in mTCP (https://github.com/eunyoung14/mtcp), or possibly adding mTCP functionality to packet bricks?
Given the overhead of context switches, is it possible to take a general purpose application like nginx and use a user-mode TCP stack? For instance if I had a network adapter that is solely dedicated to nginx, and don't need any of the kernel TCP services. Is this even a viable consideration?
I've done high performance nginx, in the million request per second range (there are situations that benefit from these, though unfortunately such discussions always get waylaid by people insisting that performance doesn't matter), but there is enormous system overhead at this rate that I'd like to get around.
You can run Nginx with a rump kernel (rumpkernel.org), so with a completely userspace tcp stack. I havent yet done any work on optimising it for 10Gb networking, it is on my TODO list (there are Snabb, Netmap and dpdk drivers, although I might just get it to drive the NIC directly). (Current tests are just with a tap device or raw socket which is very slow).
It's possible, you'd need to override the relevant system calls with LD_PRELOADed library.
I don't know if a complete drop-in solution is the right solution though. If your application is performance sensitive enough to require embedding a full networking stack, you might as well make use of better APIs. For example it'd be silly to indirect the event dispatching through something poll/select-like. Instead you'd much rather just have the core IO loop call the handlers directly. Or as another example, zero-copy will be impossible with a recv()-like interface where the client provides the buffer that data needs to go to, but will be trivial with an API where it's the network stack giving the client a buffer that already has the data.
this is exactly how openonload works, it's pretty impressive that they get nearly all weird behaviour of the Linux socket API correct (correct behaviour across fork, select/poll/epoll, multicast behaviour, etc).
Packet BRICKS is a Linux/FreeBSD daemon that is capable of receiving and distributing ingress traffic to userland applications. Its main responsibilities may include (i) load-balancing, (ii) duplicating and/or (iii) filtering ingress traffic across all registered applications. The distribution is flow-aware (i.e. packets of one connection will always end up in the same application). At the moment, packet-bricks uses netmap packet I/O framework for receiving packets. It employs netmap pipes to forward packets to end host applications.
(Credit goes to Asim Jamshed, who pulled this off as part of an internship at ICSI.)