A bit dated in the sense that for Linux you'd probably use io_uring nowadays, but otherwise it's a timeless design
Still, I'm conflicted on whether separating stages per thread (accept on one thread and the client loop in another) is a good idea. It sounds like the gains would be minimal or non-existent even in ideal circumstances, and on some workloads where there's not a lot of clients or connection churn it would waste an entire core for handling a low-volume event.
I'm open to contrarian opinions on this though, maybe I'm not seeing soemthing...
It’s not a good idea and that’s where I’d really start with the dated commentary here rather than focusing on the polling mechanism. It depends on the application but if the buffers are large (>=64kb) such as a common TCP workload then uring won’t necessarily help that much. You’ll gain a lot of scalability regardless of polling mechanism by making sure you can utilize rss and xss optimizations.
It's been a while but why is uring not helpful for larger buffers? I'd think the zero-copy I/O capabilities would make it more helpful for larger payloads, not less
uring supports zero-copy, but is not a copy-reduction mechanism; it is a syscall-reduction mechanism. Large buffers mean less syscalls to start with, so less benefit.
io_uring is in a curious place. Yes it does offer significant performance advantages, but it continues to be such a consistent source of bugs - many with serious security implications - that it's questionable if it's really worth using.
I do agree that it's a bit dated and today you'd do other things (notably SO_REUSEPORT), just feel that io_uring is a questionable example.
> continues to be such a consistent source of bugs - many with serious security implications... just feel that io_uring is a questionable example.
Are you saying this as someone with experience, or is it just a feeling? Please give examples of recent bugs in io_uring that have security implications.
There are a couple of notable examples of projects[0] and companies[1] that have got tired of it, and no longer use it.
There's considerable difficulty these days extrapolating "real" vulnerabilities from kernel CVEs, as the kernel team quite reasonably feel that basically any bug can be a vulnerability in the right situation, but the list of vulnerabilities in io_uring over the past 12 months[2] is pretty staggering to me.
Not OP, and I'm no expert in the area at all, but I _do_ have a feeling that there have been quite a few such issues posted here and elsewhere that I read in the last year.
https://www.cve.org/CVERecord/SearchResults?query=io_uring seems to back that up. Only one relevant CVE listed there for 2026 so far, for more than two per month on average in 2025. Caveat: I've not looked into the severity and ease of exploit for any of those issues listed.
Did you read the CVEs? Half these aren't vulnerabilities. One allows the root user to create a kernel thread and then block its shutdown for several minutes. One is that if you do something that's obviously stupid, you don't get an event notification for it.
Remember the Linux kernel's policy of assigning a CVE to every single bug, in protest to the stupid way CVEs were being assigned before that.
If we apply risk/reward analysis, how probable is such a chain of exploits? If you already got local root, you might as well do a little bit more than a simple DoS.
Depending on how much performance would be gained by using io_uring in a particular case, and how many layers of protection exist around your server, it might be a risk worth taking.
In node.js I've seen time and time again some slow task that happens only every now and then, but which causes significant latency spikes. Having the one single event loop, with tasks big and small, from all stages of the processing pipeline mixed in, feels so crude. I really want a more sophisticated architecture where different stages of the execution can be managed independently.
I also want to mention that very very very few programs do, but io_uring does let you run multiple io_urings!! Your program can pick from which completion queue it wants to read, can put high priority tasks in a specific iou.
> One thread per core, pinned (affinity) to separate CPUs, each with their own epoll/kqueue fd
> Each major state transition (accept, reader) is handled by a separate thread, and transitioning one client from one state to another involves passing the file descriptor to the epoll/kqueue fd of the other thread.
So this seems like a little pipeline that all of the requests go through, right? For somebody who doesn’t do server stuff, is there a general idea of how many stages a typical server might be able to implement? And does it create a load-balancing problem? I’d expect some stages to be quite cheap…
> For somebody who doesn’t do server stuff, is there a general idea of how many stages a typical server might be able to implement?
On the HTTP server from the article, what I understood is that those 2 you are seeing are the ones you have. Or maybe 3, if disposing of things is slow.
I'm not sure what I prefer. On one hand, there's some expensive coordination for passing those file descriptors around. On the other hand, having some separate code bother with creating and closing the connections make it easier to focus on the actual performance issues where they appear, and create opportunity to dispatch work smartly.
Of course, you can go all the way in and make a green threads server where every bit of IO puts the work back on the queue. But you would use a single queue then, and dispatch the code that works on it. So you get more branching, but less coordination.
It’s an interesting throwback to SEDA, but physically passing file descriptors between different cores as a connection changes state is usually a performance killer on modern hardware. While it sounds elegant on a whiteboard to have a dedicated 'accept' core and a 'read' core, you end up trading a slightly simpler state machine for massive L1/L2 cache thrashing. Every time you hand off that connection, you immediately invalidate the buffers and TCP state you just built up. There’s a reason the industry largely settled on shared-nothing architectures like NGINX having a single pinned thread handle the entire lifecycle of a request keeps all that data strictly local to the CPU cache. When you're trying to scale, respecting data locality almost always beats pipeline cleanliness.
You could presumably have an acceptor thread per core, which passes the fds to core alligned next thread, etc.
That would get you the code simplicity benefits the article suggests, while keeping the socket bound to a single core, which is definitely needed.
Depending on if you actually need to share anything, you could do process per core, thread per loop, and you have no core to core communication from the usual workings of the process (i/o may cross though)
I don't think the author intended "code simplicity" as an end unto itself but a way to reduce cache pressure. He popped into the 2016 discussion [1] to say:
> Another benefit of this design overlooked is that individual cores may not ever need to read memory -- the entire task can run in L1 or L2. If a single worker becomes too complicated this benefit is lost, and memory is much much slower than cache.
I think this is wrong or at least overstated: if you're passing off fds and their associated (kernel- and/or user-side) buffers between cores, you can't run entirely in L1 or L2. And in general, I'd expect data to be responsible for much more cache pressure than code, so I'm skeptical of localizing the code at the expense of the data.
But anyway, if the goal is to organize which cores are doing the work, splitting a single core's work from a single thread (pinnned to it) to several threads (still pinned to it) doesn't help. It just introduces more context switching.
While I agree that shared nothing wipes the pants performance-wise of shared state, surely the penalty you've outlined is only for super short lived connections?
For longer lived connections the cache is going to thrash on an inevitable context switch anyway (either do to needing to wait for more I/O or normal preemption). As long as processing of I/O is handled on a given core, I don't know if there is actually such a huge benefit. A single pinned thread for the entire lifecycle has the problem that you get latency bottlenecks under load where two CPU-heavy requests end up contending for the same core vs work stealing making use of available compute.
The ultimate benefit would be if you could arrange each core to be given a dedicated NIC. Then the interrupts for the NIC are arriving on the core that's processing each packet. But otherwise you're already going to have to wake up the NIC on a random core to do a cross-core delivery of the I/O data.
TLDR: It's super complex to get a truly shared nothing approach unless you have a single application and you correctly allocate the work. It's really hard to solve generically optimally for all possible combinations of request and processing patterns.
Still, I'm conflicted on whether separating stages per thread (accept on one thread and the client loop in another) is a good idea. It sounds like the gains would be minimal or non-existent even in ideal circumstances, and on some workloads where there's not a lot of clients or connection churn it would waste an entire core for handling a low-volume event.
I'm open to contrarian opinions on this though, maybe I'm not seeing soemthing...
I do agree that it's a bit dated and today you'd do other things (notably SO_REUSEPORT), just feel that io_uring is a questionable example.
Are you saying this as someone with experience, or is it just a feeling? Please give examples of recent bugs in io_uring that have security implications.
There's considerable difficulty these days extrapolating "real" vulnerabilities from kernel CVEs, as the kernel team quite reasonably feel that basically any bug can be a vulnerability in the right situation, but the list of vulnerabilities in io_uring over the past 12 months[2] is pretty staggering to me.
0: https://github.com/containerd/containerd/pull/9320 1: https://security.googleblog.com/2023/06/learnings-from-kctf-... 3: https://nvd.nist.gov/vuln/search#/nvd/home?offset=0&rowCount...
https://www.cve.org/CVERecord/SearchResults?query=io_uring seems to back that up. Only one relevant CVE listed there for 2026 so far, for more than two per month on average in 2025. Caveat: I've not looked into the severity and ease of exploit for any of those issues listed.
Remember the Linux kernel's policy of assigning a CVE to every single bug, in protest to the stupid way CVEs were being assigned before that.
You obviously didn't read to the end of my little post, yet feel righteous enough to throw that out…
> One allows the root user to create a kernel thread and then block its shutdown for several minutes.
Which as part of a compromise chain could cause a DoS issue that might be able to bypass common protections like cgroup imposed limits.
Depending on how much performance would be gained by using io_uring in a particular case, and how many layers of protection exist around your server, it might be a risk worth taking.
I also want to mention that very very very few programs do, but io_uring does let you run multiple io_urings!! Your program can pick from which completion queue it wants to read, can put high priority tasks in a specific iou.
> Each major state transition (accept, reader) is handled by a separate thread, and transitioning one client from one state to another involves passing the file descriptor to the epoll/kqueue fd of the other thread.
So this seems like a little pipeline that all of the requests go through, right? For somebody who doesn’t do server stuff, is there a general idea of how many stages a typical server might be able to implement? And does it create a load-balancing problem? I’d expect some stages to be quite cheap…
On the HTTP server from the article, what I understood is that those 2 you are seeing are the ones you have. Or maybe 3, if disposing of things is slow.
I'm not sure what I prefer. On one hand, there's some expensive coordination for passing those file descriptors around. On the other hand, having some separate code bother with creating and closing the connections make it easier to focus on the actual performance issues where they appear, and create opportunity to dispatch work smartly.
Of course, you can go all the way in and make a green threads server where every bit of IO puts the work back on the queue. But you would use a single queue then, and dispatch the code that works on it. So you get more branching, but less coordination.
https://www.techempower.com/benchmarks/#section=data-r23&tes...
That would get you the code simplicity benefits the article suggests, while keeping the socket bound to a single core, which is definitely needed.
Depending on if you actually need to share anything, you could do process per core, thread per loop, and you have no core to core communication from the usual workings of the process (i/o may cross though)
> Another benefit of this design overlooked is that individual cores may not ever need to read memory -- the entire task can run in L1 or L2. If a single worker becomes too complicated this benefit is lost, and memory is much much slower than cache.
I think this is wrong or at least overstated: if you're passing off fds and their associated (kernel- and/or user-side) buffers between cores, you can't run entirely in L1 or L2. And in general, I'd expect data to be responsible for much more cache pressure than code, so I'm skeptical of localizing the code at the expense of the data.
But anyway, if the goal is to organize which cores are doing the work, splitting a single core's work from a single thread (pinnned to it) to several threads (still pinned to it) doesn't help. It just introduces more context switching.
[1] https://news.ycombinator.com/item?id=10874616
For longer lived connections the cache is going to thrash on an inevitable context switch anyway (either do to needing to wait for more I/O or normal preemption). As long as processing of I/O is handled on a given core, I don't know if there is actually such a huge benefit. A single pinned thread for the entire lifecycle has the problem that you get latency bottlenecks under load where two CPU-heavy requests end up contending for the same core vs work stealing making use of available compute.
The ultimate benefit would be if you could arrange each core to be given a dedicated NIC. Then the interrupts for the NIC are arriving on the core that's processing each packet. But otherwise you're already going to have to wake up the NIC on a random core to do a cross-core delivery of the I/O data.
TLDR: It's super complex to get a truly shared nothing approach unless you have a single application and you correctly allocate the work. It's really hard to solve generically optimally for all possible combinations of request and processing patterns.