Nano-vLLM: How a vLLM-style inference engine works

▲

Nano-vLLM: How a vLLM-style inference engine works(neutree.ai)

132 points byyz-yu4 hours ago |2 comments

▲jbarrow2 hours ago

The whole thing feels AI written, generated from the codebase.*

*this is incorrect per the author’s response, my apologies.

For instance, it goes into (nano)vLLM internals and doesn’t mention PagedAttention once (one of the core ideas that vLLM is based on)[1].

Also mentions that Part 2 will cover dense vs MoE’s, which is weird because nanovllm hardcodes a dense Qwen3 into the source.

Here are better (imo) explainers about how vLLM works:

- https://hamzaelshafie.bearblog.dev/paged-attention-from-firs...

- https://www.aleksagordic.com/blog/vllm

- https://huggingface.co/blog/continuous_batching

Aleksa’s blog is a bit in the weeds for my taste but it’s really worth working through.

A lot of the magic of vLLM happens in the PagedAttention kernels, which are really succinctly implanted in nanovllm. And the codebase is great and readable by itself!

—

1. https://arxiv.org/abs/2309.06180

▲WhitneyLand12 minutes ago

Actually I thought it was a great example clarity, focus, and economy of words that AI is not capable of at this point in time.

▲yz-yu2 hours ago

Hi jbarrow, thanks for your feedback and the links you shared—they're great readings for me (and likely others too).

That said, I need to clarify: the content was not written by AI, and certainly not generated from a database in one shot. If there's some agent + prompt that can produce what I wrote, I'd love to learn it—it would've saved me two weekends :)

Before addressing your questions further, some context: I'm a developer with no ML background but plenty of Cloud Infra experience. I'm currently building an open-source AI Infra project, which is why I studied nano-vllm. So my writing reflects some gaps in ML knowledge.

To your specific points:

> it goes into (nano)vLLM internals and doesn't mention PagedAttention once

I didn't find any explicit "paged attention" naming in nano-vllm. After reading the first article you linked—specifically the "Paged KV Caching" section—I believe the block management logic and CPU/GPU block mapping it describes is exactly what I covered in both posts. It may not be the full picture of paged attention, but I interpreted what I saw in the code and captured the core idea. I think that's a reasonable outcome.

> Part 2 will cover dense vs MoE's, which is weird because nanovllm hardcodes a dense Qwen3 into the source

This reflects my learning approach and background. Same as point 1—I may not have realized the block design was the famous PagedAttention implementation, so I didn't name it as such. For point 2, seeing a dense Qwen3 naturally made me wonder how it differs from the xx-B-A-yy-B MoE models I'd seen on Hugging Face—specifically what changes in the decoder layers. That curiosity led me to learn about MoE and write it up for others with the same questions.

---

I completely understand that in this era, people care more about whether what they're reading is AI-generated—no one wants to waste time on low-effort slop with no human involvement.

But as I explained above—and as my hand-drawn Excalidraw diagrams show (I haven't seen an LLM produce diagrams with logic that satisfies me)—this is the result of learning shaped by my own knowledge background and preferences.

▲jacquesm1 hour ago

Funny, this reads even more AI written than the article itself.

▲lambda7 minutes ago

One thing to keep in mind is that a lot of non-native English speakers use LLMs to translate to English, or to polish their English prose; they may not realize that it causes the translation to come out in a very LLM-style tone. Not sure if that's the case here, but it looks like OP is a native Chinese speaker so may be using tools to translate to English.

▲Juvination1 hour ago

The em dashes really aren't helping their case.

▲_alternator_54 minutes ago

Wait—do people here really think the em dash was nonexistent before LLMs? It’s widely used by people like me who care about writing style. The reason LLMs use it is because they reflect care and concern about writing style.

▲WhitneyLand4 minutes ago

No, people think humans use it a lot less often than AI, because it’s true. Especially for casual writing.

The contrast might become even greater because some humans that did use them have stopped to avoid false accusations.

▲CodeMage46 minutes ago

Yeah, people do seem to think that em dashes are an indicator of GenAI. I have been accused of using AI to write my posts on a forum, precisely because of em dashes. That's how I found out about that particular sniff test people use.

Hasn't made me change the way I write, though. Especially because I never actually type an em dash character myself. Back when I started using computers, we only had ASCII, so I got used to writing with double dashes. Nowadays, a lot of software is smart enough to convert a double dash into an em dash. Discourse does that and that's how I ended up being accused of being an AI bot.

▲171862744025 minutes ago

Shouldn't a double dash result in an en dash and only a triple in an em dash?

▲Juvination44 minutes ago

Nobody ever said that they were nonexistent before LLMs. When you are investigating and trying to determine if something is AI generated they are the number one indicator.

So if you're being accused of just spewing AI, then double down and spew what looks EVEN MORE like AI. What are you even doing?

▲mmaunder12 minutes ago

My guess it's a translator they're using.

▲marcellus231 hour ago

It really doesn't.

▲CodeMage38 minutes ago

It does, but what does that say about the state of communication in our industry? I've seen a lot of writing that reads like an AI produced it in contexts where I could be pretty sure no AI was involved. We want to sound professional, so we sanitize how we write so much that it becomes... whatever this current situation is.

No offense intended to @yz-yu, by the way. I miss the times when more people wrote in an eccentric style -- like Steve Yegge -- but that doesn't detract from what you wrote.

▲yz-yu1 hour ago

Cool, humans hallucinate too. — AI

▲Der_Einzige1 hour ago

Yup, this is SUPER AI generated.

Next time OP that you want to lie about using LLMs, either use the techniques from our paper: https://arxiv.org/abs/2510.15061

Or if you're more lazy, there's a "logit_bias" technique which you can use to ban the em-dash in language models.

But you were too lazy to do that, and you lied about not using AI. Shame on you big time.

Also, even if you somehow didn't AI generate this, you sure as shit got infected by the LLM mind-virus and now you write/talk like it. That's basically just as bad. Either square up with proof that you overused EM dashes before late 2022 (like Dang!) or fix your writing style.

https://arxiv.org/abs/2409.01754

https://arxiv.org/abs/2508.01491

https://aclanthology.org/2025.acl-short.47/

https://arxiv.org/abs/2506.06166

https://en.wikipedia.org/wiki/Wikipedia:Signs_of_AI_writing

https://osf.io/preprints/psyarxiv/wzveh_v1

https://arxiv.org/abs/2506.08872

https://aclanthology.org/2025.findings-acl.987/

https://aclanthology.org/2025.coling-main.426/

https://aclanthology.org/2025.iwsds-1.37/

https://www.medrxiv.org/content/10.1101/2024.05.14.24307373v...

https://journals.sagepub.com/doi/full/10.1177/21522715251379...

https://arxiv.org/abs/2506.21817

▲lukax2 hours ago

Not really in the PagedAttention kernels. Paged attention was integrated into FlashAttention so that FlashAttention kernels can be used both for prefill and decoding with paged KV. The only paged attention specific kernels are for copying KV blocks (device to device, device to host and host to device). At least for FA2 and FA3, vLLM maintained a fork of FA with paged attention patches.

▲yz-yu1 hour ago

Since HN only allows one link per submission, dropping Part 2 here.

https://www.neutree.ai/blog/nano-vllm-part-2