LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs

▲

LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs(arxiv.org)

124 points by PaulHoule 290 days ago | 5 comments

▲fhdsgbbcaA 289 days ago

Looks like LLM inference will follow the same path as Bitcoin: CPU -> GPU -> FPGA -> ASIC.

▲hackernudes 289 days ago

I really doubt it. Bitcoin mining is quite fixed, just massive amounts of SHA256. On the other hand, ASICs for accelerating matrix/tensor math are already around. LLM architecture is far from fixed and currently being figured out. I don't see an ASIC any time soon unless someone REALLY wants to put a specific model on a phone or something.

▲YetAnotherNick 289 days ago

Google's TPU is an ASIC and performs competitively. Also Tesla and Meta is building something AFAIK.

Although I doubt you could get lot better as GPUs already have half the die area reserved for matrix multiplication.

▲danielmarkbruce 289 days ago

It depends on your precise definition of ASIC. The FPGA thing here would be analogous to an MSIC where m = model.

It's clearly different to build a chip for a specific model than what a TPU is.

Maybe we'll start seeing MSICs soon.

▲YetAnotherNick 289 days ago

LLMs and many other models spend 99% of the FLOPs in matrix multiplication. And TPU initially had just single operation i.e. multiply matrix. Even if the MSIC is 100x better than GPU in other operations, it would just be 1% faster overall.

▲danielmarkbruce 289 days ago

You can still optimize various layers of memory for a specific model, make it all 8 bit or 4 bit or whatever you want, maybe burn in a specific activation function, all kinds of stuff.

No chance you'd only get 1% speedup on a chip designed for a specific model.

▲pzo 289 days ago

Apple has Neural Engine and it really speeds up many CoreML models - if most operators are implemented in NPU inference will be significantly faster than on GPU on my Macbook m2 max (and they have similar NPU as those in e.g. iPhone 13). Those ASIC NPU just implements many typical low level operators used in most ML models.

▲imtringued 289 days ago

99% of the time is spent on matrix matrix or matrix vector calculation. Activation functions, softmax, RoPE, etc basically cost nothing in comparison.

Most NPUs are programmable, because the bottleneck is data SRAM and memory bandwidth instead of instruction SRAM.

For classic matrix matrix multiplication, the SRAM bottleneck is the number of matrix outputs you can store in SRAM. N rows and M columns get you N X M accumulator outputs. The calculation of the dot product can be split into separate steps without losing the N X M scaling, so the SRAM consumed by the row and column vectors is insignificant in the limit.

For the MLP layers in the unbatched case, the bottleneck lies in the memory bandwidth needed to load the model parameters. The problem is therefore how fast your DDR, GDDR, HBM memory and your NoC/system bus lets you transfer data to the NPU.

Having a programmable processor that controls the matrix multiplication function unit costs you silicon area for the instruction SRAM. For matrix vector multiplication, the memory bottleneck is so big, it doesn't matter what architecture you are using, even CPUs are fast enough. There is no demand for getting rid of the not very costly instruction SRAM.

"but what about the area taken up by the processor itself?"

HAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHA. Nice joke

Wait..., you were serious? The area taken up by an in order VLIW/TTA processor is so insignificant I jammed it in-between the routing gap of two SRAM blocks. Sure, the matrix multiplication unit might take up some space, bit decoding instructions is such an insignificant cost that anyone opposing programmability must have completely different goals and priorities than LLMs or machine learning.

▲winwang 289 days ago

As far as I understand, the main issue for LLM inference is memory bandwidth and capacity. Tensor cores are already an ASIC for matmul, and they idle half the time waiting on memory.

▲evanjrowley 289 days ago

You forgot to place "vertically-integrated unobtanium" after ASIC.

▲namibj 289 days ago

Soooo.... TPUv4?

▲evanjrowley 288 days ago

Yes, but the kinds that aren't on the market.

▲bee_rider 289 days ago

LLM inference is a small task built into some other program you are running, right? Like an office suite with some sentence suggestion feature, probably a good use for an LLM, would be… mostly office suite, with a little LLM inference sprinkled in.

So, the “ASIC” here is probably the CPU with, like, slightly better vector extensions. AVX1024-FP16 or something, haha.

▲p1esk 289 days ago

would be… mostly office suite, with a little LLM inference sprinkled in.

No, it would be LLM inference with a little bit of an office suite sprinkled in.

▲bitdeep 289 days ago

Not sure if you guys know: Groq already doing this with their ASIC chips. So... the already passed FPGA phase and is on ASICs phase.

The problem is: seems that their costs is 1x or 2x from what they are charging.

▲latchkey 289 days ago

Probably more than 2x...

"Semi analysis did some cost estimates, and I did some but you’re likely paying somewhere in the 12 million dollar range for the equipment to serve a single query using llama-70b. Compare that to a couple of gpus, and it’s easy to see why they are struggling to sell hardware, they can’t scale down.

Since they didn’t use hbm, you need to stich enough cards together to get the memory to hold your model. It takes a lot of 256mb cards to get to 64gb, and there isn’t a good way to try the tech out since a single rack really can’t serve an LLM."

https://news.ycombinator.com/item?id=39966620

▲faangguyindia 288 days ago

Groq is unpredictable and while it might be fast for some requests about it's super slow or fails on others.

Fastest commercial model is Google's Gemini Flash (predictable speed)

▲qwertox 289 days ago

The way I see it, is that one day we'll be buying small LLM cartridges.

▲jsheard 289 days ago

Is there any particular reason you'd want to use an FPGA for this? Unless your problem space is highly dynamic (e.g. prototyping) or you're making products in vanishing low quantities for a price insensitive market (e.g. military) an ASIC is always going to be better.

There doesn't seem to be much flux in the low level architectures used for inferencing at this point, so may as well commit to an ASIC, as is already happening with Apple, Qualcomm, etc building NPUs into their SOCs.

▲PaulHoule 289 days ago

(1) Academics could make an FPGA but not an ASIC, (2) FPGA is a first step to make an ASIC

▲wongarsu 289 days ago

This specific project looks like a case of "we have this platform for automotive and industrial use, running Llama on the dual-core ARM CPU is slow but there's an FPGA right next to it". That's all the justification you really need for a university project.

Not sure how useful this is for anyone who isn't already locked into this specific architecture. But it might be a useful benchmark or jumping-off-point for more useful FPGA-based accelerators, like ones optimized for 1 bit or 1.58 bit LLMs

▲israrkhan 289 days ago

You can open-source your FPGA designs for wider collaboration with the community? wider collaboration. Also, FPGA is the starting step to make any modern digital chip.

▲someguydave 289 days ago

gotta prototype the thing somewhere. If it turns out that the LLM algos become pretty mature I suspect accelerators of all kinds will be baked into silicon, especially for inference.

▲jsheard 289 days ago

That's the thing though, we're already there. Every new consumer ARM and x86 ASIC is shipping with some kind of NPU, the time for tentatively testing the waters with FPGAs was a few years ago before this stuff came to market.

▲PaulHoule 289 days ago

But the NPU might be poorly designed for your model or workload or just poorly designed.

▲mistrial9 289 days ago

like this? https://www.d-matrix.ai/product/

▲danielmarkbruce 289 days ago

Model architecture changes fast. Maybe it will slow down.

▲KeplerBoy 289 days ago

4 times as efficient as on the SoC's low end arm cores, soo many times less efficient than on modern GPUs I guess?

Not that I was expecting GPU like efficiency from a fairly small scale FPGA project. Nvidia engineers spent thousands of man-years making sure that stuff works well on GPUs.

▲rldjbpin 289 days ago

as of now there are way too many parallel developments across abstraction layers, hardware or software, to really have the best combo just yet. even this example is for an older architecture because certain things just move slower than others.

but when things plateau off, this, then ASICs, would probably be the most efficient way ahead for "stable" versions of AI models during inference.