bitdeep 12 hours ago
Not sure if you guys know: Groq already doing this with their ASIC chips. So... the already passed FPGA phase and is on ASICs phase.

The problem is: seems that their costs is 1x or 2x from what they are charging.

latchkey 9 hours ago
Probably more than 2x...

"Semi analysis did some cost estimates, and I did some but you’re likely paying somewhere in the 12 million dollar range for the equipment to serve a single query using llama-70b. Compare that to a couple of gpus, and it’s easy to see why they are struggling to sell hardware, they can’t scale down.

Since they didn’t use hbm, you need to stich enough cards together to get the memory to hold your model. It takes a lot of 256mb cards to get to 64gb, and there isn’t a good way to try the tech out since a single rack really can’t serve an LLM."

https://news.ycombinator.com/item?id=39966620

qwertox 11 hours ago
The way I see it, is that one day we'll be buying small LLM cartridges.
fhdsgbbcaA 18 hours ago
Looks like LLM inference will follow the same path as Bitcoin: CPU -> GPU -> FPGA -> ASIC.
hackernudes 17 hours ago
I really doubt it. Bitcoin mining is quite fixed, just massive amounts of SHA256. On the other hand, ASICs for accelerating matrix/tensor math are already around. LLM architecture is far from fixed and currently being figured out. I don't see an ASIC any time soon unless someone REALLY wants to put a specific model on a phone or something.
pzo 9 hours ago
Apple has Neural Engine and it really speeds up many CoreML models - if most operators are implemented in NPU inference will be significantly faster than on GPU on my Macbook m2 max (and they have similar NPU as those in e.g. iPhone 13). Those ASIC NPU just implements many typical low level operators used in most ML models.
YetAnotherNick 16 hours ago
Google's TPU is an ASIC and performs competitively. Also Tesla and Meta is building something AFAIK.

Although I doubt you could get lot better as GPUs already have half the die area reserved for matrix multiplication.

danielmarkbruce 15 hours ago
It depends on your precise definition of ASIC. The FPGA thing here would be analogous to an MSIC where m = model.

It's clearly different to build a chip for a specific model than what a TPU is.

Maybe we'll start seeing MSICs soon.

YetAnotherNick 13 hours ago
LLMs and many other models spend 99% of the FLOPs in matrix multiplication. And TPU initially had just single operation i.e. multiply matrix. Even if the MSIC is 100x better than GPU in other operations, it would just be 1% faster overall.
danielmarkbruce 12 hours ago
You can still optimize various layers of memory for a specific model, make it all 8 bit or 4 bit or whatever you want, maybe burn in a specific activation function, all kinds of stuff.

No chance you'd only get 1% speedup on a chip designed for a specific model.

evanjrowley 5 hours ago
You forgot to place "vertically-integrated unobtanium" after ASIC.
winwang 12 hours ago
As far as I understand, the main issue for LLM inference is memory bandwidth and capacity. Tensor cores are already an ASIC for matmul, and they idle half the time waiting on memory.
bee_rider 15 hours ago
LLM inference is a small task built into some other program you are running, right? Like an office suite with some sentence suggestion feature, probably a good use for an LLM, would be… mostly office suite, with a little LLM inference sprinkled in.

So, the “ASIC” here is probably the CPU with, like, slightly better vector extensions. AVX1024-FP16 or something, haha.

p1esk 8 hours ago
would be… mostly office suite, with a little LLM inference sprinkled in.

No, it would be LLM inference with a little bit of an office suite sprinkled in.

jsheard 18 hours ago
Is there any particular reason you'd want to use an FPGA for this? Unless your problem space is highly dynamic (e.g. prototyping) or you're making products in vanishing low quantities for a price insensitive market (e.g. military) an ASIC is always going to be better.

There doesn't seem to be much flux in the low level architectures used for inferencing at this point, so may as well commit to an ASIC, as is already happening with Apple, Qualcomm, etc building NPUs into their SOCs.

PaulHoule 17 hours ago
(1) Academics could make an FPGA but not an ASIC, (2) FPGA is a first step to make an ASIC
israrkhan 16 hours ago
You can open-source your FPGA designs for wider collaboration with the community? wider collaboration. Also, FPGA is the starting step to make any modern digital chip.
wongarsu 17 hours ago
This specific project looks like a case of "we have this platform for automotive and industrial use, running Llama on the dual-core ARM CPU is slow but there's an FPGA right next to it". That's all the justification you really need for a university project.

Not sure how useful this is for anyone who isn't already locked into this specific architecture. But it might be a useful benchmark or jumping-off-point for more useful FPGA-based accelerators, like ones optimized for 1 bit or 1.58 bit LLMs

danielmarkbruce 15 hours ago
Model architecture changes fast. Maybe it will slow down.
someguydave 18 hours ago
gotta prototype the thing somewhere. If it turns out that the LLM algos become pretty mature I suspect accelerators of all kinds will be baked into silicon, especially for inference.
jsheard 17 hours ago
That's the thing though, we're already there. Every new consumer ARM and x86 ASIC is shipping with some kind of NPU, the time for tentatively testing the waters with FPGAs was a few years ago before this stuff came to market.
PaulHoule 17 hours ago
But the NPU might be poorly designed for your model or workload or just poorly designed.
mistrial9 17 hours ago
KeplerBoy 17 hours ago
4 times as efficient as on the SoC's low end arm cores, soo many times less efficient than on modern GPUs I guess?

Not that I was expecting GPU like efficiency from a fairly small scale FPGA project. Nvidia engineers spent thousands of man-years making sure that stuff works well on GPUs.