Consistency diffusion language models: Up to 14x faster, no quality loss

▲

Consistency diffusion language models: Up to 14x faster, no quality loss(together.ai)

91 points byzagwdt4 hours ago |11 comments

▲WiSaGaN5 minutes ago

I think diffusion makes much more sense than auto-regressive (AR) specifically in code generation comparing to chatbot.

▲LarsDu8824 minutes ago

A lot of this post-training recipe feels reminiscent of DINO training (teacher/student, use of stop gradients). I wonder if the more recent leJEPA SigREG regularization research might be relevant here for simpler post-training.

▲simonw27 minutes ago

I'd love to know what's going on with the Gemini Diffusion model - they had a preview last May and it was crazy fast but I've not heard anything since then.

▲MASNeo2 hours ago

I wish there would be more of this research to speed things up rather than building ever larger models

▲nl1 hour ago

Why not both?

Scaling laws are real! But they don't preclude faster processing.

▲yjftsjthsd-h3 hours ago

Is anyone doing any form of diffusion language models that are actually practical to run today on the actual machine under my desk? There's loads of more "traditional" .gguf options (well, quants) that are practical even on shockingly weak hardware, and I've been seeing things that give me hope that diffusion is the next step forward, but so far it's all been early research prototypes.

▲janalsncm34 minutes ago

I worked on it for a more specialized task (query rewriting). It’s blazing fast.

A lot of inference code is set up for autoregressive decoding now. Diffusion is less mature. Not sure if Ollama or llama cpp support it.

▲Bolwin3 hours ago

Based on my experience running diffusion image models I really hope this isn't going to take over anytime soon. Parallel decoding may be great if you have a nice parallel gpu or npu but is dog slow for cpus

▲nl1 hour ago

Releasing this on the same day as Taalas's 16,000 token-per-second acceleration for the roughly comparable Llama 8B model must hurt!

I wonder how far down they can scale a diffusion LM? I've been playing with in-browser models, and the speed is painful.

https://taalas.com/products/

▲aurareturn1 hour ago

Nothing to do with each other. This is a general optimization. Taalas' is an ASIC that runs a tiny 8B model on SRAM.

But I wonder how Taalas' product can scale. Making a custom chip for one single tiny model is different than running any model trillions in size for a billion users.

Roughly, 53B transistors for every 8B params. For a 2T param model, you'd need 13 trillion transistor assuming scale is linear. One chip uses 2.5 kW of power? That's 4x H100 GPUs. How does it draw so much power?

If you assume that the frontier model is 1.5 trillion models, you'd need an entire N5 wafer chip to run it. And then if you need to change something in the model, you can't since it's physically printed on the chip. So this is something you do if you know you're going to use this exact model without changing anything for years.

Very interesting tech for edge inference though. Robots and self driving can make use of these in the distant future if power draw comes down drastically. 2.4kW chip running inside a robot is not realistic. Maybe a 150w chip.

▲LASR1 hour ago

Just tried this. Holy fuck.

I'd take an army of high-school graduate LLMs to build my agentic applications over a couple of genius LLMs any day.

This is a whole new paradigm of AI.

▲tokenless41 minutes ago

When that genrates 10k of output slop in less latency than my web server doing some crud shit....amazing!

▲bjt123451 hour ago

I do wonder why diffusion models aren't used alongside constraint decoding for programming - surely it makes better sense then using an auto-regressive model.

▲bob102950 minutes ago

Diffusion models need to infer the causality of language from within a symmetric architecture (information can flow forward or backward). AR forces information to flow in a single direction and is substantially easier to control as a result. The 2nd sentence in a paragraph of English text often cannot come before the first or the statement wouldn't make sense. Sometimes this is not an issue (and I think these are cases where parallel generation makes sense), but the edge cases are where all the money lives.

▲LarsDu883 hours ago

Google is working on a similar line of research. Wonder why they haven't rolled out a GPT40 scaled version of this yet

▲vintermann2 hours ago

Probably because it's expensive.

But I wish there were more "let's scale this thing to the skies" experiments from those who actually can afford to scale things to the skies.

▲yorwba1 hour ago

Scaling laws mean that there's not much need to actually scale things to the skies. Instead, you can run a bunch of experiments at small scale, fit the scaling law parameters, then extrapolate. If the predicted outcome is disappointing (e.g. it's unlikely to beat the previous scaled-to-the-sky model), you can save the really expensive experiment for a more promising approach.

It would certainly be nice though if this kind of negative result was published more often instead of leaving people to guess why a seemingly useful innovation wasn't adopted in the end.

▲hanifbbz1 hour ago

Is this available as open source anywhere to try?

▲LoganDark59 minutes ago

Can't wait for the day I can actually try a diffusion model on my own machine (128GB M4 Max) rather than as a hosted service. So far I haven't seen a single piece of software that supports it.

▲janalsncm30 minutes ago

You can try it today. You can get them from huggingface. Here is an example:

https://huggingface.co/tencent/WeDLM-8B-Instruct

Diffusion isn’t natively supported in the transformers library yet so you have to use their custom inference code.

▲refulgentis3 hours ago

If this means there’s a 2x-7x speed up available to a scaled diffusion model like Inception Mercury, that’ll be a game changer. It feels 10x faster already…

▲blurbleblurble2 hours ago

Diffusion language models seem poised to smash purely autoregressive models. I'm giving it 1-2 years.

▲impossiblefork44 minutes ago

One appeal of it is for RL. If it ends up being a lot faster for generation, you'll be able to do a lot more RL.

If people can make RL scalable-- make it so that RL isn't just a final phase, but something which is as big as the supervised stuff, then diffusion models are going to have an advantage.

If not, I think autoregressive models will still be preferred. Diffusion models become fixed very fast, they can't actually refine their outputs, so we're not talking about some kind of refinement along the lines of: initial idea -> better idea -> something actually sound.

▲meatmanek2 hours ago

Feels like the sodium ion battery vs lithium ion battery thing, where there are theoretical benefits of one but the other has such a head start on commercialization that it'll take a long time to catch up.

▲LarsDu8817 minutes ago

Not really. Unlike with physical goods like batteries, the hardware for training a diffusion vs an autoregressive language model is more or less exactly the same.

Although the lab that did this research (Chris Re and Tri Dao are involved) is run by the world's experts in squeezing CUDA and Nvidia hardware for every last drop of performance.

At the API level, the primary differences will be the addition of text infill capabilities for language generation. I also somewhat expect certain types of generation to be more cohesive (e.g. comedy or stories where you need to think of the punchline or ending first!)

▲sroussey1 hour ago

Same with digital vs analog