Context Rot: How increasing input tokens impacts LLM performance

▲

Context Rot: How increasing input tokens impacts LLM performance(research.trychroma.com)

74 points by kellyhongsn 5 hours ago | 6 comments

▲posnet 2 hours ago

I've definitely noticed this anecdotally.

Especially with Gemini Pro when providing long form textual references, providing many documents in a single context windows gives worse answers than having it summarize documents first, ask a question about the summary only, then provide the full text of the sub-documents on request (rag style or just simple agent loop).

Similarly I've personally noticed that Claude Code with Opus or Sonnet gets worse the more compactions happen, it's unclear to me whether it's just the summary gets worse, or if its the context window having a higher percentage of less relevant data, but even clearing the context and asking it to re-read the relevant files (even if they were mentioned and summarized in the compaction) gives better results.

▲zwaps 2 hours ago

Gemini loses coherence and reasoning ability well before the chat hits the context limitations, and according to this report, it is the best model on several dimensions.

Long story short: Context engineering is still king, RAG is not dead

▲tvshtr 54 minutes ago

Yep, it can decohere really badly with bigger context. It's not only context related though. Sometimes it can lose focus early on in a way that is impossible to get it back on track.

▲deadbabe 25 minutes ago

RAG was never going away, the people who say that are the same types who say software engineers will be totally replaced with AI.

LLMs will need RAG one way or another, you can hide it from the user, but it still must be there.

▲risyachka 1 hour ago

Yep. The easiest way to tell someone has no experience with LLMs is if they say “RAG is dead”

▲apwell23 26 minutes ago

> someone has no experience with LLMs

Thats 99% of coders. No need to gatekeep.

▲tough 2 hours ago

Have you tried NotebookLM which basically does this as an app on the bg (chunking and summarising many docs) and you can -chat- with the full corpus using RAG

▲lukev 1 hour ago

This effect is well known but not well documented so far, so great job here.

It's actually even more significant than it's possible to benchmark easily (though I'm glad this paper has done so.)

Truly useful LLM applications live at the boundaries of what the model can do. That is, attending to some aspect of the context that might be several logical "hops" away from the actual question or task.

I suspect that the context rot problem gets much worse for these more complex tasks... in fact, exponentially so for each logical "hop" which is required to answer successfully. Each hop compounds the "attention difficulty" which is increased by long/distracting contexts.

▲tjkrusinski 3 hours ago

Interesting report. Are there recommended sizes for different models? How do I know what works or doesn't for my use case?

▲tough 2 hours ago

this felt intuitively true, great to see some research putting hard numbers on that

▲zwaps 2 hours ago

Very cool results, very comprehensive article, many insights!

Media literacy disclaimer: Chroma is a vectorDB company.

▲philip1209 1 hour ago

Chroma does vector, full-text, and regex search. And, it's designed for multitenant workloads typical of AI applications. So, not just a "vectorDB company"