João Freitas

The following post discusses why AI loves to index all data and why indexing too much data may hurt the trained Large Language Model (LLM).

Many AI products today are focused on indexing as much as possible. Every meeting, every document, every moment of your day. Every modality — images, audio, and text. Devices that are meant to capture your every moment.

Then, they run every data point through a complex pipeline of vector searches, heuristics, draft models, large models, and more to make sense of it. Models trained to take in ever-increasing context-lengths that fit in as many documents and pieces of information as possible.

But more information isn’t always better. The limits of the ‘index everything approach’.

Index size is a trade-off against retrieval quality. A larger index can capture more information, but it also increases the risk of false positives in retrieval. Google was lucky enough to get started in a world where the index size was relatively small, and the retrieval quality was already low.

Each modality is hard enough. Searching websites with text is a hard enough problem for Google to solve. Searching images by text is harder. Searching images by images (reverse image search) is even harder. Text-to-speech search is another layer of UX and technical problems.

Irrelevant information does more harm than good. Just because models can handle larger context lengths doesn’t mean that they keep the same level of performance. Benchmarks are still being developed, but it looks like larger contexts see degraded performance, especially in the middle of the context. LLMs are easily led astray by irrelevant quality.

Indexing everything turns all problems into one difficult problem. LLMs can answer complex subjective questions but struggle with math problems. When you have a hammer, everything looks like a nail. Indexing everything lets us skip the essential task of asking if we can simply the problem. Sometimes, it’s simpler to just use a calculator.

Index everything isn’t a bad approach (inventor’s paradox), but it’s an extremely difficult problem. We’re still trying to figure out the targeted solutions with the latest AI.

#reads #matt rickard #ai #llm #indexing