RAG isnât failing you. Youâre failing RAG.
Two months ago someone sent me their âRAG-powered AI assistantâ and asked why it kept hallucinating.
They proudly told me they used âstate-of-the-art embeddingsâ and âa powerful vector databaseâ.
Turned out they were chunking 40-page PDFs into 200-character blobs and storing them without metadata.
Retrieval was basically roulette with extra steps.
This happens a lot. People treat RAG like a checkbox: âYeah we added RAG so it wonât hallucinate.â
Thatâs not how this works. Retrieval-augmented generation is not a magic patch.
Itâs a very picky pipeline that punishes laziness in every step: ingestion, retrieval, and prompting.
I build agent systems all day, and Iâll tell you the same thing I tell product teams:
if your retrieval is trash, your âagentâ is just an expensive autocomplete with vibes.
What RAG actually is (not the LinkedIn version)
RAG is simple on paper:
- You store your own data somewhere (usually a vector DB).
- You embed user queries and documents into vectors.
- You pull the nearest stuff and feed it to the model.
Thatâs it. The whole trick is: âGive the model relevant context at the right time.â
The problem is every one of those words hides a mess:
- âRelevantâ â depends on your task, user, and level of detail.
- âContextâ â what format, how much, what metadata, which source?
- âRight timeâ â do you retrieve once, multiple times, or in a loop?
So when someone tells me âwe built RAG with Pinecone and OpenAI,â thatâs like saying
âwe built a car with metal and gasoline.â Cool. Does it turn? Stop? Explode?
The three places people usually screw up RAG
Letâs go through the most common self-inflicted wounds I see when debugging other peopleâs systems.
1. Chunking like a maniac
If your chunks are wrong, everything downstream suffers. Most people either:
- Make chunks too small: model sees fragments with no context, or
- Make chunks too big: you store half a chapter and then canât fit enough of them in context
I recently saw a team split a 120-page policy document into fixed-size 256-token chunks.
No overlap, no respect for headings, nothing. When we logged retrieval for the query:
âWhatâs our refund policy for annual enterprise contracts?â we got:
- One chunk with ârefundsâ but for consumer accounts
- Another chunk about âenterprise contractsâ but in a different section
- Nothing that actually described the junction of those two
So the model stitched them together and confidently invented a hybrid policy that didnât exist.
Thatâs not âhallucinationâ. Thatâs you feeding it mismatched puzzle pieces.
Better pattern:
- Chunk by structure when possible: sections, headings, bullet groups.
- Add overlap (like 10â20%) so boundaries donât slice meaning in half.
- Store metadata: section title, doc type, date, version, source.
Tools like langchain-text-splitters or llama-index can help, but they wonât think for you.
You still have to decide what a âunit of meaningâ is in your domain.
2. âJust use embeddingsâ as a retrieval strategy
Another favorite: people shove everything into a vector DB and call it done.
No filters. No keyword backup. No reranking. Then they complain that embeddings âdonât work for codeâ
or âdonât work for support ticketsâ.
I worked on a support assistant where we benchmarked different setups on 500 real Zendesk tickets.
Embeddings-only retrieval (OpenAI text-embedding-3-large, top-k=5) gave us the right article in the top 5 about 62% of the time.
Adding a simple term-based search (BM25) and then reranking with a cross-encoder pushed that to 84%.
Same content. Same model. Just better retrieval logic.
You donât need a fancy stack to do this:
- Use a hybrid index (most modern search backends support this: Elastic, OpenSearch, Vespa, etc.).
- Filter by metadata first: product, version, language.
- Rerank your top 20â50 candidates with a cross-encoder or an LLM âchoose the top 5â step.
And for the love of all that is sane, log your queries and retrieved docs.
Donât just stare at some aggregate relevance score. Look at actual examples where it fails.
3. Prompting like the model is psychic
Your prompt is part of the RAG system. If the model doesnât understand how to use the context,
itâll happily ignore everything and make stuff up.
Good RAG prompts do a few boring but critical things:
- Explain what the context is and where it came from.
- Tell the model to quote or reference items in the context, not its own training.
- Define what to do when the answer isnât in the context (say âI donât knowâ).
- Give explicit formatting rules (like JSON schemas or bullet points).
Hereâs a simplified version of a system prompt we shipped to production in January 2025 for a compliance chatbot:
- âYou answer questions only using the provided documents.â
- âIf you canât answer from them, say you donât know and suggest what document might help.â
- âAlways cite the document title and section ID in parentheses.â
This single change cut âconfident nonsenseâ answers by about 40% in our manual evals.
Same retrieval stack. Just better instructions.
RAG plus agents: where it gets fun and fragile
Agents make this more interesting because now youâre not just answering one question.
Youâre orchestrating multiple steps: search, read, decide, act, maybe search again.
People keep wiring tools like:
- Tool 1: âsearch_kb(query) â top 5 docsâ
- Tool 2: âcall_api(payload) â resultâ
âŚand then expecting the agent to magically know when to search again or refine the query.
Spoiler: it doesnât. It just follows patterns.
Two simple fixes improve agent + RAG setups a lot:
- Expose uncertainty. Let the agent see retrieval scores, not just raw text.
- Give it a ârefine_searchâ tool. A tool whose purpose is literally: ârewrite the query if the results arenât good.â
When we added both in a CRM assistant system in late 2024, task success on multi-hop queries
(like âFind all accounts where we violated the SLA in the last 90 days and summarize patternsâ)
jumped from 47% to 71%. Same model. Same documents. Better agent tooling around RAG.
Stop guessing. Start evaluating.
The number one difference between teams that get RAG working and teams that donât is boring:
the good teams evaluate. Constantly.
At minimum you want:
-
Retrieval evals: For a set of queries, did we retrieve anything that actually contains the answer?
Annotate 50â100 examples by hand. Yes, by hand. Youâre building a system, not vibes. -
Answer quality evals: Is the final answer correct, grounded in the context, and complete?
You can use another model as a judge, but spot check with humans. - Hallucination checks: Did the answer include claims not supported by context?
Donât chase âperfect.â Chase âwe know exactly how and where it fails.â
Once you have that, fixing RAG becomes engineering work instead of witchcraft.
FAQ
How many chunks should I retrieve for each query?
Start with 5â10 and measure. If your chunks are smaller (like 200â400 tokens), you can go higher.
Watch your context window: you want room for instructions and the user query, not just a wall of retrieved text.
Which embedding model and vector DB should I use?
Use something boring and well-supported first. OpenAIâs text-embedding-3-small or
Cohereâs embed-english-v3.0 are fine for most apps.
For storage, anything that can do vector + metadata filters + hybrid search (Elastic, Weaviate, Qdrant, etc.) is good enough.
Your retrieval logic matters way more than the logo on the box.
Can I skip RAG and just fine-tune the model on my docs?
You can, but for most business cases itâs a bad trade. Fine-tuning is static and painful to update.
RAG lets you change answers by updating documents, not weights.
I only recommend fine-tuning for style, formatting, or when your domain is insanely niche and relatively stable.
đ Published: