RAG Systems Are Not Magic, You Just Need To Do The Work

Q: Which embedding model and vector DB should I use?

Use something boring and well-supported first. OpenAI’s text-embedding-3-small or Cohere’s embed-english-v3.0 are fine for most apps. For storage, anything that can do vector + metadata filters + hybrid search (Elastic, Weaviate, Qdrant, etc.) is good enough. Your retrieval logic matters way more than the logo on the box.

🌐🇮🇹 Italiano 🇺🇸 English

📖 7 min read•1,284 words•Updated Apr 1, 2026

RAG isn’t failing you. You’re failing RAG.

Two months ago someone sent me their “RAG-powered AI assistant” and asked why it kept hallucinating.
They proudly told me they used “state-of-the-art embeddings” and “a powerful vector database”.
Turned out they were chunking 40-page PDFs into 200-character blobs and storing them without metadata.
Retrieval was basically roulette with extra steps.

This happens a lot. People treat RAG like a checkbox: “Yeah we added RAG so it won’t hallucinate.”
That’s not how this works. Retrieval-augmented generation is not a magic patch.
It’s a very picky pipeline that punishes laziness in every step: ingestion, retrieval, and prompting.

I build agent systems all day, and I’ll tell you the same thing I tell product teams:
if your retrieval is trash, your “agent” is just an expensive autocomplete with vibes.

What RAG actually is (not the LinkedIn version)

RAG is simple on paper:

You store your own data somewhere (usually a vector DB).
You embed user queries and documents into vectors.
You pull the nearest stuff and feed it to the model.

That’s it. The whole trick is: “Give the model relevant context at the right time.”
The problem is every one of those words hides a mess:

“Relevant” – depends on your task, user, and level of detail.
“Context” – what format, how much, what metadata, which source?
“Right time” – do you retrieve once, multiple times, or in a loop?

So when someone tells me “we built RAG with Pinecone and OpenAI,” that’s like saying
“we built a car with metal and gasoline.” Cool. Does it turn? Stop? Explode?

The three places people usually screw up RAG

Let’s go through the most common self-inflicted wounds I see when debugging other people’s systems.

1. Chunking like a maniac

If your chunks are wrong, everything downstream suffers. Most people either:

Make chunks too small: model sees fragments with no context, or
Make chunks too big: you store half a chapter and then can’t fit enough of them in context

I recently saw a team split a 120-page policy document into fixed-size 256-token chunks.
No overlap, no respect for headings, nothing. When we logged retrieval for the query:
“What’s our refund policy for annual enterprise contracts?” we got:

One chunk with “refunds” but for consumer accounts
Another chunk about “enterprise contracts” but in a different section
Nothing that actually described the junction of those two

So the model stitched them together and confidently invented a hybrid policy that didn’t exist.
That’s not “hallucination”. That’s you feeding it mismatched puzzle pieces.

Better pattern:

Chunk by structure when possible: sections, headings, bullet groups.
Add overlap (like 10–20%) so boundaries don’t slice meaning in half.
Store metadata: section title, doc type, date, version, source.

Tools like langchain-text-splitters or llama-index can help, but they won’t think for you.
You still have to decide what a “unit of meaning” is in your domain.

2. “Just use embeddings” as a retrieval strategy

Another favorite: people shove everything into a vector DB and call it done.
No filters. No keyword backup. No reranking. Then they complain that embeddings “don’t work for code”
or “don’t work for support tickets”.

I worked on a support assistant where we benchmarked different setups on 500 real Zendesk tickets.
Embeddings-only retrieval (OpenAI text-embedding-3-large, top-k=5) gave us the right article in the top 5 about 62% of the time.
Adding a simple term-based search (BM25) and then reranking with a cross-encoder pushed that to 84%.
Same content. Same model. Just better retrieval logic.

You don’t need a fancy stack to do this:

Use a hybrid index (most modern search backends support this: Elastic, OpenSearch, Vespa, etc.).
Filter by metadata first: product, version, language.
Rerank your top 20–50 candidates with a cross-encoder or an LLM “choose the top 5” step.

And for the love of all that is sane, log your queries and retrieved docs.
Don’t just stare at some aggregate relevance score. Look at actual examples where it fails.

3. Prompting like the model is psychic

Your prompt is part of the RAG system. If the model doesn’t understand how to use the context,
it’ll happily ignore everything and make stuff up.

Good RAG prompts do a few boring but critical things:

Explain what the context is and where it came from.
Tell the model to quote or reference items in the context, not its own training.
Define what to do when the answer isn’t in the context (say “I don’t know”).
Give explicit formatting rules (like JSON schemas or bullet points).

Here’s a simplified version of a system prompt we shipped to production in January 2025 for a compliance chatbot:

“You answer questions only using the provided documents.”
“If you can’t answer from them, say you don’t know and suggest what document might help.”
“Always cite the document title and section ID in parentheses.”

This single change cut “confident nonsense” answers by about 40% in our manual evals.
Same retrieval stack. Just better instructions.

RAG plus agents: where it gets fun and fragile

Agents make this more interesting because now you’re not just answering one question.
You’re orchestrating multiple steps: search, read, decide, act, maybe search again.

People keep wiring tools like:

Tool 1: “search_kb(query) → top 5 docs”
Tool 2: “call_api(payload) → result”

…and then expecting the agent to magically know when to search again or refine the query.
Spoiler: it doesn’t. It just follows patterns.

Two simple fixes improve agent + RAG setups a lot:

Expose uncertainty. Let the agent see retrieval scores, not just raw text.
Give it a “refine_search” tool. A tool whose purpose is literally: “rewrite the query if the results aren’t good.”

When we added both in a CRM assistant system in late 2024, task success on multi-hop queries
(like “Find all accounts where we violated the SLA in the last 90 days and summarize patterns”)
jumped from 47% to 71%. Same model. Same documents. Better agent tooling around RAG.

Stop guessing. Start evaluating.

The number one difference between teams that get RAG working and teams that don’t is boring:
the good teams evaluate. Constantly.

At minimum you want:

Retrieval evals: For a set of queries, did we retrieve anything that actually contains the answer?
Annotate 50–100 examples by hand. Yes, by hand. You’re building a system, not vibes.
Answer quality evals: Is the final answer correct, grounded in the context, and complete?
You can use another model as a judge, but spot check with humans.
Hallucination checks: Did the answer include claims not supported by context?

Don’t chase “perfect.” Chase “we know exactly how and where it fails.”
Once you have that, fixing RAG becomes engineering work instead of witchcraft.

FAQ

How many chunks should I retrieve for each query?

Start with 5–10 and measure. If your chunks are smaller (like 200–400 tokens), you can go higher.
Watch your context window: you want room for instructions and the user query, not just a wall of retrieved text.

Which embedding model and vector DB should I use?

Use something boring and well-supported first. OpenAI’s text-embedding-3-small or
Cohere’s embed-english-v3.0 are fine for most apps.
For storage, anything that can do vector + metadata filters + hybrid search (Elastic, Weaviate, Qdrant, etc.) is good enough.
Your retrieval logic matters way more than the logo on the box.

Can I skip RAG and just fine-tune the model on my docs?

You can, but for most business cases it’s a bad trade. Fine-tuning is static and painful to update.
RAG lets you change answers by updating documents, not weights.
I only recommend fine-tuning for style, formatting, or when your domain is insanely niche and relatively stable.

🕒 Published: April 1, 2026

🧬

Written by Jake Chen

Deep tech researcher specializing in LLM architectures, agent reasoning, and autonomous systems. MS in Computer Science.

Learn more →