How to Build A Rag Pipeline with FAISS (Step by Step)

📖 5 min read•899 words•Updated Mar 27, 2026

How to Build A RAG Pipeline with FAISS

We’re building a RAG pipeline that actually handles messy PDFs — not the clean-text demos you see everywhere.

Prerequisites

Python 3.11+
pip install langchain>=0.2.0
pip install faiss-cpu
pip install PyPDF2
pip install transformers

Step 1: Setting Up Your Environment


# First, make sure your packages are installed
pip install langchain>=0.2.0
pip install faiss-cpu
pip install PyPDF2
pip install transformers

Why does this matter? Having the right versions ensures compatibility. You’ll run into version issues if you don’t stick to Python 3.11+ and the latest packages. It can be a nightmare when you have to debug due to mismatched dependencies — trust me, I’ve been there.

Step 2: Load Your PDF Data


import PyPDF2

def load_pdfs(pdf_paths):
 text = ""
 for path in pdf_paths:
 with open(path, "rb") as f:
 reader = PyPDF2.PdfReader(f)
 for page in reader.pages:
 text += page.extract_text()
 return text

# Example of loading multiple PDFs
pdf_paths = ["document1.pdf", "document2.pdf"]
pdf_data = load_pdfs(pdf_paths)

You need to handle PDFs, and they’re notoriously messy. This code loads text from each page of every PDF. Got a PDF that won’t let you extract? That’s because not all PDFs are created equal. Some contain images or have protection. If you can’t read them, expect to run into an AttributeError. Keep an eye out for that.

Step 3: Prepare Your Data for FAISS


from langchain.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()
data_vectors = embeddings.embed([pdf_data])

Here’s where the magic begins. You’re converting the PDF’s textual content into embeddings for FAISS. This step is crucial because FAISS works with numerical vectors, not raw text. If you see a TypeError here, check that your PDF text isn’t empty after extraction. You might need to clean it up first.

Step 4: Building the FAISS Index


import faiss
import numpy as np

def build_faiss_index(vectors):
 dimension = vectors.shape[1]
 index = faiss.IndexFlatL2(dimension)
 index.add(np.array(vectors, dtype=np.float32))
 return index

index = build_faiss_index(data_vectors)

This step creates an index from your embeddings. The FAISS IndexFlatL2 variant is simple and effective for building nearest neighbor searches. It’s fast, but watch out if you try adding too many vectors and your RAM runs out. You might need to consider more advanced indexing methods, but that’s another story.

Step 5: Querying the FAISS Index


def query_index(index, query_embedding, k=5):
 distances, indices = index.search(np.array([query_embedding], dtype=np.float32), k)
 return distances, indices

# Example usage
query_embedding = embeddings.embed(['What is the summary of document1?'])
distances, indices = query_index(index, query_embedding)

Now you can actually search the index with a query. The distances give you an idea of how close the results are to your query in the embedding space. If you get strange results, double-check that your embeddings are getting generated correctly — common mistake is faulty input.

The Gotchas

Empty or Corrupted PDFs: You’ll waste time debugging because the text is missing. Always check if your PDF loader works before diving into everything else.
Memory Issues: Handling large datasets? Be prepared for crashes. Monitor your RAM usage and optimize your arrays accordingly.
Embedding Problems: Your embeddings can be garbage if the content is too short or irrelevant. Make sure your queries are precise.
API Rate Limits: If you’re using an external embedding API, rate limits can screw you over. Handle those rate limits gracefully in your code, or you’ll be waiting a long time.

Full Code Example


import PyPDF2
import faiss
import numpy as np
from langchain.embeddings import OpenAIEmbeddings

# 1. Load PDFs
def load_pdfs(pdf_paths):
 text = ""
 for path in pdf_paths:
 with open(path, "rb") as f:
 reader = PyPDF2.PdfReader(f)
 for page in reader.pages:
 text += page.extract_text()
 return text

# 2. Prepare data
pdf_paths = ["document1.pdf", "document2.pdf"]
pdf_data = load_pdfs(pdf_paths)

# 3. Generate embeddings
embeddings = OpenAIEmbeddings()
data_vectors = embeddings.embed([pdf_data])

# 4. Build FAISS index
def build_faiss_index(vectors):
 dimension = vectors.shape[1]
 index = faiss.IndexFlatL2(dimension)
 index.add(np.array(vectors, dtype=np.float32))
 return index

index = build_faiss_index(data_vectors)

# 5. Querying the index
def query_index(index, query_embedding, k=5):
 distances, indices = index.search(np.array([query_embedding], dtype=np.float32), k)
 return distances, indices

# Example usage
query_embedding = embeddings.embed(['What is the summary of document1?'])
distances, indices = query_index(index, query_embedding)

What’s Next

Try integrating your RAG pipeline into a web app using Flask or FastAPI. Make it user-friendly so that even your non-technical friends can throw PDF documents at it and get useful information out!

FAQ

What if my PDF contains images?
You won’t be able to extract any text from images. Consider using OCR libraries like Tesseract to handle that.
Can I process URLs instead of PDF files?
Absolutely! You just need a library like BeautifulSoup to scrape the content from web pages and then perform the same embedding steps.
How do I improve search results?
Fine-tune your embeddings or preprocess your text data for better context extraction. Sometimes it’s a simple matter of applying better filtering or stemming techniques.

Last updated March 27, 2026. Data sourced from official docs and community benchmarks.

🕒 Published: March 27, 2026

🧬

Written by Jake Chen

Deep tech researcher specializing in LLM architectures, agent reasoning, and autonomous systems. MS in Computer Science.

Learn more →