How to Build A RAG Pipeline with FAISS
We’re building a RAG pipeline that actually handles messy PDFs â not the clean-text demos you see everywhere.
Prerequisites
- Python 3.11+
- pip install langchain>=0.2.0
- pip install faiss-cpu
- pip install PyPDF2
- pip install transformers
Step 1: Setting Up Your Environment
# First, make sure your packages are installed
pip install langchain>=0.2.0
pip install faiss-cpu
pip install PyPDF2
pip install transformers
Why does this matter? Having the right versions ensures compatibility. You’ll run into version issues if you don’t stick to Python 3.11+ and the latest packages. It can be a nightmare when you have to debug due to mismatched dependencies â trust me, I’ve been there.
Step 2: Load Your PDF Data
import PyPDF2
def load_pdfs(pdf_paths):
text = ""
for path in pdf_paths:
with open(path, "rb") as f:
reader = PyPDF2.PdfReader(f)
for page in reader.pages:
text += page.extract_text()
return text
# Example of loading multiple PDFs
pdf_paths = ["document1.pdf", "document2.pdf"]
pdf_data = load_pdfs(pdf_paths)
You need to handle PDFs, and theyâre notoriously messy. This code loads text from each page of every PDF. Got a PDF that wonât let you extract? Thatâs because not all PDFs are created equal. Some contain images or have protection. If you can’t read them, expect to run into an AttributeError. Keep an eye out for that.
Step 3: Prepare Your Data for FAISS
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
data_vectors = embeddings.embed([pdf_data])
Hereâs where the magic begins. You’re converting the PDF’s textual content into embeddings for FAISS. This step is crucial because FAISS works with numerical vectors, not raw text. If you see a TypeError here, check that your PDF text isn’t empty after extraction. You might need to clean it up first.
Step 4: Building the FAISS Index
import faiss
import numpy as np
def build_faiss_index(vectors):
dimension = vectors.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(vectors, dtype=np.float32))
return index
index = build_faiss_index(data_vectors)
This step creates an index from your embeddings. The FAISS IndexFlatL2 variant is simple and effective for building nearest neighbor searches. Itâs fast, but watch out if you try adding too many vectors and your RAM runs out. You might need to consider more advanced indexing methods, but thatâs another story.
Step 5: Querying the FAISS Index
def query_index(index, query_embedding, k=5):
distances, indices = index.search(np.array([query_embedding], dtype=np.float32), k)
return distances, indices
# Example usage
query_embedding = embeddings.embed(['What is the summary of document1?'])
distances, indices = query_index(index, query_embedding)
Now you can actually search the index with a query. The distances give you an idea of how close the results are to your query in the embedding space. If you get strange results, double-check that your embeddings are getting generated correctly â common mistake is faulty input.
The Gotchas
- Empty or Corrupted PDFs: Youâll waste time debugging because the text is missing. Always check if your PDF loader works before diving into everything else.
- Memory Issues: Handling large datasets? Be prepared for crashes. Monitor your RAM usage and optimize your arrays accordingly.
- Embedding Problems: Your embeddings can be garbage if the content is too short or irrelevant. Make sure your queries are precise.
- API Rate Limits: If youâre using an external embedding API, rate limits can screw you over. Handle those rate limits gracefully in your code, or youâll be waiting a long time.
Full Code Example
import PyPDF2
import faiss
import numpy as np
from langchain.embeddings import OpenAIEmbeddings
# 1. Load PDFs
def load_pdfs(pdf_paths):
text = ""
for path in pdf_paths:
with open(path, "rb") as f:
reader = PyPDF2.PdfReader(f)
for page in reader.pages:
text += page.extract_text()
return text
# 2. Prepare data
pdf_paths = ["document1.pdf", "document2.pdf"]
pdf_data = load_pdfs(pdf_paths)
# 3. Generate embeddings
embeddings = OpenAIEmbeddings()
data_vectors = embeddings.embed([pdf_data])
# 4. Build FAISS index
def build_faiss_index(vectors):
dimension = vectors.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(vectors, dtype=np.float32))
return index
index = build_faiss_index(data_vectors)
# 5. Querying the index
def query_index(index, query_embedding, k=5):
distances, indices = index.search(np.array([query_embedding], dtype=np.float32), k)
return distances, indices
# Example usage
query_embedding = embeddings.embed(['What is the summary of document1?'])
distances, indices = query_index(index, query_embedding)
What’s Next
Try integrating your RAG pipeline into a web app using Flask or FastAPI. Make it user-friendly so that even your non-technical friends can throw PDF documents at it and get useful information out!
FAQ
- What if my PDF contains images?
You won’t be able to extract any text from images. Consider using OCR libraries like Tesseract to handle that. - Can I process URLs instead of PDF files?
Absolutely! You just need a library like BeautifulSoup to scrape the content from web pages and then perform the same embedding steps. - How do I improve search results?
Fine-tune your embeddings or preprocess your text data for better context extraction. Sometimes itâs a simple matter of applying better filtering or stemming techniques.
Last updated March 27, 2026. Data sourced from official docs and community benchmarks.
đ Published: