Why I Measure My RAG (And You Should Too)
When I built my CV RAG Chatbot, the first version "worked." I asked it a few questions, got reasonable answers, and felt good about shipping it. Then I showed it to a friend who asked: "How do you know it's actually retrieving the right information?"
I didn't. I had tested it on maybe 10 questions โ all ones I had written myself. The retrievals "looked fine." The answers "sounded right." But I had no quantitative evidence that my RAG pipeline was working correctly.
This is the trap most RAG builders fall into. We test on a handful of queries, see plausible outputs, and declare success. But plausible โ correct. And in production, correctness matters โ especially when the system is representing you to potential employers or clients.
Enter RAGAS
RAGAS (Retrieval Augmented Generation Assessment) is a framework that evaluates RAG pipelines on multiple dimensions. The key metrics I use:
- Faithfulness โ Is the generated answer factually grounded in the retrieved context? (1.0 = perfectly faithful, 0.0 = hallucination)
- Answer Relevancy โ Does the answer actually address the question asked? (penalizes rambling or off-topic responses)
- Context Precision โ Of the retrieved chunks, what fraction is actually relevant to the question? (penalizes noisy retrieval)
- Context Recall โ Of all the relevant information available, what fraction was retrieved? (penalizes missing context)
What I Measured โ And What I Learned
I created a test set of 50 synthetic questions about my CV โ ranging from "What is Rafif's experience with PyTorch?" to "Has he ever built a financial application?" For each question, I had a ground-truth answer (from my actual CV) to compare against.
Round 1: Naive Implementation
- Faithfulness: 0.72 โ concerning. The model was making up details.
- Answer Relevancy: 0.85 โ decent but not great.
- Context Precision: 0.61 โ nearly 40% of retrieved chunks were irrelevant.
- Context Recall: 0.78 โ missing about 22% of relevant information.
The Context Precision score was the wake-up call. My chunking strategy โ splitting by character count โ was breaking semantic units. A chunk might contain half a project description and half a skill list, confusing the retriever.
Round 2: Section-Aware Chunking
I rewrote the chunking to respect document structure: each CV section (Experience, Skills, Projects) became its own chunk, with metadata about the section type. This immediately improved context precision to 0.84.
Round 3: Prompt Engineering
The faithfulness issue was partly a prompting problem. I added explicit instructions: "Only use information from the provided context. If the context doesn't contain the answer, say so." Faithfulness jumped to 0.91.
The Evaluation Pipeline
Why This Matters
If you're building a RAG system โ whether it's a customer support bot, a document Q&A tool, or a CV chatbot โ you need to measure it. Here's why:
- Hallucinations erode trust. One confidently wrong answer and users stop using the system.
- Chunking is make-or-break. Metrics tell you whether your chunking strategy works โ guessing doesn't.
- Prompt changes have side effects. A prompt that improves one metric might hurt another. You won't know without measuring.
- It's not that hard. RAGAS takes about 30 lines of Python to set up. The ROI is massive.
The Bottom Line
"It looks right" is not a metric. If you're shipping a RAG system to real users, invest the hour it takes to set up RAGAS. Your future self โ and your users โ will thank you.
See the CV RAG Chatbot case study for the full implementation details.