Why I Measure My RAG (And You Should Too)

When I built my CV RAG Chatbot, the first version "worked." I asked it a few questions, got reasonable answers, and felt good about shipping it. Then I showed it to a friend who asked: "How do you know it's actually retrieving the right information?"

I didn't. I had tested it on maybe 10 questions โ€” all ones I had written myself. The retrievals "looked fine." The answers "sounded right." But I had no quantitative evidence that my RAG pipeline was working correctly.

This is the trap most RAG builders fall into. We test on a handful of queries, see plausible outputs, and declare success. But plausible โ‰  correct. And in production, correctness matters โ€” especially when the system is representing you to potential employers or clients.

Enter RAGAS

RAGAS (Retrieval Augmented Generation Assessment) is a framework that evaluates RAG pipelines on multiple dimensions. The key metrics I use:

  • Faithfulness โ€” Is the generated answer factually grounded in the retrieved context? (1.0 = perfectly faithful, 0.0 = hallucination)
  • Answer Relevancy โ€” Does the answer actually address the question asked? (penalizes rambling or off-topic responses)
  • Context Precision โ€” Of the retrieved chunks, what fraction is actually relevant to the question? (penalizes noisy retrieval)
  • Context Recall โ€” Of all the relevant information available, what fraction was retrieved? (penalizes missing context)

What I Measured โ€” And What I Learned

I created a test set of 50 synthetic questions about my CV โ€” ranging from "What is Rafif's experience with PyTorch?" to "Has he ever built a financial application?" For each question, I had a ground-truth answer (from my actual CV) to compare against.

Round 1: Naive Implementation

  • Faithfulness: 0.72 โ€” concerning. The model was making up details.
  • Answer Relevancy: 0.85 โ€” decent but not great.
  • Context Precision: 0.61 โ€” nearly 40% of retrieved chunks were irrelevant.
  • Context Recall: 0.78 โ€” missing about 22% of relevant information.

The Context Precision score was the wake-up call. My chunking strategy โ€” splitting by character count โ€” was breaking semantic units. A chunk might contain half a project description and half a skill list, confusing the retriever.

Round 2: Section-Aware Chunking

I rewrote the chunking to respect document structure: each CV section (Experience, Skills, Projects) became its own chunk, with metadata about the section type. This immediately improved context precision to 0.84.

Round 3: Prompt Engineering

The faithfulness issue was partly a prompting problem. I added explicit instructions: "Only use information from the provided context. If the context doesn't contain the answer, say so." Faithfulness jumped to 0.91.

The Evaluation Pipeline

graph LR A[Test Questions] --> B[RAG Pipeline] B --> C[Answers + Contexts] C --> D[RAGAS Metrics] D --> E{Faithfulness โ‰ฅ 0.9?} E -->|No| F[Fix & Retry] E -->|Yes| G[Ship It] F --> B style D fill:#6c5ce7,stroke:#7c6df0,color:#fff style G fill:#00d68f,stroke:#00b868,color:#0a0a0f

Why This Matters

If you're building a RAG system โ€” whether it's a customer support bot, a document Q&A tool, or a CV chatbot โ€” you need to measure it. Here's why:

  • Hallucinations erode trust. One confidently wrong answer and users stop using the system.
  • Chunking is make-or-break. Metrics tell you whether your chunking strategy works โ€” guessing doesn't.
  • Prompt changes have side effects. A prompt that improves one metric might hurt another. You won't know without measuring.
  • It's not that hard. RAGAS takes about 30 lines of Python to set up. The ROI is massive.

The Bottom Line

"It looks right" is not a metric. If you're shipping a RAG system to real users, invest the hour it takes to set up RAGAS. Your future self โ€” and your users โ€” will thank you.

See the CV RAG Chatbot case study for the full implementation details.

โ† Back to Blog