Blog Post

Why I Measure My RAG (And You Should Too)

📅 June 2024 ⏱️ 6 min read 🏷️ RAG · Evaluation · RAGAS · LLM

When I built my CV RAG Chatbot, the first version "worked." I asked it a few questions, got reasonable answers, and felt good about shipping it. Then I showed it to a friend who asked: "How do you know it's actually retrieving the right information?"

I didn't. I had tested it on maybe 10 questions — all ones I had written myself. The retrievals "looked fine." The answers "sounded right." But I had no quantitative evidence that my RAG pipeline was working correctly.

This is the trap most RAG builders fall into. We test on a handful of queries, see plausible outputs, and declare success. But plausible ≠ correct. And in production, correctness matters — especially when the system is representing you to potential employers or clients.

Enter RAGAS

RAGAS (Retrieval Augmented Generation Assessment) is a framework that evaluates RAG pipelines on multiple dimensions. The key metrics I use:

Faithfulness — Is the generated answer factually grounded in the retrieved context? (1.0 = perfectly faithful, 0.0 = hallucination)
Answer Relevancy — Does the answer actually address the question asked? (penalizes rambling or off-topic responses)
Context Precision — Of the retrieved chunks, what fraction is actually relevant to the question? (penalizes noisy retrieval)
Context Recall — Of all the relevant information available, what fraction was retrieved? (penalizes missing context)

What I Measured — And What I Learned

I created a test set of 50 synthetic questions about my CV — ranging from "What is Rafif's experience with PyTorch?" to "Has he ever built a financial application?" For each question, I had a ground-truth answer (from my actual CV) to compare against.

Round 1: Naive Implementation

Faithfulness: 0.72 — concerning. The model was making up details.
Answer Relevancy: 0.85 — decent but not great.
Context Precision: 0.61 — nearly 40% of retrieved chunks were irrelevant.
Context Recall: 0.78 — missing about 22% of relevant information.

The Context Precision score was the wake-up call. My chunking strategy — splitting by character count — was breaking semantic units. A chunk might contain half a project description and half a skill list, confusing the retriever.

Round 2: Section-Aware Chunking

I rewrote the chunking to respect document structure: each CV section (Experience, Skills, Projects) became its own chunk, with metadata about the section type. This immediately improved context precision to 0.84.

Round 3: Prompt Engineering

The faithfulness issue was partly a prompting problem. I added explicit instructions: "Only use information from the provided context. If the context doesn't contain the answer, say so." Faithfulness jumped to 0.91.

The Evaluation Pipeline

graph LR A[Test Questions] --> B[RAG Pipeline] B --> C[Answers + Contexts] C --> D[RAGAS Metrics] D --> E{Faithfulness ≥ 0.9?} E -->|No| F[Fix & Retry] E -->|Yes| G[Ship It] F --> B style D fill:#6c5ce7,stroke:#7c6df0,color:#fff style G fill:#00d68f,stroke:#00b868,color:#0a0a0f

Why This Matters

If you're building a RAG system — whether it's a customer support bot, a document Q&A tool, or a CV chatbot — you need to measure it. Here's why:

Hallucinations erode trust. One confidently wrong answer and users stop using the system.
Chunking is make-or-break. Metrics tell you whether your chunking strategy works — guessing doesn't.
Prompt changes have side effects. A prompt that improves one metric might hurt another. You won't know without measuring.
It's not that hard. RAGAS takes about 30 lines of Python to set up. The ROI is massive.

The Bottom Line

"It looks right" is not a metric. If you're shipping a RAG system to real users, invest the hour it takes to set up RAGAS. Your future self — and your users — will thank you.

See the CV RAG Chatbot case study for the full implementation details.

← Back to Blog