The Detect-Then-Classify Pattern

Most computer vision tutorials teach you to build a single model that does everything: detect objects AND classify them in one forward pass. YOLO, SSD, Faster R-CNN โ€” these are end-to-end detectors that output bounding boxes with class labels.

In production, this often isn't the right approach. Here's why โ€” and what to do instead.

The Problem with End-to-End

When you have hundreds of fine-grained classes, the combinatorial complexity explodes. A single YOLO model trying to distinguish between 200 product SKUs will struggle because:

  • Class imbalance: some products appear 100x more than others
  • Feature interference: the detector's backbone learns features optimized for localization, not fine-grained discrimination
  • Update friction: adding a new product category means retraining the entire model
  • Deployment inflexibility: you can't swap just the classification logic without touching the detector

The Solution: Decouple Detection and Classification

graph LR A[Input Image] --> B[Stage 1: Detector] B --> C[Object Crops] C --> D[Stage 2: Coarse Classifier] D --> E[Category Group] E --> F[Stage 3: Fine Classifier] F --> G[Final Label] style B fill:#6c5ce7,stroke:#7c6df0,color:#fff style D fill:#00d2ff,stroke:#00b8e6,color:#0a0a0f style F fill:#f7c948,stroke:#e0b830,color:#0a0a0f

The detect-then-classify pattern splits the problem into independent stages:

  1. Detection (YOLO / EfficientDet): Find all objects, output bounding boxes. This model only needs to distinguish "object vs background" โ€” much simpler, much more robust.
  2. Coarse Classification: Assign each crop to a category group (e.g., "electronics," "clothing," "food"). A lightweight model with high recall.
  3. Fine Classification: Within each group, a specialized classifier distinguishes between specific items. Each classifier only handles 20โ€“50 classes โ€” much easier to train and maintain.

Why This Wins in Production

1. Better Accuracy Per Class

A classifier handling 30 classes will almost always outperform a detector handling 200 classes. The model capacity is focused on discrimination, not localization. In my production pipeline, the staged approach improved per-class F1 by 12โ€“18% over the end-to-end YOLO baseline.

2. Independent Maintainability

When a client adds 5 new product SKUs, I only retrain the relevant fine classifier โ€” not the detector, not the coarse classifier, not the other fine classifiers. Training time drops from hours to minutes. Deployment risk drops because only one component changes.

3. Flexible Deployment

The detector can run on-device (ONNX, quantized) while fine classifiers run on a server if needed. Or you can swap the detector from YOLOv8 to YOLO-NAS without touching classification. Each component evolves independently.

4. Better Debugging

When the system makes a mistake, you know exactly where: was the object not detected? Was it misrouted to the wrong coarse group? Did the fine classifier get confused? Staged architectures give you per-stage metrics that make debugging systematic rather than guesswork.

The Tradeoffs

  • Latency: Multiple models mean multiple forward passes. But with ONNX and batching, the overhead is often under 10ms โ€” acceptable for most use cases.
  • Complexity: More components to orchestrate. But the separation of concerns usually reduces overall system complexity compared to a monolithic model with thousands of edge cases.
  • Data requirements: Each fine classifier needs its own training data. But this is often already available โ€” you're just organizing it differently.

When NOT to Use This Pattern

  • You have fewer than 20 classes โ€” end-to-end is simpler and likely sufficient
  • Latency is extremely tight (< 5ms) โ€” the multi-pass overhead matters
  • Classes are mutually exclusive and well-separated โ€” no fine-grained discrimination needed
  • You're in early prototyping โ€” start simple, add stages as needed

Code Sketch

# Simplified detect-then-classify pipeline
import torch
from detector import YOLODetector
from classifiers import CoarseClassifier, FineClassifier

class StagedPipeline:
    def __init__(self):
        self.detector = YOLODetector("yolov8n.pt")
        self.coarse = CoarseClassifier("efficientnet_b0_coarse.pt")
        self.fine = {
            "electronics": FineClassifier("resnet18_electronics.pt"),
            "clothing": FineClassifier("resnet18_clothing.pt"),
            "food": FineClassifier("resnet18_food.pt"),
        }
    
    def __call__(self, image):
        crops = self.detector.detect(image)
        results = []
        for crop, bbox in crops:
            coarse_label = self.coarse.classify(crop)
            fine_label = self.fine[coarse_label].classify(crop)
            results.append({"bbox": bbox, "label": fine_label})
        return results

The Bottom Line

The detect-then-classify pattern isn't flashy. It won't win benchmarks. But in production, where maintainability and accuracy on real data matter more than architectural elegance, it's often the pragmatic choice. Split the problem, solve each piece well, and keep the system evolvable.

See the Multi-Stage CV Pipeline case study for the full implementation details.

โ† Back to Blog