The Detect-Then-Classify Pattern
Most computer vision tutorials teach you to build a single model that does everything: detect objects AND classify them in one forward pass. YOLO, SSD, Faster R-CNN โ these are end-to-end detectors that output bounding boxes with class labels.
In production, this often isn't the right approach. Here's why โ and what to do instead.
The Problem with End-to-End
When you have hundreds of fine-grained classes, the combinatorial complexity explodes. A single YOLO model trying to distinguish between 200 product SKUs will struggle because:
- Class imbalance: some products appear 100x more than others
- Feature interference: the detector's backbone learns features optimized for localization, not fine-grained discrimination
- Update friction: adding a new product category means retraining the entire model
- Deployment inflexibility: you can't swap just the classification logic without touching the detector
The Solution: Decouple Detection and Classification
The detect-then-classify pattern splits the problem into independent stages:
- Detection (YOLO / EfficientDet): Find all objects, output bounding boxes. This model only needs to distinguish "object vs background" โ much simpler, much more robust.
- Coarse Classification: Assign each crop to a category group (e.g., "electronics," "clothing," "food"). A lightweight model with high recall.
- Fine Classification: Within each group, a specialized classifier distinguishes between specific items. Each classifier only handles 20โ50 classes โ much easier to train and maintain.
Why This Wins in Production
1. Better Accuracy Per Class
A classifier handling 30 classes will almost always outperform a detector handling 200 classes. The model capacity is focused on discrimination, not localization. In my production pipeline, the staged approach improved per-class F1 by 12โ18% over the end-to-end YOLO baseline.
2. Independent Maintainability
When a client adds 5 new product SKUs, I only retrain the relevant fine classifier โ not the detector, not the coarse classifier, not the other fine classifiers. Training time drops from hours to minutes. Deployment risk drops because only one component changes.
3. Flexible Deployment
The detector can run on-device (ONNX, quantized) while fine classifiers run on a server if needed. Or you can swap the detector from YOLOv8 to YOLO-NAS without touching classification. Each component evolves independently.
4. Better Debugging
When the system makes a mistake, you know exactly where: was the object not detected? Was it misrouted to the wrong coarse group? Did the fine classifier get confused? Staged architectures give you per-stage metrics that make debugging systematic rather than guesswork.
The Tradeoffs
- Latency: Multiple models mean multiple forward passes. But with ONNX and batching, the overhead is often under 10ms โ acceptable for most use cases.
- Complexity: More components to orchestrate. But the separation of concerns usually reduces overall system complexity compared to a monolithic model with thousands of edge cases.
- Data requirements: Each fine classifier needs its own training data. But this is often already available โ you're just organizing it differently.
When NOT to Use This Pattern
- You have fewer than 20 classes โ end-to-end is simpler and likely sufficient
- Latency is extremely tight (< 5ms) โ the multi-pass overhead matters
- Classes are mutually exclusive and well-separated โ no fine-grained discrimination needed
- You're in early prototyping โ start simple, add stages as needed
Code Sketch
# Simplified detect-then-classify pipeline
import torch
from detector import YOLODetector
from classifiers import CoarseClassifier, FineClassifier
class StagedPipeline:
def __init__(self):
self.detector = YOLODetector("yolov8n.pt")
self.coarse = CoarseClassifier("efficientnet_b0_coarse.pt")
self.fine = {
"electronics": FineClassifier("resnet18_electronics.pt"),
"clothing": FineClassifier("resnet18_clothing.pt"),
"food": FineClassifier("resnet18_food.pt"),
}
def __call__(self, image):
crops = self.detector.detect(image)
results = []
for crop, bbox in crops:
coarse_label = self.coarse.classify(crop)
fine_label = self.fine[coarse_label].classify(crop)
results.append({"bbox": bbox, "label": fine_label})
return results
The Bottom Line
The detect-then-classify pattern isn't flashy. It won't win benchmarks. But in production, where maintainability and accuracy on real data matter more than architectural elegance, it's often the pragmatic choice. Split the problem, solve each piece well, and keep the system evolvable.
See the Multi-Stage CV Pipeline case study for the full implementation details.