Case Study

Multi-Stage CV Pipeline

A detect-then-classify architecture for high-accuracy object recognition — optimized for edge deployment.

🔒 NDA · Simplified Demo

YOLO SSD PyTorch TensorFlow OpenCV ONNX Edge Inference 2023–2024

The Problem

A client needed an object recognition system for industrial quality inspection. The challenge: hundreds of fine-grained product categories that a single detector couldn't handle accurately. Products looked similar at the category level but had subtle differences that mattered for quality decisions.

Additionally, the system had to run on edge devices — inference latency under 100ms, no cloud dependency, and models that fit in under 200MB of memory.

The Architecture

graph TB A[Camera Input] --> B[YOLO Detection] B --> C{Object Crops} C --> D[Stage 1: Coarse Classifier] D --> E{Category Group} E -->|Group A| F[Fine Classifier A] E -->|Group B| G[Fine Classifier B] E -->|Group C| H[Fine Classifier C] F --> I[Results Aggregation] G --> I H --> I I --> J[JSON Output] style B fill:#6c5ce7,stroke:#7c6df0,color:#fff style D fill:#00d2ff,stroke:#00b8e6,color:#0a0a0f style F fill:#f7c948,stroke:#e0b830,color:#0a0a0f style G fill:#f7c948,stroke:#e0b830,color:#0a0a0f style H fill:#f7c948,stroke:#e0b830,color:#0a0a0f

Why It's Hard

Detect-then-classify reduces combinatorial complexity. Instead of one model handling N categories, the detector handles localization and the classifiers each handle a manageable subset. This makes training faster and accuracy higher.
Edge deployment constraints are real. ONNX export, model quantization (INT8), and careful memory management were essential. The full pipeline including pre/post-processing had to fit within a tight latency budget.
Data imbalance across categories. Some product types appeared 100x more frequently than others. Custom loss weighting, data augmentation, and occasional synthetic data generation were needed.
Model update strategy. When new products are added, you don't want to retrain everything. The staged architecture allows adding a new fine-classifier without touching the detector or other classifier stages.

Technical Stack

YOLOv8 + SSD — object detection, trained on custom dataset with domain-specific augmentations
PyTorch + TensorFlow — custom classifier training with ResNet/EfficientNet backbones across both frameworks
ONNX Runtime — inference optimization for edge deployment
OpenCV — image preprocessing, crop extraction, and post-processing
Model quantization — INT8 quantization reducing model size by 4x with minimal accuracy loss
Custom evaluation harness — per-stage metrics, confusion matrices, and latency profiling

What I'd Do Differently

Explore YOLO-NAS or RT-DETR. Newer architectures offer better accuracy-latency tradeoffs than YOLOv8 for edge deployment.
Add a rejection class. When the classifier is uncertain, route to human review instead of forcing a prediction. This matters in quality inspection.
Implement active learning. Automatically flag low-confidence predictions for human labeling and model improvement over time.
Use Triton Inference Server. For multi-model orchestration on edge, Triton would simplify model loading, versioning, and batching.

Key Takeaways

The detect-then-classify pattern is underrated. Splitting the problem into stages gives you better accuracy, easier maintenance, and the flexibility to update one stage without touching others. It's now my default approach for any multi-class vision problem.

← Back to Projects