Multi-Stage CV Pipeline

A detect-then-classify architecture for high-accuracy object recognition โ€” optimized for edge deployment.

๐Ÿ”’ NDA ยท Simplified Demo
YOLO SSD PyTorch TensorFlow OpenCV ONNX Edge Inference 2023โ€“2024

The Problem

A client needed an object recognition system for industrial quality inspection. The challenge: hundreds of fine-grained product categories that a single detector couldn't handle accurately. Products looked similar at the category level but had subtle differences that mattered for quality decisions.

Additionally, the system had to run on edge devices โ€” inference latency under 100ms, no cloud dependency, and models that fit in under 200MB of memory.

The Architecture

graph TB A[Camera Input] --> B[YOLO Detection] B --> C{Object Crops} C --> D[Stage 1: Coarse Classifier] D --> E{Category Group} E -->|Group A| F[Fine Classifier A] E -->|Group B| G[Fine Classifier B] E -->|Group C| H[Fine Classifier C] F --> I[Results Aggregation] G --> I H --> I I --> J[JSON Output] style B fill:#6c5ce7,stroke:#7c6df0,color:#fff style D fill:#00d2ff,stroke:#00b8e6,color:#0a0a0f style F fill:#f7c948,stroke:#e0b830,color:#0a0a0f style G fill:#f7c948,stroke:#e0b830,color:#0a0a0f style H fill:#f7c948,stroke:#e0b830,color:#0a0a0f

Why It's Hard

  • Detect-then-classify reduces combinatorial complexity. Instead of one model handling N categories, the detector handles localization and the classifiers each handle a manageable subset. This makes training faster and accuracy higher.
  • Edge deployment constraints are real. ONNX export, model quantization (INT8), and careful memory management were essential. The full pipeline including pre/post-processing had to fit within a tight latency budget.
  • Data imbalance across categories. Some product types appeared 100x more frequently than others. Custom loss weighting, data augmentation, and occasional synthetic data generation were needed.
  • Model update strategy. When new products are added, you don't want to retrain everything. The staged architecture allows adding a new fine-classifier without touching the detector or other classifier stages.

Technical Stack

  • YOLOv8 + SSD โ€” object detection, trained on custom dataset with domain-specific augmentations
  • PyTorch + TensorFlow โ€” custom classifier training with ResNet/EfficientNet backbones across both frameworks
  • ONNX Runtime โ€” inference optimization for edge deployment
  • OpenCV โ€” image preprocessing, crop extraction, and post-processing
  • Model quantization โ€” INT8 quantization reducing model size by 4x with minimal accuracy loss
  • Custom evaluation harness โ€” per-stage metrics, confusion matrices, and latency profiling

What I'd Do Differently

  • Explore YOLO-NAS or RT-DETR. Newer architectures offer better accuracy-latency tradeoffs than YOLOv8 for edge deployment.
  • Add a rejection class. When the classifier is uncertain, route to human review instead of forcing a prediction. This matters in quality inspection.
  • Implement active learning. Automatically flag low-confidence predictions for human labeling and model improvement over time.
  • Use Triton Inference Server. For multi-model orchestration on edge, Triton would simplify model loading, versioning, and batching.

Key Takeaways

The detect-then-classify pattern is underrated. Splitting the problem into stages gives you better accuracy, easier maintenance, and the flexibility to update one stage without touching others. It's now my default approach for any multi-class vision problem.
โ† Back to Projects