Blog Post

MLOps Without Burnout

📅 April 2024 ⏱️ 7 min read 🏷️ MLOps · Docker · MLflow · Grafana

Every MLOps tutorial seems to assume you have a Kubernetes cluster, a team of 5 DevOps engineers, and a cloud budget that would make a startup founder weep. Here's the reality for most of us: a VPS with 4 vCPUs and 8GB RAM, Docker Compose, and the determination to make it work.

I run my MLOps stack in Garut, Indonesia — not exactly AWS us-east-1. Here's what I've learned about building reliable ML infrastructure without burning out (or burning cash).

The Minimal Viable MLOps Stack

graph TB A[Training Scripts] --> B[MLflow Tracking] B --> C[MLflow Model Registry] C --> D[Docker Container] D --> E[Prediction API] E --> F[Grafana Monitoring] F -->|Drift Alert| G[Retrain Trigger] G --> A style B fill:#6c5ce7,stroke:#7c6df0,color:#fff style F fill:#00d2ff,stroke:#00b8e6,color:#0a0a0f

Here's what you actually need — and what you can skip:

✅ Must-Have

Experiment tracking (MLflow). You WILL forget which hyperparameters produced which results. MLflow is free, runs anywhere, and takes 20 minutes to set up.
Containerization (Docker). "It works on my machine" is the enemy of production ML. Docker Compose handles your entire stack in one docker-compose.yml.
Monitoring (Grafana). If you can't see prediction drift, you don't know your model is degrading. Grafana dashboards take an hour to set up and pay back immediately.
Model versioning. MLflow Model Registry handles this. Every model in production should have a version, a stage (staging/production/archived), and a clear lineage back to the training run.

❌ Can Skip (For Now)

Kubernetes — Docker Compose is sufficient for single-node deployments
Feature stores (Feast, Tecton) — a PostgreSQL table with versioned feature sets works fine initially
CI/CD pipelines for ML — manual retraining with MLflow API is OK when you're the only engineer
Data versioning (DVC) — Git LFS or even dated data folders work for small-to-medium datasets
Model serving frameworks (TorchServe, Triton) — a FastAPI wrapper around your model is simpler and more debuggable

My docker-compose.yml

# The stack that runs on a $20/month VPS
version: '3.8'
services:
  mlflow:
    image: ghcr.io/mlflow/mlflow:v2.12.0
    ports: ["5000:5000"]
    command: mlflow server --host 0.0.0.0 --backend-store-uri postgresql://...
    restart: unless-stopped

  grafana:
    image: grafana/grafana:10.4.0
    ports: ["3000:3000"]
    volumes: ["./grafana/dashboards:/etc/grafana/provisioning/dashboards"]
    restart: unless-stopped

  prediction-api:
    build: ./api
    ports: ["8000:8000"]
    depends_on: [mlflow]
    restart: unless-stopped

  postgres:
    image: postgres:16
    environment:
      POSTGRES_DB: mlops
      POSTGRES_USER: mlflow
    volumes: ["pgdata:/var/lib/postgresql/data"]
    restart: unless-stopped

volumes:
  pgdata:

Principles for Solo MLOps

1. Prefer Boring Technology

PostgreSQL over specialized time-series DBs. FastAPI over TensorFlow Serving. Docker Compose over Kubernetes. Every technology choice should be defensible with: "I can debug this at 2 AM when it breaks." If you can't, it's too complex.

2. Monitoring Is Non-Negotiable

The minimum viable monitoring:

Prediction distribution over time — a histogram that updates daily. If the shape changes, your data changed.
Feature drift (PSI) — Population Stability Index for top 10 features. Alert if PSI > 0.2.
Model latency p50/p95/p99 — because a slow model is a broken model.
Error rate — prediction failures, timeouts, NaN outputs.

All of this fits in a single Grafana dashboard. Set it up once, glance at it daily.

3. Automate Retraining Triggers, Not Schedules

Retraining on a fixed schedule (every Monday!) is wasteful and often misses the moment when retraining is actually needed. Instead, trigger retraining when drift metrics cross thresholds. This means retraining happens when it matters — not when the calendar says so.

4. Document Architecture Decisions

When you're a solo engineer, there's no one to ask "why did we choose XGBoost over LightGBM?" Six months from now, you'll be that person. Write it down. A simple DECISIONS.md in your repo saves future-you hours of archaeology.

What I'd Add With More Resources

A proper feature store when feature engineering becomes the bottleneck
Shadow deployment for zero-risk model updates
A/B testing infrastructure for model comparison in production
Automated hyperparameter tuning (Optuna) integrated with MLflow

But none of these are blockers. You can ship reliable ML systems today with the stack described above. I do it every day — from Garut, on a VPS that costs less than a dinner out.

The Bottom Line

MLOps doesn't require a PhD in infrastructure. Start with experiment tracking and monitoring. Add complexity only when the current setup actually hurts. Most ML projects don't fail because of inadequate infrastructure — they fail because nobody knows if the model is still working.

← Back to Blog