Case Study

Makmur Grosir — E-Commerce Image Scraper

A Playwright-based scraper that automatically finds product images across 4 Indonesian e-commerce platforms — with sophisticated anti-bot evasion.

Node.js Playwright GitHub Actions Anti-Bot Web Scraping Automation 2025

🌐 Visit Makmur Grosir →

The Problem

Makmur Grosir — a wholesale store in Indonesia with 2,500+ products — needed to build an online storefront. The product catalog existed as an Excel file with names and prices, but images had to be sourced manually. Hiring someone to find and download 2,500 product images would take weeks and cost thousands.

The challenge: automatically search for and download product images from 4 Indonesian e-commerce platforms — Lazada, Blibli, Shopee, and Tokopedia — without getting blocked, rate-limited, or CAPTCHA'd.

The Architecture

graph TB A[Excel Product Catalog] --> B[Playwright Bot] B --> C[Lazada Search] B --> D[Blibli Search] B --> E[Shopee Search] B --> F[Tokopedia Search] C --> G{Anti-Bot Layer} D --> G E --> G F --> G G --> H[Download Best Match] H --> I[Organized Image Output] B --> J[GitHub Actions Scheduler] J -->|50 products / 3h| B style B fill:#6c5ce7,stroke:#7c6df0,color:#fff style G fill:#f7c948,stroke:#e0b830,color:#0a0a0f style J fill:#00d2ff,stroke:#00b8e6,color:#0a0a0f

Why It's Hard

Anti-bot evasion is an arms race. Indonesian e-commerce platforms actively detect and block automated browsers. I had to implement browser fingerprint spoofing, stealth page scripts, randomized delays, and headless detection circumvention — each platform has different detection vectors.
Rate limiting and CAPTCHAs. Searching 2,500 products means hitting the same platform hundreds of times. Distributed across 4 platforms with 50 products per batch every 3 hours, the scraper stays under each platform's threshold.
Unreliable search results. A search for "Sabun Cuci Piring Ekonomi 500ml" might return the wrong product entirely. The scraper uses title similarity scoring to pick the best match, but edge cases require manual review flags.
Incremental progress and resume support. If the system crashes or a batch fails, it needs to resume from where it left off — not restart from zero. With 2,500 products over 50+ batches, this was essential.

Technical Stack

Node.js + Playwright — browser automation with stealth enhancements
GitHub Actions — scheduled workflow runs (50 products per batch, every 3 hours)
Anti-bot evasion — browser fingerprint spoofing, stealth scripts, randomized delays, headless detection circumvention
Title similarity matching — fuzzy string comparison to pick the best image match
Resume & checkpoint system — failed items logged and retried; progress saved between runs
Excel parsing — reads product catalog from .xlsx, writes image-mapped results back

What I'd Do Differently

Use a rotating proxy pool. Currently runs from a single IP via GitHub Actions. Rotating proxies would reduce detection risk and allow faster batch processing.
Add image quality filtering. Some downloaded images are low-resolution or watermarked. Automated quality scoring would reduce manual review time.
Explore LLM-based matching. Instead of fuzzy title matching, a small LLM could determine if an image result actually matches the product — especially useful for ambiguous product names.
Build a simple monitoring dashboard. A Grafana dashboard showing batch progress, success rates, and platform-specific error rates would make the system more observable.

Key Takeaways

Web scraping at scale isn't just about writing selectors — it's about building a system that adapts to each platform's anti-bot evolution, handles failures gracefully, and makes progress even in the face of CAPTCHAs and rate limits. The same principles apply to any system that interacts with external APIs beyond your control.

← Back to Projects