Makmur Grosir โ E-Commerce Image Scraper
A Playwright-based scraper that automatically finds product images across 4 Indonesian e-commerce platforms โ with sophisticated anti-bot evasion.
The Problem
Makmur Grosir โ a wholesale store in Indonesia with 2,500+ products โ needed to build an online storefront. The product catalog existed as an Excel file with names and prices, but images had to be sourced manually. Hiring someone to find and download 2,500 product images would take weeks and cost thousands.
The challenge: automatically search for and download product images from 4 Indonesian e-commerce platforms โ Lazada, Blibli, Shopee, and Tokopedia โ without getting blocked, rate-limited, or CAPTCHA'd.
The Architecture
Why It's Hard
- Anti-bot evasion is an arms race. Indonesian e-commerce platforms actively detect and block automated browsers. I had to implement browser fingerprint spoofing, stealth page scripts, randomized delays, and headless detection circumvention โ each platform has different detection vectors.
- Rate limiting and CAPTCHAs. Searching 2,500 products means hitting the same platform hundreds of times. Distributed across 4 platforms with 50 products per batch every 3 hours, the scraper stays under each platform's threshold.
- Unreliable search results. A search for "Sabun Cuci Piring Ekonomi 500ml" might return the wrong product entirely. The scraper uses title similarity scoring to pick the best match, but edge cases require manual review flags.
- Incremental progress and resume support. If the system crashes or a batch fails, it needs to resume from where it left off โ not restart from zero. With 2,500 products over 50+ batches, this was essential.
Technical Stack
- Node.js + Playwright โ browser automation with stealth enhancements
- GitHub Actions โ scheduled workflow runs (50 products per batch, every 3 hours)
- Anti-bot evasion โ browser fingerprint spoofing, stealth scripts, randomized delays, headless detection circumvention
- Title similarity matching โ fuzzy string comparison to pick the best image match
- Resume & checkpoint system โ failed items logged and retried; progress saved between runs
- Excel parsing โ reads product catalog from .xlsx, writes image-mapped results back
What I'd Do Differently
- Use a rotating proxy pool. Currently runs from a single IP via GitHub Actions. Rotating proxies would reduce detection risk and allow faster batch processing.
- Add image quality filtering. Some downloaded images are low-resolution or watermarked. Automated quality scoring would reduce manual review time.
- Explore LLM-based matching. Instead of fuzzy title matching, a small LLM could determine if an image result actually matches the product โ especially useful for ambiguous product names.
- Build a simple monitoring dashboard. A Grafana dashboard showing batch progress, success rates, and platform-specific error rates would make the system more observable.
Key Takeaways
Web scraping at scale isn't just about writing selectors โ it's about building a system that adapts to each platform's anti-bot evolution, handles failures gracefully, and makes progress even in the face of CAPTCHAs and rate limits. The same principles apply to any system that interacts with external APIs beyond your control.