Building Google Reviews Scraper Pro
A resilient Python web scraper for multi-language Google Maps reviews
Overview
Google Reviews Scraper Pro is a Python tool that extracts reviews from Google Maps listings, handles multiple languages, downloads review images, and stores the results in MongoDB. It was built to solve a real operational problem: manually collecting reviews for thousands of listings is not just slow, it is error-prone and impossible to scale.
The problem it solves
Review data is locked behind a JavaScript-heavy interface that actively resists scraping. Off-the-shelf tools break within weeks because Google rotates DOM selectors, throttles requests, and serves different markup to different user agents. This project takes the long view: it assumes the DOM will change and designs around that assumption.
Key features
- Multi-language extraction. Reviews are captured regardless of their original language, with metadata preserved for later translation or classification.
- Incremental scraping. On subsequent runs it picks up where it left off, only fetching new reviews. This makes daily cron runs cheap.
- Image downloading. Reviews with photos get their images pulled into storage, with URLs rewritten to point at the local copies.
- MongoDB integration. Built-in persistence means no CSV juggling. Queries are fast and the schema supports filtering by rating, language, date, and author.
- Detection resilience. Rate limiting, user-agent rotation, and request shaping keep the scraper under the radar.
Tech stack
Python, Playwright for headless browsing, BeautifulSoup for parsing, MongoDB for storage, and Pillow for image processing. Dockerized so it runs anywhere with one command.
What I would do differently
If I were rebuilding this today, I'd move the image pipeline to a proper object store (Cloudflare R2 or S3) instead of local filesystem, and I'd split the scraping logic from the persistence layer so each can be tested independently. The current version couples them tightly, which makes unit tests awkward.
Takeaway
Scraping at scale is less about clever selectors and more about resilience. Every decision — rate limits, retries, checkpointing, logging — matters more than the HTML parsing itself. Build assuming things will break, and they break less.
