I needed to scrape property listings from a real estate site that heavily relied on JavaScript to load content. The page source was empty until React rendered everything, so traditional tools like requests + BeautifulSoup wouldn’t work. Selenium was the answer.
What I was scraping Link to heading
Property data from a site that dynamically loaded listings as you scrolled. I needed to:
- Load the page
- Scroll to trigger lazy loading
- Wait for listings to appear
- Extract the data
The site had no public API, and while I could have reverse-engineered their internal API calls, using Selenium meant I didn’t have to worry about authentication or rate limiting logic.
Selenium vs Playwright vs Puppeteer Link to heading
Selenium - The oldest and most mature option. Works with multiple browsers. Can be slow, but it’s stable and well-documented.
Playwright - Microsoft’s newer tool. Faster than Selenium and has better async support. More modern API. If I were starting fresh today, I’d probably use this.
Puppeteer - Google’s tool, Chrome-focused. Great if you only need Chrome/Chromium. Popular but now arguably superseded by Playwright.
I went with Selenium because:
- Huge community means better Stack Overflow coverage
- Works well with Chrome DevTools for debugging
- I already knew it from previous projects
Being a good citizen Link to heading
Before scraping any site:
Check robots.txt - Visit
https://example.com/robots.txtto see what the site allows. Respect theUser-agent: *andDisallow:directives.Rate limit yourself - Add delays between requests. Using
time.sleep(2)between page loads is a good starting point.Use a real User-Agent - Don’t pretend to be a regular browser if you’re a bot, but do identify yourself properly.
Check for an API - Many sites have official APIs. Use those instead if they exist.
Cache aggressively - Don’t request the same page twice. Save to disk and reuse.
The property site I scraped had a robots.txt that allowed /listings, so I was in the clear. I also rate-limited to one request every 3 seconds and ran scraping overnight to avoid peak hours.
Installation Link to heading
pip install selenium webdriver-manager
Basic setup Link to heading
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(
service=Service(ChromeDriverManager().install())
)
Open a page Link to heading
driver.get("https://example.com")
Wait for element to load Link to heading
This is crucial - never assume elements are ready immediately:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "myid"))
)
Find elements Link to heading
driver.find_element(By.ID, "myid")
driver.find_element(By.CLASS_NAME, "myclass")
driver.find_element(By.CSS_SELECTOR, "div.myclass")
driver.find_elements(By.TAG_NAME, "a") # returns list
Get text Link to heading
element.text
Click Link to heading
element.click()
Always close when done Link to heading
driver.quit()
For simple HTML parsing after JS loads Link to heading
Pass the rendered page source to BeautifulSoup:
from bs4 import BeautifulSoup
soup = BeautifulSoup(driver.page_source, "html.parser")
This is often easier than using Selenium’s selectors if you need to do complex HTML traversal.
Further reading Link to heading
- Selenium documentation
- robots.txt specification - learn how to be a polite scraper
- Playwright - modern alternative worth considering