Web Scraping: Extracting Data from the Web
Web scraping is the automated process of extracting data from websites — reading the HTML structure of a page, parsing it, and turning the relevant pieces into structured information (CSV, JSON, database) that machines can use.
Why This Matters
The internet is the largest repository of human knowledge ever built — but most of it is formatted for human eyes, not for machine consumption. If you want to answer a question like "how much does a 65-inch OLED TV cost across ten different retailers," reading each site manually would take hours. Writing a traditional scraper (that extracts the data automatically) is faster, but brittle: one CSS class rename and the whole thing breaks.
Web scraping matters because it bridges the gap between "the data exists on the web" and "the data is available for computation." Every price comparison site, every market research report, every AI training dataset, and every lead generation pipeline depends on scraping in some form. And as Browser Use and similar AI tools show, the line between scraping and autonomous web interaction is blurring: the same techniques that extract data can also fill forms, navigate workflows, and act on the web.
Prerequisites
Prerequisites
You should understand: HTML basics — websites are structured as HTML documents. Scraping fundamentally means reading HTML and extracting the parts you need. HTTP — the protocol your scraper uses to fetch pages. Status codes, request headers, and rate limiting all matter. Headless Browsers — modern websites are built with JavaScript frameworks (React, Vue, Angular) that render content dynamically. A headless browser is often necessary to let the JavaScript execute before you scrape (covered in the Headless Browsers article).
Core Idea
Imagine you have a stack of printed restaurant menus. Each menu has the restaurant name at the top, dishes in the middle, and prices on the right. You need to build a spreadsheet of all the prices for "chicken shawarma" across all the menus.
- Manually: You read each menu, find the line with "chicken shawarma," copy the price, and type it into a spreadsheet. Slow but reliable.
- Traditional scraping: You write a robot that looks for the exact position of "chicken shawarma" on every menu. Works as long as all menus use the same layout. If a restaurant rearranges its menu, the robot fails.
- AI-assisted scraping: You tell a smart assistant: "Find me the price of chicken shawarma on every menu." It reads the menus, understands that "chicken shawarma" and "شاورما دجاج" mean the same thing, finds the nearest number, and gives you the answer — even if each menu is laid out differently.
That last option is what modern AI tools bring to web scraping: the ability to adapt to different page structures by understanding meaning rather than relying on fixed patterns.
How It Actually Works
The Traditional Scraping Pipeline
Traditional web scraping follows a predictable sequence:
- Fetch — Download the page's HTML using an HTTP client (Python's
requests, cURL). For JavaScript-heavy sites, use a headless browser (Playwright, Puppeteer) that renders the page first. - Parse — Read the HTML structure into a traversable tree using a parser (BeautifulSoup in Python, Cheerio in Node.js, or native DOM APIs in a headless browser).
- Extract — Find the specific elements you want, usually by CSS selectors, XPath expressions, or element IDs. Extract the text, attributes, or URLs from those elements.
- Transform — Clean the extracted data: strip whitespace, convert strings to numbers, handle missing values, format dates.
- Store — Save the structured data to a CSV, JSON file, database, or API endpoint.
Here is what that looks like in practice:
import requests
from bs4 import BeautifulSoup
# Step 1: Fetch
response = requests.get("https://books.toscrape.com/")
soup = BeautifulSoup(response.text, "html.parser")
# Step 2 & 3: Parse and Extract
books = []
for article in soup.select("article.product_pod"):
title = article.h3.a["title"]
price = article.select_one("p.price_color").text
books.append({"title": title, "price": price})
# Step 4 & 5: Transform and Store
import csv
with open("books.csv", "w") as f:
w = csv.DictWriter(f, fieldnames=["title", "price"])
w.writeheader()
w.writerows(books)
This script works perfectly — until the website changes its CSS classes. If .product_pod becomes .book-card, the script silently returns zero results.
The Modern AI-Assisted Approach
AI-powered scraping (as in Browser Use or Stagehand) replaces the fragile selector-based extraction with natural language goals:
from browser_use import Agent
agent = Agent(task="Extract all book titles and prices from books.toscrape.com")
result = agent.run()
# Returns structured data without any CSS selectors
The agent:
- Opens the page in a headless browser
- Reads the DOM or takes a screenshot
- The LLM identifies which elements are book titles, which are prices, and returns them in a structured format
- If the site uses a different layout on page 2, the agent adapts — no manual re-coding needed
Basic vs. Interactive Scraping
A useful distinction from the Browser Use 2026 scraping guide:
| Type | What it means | Tools | Reliability |
|---|---|---|---|
| Basic scraping | The data is already in the initial HTML. No interaction needed — just fetch and parse. | requests + BeautifulSoup, Firecrawl | High (static pages), but misses JS-rendered content |
| Interactive scraping | The scraper must act on the page: scroll, click buttons, fill forms, wait for lazy-loading. | Playwright, Browser Use, Puppeteer | Lower (more moving parts), but gets data that basic scraping misses |
Most modern web scraping involves a mix of both: use basic scraping for the initial page structure, then interactive scraping for dynamic content that loads on scroll or after user interaction.
The Anti-Bot Arms Race
A critical and often frustrating aspect of real-world scraping: many websites do not want to be scraped. They employ increasingly sophisticated anti-bot systems:
- Rate limiting — Block IPs that make too many requests too quickly.
- User-agent checking — Block requests that don't come from known browser user agents.
- JavaScript challenges — Require the client to execute JavaScript before serving content (Cloudflare's "I am human" check).
- Behavioral analysis — Track mouse movements, scroll patterns, navigation speed.
- CAPTCHAs — reCAPTCHA v3, hCaptcha, Turnstile. Increasingly common and increasingly difficult to solve automatically.
- Fingerprinting — Detect headless browsers by checking for the absence of certain browser features or the presence of automation flags.
As of 2026, over 60% of scraping professionals report increased infrastructure costs year-over-year, driven primarily by the need to evade adaptive anti-bot defenses. This has pushed many teams toward managed scraping infrastructure (Firecrawl, Browserbase) that handles stealth and proxy rotation as a service.
Common Misconceptions
"Web scraping is illegal."
Scraping publicly accessible data is generally legal in most jurisdictions, but the legal landscape is complicated and evolving. Notable cases (like hiQ Labs vs. LinkedIn) have established that scraping public data does not violate the CFAA in the US — but contractual terms of service violations, copyright concerns, and data privacy regulations (GDPR, CCPA) can create legal exposure. This article is not legal advice.
"AI scraping completely replaces traditional scraping."
Not for high-volume work. AI-assisted scraping costs $0.01–0.05 per task in API calls. If you need to scrape 100,000 pages daily, a traditional Playwright script costs pennies in server time while AI-driven scraping would cost thousands. The sweet spot for AI scraping is low-volume, high-variety tasks.
"You need to know programming to scrape data."
In 2026, no-code tools (Browse AI, Apify, Bardeen, Claygent) let non-developers extract data by pointing and clicking. But for anything custom, high-volume, or technically challenging, programming skills remain necessary.
"If you can see it in your browser, you can scrape it."
Not always. Some sites dynamically load content behind authentication, use WebSockets for real-time updates, or render content in Canvas/WebGL that a scraper cannot easily extract. Some data is protected by law or terms of service even if technically accessible.
Key Takeaways
- Web scraping is the process of extracting structured data from websites that lack a dedicated API.
- The traditional approach (requests + BeautifulSoup/Playwright + CSS selectors) is fast and cheap but brittle — a single class name change breaks the scraper.
- The modern AI-assisted approach (Browser Use, Stagehand) uses LLMs to extract data by natural language description, adapting to different layouts without re-coding.
- The two approaches serve different needs: traditional scraping for high-volume deterministic extraction, AI scraping for low-volume adaptive extraction.
- Anti-bot defenses are escalating rapidly, making production scraping increasingly expensive and complex.
- Legal considerations around scraping are jurisdiction-specific and evolving. When in doubt, consult legal counsel.
References
- Browser Use — "The Ultimate Guide to Web Scraping (2026)" — browser-use.com
- Browserless — "State of Web Scraping 2026" — browserless.io
- Kadoa — "What is Web Scraping? Enterprise Use Cases for 2026" — kadoa.com
- Ropes & Gray LLP — "Web Scraping in the Age of AI" (May 2026) — ropesgray.com
- Wikipedia — "Web scraping" — en.wikipedia.org/wiki/Web_scraping
- Firecrawl — "Top 9 Browser Automation Tools for Web Testing and Scraping in 2026" — firecrawl.dev
Related
Headless Browsers: Browsers Without a Window
A headless browser is a full web browser (typically Chromium) that runs without any graphical user interface — no window, no tabs, no address bar — controllable entirely through code.
Browser Use: Making Websites Accessible to AI Agents
Browser Use is a family of techniques and tools — led by the open-source Python library browser-use — that lets AI agents control a real web browser the way a human would, all described in natural language.
LLM Function Calling: Giving Language Models a Way to Act
Function calling (also called tool use) is a capability that lets an LLM output structured commands — like get_weather(location='Cairo') — which your own code then executes, bridging the gap between what the model says and what it can do.