Skip to content
Intermediate· 11 min read

Browser Use: Making Websites Accessible to AI Agents

Browser Use is a family of techniques and tools — led by the open-source Python library browser-use — that lets AI agents control a real web browser the way a human would, all described in natural language.

Why This Matters

The web runs on APIs — but most of the web doesn't have one that you can use. Amazon, LinkedIn, Booking.com, most government portals, and thousands of other sites are designed for human eyeballs and human clicks, not for machine consumption. Traditional automation tools (Selenium, Playwright) can drive a browser programmatically, but they require a human developer to write explicit selectors and handling logic for every single interaction. When the website's layout changes, the script breaks.

AI browser automation inverts this. Instead of a human writing brittle code, an AI agent looks at the page, decides what to do next, and executes it — adapting to different layouts, handling edge cases, and recovering from errors, all in natural language. The practical consequence is staggering: a task that once required hours of scripting (scrape 500 product pages, fill 50 job applications, monitor a competitor's pricing) can now be described in a single sentence and executed autonomously.

This matters because the gap between "the web exists" and "the web is programmable" has been the bottleneck for AI agents since the beginning. Browser Use is arguably the most important infrastructure project closing that gap in 2025–2026.

Prerequisites

Prerequisites

This article connects to several related concepts on Q4bits: Agent Loops — an AI that repeatedly works toward a goal until a stopping condition is met (covered in the Agent Loops article). LLM Function Calling — modern LLMs output structured commands that programs execute (covered in the Function Calling article). Headless Browsers — browsers running without a graphical window (covered in the Headless Browsers article). Web Scraping — extracting data from websites (covered in the Web Scraping article).

Core Idea

Imagine you need someone to fill out a job application on LinkedIn. You have two options:

  1. Write a script: Find the HTML IDs of every input field, write code to find each element, handle the dropdown menus, wait for the page to load, deal with the CAPTCHA — and pray LinkedIn doesn't redesign the page tomorrow.
  2. Tell a smart assistant: "Go to LinkedIn, find software engineering jobs at Google, and submit my profile to the top three."

Browser Use is option 2. It's a Python library that gives an LLM (like GPT-4 or Claude) the ability to control a real browser. The LLM receives a text or vision snapshot of the page, reasons about what to do next — "I see a search bar, I should type 'software engineer' into it" — and calls a browser action. The library handles the actual DOM interaction under the hood.

The key insight is that the LLM replaces the hardcoded logic. Traditional browser automation needs a programmer to anticipate every branching path. Browser Use lets the LLM improvise, which makes the automation layout-resistant and zero-shot — you don't need to pre-program interactions for every site.

How It Actually Works

The High-Level Architecture

graph TD
    User["👤 User / Code"] -->|describes task| Agent["🤖 Agent Loop"]
    Agent -->|observes page state| Snapshot["📸 Page Snapshot"]
    Snapshot -->|feeds to| LLM["🧠 LLM (GPT-4 / Claude / Gemini)"]
    LLM -->|decides next action| Agent
    Agent -->|executes action| Browser["🌐 Real Browser (Chromium)"]
    Browser -->|returns new page state| Agent
    Agent -->|task complete?| Done["✅ Return Result"]

    classDef user fill:#d4a853,stroke:#c49a3c,stroke-width:2px,color:#050505
    classDef agent fill:#a78bfa,stroke:#8b5cf6,stroke-width:2px,color:#ffffff
    classDef snapshot fill:#2d2d2d,stroke:#4a4a4a,stroke-width:2px,color:#c9c9c9
    classDef llm fill:#1c1c3a,stroke:#5b8cf6,stroke-width:2px,color:#87CEEB
    classDef browser fill:#1a3a2a,stroke:#34d399,stroke-width:2px,color:#34d399
    classDef done fill:#1a2a1a,stroke:#34d399,stroke-width:2px,color:#34d399

    class User user
    class Agent agent
    class Snapshot snapshot
    class LLM llm
    class Browser browser
    class Done done

The loop — sometimes called a perception-action loop — works like this:

  1. Perceive: The agent captures the current browser state, either as a text representation (the rendered DOM or accessibility tree) or as a screenshot (for vision-capable models).
  2. Reason: The snapshot is fed to the LLM, which outputs a decision: which element to interact with and what action to take (click, type, scroll, wait, extract).
  3. Act: The agent executes the chosen action through Playwright or a direct CDP (Chrome DevTools Protocol) connection.
  4. Repeat: The browser updates, the agent captures the new state, and the loop continues until the task is complete, the agent hits a limit, or the LLM signals "done."

DOM-First vs Vision-First

There are two fundamentally different approaches, and the choice affects cost, reliability, and latency:

ApproachHow it sees the pageProsCons
DOM-firstConverts HTML / accessibility tree into text (e.g., a numbered list of clickable elements)Cheap (small tokens), fast, precise element targetingMisses visual layout, struggles with CSS-heavy rendering
Vision-firstTakes a screenshot, feeds it to a vision model (GPT-4V, Claude 3.5 Sonnet)Understands visual context, works even when DOM is messyExpensive (large images = lots of tokens), slower

Most production systems use a hybrid: vision for initial understanding and layout awareness, DOM for precise action targeting. Browser Use supports both modes.

The Browser Use Library Specifically

The browser-use/browser-use repository has grown to over 98,000 stars on GitHub (as of mid-2026), making it the most popular open-source browser agent framework by a wide margin.

Key architectural features of the library:

  • Rust-powered core (v0.13+): The latest beta agent rewrites the runtime in Rust for speed and reliability. Install with pip install browser-use[core].
  • Self-healing browser harness: If a click fails because the element moved, the harness retries with alternative selectors.
  • Multi-model support: Works with OpenAI, Anthropic, Google Gemini, Ollama (local models), and the custom ChatBrowserUse model optimised specifically for browser tasks (3–5× faster on benchmarks).
  • Persistent sessions: The browser stays open between commands in CLI mode, maintaining login state and cookies.

A Minimal Working Example

from browser_use import Agent

agent = Agent(
    task="Go to Hacker News, find the top post, and tell me its title and points.",
    model="gpt-4o",  # or "claude-sonnet-4", "gemini-2.0-flash"
)
result = agent.run()
print(result)

Under the hood, the agent opens Chrome, navigates to news.ycombinator.com, reads the rendered page, clicks the top link if needed, extracts the title and score, and returns the result as structured text. The whole interaction takes seconds and requires zero selectors.

Worked Example: A Real-World Task

Let's trace what happens when you ask Browser Use to "buy a 65-inch OLED TV under $1,500 on Amazon and add it to my cart."

Step 1 — Agent receives the task. The agent has no pre-programmed Amazon logic. It just knows it has browser tools.

Step 2 — Open Amazon. The agent calls browser_navigate("https://www.amazon.com"). Chromium loads the page.

Step 3 — Search for the product. The agent sees the search bar in the DOM snapshot. It decides to type into it: browser_type_text("#search", "65 inch OLED TV") and press Enter.

Step 4 — Filter results. The search results page loads. The agent reads the DOM or takes a screenshot. It sees price filters. It clicks the filter for "Under $1,500" using a selector or coordinates.

Step 5 — Pick the best option. The agent scans the filtered results, reads titles and prices, and picks a TV that matches the criteria. It clicks the product link.

Step 6 — Add to cart. On the product page, the agent looks for the "Add to Cart" button and clicks it. If a "Protection Plan" popup appears, the LLM decides whether to dismiss it or accept it — it reads the popup content and reasons about it.

Step 7 — Confirm. The agent extracts the cart total and reports back: "A [specific TV model] for $1,399 has been added to your cart."

If any step fails — the search returned no results, the page loaded slowly, there was a CAPTCHA — the LLM reasons about the failure and tries an alternative approach rather than crashing.

The Ecosystem: How Browser Use Fits In

Browser Use is not alone. The AI browser automation space has rapidly matured:

ToolLanguagePhilosophyStrengths
Browser UsePythonFull autonomy — describe the goal, agent decides everythingBest autonomous reasoning, 89.1% WebVoyager benchmark, largest community (98k stars)
Stagehand (by Browserbase)TypeScriptHybrid — you write structured scripts with AI primitivesFaster for repetitive tasks, action caching, better for production where costs matter
SkyvernPython/APIManaged + computer-vision-firstBest for CAPTCHA-heavy sites, form-heavy workflows, managed deployment
Playwright (vanilla)TypeScript/PythonDeterministic — you write every selectorFastest, cheapest, most reliable if selectors are stable
SeleniumMany languagesLegacy automationBroadest language support, mature ecosystem, outdated architecture
Agent Browser (Vercel)Rust/TSMinimalist CLI for AI agents93% context window saving, Rust performance

Browser Use leads in autonomous reasoning — the ability to handle completely novel, multi-step web tasks without any pre-written logic. Stagehand leads in production efficiency — if you run the same structured extraction every day, Stagehand's caching makes it cheaper. Playwright remains king for deterministic testing (run the same 1,000 tests a thousand times).

The Framework Wars

A civil war is quietly happening in the browser automation space. Traditional tools (Playwright, Selenium) are being caught between two forces: vision-first agents that treat the browser like a human (screenshots, pixels) and DOM-first agents that treat it like an API (structured text, selectors). Browser Use sits in the middle — it can do both, and its Rust-powered core gives it performance approaching Playwright's while keeping the flexibility of an AI-driven approach.

Common Misconceptions

"Browser Use replaces web scraping entirely."

It subsumes scraping for use cases that need reasoning and adaptivity, but for high-volume, deterministic data extraction (scrape 10,000 product pages every hour), a plain Playwright script with well-maintained selectors is still orders of magnitude faster and cheaper. Each Browser Use task incurs an LLM API cost of $0.01–0.05+ just for reasoning.

"It works on every website."

Sites with aggressive anti-bot protection (Cloudflare, DataDome), complex multi-step CAPTCHAs, or highly dynamic single-page apps can still fail. The open-source version struggles here; the Cloud version handles stealth browsers and CAPTCHA solving as premium features.

"I don't need to know anything about browsers to use it."

You can get very far with just natural language, but for production use you'll want to understand browser basics — what a DOM is, how sessions work, how cookies persist, what a user agent is. The LLM handles the reasoning, but debugging failures often requires browser knowledge.

"The LLM costs are negligible."

They add up. A single multi-step task (browse 5 pages, extract data from each) can cost $0.03–0.10 in API tokens. Running 1,000 such tasks = $30–100. For large-scale operations, the LLM cost is the bottleneck, not the browser infrastructure.

"It's just Selenium with an LLM."

Superficially, yes — both drive a browser. But the difference in reliability profile is huge. Selenium scripts break on layout changes; Browser Use adapts. Selenium requires a programmer to fix failures; Browser Use retries with different strategies. The failure mode shifts from "broken selector" to "LLM hallucination" — a very different debugging experience.

Key Takeaways

  • Browser Use is a broad concept (AI agents controlling browsers) and a specific open-source library (browser-use) that leads the space with 98k+ GitHub stars.
  • The core mechanism is a perception-action loop: the agent sees the page (via DOM or vision), the LLM decides the next action, and the agent executes it through Playwright or CDP.
  • The library is MIT-licensed and free (you only pay for LLM API calls), with a paid Cloud offering ($29–$999/mo) for stealth browsers, CAPTCHA solving, and managed infrastructure at scale.
  • It excels at autonomous, multi-step web tasks that would require brittle scripting with traditional tools — job applications, price comparison, form filling, data extraction.
  • The major trade-off is LLM cost vs. reliability: each task costs tokens, but the agent adapts to layout changes and recovers from failures without human intervention.

References

Related