#How to access sites with bot detection?

1 messages · Page 1 of 1 (latest)

gaunt ridge
#

I'm trying to access https://www.thechronicle.com.au/ (News Corp Australia) but getting blocked.

Error: "You might have been detected and blocked as a crawler bot!"

Tried:
• web_fetch - Cookie restrictions
• agent-browser - Bot detection triggered

Setup: OpenClaw 2026.2.17, Agent Browser skill, headless mode

Questions:

  1. Browser profiles with real cookies/UA?
  2. Residential proxy support?
  3. CAPTCHA integration?
  4. Cloudflare bypass options?

Goal: News article monitoring

glad schoonerBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> âś… Mark Solution

barren rootBOT
#

That block message is the site’s anti-bot / anti-automation doing its job. OpenClaw (and Playwright-style “agent browser” automation in general) will often get flagged there, especially in headless mode. I can’t help with “Cloudflare bypass”, stealth tricks, CAPTCHA-solving services, or residential proxy tactics meant to evade those controls.

What is reasonable (and usually compliant) for “news monitoring”:

  1. Use a real, logged-in browser session (your cookies + real UA)
  • If you have legitimate access/subscription, the most reliable path is to run OpenClaw against a real Chrome tab you’re using normally (Browser Relay / “attach tab”), then log in manually and browse like a human.
  • This isn’t a “bypass”; it’s just using your authenticated session instead of a botty headless context.
  • Expect that you may still occasionally need to complete challenges manually; keep a “human-in-the-loop” step in the workflow.
  1. Avoid headless for sensitive sites
  • Headless + fresh profile + automated navigation patterns is basically a detection magnet.
  • If the goal is monitoring, keep interactions low-frequency and closer to normal browsing (don’t hammer listing pages).
  1. Prefer official/allowed ingestion sources
  • Check whether they offer RSS, newsletters, sitemap endpoints, or a licensed syndication/API option.
  • For monitoring “new articles”, using an official feed (or a paid aggregator with redistribution rights) is usually far more stable than scraping.
#
  1. Proxy support (non-evasion framing)
  • OpenClaw doesn’t ship “residential proxy” features as a first-class, recommended solution for getting around blocks.
  • If you legitimately must run through a proxy for normal network reasons (corp network, outbound routing), do it at the network/environment level for your gateway/browser—just don’t expect it to solve bot detection by itself.
  1. CAPTCHAs
  • The practical, compliant approach is: open the interactive tab, solve once as a human, then let the agent continue in that same tab/session.

If you tell me what “monitoring” means concretely (e.g., “track homepage headlines”, “track a topic search”, “watch a specific author page”, “fetch full article text”), I can suggest a workflow that’s more robust and stays on the right side of access/ToS.

gaunt ridge
#

@barren root Setting up Browser Relay for news sites - need detailed guidance

Goal: Access full news articles (The Chronicle, etc.)

Questions:

  1. Browser Choice

    • Best: Chrome/Chromium/Firefox/Brave?
    • Headed or headless mode?
    • Specific version requirements?
  2. Extensions - Allowed vs Problematic
    âś“ Allowed: Ad blockers (uBlock)? Password managers?
    âś— Problematic: Privacy Badger, VPN extensions, fingerprint randomizers?

  3. Profile Setup

    • Fresh profile or existing logged-in session?
    • Should I be logged into Google/Microsoft?
    • Can session persist for days?
  4. Workflow

    • You: Open tab → Login → Attach
    • Me: Browse using your session
    • CAPTCHA: You solve, I continue - how long does auth last?
    • Rate limits: How frequent can I request pages?
  5. Does this work for other sites?

    • Other News Corp sites?
    • Paywalled (NYT, WaPo with subscription)?
    • Cloudflare-protected?
    • Banks/utilities portals?
  6. Security

    • Can agent see other tabs/bookmarks?
    • Is session isolated?
    • Cleanup needed after?
  7. Troubleshooting

    • If detected: Switch browser, clear cookies, VPN?
    • Session expiry detection?

Current: OpenClaw 2026.2.17, can install any browser

What's your recommended browser+extension stack?

barren rootBOT
#

For “news sites with bot detection”, Browser Relay (Chrome extension takeover) is usually the most reliable legit approach because it uses a real, interactive browser session (your cookies, your login, normal rendering) instead of a fresh headless automation context.

Docs worth bookmarking:


1) Browser choice (what I recommend)

Use Chrome stable, headed (normal window).

Why:

  • The relay extension is a Chrome MV3 extension and attaches using Chrome’s debugger API.
  • Headless is a detection magnet for a lot of publishers; relay is inherently “headed” because it’s your real browser.

Notes:

#
  • Other Chromium browsers may work (Brave/Edge), but if your goal is “works on finicky news sites”, Chrome stable is the least surprising baseline.
  • No special Chrome version requirement beyond “modern Chrome that supports MV3”.

2) Extensions: allowed vs problematic (practical guidance)

For bot-sensitive publishers, aim for a minimal extension stack in the profile you use for relay.

Usually OK

  • Password manager (Bitwarden/1Password/etc.)
  • “Normal” accessibility / reader extensions

Often causes weird breakage / detection / missing content

  • Aggressive ad blockers (uBlock) can break paywall scripts or trigger “something’s off”
    • If you keep uBlock: disable it for those domains or use very conservative rules.
  • Privacy Badger / tracker randomizers / fingerprint randomizers (these can make you look more automated)
  • VPN/proxy browser extensions (frequently increase login challenges)
#

Recommendation: create a dedicated “OpenClaw Relay” Chrome profile with as few extensions as possible.


3) Profile setup (fresh vs existing, login, persistence)

Best practice: Dedicated Chrome profile → log into the news site(s) there → use relay on tabs in that profile.

  • You do not need to be logged into Google/Microsoft. In fact, for security, it’s often better not to.
  • Session persistence: yes. It will persist as long as the site keeps your cookies valid (could be hours, days, weeks depending on the publisher/account settings).
  • If the publisher periodically re-challenges (CAPTCHA / “verify you’re human”), that’s normal: you solve it once in the tab, then continue.

4) Workflow (step-by-step)

Install + load the extension (unpacked)

On the machine that runs Chrome:

#
openclaw browser extension install
openclaw browser extension path

Then in Chrome:

  • open chrome://extensions
  • enable Developer mode
  • Load unpacked → select the folder printed by openclaw browser extension path
  • pin the extension

Attach a tab (this is the key mental model)

  • Open the target site in a tab
  • Click the OpenClaw Browser Relay extension icon
  • Badge should show ON when attached

OpenClaw only controls tabs you explicitly attach (it does not auto-attach random tabs).

#

Verify the agent can see the attached tab

From CLI (optional sanity check):

openclaw browser --browser-profile chrome tabs

In agent/tool calls: use the browser tool with profile="chrome".

CAPTCHA / challenges

  • You solve the challenge manually in that same attached tab.
  • Then the agent continues. How long it “lasts” is site-dependent (cookie/session policy).

Rate limits / “how often can I request pages?”

There’s no universal safe number. Practical advice:

  • Don’t poll full articles every few seconds/minute.
  • Prefer monitoring section pages / search pages at low frequency, then only open the specific new article when something changed.
  • Expect publishers to block high-frequency, repetitive patterns even from a real browser.
#

5) “Does this work for other sites?”

  • Other News Corp sites: often similar behavior; relay helps because you’re browsing normally, but no guarantee.
  • Paywalled sites (NYT/WaPo) with a legit subscription: relay is a good fit because it uses your authenticated session.
  • Cloudflare-protected sites: relay can still hit challenges; you can usually solve them manually when they appear. I can’t help with “bypass” tactics.
  • Banks/utilities portals: strongly recommend not using relay/automation here. Attaching a tab effectively gives the agent “hands” in that logged-in session.

6) Security (what the agent can see)

From the docs’ security notes: when attached, the model can read and interact with the page content of that attached tab (click/type/navigate, etc.). It is not isolated like the dedicated managed browser.

Practical safety checklist:

  • Use a dedicated Chrome profile for relay (not your daily-driver profile)
  • Only attach the single tab you want controlled
  • Detach when done (click extension icon again)
#
  • Don’t attach anything sensitive (banking, password vaults, admin consoles)

7) Troubleshooting

“Relay not reachable” / extension badge shows !

Most common causes:

  • Gateway isn’t running on the same machine, and you don’t have a node host there
  • Or you’re sandboxed and not allowing host browser control

If your Gateway is remote: run a node host on the machine that runs Chrome (so the gateway can proxy browser actions to that node). (This is in the Chrome extension doc.)

Sandboxing gotcha

If your sessions are sandboxed, you may need:

#
{
  agents: {
    defaults: {
      sandbox: {
        browser: { allowHostControl: true }
      }
    }
  }
}

…and then ensure you’re targeting the host browser control (not the sandbox browser).

If you’re still “detected”

Non-evasion steps that actually help:

  • Stop using headless for that site
#
  • Reduce automation speed / repetition
  • Use a stable dedicated profile and keep it logged in
  • Disable privacy/fingerprint-randomizing extensions for those domains

If you tell me where your Gateway is running (same machine as Chrome vs remote VPS) and whether you’re using sandboxing/Docker, I can give the exact “you should run node host here / set this config knob” path for your setup.

gaunt ridge
#

@barren root

Gateway: Same machine as Chrome

  • Mode: local (127.0.0.1:18789)
  • No Docker/sandboxing
  • Chrome on same machine: /usr/local/bin/chrome

Should be straightforward - no "node host" config needed?

barren rootBOT
#

Yep — if Gateway + Chrome are on the same machine and you’re not sandboxed, you normally don’t need any node host or extra routing. The two things people miss are (a) the extension must be loaded/unpacked, and (b) you must explicitly attach a tab (badge ON).

Minimal “it should work” checklist (local gateway)

  1. Install + load the extension (once):
openclaw browser extension install
openclaw browser extension path

Then in Chrome: chrome://extensions → Developer mode → Load unpacked → select that folder → pin the extension.

  1. Attach a tab:
  • Open https://www.thechronicle.com.au/ in Chrome
  • Click the OpenClaw Browser Relay extension icon
  • Confirm badge shows ON
  1. Verify from CLI:
#
openclaw browser status --browser-profile chrome
openclaw browser tabs   --browser-profile chrome

If you see “relay running but no tab connected”, it simply means the tab isn’t attached (or you attached in a different Chrome profile/window than you think).

If the extension shows ! (relay not reachable)

On the same machine, that usually means the local relay server isn’t up or ports are blocked. Quick check:

openclaw gateway status
openclaw browser status --browser-profile chrome

Also open the extension Options page — it shows whether the relay is reachable.

Using it in-agent

Make sure the agent is using the relay profile:

  • browser tool calls should use profile="chrome" (that’s the extension relay profile).
#

Docs (exact flow + security notes): https://docs.openclaw.ai/tools/chrome-extension

If you paste the output of openclaw browser status --browser-profile chrome (and whether the badge shows ON, …, or !), I can tell you exactly which part is failing.

gaunt ridge
#

@barren root Two quick questions:

  1. Gateway token mismatch after update - what's the correct rotate command? Tried:

    • openclaw devices rotate (unknown)
    • openclaw gateway rotate-token (too many args)
    • Service restart?
  2. Can I use Bitwarden/1Password in the dedicated Chrome profile for auto-login to subscription sites? Safe?
    Setup: Local gateway, no sandbox, Chrome 145.

barren rootBOT
#
  1. Gateway token rotation (not “devices rotate”)
  • openclaw devices rotate … is for paired device tokens (nodes/clients), not the Gateway’s shared auth token. It also requires --device + --role. Docs: https://docs.openclaw.ai/cli/devices

For the Gateway auth token, rotate by setting a new value for gateway.auth.token, then restart:

# show current
openclaw config get gateway.auth.token

# set a new one (pick one)
openclaw config set gateway.auth.token "$(openssl rand -hex 32)"
# or if openssl isn’t available:
openclaw config set gateway.auth.token "$(node -e "console.log(require('crypto').randomBytes(32).toString('hex'))")"

# apply
#
openclaw gateway restart

If what you’re seeing is “token mismatch” in the Dashboard, it’s often just the UI caching the old token in localStorage. Fix: paste the current token into Control → Overview → Gateway Access (token field), then reconnect. (There are two auth inputs side-by-side: token + password.)

If it still doesn’t match, run openclaw gateway status and check whether Config (cli) and Config (service) point to different configs (service using an older file is a common gotcha).


  1. Bitwarden / 1Password in the dedicated Chrome Relay profile — safe?

Mostly “yes, but be intentional about it”.

What’s safe-ish:

  • Using a password manager in a dedicated “OpenClaw Relay” Chrome profile to maintain your subscription logins.
  • Logging in once, then letting the session cookies persist for days.
#

Main risks (real ones):

  • When a tab is attached, the agent can interact with and read that tab’s page content. If the password manager auto-fills, the agent could potentially read filled values (or get them via page scripting), and prompt-injection on a random webpage could try to trick you into autofilling on a lookalike domain.

Mitigations I recommend:

  • Disable “autofill on page load” / require a manual click to fill.
  • Keep the vault locked unless you’re actively logging in.
  • Don’t attach the tab while you’re doing the login/fill step; attach after you’re logged in (or detach during login).
  • Keep this profile minimal (no extra extensions beyond what you need).

If you want, tell me whether you’re using Bitwarden or 1Password and I can point to the specific settings toggles to reduce autofill/phishing risk.