Features|June 17, 2026

How to Check If AI Crawlers Can Access Your Site

Your robots.txt file tells AI crawlers what they can read on your site. Here is how it works, the crawlers to know, and how to check your own in seconds.

01 / What it is

robots.txt and AI crawlers.

Before an AI crawler reads a single page of your site, it checks one file: robots.txt. It lives at the root of your domain, at yourdomain.com/robots.txt, and it is the first thing a well-behaved crawler requests. The file is a short set of instructions that says which crawlers may visit and which parts of the site they may read.

Every major AI company runs crawlers that respect it. OpenAI, Anthropic, Google, Perplexity, and others read your robots.txt and follow the rules they find. If the file blocks them, they stay out, and your pages never make it into the training data or the live answers those models give about your brand. If you have never set one up, that is fine: no robots.txt means everything is open by default.

02 / How it works

user-agent, disallow, allow.

A robots.txt file is a list of rules grouped by the crawler they apply to. Here is a complete example with every piece you need to know:

# Rules for every crawler not named below

User-agent: *

Disallow: /private/

Allow: /private/press/

Disallow: /*.pdf$

# Block OpenAI’s training crawler from the whole site

User-agent: GPTBot

Disallow: /

# Let Anthropic’s crawler read everything

User-agent: ClaudeBot

Allow: /

Sitemap: https://example.com/sitemap.xml

Read it top to bottom:

User-agent names the crawler a block of rules applies to. The asterisk is a catch-all for any crawler you have not named on its own, so User-agent: * means "everyone else." User-agent: GPTBot targets exactly one crawler.
Disallow blocks a path. Disallow: / shuts the crawler out of the whole site. Disallow: /private/ blocks a single folder. An empty Disallow blocks nothing, which is how you open a site back up.
Allow carves an exception back out of a blocked area. In the example, /private/ is closed, but /private/press/ is reopened for crawlers to read.
The asterisk and dollar sign are wildcards. The asterisk stands in for any run of characters, and the dollar sign marks the end of the URL. Disallow: /*.pdf$ blocks every PDF on the site, wherever it lives.
A hash starts a comment that crawlers ignore, useful for labeling each block. Sitemap points crawlers at your sitemap so they can find your pages faster.

Two things trip people up. First, precedence: once a crawler has its own named block, it ignores the catch-all block entirely. The moment you write a User-agent: GPTBot group, GPTBot follows only those lines and nothing from the asterisk group. Second, robots.txt is an honor system, not a lock. The major AI companies honor it, which is why it works, but it tells crawlers what they may do rather than physically stopping them. Treat it as a sign on the door, not a deadbolt.

03 / The crawlers

the AI crawlers to know.

There is not one AI crawler, there are dozens of them knocking on your site, and they do different jobs. Centium tracks 21 across 11 companies. The name in the first column is the exact token you put after User-agent: in robots.txt.

Crawler (User-agent)	Company	Purpose
GPTBot	OpenAI	AI Training
ChatGPT-User	OpenAI	Live Browsing
OAI-SearchBot	OpenAI	Search Indexing
ClaudeBot	Anthropic	AI Training
Claude-User	Anthropic	Live Browsing
Claude-SearchBot	Anthropic	Search Indexing
Googlebot	Google	Search Indexing
Google-Extended	Google	Gemini Training & Search
Gemini-Deep-Research	Google	Deep Research
Grok	xAI	AI Training
PerplexityBot	Perplexity	Search Indexing
Perplexity-User	Perplexity	Live Browsing
DeepSeekBot	DeepSeek	AI Training & Search
DuckAssistBot	DuckDuckGo	AI Search
BraveBot	Brave	Search Indexing
Meta-ExternalAgent	Meta	Llama Training
Amazonbot	Amazon	Alexa AI
Bytespider	ByteDance	TikTok AI
CCBot	Common Crawl	Open Training Data
Applebot-Extended	Apple	Apple AI / Siri
Bingbot	Microsoft	Copilot

They fall into three jobs, and blocking the wrong one quietly removes you from where AI looks:

Training crawlers like GPTBot, ClaudeBot, Google-Extended, CCBot, and Meta-ExternalAgent index your content to teach the next generation of models. Block these and AI never learns your brand exists.
Live browsing crawlers like ChatGPT-User, Perplexity-User, and Claude-User visit your site in real time when someone asks an AI about you. Block these and the model cannot cite you in its answer.
Search indexing crawlers like Googlebot, OAI-SearchBot, and Bingbot power AI-integrated search results.

A few are worth knowing by name. Google-Extended is the token that controls Gemini training, separate from Googlebot, so you can be in Google Search while opting out of Gemini, or the reverse. CCBot is Common Crawl, the open dataset nearly every model trains on, so blocking it removes you from many models at once. You can check whether your site is in Common Crawl with our free training data checker. And one crawler sits outside the system: xAI's Grok publishes user-agent tokens, but it has been reported to crawl as ordinary browser traffic and ignore robots.txt, so there is no reliable rule that keeps it out.

04 / In your dashboard

monitor AI crawler access.

For Centium subscribers, watching the door is not a once-a-year audit. Crawler access lives in your dashboard under Indexing, and every time you open it, Centium fetches a live read of your robots.txt and re-checks all 21 crawlers against it. You see which are allowed, which are blocked, and the exact rule doing the blocking, refreshed on the spot. Check it daily if you want, or just glance at it when you open the dashboard for everything else.

It matters because robots.txt changes quietly, and often for reasons that have nothing to do with AI. A developer ships a site update and a staging rule slips into production. A security plugin adds a blanket Disallow. Most often, an IT admin sees the extra server load from all these bots, assumes it is traffic worth shedding, and blocks the crawlers to protect the server, with no idea those are the same bots feeding your brand into AI. Any one of these can shut out a crawler that mattered overnight. Catching it the day it happens, instead of months later once your AI visibility has already slipped, is the difference between a one-line fix and a season of lost ground.

Free tool

check your site
against all 21 crawlers.

Paste in your domain and Centium’s free AI Access Tester shows you, crawler by crawler, who can read your site. If one is blocked that should not be, the fix is usually a single line. No signup required.

Check your site

Crawler access is just the first layer

see where
you stand.

Open crawler access gets AI to your door. Centium shows what happens next: whether ChatGPT, Claude, Gemini, Perplexity, and Grok actually recommend you, how you stack up against your competitors, and what you can do about it.

Our Plans View Demo

FAQ

questions, answered.

An AI crawler is an automated bot that an AI company sends to read websites. Some collect pages to train models like ChatGPT, Claude, and Gemini. Others visit in real time when someone asks an AI a question, so the model can ground its answer in current information. Each crawler identifies itself with a user-agent name, like GPTBot or PerplexityBot, which is the name you use to allow or block it in robots.txt.

Open yourdomain.com/robots.txt in a browser and read the rules, or paste your domain into Centium’s free AI Access Tester to see all 21 AI crawlers checked at once. Centium subscribers also get a live crawler check inside the dashboard that refreshes every time they open it.

Add a named block to your robots.txt: a line that reads User-agent: GPTBot, followed by a line that reads Disallow: /. That shuts GPTBot out of the entire site. To block a single section instead, point the Disallow at that path, like Disallow: /members/.

It works on an honor system. The major AI companies, including OpenAI, Anthropic, Google, and Perplexity, read your robots.txt and follow it, so for them it is reliable. It is not a security control. It tells well-behaved crawlers what they may do rather than physically blocking anyone, so it will not stop a bad actor that ignores the rules.

For most brands, all of them. Training crawlers like GPTBot and ClaudeBot teach the next generation of models about you. Live browsing crawlers like ChatGPT-User and PerplexityBot let models cite you in real-time answers. Search indexing crawlers like Googlebot power AI search results. Blocking any of them quietly removes you from where buyers are now discovering brands.

GPTBot is OpenAI’s training crawler: it collects pages to teach future models, and what it reads stays in the model. ChatGPT-User is the live browsing agent: it visits your site the moment a user asks ChatGPT something, and what it reads is used only for that one answer. You can allow or block each one separately in robots.txt.

More guides

Methodology|Jun 20, 2026

What Content AI Wants From Your Brand

AI does not buy the brand story. It weighs reviews, rankings, specs, and accolades. Here is the content that makes AI confident enough to recommend you.

Read guide

Methodology|Jun 19, 2026

How to Measure Your Brand's AI Visibility

AI search has no rankings and no analytics. Here is how Centium measures whether AI recommends your brand, across five models and hundreds of prompts.

Read guide

Methodology|Jun 18, 2026

How AI Answers Questions About Your Brand

AI answers a question one of two ways, from training data or by searching the web live. Here is how each works and what it means for whether AI recommends you.

Read guide

How to Check If AI Crawlers Can Access Your Site

robots.txt and AI crawlers.

user-agent, disallow, allow.

the AI crawlers to know.

monitor AI crawler access.

check your siteagainst all 21 crawlers.

see whereyou stand.

questions, answered.

More guides

What Content AI Wants From Your Brand

How to Measure Your Brand's AI Visibility

How AI Answers Questions About Your Brand

check your site
against all 21 crawlers.

see where
you stand.