Your robots.txt file tells AI crawlers what they can read on your site. Here is how it works, the crawlers to know, and how to check your own in seconds.
Before an AI crawler reads a single page of your site, it checks one file: robots.txt. It lives at the root of your domain, at yourdomain.com/robots.txt, and it is the first thing a well-behaved crawler requests. The file is a short set of instructions that says which crawlers may visit and which parts of the site they may read.
Every major AI company runs crawlers that respect it. OpenAI, Anthropic, Google, Perplexity, and others read your robots.txt and follow the rules they find. If the file blocks them, they stay out, and your pages never make it into the training data or the live answers those models give about your brand. If you have never set one up, that is fine: no robots.txt means everything is open by default.
A robots.txt file is a list of rules grouped by the crawler they apply to. Here is a complete example with every piece you need to know:
Read it top to bottom:
Two things trip people up. First, precedence: once a crawler has its own named block, it ignores the catch-all block entirely. The moment you write a User-agent: GPTBot group, GPTBot follows only those lines and nothing from the asterisk group. Second, robots.txt is an honor system, not a lock. The major AI companies honor it, which is why it works, but it tells crawlers what they may do rather than physically stopping them. Treat it as a sign on the door, not a deadbolt.
There is not one AI crawler, there are dozens of them knocking on your site, and they do different jobs. Centium tracks 21 across 11 companies. The name in the first column is the exact token you put after User-agent: in robots.txt.
| Crawler (User-agent) | Company | Purpose |
|---|---|---|
| OpenAI | AI Training | |
| OpenAI | Live Browsing | |
| OpenAI | Search Indexing | |
| Anthropic | AI Training | |
| Anthropic | Live Browsing | |
| Anthropic | Search Indexing | |
| Search Indexing | ||
| Gemini Training & Search | ||
| Deep Research | ||
| xAI | AI Training | |
| Perplexity | Search Indexing | |
| Perplexity | Live Browsing | |
| DeepSeek | AI Training & Search | |
| DuckDuckGo | AI Search | |
| Brave | Search Indexing | |
| Meta | Llama Training | |
| Amazon | Alexa AI | |
| ByteDance | TikTok AI | |
| Common Crawl | Open Training Data | |
| Apple | Apple AI / Siri | |
| Microsoft | Copilot |
They fall into three jobs, and blocking the wrong one quietly removes you from where AI looks:
A few are worth knowing by name. Google-Extended is the token that controls Gemini training, separate from Googlebot, so you can be in Google Search while opting out of Gemini, or the reverse. CCBot is Common Crawl, the open dataset nearly every model trains on, so blocking it removes you from many models at once. You can check whether your site is in Common Crawl with our free training data checker. And one crawler sits outside the system: xAI's Grok publishes user-agent tokens, but it has been reported to crawl as ordinary browser traffic and ignore robots.txt, so there is no reliable rule that keeps it out.
For Centium subscribers, watching the door is not a once-a-year audit. Crawler access lives in your dashboard under Indexing, and every time you open it, Centium fetches a live read of your robots.txt and re-checks all 21 crawlers against it. You see which are allowed, which are blocked, and the exact rule doing the blocking, refreshed on the spot. Check it daily if you want, or just glance at it when you open the dashboard for everything else.
It matters because robots.txt changes quietly, and often for reasons that have nothing to do with AI. A developer ships a site update and a staging rule slips into production. A security plugin adds a blanket Disallow. Most often, an IT admin sees the extra server load from all these bots, assumes it is traffic worth shedding, and blocks the crawlers to protect the server, with no idea those are the same bots feeding your brand into AI. Any one of these can shut out a crawler that mattered overnight. Catching it the day it happens, instead of months later once your AI visibility has already slipped, is the difference between a one-line fix and a season of lost ground.
Paste in your domain and Centium’s free AI Access Tester shows you, crawler by crawler, who can read your site. If one is blocked that should not be, the fix is usually a single line. No signup required.
Open crawler access gets AI to your door. Centium shows what happens next: whether ChatGPT, Claude, Gemini, Perplexity, and Grok actually recommend you, how you stack up against your competitors, and what you can do about it.
An AI crawler is an automated bot that an AI company sends to read websites. Some collect pages to train models like ChatGPT, Claude, and Gemini. Others visit in real time when someone asks an AI a question, so the model can ground its answer in current information. Each crawler identifies itself with a user-agent name, like GPTBot or PerplexityBot, which is the name you use to allow or block it in robots.txt.
Open yourdomain.com/robots.txt in a browser and read the rules, or paste your domain into Centium’s free AI Access Tester to see all 21 AI crawlers checked at once. Centium subscribers also get a live crawler check inside the dashboard that refreshes every time they open it.
Add a named block to your robots.txt: a line that reads User-agent: GPTBot, followed by a line that reads Disallow: /. That shuts GPTBot out of the entire site. To block a single section instead, point the Disallow at that path, like Disallow: /members/.
It works on an honor system. The major AI companies, including OpenAI, Anthropic, Google, and Perplexity, read your robots.txt and follow it, so for them it is reliable. It is not a security control. It tells well-behaved crawlers what they may do rather than physically blocking anyone, so it will not stop a bad actor that ignores the rules.
For most brands, all of them. Training crawlers like GPTBot and ClaudeBot teach the next generation of models about you. Live browsing crawlers like ChatGPT-User and PerplexityBot let models cite you in real-time answers. Search indexing crawlers like Googlebot power AI search results. Blocking any of them quietly removes you from where buyers are now discovering brands.
GPTBot is OpenAI’s training crawler: it collects pages to teach future models, and what it reads stays in the model. ChatGPT-User is the live browsing agent: it visits your site the moment a user asks ChatGPT something, and what it reads is used only for that one answer. You can allow or block each one separately in robots.txt.
AI does not buy the brand story. It weighs reviews, rankings, specs, and accolades. Here is the content that makes AI confident enough to recommend you.
AI search has no rankings and no analytics. Here is how Centium measures whether AI recommends your brand, across five models and hundreds of prompts.
AI answers a question one of two ways, from training data or by searching the web live. Here is how each works and what it means for whether AI recommends you.