Web Crawler Directory & Bot Database
A database of 151 active web crawlers, AI bots, SEO tools, Social Bots, Scrapers used by the CrawlerCheck core engine. Get detailed profiles, copy robots.txt rules, and check if your URL is allowed or blocked by them.
Currently viewing 12 of 151 crawlers and bots
All Filter by safety to see specific recommendations for each category.
AI2Bot is the crawler for the Allen Institute for AI (AI2). It collects data for open-source AI research and datasets.
A specific AI2 crawler used to build the Dolma dataset, a massive open corpus for training language models.
The web crawler for Big Sur AI, an e-commerce AI platform used to gather product data and market insights.
An agent associated with ChatGPT's operational tasks or specific plugin interactions.
ChatGPT-User is the user-agent used when a human user of ChatGPT explicitly asks the AI to browse a specific webpage.
Version 2.0 of the ChatGPT-User agent, representing updated browsing capabilities of the ChatGPT model.
Claude-SearchBot is a crawler used to index content specifically for search-related features within the Claude ecosystem...
Similar to Claude-Web, this agent represents direct user-initiated browsing requests from the Claude interface.
Claude-Web is used when a Claude AI user asks the model to visit a specific URL to answer a question.
The main crawler for Anthropic. It scrapes the web to build the training dataset for the Claude family of AI models.
A crawler used by DigitalOcean, likely for internal GenAI product testing or data gathering.
DuckAssistBot is used by DuckDuckGo for its AI-assisted search features, generating instant answers from specific source...
Check URL for 151 crawlers and bots
Verify if the crawlers currently in view are allowed or blocked on a specific URL.
⚠️ Caution: Advanced Configuration
Modifying your robots.txt file effectively controls who can
access your website. Incorrect rules can accidentally de-index
your entire site from Search Engines like Google. This tool
generates valid syntax rules based on your selection. It does
not analyze your specific website needs.
We strongly suggest testing any changes in Google Search Console or with CrawlerCheck before deploying to production.
Generated robots.txt snippet for the currently viewed bots 151
Select one of the options below to Disallow or Allow the bots.
Generating rules to BLOCK all bots currently in the list.
Review the list above. We recommend blocking bots marked as 'Unsafe' and carefully evaluating the bots marked as 'Caution'.
This is a live generated robots.txt based on the filters you selected above.
User-agent: ai2bot Disallow: / User-agent: ai2bot-dolma Disallow: / User-agent: big-sur-ai Disallow: / User-agent: chatgpt-operator Disallow: / User-agent: chatgpt-user Disallow: / User-agent: chatgpt-user-2-0 Disallow: / User-agent: claude-searchbot Disallow: / User-agent: claude-user Disallow: / User-agent: claude-web Disallow: / User-agent: claudebot Disallow: / User-agent: digitaloceangenai-crawler Disallow: / User-agent: duckassistbot Disallow: / User-agent: gptbot Disallow: / User-agent: google-extended Disallow: / User-agent: grok Disallow: / User-agent: liner-bot Disallow: / User-agent: mistralai-user Disallow: / User-agent: mistralai-user-1-0 Disallow: / User-agent: mycentralaiscraperbot Disallow: / User-agent: oai-searchbot Disallow: / User-agent: perplexity-user Disallow: / User-agent: perplexity-user-1-0 Disallow: / User-agent: perplexitybot Disallow: / User-agent: sbintuitionsbot Disallow: / User-agent: youbot Disallow: / User-agent: anthropic-ai Disallow: / User-agent: cohere-ai Disallow: / User-agent: cohere-training-data-crawler Disallow: / User-agent: img2dataset Disallow: / User-agent: quillbot-com Disallow: / User-agent: aliyunsecbot Disallow: / User-agent: amazonbot Disallow: / User-agent: google-cloudvertexbot Disallow: / User-agent: google-inspectiontool Disallow: / User-agent: googleother Disallow: / User-agent: googleother-image Disallow: / User-agent: googleother-video Disallow: / User-agent: googlebot-discovery Disallow: / User-agent: googlebot-image Disallow: / User-agent: googlebot-news Disallow: / User-agent: googlebot-video Disallow: / User-agent: storebot-google Disallow: / User-agent: turnitinbot Disallow: / User-agent: viennatinybot Disallow: / User-agent: archive-org-bot Disallow: / User-agent: ia-archiver Disallow: / User-agent: ia-archiver-web-archive-org Disallow: / User-agent: peer39-crawler-1-0 Disallow: / User-agent: ahrefsbot Disallow: / User-agent: audigentadbot Disallow: / User-agent: awariorssbot Disallow: / User-agent: awariosmartbot Disallow: / User-agent: brandwatch Disallow: / User-agent: chrome-lighthouse Disallow: / User-agent: dataforseobot Disallow: / User-agent: dotbot Disallow: / User-agent: google-page-speed-insights Disallow: / User-agent: jetslide Disallow: / User-agent: mj12bot Disallow: / User-agent: meltwater Disallow: / User-agent: neticlebot Disallow: / User-agent: netvibes Disallow: / User-agent: screaming-frog-seo-spider Disallow: / User-agent: searchmetricsbot Disallow: / User-agent: semrushbot Disallow: / User-agent: semrushbot-ocob Disallow: / User-agent: semrushbotswa Disallow: / User-agent: sidetrade-indexer-bot Disallow: / User-agent: peer39-crawler Disallow: / User-agent: amazon-kendra Disallow: / User-agent: arquivo-web-crawler Disallow: / User-agent: blexbot Disallow: / User-agent: barkrowler Disallow: / User-agent: bravest Disallow: / User-agent: bytespider Disallow: / User-agent: ccbot Disallow: / User-agent: cotoyogi Disallow: / User-agent: crawlspace Disallow: / User-agent: diffbot Disallow: / User-agent: echobot-bot Disallow: / User-agent: echoboxbot Disallow: / User-agent: factset-spyderbot Disallow: / User-agent: friendlycrawler Disallow: / User-agent: icc-crawler Disallow: / User-agent: isscyberriskcrawler Disallow: / User-agent: imagesiftbot Disallow: / User-agent: jenkersbot Disallow: / User-agent: kangaroo-bot Disallow: / User-agent: livelapbot Disallow: / User-agent: mauibot Disallow: / User-agent: moodlebot Disallow: / User-agent: newsnow Disallow: / User-agent: novaact Disallow: / User-agent: pangubot Disallow: / User-agent: poseidon-research-crawler Disallow: / User-agent: qualifiedbot Disallow: / User-agent: scrapy Disallow: / User-agent: seekportbot Disallow: / User-agent: seekr Disallow: / User-agent: seekrbot Disallow: / User-agent: taragroup-intelligent-bot Disallow: / User-agent: timpibot Disallow: / User-agent: turnitin Disallow: / User-agent: velenpublicwebcrawler Disallow: / User-agent: webzio-extended Disallow: / User-agent: coccocbot-web Disallow: / User-agent: crawler4j Disallow: / User-agent: hada-news Disallow: / User-agent: iaskspider Disallow: / User-agent: iaskspider-2-0 Disallow: / User-agent: imediaethics-org Disallow: / User-agent: imgproxy Disallow: / User-agent: magpie-crawler Disallow: / User-agent: netestate-imprint-crawler Disallow: / User-agent: news-please Disallow: / User-agent: omgili Disallow: / User-agent: omgilibot Disallow: / User-agent: yacy Disallow: / User-agent: yacybot Disallow: / User-agent: applebot Disallow: / User-agent: applebot-extended Disallow: / User-agent: aspiegelbot Disallow: / User-agent: baiduspider Disallow: / User-agent: bingbot Disallow: / User-agent: duckduckbot Disallow: / User-agent: googlebot Disallow: / User-agent: mojeek Disallow: / User-agent: mojeekbot Disallow: / User-agent: petalbot Disallow: / User-agent: seznamhomepagecrawler Disallow: / User-agent: slurp Disallow: / User-agent: teoma Disallow: / User-agent: yahoo-blogs Disallow: / User-agent: yahoo-feedseeker Disallow: / User-agent: yahoo-mmcrawler Disallow: / User-agent: yahooseeker Disallow: / User-agent: yandex Disallow: / User-agent: yandexadditional Disallow: / User-agent: yandexadditionalbot Disallow: / User-agent: yandexbot Disallow: / User-agent: baidu Disallow: / User-agent: facebookbot Disallow: / User-agent: meta-externalagent Disallow: / User-agent: meta-externalfetcher Disallow: / User-agent: pinterestbot Disallow: / User-agent: quora-bot Disallow: / User-agent: slackbot Disallow: / User-agent: twitterbot Disallow: / User-agent: facebookexternalhit Disallow: / User-agent: meta-externalagent-lowercase Disallow: / User-agent: meta-externalfetcher-lowercase Disallow: /
Copy and paste these rules into your website's robots.txt file to block the identified bots.
Resource & Impact Analysis
Managing bot traffic is about more than just security. It's about optimizing your infrastructure and protecting your digital assets. Unchecked crawler activity can have significant downstream effects on your website's performance and business metrics.
📉 Server Load & Bandwidth
Every request from a bot consumes CPU cycles, RAM, and bandwidth. Aggressive scrapers can simulate a DDoS attack, slowing down your site for real human users and increasing your hosting costs, especially on metered cloud platforms.
💰 Crawl Budget Waste
Search engines like Google assign a "Crawl Budget" to your site. A limit on how many pages they will crawl in a given timeframe. If low-value bots clog your server queues, Googlebot may reduce its crawl rate, delaying the indexing of your new content.
🤖 AI & Data Privacy
Modern AI bots (like GPTBot and CCBot) scrape your content to train Large Language Models. While not malicious, they use your intellectual property without providing traffic back. Blocking them allows you to opt-out of having your data used for AI training.
🕵️ Competitive Intelligence
Many "SEO Tools" and commercial scrapers are used by competitors to monitor your pricing, copy your content strategy, or analyze your site structure. Restricting these bots protects your business intelligence.
Understanding Web Crawlers & Bots
Web crawlers (also known as spiders or bots) are automated software programs that browse the internet. CrawlerCheck classifies them into distinct categories to help you decide which ones to allow and which to block.
Search Engines Bots
Bots like Googlebot and Bingbot are essential for your website's visibility. They index your content so it appears in search results. Blocking these will remove your site from search engines.
AI Data Scrapers
Bots like GPTBot (OpenAI), ClaudeBot (Anthropic) and PerplexityBot (PerplexityAI) crawl the web to collect data for training Large Language Models (LLMs). Blocking them prevents your content from being used to train AI, but does not affect your search rankings.
SEO Tools & Scrapers
Marketing tools like Ahrefs and Semrush scan your site to analyze backlinks and SEO health. While useful for SEO audits, aggressive scrapers can consume server bandwidth and impact performance.
Support the Project
You are invited to leave a review and suggest improvements to
CrawlerCheck and Directory.
With your help
we can make them even better.