Documentation & Methodology

Everything you need to know about how CrawlerCheck works, our classification system, and best practices for managing bot traffic.

1. User-Agent Parsing

First, we analyze the User-Agent header of the incoming request. This gives us a preliminary identification of who the bot claims to be (e.g., "Googlebot" or "GPTBot").

2. Safety Score Calculation

We analyze known behavior patterns, official documentation, and community reports to assign a "Safety Score" to each crawler. This helps identify potential risks without requiring technical DNS verification on your end.

3. Behavioral Analysis

Our database tracks established behavior patterns, official documentation URLs, and known IP ranges to provide a "Safety Score" for each crawler.

Safe: Verified & Beneficial

These are bots from major search engines (Google, Bing, DuckDuckGo) and trusted partners. Blocking them will directly hurt your website's visibility in search results.

  • ✅ Essential for SEO
  • ✅ Strict abuse policies

Caution: Use Discretion

This category includes AI data scrapers, SEO analysis tools, and commercial crawlers. They are legitimate companies, but they offer a value exchange that may not benefit every website owner.

  • ⚖️ AI Training Bots
  • ⚖️ SEO Tools
  • ⚖️ Social Media Preview Bots

Aggressive: Low Value / Resource Intensive

These bots are often aggressive scrapers, lead generators, or tools that ignore robots.txt rules. They consume server resources while providing little to no tangible value to your visibility.

  • ❌ Aggressive scraping
  • ❌ Ignoring crawl delays
  • ❌ Often used for cloning content
  • ❌ Known to bypass standard blocks

Freshness & Accuracy

We meticulously track when each bot was added to our directory and when its profile was last updated. Look for the "Added on" and "Last updated on" dates on every crawler profile to gauge the recency of the information.

Robots.txt Best Practices

1. Specificity Matters

Your robots.txt file is the first line of defense. Specific user-agent rules will typically override a global User-agent: * rule.

User-agent: *
Disallow: /private/

User-agent: GPTBot
Disallow: /

In this example, most bots can't visit /private/, but GPTBot is blocked from the entire site.

2. Crawl-delay is not standard

The Crawl-delay directive is supported by some bots (like Bingbot and Yandex) but ignored by Google. Do not rely on it as a universal rate-limiting tool.

Advanced Blocking Methods

1. Meta Robots Tag (MetaNoIndex)

While robots.txt tells bots where not to go, it doesn't guarantee they won't index a page if they find it via a link. To explicitly prevent indexing, use the noindex meta tag in your HTML header.

<meta name="robots" content="noindex">

You can also target specific bots. For example, to allow Google but block GPTBot from indexing:

<meta name="googlebot" content="index">
<meta name="GPTBot" content="noindex">

2. X-Robots-Tag Header (HeaderNoIndex)

For non-HTML content like PDFs, images, or API endpoints, you can't use a meta tag. Instead, use the X-Robots-Tag HTTP header in your server configuration (Nginx, Apache, etc.).

X-Robots-Tag: noindex

This is a powerful directive that is respected by all major crawlers and ensures specific files are never shown in search results.

How CrawlerCheck Reports This

When you use these methods, CrawlerCheck will report the bot status as Allowed (MetaNoIndex) or Allowed (HeaderNoIndex) rather than "Disallowed".

Why? To respect a noindex tag, a bot must first be Allowed to crawl the page to see the instruction. If you block it in robots.txt, it will never see the tag and might still index the URL (as a link without content).

Frequently Asked Questions

Will blocking AI bots hurt my SEO?

No. Blocking AI training bots like GPTBot or CCBot does not affect your ranking on search engines like Google or Bing. These bots are used for training Large Language Models, not for indexing content for search.

How often should I update my robots.txt?

You don't need to update it daily. However, as new AI bots and scrapers emerge frequently, we recommend reviewing your access logs and our Directory once every few months to identify new actors you may want to block.

Is robots.txt legally binding?

No. It is a voluntary protocol ("The Robots Exclusion Protocol"). Reputable bots follow it strictly, but malicious scrapers and bad actors will ignore it. For hard blocking, you should use server-level firewall rules or IP blocking.