Documentation & Methodology

Everything you need to know about how CrawlerCheck works, our classification system, and best practices for managing bot traffic.

1. User-Agent Parsing

First, we analyze the User-Agent header of the incoming request. This gives us a preliminary identification of who the bot claims to be (e.g., "Googlebot" or "GPTBot").

2. IP Verification (DNS)

We perform valid Reverse DNS (rDNS) lookups on the visitor's IP address. For known major bots (Google, Bing, OpenAI), we verify that the IP actually belongs to their official network ranges.

3. Behavioral Analysis

Our database tracks established behavior patterns, official documentation URLs, and known IP ranges to provide a "Safety Score" for each crawler.

Safe: Verified & Beneficial

These are bots from major search engines (Google, Bing, DuckDuckGo) and trusted partners. Blocking them will directly hurt your website's visibility in search results.

  • ✅ Essential for SEO
  • ✅ Strict abuse policies
  • ✅ Verifiable by IP

Caution: Use Discretion

This category includes AI data scrapers, SEO analysis tools, and commercial crawlers. They are legitimate companies, but they offer a value exchange that may not benefit every website owner.

  • ⚖️ AI Training Bots
  • ⚖️ SEO Tools
  • ⚖️ Social Media Preview Bots

Unsafe: Aggressive / Low Value

These bots are often aggressive scrapers, lead generators, or tools that ignore robots.txt rules. They consume server resources while providing little to no tangible value to your visibility.

  • ❌ Aggressive scraping
  • ❌ Ignoring crawl delays
  • ❌ Often used for cloning content

Robots.txt Best Practices

1. Specificity Matters

Your robots.txt file is the first line of defense. Specific user-agent rules will typically override a global User-agent: * rule.

User-agent: *
Disallow: /private/

User-agent: GPTBot
Disallow: /

In this example, most bots can't visit /private/, but GPTBot is blocked from the entire site.

2. Crawl-delay is not standard

The Crawl-delay directive is supported by some bots (like Bingbot and Yandex) but ignored by Google. Do not rely on it as a universal rate-limiting tool.

Advanced Blocking Methods

1. Meta Robots Tag (MetaNoIndex)

While robots.txt tells bots where not to go, it doesn't guarantee they won't index a page if they find it via a link. To explicitly prevent indexing, use the noindex meta tag in your HTML header.

<meta name="robots" content="noindex">

You can also target specific bots. For example, to allow Google but block GPTBot from indexing:

<meta name="googlebot" content="index">
<meta name="GPTBot" content="noindex">

2. X-Robots-Tag Header (HeaderNoIndex)

For non-HTML content like PDFs, images, or API endpoints, you can't use a meta tag. Instead, use the X-Robots-Tag HTTP header in your server configuration (Nginx, Apache, etc.).

X-Robots-Tag: noindex

This is a powerful directive that is respected by all major crawlers and ensures specific files are never shown in search results.

How CrawlerCheck Reports This

When you use these methods, CrawlerCheck will report the bot status as Allowed (MetaNoIndex) or Allowed (HeaderNoIndex) rather than "Disallowed".

Why? To respect a noindex tag, a bot must first be Allowed to crawl the page to see the instruction. If you block it in robots.txt, it will never see the tag and might still index the URL (as a link without content).

Frequently Asked Questions

Will blocking AI bots hurt my SEO?

No. Blocking AI training bots like GPTBot or CCBot does not affect your ranking on search engines like Google or Bing. These bots are used for training Large Language Models, not for indexing content for search.

How often should I update my robots.txt?

You don't need to update it daily. However, as new AI bots and scrapers emerge frequently, we recommend reviewing your access logs and our Directory once every few months to identify new actors you may want to block.

Is robots.txt legally binding?

No. It is a voluntary protocol ("The Robots Exclusion Protocol"). Reputable bots follow it strictly, but malicious scrapers and bad actors will ignore it. For hard blocking, you should use server-level firewall rules or IP blocking.