1. User-Agent Parsing
First, we analyze the User-Agent header of the incoming request.
This gives us a preliminary identification of who the bot claims to be (e.g.,
"Googlebot" or "GPTBot").
Everything you need to know about how CrawlerCheck works, our classification system, and best practices for managing bot traffic.
First, we analyze the User-Agent header of the incoming request.
This gives us a preliminary identification of who the bot claims to be (e.g.,
"Googlebot" or "GPTBot").
We perform valid Reverse DNS (rDNS) lookups on the visitor's IP address. For known major bots (Google, Bing, OpenAI), we verify that the IP actually belongs to their official network ranges.
Our database tracks established behavior patterns, official documentation URLs, and known IP ranges to provide a "Safety Score" for each crawler.
These are bots from major search engines (Google, Bing, DuckDuckGo) and trusted partners. Blocking them will directly hurt your website's visibility in search results.
This category includes AI data scrapers, SEO analysis tools, and commercial crawlers. They are legitimate companies, but they offer a value exchange that may not benefit every website owner.
These bots are often aggressive scrapers, lead generators, or tools that
ignore robots.txt rules. They consume server resources while providing
little to no tangible value to your visibility.
Your robots.txt file is the first line of defense. Specific
user-agent rules will typically override a global User-agent: * rule.
User-agent: *
Disallow: /private/
User-agent: GPTBot
Disallow: / In this example, most bots can't visit /private/, but GPTBot is blocked from the entire site.
The Crawl-delay directive is supported by some bots (like Bingbot
and Yandex) but ignored by Google. Do not rely on it as a universal rate-limiting
tool.
While robots.txt tells bots where not to go, it doesn't
guarantee they won't index a page if they find it via a link. To
explicitly prevent indexing, use the noindex meta tag in your
HTML header.
<meta name="robots" content="noindex"> You can also target specific bots. For example, to allow Google but block GPTBot from indexing:
<meta name="googlebot" content="index">
<meta name="GPTBot" content="noindex">For non-HTML content like PDFs, images, or API endpoints, you can't use
a meta tag. Instead, use the X-Robots-Tag HTTP header in your
server configuration (Nginx, Apache, etc.).
X-Robots-Tag: noindex This is a powerful directive that is respected by all major crawlers and ensures specific files are never shown in search results.
When you use these methods, CrawlerCheck will report the bot status as Allowed (MetaNoIndex) or Allowed (HeaderNoIndex) rather than "Disallowed".
Why? To respect a noindex tag, a bot
must first be Allowed to crawl the page to see the
instruction. If you block it in robots.txt, it will never
see the tag and might still index the URL (as a link without content).
No. Blocking AI training bots like GPTBot or CCBot does not affect your ranking on search engines like
Google or Bing. These bots are used for training Large Language Models,
not for indexing content for search.
You don't need to update it daily. However, as new AI bots and scrapers emerge frequently, we recommend reviewing your access logs and our Directory once every few months to identify new actors you may want to block.
No. It is a voluntary protocol ("The Robots Exclusion Protocol"). Reputable bots follow it strictly, but malicious scrapers and bad actors will ignore it. For hard blocking, you should use server-level firewall rules or IP blocking.