Bots You Should Block To Protect Your Content And Bandwidth

Why I block most bots

I’m certainly not telling you that you have to block bots if you see no problem with them. However, they do cause some issues that you might like to avoid.

Firstly, they use your bandwidth and CPU resources with no benefits for you. Secondly, they scrape your websites for information that they sell to your competitors.

If you’re running your own VPS, you’ll know that CPU resources and network bandwidth don’t come cheap. And if you’re on shared hosting, you’ll want to reserve whatever resources assigned to you for running your services.

Regarding your content, do you want the blog posts, images and videos you’ve worked hard to create scraped by AI companies and then plagiarized? The scraped content is used by these companies to train their Large Language Models (LLMs) so that others can easily make content that looks like yours.

What’s more, SEO companies such as Ahrefs and SEMrush crawl your sites and sell the information to other businesses that are looking to outrank you on search engines. Why would you help them outcompete you?

Finally, some bots are simply malicious and probing your server and websites for vulnerabilities, or looking for information on your infrastructure and software that can be used for nefarious purposes.

Not all bots respect your robots.txt rules

You might ask: Why not just add the user agents to robots.txt? That’s because the “rules” you provide in there are only recommendations, and no one is under any obligation to follow them. Many of the bad bots are notorious for not respecting robots.txt, and to be sure your website and content are protected, I advise you to completely block their access.

How to block bots

There are several ways to block bots:

Make a firewall rule in Cloudflare or other WAF
Block them in the web server config (Nginx or Apache)
Disallow them in Robots.txt

Of the methods above, using robots.txt is the most often used. However, this is not my preferred method as I don’t trust all bots to follow the rules set in there. Also, why waste your CPU resources and bandwidth on bots if you can block them before they reach your server?

For these reasons, I recommend you block bots in the Cloudflare Web Application Firewall. If that’s not an option, block them in your web server config. If you don’t have access to any of these, you have no choice but to rely on robots.txt.

List of bad bots to block

AI Crawlers

A quick note: blocking AI crawlers from scraping your data only prevents future posts, articles and images from entering their datasets. It will do nothing to remove the data that these companies have already scraped and used to train the models.

anthropic-ai

anthropic-ai is an (unconfirmed) crawler operated by Anthropic to gather data for their LLMs such as Claude.

Applebot-Extended

Applebot-Extended collects data for training their LLM. “We train our foundation models on licensed data, including data selected to enhance specific features, as well as publicly available data collected by our web-crawler, AppleBot.”

Bytespider

Bytespider is a web crawler operated by ByteDance, the Chinese owner of TikTok. The data is used for training its LLMs (Large Language Models) including those powering ChatGPT competitor Doubao. It’s an extremely aggressive bot that wastes a lot of server resources.

CCBot

CCBot is a web crawler used by Common Crawl. Based in Europe, Common Crawl has spent years collecting massive amounts of data from the web, including copyrighted content, and organizing the datasets for use as free training data for LLMs.

ChatGPT-User

ChatGPT-User is not a web crawler, but a bot that ChatGPT plugins use to access your website. So while it is not automatic, it is still stealing data from your sites any time a user of ChatGPT references them.

ClaudeBot

ClaudeBot is a crawler operated by Anthropic. “Anthropic uses a variety of data sources for model development, such as publicly available data from the internet gathered via a web crawler.”

Claude-Web

Claude-Web is another user agent operated by Anthropic. It is unclear what it’s used for.

cohere-ai

cohere-ai is yet another user agent operated by Anthropic. It is also unclear what this one is used for, perhaps to retrieve web pages when user asks in a chat.

Diffbot

Diffbot “automates web data extraction from any website using AI, computer vision, and machine learning”. Diffbot can “turn any site into a structured database of all their products, articles, and discussions in minutes”.

FacebookBot

FacebookBot is another crawler that scrapes the Internet for data used to train Meta’s LLMs (large language models).

FriendlyCrawler

FriendlyCrawler is a new web scraper that I’ve started seeing. There isn’t much info on it, but it runs on Amazon’s autonomous systems (AWS) and apparently collects data for “machine learning experiments”. Doesn’t seem to respect robots.txt, either, so it’s far from “friendly”.

Google-Extended

Google-Extended is the user agent token assigned to crawls collecting data for Google’s LLMs, such as Gemini. Blocking this user agent will not affect indexing of your sites.

GPTBot

GPTBot is the web crawler of OpenAI that collects data used for training their large language model (LLM). Unless you enjoy having your content cannibalized, block this bot.

Image2dataset

Image2dataset is a scraper tool that people can use to crawl image URLs. The URLs are then feed into LLM datasets.

ImagesiftBot

ImageSiftBot is “a web crawler that scrapes the internet for publicly available images”. It is owned by Hive, a company that offers a reverse image search engine. However, Hive also offers an image generation AI, which means that your images will be used a training data for their AI tool.

Meta-ExternalAgent

Meta-ExternalAgent “crawls the web for use cases such as training AI models or improving products by indexing content directly”. Belongs to the Meta company (Facebook).

omgili / omgilibot

omgilibot is a web scraper from webz.io. Data from the scraping is used to power their index, which they make accessible for business. The company sells their datasets for LLM training.

PerplexityBot

PerplexityBot is a web scraper used by Perplexity AI, an alternative search engine. Data from the scraping is used to power their index, as well as for training their AI models.

Timpibot

Timpibot is yet another shitbot scraping your data and selling it. This ones belongs to Timpi, a company who claims their index is second only to Google. They sell datasets for LLMs.

SEO Crawlers

AhrefsBot

AhrefsBot is a crawler belonging to Ahrefs, a company that collects SEO-related data from your websites and sells it to your competitors so that they can outrank you on search engines. This is of course terrible for your business, and I recommend blocking it. Unless you’re using their services to keep track of your own SEO ranking and backlinks, in which case you need to allow it access.

Barkrowler

Barkrowler is the name of a SEO crawler from Babbar.tech, which fuels and updates their graph representation of the World Wide Web. The database and all the metrics they collect are used to provide a set of online marketing and referencing tools for the SEO agencies.

BLEXBot

BLEXBot assists internet marketers to get information on the link structure of sites and their interlinking on the web, to avoid any technical and possible legal issues and improve overall online experience. It’s operated by an Israeli company named WebMeUp, that offers SEO services such as a backlink checker.

DataForSeoBot

DataForSeoBot is a backlink checker bot that collects SEO information from your website and sells it to clients. “DataForSEO Bot is constantly crawling the web to add new links to our backlink database and check links that are already included”.

DotBot

DotBot is a backlink crawler from Moz. Again, if you don’t use these services yourself, there is no reason to freely hand this information over to your competitor.

MJ12Bot

MJ12Bot is a crawler collecting SEO data for the company Majestic. Block it if you don’t use their service.

SemrushBot

SemrushBot is a SEO crawler operated by Semrush, a competitor of Ahrefs providing the same kind of service. Block them without a second thought. As mentioned above, if you’re using Semrush as a service, you need to allow their bot access to your site.

Security Scanners

CensysInspect

CensysInspect is a security scanner operated by the Censys company. It scans the entire public IP range to identify running services on your servers. Information like this is used to identify vulnerabilities on your network and might be useful if you’re testing your own defenses. However, if you’re not using their services, this is of no value to you.

Expanse

Expanse is a web crawler from Palo Alto Networks. It crawls the Internet to create a database of “malicious URLs” which they offer as a subscription service to their customers. It is a very aggressive scanner that uses a lot of network resources.

internet-measurement

Internet-measurement is a crawler operated by Driftnet. “It is used to discover and measure services that network owners and operators have publicly exposed.” It is unclear to me how your data is being used, but it is surely not to your advantage.

Search Engine Crawlers

PetalBot

PetalBot is another very aggressive web crawler. This one is from Huawei, feeding data to Petal Search, an app only available on Huawei phones. Besides, Huawei has been suspected of spying on behalf of the Chinese government.

Other bots and scrapers

Amazonbot

Amazonbot “is Amazon’s web crawler used to improve our services, such as enabling Alexa to answer even more questions for customers”. One user on X posted that Amazonbot had blasted through 380gb of data in 3 nights.

BW/1.1

This is the user agent of the BuiltWith crawler, collecting data for the BuiltWith company. They collect data about the software and hosting platform your websites run on and then sell it packaged as “lead generation” and “sales intelligence”. This way, sales companies can know a lot about your business before they contact you with their offers.

Dalvik/2.1.0

Dalvik is a virtual machine (VM) for the Android operating system. Usually used on mobile click farms for automated tasks. No sane human will be using this browser agent.

Dataprovider

Dataprovider bot is a web scraper collecting information such as your hosting information, what CMS you’re running, number of visitors, and other statistics. The company sells the data about your websites and running services to anyone interested.

“Empty user agent”

If the user agent is completely empty, there is definitely something bad going on. The user agent field should never be empty, and all regular users (people) as well as good bots will have one.

Go-http-client

Go-http-client is often seen as the user agent of bots probing for vulnerabilities. Any legitimate human users will not use this user agent string. Though the tool can be used for legitimate purposes, it is mainly used by malicious bots.

GoogleOther

GoogleOther is a new set of data crawlers from Google, used for web scrabing. “GoogleOther is the generic crawler that may be used by various product teams for fetching publicly accessible content from sites”. Besides the generic GoogleOther, they also use GoogleOther-Image and GoogleOther-Video. If you block the user agent string “GoogleOther” you will block all of them.

IonCrawl

IonCrawl is a crawler belonging to IONOS, a cloud hosting provider. According to their information, it is “used to improve and expand their hosting service”. I have no idea why a hosting company needs to crawl the entire Internet and see no benefit of being included in it.

ISSCyberRiskCrawler

ISSCyberRiskCrawler is a new crawler I’ve started seeing in logs. It’s hosted on Amazon’s AWS cloud network. I have no idea what this bot is up to. Will update if I learn more.

Java

The Java user agent indicates that the visitor is a bot, not a human. Good bots declare themselves with their own user agent and state what their purpose is. I do not want unknown tools or bots scanning or scraping my sites for unknown reasons.

Mozlila

Mozlila is a user agent I’m seeing often in logs. It is always probing for exploits and vulnerabilities, and is never associated with legitimate requests. Not to be confused with Mozilla, the company behind the Firefox browser. It seems to have started when someone mistyped Mozilla while trying to spoof their user agent. Safe to block this one.

news-please

news-please is a web scraper tool “that extracts structured information from almost any news website. It can recursively follow internal hyperlinks and read RSS feeds to fetch both most recent and also old, archived articles”.

Orbbot

Orbbot is a web scraper of unknown origin. Unknown bots that do not declare themselves are always up to no good.

peer39_crawler

Peer39_crawler is a scraper belonging to Peer39, a company that builds databases for advertising purposes.

python-requests

Python-requests is another library used for web scraping. As with the Java bots above, these tools are used for automatic web scraping, or other nefarious purposes.

Scrapy

Scrapy is “an open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way”. In other words, it’s a scraper tool. Why are scrapers bad? Because they steal your content, which is then used to outrank you on search engines. Of course, everyone could copy and paste your pages manually, but it would take much more time. There is no need to hand your competition any ammunition.

VelenPublicWebCrawler

VelenPublicWebCrawler seems to be scraper that collects email adresses from websites that are then sold for marketing (spamming) purposes.

wp_is_mobile

Wp_is_mobile is a WordPress function that looks at user-agents to discern if the visitor is using a large screen or mobile device. It is being misused as a user agent string by bad bots to look legitimate. No real users will ever have this user agent string, so it’s safe to block.

Zoominfobot

Zoominfobot “scans millions of corporate websites, press releases, electronic news services, SEC filings”, and is “focused on finding pages with information about businesses and business professionals”. This information is sold to businesses doing sales outreach.

Blocking old browser versions

Many bots and scripts use outdated browser versions as their user agent. So by blocking these outdated browser versions, you can get rid of a lot of malicious bot activity. As an example:


Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.9496.1505 Mobile Safari/537.36

Who still uses Chrome version 45? As of today (April, 2024), the Chrome stable version is 123. Version 63 was released in 2015, nine years ago. I’ve even seen Chrome version 5 in my logs. This one was released in 2010, more than 14 years ago. I’ll let you draw your own conclusions.

Note: do not block Chrome version 79 and above. Some useful bots still use this version of Chrome in their user agent string.

When blocking user agents isn’t enough

User agents can be spoofed (faked) so it is not always sufficient to rely on these for blocking bots. Most bots from large corporations do declare themselves with an accurate user agent, but there are cases when they don’t.

For example, Wired reported that Perplexity is lying about their user agent. When this is the case, you need to go further in order to block the bots. You have to identify where the bots are hosted and block or challenge those IP networks.

Need help blocking bad bots and scrapers?

It’s vital to protect your intellectual property and server resources. If you need help blocking bad bots and content scrapers from accessing your websites, contact me now.

This post will be updated along the way as I discover more bots. Be sure to let me know if there are bots missing from the list.

2 Comments

Ana

March 13, 2024 / 10:00 Reply

Dear Jimmy,
thank you very much for the list. Very useful. Can you please develop a little more about each bot, what it does and to what company it belongs.
thank you
- Jimmy Jensen
  
  March 13, 2024 / 13:09 Reply
  
  Thanks, Ana. I’m glad you’re finding the list useful. I’ll be sure to add some more info about the bots and the companies behind to the post!

Get Started