Major Websites Blocking Content from AI Crawlers


According to recent data from content detector Originality.AI, nearly 20% of the top 1000 websites in the world are restricting crawler bots that collect web data for AI services.AI.

Websites, big and small, are taking matters into their own hands because there are no clear legal or regulatory guidelines limiting AI's usage of intellectual information.

Early in August, OpenAI unveiled its GPTBot crawler, claiming that the information obtained "might improve future models," assuring that pay-walled content will be omitted, and providing instructions on how to block the crawler on websites.

Several of well-known news outlets, including the New York Times, Reuters, and CNN, started blocking GPTBot shortly after, and many more have subsequently done the same.

According to Originality, the percentage of websites censoring OpenAI's ChatGPT bot has climbed from 9.1% on August 22 to 12% on August 29 among the top 1000 most popular websites.

Amazon, Quora, and Indeed are the three major websites that prohibit ChatGPT's bot. The analysis shows that larger websites are more likely to have AI bots stopped already. In the top 1000 websites, the Common Crawl Bot—another crawler that regularly collects web data used by some AI services—is blocked 6.77% of the time.

This is how it goes. Any webpage that can be accessed by a web browser can also be "scraped" by a crawler, which works just like a browser but stores the content in a database rather than showing it to the user. That is how information is gathered by search engines like Google.

The ability to publish instructions telling these crawlers to leave has long been available to site owners, but compliance is entirely voluntary, and malicious users can choose to disregard the advice.

Although many publishers and owners of intellectual property have long objected, Google and other web companies view the activity of their data crawlers as fair use. As a result, the company has been involved in many legal battles over the practice. As generative AI and huge language models gain popularity, this issue has once again come to light as AI businesses send out their own crawlers to gather information for their chatbot feeds and to train their models.

Since Google and other search engines directed consumers to these publishers' ad-supported websites, some publishers found at least some value in allowing search crawlers access to their websites. However, in the age of AI, publishers are more adamantly rejecting crawlers because there is now no benefit to providing their data to AI firms.

Many media businesses are currently in discussions with AI companies about paying a price to license their data to AI companies, but these discussions are still in the early stages. While this is going on, some websites and owners of intellectual property are suing or considering suing AI businesses that may have misused their data.

The increasing commercialization of AI services like OpenAI is being viewed with anger and a "we won't get fooled again" attitude by media organizations that feel they were duped by Google over the past 20 years. According to The Information, OpenAI is expected to earn more than $1 billion in revenue over the coming year.

Particularly, news organizations are having trouble striking the correct mix between embracing AI and resisted it. On the one hand, the sector is desperately trying to come up with new ideas to increase profit margins in their labor-intensive operation. On the other hand, integrating AI into a newsroom's workflow when public confidence in media organizations is at an all-time low raises difficult ethical issues.

If too much of the web bans AI crawlers, the owners of those crawlers may find it more difficult to update and improve their AI products, and good data is getting tougher to find.


Subscribe to Technology This Week