Hengyu is from the Au Fei Temple

Quantum Position | 公众号 QbitAI

Claude团队这次惹了众怒！

Reason: Visit a company's server 1 million times within 24 hours, crawlers crawl website content in a non-paid form.

Not only did it blatantly ignore the "no crawling" announcement, but it also forcibly occupied server resources.

The "victim" company actually tried its best to defend itself, but failed to stop it, and the content data was still captured by Claude.

Claude's team caused outrage and did anything to scrape the data, changing the name of the crawler in defiance of the prohibition rules

The person in charge of the company was so angry that he blew his beard and glared, and opened the microphone passionately on X:

Hey Anthropic, I know you're hungry for data. Claude is really smart!

But do you make it, not at all! Cruel! Oh!

Many netizens were indignant about this, and a netizen who was engaged in copywriting left a message:

"I suggest using 'stealing' instead of 'not paying' to describe this behavior of Anthropic."

All of a sudden, the crowd was furious!

Those who support the condemnation and demand that Claude pay, the comment area is simply a mess.

What's going on

The company that strongly condemned Anthropic is called iFixit, an United States e-commerce and how-to guide website.

Part of iFixit's business is to provide free Wikipedia-like repair guides for consumer electronics and gadgets.

There are millions of pages within the site, including repair guides, revision histories of guides, blogs, news posts and research, forums, community-contributed repair guides, and Q&A sections.

However, iFixit suddenly discovered that Claude's crawler, ClaudeBot, was getting thousands of requests for access every minute for several hours.

That's about a million visits to its website in a single day.

According to statistics, it accessed 10 TB of files in a single day, totaling 73 TB throughout the month of May.

为此，iFixit的CEO老K（Kyle Wiens）丢下一句话：

Without permission, ClaudeBot stole all of our data and filled up our servers...... Fine, it's not a big deal.

Don't know if it climbed to our licensing instructions??

That's right, "without permission".

iFixit其实有写声明——

Reproduction, reproduction, or distribution of any content, materials, or design elements on this website for any other purpose, including training machine learning or artificial intelligence models, without the express prior written permission of iFixit, is strictly prohibited.

And then the eggs.

Not only did Claude turn a blind eye and continue his frenzied access-grabbing, but he also dodged iFixit's defenses.

iFixit其实成功阻止了两个Anthropic的AI抓取机器人，分别名为“ANTHROPIC-AI”和“CLAUDE-WEB”。

But these two AI scraping bots seem to be a thing of the past, and the current main crawler is the "ClaudeBot" that has not been prevented from succeeding.

As a last resort, Lao K said that iFixit modified robots.txt files this week specifically to stop Anthropic's crawler bots.

So, what's the reaction from Anthropic?

They didn't shut the mic, and responded to the media:

ANTHROPIC-AI AND CLAUDE-WEB WERE INDEED OLD CRAWLERS USED BY THE COMPANY, BUT THEY HAVE NOW BEEN DISCONTINUED.

Of course, Anthropic dodges the question of whether the now active ClaudeBot respects crawlers robots.txt prevents crawling.

It's not the first time an AI company has done this

Looking through Anthropic's official website, you can find that there has long been an article titled "Does Anthropic scrape data from the web?" How can site owners block crawlers? article.

It mentions:

In accordance with industry standards, Anthropic uses a variety of data sources for model development, such as publicly available data from the internet collected through web crawlers.

Our crawling should not be intrusive or disruptive.

Our goal is to minimize disruption by considering the speed at which the same domain is crawled, and respecting crawl latency where appropriate.

But it is not difficult to find out from the public opinion that Anthropic clearly did not do this.

It's an old habit of crawling other people's data without permission.

Let's just say that in April of this year, the Linux Mint forum was crawled.

Over the course of a few hours, ClaudeBot accessed the forum crawling data several times, causing the forum to be in an ultra-low or crash state for a few hours, and finally crashed completely.

Some people said that in the same time period, ClaudeBot took up the most traffic, 20 times more than the second place and 40 times the third place.

In the discussion thread of the April incident and this incident, it was suggested that:

Since it's useless to put a crawling ban announcement, don't put it on your website to create false information with trackable or unique information in order to detect who stole the data.

iFixit确实也这么做了。

And it's really useful - the information on the website was crawled not only by Claude, but also by OpenAI......

To be reasonable, what can be done? There's really nothing you can do about it.

Because in addition to Claude and GPT, there are quite a few AIs that forcibly steal homes like this.

A few days ago, a robot detection startup called Tollbit claimed that Perplexity, Claude, and OpenAI would ignore robots.txt settings on crawling websites - at that time, someone ran to ask OpenAI's attitude, and OpenAI did not comment.

Looking further back, there was a fuss last month.

Forbes condemns AI search product Perplexity for allegedly plagiarizing its news articles; More media outlets came out to accuse Perplexity's crawler bot PerplexityBot of illegally scraping information from its own website.

And Perplexity's attitude has always been:

Respect publishers' requirements not to scrape content and operate within the confines of fair use copyright laws.

Theoretically, both ClaudeBot and PerplexityBot should comply with the protocol to avoid crawling the content of the declarant's website when they encounter a file marked "No Scraping" or "No robot.txt".

Since the declaration is invalid, there are calls for creators to move content to the paid area as much as possible to prevent unlimited crawling.

Do you think this approach will work?

Reference Links:

[1]https://www.404media.co/websites-are-blocking-the-wrong-ai-scrapers-because-ai-companies-keep-making-new-ones/

[2]https://www.404media.co/anthropic-ai-scraper-hits-ifixits-website-a-million-times-in-a-day/

[3]https://twitter.com/kwiens/status/1816128302542905620

[4]https://x.com/Carnage4Life/status/1804316030665396356

[5]https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler?ref=404media.co

— END —

QubitAI · 头条号签约

Claude's team caused outrage and did anything to scrape the data, changing the name of the crawler in defiance of the prohibition rules

What's going on

It's not the first time an AI company has done this

Read on

PHP Crawler: Network Security Engineer Demystifies Amazing Analysis Techniques

Say no to web crawlers, keep your website safe, PHP solution is coming

vue-virtual DOM: Crawler crawling problem for single-page applications

Recommend 6 yyds open source projects This week's top GitHub projects include: Dub is a tool for generating short links that can track and analyze user usage;

Python crawler library Requests author unemployed due to mania: ask for funding online, find a job

"Parasite" shines into reality, it turns out that he is the reptile who was innocently planted, it's so sad!

What conditions and skills do crawlers need to have to collect a large amount of data?

Java,SpringBoot,Vue,Python爬虫,Hadoop大数据旅游推荐管理系统

Python Efficient Crawler - Introduction and Use of Scrapy

Python simple crawler case

The Apocalyptic War tracker crawler

Weekly GitHub Discovery|Generative AI, backend frameworks, web crawlers, and testing tools

Web crawler development: Minor differences between JavaScript and Python features

Wary! Your data is almost crawled by AI crawlers!

Social platforms have frequently become "melon fields", and many companies have taken action: employees manage their social media! A public offering person said that "the company can use a crawler to extract information"