{"id":248442,"date":"2024-07-27T14:08:29","date_gmt":"2024-07-27T14:08:29","guid":{"rendered":"https:\/\/michigandigitalnews.com\/index.php\/2024\/07\/27\/websites-accuse-ai-startup-anthropic-of-bypassing-their-anti-scraping-rules-and-protocol\/"},"modified":"2025-06-25T17:13:47","modified_gmt":"2025-06-25T17:13:47","slug":"websites-accuse-ai-startup-anthropic-of-bypassing-their-anti-scraping-rules-and-protocol","status":"publish","type":"post","link":"https:\/\/michigandigitalnews.com\/index.php\/2024\/07\/27\/websites-accuse-ai-startup-anthropic-of-bypassing-their-anti-scraping-rules-and-protocol\/","title":{"rendered":"Websites accuse AI startup Anthropic of bypassing their anti-scraping rules and protocol"},"content":{"rendered":"<p> [ad_1]<br \/>\n<\/p>\n<div>\n<p>Freelancer has accused Anthropic, the AI startup behind the Claude large language models, of ignoring its &#8220;do not crawl&#8221; robots.txt protocol to scrape its websites&#8217; data. Meanwhile, iFixit CEO Kyle Wiens said Anthropic has ignored the website&#8217;s policy prohibiting the use of its content for AI model training. Matt Barrie, the chief executive of Freelancer, told <a data-i13n=\"cpos:1;pos:1\" href=\"https:\/\/www.ft.com\/content\/07611b74-3d69-4579-9089-f2fc2af61baa\" rel=\"nofollow noopener\" target=\"_blank\" data-ylk=\"slk:The Information;cpos:1;pos:1;elm:context_link;itc:0;sec:content-canvas\" class=\"link \"><em>The Information<\/em><\/a> that Anthropic&#8217;s ClaudeBot is &#8220;the most aggressive scraper by far.&#8221; His website allegedly got 3.5 million visits from the company&#8217;s crawler within a span of four hours, which is &#8220;probably about five times the volume of the number two&#8221; AI crawler. Similarly, Wiens <a data-i13n=\"cpos:2;pos:1\" href=\"https:\/\/x.com\/kwiens\/status\/1816304897484284007\" rel=\"nofollow noopener\" target=\"_blank\" data-ylk=\"slk:posted on X\/Twitter;cpos:2;pos:1;elm:context_link;itc:0;sec:content-canvas\" class=\"link \">posted on X\/Twitter<\/a> that Anthropic&#8217;s bot hit iFixit&#8217;s servers a million times in 24 hours. &#8220;You&#8217;re not only taking our content without paying, you&#8217;re tying up our devops resources,&#8221; he wrote.<\/p>\n<p>Back in June, <a data-i13n=\"cpos:3;pos:1\" href=\"https:\/\/www.engadget.com\/ai-companies-are-reportedly-still-scraping-websites-despite-protocols-meant-to-block-them-132308524.html\" data-ylk=\"slk:Wired accused;cpos:3;pos:1;elm:context_link;itc:0;sec:content-canvas\" class=\"link \"><em>Wired accused<\/em><\/a> another AI company, Perplexity, of crawling its website despite the presence of the Robots Exclusion Protocol, or robots.txt. A robots.txt file typically contains instructions for web crawlers on which pages they can and can&#8217;t access. While compliance is voluntary, it&#8217;s mostly just been ignored by bad bots. After <a data-i13n=\"cpos:4;pos:1\" href=\"https:\/\/www.wired.com\/story\/perplexity-is-a-bullshit-machine\/\" rel=\"nofollow noopener\" target=\"_blank\" data-ylk=\"slk:Wired's piece;cpos:4;pos:1;elm:context_link;itc:0;sec:content-canvas\" class=\"link \"><em>Wired&#8217;s<\/em> piece<\/a> came out, a startup called TollBit that connects AI firms with content publishers reported that it&#8217;s not just Perplexity that&#8217;s bypassing robots.txt signals. While it didn&#8217;t name names, <a data-i13n=\"cpos:5;pos:1\" href=\"https:\/\/www.businessinsider.com\/openai-anthropic-ai-ignore-rule-scraping-web-contect-robotstxt\" rel=\"nofollow noopener\" target=\"_blank\" data-ylk=\"slk:Business Insider;cpos:5;pos:1;elm:context_link;itc:0;sec:content-canvas\" class=\"link \"><em>Business Insider<\/em><\/a> said it learned that OpenAI and Anthropic were ignoring the protocol, as well.<\/p>\n<p>Barrie said Freelancer tried to refuse the bot&#8217;s access requests at first, but it ultimately had to block Anthropic&#8217;s crawler entirely. &#8220;This is egregious scraping [which] makes the site slower for everyone operating on it and ultimately affects our revenue,&#8221; he added. As for iFixit, Wiens said the website has set alarms for high traffic, and his people got woken up at 3AM due to Anthropic&#8217;s activities. The company&#8217;s crawler stopped scraping iFixit after it added a line in its <a data-i13n=\"cpos:6;pos:1\" href=\"https:\/\/www.ifixit.com\/robots.txt\" rel=\"nofollow noopener\" target=\"_blank\" data-ylk=\"slk:robots.txt file;cpos:6;pos:1;elm:context_link;itc:0;sec:content-canvas\" class=\"link \">robots.txt file<\/a> that disallows Anthropic&#8217;s bot, in particular.<\/p>\n<p>The AI startup told <em>The Information<\/em> that it respects robots.txt and that its crawler &#8220;respected that signal when iFixit implemented it.&#8221; It also said that it aims &#8220;for minimal disruption by being thoughtful about how quickly [it crawls] the same domains,&#8221; which is why it&#8217;s now investigating the case.<\/p>\n<p>AI firms use crawlers to collect content from websites that they can use to train their generative AI technologies. They&#8217;ve been the <a data-i13n=\"cpos:7;pos:1\" href=\"https:\/\/www.engadget.com\/the-new-york-times-is-suing-openai-and-microsoft-for-copyright-infringement-181212615.html\" data-ylk=\"slk:target of multiple lawsuits;cpos:7;pos:1;elm:context_link;itc:0;sec:content-canvas\" class=\"link \">target of multiple lawsuits<\/a> as a result, with publishers accusing them of copyright infringement. To prevent more lawsuits from being filed, companies like OpenAI have been striking deals with publishers and websites. OpenAI&#8217;s content partners, so far, include <a data-i13n=\"cpos:8;pos:1\" href=\"https:\/\/www.engadget.com\/openai-will-reportedly-pay-250-million-to-put-news-corps-journalism-in-chatgpt-214615249.html\" data-ylk=\"slk:News Corp;cpos:8;pos:1;elm:context_link;itc:0;sec:content-canvas\" class=\"link \">News Corp<\/a>, <a data-i13n=\"cpos:9;pos:1\" href=\"https:\/\/www.engadget.com\/the-atlantic-and-vox-media-made-their-own-deal-with-the-ai-devil-161017636.html\" data-ylk=\"slk:Vox Media;cpos:9;pos:1;elm:context_link;itc:0;sec:content-canvas\" class=\"link \">Vox Media<\/a>, the <a data-i13n=\"cpos:10;pos:1\" href=\"https:\/\/www.engadget.com\/openai-will-train-its-ai-models-on-the-financial-times-journalism-173249177.html\" data-ylk=\"slk:Financial Times;cpos:10;pos:1;elm:context_link;itc:0;sec:content-canvas\" class=\"link \"><em>Financial Times<\/em><\/a> and <a data-i13n=\"cpos:11;pos:1\" href=\"https:\/\/www.engadget.com\/openai-strikes-deal-to-put-reddit-posts-in-chatgpt-224133045.html\" data-ylk=\"slk:Reddit;cpos:11;pos:1;elm:context_link;itc:0;sec:content-canvas\" class=\"link \">Reddit<\/a>. iFixit&#8217;s Wiens seems open to the idea of signing a deal for the how-to-repair&#8217;s website&#8217;s articles, as well, telling Anthropic in a tweet he&#8217;s willing to have a conversation about licensing content for commercial use.<\/p>\n<div class=\"twitter-tweet-wrapper\" data-embed-anchor=\"97d06742-daf4-58cd-a178-dd16b2031211\">\n<blockquote placeholder=\"\" data-theme=\"light\" class=\"twitter-tweet\">\n<p>If any of those requests accessed our terms of service, they would have told you that use of our content expressly forbidden. But don&#8217;t ask me, ask Claude!<\/p>\n<p>If you want to have a conversation about licensing our content for commercial use, we&#8217;re right here. <a href=\"https:\/\/t.co\/CAkOQDnLjD\" rel=\"nofollow noopener\" target=\"_blank\" data-ylk=\"slk:pic.twitter.com\/CAkOQDnLjD;elm:context_link;itc:0;sec:content-canvas\" class=\"link \">pic.twitter.com\/CAkOQDnLjD<\/a><\/p>\n<p>\u2014 Kyle Wiens (@kwiens) <a href=\"https:\/\/twitter.com\/kwiens\/status\/1816136485785186335?ref_src=twsrc%5Etfw\" rel=\"nofollow noopener\" target=\"_blank\" data-ylk=\"slk:July 24, 2024;elm:context_link;itc:0;sec:content-canvas\" class=\"link \">July 24, 2024<\/a><\/p>\n<\/blockquote>\n<\/div>\n<\/div>\n<p><script async src=\"\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script><br \/>\n<br \/>[ad_2]<br \/>\n<br \/><a href=\"https:\/\/www.engadget.com\/websites-accuse-ai-startup-anthropic-of-bypassing-their-anti-scraping-rules-and-protocol-133022756.html?src=rss\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>[ad_1] Freelancer has accused Anthropic, the AI startup behind the Claude large language models, of ignoring its &#8220;do not crawl&#8221; robots.txt protocol to scrape its<\/p>\n","protected":false},"author":1,"featured_media":248443,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"_uf_show_specific_survey":0,"_uf_disable_surveys":false,"footnotes":""},"categories":[159],"tags":[],"_links":{"self":[{"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/posts\/248442"}],"collection":[{"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/comments?post=248442"}],"version-history":[{"count":0,"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/posts\/248442\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/media\/248443"}],"wp:attachment":[{"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/media?parent=248442"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/categories?post=248442"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/tags?post=248442"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}