{"id":235341,"date":"2024-06-22T14:05:35","date_gmt":"2024-06-22T14:05:35","guid":{"rendered":"https:\/\/michigandigitalnews.com\/index.php\/2024\/06\/22\/ai-companies-are-reportedly-still-scraping-websites-despite-protocols-meant-to-block-them\/"},"modified":"2025-06-25T17:16:27","modified_gmt":"2025-06-25T17:16:27","slug":"ai-companies-are-reportedly-still-scraping-websites-despite-protocols-meant-to-block-them","status":"publish","type":"post","link":"https:\/\/michigandigitalnews.com\/index.php\/2024\/06\/22\/ai-companies-are-reportedly-still-scraping-websites-despite-protocols-meant-to-block-them\/","title":{"rendered":"AI companies are reportedly still scraping websites despite protocols meant to block them"},"content":{"rendered":"<p> [ad_1]<br \/>\n<\/p>\n<div>\n<p>Perplexity, a company that describes its product as &#8220;a free AI search engine,&#8221; has been under fire over the past few days. Shortly after <a data-i13n=\"cpos:1;pos:1\" href=\"https:\/\/www.forbes.com\/sites\/randalllane\/2024\/06\/11\/why-perplexitys-cynical-theft-represents-everything-that-could-go-wrong-with-ai\/\" rel=\"nofollow noopener\" target=\"_blank\" data-ylk=\"slk:Forbes;cpos:1;pos:1;elm:context_link;itc:0;sec:content-canvas\" class=\"link \"><em>Forbes<\/em><\/a> accused it of stealing its story and republishing it across multiple platforms, <a data-i13n=\"cpos:2;pos:1\" href=\"https:\/\/www.wired.com\/story\/perplexity-is-a-bullshit-machine\/\" rel=\"nofollow noopener\" target=\"_blank\" data-ylk=\"slk:Wired;cpos:2;pos:1;elm:context_link;itc:0;sec:content-canvas\" class=\"link \"><em>Wired<\/em><\/a> reported that Perplexity has been ignoring the Robots Exclusion Protocol, or robots.txt, and has been scraping its website and other Cond\u00e9 Nast publications. Technology website<a data-i13n=\"cpos:3;pos:1\" href=\"https:\/\/www.theshortcut.com\/p\/perplexity-ai-is-stealing-from-the-shortcut\" rel=\"nofollow noopener\" target=\"_blank\" data-ylk=\"slk:The Shortcut;cpos:3;pos:1;elm:context_link;itc:0;sec:content-canvas\" class=\"link \"><em> The Shortcut<\/em><\/a> also accused the company of scraping its articles. Now, <a data-i13n=\"cpos:4;pos:1\" href=\"https:\/\/www.reuters.com\/technology\/artificial-intelligence\/multiple-ai-companies-bypassing-web-standard-scrape-publisher-sites-licensing-2024-06-21\/\" rel=\"nofollow noopener\" target=\"_blank\" data-ylk=\"slk:Reuters;cpos:4;pos:1;elm:context_link;itc:0;sec:content-canvas\" class=\"link \"><em>Reuters<\/em><\/a> has reported that Perplexity isn&#8217;t the only <a data-i13n=\"cpos:5;pos:1\" href=\"https:\/\/www.engadget.com\/if-ai-is-going-to-take-over-the-world-why-cant-it-solve-the-spelling-bee-170034469.html\" data-ylk=\"slk:AI company;cpos:5;pos:1;elm:context_link;itc:0;sec:content-canvas\" class=\"link \">AI company<\/a> that&#8217;s bypassing robots.txt files and scraping websites to get content that&#8217;s then used to train their technologies.<\/p>\n<p><em>Reuters<\/em> said it saw a letter addressed to publishers from TollBit, a startup that pairs them up with AI firms so they can reach licensing deals, warning them that &#8220;AI agents from multiple sources (not just one company) are opting to bypass the robots.txt protocol to retrieve content from sites.&#8221; The robots.txt file contains instructions for web crawlers on which pages they can and can&#8217;t access. Web developers have been using the protocol since 1994, but compliance is completely voluntary.<\/p>\n<p>TollBit&#8217;s letter didn&#8217;t name any company, but <a data-i13n=\"cpos:6;pos:1\" href=\"https:\/\/www.businessinsider.com\/openai-anthropic-ai-ignore-rule-scraping-web-contect-robotstxt\" rel=\"nofollow noopener\" target=\"_blank\" data-ylk=\"slk:Business Insider;cpos:6;pos:1;elm:context_link;itc:0;sec:content-canvas\" class=\"link \"><em>Business Insider<\/em><\/a> says it has learned that <a data-i13n=\"cpos:7;pos:1\" href=\"https:\/\/openai.com\/index\/approach-to-data-and-ai\/\" rel=\"nofollow noopener\" target=\"_blank\" data-ylk=\"slk:OpenAI;cpos:7;pos:1;elm:context_link;itc:0;sec:content-canvas\" class=\"link \">OpenAI<\/a> and <a data-i13n=\"cpos:8;pos:1\" href=\"https:\/\/support.anthropic.com\/en\/articles\/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler\" rel=\"nofollow noopener\" target=\"_blank\" data-ylk=\"slk:Anthropic;cpos:8;pos:1;elm:context_link;itc:0;sec:content-canvas\" class=\"link \">Anthropic<\/a> \u2014 the creators of the ChatGPT and Claude chatbots, respectively \u2014 are also bypassing robots.txt signals. Both companies previously proclaimed that they respect &#8220;do not crawl&#8221; instructions websites put in their robots.txt files.<\/p>\n<p>During its investigation, <em>Wired<\/em> discovered that a machine on an Amazon server &#8220;certainly operated by Perplexity&#8221; was bypassing its website&#8217;s robots.txt instructions. To confirm whether Perplexity was scraping its content, <em>Wired<\/em> provided the company&#8217;s tool with headlines from its articles or short prompts describing its stories. The tool reportedly came up with results that closely paraphrased its articles &#8220;with minimal attribution.&#8221; And at times, it even generated inaccurate summaries for its stories \u2014 <em>Wired<\/em> says the chatbot falsely claimed that it reported about a specific California cop committing a crime in one instance.<\/p>\n<p>In an interview with <a data-i13n=\"cpos:9;pos:1\" href=\"https:\/\/www.fastcompany.com\/91144894\/perplexity-ai-ceo-aravind-srinivas-on-plagiarism-accusations\" rel=\"nofollow noopener\" target=\"_blank\" data-ylk=\"slk:Fast Company;cpos:9;pos:1;elm:context_link;itc:0;sec:content-canvas\" class=\"link \"><em>Fast Company<\/em><\/a>, Perplexity CEO Aravind Srinivas told the publication that his company &#8220;is not ignoring the Robot Exclusions Protocol and then lying about it.&#8221; That doesn&#8217;t mean, however, that it isn&#8217;t benefiting from crawlers that do ignore the protocol. Srinivas explained that the company uses third-party web crawlers on top of its own, and that the crawler <em>Wired<\/em> identified was one of them. When <em>Fast Company<\/em> asked if Perplexity told the crawler provider to stop scraping Wired&#8217;s website, he only replied that &#8220;it&#8217;s complicated.&#8221;<\/p>\n<p>Srinivas defended his company&#8217;s practices, telling the publication that the Robots Exclusion Protocol is &#8220;not a legal framework&#8221; and suggesting that publishers and companies like his may have to establish a new kind of relationship. He also reportedly insinuated that <em>Wired<\/em> deliberately used prompts to make Perplexity&#8217;s chatbot behave the way it did, so ordinary users will not get the same results. As for the inaccurate summaries that the tool had generated, Srinivas said: &#8220;We have never said that we have never hallucinated.&#8221;<\/p>\n<\/div>\n<p>[ad_2]<br \/>\n<br \/><a href=\"https:\/\/www.engadget.com\/ai-companies-are-reportedly-still-scraping-websites-despite-protocols-meant-to-block-them-132308524.html?src=rss\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>[ad_1] Perplexity, a company that describes its product as &#8220;a free AI search engine,&#8221; has been under fire over the past few days. Shortly after<\/p>\n","protected":false},"author":1,"featured_media":235342,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"_uf_show_specific_survey":0,"_uf_disable_surveys":false,"footnotes":""},"categories":[159],"tags":[],"_links":{"self":[{"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/posts\/235341"}],"collection":[{"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/comments?post=235341"}],"version-history":[{"count":0,"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/posts\/235341\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/media\/235342"}],"wp:attachment":[{"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/media?parent=235341"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/categories?post=235341"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/tags?post=235341"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}