{"id":222210,"date":"2024-04-09T22:17:28","date_gmt":"2024-04-09T22:17:28","guid":{"rendered":"https:\/\/michigandigitalnews.com\/index.php\/2024\/04\/09\/meta-google-openai-used-protected-data-to-train-llms-report\/"},"modified":"2025-06-25T17:18:57","modified_gmt":"2025-06-25T17:18:57","slug":"meta-google-openai-used-protected-data-to-train-llms-report","status":"publish","type":"post","link":"https:\/\/michigandigitalnews.com\/index.php\/2024\/04\/09\/meta-google-openai-used-protected-data-to-train-llms-report\/","title":{"rendered":"Meta, Google, OpenAI used protected data to train LLMs, report"},"content":{"rendered":"<p> [ad_1]<br \/>\n<br \/><img decoding=\"async\" src=\"https:\/\/fortune.com\/img-assets\/wp-content\/uploads\/2024\/04\/GettyImages-1255237098-e1712697888939.jpg?w=2048\" \/><\/p>\n<p><a href=\"https:\/\/fortune.com\/2023\/11\/19\/ai-expert-gary-marcus-warns-openai-investors-getting-sam-altman-reinstated-overpowering-board-ominous\/\" target=\"_self\" rel=\"noopener\" class=\"sc-76811d68-0 jyYcOa\">Gary Marcus<\/a> is a leading AI researcher who\u2019s increasingly appalled at what he\u2019s seeing. He founded at least two AI startups, one of which sold to <a href=\"https:\/\/fortune.com\/company\/uber-technologies\/\" target=\"_blank\" rel=\"noopener\" class=\"sc-76811d68-0 jyYcOa\">Uber<\/a>, and has been researching the subject for over two decades. Just last weekend, the <a href=\"https:\/\/www.ft.com\/content\/648228e7-11eb-4e1a-b0d5-e65a638e6135\" target=\"_blank\" rel=\"noopener\" class=\"sc-76811d68-0 jyYcOa\"><em>Financial Times<\/em><\/a> called him \u201cPerhaps the noisiest AI questioner\u201d and reported that <a href=\"https:\/\/fortune.com\/2024\/02\/12\/sam-altman-7-trillion-ai-chips-grind-for-future-substack\/#\" target=\"_self\" rel=\"noopener\" class=\"sc-76811d68-0 jyYcOa\">Marcus<\/a> assumed he was targeted by a critical <a href=\"https:\/\/fortune.com\/longform\/chatgpt-openai-sam-altman-microsoft\/\" target=\"_self\" rel=\"noopener\" class=\"sc-76811d68-0 jyYcOa\">Sam Altman<\/a> post on <a href=\"https:\/\/twitter.com\/sama\/status\/1512471289545383940?lang=en\" target=\"_blank\" rel=\"noopener\" class=\"sc-76811d68-0 jyYcOa\">X<\/a>: \u201cGive me the confidence of a mediocre deep-learning skeptic.\u201d<\/p>\n<div>\n<p>Marcus doubled down on his critiques the very next day after he appeared in the FT, <a href=\"https:\/\/garymarcus.substack.com\/p\/generative-ai-as-shakespearean-tragedy\" target=\"_blank\" rel=\"noopener\" class=\"sc-76811d68-0 jyYcOa\">writing on his Substack<\/a> about \u201cgenerative AI as Shakespearean tragedy.\u201d The subject was a <a href=\"https:\/\/www.nytimes.com\/2024\/04\/06\/technology\/tech-giants-harvest-data-artificial-intelligence.html\" target=\"_blank\" rel=\"noopener\" class=\"sc-76811d68-0 jyYcOa\">bombshell report from <em>The New York Times<\/em><\/a> that OpenAI violated YouTube\u2019s terms of service by scraping over a million hours of user-generated content. What\u2019s worse, Google\u2019s need for data to train its own AI model was so insatiable that it did the same thing, potentially violating the copyrights of the content creators whose videos it used without their consent.<\/p>\n<p>As far back as 2018, Marcus noted, he has expressed doubts about the \u201cdata-guzzling\u201d approach to training that sought to feed AI models with as much content as possible. In fact, he listed eight of his warnings, dating all the way back to his <a href=\"https:\/\/mitpress.mit.edu\/9780262632683\/the-algebraic-mind\/\" target=\"_blank\" rel=\"noopener\" class=\"sc-76811d68-0 jyYcOa\">diagnosis of hallucinations in 2001<\/a>, all coming true like a curse on MacBeth or Hamlet manifesting in the fifth act. \u201cWhat makes all this tragic is that many of us have tried so hard to warn the field that we would wind up here,\u201d Marcus wrote.<\/p>\n<p>While Marcus declined to comment to <em>Fortune<\/em>, the tragedy goes well beyond the fact that nobody listened to critics like him and Ed Zitron, another prominent skeptic <a href=\"https:\/\/www.ft.com\/content\/648228e7-11eb-4e1a-b0d5-e65a638e6135\" target=\"_blank\" rel=\"noopener\" class=\"sc-76811d68-0 jyYcOa\">cited<\/a> by the FT. According to the <em>Times<\/em>, which cites numerous background sources, both Google and OpenAI knew what they were doing was legally dubious\u2014banking on the fact that copyright in the age of AI had yet to be litigated\u2014but felt they had no choice but to keep pumping data into their large language models to stay ahead of their competition. And in Google\u2019s case, it potentially suffered harm as a result of OpenAI\u2019s massive scraping efforts, but its own bending of the rules to scrape the very same data left it with a proverbial arm tied behind its back.<\/p>\n<h3 class=\"wp-block-heading\">Did <strong>OpenAI use YouTube<\/strong> videos? <\/h3>\n<p>Google employees became aware OpenAI was taking YouTube content to train its models, which would infringe both its own terms of service and possibly the copyright protections of the creators to whom the videos belong. Caught in this bind Google decided not to denounce OpenAI publicly because it was afraid of drawing attention to its own use of YouTube videos to train AI models, the <em>Times<\/em> reported.\u00a0<\/p>\n<p>A Google spokesperson told <em>Fortune<\/em> the company had \u201cseen unconfirmed reports\u201d that OpenAI had used YouTube videos. They added that YouTube\u2019s terms of service \u201cprohibit unauthorized scraping or downloading\u201d of videos, which the company has a \u201clong history of employing technical and legal measures to prevent.\u201d\u00a0<\/p>\n<p>Marcus says the behavior of these big tech firms was predictable because data was the key ingredient needed to build the AI tools these big tech companies were in an arms race to develop. Without quality data, like well-written novels, podcasts by knowledgeable hosts, or expertly produced movies, the chatbots and image generators risk spitting out mediocre content. That idea can be summed up in the data science adage \u201ccrap in, crap out.\u201d In an op-ed for <em>Fortune<\/em> Jim Stratton, the chief technology officer of HR software company Workday, <a href=\"https:\/\/fortune.com\/2023\/08\/10\/workday-data-ai-revolution\/\" target=\"_self\" rel=\"noopener\" class=\"sc-76811d68-0 jyYcOa\">said<\/a> \u201cdata is the lifeblood of AI,\u201d making the \u201cneed for quality, timely data more important than ever.\u201d<\/p>\n<p>Around 2021, OpenAI ran into a shortage of data. Desperately needing more instances of human speech to continue improving its ChatGPT tool, which was still about a year away from being released, OpenAI decided to get it from YouTube. Employees discussed the fact that cribbing YouTube videos might not be allowed. Eventually a group, including OpenAI president Greg Brockman, went ahead with the plan.\u00a0\u00a0<\/p>\n<p>That a senior figure like Brockman was involved in the scheme was evidence of how seminal such data gathering methods were to developing AI, according to Marcus. Brockman did so, \u201cvery likely knowing that he was entering a legal gray area\u2014yet desperate to feed the beast,\u201d Marcus wrote. \u201cIf it all falls apart, either for legal reasons or technical reasons, that image may linger.\u201d<\/p>\n<p>When reached for comment, a spokesperson for OpenAI did not answer specific questions about its use of YouTube videos to train its models. \u201cEach of our models has a unique dataset that we curate to help their understanding of the world and remain globally competitive in research,\u201d they wrote in an email. \u201cWe use numerous sources including publicly available data and partnerships for non-public data, and are exploring synthetic data generation,\u201d they said, referring to the practice of using AI-generated content to train AI models.\u00a0<\/p>\n<p>OpenAI chief technology officer Mira Murati was asked in a <em>Wall Street Journal<\/em> <a href=\"https:\/\/www.wsj.com\/video\/series\/joanna-stern-personal-technology\/openai-made-me-crazy-videosthen-the-cto-answered-most-of-my-questions\/C2188768-D570-4456-8574-9941D4F9D7E2\" target=\"_blank\" rel=\"noopener\" class=\"sc-76811d68-0 jyYcOa\">interview<\/a> whether the company\u2019s new Sora video image generator had been trained using YouTube videos; she answered, \u201cI\u2019m actually not sure about that.\u201d Last week YouTube CEO Neal Mohan <a href=\"https:\/\/fortune.com\/2024\/04\/04\/openai-youtube-clear-violation-terms-service-ai-sora-training\/#\" target=\"_self\" rel=\"noopener\" class=\"sc-76811d68-0 jyYcOa\">responded<\/a> by saying that while he didn\u2019t know if OpenAI had actually used YouTube data to train Sora or any other tool, if it had that would violate the platforms\u2019 rules. Mohan did <a href=\"https:\/\/www.bloomberg.com\/news\/articles\/2024-04-04\/youtube-says-openai-training-sora-with-its-videos-would-break-the-rules\" target=\"_blank\" rel=\"noopener\" class=\"sc-76811d68-0 jyYcOa\">mention<\/a> that Google uses some YouTube content to train its AI tools based on a few contracts it has with individual creators. A statement a Google spokesperson reiterated to <em>Fortune<\/em> in an email.\u00a0<\/p>\n<h3 class=\"wp-block-heading\"><strong>Meta decides licensing deal would take too long<\/strong><\/h3>\n<p>OpenAI wasn\u2019t alone in facing a lack of adequate data. Meta was also grappling with the issue. When Meta realized its AI products weren\u2019t as advanced as OpenAI\u2019s; it held numerous meetings with top executives to figure out ways to secure more data to train its systems. Executives considered options like paying a licensing fee of $10 per book for new releases and outright buying the publisher Simon &amp; Schuster. During these meetings executives acknowledged they had already used copyrighted material without the permission of its authors. Ultimately, they decided to press on even if it meant possible lawsuits in the future, according to the <em>New York Times<\/em>.\u00a0\u00a0\u00a0<\/p>\n<p>Meta did not respond to a request for comment.<\/p>\n<p>Meta\u2019s lawyers believed if things did end up in litigation they would be covered by a <a href=\"https:\/\/www.theatlantic.com\/technology\/archive\/2015\/10\/fair-use-transformative-leval-google-books\/411058\/\" target=\"_blank\" rel=\"noopener\" class=\"sc-76811d68-0 jyYcOa\">2015 case Google won<\/a> against a consortium of authors. At the time a judge ruled that Google was permitted to use the authors\u2019 books without having to pay a licensing fee because it was using their work to build a search engine, which was sufficiently transformative to be considered fair use.\u00a0<\/p>\n<p>OpenAI is arguing something similar in a <a href=\"https:\/\/fortune.com\/2023\/12\/27\/openai-microsoft-new-york-times-lawsuit-ai-copyright-infringement\/\" target=\"_self\" rel=\"noopener\" class=\"sc-76811d68-0 jyYcOa\">case<\/a> brought against it by the <em>New York Times<\/em> in December. The <em>Times<\/em> alleges that OpenAI used its copyrighted material without compensating it for doing so. While OpenAI <a href=\"https:\/\/fortune.com\/2024\/01\/08\/openai-blog-post-new-york-times-lawsuit-not-full-story-copyright\/\" target=\"_self\" rel=\"noopener\" class=\"sc-76811d68-0 jyYcOa\">contends<\/a> their use of the materials is covered by fair use because they were gathered to train a large language model rather than because it\u2019s a competing news organization.\u00a0<\/p>\n<p>For Marcus the hunger for more data was evidence that the whole proposition of AI was built on <a href=\"https:\/\/www.theverge.com\/24075086\/ai-investment-hype-earnings\" target=\"_blank\" rel=\"noopener\" class=\"sc-76811d68-0 jyYcOa\">shaky ground<\/a>. In order for AI to <a href=\"https:\/\/www.reuters.com\/breakingviews\/ai-hype-will-be-hard-puncture-2024-03-20\/\" target=\"_blank\" rel=\"noopener\" class=\"sc-76811d68-0 jyYcOa\">live up<\/a> to the <a href=\"https:\/\/fortune.com\/2024\/04\/07\/ai-stocks-nvidia-artificial-intelligence-tsmc-taiwan-semiconductor-emerging-markets\/\" target=\"_self\" rel=\"noopener\" class=\"sc-76811d68-0 jyYcOa\">hype<\/a> with which it\u2019s been billed it simply needs more data than is available. \u201cAll this happened upon the realization that their systems simply cannot succeed without even more data than the internet-scale data they have already been trained on,\u201d Marcus wrote on Substack.\u00a0<\/p>\n<p>OpenAI seemed to concede that was the case in a written testimony with the U.K.\u2019s House of Lords in December. \u201cIt would be impossible to train today\u2019s leading AI models without using copyrighted materials,\u201d the company wrote.\u00a0<\/p>\n<\/div>\n<div data-cy=\"subscriptionPlea\">Subscribe to the Eye on AI newsletter to stay abreast of how AI is shaping the future of business. <a href=\"https:\/\/www.fortune.com\/newsletters\/eye-on-ai?&amp;itm_source=fortune&amp;itm_medium=article_tout&amp;itm_campaign=eye_on_ai\" target=\"_self\" rel=\"noopener\" class=\"sc-76811d68-0 jyYcOa\">Sign up<\/a> for free.<\/div>\n<p><script async src=\"\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script><br \/>\n<br \/>[ad_2]<br \/>\n<br \/><a href=\"https:\/\/fortune.com\/2024\/04\/09\/ai-openai-llm-meta-google-sam-altman-gary-marcus-youtube-chatgpt\/\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>[ad_1] Gary Marcus is a leading AI researcher who\u2019s increasingly appalled at what he\u2019s seeing. He founded at least two AI startups, one of which<\/p>\n","protected":false},"author":1,"featured_media":222211,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"_uf_show_specific_survey":0,"_uf_disable_surveys":false,"footnotes":""},"categories":[149],"tags":[],"_links":{"self":[{"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/posts\/222210"}],"collection":[{"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/comments?post=222210"}],"version-history":[{"count":0,"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/posts\/222210\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/media\/222211"}],"wp:attachment":[{"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/media?parent=222210"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/categories?post=222210"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/tags?post=222210"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}