{"id":244286,"date":"2024-07-17T01:44:08","date_gmt":"2024-07-17T01:44:08","guid":{"rendered":"https:\/\/michigandigitalnews.com\/index.php\/2024\/07\/17\/apple-nvidia-and-anthropic-reportedly-used-youtube-transcripts-without-permission-to-train-ai-models\/"},"modified":"2025-06-25T17:14:35","modified_gmt":"2025-06-25T17:14:35","slug":"apple-nvidia-and-anthropic-reportedly-used-youtube-transcripts-without-permission-to-train-ai-models","status":"publish","type":"post","link":"https:\/\/michigandigitalnews.com\/index.php\/2024\/07\/17\/apple-nvidia-and-anthropic-reportedly-used-youtube-transcripts-without-permission-to-train-ai-models\/","title":{"rendered":"Apple, NVIDIA and Anthropic reportedly used YouTube transcripts without permission to train AI models"},"content":{"rendered":"<p> [ad_1]<br \/>\n<\/p>\n<div>\n<p>Some of the world\u2019s largest tech companies trained their AI models on a dataset that included transcripts of more than 173,000 YouTube videos without permission, a <a data-i13n=\"elm:context_link;elmt:doNotAffiliate;cpos:1;pos:1\" class=\"link \" href=\"https:\/\/www.proofnews.org\/apple-nvidia-anthropic-used-thousands-of-swiped-youtube-videos-to-train-ai\/\" rel=\"nofollow noopener\" target=\"_blank\" data-ylk=\"slk:new investigation;elm:context_link;elmt:doNotAffiliate;cpos:1;pos:1;itc:0;sec:content-canvas\">new investigation<\/a> from <em>Proof News<\/em> has found. The dataset, which was created by a nonprofit company called EleutherAI, contains transcripts of YouTube videos from more than 48,000 channels and was used by Apple, NVIDIA and Anthropic among other companies. The findings of the investigation spotlight AI\u2019s uncomfortable truth: the technology is largely built on the backs of data siphoned from creators without their consent or compensation.<\/p>\n<p>The dataset doesn\u2019t include any videos or images from YouTube, but contains video transcripts from the platform&#8217;s biggest creators including Marques Brownlee and MrBeast, as well as large news publishers like <em>The New York Times<\/em>, the <em>BBC<\/em>, and <em>ABC News<\/em>. Subtitles from videos belonging to Engadget are also part of the dataset.<\/p>\n<p>\u201cApple has sourced data for their AI from several companies,\u201d Brownlee <a data-i13n=\"elm:context_link;elmt:doNotAffiliate;cpos:2;pos:1\" class=\"link \" href=\"https:\/\/x.com\/MKBHD\/status\/1813206956716212511\" rel=\"nofollow noopener\" target=\"_blank\" data-ylk=\"slk:posted on X;elm:context_link;elmt:doNotAffiliate;cpos:2;pos:1;itc:0;sec:content-canvas\">posted on X<\/a>. \u201cOne of them scraped tons of data\/transcripts from YouTube videos, including mine,\u201d he added. \u201cThis is going to be an evolving problem for a long time.\u201d<\/p>\n<div class=\"twitter-tweet-wrapper\" data-embed-anchor=\"633d85f7-5dfd-5674-95f6-524f18b32c2b\">\n<blockquote placeholder=\"\" data-theme=\"light\" class=\"twitter-tweet\">\n<p>Apple has sourced data for their AI from several companies<\/p>\n<p>One of them scraped tons of data\/transcripts from YouTube videos, including mine<\/p>\n<p>Apple technically avoids &#8220;fault&#8221; here because they&#8217;re not the ones scraping<\/p>\n<p>But this is going to be an evolving problem for a long time <a href=\"https:\/\/t.co\/U93riaeSlY\" rel=\"nofollow noopener\" target=\"_blank\" data-ylk=\"slk:https:\/\/t.co\/U93riaeSlY;elm:context_link;itc:0;sec:content-canvas\" class=\"link \">https:\/\/t.co\/U93riaeSlY<\/a><\/p>\n<p>\u2014 Marques Brownlee (@MKBHD) <a href=\"https:\/\/twitter.com\/MKBHD\/status\/1813206956716212511?ref_src=twsrc%5Etfw\" rel=\"nofollow noopener\" target=\"_blank\" data-ylk=\"slk:July 16, 2024;elm:context_link;itc:0;sec:content-canvas\" class=\"link \">July 16, 2024<\/a><\/p>\n<\/blockquote>\n<\/div>\n<p>A Google spokesperson told Engadget that <a data-i13n=\"elm:context_link;elmt:doNotAffiliate;cpos:3;pos:1\" class=\"link \" href=\"https:\/\/www.bloomberg.com\/news\/articles\/2024-04-04\/youtube-says-openai-training-sora-with-its-videos-would-break-the-rules?sref=10lNAhZ9\" rel=\"nofollow noopener\" target=\"_blank\" data-ylk=\"slk:previous comments;elm:context_link;elmt:doNotAffiliate;cpos:3;pos:1;itc:0;sec:content-canvas\">previous comments<\/a> made by YouTube CEO Neal Mohan saying that companies using YouTube&#8217;s data to train AI models would violate the paltform&#8217;s terms and service still stand. Apple, NVIDIA, Anthropic and EleutherAI did not respond to a request for comment from Engadget.<\/p>\n<p>So far, AI companies haven\u2019t been transparent about the data used to train their models. Earlier this month, artists and photographers <a data-i13n=\"elm:context_link;elmt:doNotAffiliate;cpos:4;pos:1\" class=\"link \" href=\"https:\/\/www.engadget.com\/artists-criticize-apples-lack-of-transparency-around-apple-intelligence-data-131250021.html\" data-ylk=\"slk:criticized Apple;elm:context_link;elmt:doNotAffiliate;cpos:4;pos:1;itc:0;sec:content-canvas\">criticized Apple<\/a> for failing to reveal the source of training data for <a data-i13n=\"elm:context_link;elmt:doNotAffiliate;cpos:5;pos:1\" class=\"link \" href=\"https:\/\/www.engadget.com\/apples-first-attempt-at-ai-is-apple-intelligence-181444846.html\" data-ylk=\"slk:Apple Intelligence;elm:context_link;elmt:doNotAffiliate;cpos:5;pos:1;itc:0;sec:content-canvas\">Apple Intelligence<\/a>, the company own spin on generative AI coming to millions of Apple devices this year.<\/p>\n<p>YouTube, the world\u2019s largest repository of videos, in particular, is a goldmine of not only transcripts but also audio, video, and images, making it an attractive dataset for training AI models. Earlier this year, OpenAI\u2019s chief technology officer, Mira Murati, <a data-i13n=\"elm:context_link;elmt:doNotAffiliate;cpos:6;pos:1\" class=\"link \" href=\"https:\/\/www.youtube.com\/watch?v=mAUpxN-EIgU\" rel=\"nofollow noopener\" target=\"_blank\" data-ylk=\"slk:evaded questions;elm:context_link;elmt:doNotAffiliate;cpos:6;pos:1;itc:0;sec:content-canvas\">evaded questions<\/a> from <em>The Wall Street Journal <\/em>about whether the company used YouTube videos to train <a data-i13n=\"elm:context_link;elmt:doNotAffiliate;cpos:7;pos:1\" class=\"link \" href=\"https:\/\/www.engadget.com\/openais-new-sora-model-can-generate-minute-long-videos-from-text-prompts-195717694.html\" data-ylk=\"slk:Sora;elm:context_link;elmt:doNotAffiliate;cpos:7;pos:1;itc:0;sec:content-canvas\">Sora<\/a>, OpenAI\u2019s upcoming AI video generation tool. \u201cI\u2019m not going to go into the details of the data that was used, but it was publicly available or licensed data,\u201d Murati said at the time. Alphabet CEO <a data-i13n=\"cpos:8;pos:1\" href=\"https:\/\/www.theverge.com\/24158374\/google-ceo-sundar-pichai-ai-search-gemini-future-of-the-internet-web-openai-decoder-interview\" rel=\"nofollow noopener\" target=\"_blank\" data-ylk=\"slk:Sundar Pichai;cpos:8;pos:1;elm:context_link;itc:0;sec:content-canvas\" class=\"link \">Sundar Pichai<\/a> has also said that companies using data from YouTube to train their AI models would violate of the platform\u2019s terms of service.<\/p>\n<p>If you want to see if subtitles from your YouTube videos or from your favorite channels are part of the dataset, head over to the Proof News&#8217; <a data-i13n=\"elm:context_link;elmt:doNotAffiliate;cpos:9;pos:1\" class=\"link \" href=\"https:\/\/www.proofnews.org\/youtube-ai-search\/\" rel=\"nofollow noopener\" target=\"_blank\" data-ylk=\"slk:lookup tool;elm:context_link;elmt:doNotAffiliate;cpos:9;pos:1;itc:0;sec:content-canvas\">lookup tool<\/a>.<\/p>\n<p><strong>Update, July 16 2024, 3:17 PM PT:<\/strong> This story has been updated to include a statement from Google.<\/p>\n<\/div>\n<p><script async src=\"\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script><br \/>\n<br \/>[ad_2]<br \/>\n<br \/><a href=\"https:\/\/www.engadget.com\/apple-nvidia-and-anthropic-reportedly-used-youtube-transcripts-without-permission-to-train-ai-models-170827317.html?src=rss\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>[ad_1] Some of the world\u2019s largest tech companies trained their AI models on a dataset that included transcripts of more than 173,000 YouTube videos without<\/p>\n","protected":false},"author":1,"featured_media":244287,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_uf_show_specific_survey":0,"_uf_disable_surveys":false,"footnotes":""},"categories":[159],"tags":[],"_links":{"self":[{"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/posts\/244286"}],"collection":[{"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/comments?post=244286"}],"version-history":[{"count":0,"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/posts\/244286\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/media\/244287"}],"wp:attachment":[{"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/media?parent=244286"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/categories?post=244286"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/tags?post=244286"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}