Common Crawl is a nonprofit which maintains an open repository of “crawled” internet data. The group makes the information which it crawls from the internet free and open to researchers. Some news outlets began demanding that the group not crawl their websites because of the group’s role in training artificial intelligence (AI) programs, claiming that the group violates news outlets’ copyrights. 1
Background
Gil Elbaz founded Common Crawl in 2007 to build and maintain an open repository of web data for researchers and others to use. 2
The group focuses on downloading HTML pages and does not archive images, videos, or Javascript files. 3
The group asserts that it is crucial for our information-based society that web crawl data be open and accessible to anyone, such as researchers, who wants to use it. The group stores its data on Amazon’s S3 servers. 2
Copyright Complaints
Online publishers have argued that groups like Common Crawl violate their copyrights. Some web publishers believe that by boxing up gigabytes, terabytes or even petabytes of web content and redistributing it to researchers, web archives could potentially be viewed as “redistributing” copyrighted content. Traditionally, research universities largely adopted the stance that researchers are free to crawl the web and bulk download vast quantities of content to use in data-mining research, while web archives as a whole have adopted the stance that they cannot make their holdings available for data mining because they would, in their view, be “redistributing” the content they downloaded to third parties to use for data mining. 3
Common Crawl made all its data available for research. The group excludes pages which have a robots.txt exclusion policy and allows for publishers to opt out of having their pages crawled. The group asserts that it addresses copyright concerns by only capturing a sample of a website instead of the entire website. 3
Artificial Intelligence Training
In February 2024, the Mozilla Foundation published a study exploring the use of Common Crawl’s data by artificial intelligence (AI) companies and criticized the builders for using the group’s data uncritically. Mozilla Foundation criticized AI builders and Common Crawl for being too reliant on copyrighted data and for not filtering data to exclude problematic content and bias. The study also called on Common Crawl to be more transparent and inclusive about its governance and to better highlight the limitation and biases of its data. 4
In June 2024, Danish media outlets demanded the group remove copies of their articles from past data sets and stop crawling their websites immediately. In addition to the Danish media outlets, the New York Times, Buzzfeed, the Washington Post, and the Canadian Broadcasting Corporation (CBC) also block Common Crawl’s access to their webpages. The media outlets allege that Common Crawl’s data helps AI builders violate their copyrights by using their webpages without permission. 1
Leadership
Common Crawl’s executive director is Rich Skrenta. Skrenta was the founder and CEO of Blekko, a web search engine; the Open Directory Project, a community-edited search platform; Topix, a news aggregator combined with a social forum; and Tobiko, a restaurant recommendation platform. 5
The group’s chairman is Gil Elbaz. Elbaz is the co-founder of Applied Semantics, which was the original developer of AdSense, and the co-founder of Factual, which merged with Fourspace. 6
Financials
According to Common Crawl’s 2022 tax return, the group had $451,447 in revenue, $170,140 in expenses, and $633,865 in assets. 7
The group received a $450,000 donation from the Elbaz Family Foundation in that year. 7
References
- Knibbs, Kate. “Publishers Target Common Crawl in Fight over AI Training Data.” Wired, June 13, 2024. https://www.wired.com/story/the-fight-against-ai-comes-to-a-foundational-data-set/.
- “Common Crawl Enters a New Phase.” Common Crawl, November 7, 2011. https://commoncrawl.org/blog/common-crawl-enters-a-new-phase.
- Leetaru, Kalev. “Common Crawl and Unlocking Web Archives for Research.” Forbes, September 28, 2017. https://www.forbes.com/sites/kalevleetaru/2017/09/28/common-crawl-and-unlocking-web-archives-for-research/.
- “Mozilla Report: How Common Crawl’s Data Infrastructure Shaped the Battle Royale over Generative AI.” Mozilla Foundation, February 6, 2024. https://foundation.mozilla.org/en/blog/Mozilla-Report-How-Common-Crawl-Data-Infrastructure-Shaped-the-Battle-Royale-over-Generative-AI/.
- “Common Crawl – Team – Rich Skrenta.” Common Crawl. Accessed June 24, 2024. https://commoncrawl.org/team/rich-skrenta-director.
- “Common Crawl – Team – Gil Elbaz.” Common Crawl. Accessed June 24, 2024. https://commoncrawl.org/team/gil-elbaz-chairman.
- “Commoncrawl Foundation, Full Filing – Nonprofit Explorer.” ProPublica. Accessed June 24, 2024. https://projects.propublica.org/nonprofits/organizations/261635908/202322869349100317/full.