Common Crawl

Common Crawl
Type of business501(c)(3) non-profit
Founded2007
HeadquartersSan Francisco, California; Los Angeles, California, United States
Founder(s)Gil Elbaz
Key peoplePeter Norvig, Rich Skrenta, Eva Ho
URLcommoncrawl.org

Common Crawl is a nonprofit 501(c)(3) organization that crawls the web and freely provides its archives and datasets to the public.[1][2] Common Crawl's web archive consists of petabytes of data collected since 2008.[3] It completes crawls generally every month.[4]

Common Crawl was founded by Gil Elbaz.[5] Advisors to the non-profit include Peter Norvig and Joi Ito.[6] The organization's crawlers respect nofollow and robots.txt policies. Open source code for processing Common Crawl's data set is publicly available.

The Common Crawl dataset includes copyrighted work and is distributed from the US under fair use claims. Researchers in other countries have made use of techniques such as shuffling sentences or referencing the common crawl dataset to work around copyright law in other legal jurisdictions.[7]

As of March 2023, in the most recent version of the Common Crawl dataset, 46% of documents had English as their primary language (followed by German, Russian, Japanese, French, Spanish and Chinese, all below 6%).[8]

  1. ^ Rosanna Xia (February 5, 2012). "Tech entrepreneur Gil Elbaz made it big in L.A." Los Angeles Times. Retrieved July 31, 2014.
  2. ^ "Gil Elbaz and Common Crawl". NBC News. April 4, 2013. Retrieved July 31, 2014.
  3. ^ "So you're ready to get started". Common Crawl. Retrieved 9 June 2023.
  4. ^ Lisa Green (January 8, 2014). "Winter 2013 Crawl Data Now Available". Retrieved June 2, 2018.
  5. ^ "Startups - Gil Elbaz and Nova Spivack of Common Crawl - TWiST #222". This Week In Startups. January 10, 2012.
  6. ^ Tom Simonite (January 23, 2013). "A Free Database of the Entire Web May Spawn the Next Google". MIT Technology Review. Archived from the original on June 26, 2014. Retrieved July 31, 2014.
  7. ^ Schäfer, Roland (May 2016). "CommonCOW: Massively Huge Web Corpora from CommonCrawl Data and a Method to Distribute them Freely under Restrictive EU Copyright Laws". Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16). Portorož, Slovenia: European Language Resources Association (ELRA): 4501.
  8. ^ "Statistics of Common Crawl Monthly Archives by commoncrawl". commoncrawl.github.io. Retrieved 2023-04-02.