Culturomics

Culturomics is a form of computational lexicology that studies human behavior and cultural trends through the quantitative analysis of digitized texts.^[1]^[2] Researchers data mine large digital archives to investigate cultural phenomena reflected in language and word usage.^[3] The term is an American neologism first described in a 2010 Science article called Quantitative Analysis of Culture Using Millions of Digitized Books, co-authored by Harvard researchers Jean-Baptiste Michel and Erez Lieberman Aiden.^[4]

Michel and Aiden helped create the Google Labs project Google Ngram Viewer which uses n-grams to analyze the Google Books digital library for cultural patterns in language use over time.

Because the Google Ngram data set is not an unbiased sample,^[5] and does not include metadata,^[6] there are several pitfalls when using it to study language or the popularity of terms.^[7] Medical literature accounts for a large, but shifting, share of the corpus,^[8] which does not take into account how often the literature is printed, or read.

^ Cohen, Patricia (16 December 2010). "In 500 Billion Words, New Window on Culture". New York Times.
^ Hayes, Brian (May–June 2011). "Bit Lit". American Scientist. 99 (3): 190. doi:10.1511/2011.90.190. Archived from the original on 2016-10-18. Retrieved 2011-09-09.
^ Letcher, David W. (April 6, 2011). "Cultoromics: A New Way to See Temporal Changes in the Prevalence of Words and Phrases" (PDF). American Institute of Higher Education 6th International Conference Proceedings. 4 (1): 228. Archived from the original (PDF) on March 3, 2016. Retrieved September 9, 2011.
^ Michel, Jean-Baptiste; Liberman Aiden, Erez (16 December 2010). "Quantitative Analysis of Culture Using Millions of Digitized Books". Science. 331 (6014): 176–82. doi:10.1126/science.1199644. PMC 3279742. PMID 21163965.
^ Pechenick, Eitan Adam; Danforth, Christopher M.; Dodds, Peter Sheridan (2015-10-07). "Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution". PLOS ONE. 10 (10): e0137041. arXiv:1501.00960. Bibcode:2015PLoSO..1037041P. doi:10.1371/journal.pone.0137041. ISSN 1932-6203. PMC 4596490. PMID 26445406.
^ Koplenig, Alexander (April 2017). "The impact of lacking metadata for the measurement of cultural and linguistic change using the Google Ngram data sets—Reconstructing the composition of the German corpus in times of WWII". Digital Scholarship in the Humanities. 32 (1): 169–188. doi:10.1093/llc/fqv037. ISSN 2055-7671.
^ Zhang, Sarah. "The Pitfalls of Using Google Ngram to Study Language". WIRED. Retrieved 2017-05-24.
^ Comparison of example terms

[1] Cohen, Patricia (16 December 2010). "In 500 Billion Words, New Window on Culture". New York Times.

[2] Hayes, Brian (May–June 2011). "Bit Lit". American Scientist. 99 (3): 190. doi:10.1511/2011.90.190. Archived from the original on 2016-10-18. Retrieved 2011-09-09.

[3] Letcher, David W. (April 6, 2011). "Cultoromics: A New Way to See Temporal Changes in the Prevalence of Words and Phrases" (PDF). American Institute of Higher Education 6th International Conference Proceedings. 4 (1): 228. Archived from the original (PDF) on March 3, 2016. Retrieved September 9, 2011.

[:0-4] Michel, Jean-Baptiste; Liberman Aiden, Erez (16 December 2010). "Quantitative Analysis of Culture Using Millions of Digitized Books". Science. 331 (6014): 176–82. doi:10.1126/science.1199644. PMC 3279742. PMID 21163965.

[:1-5] Pechenick, Eitan Adam; Danforth, Christopher M.; Dodds, Peter Sheridan (2015-10-07). "Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution". PLOS ONE. 10 (10): e0137041. arXiv:1501.00960. Bibcode:2015PLoSO..1037041P. doi:10.1371/journal.pone.0137041. ISSN 1932-6203. PMC 4596490. PMID 26445406.

[:2-6] Koplenig, Alexander (April 2017). "The impact of lacking metadata for the measurement of cultural and linguistic change using the Google Ngram data sets—Reconstructing the composition of the German corpus in times of WWII". Digital Scholarship in the Humanities. 32 (1): 169–188. doi:10.1093/llc/fqv037. ISSN 2055-7671.

[7] Zhang, Sarah. "The Pitfalls of Using Google Ngram to Study Language". WIRED. Retrieved 2017-05-24.

[8] Comparison of example terms

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]