The Pile (dataset)

The Pile is an 886.03 GB diverse, open-source dataset of English text created as a training dataset for large language models (LLMs). It was constructed by EleutherAI in 2020 and publicly released on December 31 of that year.[1][2] It is composed of 22 smaller datasets, including 14 new ones.[1]

  1. ^ a b Gao, Leo; Biderman, Stella; Black, Sid; Golding, Laurence; Hoppe, Travis; Foster, Charles; Phang, Jason; He, Horace; Thite, Anish; Nabeshima, Noa; Presser, Shawn; Leahy, Connor (31 December 2020). "The Pile: An 800GB Dataset of Diverse Text for Language Modeling". arXiv:2101.00027 [cs.CL].
  2. ^ "The Pile: An 800GB Dataset of Diverse Text for Language Modeling". EleutherAI Website. EleutherAI. 13 February 2020. Retrieved 4 June 2023.