XingyanLiu

RedPajama-Data-1T

forked from XetHub/RedPajama-Data-1T

Assembled from URLs hosted at https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T

Zach Nation

c5e0b6766c 16 commits

Add README

README.md

RedPajama-Data-1T

Imported from https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T. Fully assembled in one xet repo with all files retrieved. Use with git xet, pyxet, or xet CLI.

The README from togethercomputer/RedPajama-Data-1T is preserved below:

Dataset Summary

RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset.

Dataset	Token Count
Commoncrawl	878 Billion
C4	175 Billion
GitHub	59 Billion
Books	26 Billion
ArXiv	28 Billion
Wikipedia	24 Billion
StackExchange	20 Billion
Total	1.2 Trillion

Languages

Primarily English, though the Wikipedia slice contains multiple languages.

Dataset Structure

The dataset structure is as follows:

{
    "text": ...,
    "meta": {"url": "...", "timestamp": "...", "source": "...", "language": "...", ...},
    "red_pajama_subset": "common_crawl" | "c4" | "github" | "books" | "arxiv" | "wikipedia" | "stackexchange"
}

Dataset Creation

This dataset was created to follow the LLaMa paper as closely as possible to try to reproduce its recipe.

Source Data

Commoncrawl

We download five dumps from Commoncrawl, and run the dumps through the official cc_net pipeline. We then deduplicate on the paragraph level, and filter out low quality text using a linear classifier trained to classify paragraphs as Wikipedia references or random Commoncrawl samples.

C4

C4 is downloaded from Huggingface. The only preprocessing step is to bring the data into our own format.

GitHub

The raw GitHub data is downloaded from Google BigQuery. We deduplicate on the file level and filter out low quality files and only keep projects that are distributed under the MIT, BSD, or Apache license.

Wikipedia

We use the Wikipedia dataset available on Huggingface, which is based on the Wikipedia dump from 2023-03-20 and contains text in 20 different languages. The dataset comes in preprocessed format, so that hyperlinks, comments and other formatting boilerplate has been removed.

Gutenberg and Books3

The PG19 subset of the Gutenberg Project and Books3 datasets are downloaded from Huggingface. After downloading, we use simhash to remove near duplicates.

ArXiv

ArXiv data is downloaded from Amazon S3 in the arxiv requester pays bucket. We only keep latex source files and remove preambles, comments, macros and bibliographies.

Stackexchange

The Stack Exchange split of the dataset is download from the Internet Archive. Here we only keep the posts from the 28 largest sites, remove html tags, group the posts into question-answer pairs, and order answers by their score.

SHA256 Checksums

SHA256 checksums for the dataset files for each data source are available here:

https://data.together.xyz/redpajama-data-1T/v1.0.0/sha256/arxiv_SHA256SUMS.txt
https://data.together.xyz/redpajama-data-1T/v1.0.0/sha256/book_SHA256SUMS.txt
https://data.together.xyz/redpajama-data-1T/v1.0.0/sha256/c4_SHA256SUMS.txt
https://data.together.xyz/redpajama-data-1T/v1.0.0/sha256/common_crawl_SHA256SUMS.txt
https://data.together.xyz/redpajama-data-1T/v1.0.0/sha256/github_SHA256SUMS.txt
https://data.together.xyz/redpajama-data-1T/v1.0.0/sha256/stackexchange_SHA256SUMS.txt
https://data.together.xyz/redpajama-data-1T/v1.0.0/sha256/wikipedia_SHA256SUMS.txt

To cite RedPajama, please use:

@software{together2023redpajama,
  author = {Together Computer},
  title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset},
  month = April,
  year = 2023,
  url = {https://github.com/togethercomputer/RedPajama-Data}
}

License

Please refer to the licenses of the data subsets you use.

Common Crawl Foundation Terms of Use
C4 license
GitHub was limited to MIT, BSD, or Apache licenses only
Books: the_pile_books3 license and pg19 license
ArXiv Terms of Use
Wikipedia License
StackExchange license on the Internet Archive

File List			Total items: 14
Name	Last Commit	Size	Last Modified
arxiv	Third batch of data		1 year ago
book	Fourth batch of data		1 year ago
c4	Third batch of data		1 year ago
common_crawl	Fourth batch of data		1 year ago
github	Fourth batch of data		1 year ago
stackexchange	Fourth batch of data		1 year ago
wikipedia	Fourth batch of data		1 year ago
.gitattributes	Initial commit	79 B	1 year ago
Makefile	Second batch of data	110 KiB	1 year ago
README.md	Add README	4.5 KiB	1 year ago
SHA256SUMS.txt	Third batch of data	238 KiB	1 year ago
checksum_urls.txt	Fix up Makefile and check sha256sums	569 B	1 year ago
curl.config	Add curl config	337 KiB	1 year ago
urls.txt	Add URL list	207 KiB	1 year ago

About

Assembled from URLs hosted at https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T

Repository Size

Loading repo size...

README.md

RedPajama-Data-1T

Dataset Summary

Languages

Dataset Structure

Dataset Creation

Source Data

Commoncrawl

C4

GitHub

Wikipedia

Gutenberg and Books3

ArXiv

Stackexchange

SHA256 Checksums

License

About

Repository Size

Commits 12 commits

File Types