Assembled from URLs hosted at https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T
README.md
RedPajama-Data-1T
Imported from https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T.
Fully assembled in one xet repo with all files retrieved.
Use with git xet
, pyxet
, or xet
CLI.
The README from togethercomputer/RedPajama-Data-1T is preserved below:
Dataset Summary
RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset.
Dataset | Token Count |
---|---|
Commoncrawl | 878 Billion |
C4 | 175 Billion |
GitHub | 59 Billion |
Books | 26 Billion |
ArXiv | 28 Billion |
Wikipedia | 24 Billion |
StackExchange | 20 Billion |
Total | 1.2 Trillion |
Languages
Primarily English, though the Wikipedia slice contains multiple languages.
Dataset Structure
The dataset structure is as follows:
{
"text": ...,
"meta": {"url": "...", "timestamp": "...", "source": "...", "language": "...", ...},
"red_pajama_subset": "common_crawl" | "c4" | "github" | "books" | "arxiv" | "wikipedia" | "stackexchange"
}
Dataset Creation
This dataset was created to follow the LLaMa paper as closely as possible to try to reproduce its recipe.
Source Data
Commoncrawl
We download five dumps from Commoncrawl, and run the dumps through the official cc_net
pipeline.
We then deduplicate on the paragraph level, and filter out low quality text using a linear classifier trained to
classify paragraphs as Wikipedia references or random Commoncrawl samples.
C4
C4 is downloaded from Huggingface. The only preprocessing step is to bring the data into our own format.
GitHub
The raw GitHub data is downloaded from Google BigQuery. We deduplicate on the file level and filter out low quality files and only keep projects that are distributed under the MIT, BSD, or Apache license.
Wikipedia
We use the Wikipedia dataset available on Huggingface, which is based on the Wikipedia dump from 2023-03-20 and contains text in 20 different languages. The dataset comes in preprocessed format, so that hyperlinks, comments and other formatting boilerplate has been removed.
Gutenberg and Books3
The PG19 subset of the Gutenberg Project and Books3 datasets are downloaded from Huggingface. After downloading, we use simhash to remove near duplicates.
ArXiv
ArXiv data is downloaded from Amazon S3 in the arxiv
requester pays bucket. We only keep latex source files and
remove preambles, comments, macros and bibliographies.
Stackexchange
The Stack Exchange split of the dataset is download from the Internet Archive. Here we only keep the posts from the 28 largest sites, remove html tags, group the posts into question-answer pairs, and order answers by their score.
SHA256 Checksums
SHA256 checksums for the dataset files for each data source are available here:
https://data.together.xyz/redpajama-data-1T/v1.0.0/sha256/arxiv_SHA256SUMS.txt
https://data.together.xyz/redpajama-data-1T/v1.0.0/sha256/book_SHA256SUMS.txt
https://data.together.xyz/redpajama-data-1T/v1.0.0/sha256/c4_SHA256SUMS.txt
https://data.together.xyz/redpajama-data-1T/v1.0.0/sha256/common_crawl_SHA256SUMS.txt
https://data.together.xyz/redpajama-data-1T/v1.0.0/sha256/github_SHA256SUMS.txt
https://data.together.xyz/redpajama-data-1T/v1.0.0/sha256/stackexchange_SHA256SUMS.txt
https://data.together.xyz/redpajama-data-1T/v1.0.0/sha256/wikipedia_SHA256SUMS.txt
To cite RedPajama, please use:
@software{together2023redpajama,
author = {Together Computer},
title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset},
month = April,
year = 2023,
url = {https://github.com/togethercomputer/RedPajama-Data}
}
License
Please refer to the licenses of the data subsets you use.
- Common Crawl Foundation Terms of Use
- C4 license
- GitHub was limited to MIT, BSD, or Apache licenses only
- Books: the_pile_books3 license and pg19 license
- ArXiv Terms of Use
- Wikipedia License
- StackExchange license on the Internet Archive
File List | Total items: 14 | ||
---|---|---|---|
Name | Last Commit | Size | Last Modified |
arxiv | |||
book | |||
c4 | |||
common_crawl | |||
github | |||
stackexchange | |||
wikipedia | |||
.gitattributes | |||
Makefile | |||
README.md | |||
SHA256SUMS.txt | |||
checksum_urls.txt | |||
curl.config | |||
urls.txt |
About
Assembled from URLs hosted at https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T