XetHub

Imported from https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T-Sample

Zach Nation

README.md

RedPajama-Data-1T-Sample

Imported from https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T-Sample. Fully assembled in one xet repo with all files retrieved. Use with git xet, pyxet, or xet CLI.

The full dataset is available at https://xethub.com/XetHub/RedPajama-Data-1T

The README from togethercomputer/RedPajama-Data-1T-Sample is preserved below:

task_categories:

text-generation language:
en pretty_name: Red Pajama 1T Sample

Dataset Card for Dataset Name

Dataset Summary

RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset. This HuggingFace repo contains a 1B-token sample of the RedPajama dataset. The full dataset has the following token counts and is available for download:

Dataset	Token Count
Commoncrawl	878 Billion
C4	175 Billion
GitHub	59 Billion
Books	26 Billion
ArXiv	28 Billion
Wikipedia	24 Billion
StackExchange	20 Billion
Total	1.2 Trillion

A full set of scripts to recreate the dataset from scratch can be found here.

Languages

Primarily English, though the Wikipedia slice contains multiple languages.

Dataset Structure

The dataset structure is as follows:

{
    "text": ...,
    "meta": {"url": "...", "timestamp": "...", "source": "...", "language": "...", ...}
}

Dataset Creation

This dataset was created to follow the LLaMa paper as closely as possible to try to reproduce its recipe.

Source Data

Commoncrawl

We download five dumps from Commoncrawl, and run the dumps through the official cc_net pipeline. We then deduplicate on the paragraph level, and filter out low quality text using a linear classifier trained to classify paragraphs as Wikipedia references or random Commoncrawl samples.

C4

C4 is downloaded from Huggingface. The only preprocessing step is to bring the data into our own format.

GitHub

The raw GitHub data is downloaded from Google BigQuery. We deduplicate on the file level and filter out low quality files and only keep projects that are distributed under the MIT, BSD, or Apache license.

Wikipedia

We use the Wikipedia dataset available on Huggingface, which is based on the Wikipedia dump from 2023-03-20 and contains text in 20 different languages. The dataset comes in preprocessed format, so that hyperlinks, comments and other formatting boilerplate has been removed.

Gutenberg and Books3

The PG19 subset of the Gutenberg Project and Books3 datasets are downloaded from Huggingface. After downloading, we use simhash to remove near duplicates.

ArXiv

ArXiv data is downloaded from Amazon S3 in the arxiv requester pays bucket. We only keep latex source files and remove preambles, comments, macros and bibliographies.

Stackexchange

The Stack Exchange split of the dataset is download from the Internet Archive. Here we only keep the posts from the 28 largest sites, remove html tags, group the posts into question-answer pairs, and order answers by their score.

File List			Total items: 14
Name	Last Commit	Size	Last Modified
.gitattributes	Initial commit	79 B	4 months ago
README.md	Add a blurb to the README	3.9 KiB	4 months ago
RedPajama-Data-1T-Sample.py	Take a drop from HuggingFace	3.3 KiB	4 months ago
arxiv_sample.jsonl	Take a drop from HuggingFace	89 MiB	4 months ago
book_sample.jsonl	Take a drop from HuggingFace	105 MiB	4 months ago
c4_sample.jsonl	Take a drop from HuggingFace	826 MiB	4 months ago
cc_2019-30_sample.jsonl	Take a drop from HuggingFace	657 MiB	4 months ago
cc_2020-05_sample.jsonl	Take a drop from HuggingFace	797 MiB	4 months ago
cc_2021-04_sample.jsonl	Take a drop from HuggingFace	765 MiB	4 months ago
cc_2022-05_sample.jsonl	Take a drop from HuggingFace	703 MiB	4 months ago
cc_2023-06_sample.jsonl	Take a drop from HuggingFace	817 MiB	4 months ago
github_sample.jsonl	Take a drop from HuggingFace	212 MiB	4 months ago
stackexchange_sample.jsonl	Take a drop from HuggingFace	77 MiB	4 months ago
wikipedia_sample.jsonl	Take a drop from HuggingFace	113 MiB	4 months ago

About