1
0
Fork 0

A clone of the Laion 400M open dataset, an uncurated dataset to enable testing model training on larger scale for broad researcher and other interested communities.

README.md

Laion400M dataset

This large repository of Parquet files is ideal for testing out XetHub at scale. Try mounting this repository for instant access.

Before beginning, make sure that git-xet is installed and set up.

Mount and explore

Get a quick glimpse at the shape of this large dataset with Xet mount and DuckDB.

  1. Mount the dataset with pre-fetch disabled, the recommended setting when working with file types that provide efficient random access to tools (e.g., Parquet, SQLite).
git xet mount xet@xethub.com:XetHub/Laion400M.git --prefetch 0
  1. Install Python, along with DuckDB and Pandas, then enter the mounted directory and start Python.
pip install duckdb pandas
cd Laion400M
python
  1. From your Python prompt, try the following queries:
> import duckdb

# Count the number of rows
> duckdb.query("select COUNT(*) from 'data/*.parquet'")

# See a random sample of 10 rows
> duckdb.query("select * from 'data/*.parquet' LIMIT 10").df()

# See the distribution of licenses
> duckdb.query("select LICENSE, count() as COUNT from 'data/*.parquet' group by LICENSE order by COUNT desc").df()

# See the distribution of NSFW labels
> duckdb.query("select NSFW, count() as COUNT from 'data/*.parquet' group by NSFW order by COUNT desc").df()

# Find the images with the largest width
> maxwidth=duckdb.query("select MAX(width) from 'data/*.parquet'").fetchall()[0][0]
> duckdb.query("select * from 'data/*.parquet' where width == {} LIMIT 10".format(maxwidth)).df()

Summary

Want to easily browse or use a big repository without needing to wait for it to download? Mount is the tool for you. You can work directly with any repository (read-only) from any local tool, whether you're using local notebooks, code, or your Finder window.

Need edit access? Clone the full repository with git xet clone, or use the --no-smudge option to only download specific files.

For a walkthrough of a non-mount workflow, check out the guided tutorial step of the Quick Start.

File List Total items: 3
Name Last Commit Size Last Modified
data Added data files. 1 year ago
.gitattributes Initial commit 79 B 1 year ago
README.md Update README instructions 2.2 KiB 1 year ago

About

A clone of the Laion 400M open dataset, an uncurated dataset to enable testing model training on larger scale for broad researcher and other interested communities.

Repository Size

Loading repo size...

Commits 11 commits

File Types