A clone of the Laion 400M open dataset, an uncurated dataset to enable testing model training on larger scale for broad researcher and other interested communities.
README.md
Laion400M dataset
This large repository of Parquet files is ideal for testing out XetHub at scale. Try mounting this repository for instant access.
Before beginning, make sure that git-xet is installed and set up.
Mount and explore
Get a quick glimpse at the shape of this large dataset with Xet mount and DuckDB.
- Mount the dataset with pre-fetch disabled, the recommended setting when working with file types that provide efficient random access to tools (e.g., Parquet, SQLite).
git xet mount xet@xethub.com:XetHub/Laion400M.git --prefetch 0
- Install Python, along with DuckDB and Pandas, then enter the mounted directory and start Python.
pip install duckdb pandas
cd Laion400M
python
- From your Python prompt, try the following queries:
> import duckdb
# Count the number of rows
> duckdb.query("select COUNT(*) from 'data/*.parquet'")
# See a random sample of 10 rows
> duckdb.query("select * from 'data/*.parquet' LIMIT 10").df()
# See the distribution of licenses
> duckdb.query("select LICENSE, count() as COUNT from 'data/*.parquet' group by LICENSE order by COUNT desc").df()
# See the distribution of NSFW labels
> duckdb.query("select NSFW, count() as COUNT from 'data/*.parquet' group by NSFW order by COUNT desc").df()
# Find the images with the largest width
> maxwidth=duckdb.query("select MAX(width) from 'data/*.parquet'").fetchall()[0][0]
> duckdb.query("select * from 'data/*.parquet' where width == {} LIMIT 10".format(maxwidth)).df()
Summary
Want to easily browse or use a big repository without needing to wait for it to download? Mount is the tool for you. You can work directly with any repository (read-only) from any local tool, whether you're using local notebooks, code, or your Finder window.
Need edit access? Clone the full repository with git xet clone
, or use the --no-smudge
option to only download specific files.
For a walkthrough of a non-mount workflow, check out the guided tutorial step of the Quick Start.
File List | Total items: 3 | ||
---|---|---|---|
Name | Last Commit | Size | Last Modified |
data | |||
.gitattributes | |||
README.md |
About
A clone of the Laion 400M open dataset, an uncurated dataset to enable testing model training on larger scale for broad researcher and other interested communities.
Repository Size
Activity 12 commits
-
committed 6761b271ce 2mo ago
-
committed 6c0682daa3 2mo ago
-
committed 69970e9808 6mo ago
-
committed 8b4df02e8c 6mo ago
-
committed 5044a04353 6mo ago