URL and caption metadata for the LAION-400M dataset - 400M English (image, text) pairs built for research purposes to enable testing model training on larger scale for broad researcher and other interested communities.
README.md
Access this dataset
Our file streaming features provide a simple way to explore large repositories in seconds from your machine, without waiting for file downloads.
If you haven't already, install and authenticate, then dive into this LAION-400M metadata dataset.
Mount
Mounting provides streaming read-only access to a repository as if it was a folder on your machine. For files that support efficient random access (e.g., Parquet, SQLite), we can mount with prefetch disabled for improved performance.
If you installed with PyXet:
xet mount --prefetch 0 xet://XetHub/LAION-400M/main LAION-400M
If you installed with Git-Xet:
git xet mount https://xethub.com/XetHub/LAION-400M.git --prefetch 0
Explore
One easy way to look at this collection of 1.7GB Parquet files is with DuckDB, an in-process OLAP database.
Create a virtual environment if you haven't already:
python -m venv .venv
source .venv/bin/activate
Install DuckDB, enter the mounted directory, and start a Python interpreter:
pip install duckdb
cd LAION-400M
ipython
From your Python prompt, import DuckDB:
import duckdb
Now check out how quickly the following queries run:
-
Count the number of rows
duckdb.query("select COUNT(*) from 'data/*.parquet'")
-
See a random sample of 10 rows
duckdb.query("select * from 'data/*.parquet' LIMIT 10").df()
-
See the distribution of licenses
duckdb.query("select LICENSE, count() as COUNT from 'data/*.parquet' group by LICENSE order by COUNT desc").df()
-
See the distribution of NSFW labels
duckdb.query("select NSFW, count() as COUNT from 'data/*.parquet' group by NSFW order by COUNT desc").df()
-
Find the images with the largest width
maxwidth=duckdb.query("select MAX(width) from 'data/*.parquet'").fetchall()[0][0] duckdb.query("select * from 'data/*.parquet' where width == {} LIMIT 10".format(maxwidth)).df()
Sure beats waiting for a 54GB download! When you're done, unmount the directory:
cd ..
umount LAION-400M
Mount can be used with local and remote environments, and can be especially useful for fast read access to big data on distributed training jobs.
Next steps
🎓 Read more about streaming access.
🛠️ Move on to the next step of the Quick Start to make your first changes in XetHub!
About this dataset
Obtained from https://laion.ai/laion-400-open-dataset/.
Concept
The LAION-400M dataset is a freely accessible dataset of 400M English (image, text) pairs. This URL and caption metadata dataset provides 32 parquet files with the image URLs, the associated texts and additional metadata in the following format:
SAMPLE_ID | URL | TEXT | LICENSE | NSFW | similarity | WIDTH | HEIGHT
where
- SAMPLE_ID: A unique identifier
- LICENSE: Where we found a Creative Commons License in the image data, we named it here like, e.g. "creativecommons.org/licenses/by-nc-sa/3.0/" - otherwise you'll find it here a "?"
- NSFW: we used CLIP to estimate if the image has NSFW content. The estimation has been pretty conservative, reducing false negatives at the cost of more false positives. Possible values are "UNLIKELY", "UNSURE" and "NSFW".
- similarity: Value of the cosine similarity between the text and image embedding
- WIDTH and HEIGHT: image size as the image was embedded. We downsized originals that were larger than 4K to 4K.
Check the official LAION site for the full description of this dataset.
All images and texts in the LAION-400M dataset have been filtered with OpenAI's CLIP by calculating the cosine similarity between the text and image embeddings and dropping those with a similarity below 0.3
License: CC BY 4.0
File List | Total items: 3 | ||
---|---|---|---|
Name | Last Commit | Size | Last Modified |
data | |||
.gitattributes | |||
README.md |
About
URL and caption metadata for the LAION-400M dataset - 400M English (image, text) pairs built for research purposes to enable testing model training on larger scale for broad researcher and other interested communities.
Repository Size
Activity 14 commits
-
committed ea6742c668 2mo ago
-
committed 64c9f99f5c 2mo ago
-
committed 89ed89fa4b 2mo ago
-
committed 96962213ff 2mo ago
-
committed 98ad0577a6 2mo ago