15
1
Fork 0

URL and caption metadata for the LAION-400M dataset - 400M English (image, text) pairs built for research purposes to enable testing model training on larger scale for broad researcher and other interested communities.

README.md

Access this dataset

Our file streaming features provide a simple way to explore large repositories in seconds from your machine, without waiting for file downloads.

If you haven't already, install and authenticate, then dive into this LAION-400M metadata dataset.

Mount

Mounting provides streaming read-only access to a repository as if it was a folder on your machine. For files that support efficient random access (e.g., Parquet, SQLite), we can mount with prefetch disabled for improved performance.

If you installed with PyXet:

xet mount --prefetch 0 xet://XetHub/LAION-400M/main LAION-400M

If you installed with Git-Xet:

git xet mount https://xethub.com/XetHub/LAION-400M.git --prefetch 0

Explore

One easy way to look at this collection of 1.7GB Parquet files is with DuckDB, an in-process OLAP database.

Create a virtual environment if you haven't already:

python -m venv .venv
source .venv/bin/activate

Install DuckDB, enter the mounted directory, and start a Python interpreter:

pip install duckdb
cd LAION-400M
ipython

From your Python prompt, import DuckDB:

import duckdb

Now check out how quickly the following queries run:

  • Count the number of rows

    duckdb.query("select COUNT(*) from 'data/*.parquet'")
    
  • See a random sample of 10 rows

    duckdb.query("select * from 'data/*.parquet' LIMIT 10").df()
    
  • See the distribution of licenses

    duckdb.query("select LICENSE, count() as COUNT from 'data/*.parquet' group by LICENSE order by COUNT desc").df()
    
  • See the distribution of NSFW labels

    duckdb.query("select NSFW, count() as COUNT from 'data/*.parquet' group by NSFW order by COUNT desc").df()
    
  • Find the images with the largest width

    maxwidth=duckdb.query("select MAX(width) from 'data/*.parquet'").fetchall()[0][0]
    duckdb.query("select * from 'data/*.parquet' where width == {} LIMIT 10".format(maxwidth)).df()
    

Sure beats waiting for a 54GB download! When you're done, unmount the directory:

cd ..
umount LAION-400M

Mount can be used with local and remote environments, and can be especially useful for fast read access to big data on distributed training jobs.

Next steps

🎓 Read more about streaming access.

🛠️ Move on to the next step of the Quick Start to make your first changes in XetHub!

About this dataset

Obtained from https://laion.ai/laion-400-open-dataset/.

Concept

The LAION-400M dataset is a freely accessible dataset of 400M English (image, text) pairs. This URL and caption metadata dataset provides 32 parquet files with the image URLs, the associated texts and additional metadata in the following format:

SAMPLE_ID | URL | TEXT | LICENSE | NSFW | similarity | WIDTH | HEIGHT

where

  • SAMPLE_ID: A unique identifier
  • LICENSE: Where we found a Creative Commons License in the image data, we named it here like, e.g. "creativecommons.org/licenses/by-nc-sa/3.0/" - otherwise you'll find it here a "?"
  • NSFW: we used CLIP to estimate if the image has NSFW content. The estimation has been pretty conservative, reducing false negatives at the cost of more false positives. Possible values are "UNLIKELY", "UNSURE" and "NSFW".
  • similarity: Value of the cosine similarity between the text and image embedding
  • WIDTH and HEIGHT: image size as the image was embedded. We downsized originals that were larger than 4K to 4K.

Check the official LAION site for the full description of this dataset.

All images and texts in the LAION-400M dataset have been filtered with OpenAI's CLIP by calculating the cosine similarity between the text and image embeddings and dropping those with a similarity below 0.3

License: CC BY 4.0

File List Total items: 3
Name Last Commit Size Last Modified
data initial 7 months ago
.gitattributes Initial commit 79 B 7 months ago
README.md Update README 4.1 KiB 7 months ago

About

URL and caption metadata for the LAION-400M dataset - 400M English (image, text) pairs built for research purposes to enable testing model training on larger scale for broad researcher and other interested communities.

Repository Size

Loading repo size...

Commits 14 commits

File Types