Read-only mount
Working with large files often requires significant download times or custom code to handle I/O. When all you want to do is just read what's in a file or repository, there's an easier way. Mount for file system access to a repository, with syntax that works the same both locally and on remote servers.
Ideal use cases:
Using mount to browse locally
XetHub's mount feature allows you to gain streaming read-only access to a repository as if it was a folder on your machine, regardless of its size. Use your favorite local tools to explore any repository at any commit, no slow downloads needed.
To try this out, mount the 1.2GB CalTech Birds Image Similarity dataset.
-
From your terminal:
git xet mount xet://xethub.com:XetHub/CalTechBirds/main birdsdemo
The mount command for any repository can be copied from the "Access" button on any page.
-
Confirm a successful mount:
Mounting to "/birdsdemo"
Mounting as a background task...
Setting up mount point...
Mount at "/birdsdemo" successful. Unmount with 'umount "/birdsdemo"'
Mount complete in 4.713231s -
Use your local file browser to find
Crested_Auklet_0011_794927.jpg
withing the mountedbirdsdemo
folder. This bird is sure surprised and delighted! -
Browse more pictures using your file browser, then unmount the repository when you're done:
umount "/birdsdemo"
You can just as easily eject the mounted folder from the file browser's UI.
Using mount for interactive reads
Browsing is fun, but sometimes you need to query a dataset to fully understand it. Let's explore the 54GB Laion 400M metadata dataset full of parquet files.
-
From your terminal, mount with prefetch disabled. Since parquet files support efficient random access, prefetch is optional and turning it off improves performance:
git xet mount https://xethub.com/XetHub/LAION-400M.git --prefetch 0
cd LAION-400M -
Let's use DuckDB, an in-process OLAP database, to easily load each of the 1.7GB parquet files in this repository. Create a Python environment, install DuckDB, and start iPython:
python -m venv .venv
source .venv/bin/activate
pip install duckdb
ipython -
Import DuckDB and try out these queries:
import duckdb
Now check out how quickly the following queries run:
-
Count the number of rows
duckdb.query("select COUNT(*) from 'data/*.parquet'")
-
See a random sample of 10 rows
duckdb.query("select * from 'data/*.parquet' LIMIT 10").df()
-
See the distribution of licenses
duckdb.query("select LICENSE, count() as COUNT from 'data/*.parquet' group by LICENSE order by COUNT desc").df()
-
See the distribution of NSFW labels
duckdb.query("select NSFW, count() as COUNT from 'data/*.parquet' group by NSFW order by COUNT desc").df()
-
Find the images with the largest width
maxwidth=duckdb.query("select MAX(width) from 'data/*.parquet'").fetchall()[0][0]
duckdb.query("select * from 'data/*.parquet' where width == {} LIMIT 10".format(maxwidth)).df()
-
-
That was faster than waiting for a 54GB download! When you're done, shut down iPython with
exit()
and unmount the folder:cd ..
umount LAION-400M
Generally prefer working with files in Python? Check out our advanced Python access patterns!
Using mount for fast data loads
When teams train at scale, the first thing their code does after setting up an environment is to load data to their machine. Fully downloading huge datasets is costly and slow, leading to idle cycles on high-demand machines.
Instead, mount your data to your machine to decrease idle compute time. As long as your code doesn't require the full dataset to be in memory at the same time and doesn't need to write back to the dataset, this will speed up the time to first read and increase training speed and iteration, removing the need to write custom code to partition and pick up data. This is especially true for distributed training jobs, where each worker downloads the data but only typically needs a chunk of it to do their work. Mount the full repository each time and automatically stream just the data needed for each worker.
Mounting on a Docker container
Use our xethub/xetfs
Docker volume plugin to mount a repository on a Docker container.
-
Install the Docker volume plugin
docker plugin install xethub/xetfs
-
Create the volume, specifying the repository and commit/branch to be mounted. This example mounts the main branch of the XetHub/Flickr30k repository into a volume named
flickr30k
.docker volume create --driver xethub/xetfs \
-o repo=https://xethub.com/XetHub/Flickr30k.git \
-o commit=main \
flickr30kYou can use the
-o
options to further specify username or personal access token credentials. If you're using our XetData GitHub app, simply set therepo
to the GitHub Git URL. -
Once the volume is created, you can attach it to any container. This command mounts the
flickr30k
volume under the/app
directory on an Ubuntu container and lists the contents:docker run --rm -it -v flickr30k:/app ubuntu:latest ls -lR /app
-
Now that your volume is mounted, you can access files within the repository within seconds without downloading the whole 4.2GB repository. To show the Flickr30k README, for instance:
cat /app/README.md
Mounting on Kubernetes
Use our k8s-csi-xetfs
Kubernetes container storage interface (CSI) plugin to mount a repository. This plugin supports single pod read-only access for CSI ephemeral storage. Reference the k8s-csi-xetfs
repository for more details and an example.
-
Install the CSI plugin
- Remote install
- Local install
curl -skSL https://raw.githubusercontent.com/xetdata/k8s-csi-xetfs/master/deploy/install-driver.sh | bash -s main --
git clone git@github.com:xetdata/k8s-csi-xetfs.git
cd k8s-csi-xetfs
./deploy/install-driver.sh main local -
Check pod status to confirm that it's running:
kubectl -n kube-system get pod -o wide -l app=csi-blob-node
The output should look something like:
NAME READY STATUS RESTARTS AGE IP NODE
xet-csi-node-zvfj9 3/3 Running 0 5m59s 192.168.49.2 minikube -
Update the pod spec yaml. Specify the volume mount name and path in the containers section, then the repository and commit/branch in the volumes section.
This example
apps.yaml
mounts the main branch of the XetHub/Flickr30k repository into a volume namedflickr30k
on path/data
:apiVersion: v1
kind: Pod
metadata:
name: app1
spec:
containers:
- name: app1
image: counter-app:latest
imagePullPolicy: Never
volumeMounts:
- name: xet-flickr-30
mountPath: /data
volumes:
- name: xet-flickr-30
csi:
driver: csi.xethub.xetdata.com
readOnly: true
volumeAttributes:
repo: https://xethub.com/XetHub/Flickr30k.git
commit: main -
Apply the changes to the pod spec:
kubectl apply -f apps.yaml
Uninstalling the driver on a Kubernetes cluster
- Remote uninstall
- Local uninstall
curl -skSL https://raw.githubusercontent.com/xetdata/k8s-csi-xetfs/master/deploy/uninstall-driver.sh | bash -s main --
git clone git@github.com:xetdata/k8s-csi-xetfs.git
cd k8s-csi-xetfs
./deploy/uninstall-driver.sh main local