Skip to main content

Read-only mount

Working with large files often requires significant download times or custom code to handle I/O. When all you want to do is just read what's in a file or repository, there's an easier way. Mount for file system access to a repository, with syntax that works the same both locally and on remote servers.

Ideal use cases:


Using mount to browse locally

XetHub's mount feature allows you to gain streaming read-only access to a repository as if it was a folder on your machine, regardless of its size. Use your favorite local tools to explore any repository at any commit, no slow downloads needed.

To try this out, mount the 1.2GB CalTech Birds Image Similarity dataset.

  1. From your terminal:

    git xet mount xet://XetHub/CalTechBirds/main birdsdemo

    The mount command for any repository can be copied from the "Access" button on any page.

  2. Confirm a successful mount:

    Mounting to "/birdsdemo"
    Mounting as a background task...
    Setting up mount point...
    Mount at "/birdsdemo" successful. Unmount with 'umount "/birdsdemo"'
    Mount complete in 4.713231s
  3. Use your local file browser to find Crested_Auklet_0011_794927.jpg withing the mounted birdsdemo folder. This bird is sure surprised and delighted!

  4. Browse more pictures using your file browser, then unmount the repository when you're done:

    umount "/birdsdemo"

    You can just as easily eject the mounted folder from the file browser's UI.

Using mount for interactive reads

Browsing is fun, but sometimes you need to query a dataset to fully understand it. Let's explore the 54GB Laion 400M metadata dataset full of parquet files.

  1. From your terminal, mount with prefetch disabled. Since parquet files support efficient random access, prefetch is optional and turning it off improves performance:

    git xet mount https://xethub.com/XetHub/LAION-400M.git --prefetch 0
    cd LAION-400M
  2. Let's use DuckDB, an in-process OLAP database, to easily load each of the 1.7GB parquet files in this repository. Create a Python environment, install DuckDB, and start iPython:

    python -m venv .venv
    source .venv/bin/activate
    pip install duckdb
    ipython
  3. Import DuckDB and try out these queries:

    import duckdb

    Now check out how quickly the following queries run:

    • Count the number of rows

      duckdb.query("select COUNT(*) from 'data/*.parquet'")
    • See a random sample of 10 rows

      duckdb.query("select * from 'data/*.parquet' LIMIT 10").df()
    • See the distribution of licenses

      duckdb.query("select LICENSE, count() as COUNT from 'data/*.parquet' group by LICENSE order by COUNT desc").df()

    • See the distribution of NSFW labels

      duckdb.query("select NSFW, count() as COUNT from 'data/*.parquet' group by NSFW order by COUNT desc").df()
    • Find the images with the largest width

      maxwidth=duckdb.query("select MAX(width) from 'data/*.parquet'").fetchall()[0][0]
      duckdb.query("select * from 'data/*.parquet' where width == {} LIMIT 10".format(maxwidth)).df()
  4. That was faster than waiting for a 54GB download! When you're done, shut down iPython with exit() and unmount the folder:

    cd ..
    umount LAION-400M
note

Generally prefer working with files in Python? Check out our advanced Python access patterns!

Using mount for fast data loads

When teams train at scale, the first thing their code does after setting up an environment is to load data to their machine. Fully downloading huge datasets is costly and slow, leading to idle cycles on high-demand machines.

Instead, mount your data to your machine to decrease idle compute time. As long as your code doesn't require the full dataset to be in memory at the same time and doesn't need to write back to the dataset, this will speed up the time to first read and increase training speed and iteration, removing the need to write custom code to partition and pick up data. This is especially true for distributed training jobs, where each worker downloads the data but only typically needs a chunk of it to do their work. Mount the full repository each time and automatically stream just the data needed for each worker.

Mounting on a Docker container

Use our xethub/xetfs Docker volume plugin to mount a repository on a Docker container.

  1. Install the Docker volume plugin

    docker plugin install xethub/xetfs
  2. Create the volume, specifying the repository and commit/branch to be mounted. This example mounts the main branch of the XetHub/Flickr30k repository into a volume named flickr30k.

    docker volume create --driver xethub/xetfs \
    -o repo=https://xethub.com/XetHub/Flickr30k.git \
    -o commit=main \
    flickr30k

    You can use the -o options to further specify username or personal access token credentials. If you're using our XetData GitHub app, simply set the repo to the GitHub Git URL.

  3. Once the volume is created, you can attach it to any container. This command mounts the flickr30k volume under the /app directory on an Ubuntu container and lists the contents:

    docker run --rm -it -v flickr30k:/app ubuntu:latest ls -lR /app
  4. Now that your volume is mounted, you can access files within the repository within seconds without downloading the whole 4.2GB repository. To show the Flickr30k README, for instance:

    cat /app/README.md

Mounting on Kubernetes

Use our k8s-csi-xetfs Kubernetes container storage interface (CSI) plugin to mount a repository. This plugin supports single pod read-only access for CSI ephemeral storage. Reference the k8s-csi-xetfs repository for more details and an example.

  1. Install the CSI plugin

    curl -skSL https://raw.githubusercontent.com/xetdata/k8s-csi-xetfs/master/deploy/install-driver.sh | bash -s main --
  2. Check pod status to confirm that it's running:

    kubectl -n kube-system get pod -o wide -l app=csi-blob-node

    The output should look something like:

    NAME                 READY   STATUS    RESTARTS   AGE     IP             NODE    
    xet-csi-node-zvfj9 3/3 Running 0 5m59s 192.168.49.2 minikube
  3. Update the pod spec yaml. Specify the volume mount name and path in the containers section, then the repository and commit/branch in the volumes section.

    This example apps.yaml mounts the main branch of the XetHub/Flickr30k repository into a volume named flickr30k on path /data:

    apiVersion: v1
    kind: Pod
    metadata:
    name: app1
    spec:
    containers:
    - name: app1
    image: counter-app:latest
    imagePullPolicy: Never
    volumeMounts:
    - name: xet-flickr-30
    mountPath: /data
    volumes:
    - name: xet-flickr-30
    csi:
    driver: csi.xethub.xetdata.com
    readOnly: true
    volumeAttributes:
    repo: https://xethub.com/XetHub/Flickr30k.git
    commit: main
  4. Apply the changes to the pod spec:

    kubectl apply -f apps.yaml

Uninstalling the driver on a Kubernetes cluster

curl -skSL https://raw.githubusercontent.com/xetdata/k8s-csi-xetfs/master/deploy/uninstall-driver.sh | bash -s main --