Warning: This blog post references deprecated XetHub links and functionality. Please use as a reference only and follow our current work on Hugging Face.

December 18, 2023

Train ML Models Faster in Kubernetes by Mounting Training Datasets

Assaf Vayner

Srini Kadamati

Waiting on Data

Training large models can take a long time and expensive resources like GPUs can easily spend most of their time just loading data. We’re willing to bet that “my models are still training” is the new “my code’s compiling” — but even more costly.

While building up the ML data and compute infrastructure at Apple, several of our team members witnessed this problem at scale. Internal teams would download the same huge datasets over and over again in each node of their distributed training jobs, sometimes spending more time on the download than actual training. To improve training speed and GPU utilization, we tried a variety of approaches like updating data loading logic and chunking datasets. These helped, but seemed heavy-handed. A more natural way would be to simply provide file system access to the containers themselves.

Our Solution: Mount

What if you could just mount your large datasets and stream whatever you need to access just-in-time as your code references specific chunks of data? We’ve built two plugins to support the most popular container types around:

Docker Plugin

It’s usually difficult to mount on Docker because mounting requires elevated CAP_SYS_ADMIN privileges. To get around this, we created a Docker plugin that uses a volume plugin to connect to XetHub — enabling read-only mounts of XetHub repos from inside your Docker container without any extra permissions.

Kubernetes CSI Plugin

Naturally, once we released our Docker plugin, we were asked about supporting mount on Kubernetes! Kubernetes is being used more and more frequently for efficiently orchestrating ML workflows.

We built a simple node CSI plugin to mount your XetHub repo and fetch files on the fly so your primary applications can use the data without up-front downloads. The plugin sets up a read-only ephemeral volume that uses our git-xet mount process to access your repository.

Our plugin is entirely open source and you can find it here on GitHub.

Getting Started

Our installation process is very simple and currently only utilized kubectl over helm charts (open an issue you want helm charts!). You can find up-to-date documentation here.

The simplest way to install our plugin is to download and run our install script which you can do with the following one-liner or following the local install steps to run the install script as documented in our README:

curl -skSL https://raw.githubusercontent.com/xetdata/k8s-csi-xetfs/main/deploy/install-driver.sh | bash -s main --

Once you have the plugin installed you can create volumes and use them from within your pods! To set up a volume, add the volumes section to your pods configuration files:

# apps.yaml (https://github.com/xetdata/k8s-csi-xetfs/blob/main/example/apps.yaml#L14-L21)
volumes:
	- name: xet-flickr-30
	  csi:
	    driver: csi.xethub.xetdata.com
      readOnly: true
      volumeAttributes:
				repo: https://xethub.com/XetHub/Flickr30k.git
				commit: main # this can be a branch name or a commit hash

Then in your pods containers section, add a volume mount referencing the volume name created above: xet-flickr-30.

volumeMounts:
	- name: xet-flickr-30
		mountPath

Once you apply these changes, your container will have access to your XetHub repo under the mount path.

kubectl apply -f

To set up a volume with a private repository you will need to create a secret in Kubernetes. Please follow the documentation in our README for how to do this.

Why Rust? 🦀

What can we say — we just love Rust! We wrote an NFS server implementation (nfsserve) and our Docker plugin (docker-volume-xetfs) both in Rust.

Most Kubernetes CSI drivers out there in the ether are written in Golang, as is most of the backing components of Kubernetes. The CSI spec is in essence a well laid-out gRPC spec so we decided to lean into our love for and expertise in writing Rust.

Contributing

See room for improvement? Please contribute to help us improve! You're also welcome to join our Slack community, where you can interact with our team.

Share on

Nov 25, 2024

Rearchitecting Hugging Face Uploads and Downloads

Nov 20, 2024

From Files to Chunks: Improving HF Storage Efficiency

Oct 4, 2024

Improving Parquet Dedupe on Hugging Face Hub

Rearchitecting Hugging Face Uploads and Downloads

Nov 20, 2024

From Files to Chunks: Improving HF Storage Efficiency

Oct 4, 2024

Train ML Models Faster in Kubernetes by Mounting Training Datasets

Waiting on Data

Our Solution: Mount

Getting Started

Why Rust? 🦀

Contributing

More posts

Rearchitecting Hugging Face Uploads and Downloads

From Files to Chunks: Improving HF Storage Efficiency

Improving Parquet Dedupe on Hugging Face Hub

More posts

Rearchitecting Hugging Face Uploads and Downloads

More posts

Rearchitecting Hugging Face Uploads and Downloads

From Files to Chunks: Improving HF Storage Efficiency

Improving Parquet Dedupe on Hugging Face Hub