Warning: This blog post references deprecated XetHub links and functionality. Please use as a reference only and follow our current work on Hugging Face.
November 9, 2023
Version Your S3 Datasets & Models with Git
Your models deserve better than S3
An important piece of ML development is the ability to look back at past versions of models and the dataset used to create those models. Versioning is about change management and as humans we contextualize it to our work in some way. While object storage services like AWS S3 have become the default platform for storing model and dataset artifacts, their versioning capability is focused on tracking changes at the individual artifact level. Turning on S3 versioning can also be prohibitively expensive, requiring backups of every change, even if only a single line has been modified.
For ML, however, knowing the set of files related to a particular version, and what has changed between versions, is very important. Users want context and documentation for the version and the changes contained within — for both the artifacts and code used to generate them — in order to understand why a model’s behavior has changed.
XetHub as a versioned object store
XetHub provides versioning in a way that alleviates many of the issues described earlier. Xet repositories are backed by Git and scaled to support over 10TB per repo to match the needs of modern ML workflows. Like Git, it uses commits to track version across all files in a repository, allowing users to view snapshots of their data at any particular iteration. Users can even use XetHub to track code and artifacts together, removing the need for extra tooling to coordinate the versions of code and data across multiple systems.
With XetHub’s built-in block-level deduplication, additional versions of files only require storing changed blocks of data, so iterations on data and models are stored much more efficiently.
If you already have your datasets and artifacts in S3, it can be a heavy lift to manually download all the data to a local machine and move it into XetHub just to try it out. To make testing XetHub versioning on your files easier than ever, we have just released a S3 import function that periodically syncs a S3 bucket to a XetHub repository.
Flex your new versioning muscles
Follow our instructions to import and sync your S3 bucket with a new XetHub repository. Each sync will move files from S3 to XetHub and commit them with a message that shows where the file was copied from, allowing you to try out XetHub risk-free.
Once your files are in XetHub, install our Xet CLI and try these Xet access patterns to get the most out of your newly versioned assets:
Read the latest version of your files without needing to download anything
Access a copy of your files from a week ago
Grab the output file associated with a certain commit
And easily explore your files through our UI.
🎉 Making the switch
Loving your versioned XetHub view and functionality? Easily move away from S3 by running a one-time S3 import (with no sync) and start writing your files to XetHub instead of S3. This will allow you to fully manage your large ML assets using Git semantics, Python, or the Xet CLI. For example, writing to a Xet repository is as easy as:
Git
Python
Xet CLI
XetHub is free for all public use and private repositories under 20GB — try it today!
Next Steps
If you're using S3 to version your large models and datasets, we'd love to hear from you! You're welcome to reach out to us over email or join our Slack.
Share on