Warning: This blog post references deprecated XetHub links and functionality. Please use as a reference only and follow our current work on Hugging Face.
November 16, 2023
XetData: Scale your GitHub Repos to 100 Terabytes
Announcing our XetData integration for GitHub!
We’re excited to announce that XetData is now in beta. XetData turns GitHub into a single source of truth for machine learning projects by enabling you to version code, datasets, models, and other large assets in the same repo.
With XetData, after you configure our GitHub app and install our Git extension, the tooling fades into the background and you can continue using your familiar Git commands.
Versioning your machine learning development lifecycle in a single system lets us build some unique features and enable some unique workflows.
Dataset and model diffs
We can compare the same model file across 2 different Git branches and show useful model architecture visualizations powered by Lutz Roeder’s Netron library.
Automatic reproducibility
Most ML frameworks promise reproducibility only if you make major workflow changes, learn new commands, and lock into their system. And even then, there’s often drift between source code (managed by GitHub) and their MLOps system (for versioning datasets and models).
With our XetData integration, you can include all of the code, datasets, and models in your commits and pull requests. Reviewers then have all of the context they need to give useful feedback and help you confidently merge your work into the repo. When you’re trying to identify issues with your model in production 3 months later, you can go back to the specific commit and recreate the work again.
Over time, you can implement a lightweight ML versioning philosophy using commits and branches. You can read more about our opinionated, branch-centric ML versioning framework here.
Tap into the Git tooling ecosystem
Because our Git-Xet extension uses Git under the hood, we work with most Git tools out of the box! For example, you can carry out most common Git operations from inside VS Code or PyCharm using their existing no-code Git features. You can learn more in our docs.
How it Works
Staging changes and making commits
Every time you stage changes (git add .)or make a commit (git commit -m “commit message”), the Git-Xet filter runs behind the scenes to detect duplicate blocks of data in your files. This helps speed up the upload process, reduces storage space as well, and encourages practitioners to version more liberally because incremental changes are very cheap.
Pushing up your changes
When you push your changes to GitHub (git push origin), the Git-Xet client:
pushes raw binary files (videos, images, ML models, etc) and large files (big CSV’s) are to XetData
pushes raw source code and just file hashes for the large & binary files above to GitHub
Because we’re built on top of Git, we don’t require you to learn a new set of commands that you need to run in addition to your Git commands. Compare the XetData approach to the DVC approach:
DVC workflow image credit to Evgenii Munin
Block-based Deduplication
If the changes in the commit(s) you’re pushing overlap heavily with what’s in your main branch, then the upload can be quite fast because of the deduplication detection step that happened earlier. We invented our own deduplication technique and even published a paper at CIDR'23 on our approach.
On average, we find that we’re 5-8x faster than DVC, S3, and LakeFS when uploading large files because our deduplication.
XetHub vs XetData
Our founding team built Apple’s internal ML data and compute platform from the ground up. They witnessed firsthand the challenges machine learning engineers and data scientists faced when it came to versioning machine learning datasets and models. The teams there had tried Git, Git LFS, custom S3 versioning schemes, and DVC but none of them were widely used and users were unhappy with all of their approaches.
The early team set out to sea by creating a blob store with version control built in, as if Git and S3 had a baby. We built a complete platform with a lot of batteries included and called it XetHub.
Git repo management (VCS)
Website with a UI for collaboration (viewing commits, PRs, etc)
Compute platform for deploying Streamlit, Gradio, and other Python apps
To really tap into the power of versioning data, code, and models in one repo, our users had to migrate everything over to XetHub. For many teams, they were happy to do so for the platform benefits.
But we asked ourselves — could we bring the power of XetHub to GitHub? Could we upgrade the GitHub experience to continue managing code but have XetHub take over large files like datasets and models? XetData is what emerged out of that exploration and we’re excited to have you try it out. 🎉
Git Started
While we’re in beta, XetData is completely free of charge. Store and version large files in your GitHub repos to your heart’s content! To get started, follow the Quick Start on our XetData GitHub app page.
In summary, follow these 3 steps to try it for yourself:
Add our GitHub app to one of your GitHub repos
Install our tiny Git-Xet extension
Run through the Git lifecycle (add, commit, and push) 🎉
Get Help
If you have questions or run into issues, reach out to us in our Slack community! You can also file bugs in our public GitHub issue tracker.
Share on