December 8, 2023
Our New, Git-Centric ML Versioning Framework
The Messy ML Workflow
Machine learning practitioners are used to gluing together multiple tools for experiment tracking, hosting datasets, versioning datasets, hosting models, versioning models, reviewing changes, and monitoring models. There are entire products that teams hire to focus on solving each of these problems, which causes:
model reproducibility to be very difficult
challenges onboarding new team members or interfacing with other teams
friction for teams looking to move fast in ML
We believe that the root of the problems center around the limitations of Git.
While Git and hosted Git solutions like GitHub and GitLab have become extremely popular in the world of software engineering, they struggle to store and version even 100 MB datasets and models. ML teams are left to use other tools like DVC, S3, and HuggingFace in addition to Git and GitHub.
At XetHub, we asked ourselves a simple question. What if we could just solve the limitations Git had? Could we scale Git to handle terabytes of large files?
Scaling Git would enable ML practitioners to version datasets, code, and models in the same repo. After we solved these problems, we wrote about how we solved them, and launched the XetHub platform, people started to ask us for advice on:
When to make Git commits?
How should we separate, name, and categorize our branches?
How do we know what model is in production?
In this post, we’ll provide an overview of our opinionated ML versioning framework and a few resources to dive deeper.
An Approachable Overview (Talk)
Recently, our team member Yonatan Alexander gave a fantastic talk at pyGrunn (30min) that is approachable and a good starting point.
The Framework in Detail
Next, we recommend reading Yonatan’s blog on Towards Data Science that dives into incredible detail and provides a clear playbook you can take with you to use in any tool.
Branches are at the heart of our ML versioning framework. In our point-of-view, using branches deliberately helps you and your team experiment freely and confidently as well as separate model discovery from delivery work.
We believe that you should maintain a few different types of branches as you work on an ML project:
data: mainly contain datasets and documentation
analysis: run analysis, A/B tests, etc
stable: active branches for training & inference
coding: meant for code development and active data exploration
monitoring: contains data, commit tag, and model prediction in prod → useful for detecting data drift
To dive deeper, we recommend reading the full post here.
An Example Repo
Once you’re ready to get your hands dirty, we recommend cloning and playing with our example repo hosted on XetHub.
Deduplication
Immediately, you’ll notice that our deduplication technology reduced the total repo size from 120 megabytes to 54 megabytes. This is part of our not-so-secret sauce to scaling Git to handle large files.
Because of our efficient deduplication, remixing datasets or creating branches with training & test sets from full datasets often doesn’t consume more storage space. It also makes switching branches and uploading changes back to a central repository significantly faster.
Share on