What is XetHub?
XetHub is a collaboration platform for accessing, exploring, and iterating on large-scale repositories, backed by the power of Git.
Wait, don't GitHub, GitLab, and Git LFS already do this?
We believe that a user experience for exploring things like datasets, models, media, and full projects is very different from one that is solely focused on code. When working with large files using existing tools like GitHub and GitLab, non-code files are never as accessible as we want them to be, and even with Git LFS, there are cost and speed barriers to working with larger repositories.
With XetHub, we want users to work together on datasets, media, and more as easily as developers currently do with code. With an interface that enables custom visualizations on repositories and differences, and a backend that makes interacting with big files faster than ever, we want to provide better ways for users to share and understand how their projects change over time.
Is it just... Git?
Kinda. It is Git with an extension that transparently enables support for arbitrarily large binary files.
My files are currently in GitHub/GitLab/Git LFS. Do I have to move them into XetHub to use these features?
Yes, for XetHub to provide the speed, visualizations, and summaries that we do, any files that you want to access with Xet functionality should be migrated into XetHub.
If your code is happy where it is and you are primarily interested in using Xet functionality on your larger files (datasets, binary assets, etc.), this is also an option. Simply move your larger files into XetHub and take advantage of submodules to keep your project in sync. Want to talk through your use case? We'd be happy to chat!
Arbitrarily large? How large is that, actually?
XetHub can comfortably store repositories of sizes over 100+ GB, up to 1 TB, with plans to scale to 100 TB per repository in the future. Operations on larger repositories near the TB end of the spectrum may take longer as we continue to optimize our platform for scale.
The data I work with is huge, but I only need to work with pieces of it. Can XetHub help?
Yes! We built Xet mount for easy read-only exploration, allowing you to access a repository at any commit without downloading everything. File contents are streamed and cached on demand, so you don't need to have the local space to store it all, and subsequent reads are even faster due to the cache.
If you need to edit access to a subset of files within a huge repository, we've got you covered. Follow our best practices for working with huge repositories by cloning the repository with the
--no-smudge flag to keep all large files as pointers, then manually checking out files to fully materialize them for local editing.
Can I choose to work with pointers or actual files and convert as needed?
Yes, doing so can be very convenient to save disk space and download time. You can clone an "unsmudged" version of the repository, which checks out all files that are non-UTF-8 decodeable or larger than 256KB as pointer files, with either of the following commands:
git xet clone --no-smudge <Git remote URL>
XET_NO_SMUDGE=1 git clone <Git remote URL>
To convert pointers to full files:
git xet checkout <pointers>
To convert full files back to pointers:
XET_NO_SMUDGE=1 git checkout <files>
This is a neat one. The touch makes Git think the file has changed so it will try to overwrite
it on the checkout command. Git-xet then checks out those files again with
flagged, forcing Git to checkout the pointer instead.
What can I use XetHub for? Is it just for ML?
XetHub is for anything that you want to collaborate on, where reliable history and metadata are important. Baked-Git functionality provides difference-based review, and custom visualizations, while XetHub adds instant access. It is especially useful for iterative workflows where you want to quickly access different versions without fully downloading them, such as large asset development (Unreal and Unity models) or ML model training iterations (model checkpoints), or to replace workflows where you may currently be appending "_1", "_2", etc. to your file names to manually track versions.
Intrigued but unsure it's a good fit for your use case? Let's talk.
What operating systems are supported right now?
MacOS and Linux are fully supported. Windows is currently in preview, with some known limitations.
Is it ready for production use?
Yes! Reach out to us for a custom onboarding plan.
Ok, so how much does this cost?
Community users can store up to 20 GB of deduplicated data across any number of repositories without charge. Need more? Contact us to share your requirements.
How does git-xet work?
Magic! Read how Xet deduplication works for a high-level overview and some commands you can run to explore its internals.