July 10, 2024
Benchmarking the Modern Development Experience across Versioning Tools: S3, DVC, Git LFS, and XetHub
The advent of artificial intelligence and machine learning (AI/ML) has brought a new level of scale and complexity to software projects. Now, in addition to code, developers must also version and store artifacts such as datasets, models, and metadata to ensure provenance and reproducibility.
While a variety of tools are used to address this problem, not all tools are equal. Each uses a different approach to storing and versioning files that has implications on the end developer experience. In this post, we compare the performance of four versioning solutions — Amazon S3, DVC, Git LFS, and XetHub — across three modern development use cases to understand the trade-offs of each tool.
Methodology
Benchmarks frequently focus on the performance of single operation, such as time to upload or download a file, but real-world development requires many iterative operations. For our benchmark, we are specifically interested in the performance of versioning tools as artifacts change over the course of a project.
To do this, we examine three publicly developed projects from the domains of gaming, biotech, and research. Each has published a replay-able history of updates, allowing us to mirror the incremental edits across the full project in order to perform our benchmark calculations. Finding these projects in the wild can be difficult, as many such projects are only shared in their final state.
Use Cases
1. Game Development: Megacity Multiplayer
The Megacity 2019 repo contains a multiplayer version of the original Megacity game packaged to run using Unity, a cross-platform game engine. The repo contains an interesting mix of asset and metadata files (e.g., .fbx, .mesh, .mat, jpg, .png, .psd, .asset, .unity, .ogg), with history that currently spans 37 commits with a mix of changes to assets and code in the course of game development. The original repo uses Git LFS hosted on GitHub.
The uncompressed size of storing all 37 versions is approximately 239 GB.
2. Biotech: RCSB PDB SARS-CoV-2 structures
The Research Collaborator for Structural Bioinformatics (RCSB) publishes a list of SARS-CoV-2 Protein Data Bank (PDB) structures on its registry, with ~4000 protein structures that comprise the virus. PDB files are text-based, containing key-value entries that describe the structure and coordinates of the different atoms and molecules that make up the protein crystal, along with additional metadata. RCSB uses a naming convention to version the PDB registry via filenames and folder structures on their FTP servers. This approach has some limitations:
Only the last major versions are kept, leading to incomplete history.
PDB files are renamed on each version, making it difficult to track changes between versions.
The uncompressed size of storing all 5 versions is 29 GB.
3. Research dataset: CORD-19
The CORD-19 dataset is a corpus of COVID-19 papers, curated and maintained by researchers between 2020 to 2022. The dataset contains document embeddings, full text parses, and metadata of all the papers contained in the corpus with 50 additions (commits, in this benchmarking exercise) during the two year history of the repository. This dataset was used in XetHub’s original Git is for Data research paper and was included in this benchmarking as it shows the typical pattern of many modern datasets where data is merely appended.
The uncompressed size of storing all 50 versions is approximately 2.45TB.
Measurements
For each use case, we stored the original state of the repository and replayed all updates as incremental commits. In the case of the gaming and research projects, this was quite natural as the commit history is readily available as part of the GitHub repository. For the biotech PDB structures, we created a script to download PDB files from RCSB’s FTP servers and created a commit for each successive version.
For each commit, we measured:
Upload or push time
Download or checkout time
Storage size
Tools
There are known differences in the versioning and storage approaches of each tool. The “raw size” column in the table above refers to the space occupied if each commit is stored as a subdirectory and all subdirectory sizes are summed together.
S3 versioning, when enabled, stores every version of every file at “raw size” — no deduplication.
Git LFS and DVC both use pointers to refer to large files stored on servers, and implement file-level deduplication, meaning that identical files in subsequent versions will not be stored again, regardless of the storage backend.
XetHub uses pointers to refer to large files stored on S3-hosted XetHub servers, which automatically stores files in a data format that automatically deduplicates at the block level. This means that unchanged blocks in subsequent versions will not be stored again.
We’ll see the impact of these approaches in our results.
In our configuration, we set up DVC to use S3 as its remote and Git LFS to use GitHub’s LFS offering. All S3 buckets use S3 Standard tiers. Benchmarks were run on Amazon Elastic Cloud Compute m5.xlarge instances (4 vCPUs, 16GB memory) with 10TB Elastic Block Store volumes.
Results
Each project’s development patterns, size, and complexity lead to differences in transfer times and storage costs. Let’s review the patterns and performance for each use case.
Use Case 1: Megacity Multiplayer
The Megacity Multiplayer commit history reflects what we see in many modern projects: multiple collaborators making frequent modifications across code and artifacts. In this experiment, we found that upload/download speeds varied as the project progressed.
Initial development:
Despite Git LFS, DVC, and XetHub’s storage deduplication algorithms, all three tools perform similarly to S3 on upload and download times for the first four commits. This may be due to how many files are dramatically changed at the beginning of a project, with few duplicatable changes.
Later development:
Git LFS and XetHub perform well compared to both S3 and DVS on uploads and downloads in later commits, likely owing to parallelization of upload and download requests.
On larger pushes of changes, commits 4, 8, and 29 in particular, we can see both Git LFS and XetHub uploads times spike. Because XetHub only needs to push altered blocks, it performs much faster on these changes than Git LFS. On the download side, Git LFS has a slight edge over XetHub.
Throughout development:
Our S3-backed DVC performs significantly worse then S3 and Git LFS for uploads, likely due to the lack of parallelization in handling these requests. DVC’s focus on making it easy to version code and artifacts side-by-side does not extend to optimizing performance for subsequent updates.
As expected, DVC and Git LFS storage track almost identically since both deduplicate at the file level, while XetHub block-level deduplication leads to an additional 60% savings in storage. We intentionally leave S3 off of the plot because its increasing large commit sizes, ending near 240GB, made the chart unreadable for the other tools.
Use Case 2: RCSB PDB structures
In many respects, the versioning-via-naming-convention approach of the RCSB PDB group reflects the reality of how many AI/ML teams version models today:
Storage cost considerations lead to only keeping major versions
Naming conventions are easy to implement but require manually tracking down changes across versions
Meanwhile, the edit patterns mirror more what we see in traditional software development: a number of edits across a set of smaller files. After the initial PDB commit, transfer times for pushing and pulling changes is essentially constant across each tool for subsequent versions. Between tools, however, the differences are stark: DVC upload/download times are much longer than all other tools and XetHub outperforms on every dimension.
Use Case 3: Research dataset
Datasets used in modern AI/ML projects typically experience incremental updates as opposed to sweeping changes. Modifying labels, deleting old data, and appending new data are common operations. The CORD-19 dataset is an example of the last pattern.
As updates to these types of datasets are infrequent, upload times are less significant in the overall development process. However, running iterative model training and evaluation experiments is common, and that code often starts by downloading datasets — the cost of download delay is a cost in terms of both developer time and expensive idle GPU hours.
For a dataset of this size, upload and download times are the bottleneck for running benchmarks. Downloading the first 20 commits for DVC took 32 hours. Projecting forward, our estimate is that playing the remaining 30 commits would take approximately 336 hours. To avoid wasting resources, we cut off our download measurements at 20 commits for all tools. Here are the results:
For DVC, the first 20 re-played commits averaged 96 minutes per download, generally increasing with each commit. The 20th commit took nearly 4 hours to download.
Git LFS averaged 51 minutes per download. The first two commits each took over 2 hours to download, skewing the average; our suspicion is that this is due to inefficiencies in Git LFS smudge filters. Subsequent downloads normalized, with the 20th commit taking 55 minutes.
S3 averaged nearly 47 minutes per download. The 20th commit took 56 minutes.
XetHub averaged 19 minutes per download. The 20th commit took 32 minutes.
The variation in download times is partially a function of each product’s approach to deduplication, as mentioned above in the Tools section. See the table below for the final stored size for each tool.
Downloads for all tools are only calculated for the first 20 commits to conserve resources.
Takeaways
Each evaluated tool allows for easy storage and custom naming conventions. However, what a product optimizes for determines the developer experience, leading to significant differences in performance across use cases:
XetHub performs the best across upload, download, and storage benchmarks for all use cases.
For the two more development-focused use cases, Megacity Multiplayer and RCSB PDB, upload/download time per commit generally normalized by tool after the first few commits. The Megacity Multiplayer example, in particular, demonstrates some peculiar behavior:
S3 consistently outperforms DVC on uploads, but lags far behind on downloads due to lack of deduplication in what’s being downloaded.
For larger pushes of data (commits 4, 8, and 29), upload times spike and re-normalize for XetHub and Git LFS, while DVC times don’t fully renormalize after each spike and S3 times barely change. Corresponding downloads of the larger data pushes show the same spikes and re-normalization for XetHub, Git LFS, and DVC.
For the append-only scenario of CORD-19, upload time trends were similar for all tools. However, on the download side, DVC stuck out as having increasingly slow downloads as the size of the dataset grew. After commit 8 (a push of about ~84GB), DVC download times rapidly fell behind all other tools, indicating that it might not be an ideal choice for projects that will exceed 80GB.
All benchmarks for DVC and XetHub were performed with S3 providing the backing storage. However, even with a similar tech stack, XetHub transfer times were consistently faster than both S3 and DVC.
One aspect of the development experience not covered in the benchmarking above is the user experience of working with each tool. Here are some differences we noticed:
XetHub and Git LFS integrate with Git workflows and are installed via Git extensions. Normal Git commands (e.g.,
push
,pull
) generally work on your repository once your files are tracked.With Git LFS, you must configure a remote server and explicitly track every large file you want to be stored on the remote server with the
git lfs track
command. This command is required to turn the file into a reference to the file on the remote server. Any oversight in forgetting to track a file requires going through a complicated migration process.With XetHub, your files are automatically stored on XetHub servers, and large files are automatically converted to references. No additional commands are needed for file tracking. XetHub also supports additional access patterns such instant file-system mount that can further improve load time for training.
DVC uses a Git-like syntax, requiring the usage of additional commands like
dvc init
anddvc checkout
to track and pull files. It can be used with or without Git.DVC supports a variety of backing cloud storage options so you don’t have to move your data from one provider to another. It creates an additional
.dvc
file to reference track large files. Like Git LFS, if you accidentally forget to run a DVC command, you may end up in an unhappy state.
S3 can be used for versioning in two ways: by turning on the S3 versioning feature or by bucket and file naming conventions.
S3 bucket versioning is a feature that stores all versions of S3 objects. On any write or deletion of a file, the previous version is saved, allowing for recoverability at the file level. However, there is no way to revert the state of all contents in a bucket to a certain time, making project-level recovery difficult. This feature is disabled by default on all buckets because, without any storage deduplication algorithms in place, costs can quickly build up.
The naming convention approach, as demonstrated by the RCSB PDB repository, can be brittle, takes significant storage space, and makes comparing against previous versions difficult. Human errors and re-runs of code may also accidentally overwrite or delete previous versions, resulting in uncertain provenance.
Conclusion
Twenty years ago, it would have been absurd to consider pushing multiple gigabytes of changes to the cloud in a single commit. S3 didn’t exist until 2006, Git LFS was created in the mid 2010s, and DVC was started in 2017. Recently, the promise of AI/ML has led to an influx of data collection and creation, all versioned and stored with tools that were originally architected to support a much smaller scale. As a result, developers pay the price in three dimensions: storage costs, developer time waiting for files to transfer, and idle GPU time waiting for files to download.
XetHub was founded in late 2021 to address these pains. The benchmarks in this post highlight the differences in how each of the tools mentioned above performs across three different modern use cases: game development, biotech, and research. The results demonstrate the benefits of XetHub storage and compression algorithms for each project over the course of its development:
About 50% savings in average upload times compared to the nearest competitor
A range of improvements in average download times compared to competitors
Over 50% savings in final storage used compared to the nearest competitor
Our customers have seen similar results; Gather AI reduced storage costs by 51% and improved deployment times by 40% by integrating with XetHub.
Storage and performance aren’t the only things that matter to XetHub. The user experience of developing with large files is just as challenging as versioning, so we invested in efficient load patterns, custom visualizations, and multiple access patterns to support observability and collaboration at scale with minimal friction. Try XetHub today to experience the benefits for yourself.
Share on