October 15, 2022

Why Git for Data?

Yucheng Low

Yucheng Low

Yucheng Low

The Problem

Everyone who has worked on ML has encountered the dataset problem.

When working on an ML project for fun by myself, I have no idea where to store all the source and intermediate results. Most recently was a whisky mass-spec project I was working on: I did not want to take on the friction of a pipeline system when I'm just playing with data, but when I have interesting results I have no idea where to store them. Copying and pasting directories (whisky_v1, whisky_v2, whisky_v2_nmf_denoise, whisky_v2_weird_looking) seemed wrong and reminiscent of source control in the '80s and '90s, but was also the simplest solution available.

When scaling up to bigger datasets and bigger teams, the pains of dataset management only gets worse. This has consistently been a source of frustration for every Machine Learning team we have worked with for the last 10+ years (both at Apple and GraphLab/Dato/Turi).

Balancing the need for doing experiments at different scales (from tiny on-laptop explorations, to training on a data subset, to production-ready training code) is almost impossible. Invariably, every team architects their own solution for storage, file formats, streaming, pipelining, versioning — a solution unique to their use case that requires constant care and maintenance.

There has to be a better way.

Who are we?

We are a group of co-founders who combined have 10+ years of experience with building large scale distributed systems. We architected a foundational piece of the ML storage system at Apple and have worked with countless numbers of ML teams, learning their needs and supporting them to deliver their features. The founding team also has an incredible breadth of experience ranging from high performance systems, databases, distributed systems, visualization, UX design, and even game development!

This team is a culmination of our experience and intuition. Our goal is to try to bring about a better solution for everyone.

What do we need?

To take a step back, we will use the first three levels of Maslow’s hierarchy of needs as a metaphor and stretch it a bit:

  1. The basics: When working with data, we need fast, scalable storage. Without this, achieving results is actively painful.

  2. Safety and security: People want the confidence to make changes and be able to undo them, to be able to make new versions of datasets without worrying about access to older versions. When new versions of datasets are made, people want to know how it has changed and to have the confidence to make use of the new version in their work.

  3. Full potential: The ideal is to have the ability to collaborate with others on projects, share results, and exchange ideas. To be able to easily explore and see what datasets and models are available, and to use them in your own projects provides an environment where innovation can thrive.

‍Our solution

We have architected XetHub, a collaborative storage platform designed for managing data at scale. XetHub comes with blobstore-like performance, with built-in features like data deduplication, file-system semantics, and full whole-hearted Git compatibility.

Let's break it down.

Automatic deduplication

We believe that effective data deduplication for ML workloads is not merely for storage optimization, but can be fundamental to performance since datasets frequently share common elements: subsets, minor modifications, appends and updates, and just straight up copies for convenience. The ability to make copies of a large training set, with the freedom to make whatever small changes you need (fix a label, delete a bad row), all while knowing that these changes will be cheap to maintain, is liberating.

File-system semantics

We believe that files are the lowest common denominator and are needed to support the broadest workflows. But the storage architecture should have the capability to optimize deeply for specific file types; whether that is in terms of dedupe, or visualization metadata.

Having file semantics as a core part of the system architecture allows us to embrace the idea of file-system mounts as a core access pattern, enabling both laptops and clusters to get a identical view of the same repository for both exploration and distributed training workflows.

Git compatibility

I have learned the painful lesson that being ‘almost-like’ something familiar only adds friction to the user experience. When I built the SFrame dataframe for Turi Create (prev. GraphLab Create), it had huge advantages over other dataframe implementations in Python (namely Pandas). However, since the APIs were incompatible, it introduced too much friction for someone to learn a new API. (Even Spark dataframe is trying to be Pandas dataframe compatible now!) With our platform, we quickly realized that we cannot be merely ‘Git-like’.

As such, we embrace Git compatibility whole-heartedly. So much so that we have challenges with our usage documentation as it sometimes reads like a Git tutorial.

With Git compatibility, we want to automatically enable all the workflows that Git users are familiar with: branching, versioning, tagging, merging, etc. But all on your data.

While we don’t require changes to Git itself right now, we have some ideas for improvements which we will share and contribute upstream further down the road.

Collaboration

GitHub has shown the way for developers to collaborate on code. Can we extend the metaphors for data? What does it mean to do a Pull Request on data? What kind of visualizations can we provide automatically (or with configuration) to enable data exploration? So many questions to explore.

To begin to address these difficulties, we've started by building an auto-summarization feature for CSV files that automatically displays basic column statistics, as well as enabling custom visualizations that automatically load as part of reviewing differences.

Current limits

There is an architectural principle I strongly believe in: every couple orders of magnitude, we need to redesign almost everything. We currently support up to 1TB per repository, a scale where we will be able to rapidly to gain an better understanding the bottlenecks of our system. The learnings at this level will inform the implementation of the v2 architecture required to scale support to 10TB and beyond.

Share on