Warning: This blog post references deprecated XetHub links and functionality. Please use as a reference only and follow our current work on Hugging Face.

January 24, 2024

XetHub > Google Drive for LLM Fine-tuning in Colab

Srini Kadamati

Google Colab is a cloud notebook with attached compute that has become a very popular way to load, explore, and fine-tune large language models (LLMs). Colab gives free users access to CPU and GPU compute units, with the option to upgrade to Colab Pro for more compute and less restrictions.

We love Colab for prototyping and quick exploration, but we believe it falls very short because of the storage and versioning options that are available in Colab.

Google Drive… for data and model storage?

Google Colab prefers that you store datasets and models in Google Drive, which has very poor ergonomics for data professionals.

Browsing & building context

Google Drive is fine for photos and text documents, but does a poor job rendering JSON, tabular datasets, model files, and more. This makes discoverability of datasets and models very difficult.

Access Patterns

While you can mount entire folders from Google Drive in Google Colab, the access patterns are otherwise quite limited for materializing specific files or trying to access previous versions of the same files.

Lack of Version Control

Colab users have turned to alternatives like Hugging Face to host their ML model files, blob stores (like S3) to host raw datasets, and GitHub for code & notebook versioning. To version and keep track of your long-running ML experiments, you need to version in 3 different places and somehow manage all of this overhead yourself. This problem is further magnified in a team environment.

In addition, it's extremely easy for you to overwrite files and very clunky to revert to an older version!

Poor collaboration features

While Google Docs, Sheets, etc are associated with live collaboration, Google Drive itself offers little features for collaboration, especially with collaborators who don't have Google accounts.

XetHub as a Google Drive Alternative

XetHub is a new kind of version control repo that can scale to handle large file types (up to 100 terabytes), provides useful context for most file types, has rich collaborative features (issues, pull requests, etc), and supports multiple access patterns.

This means that:

your commits can contain ALL of the context from a specific ML experiment
you can reproduce any past ML work because the specific dataset, model, and code can be rewinded to
you only need to share access to the repo, not 3 different repos in 3 different tools

To showcase the workflow, we’ll fine-tune Meta’s CodeLlama model in Google Colab from a XetHub repo.

Getting Started

Start by creating a free XetHub account and forking our XetHub repo.
Save a copy of the Colab notebook to your own Google Drive so you can edit it.
Run the first 3 cells in the notebook first. In the 3rd cell, you’ll be asked to fill out:
- Your XetHub username
- Your XetHub email
Run the rest of the cells in the Colab notebook. Here’s a breakdown of the steps you’re executing in this notebook with links to the relevant cell:
- Install libraries for fine-tuning. Link
- Install Git-Xet so you can access large files from XetHub. Link
- Use Git-Xet to lazy clone our repo and then materialize (or download) the Code Llama 7B model. Note that this may take a few minutes as gigabytes of files are downloaded to Colab’s local filesystem and then models loaded into memory. Link
- We then establish the baseline performance by asking the model to generate some code that’s contextual to our PyXet library. We notice how the generated code is highly erroneous. Link
- Then, we fetch the source code for PyXet, tokenize it, and finetune the model using LoRa. Link
- We then load the new model checkpoint and ask it to perform the same code generation task we did in the baseline. Link
- We end by loading the new weights back into the original model, creating a git commit, and pushing the changes back to XetHub. Link

Every change in XetHub is a Git commit and you can get helpful context on what changed.

In addition, XetHub natively supports rendering of common data formats and model files. Check out this model visualization in our file browser.

Experiment Further

Finetune Code Llama with your own Source Code Repo

We used our very own PyXet library in our example but with just a small change, you can run this Colab notebook to fine-tune Code Llama to generate better code in the context of your own database, library, or other software project!

We recently created the XetCache library for improving the reproducibility & rerun experience in Jupyter Notebook. Let’s see if we can fine-tune Code Llama to generate valid code for us using this library.

Add a new cell early in the notebook to create and checkout to a new branch:

!cd {model_repo} && git checkout -b 'finetune/xetcache'

Let’s establish a baseline by asking the Code Llama model to generate some relevant code for us.

eval_prompt = """
Write a DuckDB SQL query that converts the integer 32 to a string with radix base 16.
"""

Change the repo we want to fine-tune on in this cell.

import importlib
myscripts=importlib.import_module(f"{model_repo}.scripts")
import pandas as pd
from datasets import Dataset

# Clones a source code repository as fine tuning data
username='xetdata'
repository='xetcache'
parquet_file = myscripts.create_dataset_from_git_repo(username,repository)
# Optionally you can save this dataset back to the model repo
df = pd.read_parquet(parquet_file)
dataset = Dataset.from_pandas(df, split="train")
train_dataset = dataset.train_test_split(test_size=0.1)["train"]
eval_dataset = dataset.train_test_split(test_size=0.1)["test"]

Run the rest of the cells to run LoRa (which will take a while) and re-evaluate the same prompt. The baseline response was highly inaccurate (XetCache is a Python package, not an npm one):

## Installation
```
npm install -g xetcache-cli
```

## Usage
```
Usage: xetcache-cli [options]

Options:
  -V, --version                 output the version number
  -f, --function <functionName>  The name of the function to memoize
  -o, --output <outputFile>      The output file to write the code snippet to
  -h, --help                     output usage information

The fine-tuned response is nearly a perfect match:

## Installation

```
pip install xetcache
```

## Usage

```
import xetcache
xetcache.set_xet_project([give a project name here])

%%xetmemo input=v1,v2 output=v3,v4

Use Git to commit the fine-tuned model on our finetune/xetcache branch and push it back to XetHub.

!cd {model_repo} && git add {model_name} && git commit -m "Fine tuned model trained on xetcache" && git

Next Steps

We hope you give this workflow a try! If you have feedback or run into issues, you can join us in our Xet Community Slack.

If you want to learn more about XetHub, check out our homepage here.

Share on

Nov 25, 2024

Rearchitecting Hugging Face Uploads and Downloads

Nov 20, 2024

From Files to Chunks: Improving HF Storage Efficiency

Oct 4, 2024

Improving Parquet Dedupe on Hugging Face Hub

Rearchitecting Hugging Face Uploads and Downloads

Nov 20, 2024

From Files to Chunks: Improving HF Storage Efficiency

Oct 4, 2024

XetHub > Google Drive for LLM Fine-tuning in Colab

Google Drive… for data and model storage?

XetHub as a Google Drive Alternative

Getting Started

Experiment Further

Next Steps

More posts

Rearchitecting Hugging Face Uploads and Downloads

From Files to Chunks: Improving HF Storage Efficiency

Improving Parquet Dedupe on Hugging Face Hub

More posts

Rearchitecting Hugging Face Uploads and Downloads

More posts

Rearchitecting Hugging Face Uploads and Downloads

From Files to Chunks: Improving HF Storage Efficiency

Improving Parquet Dedupe on Hugging Face Hub