Fork 0
24 GiB materialized 23 GiB stored

Kelton Zhang
7aef67f2c7 6 months ago 8 commits


Building expert language models in different domains with code, data, model separated on xet branches

This project shows how xet can tie code and data together between machine learning model iterations with the example of "Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models". Branch-Train-Merge trains a transformer LM (language model) from one corpus as the "seed" or base LM, then adapts it to different domains by further training the seed LM's parameters on domain data. The ELMs (expert LM) created can also be merged back into a more general model.

Now, where xet comes in is storing code, data, and experiment artifacts of the LMs into separate branches, e.g., seed LM on main and legal LM on legal branch. For the developer, this keeps the training environment clean; meanwhile for the user trying out the language models, this makes it easy to compare and deploy the different models by simply switching between branches. In terms of cost and performance, xet also deduplicates data across branches. With BranchTrainMerge, there are 64 domains, which would otherwise be very hard to manage and inefficient without using xet.


The repo is adapted from the "Branch-Train-Merge" repo, which relies on Fairseq for data preprocessing, training and inference.


Sign up on xethub and download git-xet client !

Grab a copy of the repo with :

git xet clone https://xethub.com/keltonzhang/branchTrainMergeLM.git
cd fairseq
pip install -e .



Adding each new domain branch only add new and unique artifacts on top of main branch

git checkout -b new_domain

branch switching

Just git checkout a certain domain and you will have all the code, data, model needed to experiment with the domain

Checking into a domain like below takes some time, in this case 5 minutes check_into_domain

File List Total items: 6
Name Last Commit Size Last Modified
btm_shell_scripts init repo 6 months ago
btm_utils init repo 6 months ago
fairseq init repo 6 months ago
.gitattributes Initial commit 79 B 6 months ago
LICENSE init repo 1.0 KiB 6 months ago
README.md readmeee 9.5 KiB 6 months ago

Repository Size

Materialized: 24 GiB
Stored: 23 GiB

Activity 8 commits

File Types