Building expert language models in different domains with code, data, model separated on xet branches

This project shows how xet can tie code and data together between machine learning model iterations with the example of "Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models". Branch-Train-Merge trains a transformer LM (language model) from one corpus as the "seed" or base LM, then adapts it to different domains by further training the seed LM's parameters on domain data. The ELMs (expert LM) created can also be merged back into a more general model.

Now, where xet comes in is storing code, data, and experiment artifacts of the LMs into separate branches, e.g., seed LM on main and legal LM on legal branch. For the developer, this keeps the training environment clean; meanwhile for the user trying out the language models, this makes it easy to compare and deploy the different models by simply switching between branches. In terms of cost and performance, xet also deduplicates data across branches. With BranchTrainMerge, there are 64 domains, which would otherwise be very hard to manage and inefficient without using xet.

Code

The repo is adapted from the "Branch-Train-Merge" repo, which relies on Fairseq for data preprocessing, training and inference.

Setup

Grab a copy of the repo with :

git xet clone https://xethub.com/keltonzhang/branchTrainMergeLM.git
cd fairseq
pip install -e .

demo

deduplication

Adding each new domain branch only add new and unique artifacts on top of main branch

git checkout -b new_domain

branch switching

Just git checkout a certain domain and you will have all the code, data, model needed to experiment with the domain

Checking into a domain like below takes some time, in this case 5 minutes

File List			Total items: 6
Name	Last Commit	Size	Last Modified
btm_shell_scripts	init repo		1 year ago
btm_utils	init repo		1 year ago
fairseq	init repo		1 year ago
.gitattributes	Initial commit	79 B	1 year ago
LICENSE	init repo	1.0 KiB	1 year ago
README.md	readmeee	9.5 KiB	1 year ago

Repository Size

Loading repo size...

README.md