README.md
Building expert language models in different domains with code, data, model separated on xet branches
This project shows how xet can tie code and data together between machine learning model iterations with the example of "Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models". Branch-Train-Merge trains a transformer LM (language model) from one corpus as the "seed" or base LM, then adapts it to different domains by further training the seed LM's parameters on domain data. The ELMs (expert LM) created can also be merged back into a more general model.
Now, where xet comes in is storing code, data, and experiment artifacts of the LMs into separate branches, e.g., seed LM on main and legal LM on legal branch. For the developer, this keeps the training environment clean; meanwhile for the user trying out the language models, this makes it easy to compare and deploy the different models by simply switching between branches. In terms of cost and performance, xet also deduplicates data across branches. With BranchTrainMerge, there are 64 domains, which would otherwise be very hard to manage and inefficient without using xet.
Code
The repo is adapted from the "Branch-Train-Merge" repo, which relies on Fairseq for data preprocessing, training and inference.
Setup
Sign up on xethub and download git-xet client !
Grab a copy of the repo with :
git xet clone https://xethub.com/keltonzhang/branchTrainMergeLM.git
cd fairseq
pip install -e .
demo
deduplication
Adding each new domain branch only add new and unique artifacts on top of main branch
git checkout -b new_domain
branch switching
Just git checkout a certain domain and you will have all the code, data, model needed to experiment with the domain
Checking into a domain like below takes some time, in this case 5 minutes
File List | Total items: 6 | ||
---|---|---|---|
Name | Last Commit | Size | Last Modified |
btm_shell_scripts | |||
btm_utils | |||
fairseq | |||
.gitattributes | |||
LICENSE | |||
README.md |
Repository Size
Activity 8 commits
-
committed 7aef67f2c7 6mo ago
-
committed 2de1e676c9 6mo ago
-
committed 187459d90b 6mo ago
-
committed 476563e9ad 6mo ago
-
committed 75476adc13 6mo ago