Building expert language models in different domains with code, data, model separated on xet branches
This project shows how xet can tie code and data together between machine learning model iterations with the example of "Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models". Branch-Train-Merge trains a transformer LM (language model) from one corpus as the "seed" or base LM, then adapts it to different domains by further training the seed LM's parameters on domain data. The ELMs (expert LM) created can also be merged back into a more general model.
Now, where xet comes in is storing code, data, and experiment artifacts of the LMs into separate branches, e.g., seed LM on main and legal LM on legal branch. For the developer, this keeps the training environment clean; meanwhile for the user trying out the language models, this makes it easy to compare and deploy the different models by simply switching between branches. In terms of cost and performance, xet also deduplicates data across branches. With BranchTrainMerge, there are 64 domains, which would otherwise be very hard to manage and inefficient without using xet.
Grab a copy of the repo with :
git xet clone https://xethub.com/keltonzhang/branchTrainMergeLM.git cd fairseq pip install -e .
Adding each new domain branch only add new and unique artifacts on top of main branch
git checkout -b new_domain
Just git checkout a certain domain and you will have all the code, data, model needed to experiment with the domain
|File List||Total items: 6|
|Name||Last Commit||Size||Last Modified|