Git submodules
If you want to work on a project that combines your existing Git-versioned code and a XetHub asset repo, leveraging Git submodules can streamline your workflow.
Submodules allow you to work with a secondary repo as if it is a subdirectory of a primary repo. This is most useful when you want to continue developing your code in GitHub or GitLab while using XetHub to version the assets generated by your code (data, models, other artifacts). To see this in action, skip down to the end-to-end example.
Adding a submodule
To add a submodule to an existing project, navigate to the directory in the parent project where you would like to place the submodule and run:
git submodule add <SUBMODULE_URL>
git submodule update --init recursive
The <SUBMODULE_URL>
can match either the main address of the repository or the .git
URL (e.g., https://xethub.com/XetHub/Flickr30k
or https://xethub.com/XetHub/Flickr30k.git
).
This will clone the submodule into a new folder in the current working directory, create a .gitmodules
file at the top-level of the parent project, and explicitly tell Git to download the contents of the submodule.
Inside .gitmodules
you will see an entry for the newly added submodule in the following format:
[submodule "path/to/submodule/submodule_folder"]
path = path/to/submodule/submodule_folder
url = <SUBMODULE_URL>
If you have a project with existing submodules, this entry will be appended to the current .gitmodules
file.
To ensure the submodule is tracked in the existing project, commit and push the changes to the parent repository. This will add both the .gitmodules
file and create a directory entry for the submodule.
In the parent repository, this entry will not contain the submodule's contents, but will rather be a pointer to the submodule at a specific commit hash. All collaborators, when cloning the repository, will have an easy way of accessing the submodule at that commit hash when working with the parent project.
While it is possible to use the SSH remote address for the submodule URL, we recommend using https
to reduce confusion. Not all contributors will have configured their machine to work with SSH and may encounter an error while cloning the repository.
Contributors with SSH access can still use SSH locally by running git remote set-url <SSH-URL>
in the submodule directory after cloning the parent repository.
Editing a submodule
If you modify the contents of a submodule, you must commit these changes to the submodule locally and then push the updates to both remotes. To do this:
- Navigate to the directory where your submodule is located and commit the changes to the submodule
- Navigate to the project's top-level directory and commit the changes to the submodule in the parent project.
- From the parent, run the following:
git push --recurse-submodules=on-demand
This will push commits to the:
- Submodule; tracking local changes in the XetHub repository
- Parent project; ensuring the newest version of the submodule is available for collaborators
Cloning a repository with a submodule
If you are cloning a repository with a XetHub repository included as a submodule and need the submodule artifacts available locally, run:
git clone --recurse-submodules <PARENT_PROJECT_GIT_URL>
Failure to do so will clone the parent project files, but not the files within the submodule.
Pulling submodule updates
If you are collaborating on a project and need to ensure the submodule files are current, run:
git pull --recurse-submodules
This will run a git pull
followed by a git submodule update
, keeping both the parent project and submodule current with their respective remotes.
Simple data cleaning example
This example demonstrates a simple data cleaning pipeline where:
- A XetHub repository versions all data for the project
- A GitHub repository is responsible for the data cleaning code and populating the XetHub repository with the cleaned data
In the XetHub repository, nsf-awards
, the history of awards granted by the United States National Science Foundation is stored as Parquet files, downloaded from Hugging Face. There are 518,285 rows across 4 files, requiring 589MB of storage. The unprocessed files are held in the raw
directory of the XetHub repository.
In the GitHub repository, nsf-awards-cleaning
, the XetHub repository is added as a submodule using the command:
git submodule add https://xethub.com/XetHub/nsf-awards
After committing and pushing the changes that add the submodule, the GitHub interface shows the nsf-awards repository as a directory entry with a commit hash pointing the most recent commit for nsf-awards
:
The .gitmodules
file in the GitHub repository contains an entry for the XetHub repository. This file is essential for all Git submodule commands to keep both projects in sync with one another. No other folders or files from the XetHub repository are added to the GitHub repository.
From nsf-awards-cleaning
, the nsf_cleaning.py
file is executed, dropping all rows with null data from the files in the nsf-awards/raw
directory and saving them to the nsf-awards/clean
directory (removing 36,681 rows). To version these files in the XetHub repository and ensure the GitHub repository points to the most current version of the submodule, a git push
is run from the top level of the GitHub project:
git push --recurse-submodules=on-demand
Subsequent updates to the XetHub repository (e.g., when the data for 2025 is added to the raw
directory) can be pulled into this simple cleaning pipeline by running git pull --recurse-submodules
in the GitHub project. The newly added data will be available for cleaning and any updates can be committed back to the clean
directory in the XetHub repository using the same commands.
Advanced submodule usage
Helpful Git configs for submodules
A downside to using Git submodules is that it adds the mental overhead of additional steps or flags to your workflow. The following commands will configure Git for working with submodules and streamline your experience.
For Git versions >=2.14, the following will add the --recurse-submodules
option to all Git commands that support it (except git clone
):
git config --global submodule.recurse true
This configuration will allow you to always see a summary of changes made to your local submodule when running git status
:
git config --global status.submodulesummary 1
Without this, you will only see when there are new commits to the submodule. Not what those commits contain.
To always push submodule changes when running git push --recurse-submodules
use:
git config --global push.recurseSubmodules on-demand
Branches
Working with branches and submodules can be tricky as different parent project branches may point to different versions of the submodule. For example, you may have a stable
branch in the parent project that points to a stable
branch in the XetHub repository, and a similar mapping of two dev
branches. Unfortunately, Git does not update the submodule state by default when using git checkout
in the parent project.
To ensure you are working with the version of the submodule your parent project expects, always add the --recurse-submodules
flag when checking out a branch in the parent project. This will update the submodule to point to the commit expected by the branch in the parent project.
Private repositories
For private repositories, all prior commands are still applicable. However, all collaborators that require access to the submodule in the parent project must also have access to the submodule's repository in XetHub. If they (or you) do not, you will encounter a permissions error (Error: Could not read from remote repository.
) when cloning the submodule.