Skip to main content

Git submodules

If you want to work on a project that combines your existing Git-versioned code and a XetHub asset repo, leveraging Git submodules can streamline your workflow.

Submodules allow you to work with a secondary repo as if it is a subdirectory of a primary repo. This is most useful when you want to continue developing your code in GitHub or GitLab while using XetHub to version the assets generated by your code (data, models, other artifacts). To see this in action, skip down to the end-to-end example.


Adding a submodule

To add a submodule to an existing project, navigate to the directory in the parent project where you would like to place the submodule and run:

git submodule add <SUBMODULE_URL>
git submodule update --init recursive

The <SUBMODULE_URL> can match either the main address of the repository or the .git URL (e.g., https://xethub.com/XetHub/Flickr30k or https://xethub.com/XetHub/Flickr30k.git).

This will clone the submodule into a new folder in the current working directory, create a .gitmodules file at the top-level of the parent project, and explicitly tell Git to download the contents of the submodule.

Inside .gitmodules you will see an entry for the newly added submodule in the following format:

[submodule "path/to/submodule/submodule_folder"]
path = path/to/submodule/submodule_folder
url = <SUBMODULE_URL>

If you have a project with existing submodules, this entry will be appended to the current .gitmodules file.

To ensure the submodule is tracked in the existing project, commit and push the changes to the parent repository. This will add both the .gitmodules file and create a directory entry for the submodule.

In the parent repository, this entry will not contain the submodule's contents, but will rather be a pointer to the submodule at a specific commit hash. All collaborators, when cloning the repository, will have an easy way of accessing the submodule at that commit hash when working with the parent project.

note

While it is possible to use the SSH remote address for the submodule URL, we recommend using https to reduce confusion. Not all contributors will have configured their machine to work with SSH and may encounter an error while cloning the repository.

Contributors with SSH access can still use SSH locally by running git remote set-url <SSH-URL> in the submodule directory after cloning the parent repository.

Editing a submodule

If you modify the contents of a submodule, you must commit these changes to the submodule locally and then push the updates to both remotes. To do this:

  1. Navigate to the directory where your submodule is located and commit the changes to the submodule
  2. Navigate to the project's top-level directory and commit the changes to the submodule in the parent project.
  3. From the parent, run the following:
git push --recurse-submodules=on-demand

This will push commits to the:

  • Submodule; tracking local changes in the XetHub repository
  • Parent project; ensuring the newest version of the submodule is available for collaborators

Cloning a repository with a submodule

If you are cloning a repository with a XetHub repository included as a submodule and need the submodule artifacts available locally, run:

git clone --recurse-submodules <PARENT_PROJECT_GIT_URL>

Failure to do so will clone the parent project files, but not the files within the submodule.

Pulling submodule updates

If you are collaborating on a project and need to ensure the submodule files are current, run:

git pull --recurse-submodules

This will run a git pull followed by a git submodule update, keeping both the parent project and submodule current with their respective remotes.

Simple data cleaning example

This example demonstrates a simple data cleaning pipeline where:

  • A XetHub repository versions all data for the project
  • A GitHub repository is responsible for the data cleaning code and populating the XetHub repository with the cleaned data

In the XetHub repository, nsf-awards, the history of awards granted by the United States National Science Foundation is stored as Parquet files, downloaded from Hugging Face. There are 518,285 rows across 4 files, requiring 589MB of storage. The unprocessed files are held in the raw directory of the XetHub repository.

In the GitHub repository, nsf-awards-cleaning, the XetHub repository is added as a submodule using the command:

git submodule add https://xethub.com/XetHub/nsf-awards

After committing and pushing the changes that add the submodule, the GitHub interface shows the nsf-awards repository as a directory entry with a commit hash pointing the most recent commit for nsf-awards:

Screenshot of GitHub submodule view

The .gitmodules file in the GitHub repository contains an entry for the XetHub repository. This file is essential for all Git submodule commands to keep both projects in sync with one another. No other folders or files from the XetHub repository are added to the GitHub repository.

From nsf-awards-cleaning, the nsf_cleaning.py file is executed, dropping all rows with null data from the files in the nsf-awards/raw directory and saving them to the nsf-awards/clean directory (removing 36,681 rows). To version these files in the XetHub repository and ensure the GitHub repository points to the most current version of the submodule, a git push is run from the top level of the GitHub project:

git push --recurse-submodules=on-demand

Subsequent updates to the XetHub repository (e.g., when the data for 2025 is added to the raw directory) can be pulled into this simple cleaning pipeline by running git pull --recurse-submodules in the GitHub project. The newly added data will be available for cleaning and any updates can be committed back to the clean directory in the XetHub repository using the same commands.

Advanced submodule usage

Helpful Git configs for submodules

A downside to using Git submodules is that it adds the mental overhead of additional steps or flags to your workflow. The following commands will configure Git for working with submodules and streamline your experience.

For Git versions >=2.14, the following will add the --recurse-submodules option to all Git commands that support it (except git clone):

git config --global submodule.recurse true

This configuration will allow you to always see a summary of changes made to your local submodule when running git status:

git config --global status.submodulesummary 1

Without this, you will only see when there are new commits to the submodule. Not what those commits contain.

To always push submodule changes when running git push --recurse-submodules use:

git config --global push.recurseSubmodules on-demand

Branches

Working with branches and submodules can be tricky as different parent project branches may point to different versions of the submodule. For example, you may have a stable branch in the parent project that points to a stable branch in the XetHub repository, and a similar mapping of two dev branches. Unfortunately, Git does not update the submodule state by default when using git checkout in the parent project.

To ensure you are working with the version of the submodule your parent project expects, always add the --recurse-submodules flag when checking out a branch in the parent project. This will update the submodule to point to the commit expected by the branch in the parent project.

Private repositories

For private repositories, all prior commands are still applicable. However, all collaborators that require access to the submodule in the parent project must also have access to the submodule's repository in XetHub. If they (or you) do not, you will encounter a permissions error (Error: Could not read from remote repository.) when cloning the submodule.