Thursday, July 07, 2022

[ubhavbqh] cloning a git submodule and its submodules

if you have a checked-out git repository, and that checkout has a checked-out submodule, you can clone the submodule locally with "git clone /path/to/checkout/path/to/submodule".  in checking out the submodule, git promotes the submodule to a repository.  it does not need to access network, or more generally, the submodule's origin.  it's nice that this works; git knows to walk up then down the tree from from /path/to/checkout/path/to/submodule to /path/to/checkout/.git/modules/path/to/submodule to fetch the history.  (this continues to work even if the submodule is itself nested inside another submodule.)

however, if the submodule has nested submodules within it, then cloning the subtree without going over the network (not going to the nested submodule's origin) requires more effort.  this was inspired by a repository with a submodule with very large nested submodule and not wanting to go over network when I have a local clone already, albeit buried inside the submodule hierarchy of another repository.

"git clone --recurse-submodules /path/to/checkout/path/to/submodule" does not work.  although it fetches the submodule from the local path, it tries to fetch the nested submodules from their origins.

(note: if you want to copy an entire repository and its checked-out submodules, you can simply do "cp -r".  here, we want to copy just a submodule and its checked-out submodules.)

the script below accomplishes the task.  it uses 'git foreach' to walk the checked-out source submodule hierarchy and temporarily set the origin to the local checkout, then afterward "git checkout .gitmodules" and "git submodule sync" to restore each nested submodule back to pristine.  it only copies checked-out nested submodules.

note: as of git 2.30.2 , git submodule set-url takes a name, not a path.  this is inconsistent with its documentation.  this only matters for the rare submodule which use a name different from its path.  an earlier, much uglier, version of the script looked up the name from the path by parsing the output of "git config -f .gitmodules -l".

although the script copies files and history, the bulk of the data, it fails to copy the references to the remote branches ("git branch -a" shows nothing).  do a "git fetch" (which accesses network) afterward to get the branches, if you need them.  the first fetch might mysteriously emit errors "fatal: remote error: upload-pack: not our ref" "Errors during submodule fetch" but the fetch seems to work perfectly, then subsequent fetches do not repeat the error.  I don't understand what is going on.

it's nice that the special variables available inside "git foreach", $name, $displaypath, and $sm_path , give us exactly all the information we need.  the percent sign syntax ${displaypath%$sm_path} in bash means "Remove matching suffix pattern", documented in the "Parameter Expansion" section of the bash manpage.

#!/bin/bash
set -xeu
source=$1
git clone "$source"
target="$PWD"/$(basename "$source")
cd "$source"
git submodule foreach --recursive 'bash -x -c '"'"'cd "'"$target"'"/"${displaypath%$sm_path}" && git submodule init "$sm_path" && git submodule set-url "$name" "'"$source"'"/"$displaypath" && git submodule update "$sm_path" && git checkout .gitmodules'"'"
sourceorigin=$(git config --get remote.origin.url)
cd "$target"
git remote set-url origin "$sourceorigin"
git submodule sync --recursive

tested with paths with spaces and quotes in them.

note: git submodule set-url was introduced in git 2.25 so is not present in older distributions such as Debian Buster.  splice the following:
git config -f .gitmodules submodule."$name".url "'"$source"'"/"$displaypath" && git submodule sync "$sm_path"

No comments :