electronic brain surgery since 2001

Extracting Code from a git Repo

Today I nerd sniped myself. I wanted to do something completely different, but then decided to extract the JavaScript compressor that is built into DokuWiki into its own repository.

My goal was:

  • have new repository with only the relevant code
  • have a working git commit history for that code
  • do not have any old cruft in the repo that is unrelated to the code

It turns out this isn't so easy, but in the end I managed.

Quick side note: I used ChatGPT a lot for this to figure out how stuff works. It was quite helpful to get started, but whenever things got more difficult it started to hallucinate command options which was a bit frustrating. So in the end I had to combine its advice with good old googling and thinking myself. ;-)

Now to get started, you need to clone the original repo. I used a HTTP checkout, just to make sure that I wouldn't accidentally push anything back to origin.

In my case I needed only two files and a directory:

  • lib/exe/js.php
  • _test/tests/lib/exe/js_js_compress.test.php
  • _test/tests/lib/exe/js_js_compress/

Additionally, only one function in the js.php file was of interest. So the first step was to edit that file, remove everything but that function and commit the change.

Next, git-filter-repo is the hero of the day. I installed it via AUR on my ArchLinux system.

# delete all tags
git tag | xargs git tag -d
# delete all branches
git branch | grep -v "master" | xargs git branch -D
# remove everything we don't want
git filter-repo \
  --path '_test/tests/lib/exe/js_js_compress' \
  --path '_test/tests/lib/exe/js_js_compress.test.php' \
  --path 'lib/exe/js.php' \
  --replace-refs delete-no-add \
  --prune-empty always \
  --prune-degenerate always \
  --commit-callback '
    commit.message += b"\n\nOriginal commit:\ndokuwiki/dokuwiki@" + commit.original_id
    ' \

The above call will rewrite the history. The path options tell it what files we want to keep, filter-repo will remove all commits that do not touch these files. It will also remove all changes to unrelated files from these commits.

I'm not sure how necessary the prune and replace-refs options are.

The commit callback will append the original commit ID to each commit that is kept. Useful if some greater context is needed in the future.

At this point, the repos still contains a whole bunch of commits that touched the js.php file but addressed other functions in that file. Functions we no longer have or care about.

Ideally filter-repo would be able to remove those, too. But I couldn't figure out how. Instead I opted to use use git blame to get the commits that are still relevant, then use git filter-branch to remove all other commits:

# Get the commit hashes to keep:
#  We want the newest commit (which removed all unwanted cruft)
#  We want all commits that show up in a git blame on any existing file
commit_hashes=$(git rev-parse HEAD)" "$(git ls-files | xargs -I{} git blame --minimal --abbrev=40 -- {}|grep -vF '^'| awk '{print $1}' | sort -u)
# Rewrite the Git history to remove obsolete commits
if [ "$commit_hashes" != "" ]; then
    git filter-branch --commit-filter '
        if echo "'"$commit_hashes"'" | grep -q "$GIT_COMMIT"; then
            git commit-tree "$@";
            skip_commit "$@";
        fi' HEAD

The result is a git repo with just those commits that address the current code.

From there it's easy to continue to clean up the repo structure with git mv.

Similar posts: