Chapter 16: Reducing Repository Size

Pencil sketch of a runner on a harbor path with waterfront buildings and boats ahead.

People say "large repo" and mean wildly different things. The checkout may be too wide. The object database may carry too much history. A few large blobs may dominate reachable history. Generated files or duplicate binaries may have polluted the repository. Pack layout may be inefficient. Maintenance may be stale. So there is no one "make repo smaller" button. Some tools reduce only local footprint. Some reduce only transfer cost. If you want the canonical repository itself to become smaller, you are probably talking about history rewriting, which is the most invasive option in this chapter.

Start by Separating Local Footprint from Canonical History

The easiest mistake is to mix up how much data is checked out right now, how much data exists in .git/, and how much data exists in the full repository history. Sparse-checkout reduces the first problem, partial clone reduces what arrives up front on a client, but neither fixes the last one.

If the goal is "make developer machines smaller," then sparse-checkout and partial clone may be enough. But if you want to "make clones and fetches smaller for everyone forever," you are probably talking about history and object reduction. That's non-trivial; history reduction is invasive. Before deleting, rewriting, or repacking anything, make sure the problem actually requires it.

Start with a few questions:

how large is the packed object store?
how many packfiles exist?
are there lots of loose objects?
which blobs are the largest?
are the largest objects reachable from current refs or just historical?

And then apply some basic tooling:

git count-objects -vH
git rev-list --objects --all
git verify-pack -v .git/objects/pack/pack-*.idx | sort -k3 -n | tail
git-sizer

git count-objects -vH gives you a quick storage summary. verify-pack is ugly, but it helps find large packed objects. git-sizer is a separate GitHub project, not a built-in, and it identifies whether the problem is blob size, tree width, path count, ref count, history shape, or some combination.

I would usually start with:

git count-objects -vH
git-sizer
git rev-list --objects --all | sort -k2

The first two show size and shape, and the third is the raw object inventory for later blob analysis. Measure before choosing a fix. Many "huge repo" complaints turn out to be one bad database dump, years of tracked build output, or some media directory that never belonged in Git.

History Rewriting Is the Real Size-Reduction Tool

If you actually want the repository itself to become permanently smaller, you usually have to rewrite history to remove or transform old objects. That can mean:

removing paths entirely from history
replacing large assets with smaller versions
flattening accidental imports
stripping generated directories
moving oversized binaries out of ordinary Git history

Once you do that, old commit IDs are no longer valid because the trees changed, the commits changed, and therefore the history graph changed. This is the same rule as rebase, just at repository scale, and the consequences include:

every branch and tag may need rewriting
downstream clones must be coordinated carefully
open pull requests may become awkward
old SHAs in docs, tickets, or deployment systems may stop being useful

This is major work, it's repository migration. For repository-scale history rewriting, git filter-repo is a common choice. It is a separately installed tool that ships outside the stock Git build, so make sure git filter-repo -h is available before you build a cleanup plan around it. It is usually a better choice than old filter-branch workflows: faster, less error-prone, better at path filtering and blob stripping, and easier to use in repeatable cleanup scripts. Typical uses include:

remove a path from all history
keep only a subdirectory
rewrite author or path metadata
strip blobs larger than a chosen threshold

Examples:

git filter-repo --path build/ --invert-paths
git filter-repo --strip-blobs-bigger-than 20M
git filter-repo --path src/ --path docs/

Those commands change the repository for real, not just the current checkout, and they can hurt you badly if you run them casually on a live shared repository. The safer pattern is to rehearse in a mirror and measure before and after:

git clone --mirror <url> repo-cleanup.git
cd repo-cleanup.git
git filter-repo --strip-blobs-bigger-than 20M
git count-objects -vH

That lets you measure the storage effect before asking a shared repository to absorb the rewrite. The BFG Repo-Cleaner can still be useful when the job is narrowly:

remove very large blobs
remove known bad file types
replace secrets or obvious junk

It is less general than git filter-repo, but for some cleanup tasks that narrower model is fine.

If BFG is the chosen tool, the safe approach is still mirror-first:

git clone --mirror <url> repo-bfg.git
java -jar bfg.jar --strip-blobs-bigger-than 20M repo-bfg.git
cd repo-bfg.git
git reflog expire --expire=now --all
git gc --prune=now

Large Binary Assets Usually Belong Somewhere Else

Sometimes the repo just has a bunch of giant files in it.

There is no perfect answer here. Git LFS is one option when the repository really does need those files in ordinary developer workflows, but normal Git object storage is still the wrong place for them. Large-object promisors (see Chapter 11) are another direction. In many environments, though, the boring answer is the right one: keep large assets in artifact storage, object storage, package registries, model registries, release buckets, or some other system built for large binary distribution.

If you do opt for Git LFS, it changes the tradeoff:

Git may keep only lightweight references in ordinary history
large payloads may live in LFS, a promisor-backed object service, or an external artifact system
clones and fetches stop dragging all large blob content through the normal object pipeline

None of those choices is free. LFS adds its own operational overhead, and it is often slow. External artifact systems split the asset workflow away from ordinary Git usage. Promisor-based designs are interesting, but they are not a universal drop-in answer today.

Policy still matters. If generated files keep reaching history, the repository bloats again. A one-time rewrite helps less than expected if the same habits continue. Once a repository has been reduced, protect it with rules like:

reject oversized files in server-side hooks or CI
keep generated outputs out of tracked history
store releases and build artifacts outside the main repo
keep large assets in the storage system that actually fits them
document what belongs in Git and what belongs adjacent to Git

One concrete CI gate is to fail a branch that introduces new oversized blobs relative to the protected base branch:

base=origin/main
git rev-list --objects "$base"..HEAD \
| git cat-file --batch-check='%(objecttype) %(objectsize) %(rest)' \
| awk '$1 == "blob" && $2 > 20000000 { print; bad = 1 } END { exit bad }'

Replace origin/main and 20000000 with your real protected base ref and size limit. The point is to check newly reachable blobs, not just large files in the working tree.

Treat Repository Reduction as Migration

A reduction project usually looks like this:

measure the problem
identify exactly which paths or blob classes must go
rehearse the rewrite in a disposable mirror
compare before and after size, clone time, and fetch behavior
communicate the cutover plan clearly
freeze writes during the final migration if needed
publish the rewritten repository and migration instructions

The publish step:

git remote set-url origin <canonical-url>
git push --mirror origin

Run that only after the communication and write-freeze step, because git push --mirror republishes the rewritten ref space, not just one branch tip.

Useful migration questions include:

which branches are authoritative?
which tags must survive?
which forks matter?
which CI systems pin SHAs?
who needs to re-clone versus hard-reset?

The bigger the repository and the team, the less you can improvise this.

After a successful history rewrite, the old objects do not vanish instantly from every clone.

You still have to think about:

reflogs
unreachable objects
old packs
mirrors and caches
stale clones on developer machines

Until those are expired or replaced, some copies of the old bulk may still exist. Most rewrites happen in a fresh mirror, then users re-clone rather than trying to clean every existing clone in place. The rewritten canonical repository can be clean while older local clones still retain unreachable junk until maintenance and expiration catch up. That cleanup has its own commands:

git reflog expire --expire=now --all
git gc --prune=now

Those commands clean up leftover copies. They do not replace the migration plan.