High Performance Git

Section IV ยท Large-Repo Operations, Transport, and Scale

Chapter 16

Reducing Repository Size

Pencil sketch of a runner on a harbor path with waterfront buildings and boats ahead.

Sooner or later, a team stops asking "why is this repository slow?" and starts asking a blunter question:

"Why is this repository so big?"

That question can mean several different things:

Those are related problems, but they do not all have the same fix.

A repository can feel "large" because:

Some techniques only reduce what a user downloads locally. Some only reduce working-tree size. Some rewrite history to make the repository itself smaller for everyone.


Start by Separating Local Footprint from Canonical History

The easiest mistake is to mix up:

Sparse-checkout reduces the first problem. Partial clone reduces what arrives up front on a client. Neither one automatically makes the repository itself smaller at the server or history level.

If the goal is:

History reduction is invasive.

Before deleting, rewriting, or repacking anything, measure where the bulk actually lives.

Useful questions include:

The basic probes are familiar:

git count-objects -vH
git rev-list --objects --all
git verify-pack -v .git/objects/pack/pack-*.idx | sort -k3 -n | tail
git-sizer

git count-objects -vH gives you a quick storage summary. verify-pack is ugly, but it helps find large packed objects. git-sizer is often the best first high-level warning system because it tells you whether the problem is blob size, tree width, path count, ref count, history shape, or some nasty combination.

In practice, I would usually start with:

git count-objects -vH
git-sizer
git rev-list --objects --all | sort -k2

The first two tell you how large and oddly shaped the repository is. The third gives you the raw inventory you can join back to large-object investigation.

Do this before deciding on a fix. Many teams assume they have a "huge repo" problem when they really have:

Use the Safe Wins First

If the repository is merely expensive locally, use the non-destructive tools first.

That usually means:

These help because they change local shape and transfer cost without changing object identity for everyone else.

That is the first bar to clear. If your users only need a smaller local footprint, do not jump straight to history surgery.

Normal maintenance can reduce on-disk size without changing history semantics.

Examples:

That can produce meaningful savings, especially on older repositories that accumulated years of loose objects, fragmented packs, or poor compression decisions.

But maintenance has a hard limit:

If the repository contains ten years of giant PSDs, maintenance can repack them. It cannot wish them away.

So maintenance is often step one, not step two hundred.

Repository size problems often split into two buckets:

The first bucket usually means:

The second bucket usually means:

If the contents are wrong, you are usually heading toward policy changes, externalized asset storage, or history rewriting.

If the contents are mostly fine but the packing is poor, repacking and maintenance strategy may produce large wins without changing history at all.

Both cases can make a repository feel huge. Only one of them means the history itself is fundamentally wrong.

In practice, repository bloat usually comes from blobs:

Text compresses well. Binary junk often does not.

Even worse, Git is happiest when similar content can delta well across versions. Giant binary artifacts often defeat that. A repository with many large opaque binaries is bigger, less compressible, and more expensive to move.

So the main question becomes:

"Which blobs should never have been in reachable history?"

True size reduction starts there.

History Rewriting Is the Real Size-Reduction Tool

If you want the repository itself to become permanently smaller, you usually have to rewrite history to remove or transform old objects.

That can mean:

Once you do that, old commit IDs are no longer valid because the trees changed, the commits changed, and therefore the history graph changed.

This is the same rule as rebase, just on a much larger scale.

That has consequences:

Do not present this as a harmless cleanup. It is repository migration.

For repository-scale history rewriting, git filter-repo is the practical default.

It is a separately installed tool, not part of a stock Git build. Confirm git filter-repo -h works on your machine before you plan a cleanup around it.

It is usually a better choice than old filter-branch workflows because it is:

Typical uses include:

Examples:

git filter-repo --path build/ --invert-paths
git filter-repo --strip-blobs-bigger-than 20M
git filter-repo --path src/ --path docs/

Those commands are powerful because they change the repository for real, beyond the current checkout.

They are also powerful enough to hurt you badly if run casually on a live shared repository.

The safer pattern is to rehearse in a mirror and measure before and after:

git clone --mirror <url> repo-cleanup.git
cd repo-cleanup.git
git filter-repo --strip-blobs-bigger-than 20M
git count-objects -vH

It does make the storage effect measurable before you ask a shared repository to absorb the rewrite.

The BFG Repo-Cleaner can still be useful when the job is narrowly:

It is less general than git filter-repo, but for some cleanup tasks that narrower model is fine.

If BFG is the chosen tool, the shape is still mirror-first:

git clone --mirror <url> repo-bfg.git
java -jar bfg.jar --strip-blobs-bigger-than 20M repo-bfg.git
cd repo-bfg.git
git reflog expire --expire=now --all
git gc --prune=now

Like git filter-repo, BFG is a separate tool. Verify the exact launcher on your machine before you write a cleanup runbook around it.

These are history-rewrite tools, not storage-maintenance tools.

They change object identity.

Large Binary Assets Usually Belong Somewhere Else

Sometimes the right answer is not "keep smaller copies in Git." It is "do not keep these assets in ordinary Git history at all."

Candidates include:

There is no perfect answer here.

Git LFS is one option when the repository still needs those files in the workflow but ordinary Git object storage is the wrong place for them. Large-object promisors are another interesting direction, using promisor-style object retrieval for large payloads. In many environments, though, the more boring answer is still the better one: keep large assets in artifact storage, object storage, package registries, model registries, release buckets, or some other system that is already designed for large binary distribution.

A future-looking LFS setup is explicit:

git lfs version
git lfs install
git lfs track "*.psd" "*.zip" "models/**"
git add .gitattributes
git commit -m 'track large assets with Git LFS'

git lfs is also a separate tool, so git lfs version should succeed before you count on this workflow. This changes future commits. It does not rewrite the large blobs already buried in old history.

That changes the tradeoff:

None of those choices is free. LFS adds its own operational overhead. External artifact systems separate the assets from ordinary Git workflows. Promisor-based designs are promising, but they are not a universal drop-in answer today.

This does not remove the need for policy. If people keep committing giant generated files into ordinary Git history, the repository will bloat again.

A one-time rewrite helps less than people hope if the same habits continue.

Once a repository has been reduced, protect it with rules like:

One concrete CI gate is to fail a branch that introduces new oversized blobs relative to the protected base branch:

base=origin/main
git rev-list --objects "$base"..HEAD \
| git cat-file --batch-check='%(objecttype) %(objectsize) %(rest)' \
| awk '$1 == "blob" && $2 > 20000000 { print; bad = 1 } END { exit bad }'

Replace origin/main and 20000000 with your real protected base ref and size limit. The point is to check newly reachable blobs, not only working-tree file size.

Without policy, repository reduction is just recurring surgery.

Treat Repository Reduction as Migration

A real reduction project should usually look like this:

  1. measure the problem
  2. identify exactly which paths or blob classes must go
  3. rehearse the rewrite in a disposable mirror
  4. compare before and after size, clone time, and fetch behavior
  5. communicate the cutover plan clearly
  6. freeze writes during the final migration if needed
  7. publish the rewritten repository and migration instructions

That planning is how you avoid turning size cleanup into a multi-week repository outage.

The publish step should also be explicit:

git remote set-url origin <canonical-url>
git push --mirror origin

Run that only after the communication and write-freeze step, because git push --mirror republishes the rewritten ref space, not just one branch tip.

Useful migration questions include:

The bigger the repository and the team, the less you can improvise this.

After a successful history rewrite, the old objects do not vanish instantly from every clone.

You still have to think about:

Until those are expired or replaced, some copies of the old bulk may still exist.

Teams often do the rewrite in a fresh mirror and then have users re-clone rather than trying to coax every existing clone into a perfectly cleaned final state.

The rewritten canonical repository can be clean. Existing local clones may lag behind or retain unreachable junk until maintenance and expiration catch up.

That cleanup has its own commands:

git reflog expire --expire=now --all
git gc --prune=now

Those commands do not replace the migration plan, but they do explain why "we rewrote history" and "the old bulk is gone everywhere" are not the same moment.

If the reduction plan becomes too complicated, one honest answer is:

This is especially reasonable when:

Sometimes a clean break is operationally safer than a heroic rewrite.

Repository reduction is about cost control.

A smaller repository can mean:

But the right method depends on which cost you are actually paying.

If the problem is local checkout width, use sparse tools.

If the problem is transfer volume, use partial clone, prefetch strategy, bundles, and better transport shape.

If the problem is years of bad historical objects, use a deliberate history rewrite with a migration plan.

"Reducing repository size" only becomes a useful phrase once you say which one you mean.

Sparse-checkout reduces working-tree size, not repository history. Partial clone reduces what arrives up front, not what exists canonically. Maintenance can store the same history more efficiently, but it cannot delete bad history. Real repository shrinkage usually means removing blobs from history. git filter-repo is usually the right tool for serious cleanup. Large assets often belong outside ordinary Git history, but there is no perfect universal replacement.