People say "large repo" and mean wildly different things. The checkout may be too wide. The object database may carry too much history. A few large blobs may dominate reachable history. Generated files or duplicate binaries may have polluted the repository. Pack layout may be inefficient. Maintenance may be stale. So there is no one "make repo smaller" button. Some tools reduce only local footprint. Some reduce only transfer cost. If you want the canonical repository itself to become smaller, you are probably talking about history rewriting, which is the most invasive option in this chapter.
Start by Separating Local Footprint from Canonical History
The easiest mistake is to mix up how much data is checked out right now, how much data exists in .git/, and how much data exists in the full repository history. Sparse-checkout reduces the first problem, partial clone reduces what arrives up front on a client, but neither fixes the last one.
If the goal is "make developer machines smaller," then sparse-checkout and partial clone may be enough. But if you want to "make clones and fetches smaller for everyone forever," you are probably talking about history and object reduction. That's non-trivial; history reduction is invasive. Before deleting, rewriting, or repacking anything, make sure the problem actually requires it.
Start with a few questions:
- how large is the packed object store?
- how many packfiles exist?
- are there lots of loose objects?
- which blobs are the largest?
- are the largest objects reachable from current refs or just historical?
And then apply some basic tooling:
git count-objects -vH
git rev-list --objects --all
git verify-pack -v .git/objects/pack/pack-*.idx | sort -k3 -n | tail
git-sizer
git count-objects -vH gives you a quick storage summary. verify-pack is ugly, but it helps find large packed objects. git-sizer is a separate GitHub project, not a built-in, and it identifies whether the problem is blob size, tree width, path count, ref count, history shape, or some combination.
I would usually start with:
git count-objects -vH
git-sizer
git rev-list --objects --all | sort -k2
The first two show size and shape, and the third is the raw object inventory for later blob analysis. Measure before choosing a fix. Many "huge repo" complaints turn out to be one bad database dump, years of tracked build output, or some media directory that never belonged in Git.
History Rewriting Is the Real Size-Reduction Tool
If you actually want the repository itself to become permanently smaller, you usually have to rewrite history to remove or transform old objects. That can mean:
- removing paths entirely from history
- replacing large assets with smaller versions
- flattening accidental imports
- stripping generated directories
- moving oversized binaries out of ordinary Git history
Once you do that, old commit IDs are no longer valid because the trees changed, the commits changed, and therefore the history graph changed. This is the same rule as rebase, just at repository scale, and the consequences include:
- every branch and tag may need rewriting
- downstream clones must be coordinated carefully
- open pull requests may become awkward
- old SHAs in docs, tickets, or deployment systems may stop being useful
This is major work, it's repository migration. For repository-scale history rewriting, git filter-repo is a common choice. It is a separately installed tool that ships outside the stock Git build, so make sure git filter-repo -h is available before you build a cleanup plan around it. It is usually a better choice than old filter-branch workflows: faster, less error-prone, better at path filtering and blob stripping, and easier to use in repeatable cleanup scripts. Typical uses include:
- remove a path from all history
- keep only a subdirectory
- rewrite author or path metadata
- strip blobs larger than a chosen threshold
Examples:
git filter-repo --path build/ --invert-paths
git filter-repo --strip-blobs-bigger-than 20M
git filter-repo --path src/ --path docs/
Those commands change the repository for real, not just the current checkout, and they can hurt you badly if you run them casually on a live shared repository. The safer pattern is to rehearse in a mirror and measure before and after:
git clone --mirror <url> repo-cleanup.git
cd repo-cleanup.git
git filter-repo --strip-blobs-bigger-than 20M
git count-objects -vH
That lets you measure the storage effect before asking a shared repository to absorb the rewrite. The BFG Repo-Cleaner can still be useful when the job is narrowly:
- remove very large blobs
- remove known bad file types
- replace secrets or obvious junk
It is less general than git filter-repo, but for some cleanup tasks that narrower model is fine.
If BFG is the chosen tool, the safe approach is still mirror-first:
git clone --mirror <url> repo-bfg.git
java -jar bfg.jar --strip-blobs-bigger-than 20M repo-bfg.git
cd repo-bfg.git
git reflog expire --expire=now --all
git gc --prune=now
Large Binary Assets Usually Belong Somewhere Else
Sometimes the repo just has a bunch of giant files in it.
There is no perfect answer here. Git LFS is one option when the repository really does need those files in ordinary developer workflows, but normal Git object storage is still the wrong place for them. Large-object promisors (see Chapter 11) are another direction. In many environments, though, the boring answer is the right one: keep large assets in artifact storage, object storage, package registries, model registries, release buckets, or some other system built for large binary distribution.
If you do opt for Git LFS, it changes the tradeoff:
- Git may keep only lightweight references in ordinary history
- large payloads may live in LFS, a promisor-backed object service, or an external artifact system
- clones and fetches stop dragging all large blob content through the normal object pipeline
None of those choices is free. LFS adds its own operational overhead, and it is often slow. External artifact systems split the asset workflow away from ordinary Git usage. Promisor-based designs are interesting, but they are not a universal drop-in answer today.
Policy still matters. If generated files keep reaching history, the repository bloats again. A one-time rewrite helps less than expected if the same habits continue. Once a repository has been reduced, protect it with rules like:
- reject oversized files in server-side hooks or CI
- keep generated outputs out of tracked history
- store releases and build artifacts outside the main repo
- keep large assets in the storage system that actually fits them
- document what belongs in Git and what belongs adjacent to Git
One concrete CI gate is to fail a branch that introduces new oversized blobs relative to the protected base branch:
base=origin/main
git rev-list --objects "$base"..HEAD \
| git cat-file --batch-check='%(objecttype) %(objectsize) %(rest)' \
| awk '$1 == "blob" && $2 > 20000000 { print; bad = 1 } END { exit bad }'
Replace origin/main and 20000000 with your real protected base ref and size limit. The point is to check newly reachable blobs, not just large files in the working tree.
Treat Repository Reduction as Migration
A reduction project usually looks like this:
- measure the problem
- identify exactly which paths or blob classes must go
- rehearse the rewrite in a disposable mirror
- compare before and after size, clone time, and fetch behavior
- communicate the cutover plan clearly
- freeze writes during the final migration if needed
- publish the rewritten repository and migration instructions
The publish step:
git remote set-url origin <canonical-url>
git push --mirror origin
Run that only after the communication and write-freeze step, because git push --mirror republishes the rewritten ref space, not just one branch tip.
Useful migration questions include:
- which branches are authoritative?
- which tags must survive?
- which forks matter?
- which CI systems pin SHAs?
- who needs to re-clone versus hard-reset?
The bigger the repository and the team, the less you can improvise this.
After a successful history rewrite, the old objects do not vanish instantly from every clone.
You still have to think about:
- reflogs
- unreachable objects
- old packs
- mirrors and caches
- stale clones on developer machines
Until those are expired or replaced, some copies of the old bulk may still exist. Most rewrites happen in a fresh mirror, then users re-clone rather than trying to clean every existing clone in place. The rewritten canonical repository can be clean while older local clones still retain unreachable junk until maintenance and expiration catch up. That cleanup has its own commands:
git reflog expire --expire=now --all
git gc --prune=now
Those commands clean up leftover copies. They do not replace the migration plan.