Sooner or later, a team stops asking "why is this repository slow?" and starts asking a blunter question:
"Why is this repository so big?"
That question can mean several different things:
- clone takes too long
- fetch transfers too much
- CI spends too long checking out history
- developer machines keep filling up
- backup, mirroring, and migration are more expensive than they should be
Those are related problems, but they do not all have the same fix.
A repository can feel "large" because:
- the current checkout is too wide
- the object database contains too much history
- too many large blobs exist in reachable history
- too many duplicate or near-duplicate binary assets were committed
- packfiles or delta relationships are storing the history inefficiently
- maintenance has not compacted the repository well
Some techniques only reduce what a user downloads locally. Some only reduce working-tree size. Some rewrite history to make the repository itself smaller for everyone.
Start by Separating Local Footprint from Canonical History
The easiest mistake is to mix up:
- how much data is checked out right now
- how much data exists in
.git/ - how much data exists in the full repository history
Sparse-checkout reduces the first problem. Partial clone reduces what arrives up front on a client. Neither one automatically makes the repository itself smaller at the server or history level.
If the goal is:
- "make developer machines smaller," sparse-checkout and partial clone may be enough
- "make clones and fetches smaller for everyone forever," you are probably talking about history and object reduction
History reduction is invasive.
Before deleting, rewriting, or repacking anything, measure where the bulk actually lives.
Useful questions include:
- how large is the packed object store?
- how many packfiles exist?
- are there lots of loose objects?
- which blobs are the largest?
- are the largest objects reachable from current refs or just historical?
The basic probes are familiar:
git count-objects -vH
git rev-list --objects --all
git verify-pack -v .git/objects/pack/pack-*.idx | sort -k3 -n | tail
git-sizer
git count-objects -vH gives you a quick storage summary. verify-pack is ugly, but it helps find large packed objects. git-sizer is often the best first high-level warning system because it tells you whether the problem is blob size, tree width, path count, ref count, history shape, or some nasty combination.
In practice, I would usually start with:
git count-objects -vH
git-sizer
git rev-list --objects --all | sort -k2
The first two tell you how large and oddly shaped the repository is. The third gives you the raw inventory you can join back to large-object investigation.
Do this before deciding on a fix. Many teams assume they have a "huge repo" problem when they really have:
- one accidentally committed database dump
- years of versioned build artifacts
- a media directory that should never have lived in Git
- repeated vendor drops of large binaries
Use the Safe Wins First
If the repository is merely expensive locally, use the non-destructive tools first.
That usually means:
- sparse-checkout
- sparse-index
- partial clone with
--filter=blob:none git backfillfor batched later blob retrieval- background maintenance and incremental repack
These help because they change local shape and transfer cost without changing object identity for everyone else.
That is the first bar to clear. If your users only need a smaller local footprint, do not jump straight to history surgery.
Normal maintenance can reduce on-disk size without changing history semantics.
Examples:
- repacking loose objects into packs
- better delta compression
- consolidating many packs into fewer packs
- writing a more efficient object layout over time
That can produce meaningful savings, especially on older repositories that accumulated years of loose objects, fragmented packs, or poor compression decisions.
But maintenance has a hard limit:
- it can store the same history more efficiently
- it cannot make that history logically smaller
If the repository contains ten years of giant PSDs, maintenance can repack them. It cannot wish them away.
So maintenance is often step one, not step two hundred.
Repository size problems often split into two buckets:
- the repository contains the wrong things
- the repository stores reasonable things inefficiently
The first bucket usually means:
- giant blobs that never belonged in Git
- generated outputs committed repeatedly
- vendor drops, archives, media, or datasets living in ordinary history
The second bucket usually means:
- too many fragmented packs
- weak object locality
- poor delta choices
- a storage layout that has drifted into an inefficient shape over time
If the contents are wrong, you are usually heading toward policy changes, externalized asset storage, or history rewriting.
If the contents are mostly fine but the packing is poor, repacking and maintenance strategy may produce large wins without changing history at all.
Both cases can make a repository feel huge. Only one of them means the history itself is fundamentally wrong.
In practice, repository bloat usually comes from blobs:
- archives
- generated assets
- media
- build outputs
- vendored dependencies committed repeatedly
- machine-produced JSON or logs
Text compresses well. Binary junk often does not.
Even worse, Git is happiest when similar content can delta well across versions. Giant binary artifacts often defeat that. A repository with many large opaque binaries is bigger, less compressible, and more expensive to move.
So the main question becomes:
"Which blobs should never have been in reachable history?"
True size reduction starts there.
History Rewriting Is the Real Size-Reduction Tool
If you want the repository itself to become permanently smaller, you usually have to rewrite history to remove or transform old objects.
That can mean:
- removing paths entirely from history
- replacing large assets with smaller versions
- flattening accidental imports
- stripping generated directories
- moving oversized binaries out of ordinary Git history
Once you do that, old commit IDs are no longer valid because the trees changed, the commits changed, and therefore the history graph changed.
This is the same rule as rebase, just on a much larger scale.
That has consequences:
- every branch and tag may need rewriting
- downstream clones must be coordinated carefully
- open pull requests may become awkward
- old SHAs in docs, tickets, or deployment systems may stop being useful
Do not present this as a harmless cleanup. It is repository migration.
For repository-scale history rewriting, git filter-repo is the practical default.
It is a separately installed tool, not part of a stock Git build. Confirm git filter-repo -h works on your machine before you plan a cleanup around it.
It is usually a better choice than old filter-branch workflows because it is:
- much faster
- less error-prone
- better suited to path filtering and blob stripping
- easier to reason about in repeatable cleanup scripts
Typical uses include:
- remove a path from all history
- keep only a subdirectory
- rewrite author or path metadata
- strip blobs larger than a chosen threshold
Examples:
git filter-repo --path build/ --invert-paths
git filter-repo --strip-blobs-bigger-than 20M
git filter-repo --path src/ --path docs/
Those commands are powerful because they change the repository for real, beyond the current checkout.
They are also powerful enough to hurt you badly if run casually on a live shared repository.
The safer pattern is to rehearse in a mirror and measure before and after:
git clone --mirror <url> repo-cleanup.git
cd repo-cleanup.git
git filter-repo --strip-blobs-bigger-than 20M
git count-objects -vH
It does make the storage effect measurable before you ask a shared repository to absorb the rewrite.
The BFG Repo-Cleaner can still be useful when the job is narrowly:
- remove very large blobs
- remove known bad file types
- replace secrets or obvious junk
It is less general than git filter-repo, but for some cleanup tasks that narrower model is fine.
If BFG is the chosen tool, the shape is still mirror-first:
git clone --mirror <url> repo-bfg.git
java -jar bfg.jar --strip-blobs-bigger-than 20M repo-bfg.git
cd repo-bfg.git
git reflog expire --expire=now --all
git gc --prune=now
Like git filter-repo, BFG is a separate tool. Verify the exact launcher on your machine before you write a cleanup runbook around it.
These are history-rewrite tools, not storage-maintenance tools.
They change object identity.
Large Binary Assets Usually Belong Somewhere Else
Sometimes the right answer is not "keep smaller copies in Git." It is "do not keep these assets in ordinary Git history at all."
Candidates include:
- media archives
- trained models
- packaged dependencies
- release artifacts
- design source files
There is no perfect answer here.
Git LFS is one option when the repository still needs those files in the workflow but ordinary Git object storage is the wrong place for them. Large-object promisors are another interesting direction, using promisor-style object retrieval for large payloads. In many environments, though, the more boring answer is still the better one: keep large assets in artifact storage, object storage, package registries, model registries, release buckets, or some other system that is already designed for large binary distribution.
A future-looking LFS setup is explicit:
git lfs version
git lfs install
git lfs track "*.psd" "*.zip" "models/**"
git add .gitattributes
git commit -m 'track large assets with Git LFS'
git lfs is also a separate tool, so git lfs version should succeed before you count on this workflow. This changes future commits. It does not rewrite the large blobs already buried in old history.
That changes the tradeoff:
- Git may keep only lightweight references in ordinary history
- large payloads may live in LFS, a promisor-backed object service, or an external artifact system
- clones and fetches stop dragging all large blob content through the normal object pipeline
None of those choices is free. LFS adds its own operational overhead. External artifact systems separate the assets from ordinary Git workflows. Promisor-based designs are promising, but they are not a universal drop-in answer today.
This does not remove the need for policy. If people keep committing giant generated files into ordinary Git history, the repository will bloat again.
A one-time rewrite helps less than people hope if the same habits continue.
Once a repository has been reduced, protect it with rules like:
- reject oversized files in server-side hooks or CI
- keep generated outputs out of tracked history
- store releases and build artifacts outside the main repo
- keep large assets in the storage system that actually fits them
- document what belongs in Git and what belongs adjacent to Git
One concrete CI gate is to fail a branch that introduces new oversized blobs relative to the protected base branch:
base=origin/main
git rev-list --objects "$base"..HEAD \
| git cat-file --batch-check='%(objecttype) %(objectsize) %(rest)' \
| awk '$1 == "blob" && $2 > 20000000 { print; bad = 1 } END { exit bad }'
Replace origin/main and 20000000 with your real protected base ref and size limit. The point is to check newly reachable blobs, not only working-tree file size.
Without policy, repository reduction is just recurring surgery.
Treat Repository Reduction as Migration
A real reduction project should usually look like this:
- measure the problem
- identify exactly which paths or blob classes must go
- rehearse the rewrite in a disposable mirror
- compare before and after size, clone time, and fetch behavior
- communicate the cutover plan clearly
- freeze writes during the final migration if needed
- publish the rewritten repository and migration instructions
That planning is how you avoid turning size cleanup into a multi-week repository outage.
The publish step should also be explicit:
git remote set-url origin <canonical-url>
git push --mirror origin
Run that only after the communication and write-freeze step, because git push --mirror republishes the rewritten ref space, not just one branch tip.
Useful migration questions include:
- which branches are authoritative?
- which tags must survive?
- which forks matter?
- which CI systems pin SHAs?
- who needs to re-clone versus hard-reset?
The bigger the repository and the team, the less you can improvise this.
After a successful history rewrite, the old objects do not vanish instantly from every clone.
You still have to think about:
- reflogs
- unreachable objects
- old packs
- mirrors and caches
- stale clones on developer machines
Until those are expired or replaced, some copies of the old bulk may still exist.
Teams often do the rewrite in a fresh mirror and then have users re-clone rather than trying to coax every existing clone into a perfectly cleaned final state.
The rewritten canonical repository can be clean. Existing local clones may lag behind or retain unreachable junk until maintenance and expiration catch up.
That cleanup has its own commands:
git reflog expire --expire=now --all
git gc --prune=now
Those commands do not replace the migration plan, but they do explain why "we rewrote history" and "the old bulk is gone everywhere" are not the same moment.
If the reduction plan becomes too complicated, one honest answer is:
- archive the old repository
- create a cleaner successor
- migrate active development to the new home
This is especially reasonable when:
- old history is mostly ballast
- branch and tag preservation is not mission-critical
- organizational boundaries changed anyway
- the repository has become a dumping ground rather than a source tree
Sometimes a clean break is operationally safer than a heroic rewrite.
Repository reduction is about cost control.
A smaller repository can mean:
- faster first clones
- cheaper fetches
- lower storage cost
- easier mirroring
- less CI setup time
- less pain on constrained developer machines
But the right method depends on which cost you are actually paying.
If the problem is local checkout width, use sparse tools.
If the problem is transfer volume, use partial clone, prefetch strategy, bundles, and better transport shape.
If the problem is years of bad historical objects, use a deliberate history rewrite with a migration plan.
"Reducing repository size" only becomes a useful phrase once you say which one you mean.
Sparse-checkout reduces working-tree size, not repository history. Partial clone reduces what arrives up front, not what exists canonically. Maintenance can store the same history more efficiently, but it cannot delete bad history. Real repository shrinkage usually means removing blobs from history. git filter-repo is usually the right tool for serious cleanup. Large assets often belong outside ordinary Git history, but there is no perfect universal replacement.