Chapter 8: Commit-Graph, Bloom Filters, MIDX, Bitmaps

Pencil sketch of several sailboats racing across a river while spectators watch from the shore.

If git log feels fast in a huge repo, or if a fetch knows exactly which objects to send without enumerating them from scratch, the credit usually belongs to one of four sidecar structures: commit-graph files, changed-path Bloom filters, multi-pack-indexes, and reachability bitmaps. They exist because recomputing everything the honest way gets expensive.

Git's base model reads the underlying data when it is needed, such as:

commits and parent links for history
trees and blobs for content
packs and pack indexes for object storage

The problem is that raw object access is not always the cheapest way to answer large questions. Parsing commit objects one by one, scanning many pack indexes independently, or recomputing reachability sets from scratch can become expensive at scale. So Git builds acceleration structures around the same underlying repository.

These Structures Accelerate Existing Data

These structures sit alongside commits, packs, trees, and refs. They summarize or index information that is already present so Git can answer certain questions faster. The data they store can always be rebuilt from existing logical primitives.

When one of them is missing, Git should still work. It just has to do the slower version of the job.

Each accelerator has a place on disk:

commit-graph usually lives at .git/objects/info/commit-graph
split commit-graph chains live under .git/objects/info/commit-graphs/
the multi-pack-index lives at .git/objects/pack/multi-pack-index
pack bitmaps and related pack sidecar files live under .git/objects/pack/

That means you can often answer "do I even have this feature here?" with ordinary shell inspection before you reach for deeper Git commands:

test -f .git/objects/info/commit-graph && echo commit-graph present
test -d .git/objects/info/commit-graphs && find .git/objects/info/commit-graphs -maxdepth 1 -type f | sed -n '1,20p'
test -f .git/objects/pack/multi-pack-index && echo midx present
find .git/objects/pack -maxdepth 1 \( -name 'pack-*.idx' -o -name 'pack-*.pack' -o -name 'pack-*.bitmap' -o -name 'multi-pack-index*' \) | sort

Commit-Graph Serializes Commit Metadata for Fast Walks

Commit-Graph: Serialized file that stores commit metadata and parent relationships in a traversal-friendly form.

The motivation for commit-graph is direct: walking commit history in very large repositories can be too slow when Git has to parse commit objects one by one from packfiles. The commit-graph file precomputes and serializes the graph structure so that ancestry questions become cheaper lookups instead of repeated object parsing.

At the object level, Git can learn everything it needs about a commit by reading and parsing the commit object itself. But when a repository has a very large history, doing that repeatedly becomes expensive. The commit-graph file exists to make that cheaper. It stores commit OIDs together with associated metadata including generation numbers, root tree OIDs, commit dates, parent positions, and optionally changed-path Bloom filters.

Operations that need only graph shape and a few facts per node — rather than the full text of every commit object — benefit most. That includes:

ancestry walks
merge-base style reasoning
reachability checks
ordered history traversals over large commit sets

On disk, the simplest form is one file at .git/objects/info/commit-graph. Repositories using split commit-graph chains instead keep a commit-graph-chain file and one or more graph-*.graph files under .git/objects/info/commit-graphs/.

The commit-graph stores parent references by position within the graph file, not by repeating full object IDs for every edge. Once the commits are laid out in one dense table, parent navigation becomes much cheaper than repeatedly inflating scattered commit objects from packs just to chase edges.

Once commit metadata sits in one dense table, ancestry queries become cheaper than repeatedly inflating scattered commit objects from packs.

The commit-graph also records generation data. Generation numbers help Git reason about commit order and prune impossible ancestry candidates more quickly. Commit-graph files improve commands that walk a lot of history for exactly that reason. Git is crossing useless candidates off the list sooner, with the same underlying reasoning.

Git also supports corrected commit-date style generation data, with generation version 2 as the default for writing and reading commit-graph files. Even inside one accelerator, Git keeps refining the quality of its precomputed metadata.

Changed-Path Bloom Filters Speed Up Path-Limited History

Changed-Path Bloom Filter: Probabilistic filter attached to commit-graph data that helps Git skip commits unlikely to affect a requested path.

Changed-path Bloom filters extend the commit-graph for a specific problem: path-limited history queries like git log -- path can still be expensive even with a commit-graph because Git may need to inspect trees at each commit to determine whether the path changed. Bloom filters let Git skip that tree inspection for commits where the path almost certainly did not change.

The commit-graph becomes even more valuable when it carries changed-path Bloom filters. git log -- path is a graph walk plus repeated path reasoning. A cheap negative test can save a great deal of tree inspection there.

These filters are specific. The commit-graph format stores a Bloom filter for the paths changed between a commit and its first parent, if requested, and --changed-paths can provide significant gains for git log -- <path> style queries.

Operationally:

they are path-query accelerators
they sit on top of the commit graph
they are especially useful for history limited by file or directory

Bloom filters are probabilistic membership tests, with no resemblance to a stored path history. The property is asymmetrical:

a negative result is strong enough for Git to skip extra work
a positive result only means "maybe, check further"

They help without changing correctness. Git uses them to avoid unnecessary inspections and falls back to the real underlying data when needed. A negative result lets Git skip work; a positive result only means "keep checking."

Keep their scope in mind. Changed-path Bloom filters are based on paths changed against the first parent, which leaves them silent on the full semantic story of renames, copies, and every possible merge interpretation. They help path-limited history without granting psychic powers about renames or turning the query into a free operation. The hard parts of history simplification, rename heuristics, and merge-heavy path reasoning still exist.

Multi-Pack-Index Makes Many Packs Behave More Like One Index

Multi-Pack-Index: Repository-level index that lets Git look up objects across multiple packfiles without consulting each pack index independently.

In simplified examples, a repository has one tidy pack. Real repositories are messier. They accumulate multiple packs for ordinary reasons: incremental fetches, maintenance strategy, kept packs, promisor packs, geometric repacking, and more.

Without an extra index, object lookup across many packs can become needlessly repetitive. Git may need to consider pack after pack and its separate .idx file just to answer basic lookup questions.

The multi-pack-index, or MIDX, exists to collapse that lookup work into one repository-level index. Git can write or verify that MIDX as its own maintained artifact. That modest description hides an important effect. MIDX lets Git keep multiple packs around without taking the full lookup hit of treating them as unrelated islands.

The literal file is .git/objects/pack/multi-pack-index. It sits beside the .pack, .idx, .rev, .mtimes, and bitmap sidecar files for the packs it indexes.

MIDX is for cases where keeping multiple packs is better than rewriting everything into one. One answer to "too many packs" is "rewrite everything into one giant pack." Sometimes that is right. Sometimes it is unnecessary.

Modern Git has better options. Git can:

keep multiple packs
write a MIDX over them
repack selected packs
expire packs no longer referenced by the MIDX
maintain geometric pack structure instead of flattening the whole repository every time

Git can keep multiple packs healthy without flattening the whole repository into one perfect pack.

Bitmaps Accelerate Reachability Enumeration

Reachability Bitmap: Precomputed set representation that speeds up answering which objects are reachable from selected commits.

Bitmaps solve a different class of problem from commit-graph and Bloom filters. Clone, fetch, and some object-counting tasks are really reachability-enumeration problems in disguise, and bitmaps are what make that enumeration cheap. Commit-graph helps with commit traversal. Bloom filters help with path-limited history. Bitmaps help when Git needs to know which objects are reachable from a set of tips, right now, without walking the graph object by object.

Git exposes this in two places: git rev-list supports --use-bitmap-index for bitmap-assisted traversal, and git repack supports --write-bitmap-index for writing reachability bitmaps during repack. In large repositories, this is one of the strongest performance features Git has.

Bitmaps sit close to transfer, which is why they show up in server tuning more than local-command tutorials. When a server is preparing to satisfy a fetch or clone, it often needs to answer a question like:

"Which objects are reachable from these tips, but not already known on the other side?"

That question gets huge very quickly in a large repository. Bitmaps help because they precompute large reachability sets instead of forcing Git to rediscover them object by object every time. So bitmap discussions show up in server performance, clone performance, and fetch negotiation conversations more often than in ordinary local-command tutorials.

Historically, bitmaps were often associated with a repository having one dominant pack. That is still an important case, and bitmap writing only makes sense when the bitmap can refer to all reachable objects.

Git also supports multi-pack bitmaps. git multi-pack-index includes --bitmap on write, and git repack --write-bitmap-index notes that when multiple packs are produced while writing a MIDX, a multi-pack bitmap can be created. Git can preserve a more flexible pack layout while still getting bitmap-style acceleration.

These files also live under .git/objects/pack/. In a single-pack case you will usually see a pack-*.bitmap file next to the corresponding pack and index. In a multi-pack case, the bitmap data lives alongside the MIDX in the same directory.

Different Accelerators Help Different Commands

Git performance advice often goes bad when these structures get thrown into one vague "make Git faster" bucket. Roughly speaking:

commit-graph helps commit walks and ancestry reasoning
changed-path Bloom filters help path-limited history queries
multi-pack-index helps object lookup across many packs
bitmaps help reachability enumeration, especially for transfer

Each structure maps to a class of expensive work. If I had to rank them by practical value on a typical repo: commit-graph first, because it helps on almost any history-heavy workload. MIDX next, once the repo accumulates more than one or two packs. Bloom filters when git log -- path is actually slow, reactively rather than preemptively. Bitmaps matter most server-side and for transfer-heavy workflows; on a single-developer clone they are rarely the lever worth pulling.

Two quick probes make that concrete:

git multi-pack-index verify
git rev-list --use-bitmap-index --count --all

The first confirms that Git can treat many packs through one higher-level index. The second asks for a large reachability count using bitmap assistance when available.

A Few Commands Make These Files Concrete

Ask Git to write or verify them directly:

ls .git/objects/info
ls .git/objects/pack

git commit-graph write --reachable --changed-paths
git commit-graph verify

git multi-pack-index write
git multi-pack-index verify

git rev-list --use-bitmap-index --count --all

Those commands are a good way to see the structures in the chapter become actual files and actual query behavior. ls shows where the files live, the write commands create or refresh them, and the verify commands confirm that Git can read them coherently.

These files age. Maintenance is what keeps them aligned with the current repository state. Git uses these accelerators automatically when they are present. Writing them is messier. I let background maintenance handle commit-graph and MIDX, and reach for explicit writes mainly when I want to force the structures into existence for a benchmark or a debugging pass. Bitmap writing is more common in server-oriented or repack-heavy setups than in an ordinary local clone.

If you want to force the structures into existence instead of waiting for maintenance:

git commit-graph write --reachable --changed-paths
git multi-pack-index write --bitmap

These files are maintained artifacts, not hand-built ones. When they are present and current, Git can answer large questions with much less repeated work.

The pattern across the chapter is the same. Git keeps the core repository model simple:

immutable objects
refs
parent links
packs

Then it layers accelerators around that core:

serialized commit metadata
probabilistic path hints
unified indexes across multiple packs
precomputed reachability sets

Those structures keep Git's model intact while making it usable at larger scales and lower latencies. Each one is the same bet: write an answer down once, so Git can read it instead of recomputing it every time you ask.