High Performance Git

Section III ยท Storage and Local Scale

Chapter 8

The Index as a Performance Structure

Pencil sketch of runners circling a school track with a coach and a few spectators nearby.

git status can take five seconds in a large checkout and produce no new history at all. The cost is usually local bookkeeping.

We have mostly talked about the index as a staging area so far. The index is also one of Git's most important local performance structures. Git uses it to cache path state, validate the working tree, and avoid rediscovering the repository on every command. Many local performance problems show up here.


git status Is Not Just Looking at Files

Start with a simple split: git status is doing more than "looking at the working tree and telling you what changed." It compares several layers:

So status can be expensive even though it writes no history. The command validates a cached model of the checkout against a mutable filesystem. On a 100k-file checkout, that usually means a great many lstat() calls, plus directory scans for paths the index has never seen. Five seconds of this feels especially rude because nothing visible happened.

You can make that visible with a very small set of commands:

git status --untracked-files=all
git ls-files --debug | sed -n '1,20p'
git update-index --refresh

status asks the high-level question. ls-files --debug shows the cached path state Git is leaning on. update-index --refresh forces a pass that revalidates index entries against the working tree.

Index Entry
Per-path record in the index containing staged object identity plus cached filesystem state and flags.

The index is a sorted table of path entries. Each entry says, in effect:

Git does not want to re-hash every tracked file every time you ask a local question. Instead, it uses the index to remember enough filesystem state that many paths can be treated as unchanged unless the cached facts no longer line up with what the filesystem reports.

That is the gap between a content-addressed history model and a usable local workflow. The history model says what the snapshot means. The index helps Git decide how much of the working tree it actually has to touch right now. Unglamorous work, maybe, but Git is full of these ungainly little masterpieces.

Commits already point to trees, so it is fair to ask why Git also needs the index. Git needs it for staging and locality. The current commit tells Git what the last committed snapshot looked like. It does not tell Git which files in the working tree might have changed since then without further investigation. The index fills that gap. It sits between immutable history and mutable disk state.

That makes the index central to so many everyday commands:

The commands differ, but they all benefit from Git having one structured place where path state is already organized and partially cached.

Tracked Files, Untracked Files, and Why Stat Data Matters

For tracked files, Git often begins with the index and asks: does the filesystem still agree with what I cached for this path?

For untracked files, Git has a different problem: it has to discover paths that are not in the index at all.

Those are not the same job, and they fail differently at scale. A repository can have a manageable set of tracked-file validations and still feel slow because untracked-file discovery is expensive. Or the opposite can happen. Large-repository tuning gets easier once you separate:

All three sit close to the index, but they are not identical workloads. Different accelerators help different parts of the bill.

Git does not normally open every tracked file and hash its content for every status. That would be too expensive on a large checkout. Instead, it compares cached filesystem facts from the index against current filesystem facts to decide whether a path probably needs deeper inspection.

Exactly which metadata Git trusts can vary with platform and configuration, but the principle is stable: the index holds enough cached path state that Git can often skip content reads for files that appear unchanged.

That is also why some environments feel worse than others. Network filesystems, antivirus or endpoint-security agents, editors that do atomic-save renames, background indexers, and build flows that restamp files can all make the stat cache less trustworthy than it first appears.

The issue is usually not "Git is bad at files." The issue is that Git is trying to cheaply validate a cached model against a noisy real filesystem. The filesystem, as usual, has opinions.

If the index is a good cache and the filesystem gives Git trustworthy enough signals, many local commands remain cheap because Git can avoid deep inspection of most paths.

If the cache cannot be trusted or the working tree is too large to scan comfortably, local Git gets slower because Git has to do more direct filesystem work:

Local Git tuning often sounds unglamorous. The features that matter most usually reduce filesystem uncertainty. Hardly a thrilling answer. Still the right one.

Untracked Cache and Fsmonitor Help on Different Sides of the Problem

Untracked Cache
Index extension that caches directory mtimes so unchanged directories are skipped when looking for untracked files.

One of the clearest examples is the untracked cache. It targets the part of status that has to discover files Git does not already know about. If a directory has not changed since the last scan, Git may be able to avoid reading through it again while searching for untracked files. The feature depends on whether the filesystem updates directory mtimes in the way Git expects.

This feature depends on very ordinary operating-system behavior. A fast repository depends on whether Git can safely skip work.

git update-index --test-untracked-cache
git config --get core.untrackedCache
git status

If the probe fails, the cache will not do much for you.

Fsmonitor
Integration that lets Git learn which paths changed from a filesystem monitor instead of probing every tracked path itself.

The fsmonitor feature attacks a different part of the same problem. Instead of asking the filesystem about every tracked path, Git can consult a monitor that reports which paths changed. That does not eliminate the index. It makes the index more effective by narrowing the set of paths Git needs to revalidate aggressively.

It is one of the strongest examples of Git turning a brute-force local scan into a more selective query. In a large working tree, that can change the feel of commands dramatically. It is the difference between Git checking everything because it has to and checking a short list because it knows better.

When combined with the untracked cache, fsmonitor helps on both sides of the local problem:

The corresponding visibility pass is just as small:

git config --get core.fsmonitor
git status

"Optimize the index" is not operational advice. The real question is which local bill is dominating: untracked discovery, tracked-path probes, or full-index rewrites. The Configuration Playbook chapter covers the specific settings (core.fsmonitor, core.splitIndex, feature.manyFiles, and others) with version caveats and compatibility guidance.

You do not have to treat the index as one opaque binary blob, either:

git ls-files --stage --debug | sed -n '1,20p'
git update-index --test-untracked-cache
git config --show-origin --get-regexp '^(core\\.fsmonitor|index\\.)' || true

ls-files --debug shows the cached path metadata Git is leaning on. --test-untracked-cache checks whether the filesystem can support that accelerator cleanly. The config query shows whether fsmonitor, split-index, sparse-index, and related index settings are enabled.

Split Index and Sparse Index Reduce Different Costs

Split Index
Mode that stores a stable shared index plus a smaller mutable overlay to reduce repeated full-index rewrites.

The index is valuable as a cache, but it also has a cost: Git has to read and write it. In very large repositories, repeatedly rewriting the full index becomes noticeable.

Split index addresses that problem by separating a mostly stable shared index from a smaller mutable layer where recent changes accumulate. Instead of rewriting one huge index file every time, Git can often update a smaller overlay and periodically push changes back into a shared index file.

That shows up directly on disk:

git update-index --split-index
ls .git/index*

The core index still exists as the main path-state structure, but Git keeps adding extensions around it to make local operations more selective and less repetitive. Untracked cache, fsmonitor data, split index, sparse index, end-of-entry tables, and related extensions all serve the same broad purpose: keep the basic model intact while reducing unnecessary work. The index is a subsystem, not a staging shelf.

The sparse index belongs in a later chapter because it depends on sparse-checkout, but it shows up here too. In a sparse working tree, Git does not always need a fully expanded per-file index for every path in the repository. Under the right conditions, entire directory regions can be summarized more compactly. That changes both memory behavior and command cost.

Once you see the index as a performance structure, sparse index looks like the natural next step. Git is avoiding work it does not currently need to represent.

Slow status Usually Means One of a Few Things

When git status is slow, the problem is often one or more of these:

It is more useful to ask which layer is expensive than to say "Git is slow on this repo." That points toward the layer where the cost is happening.

If you need to go one step further, Trace2 is useful here too:

GIT_TRACE2_PERF=/tmp/status.perf git status >/dev/null
less /tmp/status.perf

Hot refresh_index or lstat time points toward tracked-path validation. Hot read_directory time points toward untracked discovery.

The Index Is Where Staging and Performance Meet

The index matters here because it serves two jobs at once:

Those jobs are closely related, but they are not identical. A feature can matter primarily because it affects staging semantics, or primarily because it affects local scan cost, or both.

The index is where Git's elegant snapshot model meets the messy realities of a large mutable filesystem. So it keeps showing up whenever local Git is either wonderfully fast or painfully slow. It is where the theory gets mud on its boots.

Even if most users never inspect .git/index directly, they feel its behavior constantly.

The index is one of the most important local data structures Git has.