High Performance Git

Section II ยท History, Rewrite, and Parallel Work

Chapter 4

Revisions and History Traversal

Pencil sketch of people fishing from a long wooden pier with waterfront houses and boats in the distance.

Git history is not like a saved movie. It's not a file that lists commits, or anything like that. It is a graph Git walks every time you ask a question.

git log, git show, git blame, git bisect, and git merge-base all start from the same basic move: pick commits, walk parent links, and keep the results that match the question you asked.

This chapter takes that in three layers:


Revisions Are Expressions, Not Just Commit IDs

In everyday use, a "revision" often means "some commit I want to refer to," but Git's revision syntax is (for better or worse) broader than that. A revision expression can name a ref, a commit ID, a relative ancestor, a set difference, or a merge-related relationship that Git has to compute. You've seen those tildes and carets in revisions (and likely forgot what they mean); well that's this.

Revision
Expression that Git resolves to one or more commits or objects.

If revision syntax feels dense at first, that is because Git is giving you a compact language for describing commit sets rather than merely pointing at one object.

The important distinction is that ^ picks a specific parent edge, while ~ stays on the first-parent line and keeps walking. So HEAD~3 and HEAD^^^ often land on the same commit, while merge~1 and merge^2 can point in completely different directions.

The key point is that revisions are expressions (not mere IDs) that eventually resolve. There is another important symbol to cover, which adds to the density.

The Separator To Reduce Ambiguity

One of Git's smallest lifesavers is also one of the easiest to overlook: --.

Pathspec
Git path-matching expression used to limit a command to particular files, directories, or patterns.

In commands such as git log, git show, or git diff, Git has to distinguish revision arguments from path arguments. A revision argument is a particular kind of expression.

The -- marker tells Git that everything after it should be treated as a pathspec rather than as another revision.

A pathspec is Git's language for saying which parts of the tree you mean. Sometimes that is a literal path such as src/main.c. Sometimes it is a directory such as docs/. Sometimes it is a pattern. A revision changes where the history walk starts. A pathspec narrows which paths matter during that walk.

These two commands do different things:

In the first form, src is still in revision-argument position. In the second, main is the history starting point and src is a pathspec that limits the query to that part of the tree.

A Tiny -- Probe

You can make the ambiguity visible with a branch and a directory that share the same name:

git init demo-double-meaning
cd demo-double-meaning

mkdir -p src docs
printf 'one\n' > src/file.txt
git add .
git commit -m 'add src file'

git branch src

printf 'note\n' > docs/note.txt
git add docs/note.txt
git commit -m 'docs only'

git show-ref --heads
git log --oneline main src
git log --oneline main -- src

show-ref proves that src is a branch name in this repository, but there is also a src/ directory. git log main src runs straight into that ambiguity. git log main -- src removes it by saying, plainly, "walk from main, then treat src as a path."

Ranges Are (Simple) Set Algebra

Git history ranges make a little more sense once you stop reading them as punctuation tricks and start reading them as set operations over reachable commits.

Git begins with one or more starting commits, walks backward through parent links, and accumulates the commits reachable from those starting points. Range syntax changes which starting points are included and which reachable commits are excluded.

Range syntax adds another layer. The most common forms are:

That last form is easy to misread. In git log, A...B is a set operation over history. In git diff, the same spelling is merge-base based and has different semantics. The punctuation is shared, but the command still matters.

If you keep the set model in mind, many Git queries get easier to parse. "Show me what is on this branch but not that one" is reachable-set subtraction. "Show me the commits unique to either side" is symmetric difference. Git is doing set algebra here, just with more carets than anyone would choose on purpose.

rev-parse and rev-list

At the plumbing level, keep in mind the difference between naming and walking.

As a rough implementation sketch: first Git decides what commits you mean. Then it walks the graph. Then, depending on the command, it formats or filters the result.

git log sometimes feels fast and sometimes expensive for exactly this reason. Despite the name, it is doing traversal work rather than simply reading from a log.

A Few Commands Make Revision Syntax Concrete

This gets easier once you ask Git to resolve and walk a few expressions directly:

git rev-parse HEAD HEAD^ HEAD~3
git rev-list --count main..topic
git rev-list --left-right --count main...topic
git merge-base main topic
git log --first-parent main

Those five commands cover a lot of ground: naming one commit, naming relative ancestors, subtracting one reachable set from another, finding a merge base, and reading a mainline view instead of the full graph.

If you want the set operations to look like actual commits instead of counts:

git rev-list --oneline main..topic | sed -n '1,5p'
git rev-list --left-right --oneline main...topic | sed -n '1,5p'

main..topic is one-sided subtraction. main...topic is symmetric difference, and --left-right marks which side each commit came from.

How Git Walks the Graph

Once Git has a set of starting revisions, the basic traversal model is straightforward: read a commit, emit it if it survives the filters, follow its parent links, and keep going until the walk is exhausted.

What varies is how Git orders and prunes that walk.

For example:

One history graph can therefore produce very different views depending on the command options. A release engineer reading git log --first-parent main is asking a different question from someone reading the full branch topology of a topic branch.

Commit-graph files help here too. The commit graph itself already exists in the commit objects, but the commit-graph file stores derived metadata that lets Git answer ancestry questions more efficiently without recomputing the same structure from scratch on every run.

Merge Bases Are Computed, Not Stored

Merge Base
Best common ancestor Git computes for two or more commits.

Git does not store a permanent "common ancestor" field for every possible pair of commits. When a command needs a merge base, Git computes it by walking ancestry and finding the best common ancestor or ancestors for the commits in question.

That matters in more places than just merges. Merge bases show up in:

Git is not reading a canned answer here. It computes merge bases from the commit graph.

Path-Limited History Is Another Layer on Top

Plain history traversal walks commits and parent links. Path-limited history adds another job: deciding whether the path or paths you asked about changed across those commits.

Commands like git log -- path/to/file, git rev-list -- path/to/file, and git blame path/to/file do more than a plain history walk with fewer results. Git may have to inspect trees while it walks the graph in order to determine whether the path's content or location changed between a commit and its parents.

Path history is often more expensive than branch history for the same reason. The graph walk is still there, but now tree comparisons join the query.

A Tiny Path-History Probe

You can watch that extra work happen in a tiny repository:

git init demo-path-history
cd demo-path-history

mkdir -p src docs
printf 'one\n' > src/file.txt
git add .
git commit -m 'add file'

printf 'note\n' > docs/note.txt
git add docs/note.txt
git commit -m 'docs only'

git mv src/file.txt src/renamed.txt
git commit -m 'rename file'

printf 'two\n' >> src/renamed.txt
git commit -am 'edit file'

git rev-list --oneline HEAD
git log --oneline -- src/renamed.txt
git log --oneline --follow -- src/renamed.txt

git rev-list --reverse HEAD | while read commit
do
  printf '\n== %s ==\n' "$commit"
  git diff-tree --root --find-renames --name-status -r "$commit"
done

rev-list shows the whole reachable commit set. The path-limited log commands ask a narrower question about one file. diff-tree makes the extra job visible: Git still walks the graph, but it also compares tree state at each step, and --follow adds rename inference on top.

The basic shape looks like this:

  1. choose starting commits
  2. walk parent links
  3. for each commit of interest, inspect trees as needed
  4. keep or discard commits based on whether the path changed

That is already more work than a plain commit walk, and rename-following adds another bill.

Rename Detection Is Heuristic

Git does not store "rename" as a first-class event in commit history. As Chapter 2 showed, trees store names and blobs store content; if a path disappears and a similar path appears, Git may infer that one became the other.

That means path history with rename-following is not simply reading metadata. It is asking Git to compare content similarity across snapshots and infer whether one path should be treated as the continuation of another.

That has a few consequences:

Git's data model shows up directly in user experience here. Git gets flexibility because renames are inferred rather than stored, and the bill shows up at read time.

Why Path History Gets Slow

When a path-limited command is slow, several costs may be stacking up at once:

Performance advice for path-limited history commands looks different from advice for git status. The first problem is primarily graph and tree work; the second is primarily index and filesystem work.

Changed-path Bloom filters help precisely because they let Git skip some tree inspections during path-limited history queries. They do not replace the commit graph or change the logical history model. They are a fast negative test layered on top of the existing walk.

First-Parent History Is a Different View

One especially useful history mode is --first-parent. It tells Git to follow only the first parent at merge commits, which turns a fully branched history into a cleaner mainline view.

This is useful when the question is not "what is every commit reachable from here?" but something narrower, such as:

That option does not reveal a truer history. It reveals a different slice of the history graph, tuned for a different kind of reading.

History Is Computed Every Time You Ask for It

The main idea of this chapter is simple: Git history is a computed view over immutable objects and parent links.

When you ask a history question, Git has to:

Once that model is in place, several things get easier to understand at once: