High Performance Git

Section II · History and Rewrite

Chapter 4

Revisions and History Traversal

Pencil sketch of people fishing from a long wooden pier with waterfront houses and boats in the distance.

Git history is a graph Git walks every time you ask a question. Commits are objects with parent pointers; the history view you see is computed from those walks on demand. git log, git show, git blame, git bisect, and git merge-base all start from the same basic move: pick commits, walk parent links, and keep the results that match the question you asked.

This chapter takes that in three layers:


Revisions Are Expressions, Not Just Commit IDs

In everyday use, a "revision" often means "some commit I want to refer to," but Git's revision syntax is (for better or worse) broader than that. A revision expression can name a ref, a commit ID, a relative ancestor, a set difference, or a merge-related relationship that Git has to compute. You've seen those tildes and carets in revisions (and likely forgot what they mean); well that's this.

Revision
Expression that Git resolves to one or more commits or objects.

If revision syntax feels dense at first, that is because Git is giving you a compact language for describing commit sets.

The important distinction is that ^ picks a specific parent edge, while ~ stays on the first-parent line and keeps walking. So HEAD~3 and HEAD^^^ often land on the same commit, while merge~1 and merge^2 can point in completely different directions.

Revisions are expressions that Git resolves to objects, with raw IDs being only one possible form. One more symbol is worth covering before the range syntax.

The Separator To Reduce Ambiguity

A small but easy-to-miss bit of revision syntax: --.

Pathspec
Git path-matching expression used to limit a command to particular files, directories, or patterns.

In commands such as git log, git show, or git diff, Git has to distinguish revision arguments from path arguments, and the -- marker tells Git that everything after it should be treated as a pathspec rather than as another revision. A pathspec is Git's language for saying which parts of the tree you mean: a literal path such as src/main.c, a directory such as docs/, or a pattern. The distinction matters because a revision changes where the history walk starts, while a pathspec narrows which paths matter during that walk.

These two commands do different things:

In the first form, src is still in revision-argument position. In the second, main is the history starting point and src is a pathspec that limits the query to that part of the tree.

A Tiny -- Probe

You see this with a branch and a directory that share the same name:

git init demo-double-meaning
cd demo-double-meaning

mkdir -p src docs
printf 'one\n' > src/file.txt
git add .
git commit -m 'add src file'

git branch src

printf 'note\n' > docs/note.txt
git add docs/note.txt
git commit -m 'docs only'

git show-ref --heads
git log --oneline main src
git log --oneline main -- src

show-ref proves that src is a branch name in this repository, but there is also a src/ directory. git log main src runs straight into that ambiguity. git log main -- src removes it by saying, plainly, "walk from main, then treat src as a path."

Ranges Are (Simple) Set Algebra

Git history ranges make a little more sense once you stop reading them as punctuation tricks and start reading them as set operations over reachable commits.

Reachability
Whether an object — a commit, tree, or blob — can be reached by following parent links and content references starting from a given commit or set of tips.

Git begins with one or more starting commits, walks backward through parent links, and accumulates the commits reachable from those starting points. Range syntax changes which starting points are included and which reachable commits are excluded.

That last form is easy to misread. In git log, A...B is a set operation over history. In git diff, the same spelling is merge-base based and has different semantics. The punctuation is shared, but the command still matters.

If you keep the set model in mind, many Git queries get easier to parse. "Show me what is on this branch but not that one" is reachable-set subtraction. "Show me the commits unique to either side" is symmetric difference. Git is doing set algebra here, just with more carets than anyone would choose on purpose.

rev-parse and rev-list

At the plumbing level, I keep naming and walking as two separate ideas: git rev-parse resolves revision expressions, git rev-list walks commits, and git log is largely rev-list plus formatting and presentation. Roughly, Git first decides what commits you mean, then walks the graph, then formats or filters the result depending on the command.

git log sometimes feels fast and sometimes slow for exactly this reason. Despite the name, it is doing graph traversal, not reading a log file.

A Few Commands Make Revision Syntax Concrete

Resolve and walk a few expressions directly and it gets easier:

git rev-parse HEAD HEAD^ HEAD~3
git rev-list --count main..topic
git rev-list --left-right --count main...topic
git merge-base main topic
git log --first-parent main

Those five commands cover a lot of ground: naming one commit, naming relative ancestors, subtracting one reachable set from another, finding a merge base, and reading a mainline view instead of the full graph.

If you want the set operations to look like actual commits instead of counts:

git rev-list --oneline main..topic | sed -n '1,5p'
git rev-list --left-right --oneline main...topic | sed -n '1,5p'

main..topic is one-sided subtraction. main...topic is symmetric difference, and --left-right marks which side each commit came from.

How Git Walks the Graph

Once Git has a set of starting revisions, the basic traversal model is straightforward: read a commit, emit it if it survives the filters, follow its parent links, and keep going until the walk is exhausted.

What varies is how Git orders and prunes that walk.

For example:

One history graph can therefore produce very different views depending on the command options. A release engineer reading git log --first-parent main is asking a different question from someone reading the full branch topology of a topic branch.

Commit-graph files help here too. The commit graph itself already exists in the commit objects, but the commit-graph file stores derived metadata that lets Git answer ancestry questions more efficiently without recomputing the same structure from scratch on every run.

Merge Bases and Path-Limited History

Merge Base
Best common ancestor Git computes for two or more commits.

Git does not store a permanent "common ancestor" field for every possible pair of commits. When a command needs a merge base, Git computes it by walking ancestry and finding the best common ancestor or ancestors for the commits in question.

That matters in more places than just merges. In these cases, Git is computing merge bases from the commit graph instead of reading a canned history:

Plain history traversal walks commits and parent links. Path-limited history adds another job: deciding whether the path or paths you asked about changed across those commits.

Commands like git log -- path/to/file, git rev-list -- path/to/file, and git blame path/to/file do more than a plain history walk with fewer results. Git may have to inspect trees while it walks the graph in order to determine whether the path's content or location changed between a commit and its parents.

Path history is often more expensive than branch history for the same reason. The graph walk is still there, but now tree comparisons join the query.

A Tiny Path-History Probe

You can watch that extra work happen in a tiny repository:

git init demo-path-history
cd demo-path-history

mkdir -p src docs
printf 'one\n' > src/file.txt
git add .
git commit -m 'add file'

printf 'note\n' > docs/note.txt
git add docs/note.txt
git commit -m 'docs only'

git mv src/file.txt src/renamed.txt
git commit -m 'rename file'

printf 'two\n' >> src/renamed.txt
git commit -am 'edit file'

git rev-list --oneline HEAD
git log --oneline -- src/renamed.txt
git log --oneline --follow -- src/renamed.txt

git rev-list --reverse HEAD | while read commit
do
  printf '\n== %s ==\n' "$commit"
  git diff-tree --root --find-renames --name-status -r "$commit"
done

rev-list shows the whole reachable commit set. The path-limited log commands ask a narrower question about one file. diff-tree makes the extra job clear: Git still walks the graph, but it also compares tree state at each step, and --follow adds rename inference on top.

The basic shape looks like this:

  1. choose starting commits
  2. walk parent links
  3. for each commit of interest, inspect trees as needed
  4. keep or discard commits based on whether the path changed

That is already more work than a plain commit walk, and rename-following adds another layer of cost.

Rename Detection Is Heuristic

Git does not store "rename" as a first-class event in commit history. As Chapter 2 showed, trees store names and blobs store content; if a path disappears and a similar path appears, Git may infer that one became the other.

That means path history with rename-following asks Git to compare content similarity across snapshots and infer whether one path should be treated as the continuation of another. It is real query-time work, well beyond reading metadata.

That has a few consequences:

Git's data model shows up directly in user experience here. Git gets flexibility because renames are inferred rather than stored, and the cost shows up at read time.

Why Path History Gets Slow

When a path-limited command is slow, several costs may be stacking up at once:

Performance advice for path-limited history commands looks different from advice for git status. The first problem is primarily graph and tree work; the second is primarily index and filesystem work.

Changed-path Bloom filters help precisely because they let Git skip some tree inspections during path-limited history queries. They sit on top of the existing walk as a fast negative test, leaving the commit graph and the logical history model intact.

First-Parent History Is a Different View

One especially useful history mode is --first-parent. It tells Git to follow only the first parent at merge commits, which turns a fully branched history into a cleaner mainline view.

This is useful when the question is not "what is every commit reachable from here?" but something narrower, such as:

That option reveals a different slice of the history graph, tuned for a different kind of reading. It is one valid view among several, each shaped to the question being asked.

History Is Computed Every Time You Ask for It

The main idea of this chapter is simple: Git history is a computed view over immutable objects and parent links.

When you ask a history question, Git has to:

Once that model is in place, several things get easier to understand at once: