Chapter 2: Git's Core Data Model | High Performance Git

Pencil sketch of a rope and cargo storehouse opening onto a dock with boats outside.

Under the hood, Git is a small object database with a naming layer on top. Objects hold the data, refs point at them, the index stores the next snapshot, and the working tree is a checkout of repository state rather than the repository itself.

The chapters that follow walk through the object store, refs, HEAD, the index, reflogs, and the working tree. The rest of the book builds on this. Some of your mental model of Git may shift along the way, and most of the performance payoff comes from that shift.

Git Stores Snapshots, Not Patch Files

One of the biggest surprises when you dive into Git internals: Git does not store each commit as "the previous commit plus a diff."

At the logical level, each commit names a full snapshot through a tree object. Diffs are something Git computes when you ask for them rather than the primary representation of history.

A commit can be checked out without replaying a patch chain, a merge compares snapshots and ancestry rather than blindly applying text deltas, and path history queries are sometimes expensive because Git has to inspect which snapshots changed a path rather than reading off a neatly stored per-file timeline.

Git does use delta compression inside packfiles to save storage and network bytes, which is the closest conceptual thing to a "diff." But that is a physical encoding detail; the logical model is still full snapshots connected by parent links.

The Four Object Types

At the core of Git are four object types.

Blob
Tree
Commit
Tag

Everything else in day-to-day Git is built around these.

Blob

A blob is file content and nothing more. It has no filename, no path, and no executable bit, so if two files anywhere in the repository have identical bytes, they can point to the same blob object. That is one reason Git can reuse storage so effectively.

You can test that directly with a plumbing command:

printf 'hello\n' > first.txt
printf 'hello\n' > second.txt
git hash-object first.txt
git hash-object second.txt

Those two object IDs match because the blob does not include the filename. Change one byte in one file and run the command again; now the IDs diverge.

Tree

A tree is one directory snapshot. It maps names to object IDs and modes, with some entries pointing to blobs that represent files and others pointing to trees that represent subdirectories. Paths live here, not in blobs. A blob is just file content; it doesn't carry its own filename.

That is an important design choice, because it means the same blob can appear under different names or in different directories across history without changing the blob itself. Git infers renames at query time when a command asks for that view, rather than storing them as first-class events in the object model. The path-history chapter comes back to the cost of that.

You can test that with plumbing too:

blob=$(printf 'hello\n' | git hash-object -w --stdin)

tree_a=$(printf '100644 blob %s\talpha.txt\n' "$blob" | git mktree)
tree_b=$(printf '100644 blob %s\tbeta.txt\n' "$blob" | git mktree)

echo "$tree_a"
echo "$tree_b"
git cat-file -p "$tree_a"
git cat-file -p "$tree_b"

The blob is the same both times. The tree IDs differ because the name lives in the tree. Change the path, mode, or directory structure, and you get a different tree object even if the file bytes stay the same.

A Small Example Snapshot

Imagine a repository with three tracked files:

README.md
src/main.ts
src/store.ts

Git stores that snapshot as a small set of related objects:

a blob for README.md
a blob for src/main.ts
a blob for src/store.ts
a tree for src/, pointing to the two src blobs
a root tree, pointing to README.md and the src/ tree
a commit, pointing to the root tree

A commit names a full snapshot, but it does so by pointing at a tree structure that can reuse a great deal of existing data (like blobs and other trees). Without even looking at more advanced compression, Git's commits and branches can stay lightweight through reuse alone.

If you change only src/store.ts and commit, Git writes a new blob for the new file content, a new src/ tree because one entry changed there, a new root tree because the src/ subtree changed, and a new commit that points to the new root tree. The README.md blob and the src/main.ts blob stay exactly as they were. For any real repository, this reuse matters.

"Git stores snapshots" can be true without meaning "Git copies the whole repository on every commit."

Why Paths Near the Root Matter More

Git's tree structure is recursive, so changes near the bottom of a directory hierarchy affect fewer tree objects than changes near the top. If you edit src/lib/math/vector.ts, Git may need to rewrite the blob and each tree object from vector.ts's directory back to the root. If you rename src/ itself, the impact is broader.

For one local edit in a small repository, this fades into the noise. Tree updates are path-sensitive, and unchanged subtrees can be reused only where the path above them stayed the same.

Commit

A commit points to one root tree and to zero or more parent commits, while also storing author and committer metadata (including timestamps), and a message. The commit object ties a snapshot to history.

Even a small history edit produces new commit IDs: if you change the parent list, the message, the tree, or the metadata, you created a different commit object. Two commits built around the same tree but with different timestamps still get different IDs, because the commit bytes include the metadata too.

You can test that directly with plumbing too. Build one blob, one tree, and then two commits around the same tree:

blob=$(printf 'hello\n' | git hash-object -w --stdin)
tree=$(printf '100644 blob %s\thello.txt\n' "$blob" | git mktree)

GIT_AUTHOR_NAME='Example' GIT_AUTHOR_EMAIL='example@example.com' \
GIT_COMMITTER_NAME='Example' GIT_COMMITTER_EMAIL='example@example.com' \
GIT_AUTHOR_DATE='2024-01-01T00:00:00Z' GIT_COMMITTER_DATE='2024-01-01T00:00:00Z' \
git commit-tree "$tree" -m 'same snapshot'

GIT_AUTHOR_NAME='Example' GIT_AUTHOR_EMAIL='example@example.com' \
GIT_COMMITTER_NAME='Example' GIT_COMMITTER_EMAIL='example@example.com' \
GIT_AUTHOR_DATE='2024-01-02T00:00:00Z' GIT_COMMITTER_DATE='2024-01-02T00:00:00Z' \
git commit-tree "$tree" -m 'same snapshot'

The tree is the same both times. The message is the same both times. The commit IDs still differ, because the commit bytes include the timestamps and other metadata too.

Tag

Tag: Git label for another object, usually a commit; annotated tags are objects, while lightweight tags are plain refs.

Tags come in two flavors, and the difference matters. An annotated tag is a full object: it points to another object (usually a commit) and carries its own metadata — tagger, date, message, and optional signature. A lightweight tag is just a ref under refs/tags/ that points directly at a commit, with no object of its own.

So annotated tags show up in the object database alongside blobs, trees, and commits. Lightweight tags show up only as named pointers. Most everyday tagging (release markers, checkpoints) works with either; annotated tags matter when you need the extra metadata or a signature.

Object IDs and Object Format

Object ID: Content-derived identifier for a Git object, produced from the object's canonical bytes under the repository's hash format.

Each object has an object ID derived from its canonical bytes. Historically, Git repositories have used SHA-1 for this; Git also supports repositories created with SHA-256 object IDs.

An object ID names content, so if the underlying object bytes change, the object ID changes too. Git objects are effectively immutable: to change anything, Git writes a new object and moves a ref to point to it. That write pattern ends up being a kind of performance optimization, since immutable objects are easier to cache, easier to pack, and easier to share between related histories.

Refs Are Names for Important Object IDs

Object IDs are exact, but they are not convenient for humans, and refs solve that problem. A ref is a durable name that stores an object ID, usually the ID of a commit.

Common examples include:

refs/heads/main
refs/heads/feature-x
refs/tags/v1.2.0
refs/remotes/origin/main

Branches are refs, most tags are refs, and remote-tracking branches are refs as well. Much of Git's apparent complexity becomes easier to follow once you see that many commands are reading, updating, or comparing named pointers.

Branching is cheap because creating a branch does not copy history; it creates a new name that points at an existing commit. That is a lot of Git's magic, and it is easy to underappreciate now that it feels ordinary.

`HEAD` Says Where You Are

HEAD is a special ref that usually points to another ref, such as refs/heads/main. In that normal attached state, new commits advance the branch ref, and HEAD continues to name that branch. It is literally just a text file at .git/HEAD that Git reads.

Detached HEAD makes more sense with this model. If HEAD points directly to a commit ID instead of a branch name, you are detached.

Since the repository is no longer attached to a branch name, new commits are no longer anchored by a branch unless you create one. Detached HEAD is a direct pointer to a commit rather than a branch attachment.

You can watch that change happen in a throwaway repo:

tmp=$(mktemp -d)
cd "$tmp"

git init -b main
git config user.name Example
git config user.email example@example.com
printf 'hello\n' > hello.txt
git add hello.txt
git commit -m 'first'

cat .git/HEAD
git symbolic-ref HEAD
git rev-parse HEAD

git switch --detach HEAD
cat .git/HEAD
git symbolic-ref -q HEAD || git rev-parse HEAD

At first, .git/HEAD contains ref: refs/heads/main. After git switch --detach HEAD, it contains the commit ID itself. That is the whole difference between attached and detached HEAD.

The Index Is the Next Snapshot

The index is usually introduced as "the staging area," which is correct but incomplete. It is also the data structure that records the next commit snapshot and helps Git avoid unnecessary filesystem work.

When you run git add, you are updating index entries, and when you run git commit, Git reads the index to decide what snapshot to write. Your working tree may contain additional unstaged edits, but they are not part of the commit unless they have been staged into the index.

This explains a few everyday surprises:

git commit does not automatically include every local edit.
git diff and git diff --cached answer different questions.
git reset can move refs, rewrite the index, update the working tree, or some combination of the three.

At scale, the index becomes even more important because it is part of Git's performance strategy, caching metadata about tracked paths so commands like git status do not have to rediscover everything from scratch on every run.

The Working Tree Is a Checkout

Working Tree: Materialized on-disk checkout of tracked paths, plus any other files present there, distinct from the repository's stored objects and metadata.

Your working tree is the materialized copy of some tracked paths on disk plus whatever other files happen to exist there. It is not the repository's source of truth, because the object database, refs, index, and reflogs live in .git/ and define the repository proper.

That distinction clarifies many operations. You can remove and recreate a working tree from stored objects, you can have multiple working trees pointing at the same repository with git worktree, and you can even have a bare repository with no working tree at all. None of that is surprising once the working tree is a projection of repository state rather than the state itself.

Reflogs Are Local Memory

Reflogs are one of the most useful Git features and one of the most underexplained. A reflog records how a ref moved over time on your local machine, which is why, if HEAD pointed to one commit yesterday and points somewhere else now, the reflog is often the easiest way to find the old location again.

Recovery commands so often start with git reflog for that reason. A force-push, reset, rebase, or mistaken checkout may have moved names around, but the old commit objects often still exist for a while, and the reflog remembers where those names used to point.

Two practical details matter:

Reflogs are local metadata. They are generally not shared with remotes.
Reflogs expire. They are useful for recent recovery, but they are not permanent storage.

Later in the book, reflogs will show up both in recovery workflows and in the broader discussion of refs at scale.

A Few Commands To Understand The Model

Print a few of these layers and the model gets less abstract:

git cat-file -t HEAD
git cat-file -p HEAD
git ls-tree HEAD
git ls-files --stage
git reflog -5

cat-file shows what HEAD actually names. ls-tree shows the root tree a commit points to. ls-files --stage shows the index entries that define the next snapshot. reflog shows recent local ref movement.

Putting the Pieces Together

Here is the compact version of the model:

Blobs store file bytes.
Trees store directory structure and names.
Commits point to trees and parents.
Tags can attach metadata to objects.
Refs give human-readable names to important object IDs.
HEAD tells Git what you currently have checked out.
The index records the next snapshot and caches path state.
The working tree is the on-disk checkout.
Reflogs remember how refs moved locally.

Once those pieces are in place, common commands become easier to read literally. git switch mostly moves HEAD and refreshes the checkout to match another commit, git commit reads the index, writes any needed objects, and advances a ref, and git reset is really a family of commands because it can target refs, the index, the working tree, or all three.

That is the habit worth building: when a command looks complicated, ask which objects it reads, which names it moves, and whether it touches the index or the working tree.

Why This Model Matters for Performance

The model maps directly to performance costs.

The working tree and index dominate commands like status.
Commits, trees, and graph metadata dominate history traversal.
Refs and reflogs matter for control flow, recovery, and scale.
Object IDs, packs, and auxiliary indexes matter for storage and transport.

When I look at a slow command, the first thing I ask is which of those layers it's exercising. Once you can answer that, you're already most of the way toward understanding why it is fast or slow.