Under the command line, Git stores immutable objects, gives names to some of them with refs, keeps the next snapshot in the index, and treats the working tree as a checkout rather than the repository itself.
This chapter lays out the object store, refs, HEAD, the index, reflogs, and the working tree so the rest of the book has something solid to build on. It ideally stands on its own; you should be able to read this and walk away with a strong understanding of Git internals.
Some of your mental model of Git may change once you understand its internals better, and that pays off directly in performance work.
Git Stores Snapshots, Not Patch Files
One of the biggest surprises when you dive into Git internals: Git does not store each commit as "the previous commit plus a diff."
At the logical level, each commit names a full snapshot through a tree object. Diffs are something Git computes when you ask for them rather than the primary representation of history.
A commit can be checked out without replaying a patch chain, a merge compares snapshots and ancestry rather than blindly applying text deltas, and path history queries are sometimes expensive because Git has to inspect which snapshots changed a path rather than reading off a neatly stored per-file timeline.
Git does use delta compression inside packfiles to save storage and network bytes, which is the closest conceptual thing to a "diff." But that is a physical encoding detail; the logical model is still full snapshots connected by parent links.
The Four Object Types
At the core of Git are four object types. The first three are big ones, the fourth (tags) just kind of comes along, although it is pleasant to have.
- Blob
- Tree
- Commit
- Tag
Everything else in day-to-day Git is built around these.
Blob
A blob is file content and nothing more. It has no filename, no path, and no executable bit, so if two files anywhere in the repository have identical bytes, they can point to the same blob object. That is one reason Git can reuse storage so effectively.
You can test that directly with a plumbing command:
printf 'hello\n' > first.txt
printf 'hello\n' > second.txt
git hash-object first.txt
git hash-object second.txt
Those two object IDs match because the blob does not include the filename. Change one byte in one file and run the command again; now the IDs diverge.
Tree
A tree is one directory snapshot. It maps names to object IDs and modes, with some entries pointing to blobs that represent files and others pointing to trees that represent subdirectories. Paths live here, not in blobs. Remember that a blob is just file content: if a blob could talk, it would not know where it is.
That is an important design choice, because it means the same blob can appear under different names or in different directories across history without changing the blob itself. Git does not store renames as first-class events in the object model. It infers them later when a command asks for that view, and the path-history chapter comes back to the cost of that.
You can test that with plumbing too:
blob=$(printf 'hello\n' | git hash-object -w --stdin)
tree_a=$(printf '100644 blob %s\talpha.txt\n' "$blob" | git mktree)
tree_b=$(printf '100644 blob %s\tbeta.txt\n' "$blob" | git mktree)
echo "$tree_a"
echo "$tree_b"
git cat-file -p "$tree_a"
git cat-file -p "$tree_b"
The blob is the same both times. The tree IDs differ because the name lives in the tree. Change the path, mode, or directory structure, and you get a different tree object even if the file bytes stay the same.
A Small Example Snapshot
Imagine a repository with three tracked files:
README.mdsrc/main.tssrc/store.ts
Git does not store that snapshot as one big document. It stores something more like this:
- a blob for
README.md - a blob for
src/main.ts - a blob for
src/store.ts - a tree for
src/, pointing to the twosrcblobs - a root tree, pointing to
README.mdand thesrc/tree - a commit, pointing to the root tree
That is the payoff moment in Git's snapshot model. A commit names a full snapshot, but it does so by pointing at a tree structure that can reuse a great deal of existing data (like blobs and other trees). Without even looking at more advanced compression methods, Git's commits and branches can stay lightweight through reuse.
If you change only src/store.ts and commit, Git does not duplicate everything. It writes a new blob for the new file content, a new src/ tree because one entry changed there, a new root tree because the src/ subtree changed, and a new commit that points to the new root tree. But the README.md blob and the src/main.ts blob stay exactly as they were. For any real repository, this reuse pays off.
"Git stores snapshots" can be true without meaning "Git copies the whole repository on every commit."
Why Paths Near the Root Matter More
Git's tree structure is recursive, so changes near the bottom of a directory hierarchy affect fewer tree objects than changes near the top. If you edit src/lib/math/vector.ts, Git may need to rewrite the blob and each tree object from vector.ts's directory back to the root. If you rename src/ itself, the impact is broader.
For one local edit in a small repository, this usually does not matter much. It is still useful intuition. Tree updates are path-sensitive, and unchanged subtrees can be reused only where the path above them did not change.
Commit
A commit points to one root tree and to zero or more parent commits, while also storing author and committer metadata (including timestamps), and a message. The commit object ties a snapshot to history.
Even a small history edit produces new commit IDs: if you change the parent list, the message, the tree, or the metadata, you created a different commit object. Two commits built around the same tree but with different timestamps still get different IDs, because the commit bytes include the metadata too.
You can test that directly with plumbing too. Build one blob, one tree, and then two commits around the same tree:
blob=$(printf 'hello\n' | git hash-object -w --stdin)
tree=$(printf '100644 blob %s\thello.txt\n' "$blob" | git mktree)
GIT_AUTHOR_NAME='Example' GIT_AUTHOR_EMAIL='example@example.com' \
GIT_COMMITTER_NAME='Example' GIT_COMMITTER_EMAIL='example@example.com' \
GIT_AUTHOR_DATE='2024-01-01T00:00:00Z' GIT_COMMITTER_DATE='2024-01-01T00:00:00Z' \
git commit-tree "$tree" -m 'same snapshot'
GIT_AUTHOR_NAME='Example' GIT_AUTHOR_EMAIL='example@example.com' \
GIT_COMMITTER_NAME='Example' GIT_COMMITTER_EMAIL='example@example.com' \
GIT_AUTHOR_DATE='2024-01-02T00:00:00Z' GIT_COMMITTER_DATE='2024-01-02T00:00:00Z' \
git commit-tree "$tree" -m 'same snapshot'
The tree is the same both times. The message is the same both times. The commit IDs still differ, because the commit bytes include the timestamps and other metadata too.
Tag
- Tag
- Git label for another object, usually a commit; annotated tags are objects, while lightweight tags are plain refs.
Annotated tags are objects too, although they are less important than the "Big Three" objects we've discussed already.
Ordinary tags can point to another object, usually a commit, and attach metadata such as a tagger, date, message, and optional signature, whereas lightweight tags are just refs.
Object IDs and Object Format
- Object ID
- Content-derived identifier for a Git object, produced from the object's canonical bytes under the repository's hash format.
Each object has an object ID derived from its canonical bytes. Historically, Git repositories have used SHA-1 for this, while modern Git also supports repositories created with SHA-256 object IDs.
What matters here is what the hash means. An object ID names content, so if the underlying object bytes change, the object ID changes too. Git objects are effectively immutable: Git does not edit a commit in place or tweak a tree object, but writes a new object and then moves a ref to point to it.
That write pattern is one of the reasons Git can be both robust and fast, since immutable objects are easier to cache, easier to pack, and easier to share between related histories.
Refs Are Names for Important Object IDs
Object IDs are exact, but they are not convenient for humans, and refs solve that problem. A ref is a durable name that stores an object ID, usually the ID of a commit.
Common examples include:
refs/heads/mainrefs/heads/feature-xrefs/tags/v1.2.0refs/remotes/origin/main
Branches are refs, most tags are refs, and remote-tracking branches are refs as well. Much of Git's apparent complexity becomes easier to follow once you see that many commands are reading, updating, or comparing named pointers.
Branching is cheap because creating a branch does not copy history; it creates a new name that points at an existing commit. This is ultimately a lot of Git's magic, and, because it now feels ordinary, easy to underappreciate.
HEAD Says Where You Are
HEAD is a special ref that usually points to another ref, such as refs/heads/main. In that normal attached state, new commits advance the branch ref, and HEAD continues to name that branch. It is literally just a text file at .git/HEAD that Git reads.
Detached HEAD makes more sense with this model. If HEAD points directly to a commit ID instead of a branch name, you are detached.
Since the repository is no longer attached to a branch name, new commits are no longer anchored by a branch unless you create one. Detached HEAD is a direct pointer to a commit rather than a branch attachment.
You can watch that change happen in a throwaway repo:
tmp=$(mktemp -d)
cd "$tmp"
git init -b main
git config user.name Example
git config user.email example@example.com
printf 'hello\n' > hello.txt
git add hello.txt
git commit -m 'first'
cat .git/HEAD
git symbolic-ref HEAD
git rev-parse HEAD
git switch --detach HEAD
cat .git/HEAD
git symbolic-ref -q HEAD || git rev-parse HEAD
At first, .git/HEAD contains ref: refs/heads/main. After git switch --detach HEAD, it contains the commit ID itself. That is the whole difference between attached and detached HEAD.
The Index Is the Next Snapshot
The index is usually introduced as "the staging area," which is correct but incomplete. It is also the data structure that records the next commit snapshot and helps Git avoid unnecessary filesystem work.
When you run git add, you are updating index entries, and when you run git commit, Git reads the index to decide what snapshot to write. Your working tree may contain additional unstaged edits, but they are not part of the commit unless they have been staged into the index.
This explains a few everyday surprises:
git commitdoes not automatically include every local edit.git diffandgit diff --cachedanswer different questions.git resetcan move refs, rewrite the index, update the working tree, or some combination of the three.
At scale, the index becomes even more important because it is part of Git's performance strategy, caching metadata about tracked paths so commands like git status do not have to rediscover everything from scratch on every run.
The Working Tree Is a Checkout, Not the Repository
- Working Tree
- Materialized on-disk checkout of tracked paths, plus any other files present there, distinct from the repository's stored objects and metadata.
Your working tree is the materialized copy of some tracked paths on disk plus whatever other files happen to exist there. It is important, but it is not the repository's source of truth, because the object database, refs, index, and reflogs live in .git/ and define the repository proper.
That distinction clarifies many operations. You can remove and recreate a working tree from stored objects, you can have multiple working trees pointing at the same repository with git worktree, and you can even have a bare repository with no working tree at all. None of that is surprising once the working tree is a projection of repository state rather than the state itself.
Reflogs Are Local Memory
Reflogs are one of the most useful Git features and one of the most underexplained. A reflog records how a ref moved over time on your local machine, which is why, if HEAD pointed to one commit yesterday and points somewhere else now, the reflog is often the easiest way to find the old location again.
Recovery commands so often start with git reflog for that reason. A force-push, reset, rebase, or mistaken checkout may have moved names around, but the old commit objects often still exist for a while, and the reflog remembers where those names used to point.
Two practical details matter:
- Reflogs are local metadata. They are generally not shared with remotes.
- Reflogs expire. They are extremely useful, but they are not permanent archival storage.
Later in the book, reflogs will show up both in recovery workflows and in the broader discussion of refs at scale.
A Few Commands Make the Model Visible
The model gets less abstract once you ask Git to print a few of these layers directly:
git cat-file -t HEAD
git cat-file -p HEAD
git ls-tree HEAD
git ls-files --stage
git reflog -5
cat-file shows what HEAD actually names. ls-tree shows the root tree a commit points to. ls-files --stage shows the index entries that define the next snapshot. reflog shows recent local ref movement.
Putting the Pieces Together
Here is the compact version of the model:
- Blobs store file bytes.
- Trees store directory structure and names.
- Commits point to trees and parents.
- Tags can attach metadata to objects.
- Refs give human-readable names to important object IDs.
HEADtells Git what you currently have checked out.- The index records the next snapshot and caches path state.
- The working tree is the on-disk checkout.
- Reflogs remember how refs moved locally.
Once those pieces are in place, common commands become easier to read literally. git switch mostly moves HEAD and refreshes the checkout to match another commit, git commit reads the index, writes any needed objects, and advances a ref, and git reset is really a family of commands because it can target refs, the index, the working tree, or all three.
That is the habit this book will keep building: when a command looks complicated, ask which objects it reads, which names it moves, and whether it touches the index or the working tree.
Why This Model Matters for Performance
The model in this chapter also maps directly to performance costs.
- The working tree and index dominate commands like
status. - Commits, trees, and graph metadata dominate history traversal.
- Refs and reflogs matter for control flow, recovery, and scale.
- Object IDs, packs, and auxiliary indexes matter for storage and transport.
If you can tell which layer a command is exercising, you are already most of the way toward understanding why it is fast or slow.