Blobs, trees, commits, and tags are the model. But loose objects, packfiles, indexes, and deltas are the machinery Git actually uses to write, compress, and move that model around.
We will look at:
- why loose objects exist
- why repositories do not stay loose for long
- how packfiles and indexes work
- what delta compression is actually doing
- why none of that changes Git's logical history model
Loose Objects, Packs, and Why Repositories Stop Staying Loose
- Loose Object
- Individual object stored as its own compressed file in the object database.
When Git writes a new object, the simplest thing it can do is store that object by itself. That is the loose-object path, and it is one reason Git's write model is so direct. Hash the content, determine the object ID, compress it, and place it in the object store.
In a traditional SHA-1 repository, a loose object for an object ID such as ab1234... lives under .git/objects/ab/1234.... Git spreads loose objects across 256 directories using the first byte of the object name so one directory does not become unmanageably large.
Loose objects exist because they are easy to write safely and incrementally. When you create a commit, stage new content, or otherwise materialize new objects, Git does not need to rebuild a large archive first. It can write the new objects one by one.
Loose objects are excellent for incremental creation, and they are easy to model and understand. In a tiny repo, they would be all you need, and Git performance would be trivial.
However, scale breaks everything and we need another approach for real repositories.
Disk space is only part of the problem. A large loose-object store also drives directory traversal, file-open overhead, metadata churn, and poor locality. It turns one logical database into a huge number of tiny files, which is rarely the best layout for repeated reads, bulk transfer, or compact storage.
- Packfile
- File that stores many Git objects together in compressed form, usually with an accompanying index for random access.
A packfile is the format Git uses to store most repository data over time. Instead of keeping every object as a separate file, Git writes many objects into one .pack file and pairs it with an .idx file so objects can still be found quickly by object ID.
Git is separating its logical model from physical layout very cleanly here. This point is critical: a blob is still the same blob. A commit is still the same commit. But the repository no longer needs to keep each one in its own standalone file. The user never sees the packfiles any more than a user of a relational database sees B-tree pages.
A packfile is a purpose-built object container with per-object headers, optional deltified representations, and companion index structures designed for lookup, traversal, repacking, and transport. You can think of it as the underpinnings of a database rather than a generic zip archive for the repository.
A pack is organized Git objects. The .idx file beside it is not a table of filenames. It is a lookup structure for object names and pack offsets.
If you want a better comparison, think of a pack plus its index as an immutable bulk-written data file with a sidecar lookup structure. That is closer to an SSTable or another storage file produced by compaction than to a hand-made archive. The comparison is not perfect, but it gives you the right instincts: bulk write, immutable file, separate index, periodic rewrite into a better layout.
How Packfiles and Pack Indexes Work
A loose object includes the canonical object header, meaning the type, the size, a NUL byte, and then the object data. In a pack, Git stores type and size in the pack entry itself instead of repeating that loose-object prefix inside the compressed payload.
That means packed storage is more than "the loose object moved into a bigger file." The representation is optimized for the pack format. Even so, the object ID still refers to the same logical object, because Git reconstructs the canonical form when it needs to reason about object identity.
Git's pack-format documentation describes a very small fixed header at the start of every pack:
- the four-byte signature
PACK - a four-byte version number
- a four-byte object count
After that come the object entries, and the file ends with a trailer checksum for the whole pack. Git currently writes version 2 packs.
That is already a useful clue about what a pack is. It is a binary container format with its own header, entry encoding rules, and trailer.
Inside the pack, each object entry starts with a compact header that encodes the object type and the unpacked size. Only then do you get to the stored bytes for that object.
For an ordinary packed commit, tree, blob, or tag, those bytes are the object's packed representation. For a deltified object, the entry also has to encode how to find the base object before the delta instructions can be applied.
That detail matters because it explains why Git can answer questions like "what type of object is this?" or "where in the pack does it live?" without treating the pack as one giant opaque compression stream.
- Pack Index
- Companion file that maps object IDs to locations inside a packfile.
If a .pack file were all Git had, object lookup would be far less practical. Git would need to scan or partially scan the pack to find one object. The .idx file is what turns a dense object container into something Git can query efficiently.
The pack index keeps object names in sorted order and records where their packed representations live. That is how Git can answer "do I have object X?" without inflating the whole pack.
Version 2 pack indexes add more structure than a flat "hash to offset" list:
- a 256-entry fanout table keyed by the first byte of the object name
- the sorted object names themselves
- CRC32 values for packed data
- 32-bit offsets, plus a large-offset table for very large packs
The fanout table is the first demystifier. If an object ID starts with byte 9c, Git does not binary-search every object in the pack. It uses the fanout table to jump straight to the slice of sorted object names whose first byte is 9c, and then searches only that range.
Modern repositories often also carry adjacent files such as .rev, which provide pack-offset order, and later chapters will get into multi-pack indexes and related metadata. The main idea for now is simpler: packs make storage compact, and indexes make compact storage usable.
You do not need to guess what shape the object store is in:
git gc
git count-objects -vH
ls .git/objects/pack
git verify-pack -v .git/objects/pack/pack-*.idx | sed -n '1,12p'
If you want to look at the pack header directly, a few lines of Python are enough:
from pathlib import Path
import struct
pack = next(Path(".git/objects/pack").glob("pack-*.pack"))
with pack.open("rb") as f:
signature = f.read(4).decode()
version, count = struct.unpack("!II", f.read(8))
print(
{
"pack": pack.name,
"signature": signature,
"version": version,
"objects": count,
}
)
git gc makes sure you actually have a pack to inspect. count-objects tells you how much is loose and how much is packed. Listing .git/objects/pack shows which sidecar files exist. The Python snippet makes the fixed pack header concrete. verify-pack is the best quick x-ray: in verbose mode its non-delta lines are object-name type size size-in-packfile offset-in-packfile, and delta lines add depth plus the base object name. On a tiny repository you may see only non-delta entries. That is fine. The format is still visible.
Delta Compression Is About Storage, Not History
- Delta Object
- Packed representation stored as a set of instructions relative to another base object.
A common mistake is to hear that Git uses deltas and conclude that it stores history as "changes from one version to the next." Git still thinks in full objects and full snapshots. Its storage is opaque and unrelated to the first-class concept of a diff.
Delta compression is a packing technique used to store one object more cheaply relative to another object.
That difference matters because it explains a lot of common confusion:
- delta compression is not the same thing as commit history
- delta compression does not mean Git abandoned snapshots
- delta compression does not imply that one object is the semantic "previous version" of another
Inside a pack, an object may be stored as an undeltified object, an offset delta, or a reference delta. In either delta case, Git stores instructions for reconstructing the target object from a base object.
One reason packfiles feel mysterious (and Git compression generally confusing) is that two different ideas get collapsed into one sentence. Let's get clear on all this.
Git uses ordinary zlib-style compression on stored bytes. Loose objects do this too. That's just normal stuff. The objects are stored compressed, nothing special about that.
Packs may also use delta encoding across objects. That is a separate optimization. Git chooses a base object, stores an instruction stream for reconstructing the target from that base, and then compresses that stored representation like other packed data. Again, this is not the same as a diff; it's just an internal compression mechanism used by Git. Diffs are generated at a different, higher, level of logical abstraction.
A delta payload starts with the size of the base object and the size of the reconstructed object. After that comes a sequence of instructions.
Those instructions are simple:
- copy a byte range from the base object
- insert literal bytes directly into the output
That is much closer to a compact binary patch program than to a line-oriented diff. Pack deltas operate on bytes, not source-code lines.
Another easy mistake is to imagine that a packed blob for src/main.c must delta against the previous version of src/main.c. Sometimes something like that happens, but it is not the governing rule.
Git's pack heuristics are trying to save space while keeping access practical. That means delta relationships are chosen for storage reasons, usually among objects that look similar enough to compress well together. The fact that two objects are related in path history may help, but the pack format itself does not encode "this file evolved from that file" as a first-class storage fact.
This is the same kind of distinction we saw with rename detection. Users often want semantic history. The storage engine wants efficient representation. Those are related, but they are not the same layer.
Delta Encodings and Delta Chains
Git uses two delta encodings in packs.
OBJ_OFS_DELTA points to the base object by a negative relative offset within the same pack. OBJ_REF_DELTA names the base by object ID instead.
If the base object is in the same on-disk pack, offset deltas are usually the tighter representation. Reference deltas are more general. They can refer to a base object outside the current pack, which is exactly what makes thin packs possible during transfer.
This also explains a subtle layout fact: offset deltas point backward in the pack, so the base object has to appear earlier in the file than the delta that depends on it.
Delta compression reduces storage, but it also creates reconstruction work. To read a delta-compressed object, Git has to locate the base object, reconstruct it if necessary, and apply delta instructions until it reaches the requested object.
Pack design is a balancing act. Very aggressive delta chaining can save space while making reads more expensive. Git's packing heuristics therefore have to balance compression ratio against access cost.
Not every object is a good delta candidate for the same reason. Large binary files and other poor matches may compress badly as deltas or become too expensive to reconstruct repeatedly. Git's packing behavior reflects that tradeoff rather than assuming "more deltas is always better."
Packs on Disk and Over the Wire
Packfiles also matter beyond local disk format. The same general pack representation is how Git moves object data over the network.
That unifies several parts of Git that otherwise seem separate:
- local repository storage
- fetch and push object transfer
- bundle files and related transport containers
The pack format therefore sits at an unusually important boundary. It is both a local storage mechanism and a transfer mechanism.
This also explains why so much Git performance work eventually touches packs. Clone, fetch, push, repack, bitmap traversal, multi-pack indexing, and maintenance all meet here.
Over the wire, Git can use a thin pack whose deltas refer to base objects the receiver already has. That can make transfer smaller. On disk, though, Git wants packs to be self-contained enough to avoid unresolved dependencies.
That distinction is subtle but useful. It is another example of Git separating transfer efficiency from long-term local storage hygiene.
Commands such as git gc and git repack reorganize object storage. They consolidate loose objects, combine smaller packs, and often rewrite pack contents into a more efficient layout.
Git is not rewriting history there. It is rewriting the storage layout around the same logical objects.
This is much closer to compaction in a log-structured storage system than to in-place filesystem defragmentation. Git writes new immutable pack files, often with better delta choices and better grouping, and then removes old redundant files once the new layout is ready.
If a repository becomes faster after repacking, that usually means object locality, delta choices, pack count, or associated metadata improved. It does not mean the repository now "contains different history."
Physical Layout Is a Performance Layer
A healthy repository does not always want one giant pack and nothing else.
Sometimes that happens. Real repositories may also have several packs for good operational reasons: incremental fetches, separate promisor packs, maintenance strategy, kept packs, or geometric repacking approaches that avoid rewriting the entire universe every time.
So the right model is not "loose objects become one final monolithic pack forever." The right model is that Git manages an object store whose physical layout may include loose objects, one or more packs, and auxiliary metadata files that help Git navigate that layout efficiently.
You can inspect that layout directly:
ls -lh .git/objects/pack
git verify-pack -v .git/objects/pack/pack-*.idx | tail
The directory listing shows whether the repository really has one pack or several, plus sidecar files such as .idx, .bitmap, .rev, or .mtimes. The verify-pack tail gives you one more quick slice of per-pack layout and delta summary.
Large packfiles are often a sign that Git has done useful consolidation work.
They can mean:
- better compression
- fewer files to open and scan
- less loose-object clutter
- simpler object locality
That is one reason repacking helps so often.
But one enormous pack is not automatically the ideal steady state.
Very large packfiles can become awkward when:
- full repacks take too long
- one rewrite has to touch too much data
- backup and mirroring workflows dislike giant monolithic files
- maintenance would be healthier with geometric or incremental repacking instead
- one damaged or stale large file becomes a bigger operational event
So the practical goal is not "the fewest packfiles possible." The goal is a storage layout that gives good compression and good read behavior without making maintenance brutal.
That is why Git leans on:
- incremental repack
- geometric repacking strategies
- multi-pack-index
- bitmaps
- background maintenance
Small fragmented packs are usually bad. Moderately large consolidated packs are usually good. Giant all-in-one packs can become an operations problem even when they are storage-efficient.
If you look in .git/objects/pack, a modern repository may have more than the classic pair of .pack and .idx files. You may also see files such as:
.revfor pack reverse indexes.mtimesfor per-object mtimes associated with certain packsmulti-pack-indexfor lookup across multiple packs
And in maintenance-heavy repositories, you may also encounter cruft packs for unreachable objects instead of falling back to a sea of loose unreachable files.
Those structures do not replace packfiles. They are supporting structures around them. Later chapters will take them apart in more detail.
Once this layer is visible, several practical facts become easier to understand:
- new objects appear quickly because loose-object writes are simple
- repositories shrink and speed up after maintenance because packs improve layout
- reading one object from a pack is different from scanning many loose files
- network transfer performance depends heavily on pack generation and reuse
- storage deltas do not tell you what happened in human history
Here the storage-engine side of Git is easiest to see.
Git's logical model is still snapshot-oriented. It explains commits, trees, and most user-facing behavior. Git's physical storage engine also works hard to avoid storing every object in the most naive full-copy form. That separation is one of Git's strongest design choices. It lets the history model stay stable while the storage layer keeps getting more sophisticated. (It may even let you change the storage layer into something else; as software development speeds up and repos get larger, that may soon be needed.)