You drag two PDFs into a “merge PDF” tool and it spits out a combined file in under a second. Easy, right? So why does every PDF library on GitHub have a thousand open issues, why does PDF.js weigh in at over a megabyte minified, and why do “simple” merge tools occasionally produce files that open in Acrobat but break in Preview?

The answer lives inside the file format itself. PDF looks like a document on screen, but on disk it’s closer to a tiny object-oriented database with its own indexing system. Once you understand the pdf file structure object streams mechanism—plus the cross-reference (xref) table that ties everything together—the difficulty of “just merging two files” stops being mysterious and starts being obvious.

This post walks through what’s actually inside a PDF, why the format is structured the way it is, and why every operation more interesting than “open and view” requires a real parser.

What Is a PDF File, Really?

A PDF is a structured collection of objects—numbered, typed pieces of data—plus an index that tells a reader where to find each object inside the file. It is not a stream of pages rendered top to bottom. The byte order on disk has almost nothing to do with the page order on screen.

At the highest level, every valid PDF contains four parts, in this order:

  1. Header — a single line like %PDF-1.7 declaring the version.
  2. Body — the actual objects: pages, fonts, images, content streams.
  3. Cross-reference table (xref) — a lookup index mapping object numbers to byte offsets.
  4. Trailer — metadata pointing the reader at the xref and the document root.

Readers parse the file backwards. They jump to the end, read the trailer, follow it to the xref, and only then start fetching objects by number. This is why a corrupted last 100 bytes can render an otherwise pristine PDF unreadable.

Objects: The Building Blocks

Everything in a PDF is an object. There are eight types defined by the spec:

TypeExamplePurpose
BooleantrueFlags
Number42 or 3.14Coordinates, sizes
String(Hello) or <48656C6C6F>Text, metadata
Name/TypeKeys, identifiers
Array[1 2 3]Ordered lists
Dictionary<< /Key /Value >>Key-value maps
Stream<< ... >> stream ... endstreamBinary blobs (compressed)
NullnullAbsence of value

Objects get a unique identifier—an object number and a generation number—and live in the body wrapped in obj/endobj markers:

12 0 obj
<< /Type /Page /Parent 3 0 R /MediaBox [0 0 612 792] >>
endobj

The 3 0 R is an indirect reference—“go look up object 3, generation 0.” PDFs are built almost entirely out of these references. A page references its parent, its content stream, its fonts, and its images. The graph can be deep, and cycles are legal.

The xref Table: PDF’s Index

Objects can appear in any order in the body. To find object 12 without scanning the whole file, the reader consults the xref table, which lists the byte offset of every object:

xref
0 14
0000000000 65535 f
0000000015 00000 n
0000000074 00000 n
...

Each line is exactly 20 bytes—a 10-digit offset, a 5-digit generation number, and a flag (n for in-use, f for free). This fixed-width layout lets readers seek directly to object N without parsing intermediate bytes.

When you save a PDF after editing, most writers don’t rewrite the whole file. They append new objects to the end, append a new xref section listing only the changed objects, and point the trailer at the new xref. This is called incremental update, and it’s why PDF history is sometimes recoverable from a redacted document.

Object Streams: Compression for the Index

Here’s where modern PDFs get clever. The classic xref-plus-loose-objects layout works, but it wastes space: every object header (12 0 objendobj) is overhead, and the xref itself can be huge for a long document.

PDF 1.5 introduced object streams (the actual pdf file structure object streams feature people search for). An object stream is a single compressed stream that contains many small objects packed together. Instead of:

12 0 obj << /Type /Page ... >> endobj
13 0 obj << /Type /Font ... >> endobj
14 0 obj << /Type /XObject ... >> endobj

you get one stream object containing all three, FlateDecode-compressed. The xref then needs a companion structure—the cross-reference stream—which not only maps object numbers to byte offsets but also points inside object streams: “object 13 lives at index 1 within object stream 50.”

The payoff is significant. A typical document drops 20–40% in file size purely from packing dictionaries together and letting deflate find redundancy across them. The cost: you can no longer read a single object without decompressing the surrounding stream.

Content Streams: What’s on the Page

Text and graphics aren’t objects in the structural sense. They live inside content streams attached to page objects. A content stream is a tiny stack-based drawing language:

BT
  /F1 12 Tf
  72 720 Td
  (Hello, world.) Tj
ET

This says: begin text, select font F1 at 12pt, move to (72, 720), show the string Hello, world., end text. To extract text from a PDF you must parse this language and resolve every font reference and handle Unicode mapping (the bytes in the string don’t have to be Unicode—fonts can ship with arbitrary glyph encodings).

This is why “copy text from PDF” is unreliable on documents produced by certain CAD tools or older scanners: the bytes in the content stream might be \x01\x02\x03 with a font that maps those to letter shapes but not to Unicode codepoints.

Why Merging Is Hard

Now the picture is complete. To merge two PDFs you cannot concatenate the files. You must:

  • Renumber every object in the second PDF so its object numbers don’t collide with the first.
  • Rewrite every indirect reference inside every dictionary, array, and content stream of the second PDF to use the new numbers.
  • Decompress and recompose object streams since their internal index includes object numbers.
  • Merge the page tree so both documents’ pages live under one root.
  • Deduplicate shared resources (fonts, color profiles) or accept the bloat.
  • Build a fresh xref covering the combined object set, and write a new trailer.

Miss any step and you get a file that opens in Acrobat (which is forgiving), shows blank pages in Preview (which is not), and crashes in pdf.js (which sits in the middle).

This is also why client-side merging matters for privacy: doing it correctly requires a real PDF engine running somewhere. Either you ship that engine to the browser (what Kestrel does) or you upload the files to a server that runs it. There is no third option where the bytes “just combine.”

Try It Yourself

If you want to see this structure with your own eyes, open any PDF in a hex editor and search for xref. Right above it you’ll see object definitions; right below it you’ll see the trailer pointing back to the document root. For an interactive view, our PDF tools at Kestrel Tools run entirely in the browser—your files never leave your machine, and the merge happens with a real parser doing all the work above.

The format is forty years old, byzantine, and remarkably durable. Once you see the object graph behind the page, every quirk—the slow merges, the broken text extraction, the mysteriously-bloated files—starts to make sense.