Back to Blog
engineering

PageIndex: Teaching AI to Actually Read Documents

Jai Kumar MeenaMarch 7, 20268 min read
PageIndexDocumentsPDFRAG

PageIndex: Teaching AI to Actually Read Documents

The Document Problem

Most AI tools handle documents by chunking them into arbitrary 2000-character blocks with overlap. This destroys document structure — a chapter heading ends up in one chunk, its content in another. The AI can't navigate the document intelligently because it was fed a bag of text fragments.

CVC's Real Document Intelligence

CVC integrates the VectifyAI/PageIndex architecture for document understanding. Instead of arbitrary chunking, CVC builds a hierarchical tree that mirrors the document's actual structure.

PDF Pipeline:

  1. TOC Detection — finds the table of contents
  2. TOC Extraction — parses section structure
  3. TOC Transformation — normalizes into a tree
  4. Page Offset Mapping — maps sections to physical pages
  5. Tree Building — hierarchical node structure
  6. Recursive Splitting — splits oversized nodes
  7. LLM Summary Enrichment — AI-generated summaries per node
  8. Description Generation — document-level overview

Markdown Pipeline: Header-based tree from #/##/### hierarchy.

Plain Text Pipeline: LLM-generated pseudo-TOC for unstructured documents.

When the AI searches a document, it doesn't do brute-force matching. CVC reads node summaries at each tree level, selects the most relevant branches, and recurses to leaf nodes. This is fast, precise, and uses minimal tokens.

    Blog — CVC & AI Engineering | Jai Kumar Meena