PageIndex: Teaching AI to Actually Read Documents

The Document Problem

Most AI tools handle documents by chunking them into arbitrary 2000-character blocks with overlap. This destroys document structure — a chapter heading ends up in one chunk, its content in another. The AI can't navigate the document intelligently because it was fed a bag of text fragments.

CVC's Real Document Intelligence

CVC integrates the VectifyAI/PageIndex architecture for document understanding. Instead of arbitrary chunking, CVC builds a hierarchical tree that mirrors the document's actual structure.

PDF Pipeline:

TOC Detection — finds the table of contents
TOC Extraction — parses section structure
TOC Transformation — normalizes into a tree
Page Offset Mapping — maps sections to physical pages
Tree Building — hierarchical node structure
Recursive Splitting — splits oversized nodes
LLM Summary Enrichment — AI-generated summaries per node
Description Generation — document-level overview

Markdown Pipeline: Header-based tree from #/##/### hierarchy.

Plain Text Pipeline: LLM-generated pseudo-TOC for unstructured documents.

LLM-Guided Tree Search

When the AI searches a document, it doesn't do brute-force matching. CVC reads node summaries at each tree level, selects the most relevant branches, and recurses to leaf nodes. This is fast, precise, and uses minimal tokens.