PageIndex: Teaching AI to Actually Read Documents
PageIndex: Teaching AI to Actually Read Documents
The Document Problem
Most AI tools handle documents by chunking them into arbitrary 2000-character blocks with overlap. This destroys document structure — a chapter heading ends up in one chunk, its content in another. The AI can't navigate the document intelligently because it was fed a bag of text fragments.
CVC's Real Document Intelligence
CVC integrates the VectifyAI/PageIndex architecture for document understanding. Instead of arbitrary chunking, CVC builds a hierarchical tree that mirrors the document's actual structure.
PDF Pipeline:
- TOC Detection — finds the table of contents
- TOC Extraction — parses section structure
- TOC Transformation — normalizes into a tree
- Page Offset Mapping — maps sections to physical pages
- Tree Building — hierarchical node structure
- Recursive Splitting — splits oversized nodes
- LLM Summary Enrichment — AI-generated summaries per node
- Description Generation — document-level overview
Markdown Pipeline: Header-based tree from #/##/### hierarchy.
Plain Text Pipeline: LLM-generated pseudo-TOC for unstructured documents.
LLM-Guided Tree Search
When the AI searches a document, it doesn't do brute-force matching. CVC reads node summaries at each tree level, selects the most relevant branches, and recurses to leaf nodes. This is fast, precise, and uses minimal tokens.