RAG over corporate docs — what teams underestimate
RAG looks simple in demos: index documents, retrieve chunks, ask LLM. Production RAG over real corporate knowledge is harder than demos suggest. Teams underestimate data quality, chunking strategy, evaluation, and ongoing maintenance.
Build a RAG demo with 20 PDFs and OpenAI's API in an afternoon — it'll feel like magic. Try the same approach against 50,000 corporate documents and the magic evaporates. Wrong answers, missed information, slow queries, citations to outdated docs.
Production RAG is 80% data engineering and evaluation, 20% model selection. Teams underestimate this routinely.
What demos hide
- Carefully selected documents, all clean Markdown or simple PDFs.
- Questions you wrote yourself, knowing the answer is in the corpus.
- No measurement — you eyeball the answers and they look fine.
- No production load, no concurrent users, no latency requirements.
What production looks like
- 10,000 to 1M documents in mixed formats: PDFs, Word, Confluence pages, Notion, SharePoint, emails, transcripts.
- Documents with tables, images, footnotes, sidebars, scanned pages.
- Real user questions that contain typos, half-context, ambiguous references.
- Multiple versions of the same document (policy v1, v2, v3 — which is current?).
- Sensitive content that must be filtered by access control.
- Documents in multiple languages.
- Knowledge that updates daily — the index must stay fresh.
Data quality is the foundation
Garbage in, garbage out. Before any RAG work:
- Identify canonical sources (single source of truth per topic).
- Deprecate outdated documents.
- Establish ownership for each knowledge area.
- Normalize formats — extract clean text from PDFs, preserve tables, capture metadata.
If the corporate wiki is full of conflicting drafts, RAG amplifies the chaos. Fix the documentation first.
Chunking strategy
How documents are split affects retrieval quality more than the embedding model:
- Fixed-size chunks (e.g., 500 tokens with 50-token overlap). Simple, works okay for unstructured text.
- Semantic chunks. Split on natural boundaries (sections, paragraphs). Preserves context.
- Hierarchical chunks. Index both full sections and child paragraphs; retrieve at the level needed.
- Sliding window. Overlapping chunks reduce risk of relevant info being split.
The right strategy depends on document structure. Test multiple, measure recall, pick winner.
Embeddings and retrieval
- Embedding model matters less than people think — modern multilingual models all work reasonably.
- Vector search alone misses keyword-heavy queries ("section 4.2.3"). Add BM25.
- Hybrid retrieval (vector + BM25) with reciprocal rank fusion outperforms either alone.
- Re-ranking the top-20 with a cross-encoder model lifts precision substantially.
Citations and provenance
Every answer cites which documents informed it. Users see:
- Source document name and link.
- Section or page reference.
- Last updated date.
Without citations, users can't verify. With citations, they trust the system and provide feedback on bad sources.
Access control
Most corporate documents have access restrictions. RAG must respect them:
- Filter retrieval results by the asking user's permissions.
- Don't index documents the user shouldn't see.
- Audit every query — who asked what.
- Periodic review of who has access to what.
RAG that leaks confidential docs to wrong users is a compliance incident.
Evaluation
Without measurement, RAG decays:
- Build an evaluation set of 100-500 questions with expected answers and sources.
- Measure precision and recall on retrieval.
- Measure answer correctness with LLM-as-judge or human eval.
- Track over time as the corpus evolves.
Run evaluations before each major change to verify nothing regressed.
Latency and cost
- End-to-end latency under 3 seconds is achievable with good design.
- Each query costs 0.5-3 cents (embedding + LLM call).
- Concurrent users multiply infrastructure costs.
- Caching common questions reduces both.
Ongoing maintenance
- Re-index when documents change (event-driven or daily).
- Monitor for documents marked as outdated.
- Review user feedback on bad answers weekly.
- Update evaluation set as questions evolve.
- Track and prune outdated content.
Verdict
RAG demos lie. Production RAG over corporate docs is a data engineering challenge — clean sources, smart chunking, hybrid retrieval, access control, citations, eval, maintenance. Budget 60-70% of project time for these, not for prompts or LLMs. The teams that do this right get knowledge bases the org actually uses.