How Korvo's RAG Search Works on Your Local Files
RAG - retrieval-augmented generation - is the technique that makes AI outputs actually useful. Instead of relying on the model's training data alone, RAG retrieves relevant chunks from your own documents and feeds them as context alongside your query. The result: grounded answers with citations from your actual source material.
Most RAG implementations require uploading your documents to a cloud vector database - Pinecone, Weaviate, Chroma Cloud. Your sensitive files get chunked, embedded, and stored on servers you don't control. Korvo does the entire pipeline locally. Here's how.
The problem with cloud RAG
Traditional RAG-as-a-service has a fundamental tension: to make AI answers better, you have to give away your data. The more documents you upload, the better the retrieval - and the more exposure you have.
- Your documents are chunked and stored as embeddings on third-party servers.
- The original text is often stored alongside embeddings for retrieval.
- Vector databases are a high-value target - they contain the distilled knowledge of every customer.
- You have no control over retention policies, geographic storage, or access controls.
- Typical cost: $70–200/month for meaningful storage, on top of your AI subscription.
Korvo's local RAG pipeline
Korvo runs the entire RAG pipeline on your machine. No cloud vector databases. No document uploads. Here's the architecture, step by step:
Document ingestion
When you add files to a Korvo project - PDFs, markdown, plain text, Word documents - the app extracts the text content locally. No file is uploaded anywhere. The extracted text is stored in a local SQLite database within your project workspace.
Chunking
Documents are split into semantically meaningful chunks - typically 500–1000 tokens each, with overlap to preserve context across boundaries. Korvo uses a recursive splitting strategy that respects paragraph and section boundaries rather than cutting mid-sentence. Chunk metadata (source file, page number, section heading) is preserved for citation.
Embedding generation
Each chunk is converted into a vector embedding - a numerical representation of its semantic meaning. Korvo uses your configured AI provider's embedding endpoint (e.g., OpenAI's text-embedding-3-small). The embedding request goes directly from your machine to the provider. The resulting vectors are stored locally - never on our servers.
Local vector storage
Embeddings are indexed in a local vector store using HNSW (Hierarchical Navigable Small World) indexing - the same algorithm used by production vector databases, but running entirely on your device. This enables fast approximate nearest-neighbor search across thousands of chunks without any cloud dependency.
Query-time retrieval
When you ask a question or trigger Plan Mode, Korvo embeds your query using the same embedding model, then performs a similarity search against the local vector index. The top-k most relevant chunks are retrieved, ranked by cosine similarity, and injected into the prompt as grounding context.
Grounded generation with citations
The LLM receives your query plus the retrieved context and generates a response. Korvo's prompting instructs the model to cite its sources - producing inline references like [Source: pitch-deck.pdf, p.4] that link back to the exact chunk used. The output is grounded in your actual documents, not the model's general training data.
What this looks like in practice
Say you're doing due diligence on a deal. You upload the pitch deck, financial model, cap table, and three market research reports into a Korvo project. Here's what happens:
The entire pipeline - ingestion, chunking, embedding, retrieval, generation - runs locally. The only network calls are embedding requests and the final LLM call, both going directly to your chosen provider via your API key.
Performance considerations
A common concern with local RAG is performance. In practice, it's fast:
Ingestion
~50 pages/sec
PDF extraction + chunking
Embedding
~200 chunks/sec
Via OpenAI API (batched)
Vector search
<50ms
Local HNSW index, 10k chunks
Full RAG query
2–8 sec
Retrieval + LLM generation
The bottleneck is almost always the LLM generation step - which is the same latency you'd have with a cloud RAG pipeline. The local retrieval step is actually faster because there's no network round-trip to a remote vector database.
Why citations matter
RAG without citations is just marginally better hallucination. If you can't trace an AI-generated claim back to a specific source, you can't trust it - and for high-stakes decisions, untraceable claims are worse than no claims at all.
Korvo's citation system isn't an afterthought. Every output produced through RAG includes source references. Every reference links to the actual chunk. And every chunk links back to the original file, page, and section. This is what we call full provenance - the ability to trace any conclusion back through the reasoning chain to its source material.
Local RAG vs. cloud RAG: the comparison
| Cloud RAG | Korvo (local) | |
|---|---|---|
| Data location | Third-party servers | Your device |
| Privacy | Provider-dependent | Enforced by architecture |
| Cost | $70–200/mo + AI sub | $0 (uses your API key) |
| Search latency | 100–300ms (network) | <50ms (local) |
| Offline access | No | Full index available |
| Vendor lock-in | High (proprietary index) | None (standard formats) |
The bottom line
RAG is what makes AI useful for real work - but the standard implementation requires surrendering your documents to cloud infrastructure. Korvo proves it doesn't have to.
Local ingestion. Local embeddings. Local vector search. Direct-to-provider generation. Full citations. Zero cloud storage.
Your documents stay on your machine. Your AI outputs are grounded in your actual sources. And every conclusion is traceable.
Try local RAG in Korvo
Upload your files, ask questions, get cited answers - all on your machine. Free to start.
Download free