How We Built a Private Document Q&A System with LLaMA 3, FAISS & Gradio

Most enterprise AI pilots fail because data leaves the organisation. We built Kerdos AI — a fully private, RAG-powered document Q&A system — using open-source tools that run entirely within your environment.

The Problem: Enterprise AI Without Privacy

Most enterprise AI demos look impressive until a CTO asks: "Where does our data go?" Sending confidential documents to OpenAI, Anthropic, or any external API is a non-starter for most regulated organisations. Legal contracts, HR policies, financial models — these documents cannot leave the building.

We built Kerdos AI to solve exactly this problem. It's a Retrieval-Augmented Generation (RAG) system that runs entirely within your environment, grounded strictly in documents you upload, with zero external data transfer in the enterprise edition.

What is RAG and Why It Matters

RAG (Retrieval-Augmented Generation) combines a vector database for retrieval with a language model for generation. Instead of asking the LLM to memorise your proprietary information, RAG retrieves the most relevant document chunks at query time and injects them into the prompt context. The key advantage: the LLM's job shifts from "know everything" to "reason about what I'm given."

Grounded answers: Every response is backed by retrievable source chunks
No hallucination on your domain: The model can only say what your documents say
Updateable without retraining: Add a new policy doc and it's instantly queryable

The Architecture

Our pipeline has six stages:

Document Parsing: PyMuPDF for PDFs, python-docx for Word files, plain parsers for TXT/MD/CSV
Text Chunking: 512-character chunks with 64-character overlap to preserve context at boundaries
Embedding: sentence-transformers/all-MiniLM-L6-v2 — a fast, CPU-friendly model producing 384-dimensional dense vectors
Indexing: FAISS in-memory flat L2 index for the demo; IVF indexes for enterprise scale
Retrieval: Cosine similarity search returning the Top-K most relevant chunks
Generation: meta-llama/Llama-3.1-8B-Instruct receives only the retrieved chunks and produces a grounded, cited answer

Why LLaMA 3.1 and Not GPT-4?

Three reasons: Deployability — LLaMA 3.1 can run entirely on-premise; Licensing — Meta's license permits commercial use for enterprises under 700M MAU; Performance — LLaMA 3.1 8B Instruct scores within 10-15% of GPT-4 on document Q&A benchmarks.

The Demo vs. The Enterprise Edition

The public demo on Hugging Face Spaces uses the HuggingFace Inference API (data passes through HF's infrastructure). The enterprise edition replaces this with a self-hosted LLaMA 3.1 instance (vLLM or Ollama), a persistent FAISS/Milvus index with authentication, optional domain fine-tuning, and white-label branding.

Try It Yourself

The demo is live at kerdosdotio/Custom-LLM-Chat on Hugging Face Spaces. For enterprise deployment, partnerships, or investment: partnerships@kerdos.in

How We Built a Private Document Q&A System with LLaMA 3, FAISS & Gradio

The Problem: Enterprise AI Without Privacy

What is RAG and Why It Matters

The Architecture

Why LLaMA 3.1 and Not GPT-4?

The Demo vs. The Enterprise Edition

Try It Yourself

More in AI Solutions

Get insights delivered.

Company

Services

Command Palette

How We Built a Private Document Q&A System with LLaMA 3, FAISS & Gradio

The Problem: Enterprise AI Without Privacy

What is RAG and Why It Matters

The Architecture

Why LLaMA 3.1 and Not GPT-4?

The Demo vs. The Enterprise Edition

Try It Yourself

More in AI Solutions

Get insights delivered.