Retrieval-Augmented Generation (RAG) is the most practical way to build AI chatbots that answer questions from your own documents — without sending data to the cloud. In this hands-on tutorial, you’ll build a fully local RAG chatbot using LangChain, Ollama, and ChromaDB that runs entirely on your machine. No API keys, no cloud costs, no data leaks.
What Is RAG and Why Does It Matter in 2026?
Large Language Models (LLMs) are powerful, but they hallucinate and lack knowledge of your private data. RAG solves both problems by retrieving relevant documents first, then feeding them as context to the LLM. In 2026, local RAG has become viable thanks to powerful open-source models like Llama 4, Mistral, and Qwen 3 running efficiently via Ollama.
Benefits of local RAG:
- Privacy: Your documents never leave your machine
- Cost: Zero API fees — runs on consumer hardware
- Speed: No network latency for inference
- Control: Choose your model, tweak your prompts, own your pipeline
Prerequisites
Before we begin, make sure you have:
- Python 3.11+ installed
- Ollama installed and running
- At least 8GB RAM (16GB recommended)
- Some PDF or text documents you want to query
Pull the models we’ll use:
ollama pull llama3.2
ollama pull nomic-embed-textStep 1: Install Dependencies
Create a new project and install the required packages:
mkdir local-rag-chatbot && cd local-rag-chatbot
python -m venv venv
source venv/bin/activate
pip install langchain langchain-ollama langchain-chroma \
langchain-community pypdf sentence-transformersStep 2: Load and Chunk Your Documents
The first step in any RAG pipeline is ingesting documents and splitting them into manageable chunks. Here’s how:
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
def load_documents(data_path: str = "./docs"):
"""Load all PDFs from a directory."""
loader = PyPDFDirectoryLoader(data_path)
return loader.load()
def chunk_documents(documents, chunk_size=800, overlap=200):
"""Split documents into overlapping chunks."""
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=overlap,
length_function=len,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(documents)
print(f"Split {len(documents)} docs into {len(chunks)} chunks")
return chunksThe RecursiveCharacterTextSplitter is ideal because it tries to split on natural boundaries (paragraphs, sentences) before falling back to character-level splits. The 200-character overlap ensures context isn’t lost at chunk boundaries.
Step 3: Create the Vector Store
Next, we embed our chunks and store them in ChromaDB — a lightweight, local vector database:
from langchain_ollama import OllamaEmbeddings
from langchain_chroma import Chroma
def create_vector_store(chunks, persist_dir="./chroma_db"):
"""Embed chunks and store in ChromaDB."""
embeddings = OllamaEmbeddings(
model="nomic-embed-text",
base_url="http://localhost:11434"
)
vector_store = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory=persist_dir,
collection_name="local_docs"
)
print(f"Stored {len(chunks)} embeddings in ChromaDB")
return vector_storeWe use nomic-embed-text via Ollama for embeddings — it’s fast, runs locally, and produces high-quality 768-dimensional vectors.
Step 4: Build the RAG Chain
Now the core: connect the retriever to the LLM with a prompt template:
from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
def build_rag_chain(vector_store):
"""Build a RAG chain with retrieval + LLM."""
retriever = vector_store.as_retriever(
search_type="similarity",
search_kwargs={"k": 4}
)
llm = ChatOllama(
model="llama3.2",
temperature=0.3,
base_url="http://localhost:11434"
)
template = """You are a helpful assistant. Answer the question
based ONLY on the following context. If the context doesn't
contain the answer, say "I don't have enough information."
Context:
{context}
Question: {question}
Answer:"""
prompt = ChatPromptTemplate.from_template(template)
def format_docs(docs):
return "\n\n---\n\n".join(doc.page_content for doc in docs)
chain = (
{"context": retriever | format_docs,
"question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
return chainKey decisions here: k=4 retrieves four relevant chunks (good balance of context vs. noise), and temperature=0.3 keeps answers focused and factual.
Step 5: Create the Interactive Chatbot
Tie everything together into a working chatbot:
def main():
# Ingest documents
print("Loading documents...")
docs = load_documents("./docs")
chunks = chunk_documents(docs)
# Create or load vector store
print("Building vector store...")
vector_store = create_vector_store(chunks)
# Build chain
print("Initializing RAG chain...")
chain = build_rag_chain(vector_store)
# Chat loop
print("\n🤖 Local RAG Chatbot Ready!")
print("Ask questions about your documents. Type 'quit' to exit.\n")
while True:
question = input("You: ").strip()
if question.lower() in ("quit", "exit", "q"):
break
if not question:
continue
print("\nAssistant: ", end="", flush=True)
response = chain.invoke(question)
print(response)
print()
if __name__ == "__main__":
main()Step 6: Add Conversation Memory (Bonus)
For multi-turn conversations, add chat history awareness:
from langchain_core.messages import HumanMessage, AIMessage
def build_conversational_chain(vector_store):
retriever = vector_store.as_retriever(search_kwargs={"k": 4})
llm = ChatOllama(model="llama3.2", temperature=0.3)
template = """Given the chat history and context, answer
the user's question. Use context from retrieved documents.
Chat History:
{chat_history}
Context:
{context}
Question: {question}
Answer:"""
prompt = ChatPromptTemplate.from_template(template)
chat_history = []
def ask(question: str) -> str:
docs = retriever.invoke(question)
context = "\n\n".join(d.page_content for d in docs)
history_str = "\n".join(
f"{'Human' if isinstance(m, HumanMessage) else 'AI'}: {m.content}"
for m in chat_history[-6:] # Keep last 3 exchanges
)
response = (prompt | llm | StrOutputParser()).invoke({
"chat_history": history_str,
"context": context,
"question": question
})
chat_history.extend([
HumanMessage(content=question),
AIMessage(content=response)
])
return response
return askPerformance Tips
- Chunk size matters: 800 tokens works well for most documents. Too small = lost context, too large = noise
- Use quantized models: Ollama serves Q4_K_M quantized models by default — great quality at half the memory
- Persist your vector store: ChromaDB saves to disk, so you only embed once
- Hybrid search: For production, combine vector similarity with BM25 keyword search using
EnsembleRetriever - GPU acceleration: Ollama auto-detects GPUs — even a 6GB VRAM card dramatically speeds up inference
What’s Next?
You now have a fully local RAG chatbot. From here, you can:
- Add support for more file types (Markdown, HTML, DOCX) using LangChain’s loaders
- Build a web UI with Streamlit or Gradio
- Implement re-ranking with a cross-encoder for better retrieval accuracy
- Add streaming responses for a smoother UX
- Deploy behind a FastAPI endpoint for team access
Local RAG is one of the most impactful AI patterns in 2026. It gives you the power of LLMs over your own data, completely offline, at zero cost. The tooling has matured — there’s never been a better time to build.

Leave a Reply