Build a Local RAG Chatbot with LangChain and Ollama in 2026: A Complete Python Tutorial

Retrieval-Augmented Generation (RAG) is the most practical way to build AI chatbots that answer questions from your own documents — without sending data to the cloud. In this hands-on tutorial, you’ll build a fully local RAG chatbot using LangChain, Ollama, and ChromaDB that runs entirely on your machine. No API keys, no cloud costs, no data leaks.

What Is RAG and Why Does It Matter in 2026?

Large Language Models (LLMs) are powerful, but they hallucinate and lack knowledge of your private data. RAG solves both problems by retrieving relevant documents first, then feeding them as context to the LLM. In 2026, local RAG has become viable thanks to powerful open-source models like Llama 4, Mistral, and Qwen 3 running efficiently via Ollama.

Benefits of local RAG:

  • Privacy: Your documents never leave your machine
  • Cost: Zero API fees — runs on consumer hardware
  • Speed: No network latency for inference
  • Control: Choose your model, tweak your prompts, own your pipeline

Prerequisites

Before we begin, make sure you have:

  • Python 3.11+ installed
  • Ollama installed and running
  • At least 8GB RAM (16GB recommended)
  • Some PDF or text documents you want to query

Pull the models we’ll use:

ollama pull llama3.2
ollama pull nomic-embed-text

Step 1: Install Dependencies

Create a new project and install the required packages:

mkdir local-rag-chatbot && cd local-rag-chatbot
python -m venv venv
source venv/bin/activate

pip install langchain langchain-ollama langchain-chroma \
  langchain-community pypdf sentence-transformers

Step 2: Load and Chunk Your Documents

The first step in any RAG pipeline is ingesting documents and splitting them into manageable chunks. Here’s how:

from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

def load_documents(data_path: str = "./docs"):
    """Load all PDFs from a directory."""
    loader = PyPDFDirectoryLoader(data_path)
    return loader.load()

def chunk_documents(documents, chunk_size=800, overlap=200):
    """Split documents into overlapping chunks."""
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,
        length_function=len,
        separators=["\n\n", "\n", ". ", " ", ""]
    )
    chunks = splitter.split_documents(documents)
    print(f"Split {len(documents)} docs into {len(chunks)} chunks")
    return chunks

The RecursiveCharacterTextSplitter is ideal because it tries to split on natural boundaries (paragraphs, sentences) before falling back to character-level splits. The 200-character overlap ensures context isn’t lost at chunk boundaries.

Step 3: Create the Vector Store

Next, we embed our chunks and store them in ChromaDB — a lightweight, local vector database:

from langchain_ollama import OllamaEmbeddings
from langchain_chroma import Chroma

def create_vector_store(chunks, persist_dir="./chroma_db"):
    """Embed chunks and store in ChromaDB."""
    embeddings = OllamaEmbeddings(
        model="nomic-embed-text",
        base_url="http://localhost:11434"
    )
    
    vector_store = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        persist_directory=persist_dir,
        collection_name="local_docs"
    )
    print(f"Stored {len(chunks)} embeddings in ChromaDB")
    return vector_store

We use nomic-embed-text via Ollama for embeddings — it’s fast, runs locally, and produces high-quality 768-dimensional vectors.

Step 4: Build the RAG Chain

Now the core: connect the retriever to the LLM with a prompt template:

from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

def build_rag_chain(vector_store):
    """Build a RAG chain with retrieval + LLM."""
    retriever = vector_store.as_retriever(
        search_type="similarity",
        search_kwargs={"k": 4}
    )
    
    llm = ChatOllama(
        model="llama3.2",
        temperature=0.3,
        base_url="http://localhost:11434"
    )
    
    template = """You are a helpful assistant. Answer the question 
based ONLY on the following context. If the context doesn't 
contain the answer, say "I don't have enough information."

Context:
{context}

Question: {question}

Answer:"""
    
    prompt = ChatPromptTemplate.from_template(template)
    
    def format_docs(docs):
        return "\n\n---\n\n".join(doc.page_content for doc in docs)
    
    chain = (
        {"context": retriever | format_docs, 
         "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    )
    return chain

Key decisions here: k=4 retrieves four relevant chunks (good balance of context vs. noise), and temperature=0.3 keeps answers focused and factual.

Step 5: Create the Interactive Chatbot

Tie everything together into a working chatbot:

def main():
    # Ingest documents
    print("Loading documents...")
    docs = load_documents("./docs")
    chunks = chunk_documents(docs)
    
    # Create or load vector store
    print("Building vector store...")
    vector_store = create_vector_store(chunks)
    
    # Build chain
    print("Initializing RAG chain...")
    chain = build_rag_chain(vector_store)
    
    # Chat loop
    print("\n🤖 Local RAG Chatbot Ready!")
    print("Ask questions about your documents. Type 'quit' to exit.\n")
    
    while True:
        question = input("You: ").strip()
        if question.lower() in ("quit", "exit", "q"):
            break
        if not question:
            continue
        
        print("\nAssistant: ", end="", flush=True)
        response = chain.invoke(question)
        print(response)
        print()

if __name__ == "__main__":
    main()

Step 6: Add Conversation Memory (Bonus)

For multi-turn conversations, add chat history awareness:

from langchain_core.messages import HumanMessage, AIMessage

def build_conversational_chain(vector_store):
    retriever = vector_store.as_retriever(search_kwargs={"k": 4})
    llm = ChatOllama(model="llama3.2", temperature=0.3)
    
    template = """Given the chat history and context, answer 
the user's question. Use context from retrieved documents.

Chat History:
{chat_history}

Context:
{context}

Question: {question}

Answer:"""
    prompt = ChatPromptTemplate.from_template(template)
    
    chat_history = []
    
    def ask(question: str) -> str:
        docs = retriever.invoke(question)
        context = "\n\n".join(d.page_content for d in docs)
        history_str = "\n".join(
            f"{'Human' if isinstance(m, HumanMessage) else 'AI'}: {m.content}"
            for m in chat_history[-6:]  # Keep last 3 exchanges
        )
        response = (prompt | llm | StrOutputParser()).invoke({
            "chat_history": history_str,
            "context": context,
            "question": question
        })
        chat_history.extend([
            HumanMessage(content=question),
            AIMessage(content=response)
        ])
        return response
    
    return ask

Performance Tips

  • Chunk size matters: 800 tokens works well for most documents. Too small = lost context, too large = noise
  • Use quantized models: Ollama serves Q4_K_M quantized models by default — great quality at half the memory
  • Persist your vector store: ChromaDB saves to disk, so you only embed once
  • Hybrid search: For production, combine vector similarity with BM25 keyword search using EnsembleRetriever
  • GPU acceleration: Ollama auto-detects GPUs — even a 6GB VRAM card dramatically speeds up inference

What’s Next?

You now have a fully local RAG chatbot. From here, you can:

  • Add support for more file types (Markdown, HTML, DOCX) using LangChain’s loaders
  • Build a web UI with Streamlit or Gradio
  • Implement re-ranking with a cross-encoder for better retrieval accuracy
  • Add streaming responses for a smoother UX
  • Deploy behind a FastAPI endpoint for team access

Local RAG is one of the most impactful AI patterns in 2026. It gives you the power of LLMs over your own data, completely offline, at zero cost. The tooling has matured — there’s never been a better time to build.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Privacy Policy · Contact · Sitemap

© 7Tech – Programming and Tech Tutorials