Building a RAG Chatbot with LangChain and ChromaDB: A Practical Guide for 2026

Retrieval-Augmented Generation (RAG) has become the go-to pattern for building AI chatbots that can answer questions about your own data. Instead of fine-tuning expensive models, RAG lets you ground LLM responses in your documents — reducing hallucinations and keeping answers accurate. In this guide, we’ll build a fully functional RAG chatbot using LangChain and ChromaDB in Python.

What Is RAG and Why Does It Matter?

RAG combines two steps: retrieval (finding relevant documents from a knowledge base) and generation (using an LLM to produce an answer based on those documents). This approach solves key LLM limitations:

No hallucinations — answers are grounded in your actual data
No retraining needed — update your knowledge base anytime
Cost-effective — works with smaller, cheaper models
Private — your data stays in your vector database

Project Setup

First, install the required packages:

pip install langchain langchain-community langchain-openai chromadb tiktoken

We’ll use OpenAI’s embeddings and chat model, but you can swap in any provider (Ollama for local, Anthropic, Google, etc.).

import os
os.environ["OPENAI_API_KEY"] = "your-api-key-here"

Step 1: Load and Chunk Your Documents

RAG starts with your data. LangChain provides loaders for PDFs, text files, web pages, and more. Here we’ll load a directory of text files:

from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load all .txt files from a directory
loader = DirectoryLoader("./docs", glob="**/*.txt", loader_cls=TextLoader)
documents = loader.load()

print(f"Loaded {len(documents)} documents")

# Split into chunks for embedding
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")

The RecursiveCharacterTextSplitter is the best general-purpose splitter. It tries to split on paragraph boundaries first, then sentences, then words — keeping chunks semantically coherent. The 200-character overlap ensures context isn’t lost at chunk boundaries.

Step 2: Create the Vector Store with ChromaDB

ChromaDB is a lightweight, open-source vector database that runs embedded (no server needed) or as a service. We’ll embed our chunks and store them:

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# Create embeddings and vector store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db",
    collection_name="my_docs"
)

print(f"Stored {vectorstore._collection.count()} vectors")

The text-embedding-3-small model is fast, cheap, and surprisingly effective. For production, consider text-embedding-3-large for better retrieval accuracy.

Step 3: Build the RAG Chain

Now we wire retrieval and generation together using LangChain’s expression language (LCEL):

from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# Initialize the LLM
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.1)

# Create retriever (top 4 most relevant chunks)
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4}
)

# RAG prompt template
template = """Answer the question based only on the following context.
If you cannot answer from the context, say "I don't have enough information."

Context:
{context}

Question: {question}

Answer:"""

prompt = ChatPromptTemplate.from_template(template)

# Helper to format retrieved docs
def format_docs(docs):
    return "\n\n---\n\n".join(doc.page_content for doc in docs)

# Build the chain
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

Step 4: Query Your Chatbot

Now you can ask questions and get grounded answers:

# Ask a question
response = rag_chain.invoke("What is our refund policy?")
print(response)

# Interactive chat loop
while True:
    question = input("\nYou: ")
    if question.lower() in ["quit", "exit"]:
        break
    answer = rag_chain.invoke(question)
    print(f"\nBot: {answer}")

Step 5: Add Conversation Memory

A real chatbot remembers previous messages. Let’s add conversation history:

from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferWindowMemory

memory = ConversationBufferWindowMemory(
    memory_key="chat_history",
    return_messages=True,
    k=5  # Remember last 5 exchanges
)

conversational_chain = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=retriever,
    memory=memory,
    verbose=False
)

# Chat with memory
result = conversational_chain.invoke({"question": "What products do you offer?"})
print(result["answer"])

# Follow-up uses context from previous answer
result = conversational_chain.invoke({"question": "How much does the first one cost?"})
print(result["answer"])

Production Tips

Before deploying your RAG chatbot, consider these best practices:

Chunk size matters — Too small and you lose context; too large and retrieval gets noisy. Test with 500-1500 characters.
Use hybrid search — Combine vector similarity with keyword search (BM25) for better retrieval. ChromaDB supports this via where_document filters.
Add metadata filtering — Tag chunks with source, date, or category so users can scope their queries.
Monitor and evaluate — Use LangSmith or custom logging to track retrieval relevance and answer quality.
Consider reranking — Add a cross-encoder reranker (like Cohere Rerank) between retrieval and generation to improve relevance.

Wrapping Up

You now have a working RAG chatbot that can answer questions about any document collection. The LangChain + ChromaDB stack is excellent for prototyping and small-to-medium production workloads. For larger scale, consider Pinecone, Weaviate, or Qdrant as your vector store.

The complete code for this tutorial is under 50 lines (excluding the conversational chain). That’s the beauty of RAG — powerful AI grounded in your data, without the complexity of fine-tuning. Start with your own documents and iterate from there.