Tanishq Rupaal

[!INFO] Preamble → This blog is a collection of my research and experimentation over two days to learn about RAG and LLMs. The main idea I was chasing was to set up RAG as a service (RAGaaS) - it’s pretty straightforward! Point a container (or a multi-container stack) to a directory containing notes or files and be able to talk to an LLM about it. I don’t know much about machine learning and this post is about me getting on the AI train.

The idea I stated above is actually very common and nothing novel. In fact, several offerings already exist on the internet, the most famous one being NotebookLM by Google. Several other projects on GitHub also exist, but I wanted to try it out for myself and play around with RAG parameters. Also, I wanted it to be completely local (i.e., no sending data to an online service). So, this post will include some preliminary knowledge gathering, sample code, and information on the problems I faced and solved (maybe).

[!DANGER] Keep in mind → All the knowledge listed below is a summarized regurgitation of concepts I’ve read from various online sources. As such, some things may not be academically accurate; they just represent my own high-level understanding.

Concepts

Retrieval-Augmented Generation or RAG is a way of combining a data retrieval mechanism with an LLM to generate contextually relevant responses. It typically involves three key stages with distinct steps within each stage →

Data Ingestion

More on Embedding

Retrieval Process

Forward Retrieved Documents to the LLM

These steps stated above are how a general RAG system is implemented and can provide contextually aware responses against highly specific information corpora. Now let’s explore the setup I landed on!

Exploring a Setup

Broadly speaking, given the details described in the previous section, the following components are needed for setting up RAG →

[!TIP] An easily forgotten fact → An orchestration framework like Langchain is only required to make the task of creating LLM applications easy. It provides wrappers around common LLM-based implementations that make it easy to spin something up in just a few lines of code. However, it’s not necessary. You could use standard SDKs for platforms like OpenAI and Qdrant, and build the exact same thing without an orchestration framework.

That’s all we need to get started with the actual implementation of a local RAG system. Now let’s look at a breakdown of some code.

Proof of Concept Code Walkthrough

First. start by installing the following python packages →

pip install langchain langchain_community langchain_ollama langchain_chroma langchain-qdrant qdrant-client

Next, import the necessary functions and libraries →

import sys
import time
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams
from langchain_ollama import OllamaEmbeddings
from langchain_ollama import ChatOllama
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

The following is a template to perform RAG. It is basically a system prompt followed by variables for context and question for the retrieved documents and user query respectively.

RAG_TEMPLATE = """
You are an assistant tasked with answering a question using the following retrieved context. Follow these guidelines to provide the most accurate and relevant answer:

1. **If you don't know the answer based on the context, explicitly say "I don't know based on the provided context."** Avoid guessing or adding details not found in the context.
2. **Provide information in an organized, hierarchical format**: Use headings, bullet points, and numbering for clear structure, and employ paragraphs where appropriate.
3. **Include all relevant code snippets**: If the context includes code, ensure it is reproduced accurately in the answer.
4. **Focus on relevance**: Only include details directly related to the question. Do not introduce arbitrary or unrelated information from the context.
5. **Avoid redundancy**: Summarize where possible and avoid repeating information unless necessary for clarity.
6. **Acronyms**: For any acronyms you encounter in the query, do not use pre-existing knowledge. Instead, use the context provided to determine the meaning of the acronym.

**Context**:
{context}

**Question**:
{question}

**Answer**:"""

With the prompt ready, the next block of code loads the docs directory and searches for all markdown files using glob. These files are then serially loaded using the TextLoader class (there are other classes for loading Markdown, JSON, etc., but I went with text for simplicity). The loaded documents are then split using a method called recursive character splitting such that there are at max. 1000 characters in a split, with a max. overlap of 200 characters between 2 splits.

loader = DirectoryLoader("docs/", glob="**/*.md", loader_cls=TextLoader, use_multithreading=True)
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
all_splits = text_splitter.split_documents(documents)

Next, the following code block instantiates an Embedding model using Ollama and a Qdrant database client against a server instance that is already running via Docker (refer to Qdrant documentation to launch a container). The code block below uses the Qdrant client to load each document in batches of size 35.

local_embeddings = OllamaEmbeddings(model="mxbai-embed-large", base_url="http://host.docker.internal:11434")
qdrant = QdrantClient("http://host.docker.internal",port=6333)
qdrant.create_collection(
    collection_name="knowledgebase",
    vectors_config=VectorParams(size=1024, distance=Distance.COSINE),
)
vectorstore = QdrantVectorStore(
    client=qdrant,
    collection_name=vectordbname,
    embedding=local_embeddings,
)
batch_size = 35
delay = 0.5
for i in range(0, len(all_splits), batch_size):
    batch = all_splits[i:i + batch_size]
    vectorstore.add_documents(batch)
    time.sleep(delay)

With the vector database ready, the next code block defines a function for performing a semantic similarity search of a user-provided query across the vector database to yield 10 resulting documents. Additionally, the code defines an Ollama chat client for interacting with Llama3.1.

def format_hybrid_docs(query):
    docs = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 10}).invoke(query)
    returndata = "\n\n".join(doc.page_content for doc in docs)
    return returndata
model = ChatOllama(
    model="llama3.1:8b",
    temperature=0,
    base_url="http://host.docker.internal:11434",
)

Lastly, the final code section obtains questions from a user and repeatedly answers them given the context of the set of documents retrieved as a result of semantic similarity search against the query.

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)
qa_chain = (
    {"context": format_hybrid_docs | RunnablePassthrough(), "question": RunnablePassthrough()}
    | rag_prompt | model | StrOutputParser()
)
while True:
    ques = input(f"Ask me a question: ")
    if ques == "exit":
        break
    output = qa_chain.invoke(ques)
    print("\n\n", output, "\n\n")

[!INFO] Note that it doesn’t resend old chat back to the model.

Some Problems

Here are some of the issues I encountered during my experimentation →

What’s Next After PoC?

With this proof of concept ready, here are some items that I may try to implement in the future →

And that’s a wrap on the quick RAG lesson. Cheers!