Tanishq Rupaal

[!INFO] Preamble → This blog is a collection of my research and experimentation over two days to learn about RAG and LLMs. The main idea I was chasing was to set up RAG as a service (RAGaaS) - it’s pretty straightforward! Point a container (or a multi-container stack) to a directory containing notes or files and be able to talk to an LLM about it. I don’t know much about machine learning and this post is about me getting on the AI train.

The idea I stated above is actually very common and nothing novel. In fact, several offerings already exist on the internet, the most famous one being NotebookLM by Google. Several other projects on GitHub also exist, but I wanted to try it out for myself and play around with RAG parameters. Also, I wanted it to be completely local (i.e., no sending data to an online service). So, this post will include some preliminary knowledge gathering, sample code, and information on the problems I faced and solved (maybe).

[!DANGER] Keep in mind → All the knowledge listed below is a summarized regurgitation of concepts I’ve read from various online sources. As such, some things may not be academically accurate; they just represent my own high-level understanding.

Concepts

Retrieval-Augmented Generation or RAG is a way of combining a data retrieval mechanism with an LLM to generate contextually relevant responses. It typically involves three key stages with distinct steps within each stage →

Data Ingestion

Data Collection and Preprocessing → RAG needs a collection of documents like notes, PDFs, web pages, etc. These documents often undergo preprocessing to trim useless data (like HTML tags), and are then split into smaller chunks to optimize them for embedding.
Embedding → Each chunk of data is converted into a high-dimensional vector that preserves its semantic meaning. This is done using an embedding model (a smaller neural network like BERT or Sentence Transformers).
Storage in Vector Database → The embedding vectors are stored in a vector database like Pinecone or FAISS or Chroma DB. Essentially, these databases help “retrieve” data based on the “distance” or similarity between vectors.

Retrieval Process

User Query Embedding → A user’s query is also converted into an embedding vector with the same model used for the documents. This makes it easier to search for semantically similar documents.
Similarity Search → A similarity search helps find and return a list of documents or chunks that have vectors close or similar to the query vector. The closeness can be measured by cosine similarity or Euclidean distance or something else (these are the two common ones).
Scoring and Filtering → Filtering or scoring can help refine results. It can use metadata or other rules to rank the documents.

Forward Retrieved Documents to the LLM

System Prompt Setup → The LLM is given a system prompt that carries instructions for the model that sets the stage for the kind of responses required for the particular use case.
Context Assembly → The retrieved documents are combined with the user query and system prompt (literally! just combined) and passed to the LLM. Some RAG implementations may employ a more structured approach like tagging or chaining input segments.
Response Generation → Finally, the LLM uses the entire context to generate a response. Typically, this step works due to the larger context windows of modern models.

These steps stated above are how a general RAG system is implemented and can provide contextually aware responses against highly specific information corpora. Now let’s explore the setup I landed on!

Exploring a Setup

Broadly speaking, given the details described in the previous section, the following components are needed for setting up RAG →

Model Execution Platform → A platform that allows us to run arbitrary LLMs, like LMStudio and Ollama. I went with Ollama.
Code Execution Language and Platform → Generally, Python is the most famous language for ML tasks. I followed the same and put everything in Docker containers for ease of access, execution, and replication.
Vector Database → There are several options like Chroma DB, FAISS, etc. I used Qdrant for this task, as it also supports a fancy UI to look through and analyze data collections.
Query-Response LLM → The LLM used to do the final response generation. Using an OpenAI API Key to interact with GPT-based models is a very common pattern. I chose to use Llama3.1 via Ollama instead.
Embedding Model → There are several models available for this, and landing on a particular one requires experimentation. I chose mxbai-embed-large due to its relatively high-dimensional output.
LLM Orchestration Framework → Langchain is the most famous one, and that’s what I stuck with.

[!TIP] An easily forgotten fact → An orchestration framework like Langchain is only required to make the task of creating LLM applications easy. It provides wrappers around common LLM-based implementations that make it easy to spin something up in just a few lines of code. However, it’s not necessary. You could use standard SDKs for platforms like OpenAI and Qdrant, and build the exact same thing without an orchestration framework.

That’s all we need to get started with the actual implementation of a local RAG system. Now let’s look at a breakdown of some code.

Proof of Concept Code Walkthrough

First. start by installing the following python packages →

pip install langchain langchain_community langchain_ollama langchain_chroma langchain-qdrant qdrant-client

Next, import the necessary functions and libraries →

import sys
import time
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams
from langchain_ollama import OllamaEmbeddings
from langchain_ollama import ChatOllama
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

The following is a template to perform RAG. It is basically a system prompt followed by variables for context and question for the retrieved documents and user query respectively.

RAG_TEMPLATE = """
You are an assistant tasked with answering a question using the following retrieved context. Follow these guidelines to provide the most accurate and relevant answer:

1. **If you don't know the answer based on the context, explicitly say "I don't know based on the provided context."** Avoid guessing or adding details not found in the context.
2. **Provide information in an organized, hierarchical format**: Use headings, bullet points, and numbering for clear structure, and employ paragraphs where appropriate.
3. **Include all relevant code snippets**: If the context includes code, ensure it is reproduced accurately in the answer.
4. **Focus on relevance**: Only include details directly related to the question. Do not introduce arbitrary or unrelated information from the context.
5. **Avoid redundancy**: Summarize where possible and avoid repeating information unless necessary for clarity.
6. **Acronyms**: For any acronyms you encounter in the query, do not use pre-existing knowledge. Instead, use the context provided to determine the meaning of the acronym.

**Context**:
{context}

**Question**:
{question}

**Answer**:"""

With the prompt ready, the next block of code loads the docs directory and searches for all markdown files using glob. These files are then serially loaded using the TextLoader class (there are other classes for loading Markdown, JSON, etc., but I went with text for simplicity). The loaded documents are then split using a method called recursive character splitting such that there are at max. 1000 characters in a split, with a max. overlap of 200 characters between 2 splits.

loader = DirectoryLoader("docs/", glob="**/*.md", loader_cls=TextLoader, use_multithreading=True)
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
all_splits = text_splitter.split_documents(documents)

Next, the following code block instantiates an Embedding model using Ollama and a Qdrant database client against a server instance that is already running via Docker (refer to Qdrant documentation to launch a container). The code block below uses the Qdrant client to load each document in batches of size 35.

local_embeddings = OllamaEmbeddings(model="mxbai-embed-large", base_url="http://host.docker.internal:11434")
qdrant = QdrantClient("http://host.docker.internal",port=6333)
qdrant.create_collection(
    collection_name="knowledgebase",
    vectors_config=VectorParams(size=1024, distance=Distance.COSINE),
)
vectorstore = QdrantVectorStore(
    client=qdrant,
    collection_name=vectordbname,
    embedding=local_embeddings,
)
batch_size = 35
delay = 0.5
for i in range(0, len(all_splits), batch_size):
    batch = all_splits[i:i + batch_size]
    vectorstore.add_documents(batch)
    time.sleep(delay)

With the vector database ready, the next code block defines a function for performing a semantic similarity search of a user-provided query across the vector database to yield 10 resulting documents. Additionally, the code defines an Ollama chat client for interacting with Llama3.1.

def format_hybrid_docs(query):
    docs = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 10}).invoke(query)
    returndata = "\n\n".join(doc.page_content for doc in docs)
    return returndata
model = ChatOllama(
    model="llama3.1:8b",
    temperature=0,
    base_url="http://host.docker.internal:11434",
)

Lastly, the final code section obtains questions from a user and repeatedly answers them given the context of the set of documents retrieved as a result of semantic similarity search against the query.

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)
qa_chain = (
    {"context": format_hybrid_docs | RunnablePassthrough(), "question": RunnablePassthrough()}
    | rag_prompt | model | StrOutputParser()
)
while True:
    ques = input(f"Ask me a question: ")
    if ques == "exit":
        break
    output = qa_chain.invoke(ques)
    print("\n\n", output, "\n\n")

[!INFO] Note that it doesn’t resend old chat back to the model.

Some Problems

Here are some of the issues I encountered during my experimentation →

Outdated Public Blogs → There’s a large number of example blogs on this topic, and a majority of them were out of date. Even though they were just a couple of months old, the development of things like Langchain is so fast-paced that material gets outdated very quickly. This is why it is best to use blogs (even this one) as a lesson and refer to the API and SDK documentation for any sort of implementation.
Ollama API Rate Limiting → Ollama’s API exposed on the local interface sometimes encountered errors when creating embeddings, causing loss of certain documents. This was fixed in my code by adding a short delay, which fixed the issue. Such a delay may not be needed if using embeddings through OpenAI or something else.
Response quality →
- There is no easy way to identify the best combination of an LLM and an embedding model. It depends on use-case and results of experimentation.
- There is no easy way to identify the best splitting algorithm. It depends on the type of data, the structure of each document, and the total number of documents.
- Similarly, there isn’t an easy way to determine the best ratio of chunk size to overlap size. Retaining contextual data in a particular chunk is very hard and depends highly on the data slice being vectorized.
- The response quality can significantly deteriorate if the model in use has a short context window, in which case the system prompt combined with the context can go over that threshold and make the model lose sight that data. So, review the model’s context window before using it.
- Lastly, the “retrieval” part of the RAG pipeline is a deterministic step (similarity match), which makes it the only step in the pipeline that is not “intelligent”. A simple way to fix that is by pre-processing documents or rewriting queries (see next section).

What’s Next After PoC?

With this proof of concept ready, here are some items that I may try to implement in the future →

Query Rewriting → This is a method in which the user query is first sent to an LLM to rewrite it in a more descriptive manner and ask an overall better question, thereby increasing the chances of pulling more relevant documents as well as enhancing the response of the LLM.
Document Pre-processing and Enrichment → Enriching documents is a process of using an LLM to rewrite (essentially) the document in a particular structure that makes it easy to maintain rich context in every document chunk so that correct documents are pulled during the retrieval process.
Docker Compose → Using Docker compose to launch a RAG pipeline is a quality of life improvement making it repeatable and easy to run the pipeline. A compose file is already a part of my RAGaaS project.

And that’s a wrap on the quick RAG lesson. Cheers!