(RAG) Q/A with ChainFury

One of the first use cases of LLM powered apps is question answering. This is how you should think about this problem:

LLMs are “general purpose string-to-string computers”, so they can take in a text information (data) and instruction on how to process that data
If you have a question, you first need to find this data so you can add it to the input of the LLM
This data can be found in a variety of places, such as blogs, PDFs, etc.

In order to achieve this outcome you need to figure out how to index this information and how to query it, then prompt the LLM. There are two building blocks:

Vector Databse: You first need to store the data in a way that is easy to query. The first thing you might think is a database like Postgres or MongoDB. However, these databases are for structured querying. They are not suited to the task of searching for a similar piece of text. For this you need vector databases like Qdrant. To get the embeddings you will need to vectorize your dataset, for this we will use text-embedding-ada-002 model from OpenAI.
Prompt Engineering: Once you have the data, you need to figure out the right way to ask the LLM and get response from it. This is where ChainFury comes into the picture.

For code go to Github @yashbonde/cf_demo

Want to play with example in the demo app.

Objective

We are going to build a simple question answering system for slides of Blitzscaling PDF. You can download the PDF and keep it for your reference.

https://www.thepowermba.com/en/wp-content/uploads/2021/07/BLITZSCALING-5-1024x629.png

Take a note of this, we will test our agent to answer this question! The outcome will be a Streamlit app where you can query and get nicely summarized answers.

Step 0: Installing dependencies

We install the following dependencies for this demo:

echo '''fire==0.5.0
PyMuPDF==1.22.5
fitz==0.0.1.dev2
chainfury>=1.4.3
qdrant-client==1.1.1
streamlit==1.26.0''' >> requirements.txt
pip install -r requirements.txt

# load the environment variables
export QDRANT_API_URL="https://xxx" # qdrant.tech
export QDRANT_API_KEY="hbl-xxxxxx"
export OPENAI_TOKEN="sk-xxx"        # platform.openai.com
export CHATNBX_TOKEN="tune-xxxxx"   # chat.nbox.ai

Step 1: Loading the PDF

We first load the PDF and extract the text from it, you can read the full code for load_data.py. I’ll only highlight the few important parts here.

Step 1.1: Chunking of PDF

The first step is to break apart the document into “chunks”. You can use several methods for this, we will use the simplest. One chunk = One page.

You can get into far more complex strategies based on tokens using tiktoken, but for now we will keep it simple.

However a page can also contain a lot of text or no text so we come up with simple rules like:

page contains atleast 10 words
if page contains > 700 tokens ~ 2500 chars we break it into parts of 2500 chars each

payloads = []
for i,p in enumerate(page_text):
  if len(p.strip().split()) < 10:
    continue

  chunk_size = 2500
  if len(p) > 2500:
    for j,k in enumerate(range(0, len(p), int(chunk_size * 0.8))):
      payloads.append({"doc": pdf, "page_no": i, "chunk": j, "text": p[k:k+chunk_size]})

Step 1.2: Embeddings

Next step is to get embeddings for each of these chunks. More important than getting chunks is to keep the system high performance. For this we will use chainfury.utils.threaded_map to parallelize the process. We create buckets of payloads and extract the text to get embeddings (batching):

# (batching + parallel) gives ~2 orders of magnitude speedup
for b in buckets:
  full_out = threaded_map(
    fn = get_embedding,
    inputs = [(x, pbar) for x in b],
    max_threads = 16
  )
  all_items.extend(full_out)

Step 1.3: Loading in Qdrant

Finally we load the embeddings into Qdrant. Note that there are two ways to load this data, read more about Qdrant loading.

Fresh Load: You can load the data from scratch, this will usually be the fastest since you are only going to upload to the disk directly. However, this is not good if you want to keep previous information in the database. For this we write:

from chainfury.components.qdrant import recreate_collection, disable_indexing, enable_indexing

recreate_collection(collection_name, 1536) # OpenAI embedding dim
disable_indexing(collection_name)

success = client.upload_collection(
  collection_name = collection_name,
  vectors = embedding,
  payload = payloads,
  ids = None, # Vector ids will be assigned automatically
  batch_size = 256 # How many vectors will be uploaded in a single request?
)

enable_indexing(collection_name)

Incremental Load: You can load the data incrementally, this will be slower since you are going to be indexing as you are uploading, compute becomes a bottleneck in this case. For this you can temporarily disable indexing and then enable later. You can use inbuilt chainfury.components.qdrant.qdrant_write function to do this.

from chainfury.components.qdrant import disable_indexing, enable_indexing, qdrant_write

disable_indexing(collection_name)

# **NOTE:** This part is not in the file and is just a representation of what the code will look like
for emb_bucket, payload_bucket in zip(embedding_buckets, payloads_buckets):
  success, status, err = qdrant_write(
    embeddings = emb_bucket,
    collection_name = collection_name,
    extra_payload = payload_bucket,
  )

enable_indexing(collection_name)

Step 2: Prompt Engineering

Next step is to retrieve the information at runtime and query the LLM, you can read the full code for streamlit_app.py. Again I am only highlighting the important parts here.

from chainfury.components.qdrant import qdrant_read

out, err = qdrant_read(
  embeddings = embedding,
  collection_name = collection_name,
  top = 3, # How many results to return?
)

From this we create prompt like this:

messages=[
  {
    "role" : "system",
    "content" : '''
You are a helpful assistant that is helping user summarize the information with citations.

Tag all the citations with tags around it like:

```
this is some text [<id>2</id>, <id>14</id>]
```'''},
  {
    "role": "user",
    "content": f'''
Data points collection:

{dp_text}

---

User has asked the following question:

{question}
'''}
]

This is then passed to either ChatNBX or OpenAI ChatGPT API. The response is then parsed and returned to the user.

Step 3: Putting it all together

Finally we put it all together in a Streamlit app. You can read the full code for streamlit_app.py. The above code can be put inside a single function and called with each query. You can use the demo app for your self now.

https://d2e931syjhr5o9.cloudfront.net/chainfury/blitzscaling_qa_rag.png

We asked it a question and it gave the correct answer (see in the image in Objective section)!