(RAG) Q/A with ChainFury
========================
One of the first use cases of LLM powered apps is question answering. This is how you should think about this problem:
* LLMs are "general purpose string-to-string computers", so they can take in a text information (data) and instruction
on how to process that data
* If you have a question, you first need to find this data so you can add it to the input of the LLM
* This data can be found in a variety of places, such as blogs, PDFs, etc.
In order to achieve this outcome you need to figure out how to index this information and how to query it, then prompt
the LLM. There are two building blocks:
* **Vector Databse**: You first need to store the data in a way that is easy to query. The first thing you might think is a
database like Postgres or MongoDB. However, these databases are for structured querying. They are not suited to the
task of searching for a similar piece of text. For this you need vector databases like `Qdrant `_.
To get the embeddings you will need to vectorize your dataset, for this we will use :code:`text-embedding-ada-002` model
from OpenAI.
* **Prompt Engineering**: Once you have the data, you need to figure out the right way to ask the LLM and get response
from it. This is where **ChainFury** comes into the picture.
For code go to Github `@yashbonde/cf_demo `_
Want to play with example in the `demo app`_.
Objective
---------
We are going to build a simple question answering system for slides of `Blitzscaling PDF`_. You can download the PDF
and keep it for your reference.
.. image:: https://www.thepowermba.com/en/wp-content/uploads/2021/07/BLITZSCALING-5-1024x629.png
Take a note of this, we will test our agent to answer this question! The outcome will be a Streamlit app where you can
query and get nicely summarized answers.
Step 0: Installing dependencies
-------------------------------
We install the following dependencies for this demo:
.. code-block:: bash
echo '''fire==0.5.0
PyMuPDF==1.22.5
fitz==0.0.1.dev2
chainfury>=1.4.3
qdrant-client==1.1.1
streamlit==1.26.0''' >> requirements.txt
pip install -r requirements.txt
# load the environment variables
export QDRANT_API_URL="https://xxx" # qdrant.tech
export QDRANT_API_KEY="hbl-xxxxxx"
export OPENAI_TOKEN="sk-xxx" # platform.openai.com
export CHATNBX_TOKEN="tune-xxxxx" # chat.nbox.ai
Step 1: Loading the PDF
-----------------------
We first load the PDF and extract the text from it, you can read the full code for `load_data.py`_. I'll only highlight
the few important parts here.
Step 1.1: Chunking of PDF
~~~~~~~~~~~~~~~~~~~~~~~~~
The first step is to break apart the document into "chunks". You can use several methods
for this, we will use the simplest. **One chunk = One page**.
You can get into far more complex strategies based on tokens using :code:`tiktoken`, but for now we will keep it simple.
However a page can also contain a lot of text or no text so we come up with simple rules like:
* page contains atleast 10 words
* if page contains :code:`> 700 tokens ~ 2500 chars` we break it into parts of 2500 chars each
.. code-block:: python
payloads = []
for i,p in enumerate(page_text):
if len(p.strip().split()) < 10:
continue
chunk_size = 2500
if len(p) > 2500:
for j,k in enumerate(range(0, len(p), int(chunk_size * 0.8))):
payloads.append({"doc": pdf, "page_no": i, "chunk": j, "text": p[k:k+chunk_size]})
Step 1.2: Embeddings
~~~~~~~~~~~~~~~~~~~~
Next step is to get embeddings for each of these chunks. More important than getting chunks is to keep the system high
performance. For this we will use :mod:`chainfury.utils.threaded_map` to parallelize the process. We create buckets of
payloads and extract the text to get embeddings (batching):
.. code-block:: python
# (batching + parallel) gives ~2 orders of magnitude speedup
for b in buckets:
full_out = threaded_map(
fn = get_embedding,
inputs = [(x, pbar) for x in b],
max_threads = 16
)
all_items.extend(full_out)
Step 1.3: Loading in Qdrant
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Finally we load the embeddings into Qdrant. Note that there are two ways to load this data, read more about `Qdrant loading`_.
* **Fresh Load**: You can load the data from scratch, this will usually be the fastest since you are only going to upload
to the disk directly. However, this is not good if you want to keep previous information in the database. For this we
write:
.. code-block:: python
from chainfury.components.qdrant import recreate_collection, disable_indexing, enable_indexing
recreate_collection(collection_name, 1536) # OpenAI embedding dim
disable_indexing(collection_name)
success = client.upload_collection(
collection_name = collection_name,
vectors = embedding,
payload = payloads,
ids = None, # Vector ids will be assigned automatically
batch_size = 256 # How many vectors will be uploaded in a single request?
)
enable_indexing(collection_name)
* **Incremental Load**: You can load the data incrementally, this will be slower since you are going to be indexing as you
are uploading, compute becomes a bottleneck in this case. For this you can temporarily disable indexing and then enable
later. You can use inbuilt :mod:`chainfury.components.qdrant.qdrant_write` function to do this.
.. code-block:: python
from chainfury.components.qdrant import disable_indexing, enable_indexing, qdrant_write
disable_indexing(collection_name)
# **NOTE:** This part is not in the file and is just a representation of what the code will look like
for emb_bucket, payload_bucket in zip(embedding_buckets, payloads_buckets):
success, status, err = qdrant_write(
embeddings = emb_bucket,
collection_name = collection_name,
extra_payload = payload_bucket,
)
enable_indexing(collection_name)
Step 2: Prompt Engineering
--------------------------
Next step is to retrieve the information at runtime and query the LLM, you can read the full code for `streamlit_app.py`_.
Again I am only highlighting the important parts here.
.. code-block:: python
from chainfury.components.qdrant import qdrant_read
out, err = qdrant_read(
embeddings = embedding,
collection_name = collection_name,
top = 3, # How many results to return?
)
From this we create prompt like this:
.. code-block:: python
messages=[
{
"role" : "system",
"content" : '''
You are a helpful assistant that is helping user summarize the information with citations.
Tag all the citations with tags around it like:
```
this is some text [2, 14]
```'''},
{
"role": "user",
"content": f'''
Data points collection:
{dp_text}
---
User has asked the following question:
{question}
'''}
]
This is then passed to either `ChatNBX `_ or OpenAI ChatGPT API. The response is then parsed and returned
to the user.
Step 3: Putting it all together
-------------------------------
Finally we put it all together in a Streamlit app. You can read the full code for `streamlit_app.py`_. The above code
can be put inside a single function and called with each query. You can use the `demo app`_ for your self now.
.. image:: https://d2e931syjhr5o9.cloudfront.net/chainfury/blitzscaling_qa_rag.png
We asked it a question and it gave the correct answer (see in the image in Objective section)!
.. all the links are here
.. _Blitzscaling PDF: https://drive.google.com/file/d/1QeWwfxEcYyAXkLexCgUX4AWr6nnO3Aqk/view?usp=sharing
.. _load_data.py: https://github.com/yashbonde/cf_demo/blob/master/load_data.py
.. _streamlit_app.py: https://github.com/yashbonde/cf_demo/blob/master/streamlit_app.py
.. _Qdrant loading: https://qdrant.tech/documentation/tutorials/bulk-upload/#upload-directly-to-disk
.. _demo app: https://blitzscaling.streamlit.app/