A langchain demo in xethub

We use langchain - index and openai to answer questions about fairy tales by indexing text files.

Requirements

pip install -r requirements.txt
export OPENAI_API_KEY=YOUR_OPENAI_API_KEY

Usage

Train

# Retrain from scratch
python src/train.py

# Insert a file  
python src/put.py --branch=temp --hackx=10 tests/data/fresh_prince.txt data

# Remove a file
python src/delete.py --branch=temp data/fresh_prince.txt

Run the app

gradio app.py

Development story

First steps like any good POC, quick and dirty to have something working - I used the langchain and a small gradio app.

Next, I sew I need to split the data to different files so I can test if the model choose the right source.

After a short exploration I noticed that unfortunately the data is just a bunch of stories combined with different structures per collection, and the collections are just appended to each other somehow. I was about to make a copy of the file somewhere, so I can always redo this part and not lose the original data when I realise that's "old paranoid way". Instead, I created a branch and worked directly on the data as if I had nothing to worry about. XetPower!

After a short exploration, I noticed that the number of collections is doable manually.
Using regex and with my new sense of confidence, I started hacking the file, writing new files, add-commit-push to save progress and checkout as my "undo" for multiple files.
This was a great experience as I knew that, even if it was still 20MB of text, this is just ad-hoc and there is no point even writing scripts for this manuel work.

At the end, I had built the new index, and committed it and the new training script with the branch, so the model is always up-to-date with its correct data and completely reproducible.
Guess what, when "saved" the new model which had 53k vectors, I saved 73% deduplication of the data, just imagine the saving on a huge model with millions of documents!

Testing in the app:
Question: "Who was Pinocchio's father?"
Answer: "Pinocchio's father is Geppetto."
Sources: "data/the_blue_fairy_book.txt"

Question: "Did the emperor had clothes at the end of the story?"
Answer: "At the end of the story, the emperor did not have any clothes."
Sources: "data/the_blue_fairy_book.txt, data/andersen_fairy_tales.txt"

Optional: Try the query: "What is the name of the princess in the story of the frog prince?"

Looks legit!

Next I wanted to try a different openai model. I made a long-running branch per model, "openai/text-davinci-003", and I can store the vectors index, all the new code which is only related to it.
The alternative was to generalize my app, training, and if I would deploy the model too, then the rest of the CICD to handle this two different models and structures, this way is much cleaner! If next time I want to follow-up with a model with a similar design to huggingface, I can create a new brnach from this one, and if I want to try a different model of openai, I'll branch from there!

How about adding a new file to our index?
It is important in this case to both have the new raw file as a reference and for the case of deletion and have the embedded vectors in the index at the same time. We can automate it that everytime we add a new file, we also add it to the index, and when we delete a file, we delete it from the index too.

Question: "Where was the prince of Bel-Air born and raised?" Don't know?
Let's add tests/data/fresh_prince.txt and now it does!
Let's remove it and see what happens: Yeah, it forgot!

Let's update other branches... Now let's checkit out and see the model is updated!

mkdir tmp \
  && cd tmp

git xet clone --no-smudge https://xethub.com/xdssio/langchain_demo.git \
  && cd langchain_demo \
  && git checkout temp \
  && git xet checkout -- model data docs \
  && gradio app.py

File is here, and the question is answered! This way any server doing inference which is mounted automatically gets updated with the new data and index.

# clean up
python src/delete.py --branch=temp data/fresh_prince.txt

File List			Total items: 11
Name	Last Commit	Size	Last Modified
data	remove fresh prince		1 year ago
docs/images	make app prettier		2 years ago
model	remove fresh prince		1 year ago
src	hack filesystem logger for chromadb		1 year ago
tests	refactr to utils		1 year ago
.gitattributes	Initial commit	79 B	2 years ago
.gitignore	move scripts to src	8.7 KiB	2 years ago
app.py	refactr to utils	2.5 KiB	1 year ago
config.py	refactr to utils	300 B	1 year ago
readme.md	add cleanup to readme	4.3 KiB	1 year ago
requirements.txt	working example	1.7 KiB	1 year ago

About

A small langchain demo project of a QA on fairy tales books.

Repository Size

Loading repo size...

readme.md