A small langchain demo project of a QA on fairy tales books.
readme.md
A langchain demo in xethub
We use langchain - index and openai to answer questions about fairy tales by indexing text files.
Requirements
pip install -r requirements.txt
export OPENAI_API_KEY=YOUR_OPENAI_API_KEY
Usage
Train
# Retrain from scratch
python src/train.py
# Insert a file
python src/put.py --branch=temp --hackx=10 tests/data/fresh_prince.txt data
# Remove a file
python src/delete.py --branch=temp data/fresh_prince.txt
Run the app
gradio app.py
Development story
First steps like any good POC, quick and dirty to have something working - I used the langchain and a small gradio app.
Next, I sew I need to split the data to different files so I can test if the model choose the right source.
After a short exploration I noticed that unfortunately the data is just a bunch of stories combined with different structures per collection, and the collections are just appended to each other somehow. I was about to make a copy of the file somewhere, so I can always redo this part and not lose the original data when I realise that's "old paranoid way". Instead, I created a branch and worked directly on the data as if I had nothing to worry about. XetPower!
After a short exploration, I noticed that the number of collections is doable manually.
Using regex and with my new sense of confidence, I started hacking the file, writing new files, add-commit-push to
save progress and checkout as my "undo" for multiple files.
This was a great experience as I knew that, even if it was still 20MB of text, this is just ad-hoc and there is no point
even writing scripts for this manuel work.
At the end, I had built the new index, and committed it and the new training script with the branch, so the model is
always up-to-date with its correct data and completely reproducible.
Guess what, when "saved" the new model which had 53k vectors, I saved 73% deduplication of the data, just imagine the
saving on a huge model with millions of documents!
Testing in the app:
Question: "Who was Pinocchio's father?"
Answer: "Pinocchio's father is Geppetto."
Sources: "data/the_blue_fairy_book.txt"
Question: "Did the emperor had clothes at the end of the story?"
Answer: "At the end of the story, the emperor did not have any clothes."
Sources: "data/the_blue_fairy_book.txt, data/andersen_fairy_tales.txt"
Optional: Try the query: "What is the name of the princess in the story of the frog prince?"
Looks legit!
Next I wanted to try a different openai model. I made a long-running branch per model, "openai/text-davinci-003", and I
can store the vectors index, all the new code which is only related to it.
The alternative was to generalize my app, training, and if I would deploy the model too, then the rest of the CICD to
handle this two different models and structures, this way is much cleaner!
If next time I want to follow-up with a model with a similar design to huggingface, I can create a new brnach from this
one, and if I want to try a different model of openai, I'll branch from there!
How about adding a new file to our index?
It is important in this case to both have the new raw file as a reference and for the case of deletion and have the
embedded vectors in the index at the same time. We can automate it that everytime we add a new file, we also add it to
the index, and when we delete a file, we delete it from the index too.
Question: "Where was the prince of Bel-Air born and raised?"
Don't know?
Let's add tests/data/fresh_prince.txt
and now it does!
Let's remove it and see what happens:
Yeah, it forgot!
Let's update other branches... Now let's checkit out and see the model is updated!
mkdir tmp \
&& cd tmp
git xet clone --no-smudge https://xethub.com/xdssio/langchain_demo.git \
&& cd langchain_demo \
&& git checkout temp \
&& git xet checkout -- model data docs \
&& gradio app.py
File is here, and the question is answered! This way any server doing inference which is mounted automatically gets updated with the new data and index.
# clean up
python src/delete.py --branch=temp data/fresh_prince.txt
File List | Total items: 11 | ||
---|---|---|---|
Name | Last Commit | Size | Last Modified |
data | |||
docs/images | |||
model | |||
src | |||
tests | |||
.gitattributes | |||
.gitignore | |||
app.py | |||
config.py | |||
readme.md | |||
requirements.txt |
About
A small langchain demo project of a QA on fairy tales books.