Plug and Play RAG LLM

Author

Jesus Gonzalez

Published

April 23, 2024

Here’s an image GPT tried really hard to create!

This document demonstrate a quick and easy RAG implementation using OPENAIs API and the LangChain framework. I tried to make this a simple copy/pastable PDF solution for anyone looking to quickly interact with their PDFs or simply a large corpus of text.

The Model

model = ChatOpenAI(api_key=OPENAI_API_KEY, model="gpt-3.5-turbo")
parser = StrOutputParser()
chain = model | parser
chain.invoke("hi, are you there gpt?")
"Hello! Yes, I'm here. How can I help you today?"

Plug & Play Section

def pdf_to_text(pdf_path, output_txt_path):
    text = extract_text(pdf_path)
    with open(output_txt_path, 'w', encoding='utf-8') as file:
        file.write(text)

# pdf_path = '../../files/Computer Age Statistical Inference Book.pdf'
# output_txt_path = '../../files/rag-llm/Computer Age Statistical Inference Book.txt'  
# pdf_to_text(pdf_path, output_txt_path)

I am showing the short script above so that anyone may replicate this in their environment. I used a course textbook as a test case.

Chunking Context Due to Token Limitations

loader = TextLoader("../../files/rag-llm/Computer Age Statistical Inference Book.txt")
text_documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
documents = text_splitter.split_documents(text_documents)

Embedding the Context for Improved Performance

embeddings = OpenAIEmbeddings()
vector_store = DocArrayInMemorySearch.from_documents(documents, embeddings)

retriever = vector_store.as_retriever()
retriever.invoke("Who or what do frequentists criticize?") 
[Document(page_content='What might be called the strong definition of frequentism insists on exact\nfrequentist correctness under experimental repetitions. Pivotality, unfortu-\nnately, is unavailable in most statistical situations. Our looser definition\nof frequentism, supplemented by devices such as those above,7 presents a\nmore realistic picture of actual frequentist practice.\n\n2.2 Frequentist Optimality\n\nThe popularity of frequentist methods reflects their relatively modest math-\nematical modeling assumptions: only a probability model F (more exactly\na family of probabilities, Chapter 3) and an algorithm of choice t.x/. This\nflexibility is also a defect in that the principle of frequentist correctness\ndoesn’t help with the choice of algorithm. Should we use the sample mean\nto estimate the location of the gfr distribution? Maybe the 25% Win-\nsorized mean would be better, as Table 2.1 suggests.', metadata={'source': '../../files/rag-llm/Computer Age Statistical Inference Book.txt'}),
 Document(page_content='Frequentism cannot claim to be a seamless philosophy of statistical in-\nference. Paradoxes and contradictions abound within its borders, as will\nbe shown in the next chapter. That being said, frequentist methods have\na natural appeal to working scientists, an impressive history of success-\nful application, and, as our list of five “devices” suggests, the capacity to\nencourage clever methodology. The story that follows is not one of aban-\ndonment of frequentist thinking, but rather a broadening of connections\nwith other methods.\n\n2.3 Notes and Details', metadata={'source': '../../files/rag-llm/Computer Age Statistical Inference Book.txt'}),
 Document(page_content='Despite its simplicity, or perhaps because of it, objective Bayes procedures\nare vulnerable to criticism from both ends of the statistical spectrum. From\nthe subjectivist point of view, objective Bayes is only partially Bayesian: it\nemploys Bayes’ theorem but without doing the hard work of determining a\nconvincing prior distribution. This introduces frequentist elements into its\npractice—clearly so in the case of Jeffreys’ prior—along with frequentist\nincoherencies.\n\nFor the frequentist, objective Bayes analysis can seem dangerously un-\ntethered from the usual standards of accuracy, having only tenuous large-\nsample claims to legitimacy. This is more than a theoretical objection. The\npractical advantages claimed for Bayesian methods depend crucially on the\nfine structure of the prior. Can we safely ignore stopping rules or selective\ninference (e.g., choosing the largest of many estimated parameters for spe-\ncial attention) for a prior not based on some form of genuine experience?', metadata={'source': '../../files/rag-llm/Computer Age Statistical Inference Book.txt'}),
 Document(page_content='Frequentist statistics has the advantage of being applicable to any algo-\nrithmic procedure, for instance to our Cp/OLS estimator. This has great\nappeal in an era of enormous data sets and fast computation. The draw-\nback, compared with Bayesian statistics, is that we have no guarantee that\nour chosen algorithm is best in any way. Classical statistics developed a\ntheory of best for a catalog of comparatively simple estimation and testing\nproblems. In this sense, modern inferential theory has not yet caught up\nwith modern problems such as data-based model selection, though tech-\nniques such as model averaging (e.g., bagging) suggest promising steps\nforward.\n\n20.3 Selection Bias', metadata={'source': '../../files/rag-llm/Computer Age Statistical Inference Book.txt'})]

The PDF extract is not great … but you see the potential in cleaned text. Read the embedded text it considered most relevant to the question. This differentiates a ‘RAG’ that simply reads a file versus a RAG that passes the corpus through an embedding model.

Setup & Chaining

instructions = """
Answer the question based on the context below. Prior to finalizing your response, 
remember to clean the data and make sense of it. You are a pretrained LLM that understands
common language, so use your best judgement if the text is too messy to give a definitive answer. 
If you can't answer the question because the text is too messy,
reply "The text is too messy to answer this question". If you can't answer the question in general, reply "I don't know". 

Context: {context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(instructions)
setup = RunnableParallel(context=retriever, question=RunnablePassthrough())

chain = setup | prompt | model | parser
chain.invoke("Who or what do frequentists criticize?") 
'Frequentists criticize the objective Bayes procedures for being vulnerable to criticism from both subjectivist and frequentist perspectives.'

Looks like neither of them were right…transformers for the win!

Exploring Embeddings

Exploring embeddings using gensim.

sentences = LineSentence('../../files/rag-llm/Computer Age Statistical Inference Book.txt')
model = Word2Vec(sentences=sentences, vector_size=100, window=5, min_count=5, workers=4)

Setting up the model by breaking up the book into sentence chunks.

words = [word for word, _ in model.wv.most_similar('statistics', topn=30)]
word_vectors = [model.wv[word] for word in words]
pca = PCA(n_components=2)
reduced_vectors = pca.fit_transform(word_vectors)

A two-dimensional view of the most similar words, based on their embeddings, to the word “statistics.”

A three-dimensional view of the same query.

Cosine Similarity

Here you can see the Cosine similarity between any words of your choice. Given the context of the book and the questions asked above to the GPT-powered model I chose the following.

words = ['frequentist', 'Bayes', 'objective', 'subjective'] 
Cosine similarity between frequentist and Bayes: 0.995741069316864
Cosine similarity between objective and subjective: 0.9796308875083923

The placement of the words in space aligns with the context of the book.