Author: Aarushi Nema
Date: 2024
Tags: LangChain, LLM, Introduction, Fundamentals, AI, NLP, Machine Learning, Deep Learning, Token Distribution, Finetuning, Prompting
Categories: Learning, AI/ML, LangChain
Abstract: Fundamental concepts of LangChain - a framework that revolves around invoking an LLM for specific inputs. This notebook covers the basics of LangChain including LLMs, chains, memory, vector stores, and agents.

LangChain101

Learning Objectives

  • Understand the fundamental concept of LangChain
  • Learn about LLM vs Chat Model differences
  • Implement basic LangChain components (LLMs, Chains, Memory)
  • Work with Deep Lake VectorStore for embeddings
  • Create agents with tools and retrieval systems

Fundamental concept of LangChain Revolves around invoking an LLM for a specific input.

Aspect LLM (completion model) Chat Model
Core General next-token predictor Still an LLM, but fine-tuned for dialogue
Interface Single text prompt → single text output List of role-based messages → structured reply
API /v1/completions /v1/chat/completions
Examples text-davinci-003, gpt-3.5-turbo-instruct gpt-3.5-turbo, gpt-4, gpt-4o
Best at Plain text generation, continuation Conversations, instructions, multi-turn tasks
# Import the LLM Wrapper
from langchain.llms import OpenAI
from dotenv import load_dotenv

load_dotenv()
True

The LLM

Temperature in an LLM model is a measure of randomness in the output. It ranges from 0 to 1, with 0 for more stability and probable results and 1 for more inconsistent but interesting result. For creative tasks, a temperature between 0.70 and 0.90 offers a balance of reliability and creativity.

# gpt-3.5-turbo-instruct - for plain text continuation via the /v1/completions endpoint. Best for single-shot tasks, text continuation
llm = OpenAI(model="gpt-3.5-turbo-instruct", temperature=0.9)
text = "Suggest a personalized workout routine for someone looking to improve cardiovascular endurance and prefers outdoor activities."
print(llm(text))


Monday:
Warm up: 10 minute jog
Circuit 1:
- 20 jumping jacks
- 10 push-ups
- 20 mountain climbers
- 10 burpees
- 20 high knees
- 20 bicycle crunches
Repeat circuit 3 times with minimal rest in between exercises.

Tuesday:
30 minute outdoor run or bike ride. Alternate between 3 minutes at a moderate pace and 1 minute at a faster pace for a total of 30 minutes.

Wednesday:
Warm up: 5 minute brisk walk
Circuit 2:
- 20 step-ups (use a bench or stairs)
- 10 tricep dips (use a park bench)
- 20 squats
- 10 lunges (each leg)
- 20 Russian twists (use a water bottle or small weight)
- 1 minute plank
Repeat circuit 3 times with minimal rest in between exercises.

Thursday:
45 minute hike or nature walk with some incline intervals. Alternate between walking at a steady pace and increasing the incline for short bursts.

Friday:
Warm up: 5 minute jog
Circuit 3:
- 20 jumping lunges
- 10 push-ups with side plank rotation (5 each side)
-

The Chains

In LangChain, a chain is a wrapper around multiple invidividual components. So we use a chain to combine multiple components in a specific sequence. There are multiple types of chains and the most common one is the LLMChain, which consists of: (1) PromptTemplate (2) a model (LLM/ChatModel), (3) an optiona output parser

from langchain.prompts import PromptTemplate
from langchain.llms import OpenAI
from langchain.chains import LLMChain

llm = OpenAI(model="gpt-3.5-turbo-instruct", temperature=0.9)

prompt = PromptTemplate(
    input_variables=["product"],
    template="What is a good name for a company that makes {product}?"
)

chain = LLMChain(llm=llm, prompt=prompt)

print(chain.run("eco-friendly water bottle"))


GreenHydro Bottles

The Memory

In LangChain, the memory is how we store and manage the conversation history between the user and the AI. It keeps the context and cohenrency throughout the interaction.

Types of chains in LangChain:
(1) LLMChain = stateless, single prompt → response.
(2) ConversationChain = stateful, keeps history in memory and reuses it across turns.

from langchain.llms import OpenAI
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferMemory # acts as a buffer to store the conversation history

llm = OpenAI(model="gpt-3.5-turbo-instruct", temperature=0)
conversation = ConversationChain(
    llm=llm,
    verbose=True, # will print out extra logs about what’s happening inside the chain
    memory=ConversationBufferMemory()
)

conversation.predict(input="Tell me about yourself.")
conversation.predict(input="What can you do?")
conversation.predict(input="How can you help me with data analysis?")

# print(conversation)

> Entering new ConversationChain chain...

Prompt after formatting:

The following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.



Current conversation:



Human: Tell me about yourself.

AI:



> Finished chain.





> Entering new ConversationChain chain...

Prompt after formatting:

The following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.



Current conversation:

Human: Tell me about yourself.

AI:  Well, I am an artificial intelligence program designed and created by a team of programmers. I am constantly learning and improving my abilities through algorithms and data analysis. My main purpose is to assist and provide information to users like yourself. I am currently housed in a server located in a data center, but I can also be accessed through various devices such as computers, smartphones, and smart speakers. Is there anything specific you would like to know about me?

Human: What can you do?

AI:



> Finished chain.





> Entering new ConversationChain chain...

Prompt after formatting:

The following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.



Current conversation:

Human: Tell me about yourself.

AI:  Well, I am an artificial intelligence program designed and created by a team of programmers. I am constantly learning and improving my abilities through algorithms and data analysis. My main purpose is to assist and provide information to users like yourself. I am currently housed in a server located in a data center, but I can also be accessed through various devices such as computers, smartphones, and smart speakers. Is there anything specific you would like to know about me?

Human: What can you do?

AI:  I have a wide range of capabilities, but some of my main functions include answering questions, performing tasks, and providing recommendations based on data analysis. I can also understand and respond to natural language, making it easier for users to communicate with me. Additionally, I am constantly learning and updating my knowledge base, so my abilities are constantly expanding. Is there something specific you would like me to do for you?

Human: How can you help me with data analysis?

AI:



> Finished chain.
' I can assist with data analysis by quickly sorting through large amounts of data and identifying patterns and trends. I can also provide visualizations and reports to help you better understand the data. Additionally, I can make predictions and recommendations based on the data, which can be useful for decision making. Is there a specific dataset or analysis you would like me to help with?'

Deep Lake VectorStore

Deep Lake provides storage for embeddings and their corresponding metadata in the context of LLM Apps. It allows us to do searches and data retrieval on these embeddings and their. It also integrates with LangChain.

DeepLake is multimodal (allows storage of different types of files + their vectore representations such as text, images, audio, video etc.) It is serverless so we can create and manage cloud datasets. For easy data loading out of the datalake, DeepLake has a data loader.

from dotenv import load_dotenv
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

load_dotenv()
True
# instantiate the LLM and embeddings models
llm = OpenAI(model="gpt-3.5-turbo-instruct", temperature=0)
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

# create our docs
texts = [
    "Napoleon Bonaparte was born in 15 August 1769",
    "Louis XIV was born in 5 September 1638"
]

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.create_documents(texts)

# create Deep Lake dataset
my_activeloop_org_id = "aarushinema" 
my_activeloop_dataset_name = "langchain_course_from_zero_to_hero"
dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"
db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)

# add documents to our Deep Lake dataset
db.add_documents(docs)
/var/folders/1w/hhs18_1x18l2jwpqfzks4jkw0000gn/T/ipykernel_30187/2361674661.py:2: LangChainDeprecationWarning: The class `OpenAI` was deprecated in LangChain 0.0.10 and will be removed in 1.0. An updated version of the class exists in the :class:`~langchain-openai package and should be used instead. To use it run `pip install -U :class:`~langchain-openai` and import as `from :class:`~langchain_openai import OpenAI``.
  llm = OpenAI(model="gpt-3.5-turbo-instruct", temperature=0)
/var/folders/1w/hhs18_1x18l2jwpqfzks4jkw0000gn/T/ipykernel_30187/2361674661.py:3: LangChainDeprecationWarning: The class `OpenAIEmbeddings` was deprecated in LangChain 0.0.9 and will be removed in 1.0. An updated version of the class exists in the :class:`~langchain-openai package and should be used instead. To use it run `pip install -U :class:`~langchain-openai` and import as `from :class:`~langchain_openai import OpenAIEmbeddings``.
  embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
/var/folders/1w/hhs18_1x18l2jwpqfzks4jkw0000gn/T/ipykernel_30187/2361674661.py:18: LangChainDeprecationWarning: This class is deprecated and will be removed in a future version. You can swap to using the `DeeplakeVectorStore` implementation in `langchain-deeplake`. Please do not submit further PRs to this class.See <https://github.com/activeloopai/langchain-deeplake>
  db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)
Using embedding function is deprecated and will be removed in the future. Please use embedding instead.
Your Deep Lake dataset has been successfully created!
Creating 2 embeddings in 1 batches of size 2:: 100%|██████████| 1/1 [00:23<00:00, 23.45s/it]
Dataset(path='hub://aarushinema/langchain_course_from_zero_to_hero', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype      shape     dtype  compression
  -------    -------    -------   -------  ------- 
   text       text      (2, 1)      str     None   
 metadata     json      (2, 1)      str     None   
 embedding  embedding  (2, 1536)  float32   None   
    id        text      (2, 1)      str     None   
['e03babf2-977f-11f0-9104-5e86525bdce2',
 'e03badaa-977f-11f0-9104-5e86525bdce2']
# retrivalqa chain
retrieval_qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db.as_retriever()
)
# create an agent that uses RetrivalQA chain as a tool
from langchain.agents import initialize_agent, Tool
from langchain.agents import AgentType

tools = [
    Tool(
        name="Retrieval QA System",
        func=retrieval_qa.run,
        description="Use this to answer questions"
    )
]

agent = initialize_agent(
    tools,
    llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True
)
/var/folders/1w/hhs18_1x18l2jwpqfzks4jkw0000gn/T/ipykernel_30187/209183474.py:13: LangChainDeprecationWarning: LangChain agents will continue to be supported, but it is recommended for new use cases to be built with LangGraph. LangGraph offers a more flexible and full-featured framework for building agents, including support for tool-calling, persistence of state, and human-in-the-loop workflows. For details, refer to the `LangGraph documentation <https://langchain-ai.github.io/langgraph/>`_ as well as guides for `Migrating from AgentExecutor <https://python.langchain.com/docs/how_to/migrate_agent/>`_ and LangGraph's `Pre-built ReAct agent <https://langchain-ai.github.io/langgraph/how-tos/create-react-agent/>`_.
  agent = initialize_agent(
response = agent.run("When was Napoleone born?")
print(response)
/var/folders/1w/hhs18_1x18l2jwpqfzks4jkw0000gn/T/ipykernel_30187/1676272850.py:1: LangChainDeprecationWarning: The method `Chain.run` was deprecated in langchain 0.1.0 and will be removed in 1.0. Use :meth:`~invoke` instead.
  response = agent.run("When was Napoleone born?")

> Entering new AgentExecutor chain...

 I should use the Retrieval QA System to answer this question

Action: Retrieval QA System

Action Input: "When was Napoleone born?"

Observation: 

Napoleon Bonaparte was born in 15 August 1769.

Thought: I now know the final answer

Final Answer: Napoleon Bonaparte was born in 15 August 1769.



> Finished chain.

Napoleon Bonaparte was born in 15 August 1769.
# reloading an existing vector store and adding more data
# load the existing Deep Lake dataset and specify the embedding function
db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)

# create new documents
texts = [
    "Lady Gaga was born in 28 March 1986",
    "Michael Jeffrey Jordan was born in 17 February 1963"
]

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.create_documents(texts)

# add documents to our Deep Lake dataset
db.add_documents(docs)
Using embedding function is deprecated and will be removed in the future. Please use embedding instead.
Deep Lake Dataset in hub://aarushinema/langchain_course_from_zero_to_hero already exists, loading from the storage
Creating 2 embeddings in 1 batches of size 2:: 100%|██████████| 1/1 [00:24<00:00, 24.86s/it]
Dataset(path='hub://aarushinema/langchain_course_from_zero_to_hero', tensors=['embedding', 'id', 'metadata', 'text'])

  tensor      htype      shape     dtype  compression
  -------    -------    -------   -------  ------- 
 embedding  embedding  (4, 1536)  float32   None   
    id        text      (4, 1)      str     None   
 metadata     json      (4, 1)      str     None   
   text       text      (4, 1)      str     None   
['58fbf5dc-9781-11f0-9104-5e86525bdce2',
 '58fbf73a-9781-11f0-9104-5e86525bdce2']
# instantiate the wrapper class for GPT3
llm = OpenAI(model="gpt-3.5-turbo-instruct", temperature=0)

# create a retreiver from the db
retrieval_qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db.as_retriever()
)

# instantiate tool that uses the retreiver
tools = [
    Tool(
        name="Retrieval QA System",
        func=retrieval_qa.run,
        description="Use this to answer questions"
    )
]

# create an agent that uses the tool
agent = initialize_agent(
    tools,
    llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True
)
response = agent.run("When was Michael Jordan born?")
print(response)

> Entering new AgentExecutor chain...

 I should use a retrieval QA system to answer this question

Action: Retrieval QA System

Action Input: "When was Michael Jordan born?"

Observation:  Michael Jordan was born on 17 February 1963.

Thought: I now know the final answer

Final Answer: Michael Jordan was born on 17 February 1963.



> Finished chain.

Michael Jordan was born on 17 February 1963.

Agents in LangChain

In Langchain, agents are high-level components that use language models (LLMs) to determine which actions to take and in what order. An action can either be using a tool and observing its output or returning it to the user. Tools are functions that perform specific duties, such as Google Search, database lookups, or Python REPL.

Agents involve an LLM making decisions about which Actions to take, taking that Action, seeing an Observation, and repeating that until done. There are different types of agents in LangChain: 1. The zero-shot-react-description agent uses the ReAct framework to decide which tool to employ based purely on the tool’s description. It necessitates a description of each tool. 2. The react-docstore agent engages with a docstore through the ReAct framework. It needs two tools: a Search tool and a Lookup tool. The Search tool finds a document, and the Lookup tool searches for a term in the most recently discovered document. 3. The self-ask-with-search agent employs a single tool named Intermediate Answer, which is capable of looking up factual responses to queries. It is identical to the original self-ask with the search paper, where a Google search API was provided as the tool. 4. The conversational-react-description agent is designed for conversational situations. It uses the ReAct framework to select a tool and uses memory to remember past conversation interactions.

from langchain_core.tools import Tool
from langchain_google_community import GoogleSearchAPIWrapper

from dotenv import load_dotenv

load_dotenv()
True
search = GoogleSearchAPIWrapper()

tools = Tool(
        name="Google Search",
        description="Use this to search the web",
        func=search.run
    )
tools.run("Obama's first name?")
'A member of the Democratic Party, he was the first African American president. Obama previously served as a U.S. senator representing Illinois from 2005 to 2008\xa0... Child\'s First Name (Type or print) lb. Middle Name. BARACK. HUSSEIN. CERTIFICATE OF LIVE BIRTH. FILE 151. NUMBER le. DEPARTMENT OF HEALTH. 61. 10641. Last Name. Jan 28, 2021 ... Obama\'s name (particularly his middle name Hussein) was the object of xenophobic innuendo questioning his loyalty (especially in the context of\xa0... Apr 12, 2017 ... Why is Barack Obama\'s full name "Barack Hussein Obama"? It\'s what his parents named him … President Barack Obama, First Lady Michelle Obama, and their daughters, Malia, left, and Sasha, right, sit for a family portrait in the Oval Office,\xa0... Barack Obama ; Barack Hussein Obama II. (1961-08-04) August 4, 1961 (age 64) Honolulu, Hawaii, U.S. · Democratic · Michelle Robinson. \u200b. ( m. · 1992)\u200b · Malia\xa0... Barack Hussein Obama II was born August 4, 1961, in Honolulu, Hawaii, to parents Barack H. Obama, Sr., and Stanley Ann Dunham. Obama Barack H. Obama. Photo: Pete Souza, Obama-Biden Transition Project, licensed by Attribution Share Alike 3.0. Full name: Barack Hussein Obama Born: 4\xa0... it specifically asks for "Full former name(s). Obama put “None”, when in fact he went by the name Barry Soetoro, and Barry Obama. It is further believed\xa0... Aug 16, 2024 ... ... Obama, the former president and first lady, endorsed her, the campaign promoted the video as, “The Obamas call Kamala.” And when Harris and\xa0...'

We can use the “k” parameter to set number of results

search = GoogleSearchAPIWrapper(k=1)

tool = Tool(
    name="I'm feeling lucky",
    description="Search Google and return the first result",
    func=search.run
)
tool.run("python")
'Python is a programming language that lets you work quickly and integrate systems more effectively. Learn More'
search = GoogleSearchAPIWrapper()


def top5_results(query):
    return search.results(query, 5)


tool = Tool(
    name="Google Search Snippets",
    description="Search Google for recent results.",
    func=top5_results,
)