Amazon SageMaker Studio to run the RAG enabled question & answering
Amazon SageMaker Studio to run the RAG enabled question & answering

Amazon SageMaker Studio to run the RAG enabled question & answering

Author
Created
Jun 19, 2024
Tags
RAG
SageMaker
Text
A guide to use Amazon SageMaker studio to run RAG enabled questions and answers.

💡
Prerequisites: 1. You should have the valid AWS account with SageMaker Studio access in the AWS region. 2. You should have Embedding and LLM model running and its endpoints should be available.
 
 

Using SageMaker Studio to run the RAG enabled question answering

  • Change the region from the top, on which you should have Sageaker Studio enabled.
  • Select the user profile and Click on Open Studio.
notion image
  • Once, click on the Studio Classic Icon on the top left.
notion image
• Next, click on the Run Icon to activate classic studio. This will take a few minutes.
notion image
• Next, click on the Open Icon to open classic studio. A new tab will open.
notion image
• Inside the classic studio tab, navigate to File Browser tab and click New Folder and create Medical-QnA-RAG Folder.
notion image
• Go to the Medical-QnA-RAG Folder and create a new notebook of name medical-question-answering-rag.ipynb
notion image
A notebook environment pop up will appear, select "ml.m5.xlarge" for instance type and click on next.
    notion image
     
    • Notebook kernel will start initiating and be ready in 15-30 seconds
    • The grey pop up will disappear, implying that notebook is ready
     

    Data preparation

    NOTICE: "This link leads to a Third-Party Dataset. Id you want, you can perform your own independent assessment, and take measures to ensure that you comply with your own specific quality control practices and standards, and the local rules, laws, regulations, licenses and terms of use that apply to you, your content, and the Third-Party Dataset."
    1. The full dataset can be downloaded can be seen from https://drive.google.com/file/d/1ImYUSLk9JbgHXOemfvyiDiirluZHPeQw/view?usp=sharing. You can read more about the dataset in https://github.com/jind11/MedQA#data.
    1. To speed up the uploading for this lab, a smaller version of dataset is already downloaded - https://d2qrbbbqnxtln.cloudfront.net/Pathology_Robbins.txt
    Download the Pathology_Robbins.txt file and put it to the Medical-QnA-RAG Folder.
     
    Write the following code in the notebook and execute them one by one.
    %pip install faiss-cpu==1.7.4 --quiet %pip install langchain==0.0.222 --quiet %%capture !pip install PyYAML
    Above cell is installing some pre-requisite library used by the notebook.
    Now import other python libraries and logging codebase.
    import requests import logging import boto3 import yaml import json logger = logging.getLogger('sagemaker') logger.setLevel(logging.DEBUG) logger.addHandler(logging.StreamHandler()) logger.info(f'Using requests=={requests.__version__}') logger.info(f'Using pyyaml=={yaml.__version__}')
    run above cell to make sure, necessary import are in the place, and there is no error.
    Now setup import essentials embedding & LLM endpoint, that we captured in our previous blog.
    TEXT_EMBEDDING_MODEL_ENDPOINT_NAME = 'jumpstart-dft-hf-textembedding-gpt-j-6b-fp16' #INSERT EMBEDDING ENDPOINT NAME IF DIFFERENT TEXT_GENERATION_MODEL_ENDPOINT_NAME = 'jumpstart-dft-hf-llm-mistral-7b-instruct' #INSERT TEXT GENERATION ENDPOINT NAME IF DIFFERENT REGION_NAME = boto3.session.Session().region_name
    Encode passages (chunks) using JumpStart's GPT-J text embedding model . We are specifically using only 1 of 20 textbooks from the dataset. It takes about 6 minutes to generate embeddings for one textbook (for example, Pathology). You can increase the number of textbooks indexed by adding sufficient time buffer for execution.¶
    In order to follow the RAG approach this notebook is using the LangChain framework where it has integrations with different services and tools that allow efficient building of patterns such as RAG.
    import numpy as np from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.document_loaders import DirectoryLoader, TextLoader loader = DirectoryLoader("./", glob="**/Pathology*.txt", loader_cls=TextLoader) documents = loader.load() # - in our testing Character split works better with this PDF data set text_splitter = RecursiveCharacterTextSplitter( # Set a really small chunk size, just to show. chunk_size = 1000, chunk_overlap = 100, ) docs = text_splitter.split_documents(documents) print(docs[0])
    Once you run the above code, it will produce output one of its chunk from the Pathology_Robbins.txt
    page_content='Plasma Membrane: Protection and Nutrient Acquisition\n\nBiosynthetic Machinery: Endoplasmic Reticulum and Golgi Apparatus\n\nWaste Disposal: Lysosomes and Proteasomes\n\nModular Signaling Proteins, Hubs, and\n\nComponents of the Extracellular Matrix\n\nProliferation and the Cell Cycle\n\nPathology literally translates to the study of suffering (Greek pathos = suffering, logos = study); as applied to modern medicine, it is the study of disease. Virchow was certainly correct in asserting that disease originates at the cellular level, but we now realize that cellular disturbances arise from alterations in molecules (genes, proteins, and others) that influence the survival and behavior of cells. Thus, the foundation of modern pathology is understanding the cellular and molecular abnormalities that give rise to diseases. It is helpful to consider these abnormalities in the context of normal cellular structure and function, which is the theme of this introductory chapter.' metadata={'source': 'Pathology_Robbins.txt'}
     
    Next section of code will tell the total input size, and some more metric on after how many character reading chunk are prepared.
    avg_doc_length = lambda documents: sum([len(doc.page_content) for doc in documents])//len(documents) avg_char_count_pre = avg_doc_length(documents) avg_char_count_post = avg_doc_length(docs) print(f'Average length among {len(documents)} documents loaded is {avg_char_count_pre} characters.') print(f'After the split we have {len(docs)} documents more than the original {len(documents)}.') print(f'Average length among {len(docs)} documents (after split) is {avg_char_count_post} characters.')
    Now we will be going to create the EmbeddingsContentHandler and SagemakerEndpointEmbeddings classes : These are classes from the langchain library used for handling embeddings via Amazon SageMaker.
    from langchain.embeddings.sagemaker_endpoint import EmbeddingsContentHandler from langchain.embeddings import SagemakerEndpointEmbeddings from typing import Any, Dict, List, Optional from langchain.llms.sagemaker_endpoint import ContentHandlerBase class SagemakerEndpointEmbeddingsJumpStart(SagemakerEndpointEmbeddings): def embed_documents(self, texts: List[str], chunk_size: int = 5) -> List[List[float]]: """Compute doc embeddings using a SageMaker Inference Endpoint. Args: texts: The list of texts to embed. chunk_size: The chunk size defines how many input texts will be grouped together as request. If None, will use the chunk size specified by the class. Returns: List of embeddings, one for each text. """ results = [] _chunk_size = len(texts) if chunk_size > len(texts) else chunk_size for i in range(0, len(texts), _chunk_size): response = self._embedding_func(texts[i : i + _chunk_size]) print results.extend(response) return results class ContentHandler(EmbeddingsContentHandler): content_type = "application/json" accepts = "application/json" def transform_input(self, prompt: str, model_kwargs={}) -> bytes: input_str = json.dumps({"text_inputs": prompt, **model_kwargs}) return input_str.encode("utf-8") def transform_output(self, output: bytes) -> str: response_json = json.loads(output.read().decode("utf-8")) embeddings = response_json["embedding"] return embeddings content_handler = ContentHandler() sagemakerEndpointEmbeddingsJumpStart = SagemakerEndpointEmbeddingsJumpStart( endpoint_name=TEXT_EMBEDDING_MODEL_ENDPOINT_NAME, region_name=REGION_NAME, content_handler=content_handler, ) print(docs[0].page_content)
    Class SagemakerEndpointEmbeddingsJumpStart extend the SagemakerEndpointEmbeddings class. it has embed_documents method: that takes a list of texts and an optional chunk size to batch the requests to the SageMaker endpoint.
    • Iterates over the texts in chunks and sends them to the SageMaker endpoint for embedding.
    • Collects the results from the endpoint and returns them as a list of embeddings.
     
    Class ContentHandler extends EmbeddingsContentHandler and handles the transformation of input and output for the SageMaker endpoint.
    • transform_input method: Converts the input prompt and model arguments into a JSON string and encodes it into bytes.
    • transform_output method: Decodes the output bytes from the endpoint, parses the JSON, and extracts the embeddings.

    Semantic Similarity with Amazon Jumpstart Embedding Models

    Semantic search refers to searching for information based on the meaning and concepts of words and phrases, rather than just matching keywords. Embedding models like Amazon Titan Embeddings allow semantic search by representing words and sentences as dense vectors that encode their semantic meaning.
    Semantic matching is extremely helpful for RAG because it returns results that are conceptually related to the user's query, even if they don't contain the exact keywords. This leads to more relevant and useful search results which can be injected into our LLM's prompts.
    First, let's take a look below to illustrate the sample of an embedding
    sample_embedding = np.array(sagemakerEndpointEmbeddingsJumpStart.embed_query(docs[0].page_content)) print("Sample embedding of a document chunk: ", sample_embedding) print("Size of the embedding: ", sample_embedding.shape)
    Output-
    Sample embedding of a document chunk: [ 0.00541441 -0.01054054 0.01034561 ... -0.01216564 0.00790209 -0.00550144] Size of the embedding: (4096,)
    Now create embeddings for the entire document set. Note for a single medical textbook, it takes about 6 minutes.
    from tqdm.contrib.concurrent import process_map from multiprocessing import cpu_count def generate_embeddings(x): return (x, sagemakerEndpointEmbeddingsJumpStart.embed_query(x)) workers = 1 * cpu_count() texts = [i.page_content for i in docs] print (workers) data = process_map(generate_embeddings, texts, max_workers=workers, chunksize=100)
    Next, we insert the embeddings to the FAISS vector store, and ask the query What is acute kidney injury?
    from langchain.vectorstores import FAISS faiss = FAISS.from_documents(docs[0:2], sagemakerEndpointEmbeddingsJumpStart) faiss.add_embeddings(data) faiss.save_local("faiss_index")
    Now use the faiss embedding function to get the embedding of the query, which can be used to find the relevant document from the vector DB search.
    query_embedding = faiss.embedding_function(query) np.array(query_embedding) relevant_documents = faiss.similarity_search_by_vector(query_embedding) context = "" print(f'{len(relevant_documents)} documents are fetched which are relevant to the query.') print('----') for i, rel_doc in enumerate(relevant_documents): print(f'## Document {i+1}: {rel_doc.page_content}.......') print('---') context += rel_doc.page_content context = context.replace("\n", " ")
    Output-
    4 documents are fetched which are relevant to the query. ---- ## Document 1: Knowles MA, Hurst CD: Molecular biology of bladder cancer: new insights into pathogenesis and clinical diversity, Nat Rev Cancer 15:25, 2015. [A comprehensive review of the molecular changes in different types of bladder cancer.] Mathieson PW: Minimal change nephropathy and focal segmental glomerulosclerosis, Semin Immunopathol 29:415, 2007. [An excellent overview of new insights into the pathogenesis and diagnosis of minimal change disease versus focal segmental glomerulosclerosis.] Miller O, Hemphill RR: Urinary tract infection and pyelonephritis, Emerg Med Clin North Am 19:655, 2001. [An excellent review of acute urinary tract infections.] Murray PT, Devarajan P, Levey AS, et al: A framework and key research questions in AKI diagnosis and staging in different environments, Clin J Am Soc Nephrol 3:864, 2008. [An excellent review outlining recent advances in early diagnosis and consequences of acute kidney injury.]....... --- ## Document 2: disease. In those with preexisting chronic kidney disease, complete recovery is less frequent, and progression to end-stage renal disease is unfortunately common........ --- ## Document 3: Acute tubular injury (ATI) is a clinicopathologic entity characterized by damage to tubular epithelial cells and an acute decline in renal function, often associated with shedding of granular casts and tubular cells into the urine. Clinicians use the term acute tubular necrosis, but frank necrosis is rarely observed in a kidney biopsy, so pathologists prefer the term acute tubular injury. The constellation of changes, broadly termed acute kidney injury, manifests clinically as decreased GFR with concurrent elevation of serum creatinine. ATI is the most common cause of acute kidney injury and may produce oliguria (defined as urine output of <400 mL/day). http://ebooksmedicine.net There are two forms of ATI that differ in the underlying causes........ --- ## Document 4: Most forms of tubular injury also involve the interstitium, so the two are discussed together. Presented under this heading are diseases characterized by (1) inflammatory involvement of the tubules and interstitium (tubulointerstitial nephritis) and (2) ischemic or toxic tubular injury, leading to acute tubular injury and the clinical syndrome of acute kidney injury........ ---
    Now create a prompt template to trigger the model with above context from vector search. We specifically inform the model to answer only using the context provided
    template = """ You are a helpful, polite, fact-based agent. If you don't know the answer, just say that you don't know. Please answer the following question using the context provided. CONTEXT: {context} ========= QUESTION: {question} ANSWER: """ prompt = template.format(context=context, question=query) print(prompt)
    Output -
    You are a helpful, polite, fact-based agent. If you don't know the answer, just say that you don't know. Please answer the following question using the context provided. CONTEXT: Knowles MA, Hurst CD: Molecular biology of bladder cancer: new insights into pathogenesis and clinical diversity, Nat Rev Cancer 15:25, 2015. [A comprehensive review of the molecular changes in different types of bladder cancer.] Mathieson PW: Minimal change nephropathy and focal segmental glomerulosclerosis, Semin Immunopathol 29:415, 2007. [An excellent overview of new insights into the pathogenesis and diagnosis of minimal change disease versus focal segmental glomerulosclerosis.] Miller O, Hemphill RR: Urinary tract infection and pyelonephritis, Emerg Med Clin North Am 19:655, 2001. [An excellent review of acute urinary tract infections.] Murray PT, Devarajan P, Levey AS, et al: A framework and key research questions in AKI diagnosis and staging in different environments, Clin J Am Soc Nephrol 3:864, 2008. [An excellent review outlining recent advances in early diagnosis and consequences of acute kidney injury.]disease. In those with preexisting chronic kidney disease, complete recovery is less frequent, and progression to end-stage renal disease is unfortunately common.Acute tubular injury (ATI) is a clinicopathologic entity characterized by damage to tubular epithelial cells and an acute decline in renal function, often associated with shedding of granular casts and tubular cells into the urine. Clinicians use the term acute tubular necrosis, but frank necrosis is rarely observed in a kidney biopsy, so pathologists prefer the term acute tubular injury. The constellation of changes, broadly termed acute kidney injury, manifests clinically as decreased GFR with concurrent elevation of serum creatinine. ATI is the most common cause of acute kidney injury and may produce oliguria (defined as urine output of <400 mL/day). http://ebooksmedicine.net There are two forms of ATI that differ in the underlying causes.Most forms of tubular injury also involve the interstitium, so the two are discussed together. Presented under this heading are diseases characterized by (1) inflammatory involvement of the tubules and interstitium (tubulointerstitial nephritis) and (2) ischemic or toxic tubular injury, leading to acute tubular injury and the clinical syndrome of acute kidney injury. ========= QUESTION: What is acute kidney injury? ANSWER:
    Now, Let’s invoke the endpoint to generate a response from the LLM
    smr_client = boto3.client("sagemaker-runtime") response_model = smr_client.invoke_endpoint( EndpointName=TEXT_GENERATION_MODEL_ENDPOINT_NAME, Body=json.dumps( {"inputs": prompt, "parameters": {"max_new_tokens": 500}} ), ContentType="application/json", ) response = json.loads(response_model["Body"].read()) print(response[0]["generated_text"])
    Now check the answer from the LLM.
    You are a helpful, polite, fact-based agent. If you don't know the answer, just say that you don't know. Please answer the following question using the context provided. CONTEXT: Knowles MA, Hurst CD: Molecular biology of bladder cancer: new insights into pathogenesis and clinical diversity, Nat Rev Cancer 15:25, 2015. [A comprehensive review of the molecular changes in different types of bladder cancer.] Mathieson PW: Minimal change nephropathy and focal segmental glomerulosclerosis, Semin Immunopathol 29:415, 2007. [An excellent overview of new insights into the pathogenesis and diagnosis of minimal change disease versus focal segmental glomerulosclerosis.] Miller O, Hemphill RR: Urinary tract infection and pyelonephritis, Emerg Med Clin North Am 19:655, 2001. [An excellent review of acute urinary tract infections.] Murray PT, Devarajan P, Levey AS, et al: A framework and key research questions in AKI diagnosis and staging in different environments, Clin J Am Soc Nephrol 3:864, 2008. [An excellent review outlining recent advances in early diagnosis and consequences of acute kidney injury.]disease. In those with preexisting chronic kidney disease, complete recovery is less frequent, and progression to end-stage renal disease is unfortunately common.Acute tubular injury (ATI) is a clinicopathologic entity characterized by damage to tubular epithelial cells and an acute decline in renal function, often associated with shedding of granular casts and tubular cells into the urine. Clinicians use the term acute tubular necrosis, but frank necrosis is rarely observed in a kidney biopsy, so pathologists prefer the term acute tubular injury. The constellation of changes, broadly termed acute kidney injury, manifests clinically as decreased GFR with concurrent elevation of serum creatinine. ATI is the most common cause of acute kidney injury and may produce oliguria (defined as urine output of <400 mL/day). http://ebooksmedicine.net There are two forms of ATI that differ in the underlying causes.Most forms of tubular injury also involve the interstitium, so the two are discussed together. Presented under this heading are diseases characterized by (1) inflammatory involvement of the tubules and interstitium (tubulointerstitial nephritis) and (2) ischemic or toxic tubular injury, leading to acute tubular injury and the clinical syndrome of acute kidney injury. ========= QUESTION: What is acute kidney injury? ANSWER: Acute kidney injury (AKI) is a clinical syndrome characterized by a sudden decline in renal function, as evidenced by decreased glomerular filtration rate (GFR) and an increase in serum creatinine. Acute tubular injury (ATI) is a common cause of AKI and is characterized by damage to tubular epithelial cells, often resulting in the shedding of granular casts and tubular cells into the urine. ATI can be caused by inflammatory involvement of the tubules and interstitium (tubulointerstitial nephritis) or ischemic or toxic injury. The prognosis for recovery from AKI depends on the underlying cause and the presence of preexisting chronic kidney disease. In those with preexisting chronic kidney disease, complete recovery is less frequent, and progression to end-stage renal disease is unfortunately common.
    Â