# !pip install -U pip -q
# !pip install keras-hub -U -q #==0.22.1 -U -q
# !pip install -q tf-keras sentence_transformers #transformers==4.57.1 for embeddinggemmaBuild with AI - Tailoring LLMs: RAG and Fine-Tuning
Instructors:
- Dr Nate Butterworth (Google XWF)
Date: May 14, 2026
Part 2: Retrieval-Augmented Generation (RAG)
…and how it is better/worse/complimentary different to Fine Tuning
In fine-tuning we update a model’s internal weights to teach it new knowledge. But the process is time-consuming, (computationally) expensive, and the knowledge can become outdated.
Now, we explore a powerful and complementary technique to adapt LLMs to your data and use-cases: Retrieval-Augmented Generation (RAG).
Fine-Tuning vs. RAG | 🧠 vs. 📚
Fine-Tuning:
You spend weeks training your model on all your lab’s data. The model memorises this information, changing its own neural network to incorporate it. It can now answer questions from memory.
Pro: Deeply understands the nuances and style of your data.
Con: Expensive to re-train. If new data comes in, the model is instantly out of date until you fine-tune it again. It can also “hallucinate” or forget specifics.
RAG:
You take a general-purpose, pre-trained model (who hasn’t been specially trained on your data) and give it access to a library or textbook at the moment you ask a question.
When you ask a question, the system first retrieves the most relevant pages from the textbook.
It then gives those pages to the model along with your question and says, “Using only this information, answer the question.”
The model generates an answer based on the documents it was just given.
RAG is often cheaper, faster, and allows you to use real-time data. Let’s see how it works in practice.
The RAG Pipeline
Our goal is to build a system that can answer medical questions using a knowledge base of documents. The LLM itself knows nothing about these specific documents; it will only use the information we find for it.
Get started
Visit Google Colab and start a “New notebook in Drive”.
Step 0: Prepare our environment and download our models/data
First let’s get the libraries we need: an LLM model to use for asking questions and generating responses, an “embedding model” that builds a hyper-dimensional network of similarities between “words” and “tokens”, and some data to use as an example.
# Import necessary libraries
import os
import pandas as pd
import numpy as np
# Keras and KerasHub for loading our generative LLM
import keras_hub
import kagglehub
# SentenceTransformers for loading our embedding model
from sentence_transformers import SentenceTransformer
# Scikit-learn for calculating similarity between our questions asked and our knowledge base of data
from sklearn.metrics.pairwise import cosine_similarity2026-05-11 10:14:10.162546: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2026-05-11 10:14:10.249825: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-05-11 10:14:12.633453: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
# Authenticate with KaggleHub to download models and datasets
kagglehub.login()# Download the Gemma3 LLM model, the paraphrase embedding model, and a dataset to build our knowledge base
path = kagglehub.model_download("keras/gemma3/keras/gemma3_instruct_270m/4")
path_embedding_model = kagglehub.model_download("astrobutter/paraphrase-minilm-l3-v2/pyTorch/default")
# path_embedding_model = kagglehub.model_download("srg9000/all-minilm-l6-v2/transformers/default")
# path_embedding_model = kagglehub.model_download("google/embeddinggemma/transformers/embeddinggemma-300m") #emedding takes 1 hour
path_data = kagglehub.dataset_download("gpreda/medquad")Step 1: Setup and Prepare the Knowledge Base
First, we’ll load our libraries and prepare the dataset that will serve as our “textbook” or “knowledge base.”
# Load knowledge base.
df = pd.read_csv(path_data+"/medquad.csv")
# --- Prepare the Documents for Retrieval ---
# To make each document in our knowledge base meaningful, we'll combine the question and its corresponding answer.
# This creates a single, self-contained block of text that is rich with context.
df['combined_text'] = df['question'] + " " + df['answer']
df['combined_text'] = df['combined_text'].astype(str)
# Convert the text column into a simple list of strings.
# Each string in this list is a "document" in our knowledge base.
documents = df['combined_text'].dropna().tolist() # All data ~7minutes to index on cpu, 20 seconds on GPU
# In this case we can "cheat" to craft a smaller, more focused knowledge base.
# For example, to only include documents about a specific topic:
#documents = df[df['focus_area']=="Pernicious Anemia"]['combined_text'].tolist()Step 2: The “Retrieval” - Finding Relevant Knowledge
We can’t just do a keyword search (like Ctrl+F). We need to search by semantic meaning. To do this, we must convert our text documents into numerical representations called vector embeddings.
The Embedding Model
An embedding model is a specialised neural network that reads text and outputs a list of numbers (a vector). Texts with similar meanings will have mathematically similar vectors.
# --- 2. Load Lightweight Embedding Model ---
print("Loading sentence embedding model...")
# 'paraphrase-MiniLM-L3-v2' is a small, efficient model suitable for this task.
# Its specific job is to create high-quality embeddings for sentences and paragraphs.
embedding_model = SentenceTransformer(path_embedding_model)
# embedding_model = SentenceTransformer(path_embedding_model+"/all-MiniLM-L6-v2")
# If a GPU is available, move the model to the GPU for a massive speedup in embedding generation.
# embedding_model.to('cuda') # Move the model to the GPULoading sentence embedding model...
/usr/local/google/home/butterworthnat/miniforge3/envs/gdg/lib/python3.12/site-packages/torch/cuda/__init__.py:180: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 302: Error loading CUDA libraries. GPU will not be used. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:119.)
return torch._C._cuda_getDeviceCount() > 0
Creating the Vector Store
The “vector store” (or vector database) is our searchable index of embeddings. For our dataset, we will create a simple in-memory store using a NumPy array. For larger data set you can use tools like FAISS, ChromaDB, or Pinecone.
# %%time
# --- 3. Create Vector Store (In-Memory) ---
print("Creating embeddings for the vector store...")
# Generate embeddings for our text data. This might take a few minutes.
vector_store = embedding_model.encode(documents, show_progress_bar=True)
print(f"Vector store created with {vector_store.shape[0]} embeddings.")Creating embeddings for the vector store...
Vector store created with 16407 embeddings.
Searching the Vector Store
This is the core of the retrieval step. When a user asks a question, we convert their question into a vector and then search our store for the most similar document vectors.
# --- Vector Similarity Search - "Retrieval" part of RAG. ---
example_question = "What is anemia?"
# Embed the query using the *same* embedding model.
# This ensures the question vector is in the same "semantic space" as our document vectors.
query_embedding = embedding_model.encode([example_question])
# Find the indices of the top most similar documents using
# cosine similarity between query and each document in the vector store
similarities = cosine_similarity(query_embedding, vector_store)
top_n = 1
top_indices = np.argsort(similarities[0])[-top_n:][::-1]
# Retrieve the original text of the most similar documents to use in our LLM context.
retrieved_docs = [documents[i] for i in top_indices]
rag_context = "\n\n---\n\n".join(retrieved_docs)
print("\n--- Retrieved Context for RAG ---")
print(rag_context)
print("---------------------------------\n")
--- Retrieved Context for RAG ---
What is (are) Anemia in Chronic Kidney Disease ? Anemia is a condition in which the body has fewer red blood cells than normal. Red blood cells carry oxygen to tissues and organs throughout the body and enable them to use energy from food. With anemia, red blood cells carry less oxygen to tissues and organsparticularly the heart and brainand those tissues and organs may not function as well as they should.
---------------------------------
Step 3: The “Generation” - Answering the Question
Now that we have retrieved relevant information, we pass it to an LLM (like Gemma) to formulate a final answer.
%%time
## --- Load the Pre-trained Language Model ---
# We load the base Gemma model. No fine-tuning is needed for RAG.
print("Loading Gemma3CausalLM model...")
gemma_lm = keras_hub.models.Gemma3CausalLM.from_preset(path)Loading Gemma3CausalLM model...
2026-05-11 10:17:26.780648: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.
CPU times: user 11.3 s, sys: 2.62 s, total: 13.9 s
Wall time: 7.3 s
Generating a Baseline Response (Without RAG)
First, let’s see what the model answers using only its pre-trained, general knowledge. This is our baseline.
%%time
## Baseline response
raw_prompt = f"""
Instruction: Answer the following question.
Question: {example_question}
Answer:
"""
response_raw = gemma_lm.generate(
raw_prompt,
max_length=256,
strip_prompt=True,
)
print("--- RAW Generated Response ---")
print(response_raw)2026-05-11 10:17:48.451613: I external/local_xla/xla/service/service.cc:163] XLA service 0x7f62304071f0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2026-05-11 10:17:48.451652: I external/local_xla/xla/service/service.cc:171] StreamExecutor device (0): Host, Default Version
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1778494668.461785 1786972 device_compiler.h:196] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
--- RAW Generated Response ---
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
**Answer:**
CPU times: user 1min 39s, sys: 11.2 s, total: 1min 50s
Wall time: 29.7 s
Generating an Augmented Response (With RAG)
Now, we’ll use the “augmented” prompt that includes the context we retrieved. This is the generation step of RAG.
%%time
## 4. Augment the Prompt with Retrieved Context
rag_prompt = f"""
Instruction: Using only information from the context below, answer the following question.
Question: {example_question}
Context: {rag_context}
Answer:
"""
## 5. Generate the Response
# This is the "Generation" step in RAG.
# The model uses the augmented prompt to generate a context-aware answer.
response_rag = gemma_lm.generate(
rag_prompt,
# max_length=256, # Increased max_length for a more complete answer
strip_prompt=True,
# stop_token_ids='.'
)
print("--- RAG Generated Response ---")
print(response_rag)
--- RAG Generated Response ---
Anemia is a condition in which the body has fewer red blood cells than normal.
Final Answer:
Anemia is a condition in which the body has fewer red blood cells than normal.
<end_of_turn>
CPU times: user 41.2 s, sys: 4.43 s, total: 45.7 s
Wall time: 18.6 s
Advanced RAG
This simple “find and inject” method often fails because simple similarity doesn’t always equal relevance. Modern RAG systems add several more layers:
- Query Transformation: Instead of searching with your exact (and perhaps messy) question, the system uses an LLM to rewrite your query into a better search term or breaks it into multiple sub-questions.
- Reranking: The system might pull 50 snippets but then use a specialized Reranker model to score them again, keeping only the top 5 that actually answer the question. This solves the “lost in the middle” problem where LLMs ignore info buried in a long list of snippets.
- Knowledge Graphs: Instead of just text “chunks,” some systems retrieve relationships (e.g., “Company A acquired Company B”). This allows the LLM to understand context that a single snippet might miss.
- Contextual Compression: Rather than injecting a whole page, the system summarises or extracts only the relevant sentences from a snippet to save space in the LLM’s “brain” (context window).