LLM_log #015: Fine-Tuning LLMs — Teach a 3B Model to Call Functions with QLoRA + Unsloth on Free Colab T4

LLM_log #015: Fine-Tuning LLMs — Teach a 3B Model to Call Functions with QLoRA + Unsloth on Free Colab T4

Highlights:

Every modern LLM agent — from ChatGPT plugins to Claude tools — relies on a single learned skill: outputting a structured JSON function call instead of free text. In this post we teach that skill to a 3-billion parameter model using QLoRA on a free Google Colab T4. We start from the fundamentals — why fine-tuning, when LoRA, how quantization works — then build the full training pipeline from scratch. By the end, your model takes a user query like “What’s the weather in Tokyo?” and outputs {"name": "get_weather", "arguments": {"city": "Tokyo"}}. So let’s begin!

Tutorial Overview:

  1. What Is Fine-Tuning and Why?
  2. Encoder vs Decoder — Two Paths
  3. Instruction Fine-Tuning — The Key Idea
  4. LoRA — Train 1% of Parameters
  5. Quantization & QLoRA — Making It Fit
  6. Evaluation & RAG vs Fine-Tuning
  7. What Is Function Calling?
  8. The Training Data & The Goal
  9. Hands-On: The Complete Tutorial
  10. Results — Before vs After
  11. Appendix: Full Colab Code
3BParameters
30 minTraining
$0GPU Cost
113KExamples

1. What Is Fine-Tuning and Why?

Think of fine-tuning an LLM like taking a general-purpose robotic arm and swapping its end-effector for a highly specialized tool: you don’t rebuild the servomotors or the base logic, you just calibrate the final few degrees of freedom for your specific task. Training a foundation model from scratch requires millions of dollars in compute, but adapting an open-source model to your proprietary data follows a strictly bounded pipeline.

Fine-Tuning Pipeline
Fig 1. The fine-tuning pipeline: from choosing a pre-trained model through data prep, training, evaluation, to deployment. We’ll walk through every step in our Colab notebook.

We are going to walk through every one of these steps in our Colab notebook today, transforming this flowchart into a deployable artifact. The pre-trained model already knows language — our job is to teach it a new skill: generating structured function calls.


2. Encoder vs Decoder — Two Paths

Before we spin up massive generative models, let’s acknowledge the traditional workhorses. If your task is strictly classification or NER on short texts, a compact encoder model like DistilBERT fits perfectly — think of it as a highly specialized filing clerk that reads a document and assigns a rigid label, easily fine-tuned in minutes using HuggingFace’s AutoModelForSequenceClassification.

Encoder vs Decoder
Fig 2. Encoder models output labels (classification). Decoder models output structured text (generation). For function calling, we need the decoder path.

The encoder architecture works beautifully for classification — the transformer processes tokenized text through attention layers, the CLS token gets routed through a task-specific head, and out comes a label. If you’ve used a ResNet backbone as a feature extractor for image classification, this setup should feel completely familiar.

Confusion Matrix
Fig 3. DistilBERT on AG News — 94%+ accuracy with a few epochs on a pre-trained model. Encoder fine-tuning works.

But we don’t want labels. We want the model to generate structured JSON output — a function call with arguments. That means we need a decoder model, and that means instruction fine-tuning.


3. Instruction Fine-Tuning — The Key Idea

Think of a pre-trained base LLM as a brilliant but uncooperative intern who has read every book in the library but just blurts out related sentences when asked a question. Instruction fine-tuning is the orientation process that teaches them to actually read the ticket, follow the formatting rules, and deliver the specific deliverable you asked for.

Instruction Format
Fig 4. Training data as (system, user, assistant) triples. The model learns to map user intent to structured output — a paradigm pioneered by Flan, Self-Instruct, and LIMA.

We achieve this by formatting our training data into strict (instruction, input, output) triples. The critical insight:

Key idea: Function calling is simply instruction fine-tuning where the expected output happens to be a validated JSON object instead of natural language. The same mechanism that teaches a model to answer politely can teach it to output {"name": "get_weather", "arguments": {"city": "Paris"}}.


4. LoRA — Train 1% of Parameters

Think of LoRA like applying a surgical software patch instead of rewriting the entire operating system. During training, we freeze the massive base weights and only train two tiny matrices, A (d×r) and B (r×d), where the rank r is our compression factor. For a 3B parameter model, this means we update roughly 1–5% of total parameters.

LoRA
Fig 5. LoRA: freeze the large weight matrix W, train tiny adapters A and B alongside it. After training, merge: W_new = W + A·B — zero inference overhead.

After training, we compute W_new = W + A·B and replace the original weights — giving us zero inference latency overhead. In our notebook, setting up this entire dual-path architecture takes exactly one line: model = get_peft_model(model, lora_config).


5. Quantization & QLoRA — Making It Fit

Precision formats are like suitcases: FP32 is the massive check-in bag holding every decimal detail, but it’s far too heavy for standard GPU memory. If we push compression all the way down to 4-bit integers, we get an 8× memory reduction — exactly the trick that lets us fit a 3B parameter model onto a free Colab T4.

Quantization
Fig 6. The precision ladder: FP32 (4 bytes) → FP16 → BF16 → int8 → int4 (0.5 bytes). BF16 is preferred for training (same range as FP32, half the memory). int4 is what makes QLoRA possible.

QLoRA combines everything: a 4-bit quantized base model, 16-bit LoRA adapters, and a paged optimizer that offloads to CPU when needed. Think of it like bolting a small aftermarket turbocharger onto an engine rather than rebuilding the entire block from scratch.

QLoRA
Fig 7. QLoRA: 4-bit frozen base + 16-bit adapters + paged optimizer. Gradients flow through the 4-bit base but updates only touch the 16-bit adapters.
Component Memory
Base model (4-bit) ~1.7 GB
LoRA adapters (16-bit) ~100 MB
Optimizer (8-bit) ~1.5 GB
Activations ~2 GB
Total ~5.3 / 16 GB on T4
Memory Offloading
Fig 8. Memory offloading tiers: GPU → CPU → Disk. With 4-bit quantization, our 3B model fits entirely on GPU — no offloading needed.

6. Evaluation & RAG vs Fine-Tuning

Evaluation
Fig 9. Three evaluation approaches: automatic metrics (perplexity, BLEU, ROUGE), human eval (LMSYS Chatbot Arena), and task-specific validation.

While tracking human preference via systems like the LMSYS Chatbot Arena is necessary for foundational chat models, it’s the wrong approach for our engineering task today. For our function-calling fine-tune, we use a simpler metric: does the model output valid JSON with the correct function name and arguments?


Before we start coding, we need to rule out the most common alternative to fine-tuning: Retrieval-Augmented Generation (RAG). Think of RAG as giving the model an open-book exam — it queries an external vector database at runtime to fetch facts.

RAG vs Fine-Tuning
Fig 10. RAG retrieves changing facts at runtime. Fine-tuning teaches new behaviour. Function calling = new behaviour → fine-tuning.

Decision rule: If your data changes frequently → RAG. If you need the model to learn a new behaviour (structured output, tool use, persona) → Fine-tuning. We need fine-tuning.


7. What Is Function Calling?

Function calling is the backbone of every LLM agent system. Every tool-use system — ChatGPT plugins, Claude tools, LangChain agents — relies on the model generating structured function calls. Teaching a small model this skill makes it useful as a local, private agent.

What Is Function Calling
Fig 11. The model does NOT call the API — it generates the JSON. Your code does the rest.

Think of it like a restaurant order ticket: the LLM is the waiter taking your natural language request, but it doesn’t cook the food; it just writes down a highly structured ticket (the JSON) for the kitchen (your APIs). Your Python code then parses that JSON, executes the actual API call, and feeds the raw data back to the LLM for a final response.

Function Calling Flow
Fig 12. The 6-step function calling workflow. Step 2–3 — the JSON output — is what we fine-tune the model to generate.

8. The Training Data & The Goal

We use the Glaive Function Calling v2 dataset — 113K examples of user queries paired with the function calls the model should produce. Each example has three parts: system (available tools), user (natural language query), assistant (structured JSON output).

Training Data Example
Fig 13. One training example: System defines available tools → User asks naturally → Model outputs structured JSON.

After 100 training steps (~30 minutes on a free T4), the model goes from generic text completion to outputting structured JSON function calls. That’s the power of instruction fine-tuning + LoRA:

The Goal
Fig 14. Before: the base model rambles about weather. After: actionable JSON. 100 steps, 30 minutes, free GPU.

9. 🚀 Hands-On: The Complete Tutorial

Everything below runs on a free Google Colab T4. We use Unsloth for optimized training kernels (2× faster, 60% less VRAM) with TRL’s SFTTrainer for the training loop.

Tech Stack
Fig 15. Our stack: Colab T4 → bitsandbytes 4-bit → Unsloth → TRL SFTTrainer → Qwen 2.5 3B.

Step 1: Install Unsloth

One cell handles all dependencies. Unsloth installs PyTorch, Transformers, TRL, bitsandbytes, and PEFT automatically.

Install Code
%%capture
!pip install unsloth
!pip install --force-reinstall --no-cache-dir --no-deps \
    git+https://github.com/unslothai/unsloth.git

Step 2: Load Model with 4-bit Quantization

The key line is load_in_4bit=True. This triggers QLoRA’s 4-bit NF4 quantization on the base model, compressing 3B parameters from ~12 GB down to ~1.7 GB.

Load Model
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-3B-Instruct",
    max_seq_length=2048,
    dtype=None,           # auto-detect: bfloat16 on T4
    load_in_4bit=True,    # QLoRA: 4-bit quantized base
)

print(f"Model loaded: {model.config._name_or_path}")
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

Why Qwen 2.5 3B? Best quality/size ratio for Colab. Instruction-tuned base. Apache 2.0 license. Strong multilingual support. Fits comfortably on T4 with 4-bit quantization.

Step 3: Add LoRA Adapters

This is where the QLoRA diagram from earlier becomes code. We target all 7 attention and MLP projections. With rank r=16, we train roughly 1.5% of total parameters.

LoRA Config
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                          # LoRA rank
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=16,
    lora_dropout=0,                # Unsloth optimized: 0 is faster
    bias="none",
    use_gradient_checkpointing="unsloth",  # 60% less VRAM
    random_state=3407,
)

trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable:,} / {total:,} ({100*trainable/total:.1f}%)")

Step 4: Load the Function Calling Dataset

Glaive Function Calling v2 contains 113K examples of user queries paired with the expected function calls.

Load Dataset
from datasets import load_dataset

# Glaive Function Calling v2: 113k examples of user queries + function calls
dataset = load_dataset("glaiveai/glaive-function-calling-v2", split="train")
print(f"Dataset size: {len(dataset):,}")
print(f"Example keys: {list(dataset[0].keys())}")

# Preview one example
print("\n--- Example ---")
print(dataset[0]["system"][:200])
print("...")
print(dataset[0]["chat"][:300])

Step 5: Format into Chat Template

We parse the raw Glaive format (which uses USER:/ASSISTANT:/FUNCTION RESPONSE: markers) into Qwen’s chat template. The tokenizer handles the special tokens automatically.

Format Chat
def format_function_calling(example):
    """Convert Glaive format to Qwen chat template."""
    system_prompt = example["system"].strip()
    chat = example["chat"].strip()

    messages = []
    messages.append({"role": "system", "content": system_prompt})

    # Split chat into USER/ASSISTANT/FUNCTION turns
    lines = chat.split("\n")
    current_role = None
    current_content = []

    for line in lines:
        line = line.strip()
        if line.startswith("USER:"):
            if current_role:
                messages.append({"role": current_role,
                                 "content": "\n".join(current_content).strip()})
            current_role = "user"
            current_content = [line[5:].strip()]
        elif line.startswith("ASSISTANT:"):
            if current_role:
                messages.append({"role": current_role,
                                 "content": "\n".join(current_content).strip()})
            current_role = "assistant"
            current_content = [line[10:].strip()]
        elif line.startswith("FUNCTION RESPONSE:"):
            if current_role:
                messages.append({"role": current_role,
                                 "content": "\n".join(current_content).strip()})
            current_role = "tool"
            current_content = [line[18:].strip()]
        elif line:
            current_content.append(line)

    if current_role and current_content:
        messages.append({"role": current_role,
                         "content": "\n".join(current_content).strip()})

    text = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=False,
    )
    return {"text": text}

dataset = dataset.map(format_function_calling, num_proc=4)
print(dataset[0]["text"][:500])

Step 6: Configure the Trainer

We use TRL’s SFTTrainer with sequence packing enabled — this packs multiple short examples into a single sequence for faster training.

Trainer Setup
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    dataset_num_proc=2,
    packing=True,               # Pack short examples together = faster
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,  # Effective batch = 8
        warmup_steps=10,
        max_steps=100,              # ~30 min on T4, increase for better results
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        report_to="none",
    ),
)

Step 7: Hyperparameters

Training Arguments
Fig 16. Key hyperparameters. max_steps=100 is the minimum (~30 min). For production quality, increase to 500–1000 steps.

Step 8: Train!

The moment of truth. Three lines of code, 30 minutes of GPU time.

Train
# Show memory before training
gpu_stats = torch.cuda.get_device_properties(0)
print(f"GPU: {gpu_stats.name} ({gpu_stats.total_mem / 1024**3:.1f} GB)")

# Train
trainer_stats = trainer.train()

print(f"\nTraining complete!")
print(f"  Steps: {trainer_stats.global_step}")
print(f"  Loss:  {trainer_stats.metrics['train_loss']:.4f}")
print(f"  Time:  {trainer_stats.metrics['train_runtime']:.0f}s")

Step 9: Define Test Tools

Before testing inference, we define the tools our model should call — a JSON schema of available functions:

System Prompt
system_prompt = """You are a helpful assistant with access to the following functions:

[
  {
    "name": "get_weather",
    "description": "Get current weather for a city",
    "parameters": {
      "type": "object",
      "properties": {
        "city": {"type": "string", "description": "City name"},
        "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
      },
      "required": ["city"]
    }
  },
  {
    "name": "search_restaurants",
    "description": "Search for restaurants near a location",
    "parameters": {
      "type": "object",
      "properties": {
        "location": {"type": "string"},
        "cuisine": {"type": "string"},
        "price_range": {"type": "string", "enum": ["$", "$", "$$"]}
      },
      "required": ["location"]
    }
  }
]

To call a function, respond with a JSON object:
{"name": "function_name", "arguments": {...}}"""

Step 10: Test the Fine-Tuned Model

Switch to inference mode and test with real queries. The model should now output clean JSON function calls instead of generic text.

Inference
FastLanguageModel.for_inference(model)  # Switch to fast inference mode

test_queries = [
    "What's the weather like in Tokyo?",
    "Find me some Italian restaurants in San Francisco under $$",
    "Is it cold in Berlin right now?",
]

for query in test_queries:
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": query},
    ]
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt",
    ).to("cuda")

    outputs = model.generate(
        input_ids=inputs,
        max_new_tokens=256,
        temperature=0.1,
        do_sample=True,
    )
    response = tokenizer.decode(
        outputs[0][inputs.shape[-1]:], skip_special_tokens=True
    )

    print(f"\nQ: {query}")
    print(f"A: {response}")
    print("-" * 50)
Expected Output
Fig 17. Expected output: clean JSON function calls with correct function names and arguments.

Step 11: Save & Export

Save the LoRA adapter (~50 MB) or merge and export as GGUF for local inference with Ollama or llama.cpp.

Save Model
# Save LoRA adapter (small — ~50MB)
model.save_pretrained("qwen25-3b-function-calling-lora")
tokenizer.save_pretrained("qwen25-3b-function-calling-lora")
print("LoRA adapter saved!")

# Optional: merge and save full model in GGUF for llama.cpp / Ollama
# model.save_pretrained_gguf(
#     "qwen25-3b-fc", tokenizer, quantization_method="q4_k_m"
# )

# Optional: push to HuggingFace Hub
# model.push_to_hub("your-username/qwen25-3b-function-calling", token="hf_...")

10. Results — Before vs After

Before vs After
Fig 18. The base model rambles about weather. The fine-tuned model outputs actionable JSON. Same model, same weights — plus a 50 MB LoRA adapter.

After 100 training steps (~30 minutes on a free T4), the model goes from generic text completion to outputting structured JSON function calls. The LoRA adapter is only ~50 MB — you can distribute it, version it, and swap it out without touching the base model weights.

Metric Result
Training time ~30 min on free Colab T4
VRAM used 5.3 / 16 GB
Parameters trained 1.5% (LoRA r=16)
Adapter size ~50 MB
Output Valid JSON with correct function names
Export GGUF for Ollama / llama.cpp

What’s Next?

  • Increase steps: 500–1000 steps for production quality
  • Add validation: Write a validation loop that checks JSON validity and function name accuracy
  • Build an agent: Wrap the fine-tuned model in a LangChain/LlamaIndex agent with real API tools
  • Try other tasks: Text-to-SQL, JSON extraction, code generation — same technique, different dataset

11. Appendix: Full Colab Code

Copy each cell into a Google Colab notebook. Runtime: GPU → T4.

Cell 1: Install
%%capture
!pip install unsloth
!pip install --force-reinstall --no-cache-dir --no-deps \
    git+https://github.com/unslothai/unsloth.git
Cell 2: Load Model
from unsloth import FastLanguageModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-3B-Instruct",
    max_seq_length=2048, dtype=None, load_in_4bit=True,
)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
Cell 3: LoRA
model = FastLanguageModel.get_peft_model(
    model, r=16,
    target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
    lora_alpha=16, lora_dropout=0, bias="none",
    use_gradient_checkpointing="unsloth", random_state=3407,
)
Cell 4: Dataset
from datasets import load_dataset
dataset = load_dataset("glaiveai/glaive-function-calling-v2", split="train")
Cell 5: Format
def format_function_calling(example):
    system_prompt = example["system"].strip()
    chat = example["chat"].strip()
    messages = [{"role": "system", "content": system_prompt}]
    lines = chat.split("\n")
    current_role, current_content = None, []
    for line in lines:
        line = line.strip()
        if line.startswith("USER:"):
            if current_role: messages.append({"role": current_role, "content": "\n".join(current_content).strip()})
            current_role, current_content = "user", [line[5:].strip()]
        elif line.startswith("ASSISTANT:"):
            if current_role: messages.append({"role": current_role, "content": "\n".join(current_content).strip()})
            current_role, current_content = "assistant", [line[10:].strip()]
        elif line.startswith("FUNCTION RESPONSE:"):
            if current_role: messages.append({"role": current_role, "content": "\n".join(current_content).strip()})
            current_role, current_content = "tool", [line[18:].strip()]
        elif line: current_content.append(line)
    if current_role: messages.append({"role": current_role, "content": "\n".join(current_content).strip()})
    return {"text": tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)}
dataset = dataset.map(format_function_calling, num_proc=4)
Cell 6: Trainer
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
trainer = SFTTrainer(
    model=model, tokenizer=tokenizer, train_dataset=dataset,
    dataset_text_field="text", max_seq_length=2048, dataset_num_proc=2, packing=True,
    args=TrainingArguments(
        per_device_train_batch_size=2, gradient_accumulation_steps=4,
        warmup_steps=10, max_steps=100, learning_rate=2e-4,
        fp16=not is_bfloat16_supported(), bf16=is_bfloat16_supported(),
        logging_steps=10, optim="adamw_8bit", weight_decay=0.01,
        lr_scheduler_type="linear", seed=3407, output_dir="outputs", report_to="none",
    ),
)
Cell 7: Train
trainer_stats = trainer.train()
print(f"Loss: {trainer_stats.metrics['train_loss']:.4f}")
print(f"Time: {trainer_stats.metrics['train_runtime']:.0f}s")
Cell 8: Inference
FastLanguageModel.for_inference(model)
system_prompt = """You are a helpful assistant with access to these functions:
[{"name":"get_weather","parameters":{"city":{"type":"string"},"unit":{"type":"string"}}},
 {"name":"search_restaurants","parameters":{"location":{"type":"string"},"cuisine":{"type":"string"}}}]
Respond with JSON: {"name":"...","arguments":{...}}"""

for q in ["Weather in Tokyo?", "Italian restaurants in SF?", "Cold in Berlin?"]:
    msgs = [{"role":"system","content":system_prompt},{"role":"user","content":q}]
    ids = tokenizer.apply_chat_template(msgs, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
    out = model.generate(input_ids=ids, max_new_tokens=256, temperature=0.1, do_sample=True)
    print(f"Q: {q}\nA: {tokenizer.decode(out[0][ids.shape[-1]:], skip_special_tokens=True)}\n")
Cell 9: Save
model.save_pretrained("qwen25-3b-function-calling-lora")
tokenizer.save_pretrained("qwen25-3b-function-calling-lora")