Category: Other

LLM_log #012: Introduction to Diffusion Models — From Noise to Geometry to Sampling

Highlights: In this post we build a complete understanding of diffusion models from the ground up — what they are, how images are represented, how the network is trained, what it geometrically learns, and finally how we turn that geometry into samples using DDIM and DDPM. Every formula is accompanied by concrete numbers you can verify by hand. So let’s begin! Tutorial Overview: What Are Diffusion Models? How Images Are Represented The Denoiser Network Noise…
Read more

LLM_log #011: Diffusion Models — From Noise to Wolves, Training from Scratch

In this post we build a complete diffusion model from scratch — training a UNet on a custom dataset, implementing the full DDPM pipeline, and understanding the math that makes iterative denoising work. We cover noise schedules, the reparameterization trick, FID evaluation, and three diffusion objectives (ε, x₀, v). By the end you’ll have generated novel images from pure Gaussian noise, and understand why diffusion models overtook GANs as the dominant paradigm for image generation.…
Read more

LLM_log #006: Implementing ChatGPT 2.0 from scratch – Rashchka

Highlights: In this post, we build a complete GPT-2 model (124 million parameters) from scratch in PyTorch. We implement every component — layer normalization, GELU activations, the feed forward network, shortcut connections — and wire them into a transformer block that we stack 12 times to create the full architecture. By the end, you will have a structurally complete GPT model that can generate text token by token. We also weave in key insights from…
Read more

LLM_log #005: Implementing Attention Mechanisms — From Simplified Self-Attention to Multi-Head Attention

Highlights: In this post, we will implement four types of attention mechanisms step by step. We start with a simplified self-attention to build intuition, then move to self-attention with trainable weight matrices that forms the backbone of modern LLMs. Next, we add causal masking and dropout to enforce temporal order during text generation. Finally, we extend everything to multi-head attention — the workhorse behind GPT, Claude, and LLaMA. Every formula is accompanied by concrete numbers…
Read more

LLM_log #004 From Scratch: Working with Text Data — Embeddings for LLMs

Highlights: Before we can build or train a Large Language Model, we need to solve a fundamental problem — LLMs cannot process raw text. In today’s post, we’ll walk through the complete pipeline that converts human-readable text into numerical vectors that a neural network can work with. We’ll cover tokenization, vocabulary building, byte pair encoding, sliding window sampling, and how token and positional embeddings come together to form the final input to a GPT-like transformer.…
Read more