Category: LLMs from Scratch

LLMs Scratch #004: Mixture of Experts (MoE) Models: The Architecture Powering 2025’s Best AI Systems

🎯 What You’ll Learn This comprehensive guide takes you from MoE fundamentals to state-of-the-art implementations like DeepSeek V3. You’ll understand why sparse architectures outperform dense models at every compute scale, master the critical routing mechanisms that determine expert selection, and learn the training techniques that make these complex systems work. We’ll examine real benchmark results from Llama 4, Grok, and DeepSeek, explore load balancing challenges and solutions, and walk through the complete evolution of DeepSeek’s…
Read more

LLMs from Scratch #003: Modern Transformer Architectures: A Deep Dive into Design Principles and Training

Modern Transformer Architectures: A Deep Dive into Design Principles and Training 🎯 What You’ll Learn In this comprehensive guide, we’ll explore the evolution of transformer architectures from the original “Attention is All You Need” paper to modern implementations. You’ll discover why today’s language models use specific design choices like RoPE position embeddings and SwiGLU activations, understand the trade-offs between serial and parallel layer arrangements, and learn how to make informed decisions about hyperparameters like head…
Read more

LLMs from Scratch #002 PyTorch Fundamentals: Building Efficient Language Models from Scratch

PyTorch Fundamentals: Building Efficient Language Models from Scratch 🎯 What You’ll Learn In this comprehensive guide, we’ll explore the fundamental building blocks of PyTorch for language model development. You’ll learn how to account for memory usage across different floating-point representations, understand tensor operations and their computational costs, master efficient data movement between CPU and GPU, and develop the mindset of resource accounting that’s essential for training large-scale models. This is the practical foundation you need…
Read more

LLMs Scratch #001 Introduction to LLMs and tokenization

Introduction to Large Language Models: From Scratch (Part 1) 🎯 What You’ll Learn In this comprehensive introduction to large language models, we’ll explore why efficiency at scale is just as critical as raw compute power, showing how algorithmic improvements have outpaced Moore’s Law by 44X. You’ll understand why the “bitter lesson” is misunderstood, learn the critical difference between small and large-scale phenomena, trace the fascinating evolution from Shannon’s entropy estimates through Google’s massive N-gram models…
Read more