Understanding Large Language Models (LLMs) like LLaMa by Meta (Part 1)

Rachit Narang GEN AI 26.11.2024 | 0

Highlights: Large Language Models (LLMs) have become essential tools in Natural Language Processing, powering applications from chatbots to complex data analysis.

In today’s post, we’ll dive into the fascinating world of Large Language Models (LLMs) as explained in a lecture by Andrej Karpathy. We’ll study a notable model in this space, i.e., LLaMA (Large Language Model Meta AI), a groundbreaking LLM developed by Meta (formerly Facebook). We’ll also learn how LLMs in general and this model by Meta are shaping the world of AI research. So let’s begin!

Tutorial Overview:

A Look into LLaMa by Meta
Training Large Language Models (LLMs)
- Training LLaMA
  - LLaMa’s Transformer Architecture
  - Next-Word Prediction Task
  - The Dreaming Phenomenon
- Training ChatGPT
Challenges faced by LLMs

1. A Look into LLaMa by Meta

LLaMa or Large Language Model Meta AI, is a model developed by Meta and is known for its open-source availability as it allows the broader community to explore, experiment, and innovate with this technology. At the time of its release, LLaMA’s 70-billion parameter version was one of the most powerful models available, sparking interest across the AI research landscape.

Have a look at the image below.

As shown above, the LLaMA model consists of two main files: a massive parameter file, weighing in at around 140 GB, and a lightweight C code file with about 500 lines to run it. This compact setup makes it accessible for researchers and developers who want to delve into the inner workings of LLMs and build customized applications.

Since the release of LLaMA, newer versions like LLaMA 3.2 have emerged, pushing the boundaries even further. However, LLaMA remains an influential model, illustrating how powerful open-source LLMs can democratize AI innovation.

Another fascinating aspect of LLaMA is its compact parameter storage. Each parameter in the model is stored in just two bytes, likely using a 16-bit floating-point format (float16). With 70 billion parameters, this storage approach results in a total model size of around 140 GB (70 billion multiplied by two bytes per parameter). This efficient format makes it feasible to work with such a large model on consumer hardware, albeit with significant storage requirements.

Running LLaMA involves using an inference file, written in C, to execute the model. This file could be adapted into other programming languages like Python, as it essentially handles the inference process, or the execution of the model, which is structured as a neural network or transformer. Once you have these two key files—the model parameters and the inference code—you can effectively run the model locally, turning your laptop into a powerful tool for natural language generation.

In practice, this setup allows you to engage with LLaMA as a generative model, capable of producing text responses. For example, one use case shared during the lecture involved prompting the model to write a poem about Scale AI, showcasing its ability to generate coherent, context-aware text. This simple setup—just two files—enables users to experiment with LLMs on their own devices, opening up a world of possibilities for creative and practical applications.

2. Training Large Language Models (LLMs)

Training LLaMa

Training a large language model like LLaMA requires an enormous amount of data and computational power. The model was trained on roughly 10 terabytes of text data—essentially a large chunk of the internet. This massive dataset helps the model learn language patterns, contexts, and structures across diverse topics, making it highly versatile in generating coherent text on a wide range of subjects.

However, unlike running the model locally on a PC for inference (generating responses), training LLaMA involves a much more intensive process. It requires powerful hardware and specialized infrastructure. In this case, thousands of GPUs were employed in data centers to process the vast dataset, with each GPU contributing to breaking down and understanding the text. This setup allows the model to process billions of parameters, updating them iteratively as it “learns” from the data. Training such a large model would be impossible without this immense processing power, making it feasible only for organizations with access to high-performance computing resources.

The scale of resources needed to train a model like LLaMA is staggering. Training this model on 10 terabytes of internet text data required powerful hardware working continuously for around 12 days, with an estimated cost of about $2 million. The resulting model can be thought of as a highly compressed version of the internet’s text—a sort of lossy compression with a roughly 100x reduction. Unlike identical compression, this approach selectively retains meaningful patterns while discarding redundant details, resulting in a model that can generate coherent responses but doesn’t replicate the exact data it was trained on.

While these numbers seem immense, they are actually modest compared to the latest state-of-the-art models, where training costs can reach hundreds of millions of dollars. Today’s most advanced models require hardware and budgets that are over 10 times greater than what was used for LLaMA. However, once trained, running these models for inference is relatively inexpensive, making them accessible for various applications without ongoing high costs.

Before going into further details of training, let’s quickly understand the architecture of LLaMa’s transformer.

LLaMa’s Transformer Architecture

The image below illustrates the structure of the transformer model that underlies LLaMA’s design.

As described, LLaMA relies on a transformer architecture with layers of attention and feed-forward networks, each containing a portion of the model’s 100 billion parameters. These components work together to process input text, predict the next word, and generate coherent responses. The exact workings of how parameters are dispersed and how they collaborate to optimize predictions are still not fully understood. The transformer structure, as shown here, provides the backbone for LLaMA’s impressive language generation abilities, despite the inherent complexity in its internal parameter dynamics.

Next-Word Prediction Task

A neural network like LLaMA, as shown in the image below, essentially functions as a predictive model: given a sequence of input words, it predicts the most probable next word in the sequence. For instance, if the input is “get set on the,” the model might predict the word “mat” with a high probability—let’s say 97%. This next-word prediction is the primary task the model is trained on, and it forms the foundation of LLM capabilities.

There’s an intrinsic connection between prediction and compression in machine learning. If a model can accurately predict the next word in a sequence, it effectively understands the patterns in the text, which allows it to compress information. The entire training process for LLaMA is a next-word prediction task, which turns out to be a remarkably powerful objective. Through this, the model “learns” language structures and meaningful content that it compresses into its internal parameters.

Have a look at the image below.

Imagine the model reading a random Wikipedia page about a historical figure like Ruth Handler as in the image above. During training, it is presented with several words from the page, then tasked with predicting the next one. By repeating this process on massive datasets, the model builds an understanding of various subjects, learning contextual clues about facts such as who Ruth Handler was, where she was born, and what she achieved. All this information is encoded within the model’s parameters, effectively compressing vast amounts of knowledge into a manageable form that can generate relevant responses when prompted.

The Dreaming Phenomenon

Now take a look at this image of a Neural Network’s ‘dream’ (yes, you read that right!).

The image above illustrates the concept of how large language models like LLaMA can “dream” or “hallucinate” internet documents. The network, after being trained on massive amounts of data, can generate realistic-looking content that mimics genuine documents. However, these generated outputs are often fabricated and not based on specific facts from the training data.

Java Code Dream: The model generates what looks like Java code for managing a Field class with methods for adding, retrieving, and counting Card objects. While the syntax and structure appear correct, this code isn’t based on any real library or project. Instead, it’s an imitation, created based on the patterns the model learned from Java code it saw during training.
Amazon Product Dream: In this example, the model generates a mock Amazon product listing for a book titled Hades Heroes by Maureen Fergus. It includes plausible details like an ISBN, format, page count, dimensions, publication date, and a brief description. While the output appears authentic, every detail is invented, a product of the model “dreaming” up what an Amazon listing might look like based on similar listings seen during training.
Wikipedia Article Dream: Here, the model generates a Wikipedia-like entry about the “Blacknose dace,” a freshwater fish. The description is detailed and seemingly accurate, containing biological characteristics, habitat, and diet information. However, this too is a hallucination, synthesizing content that reads like a Wikipedia page without necessarily reflecting true facts.

This “dreaming” phenomenon illustrates how language models compress vast information into patterns and recreate plausible text based on them. This process can result in surprisingly accurate-seeming details, yet the content is often fabricated. It highlights the model’s nature as a predictive tool rather than a factual database, with outputs that reflect patterns rather than specific knowledge.

Training ChatGPT

Training a Large Language Model like ChatGPT involves more than just pre-training on massive datasets scraped from the internet. To create a truly useful and focused application, the model undergoes a second phase called fine-tuning or alignment, where it is shaped into a helpful assistant.

This process involves creating a tailored dataset specifically designed to improve the model’s ability to respond accurately and effectively to user queries. Rather than simply compressing internet knowledge into parameters, fine-tuning gives the model a directed purpose—like answering questions in a conversational manner.

To achieve this, organizations hire people to generate high-quality questions and answers that mimic real interactions between a user and an assistant. In the example shown, a user asks, “What does ‘monopsony’ mean in economics?” and the assistant provides a detailed, curated response. Initially, human writers generate these responses, creating a gold-standard dataset of around 100,000 conversations. Unlike the lower-quality, broad data used in pre-training, this focused dataset is high quality and specifically designed for interaction.

Through this fine-tuning stage, the model learns to format responses effectively, leverage the general knowledge gained in pre-training, and apply it in a way that aligns with human expectations. The result is an assistant model that can engage in coherent, relevant conversations, fine-tuned to meet the user’s needs. This phase transforms the model from a general-purpose language model into a practical, conversational assistant.

Let’s detail out the process step-by-step.

Stage 1: Pre-Training

Data Collection: Start by downloading a large dataset, roughly 10 terabytes of text data from sources across the internet.
Computational Resources: Set up a cluster of around 6,000 GPUs to handle the processing requirements.
Training Process: Compress the collected text data into a neural network by training on the GPUs. This process takes about 12 days and costs approximately $2 million.
Output: The result of this stage is a base model, a general-purpose language model that has learned patterns, language structure, and knowledge from the data. This process typically occurs once a year.

Stage 2: Fine-Tuning

Labeling Instructions: Develop specific instructions for labeling data to improve the model’s ability to respond in a structured, useful manner.
High-Quality Data Collection: Employ people (or services like Scale AI) to create a dataset of around 100,000 high-quality question-and-answer pairs or comparisons. This dataset is more focused and crafted for the model’s intended applications.
Fine-Tuning Process: Use the high-quality data to fine-tune the base model, which can take roughly one day.
Output: The result is an assistant model, optimized for user interaction and aligned to provide helpful, context-aware responses.
Evaluation and Deployment: Conduct extensive evaluations to assess the model’s performance, then deploy it for public use.
Monitoring and Iteration: Regularly monitor the model’s behavior to identify and address misbehaviors or areas for improvement. This feedback loop is iterative, with fine-tuning updates happening weekly to keep the model aligned with user expectations.

This structured approach, from broad pretraining to targeted fine-tuning, transforms a general language model into a specialized assistant.

The image below provides a summary of the two main stages involved in training a model like ChatGPT.

In the end, developing an AI assistant is not just a matter of engineering but an ongoing process of refinement and adaptability. Each iteration of fine-tuning, monitoring, and addressing misbehaviors makes the model more responsive, precise, and aligned with user needs. This iterative improvement—where even small missteps feedback into the learning cycle—transforms a base model into a tailored, reliable assistant. As AI continues to evolve, it’s clear that a successful model isn’t static but constantly learning and adapting. This commitment to ongoing improvement ensures that the model remains relevant, accurate, and truly useful in the ever-changing landscape of human interaction.

Here’s a comparison of the various Chatbot LLMs that are in use today.

Before we conclude this post, let’s look at in which scenarios LLMs can face a tough challenge.

3. Challenges Faced By LLMs

Have a look at this interesting image below.

The image above demonstrates a peculiar limitation in the way large language models, like LLaMA, handle information. Although these models build and maintain an internal “knowledge database” through their training on vast amounts of text, this knowledge can be inconsistent or incomplete.

The example, known as the “reversal curse,” highlights this issue. When asked, “Who is Tom Cruise’s mother?” the model correctly answers “Mary Lee Pfeiffer.” However, when the question is reversed to “Who is Mary Lee Pfeiffer’s son?” the model responds with “I don’t know.”

This inconsistency arises because, while the model can learn associations between entities, it doesn’t always handle relational information bidirectionally or logically. This limitation reveals the imperfect nature of language models’ “knowledge.” Unlike a structured database with clear relationships, LLaMA’s internal knowledge is based on patterns in text data, which may lack the structure needed for consistent bidirectional answers.

This was Part 1 of the post. We will talk more about LLMs in the next part. So stay tuned. Before we go, let’s quickly revise what we learnt today.

One-Shot Free-View Neural Learning-Based Talking Head Models

LLaMA, Large Language Model Meta AI, is an open-source LLM with a 70-billion parameter vision at the time of its release
Training LLMs requires a large amount of data and computational power
LLaMa was trained on 10 TB of internet text data for 12 days at a cost of $2 million
The Neural Network of LLaMa works on a next-word prediction model
LLMs can even ‘dream’ or ‘hallucinate’ internet documents
A challenge faced by LLMs is in their lower efficiency in giving consistent bidirectional answers
Training a ChatGPT-like LLM involves two stages – Pre-Training and Fine-Tuning

Summary

We are fast progressing in our research of LLMs. Therefore, we must increase our learning speeds as well so that we are able to not just catch up but do our own research in building the next-generation models like LLaMa. This is the end of Part 1 of this post. We’ll be back soon with the continuation of this post where we learn more about scaling in LLMs and more important stuff. Catch you soon, dear friends! Have a good day and take care! 🙂

Understanding Large Language Models (LLMs) like LLaMa by Meta (Part 1)