AI Education 9 min read

What is a Large Language Model (LLM)? A Deep Technical Guide

LLMs are the engine behind ChatGPT, Claude, and Gemini. This guide explains exactly how they work — from transformers and tokenisation to attention mechanisms, fine-tuning, and why they sometimes hallucinate.

G
Gurpreet Singh
March 27, 2026

What Exactly is a Large Language Model?

A Large Language Model (LLM) is a neural network trained to predict the next token in a sequence of text. That single, deceptively simple objective — given everything written so far, what word comes next? — when scaled to hundreds of billions of parameters and trained on trillions of tokens of text, produces a system capable of reasoning, coding, summarising, translating, and holding complex conversations.

GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, Llama 3, Mistral, and DeepSeek are all LLMs. They differ in size, training data, architecture details, and specialisations — but all share the same core mechanism.

The word "large" is doing real work in that name. GPT-3 had 175 billion parameters. GPT-4 is estimated at over 1 trillion. These numbers matter because they determine the model's capacity to memorise patterns and generalise to new situations.

The Transformer Architecture: The Engine Inside Every LLM

Every modern LLM is built on the Transformer architecture, introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. at Google. Before transformers, language models used Recurrent Neural Networks (RNNs) and LSTMs — architectures that processed text sequentially, one word at a time, which made them slow and poor at capturing long-range dependencies.

Transformers replaced sequential processing with self-attention — a mechanism that lets every token in a sequence attend to every other token simultaneously. This parallelism enabled training on vastly larger datasets and produced models that understand context at distances RNNs simply couldn't reach.

Tokenisation: How Text Becomes Numbers

Before any text enters the model, it must be converted into numbers. This process is called tokenisation. A tokeniser breaks text into subword units called tokens. The word "tokenisation" might become ["token", "isation"]. The sentence "I love AI" might become [40, 1842, 15360] — integer IDs looked up in a vocabulary table.

GPT-4 uses a tokeniser called BPE (Byte-Pair Encoding) with a vocabulary of roughly 100,000 tokens. On average, 1 token ≈ 0.75 words in English. This is why API pricing is per-token, not per-word.

Each token ID is then mapped to a high-dimensional vector called an embedding — typically 768 to 12,288 floating-point numbers depending on the model size. These embeddings carry semantic meaning. "King" and "Queen" will have similar embedding vectors. "Bank" (financial) and "bank" (river) will have different embeddings once context is considered.

Self-Attention: How the Model Understands Context

Self-attention is the core operation of the transformer. For every token in the input, the model computes three vectors: a Query (Q), a Key (K), and a Value (V). The attention score between two tokens is computed as the dot product of their Q and K vectors, scaled and passed through a softmax function.

The result is an attention weight — a number between 0 and 1 — representing how much each token should "attend to" every other token. The output for each token is then the weighted sum of all Value vectors. In plain English: the model learns which words to pay attention to when processing each word.

In the sentence "The bank was steep because it hadn't rained", the model uses attention to connect "bank" to "steep" and "rained" — learning from context that this is a riverbank, not a financial institution.

Modern LLMs use multi-head attention — running many attention operations in parallel with different learned weight matrices, allowing the model to attend to different aspects of the input simultaneously (syntax, semantics, coreference, etc.).

Feed-Forward Layers, LayerNorm, and Residual Connections

After the attention sub-layer, each transformer block passes tokens through a position-wise feed-forward network (FFN) — two linear layers with a non-linear activation (GELU or ReLU) between them. This is where much of the model's factual knowledge is believed to be stored.

Layer Normalisation (LayerNorm) stabilises training by normalising the activations across the feature dimension. Residual connections (adding the input of a sub-layer to its output) allow gradients to flow cleanly during backpropagation, enabling very deep networks to be trained without vanishing gradient problems.

A GPT-4-scale model stacks roughly 96–128 of these transformer blocks. Each token passes through every block sequentially, accumulating a richer contextual representation with each layer.

Training: How an LLM Learns

Pre-training: Learning from the Entire Internet

Pre-training is the most expensive phase. The model is trained on a massive corpus of text — web pages, books, code, scientific papers, Wikipedia, forums. The training objective is next-token prediction (autoregressive language modelling): given tokens 1 through N, predict token N+1. The error between the prediction and the actual next token is computed via cross-entropy loss, and gradients are backpropagated through the entire network to update billions of parameters.

Pre-training GPT-4 is estimated to have cost over $100 million in compute. It required thousands of A100 GPUs running for months. This is why most developers consume LLMs via APIs rather than training from scratch.

Instruction Fine-Tuning (SFT)

A raw pre-trained model is good at completing text — it will continue "The capital of France is" with "Paris" — but poor at following instructions. The second phase, Supervised Fine-Tuning (SFT), trains the model on a curated dataset of instruction-response pairs: {"instruction": "Summarise this article", "response": "…"}. This teaches the model to follow a user's intent, not just complete text.

RLHF: Teaching the Model What "Good" Means

Reinforcement Learning from Human Feedback (RLHF) is what turns a capable model into a helpful, harmless, honest assistant. Human raters rank multiple model outputs for the same prompt. A Reward Model is trained to predict which responses humans prefer. The LLM is then fine-tuned using PPO (Proximal Policy Optimisation) to maximise the reward model's score. This process is why GPT-4 refuses to help with harmful requests and tends to give balanced, hedged answers.

Context Window: The Model's Working Memory

The context window is the maximum number of tokens the model can process in a single forward pass — both input and output combined. GPT-4o supports 128,000 tokens (roughly 100,000 words). Gemini 1.5 Pro supports 2,000,000 tokens. Claude 3.5 supports 200,000 tokens.

Tokens within the context window are processed together via self-attention — the model can relate any token to any other token in the window. Tokens outside the context window are not accessible. This is why an LLM cannot "remember" a conversation from last week unless that conversation is re-injected into the context.

The context window is also why RAG (Retrieval-Augmented Generation) exists — you cannot fit 10,000 support documents into a context window, so you retrieve the most relevant ones first.

Why Do LLMs Hallucinate?

Hallucination — generating confident, plausible-sounding but factually incorrect text — is the most significant failure mode of LLMs. It happens for several reasons:

  • Knowledge cutoff: The model's training data ends at a fixed date. It cannot know about events after that date.
  • Token probability: The model generates the statistically most likely next token, not the factually correct one. If a plausible-sounding fake fact has high probability in the training distribution, the model will generate it.
  • No retrieval: The model has no internet connection or database. All knowledge is "baked in" to the weights during training — and weights are lossy compressions of text, not perfect memories.
  • Instruction following tension: The RLHF process rewards helpfulness. A model that says "I don't know" is often scored lower by human raters than one that gives a confident (wrong) answer — inadvertently training the model to confabulate.

Solutions include: Retrieval-Augmented Generation (RAG), grounding outputs in retrieved documents; temperature control, reducing randomness in generation; chain-of-thought prompting, forcing the model to reason step by step; and tool use, allowing the model to call external APIs for real-time facts.

Inference: How the Model Generates Text

During inference, the model generates text one token at a time in an autoregressive loop. At each step, it processes the entire context (input + tokens generated so far), computes a probability distribution over the entire vocabulary (~100K tokens), and samples the next token. This continues until the model generates a stop token or reaches the maximum output length.

Temperature controls randomness: temperature 0 always picks the highest-probability token (deterministic, consistent); temperature 1.0 samples from the raw distribution (creative, varied); temperature >1.0 is increasingly random and incoherent.

Top-p (nucleus) sampling restricts sampling to the smallest set of tokens whose cumulative probability exceeds p (e.g., 0.9). This prevents very low-probability tokens from being sampled while preserving natural variety.

Fine-Tuning vs Prompting vs RAG

There are three main ways to customise LLM behaviour for a specific use case:

  • Prompting: Provide instructions and examples in the context window. No training required. Works well for general tasks but uses context window space and costs tokens per request.
  • Fine-Tuning: Further train the model on a domain-specific dataset. Produces a specialised model that "knows" your domain implicitly. Requires compute and data. Cannot update knowledge in real time.
  • RAG (Retrieval-Augmented Generation): Retrieve relevant documents at inference time and inject them into the context. Allows the model to access up-to-date, proprietary knowledge without re-training. Best for most business applications.

Open-Weight vs Closed-Source Models

Closed-source models (GPT-4, Claude, Gemini) are only accessible via API. You pay per token, cannot inspect the weights, and are subject to the provider's terms of service and rate limits.

Open-weight models (Llama 3, Mistral, Mixtral, DeepSeek, Phi-3) release the model weights publicly. You can run them on your own hardware, fine-tune them, and serve them without per-token API costs. This is ideal for businesses with sensitive data, high-volume use cases, or compliance requirements that prevent sending data to third-party APIs.

Running Llama 3 70B locally on an H100 GPU server eliminates OpenAI API costs entirely — at scale, this can save hundreds of thousands of dollars annually.

Practical Takeaway

LLMs are not magic. They are large, well-trained pattern-matching systems that generate statistically likely continuations of text. Understanding their architecture — transformers, attention, tokenisation, context windows — makes you a much more effective developer when building AI systems. You know why hallucinations happen (token probability, no retrieval), how to fix them (RAG, grounding), and how to control output quality (temperature, system prompts, fine-tuning).

In my work building AI CRMs, RAG chatbots, and automation systems, LLMs are the reasoning engine. But they are only one component. The next articles in this series cover RAG, Vector Databases, AI Agents, and Tool Use — the infrastructure that makes LLMs useful in production.

#LLM #GPT-4 #Transformers #AI Fundamentals #Machine Learning #OpenAI #Claude #Attention Mechanism
G
Gurpreet Singh

Senior Full Stack Developer — Laravel, Vue.js, Nuxt.js & AI. Available for freelance projects.

Hire Me for Your Project

Related Articles