All Posts

A Guide to Pre-training Large Language Models

Pre-training is the most expensive part of building an LLM. We explain the data pipeline, the 'Ne...

Abstract AlgorithmsAbstract Algorithms
··14 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: Pre-training is the phase where an LLM learns "Language" and "World Knowledge" by reading petabytes of text. It uses Self-Supervised Learning to predict the next word in a sentence. This creates the "Base Model" which is later fine-tuned.


📖 The Library Metaphor: What Pre-training Actually Does

Imagine teaching a child to read.

  • Pre-training: You lock the child in a library for ten years. They read every book — grammar, history, math, code, recipes. They absorb the structure and content of human knowledge. But they have no social skills; they can't follow instructions or hold a polite conversation.
  • Fine-tuning (next step): You hire a tutor to teach manners ("don't say harmful things") and specific tasks ("summarize this document").

Pre-training creates the Base Model — a powerful but raw artifact. Fine-tuning shapes it into a product (ChatGPT, Claude, Gemini).


🔍 Deep Dive: Pre-Training Fundamentals

Before an LLM can write poetry, translate languages, or explain code, it needs to understand the basic patterns of human text. Pre-training is the process of building that foundational understanding from scratch — no hand-labeled data required.

Self-supervised learning is the key insight that makes this scale. Instead of asking humans to annotate millions of examples, the model creates its own training signal: given the words so far, predict the next one. The labels are already in the data — you just cover up the next word and ask the model to guess it. Correct it when it's wrong. Repeat for trillions of tokens.

Tokenization turns raw text into sequences of integers the model can process. Most modern LLMs use Byte-Pair Encoding (BPE) or SentencePiece, which break text into subword units. The word "unbelievable" might become three tokens: un, believ, able. A typical vocabulary has 32,000–128,000 unique tokens — enough to cover most languages, programming syntax, and scientific notation without exploding in size.

The training corpus is the collection of text the model learns from. The mix and quality of sources shapes model strengths more than almost any other design decision:

SourceTypical shareWhat it contributes
Common Crawl (web text)~60–80%Broad language coverage, diverse topics
Books and long-form writing~5–15%Multi-paragraph coherence and reasoning
GitHub code repositories~5–10%Programming ability and logical structure
Wikipedia / arXiv / papers~5–10%Factual accuracy and technical depth

A carefully filtered 300B-token corpus will produce a stronger model than a carelessly collected 3T-token one. Data curation is a competitive differentiator.


📊 The Pre-Training Workflow: From Raw Data to Base Model

Pre-training follows a structured pipeline. Each stage is essential — skipping or rushing any one of them measurably degrades the final model quality.

graph TD
    A[Raw Text Sources\nWeb · Books · Code · Papers] --> B[Deduplication & Quality Filtering]
    B --> C[Tokenization\nBPE / SentencePiece]
    C --> D[Packed Sequences\nFill Context Windows]
    D --> E[GPU / TPU Cluster\nForward Pass + Backpropagation]
    E --> F{Checkpoint?}
    F -->|Every N steps| G[Save Checkpoint to Storage]
    F -->|Continue training| E
    G --> H[Base Model]

What each stage does:

  • Dedup + filter: Remove near-duplicate web pages and low-quality HTML. Training on repeated text causes the model to memorize rather than generalize. Quality filtering is often more impactful than simply adding more raw tokens.
  • Tokenize: Convert text into integer token IDs using BPE; pack multiple shorter sequences together to fill each context window completely, maximizing GPU utilization.
  • Train: Run the transformer forward pass to produce token predictions, compute cross-entropy loss against the true next tokens, and backpropagate gradients to update all model weights.
  • Checkpoint: Save model weights to persistent storage every few thousand steps. Multi-week training runs on thousands of GPUs are prone to hardware failures; checkpointing is the safety net that prevents catastrophic loss of progress.
  • Base model: The final artifact — a transformer whose weights encode grammar, world facts, code patterns, and reasoning structures absorbed from the entire corpus.

🔢 Next-Token Prediction: The Self-Supervised Training Signal

The entire pre-training game is one question repeated billions of times:

"Given the text so far, what is the most likely next token?"

Input:   "The capital of France is"
Target:  "Paris"

This is self-supervised because the labels are already in the data — you just mask the next word. No human annotation needed.

The Loss Function: Cross-Entropy

$$L = -\sum_{t} \log P(x_t \mid x_{

  • $x_t$: the correct next token at step $t$
  • $x_{<t}$: all preceding tokens
  • $P(xt \mid x{<t})$: the model's probability for the correct token

The model is penalized for assigning low probability to the correct next word. Minimizing $L$ over trillions of tokens forces the model to learn grammar, facts, and reasoning patterns.

📊 Next-Token Prediction Training Loop

sequenceDiagram
    participant D as Data Loader
    participant T as Tokenizer
    participant M as Transformer
    participant L as Loss Function
    participant O as Optimizer

    D->>T: Raw text sequence
    T->>M: Token IDs [x1, x2, ..., xN]
    M->>M: Forward pass (attention layers)
    M->>L: Predicted logits for each position
    L->>L: Cross-entropy vs true next token
    L->>O: Backpropagate gradients
    O->>M: Update all weights
    M->>D: Ready for next batch

⚙️ The Data Pipeline: From the Web to a Training Run

flowchart LR
    A[Common Crawl\nBooks3 / GitHub / arXiv] --> B[Deduplication]
    B --> C[Quality Filtering]
    C --> D[Tokenization]
    D --> E[Packed Sequences\n2K–128K tokens]
    E --> F[Training Shards\non Object Storage]
    F --> G[GPU Cluster]
StageWhat happensWhy it matters
DeduplicationRemove near-duplicate pagesPrevents memorization of repeated text
Quality filterRemove boilerplate, low-quality HTMLImproves token efficiency
TokenizationBPE / SentencePieceCompresses text; handles rare words
PackingFill context windows to capacityMaximizes GPU utilization

Training data typically includes Common Crawl (web text), Books3, GitHub code, arXiv papers, and Wikipedia. The mix ratio shapes model strengths.


🧠 Deep Dive: Inside the Training Loop: Loss, Gradients, and Checkpoints

The training loop looks like this in pseudocode:

for batch in training_data:
    tokens = tokenize(batch)
    logits = model(tokens[:-1])          # predict all positions
    loss = cross_entropy(logits, tokens[1:])  # compare to true next token
    loss.backward()                      # compute gradients
    optimizer.step()                     # update weights
    optimizer.zero_grad()
    if step % checkpoint_interval == 0:
        save_checkpoint(model)

In practice, training runs on thousands of GPUs or TPUs for weeks to months, using advanced parallelism strategies (data parallelism, tensor parallelism, pipeline parallelism).


🧭 Decision Guide: What a Base Model Can and Cannot Do

Can doCannot do
Complete text in contextFollow instructions reliably
Summarize if prompted cleverlyRefuse harmful requests
Write code that is syntactically plausibleAdmit when it doesn't know
Translate languagesHave a consistent helpful persona

A base model will happily continue any text you give it — including harmful content. Fine-tuning with RLHF or SFT shapes it into a helpful, harmless assistant.


⚖️ Trade-offs & Failure Modes: Cost, Carbon, and Scaling

Training a frontier-scale LLM (GPT-4, LLaMA 3 70B) requires:

  • Compute: thousands of H100 GPUs running for months
  • Cost: $50M–$100M+ per frontier run
  • Energy: significant carbon footprint

The key trade-offs:

  • Data scale vs data quality: more tokens help, but noisy corpora have diminishing returns
  • Larger model vs smaller but high-quality: a well-filtered 7B model can out-perform a poorly trained 70B
  • Pre-training breadth vs fine-tuning depth: broad pre-training creates a flexible base; fine-tuning sharpens it for specific tasks

Few organizations can afford to pre-train from scratch. Most practitioners work with open base models (LLaMA, Mistral, Qwen) and apply LoRA fine-tuning.

📊 Pre-Training Data Pipeline

flowchart LR
    Web["Raw Web\n(Common Crawl)"]
    Books["Books3 / arXiv\n/ GitHub / Wiki"]
    Dedup["Near-Duplicate\nRemoval"]
    Filter["Quality Filter\n(heuristics + classifiers)"]
    Mix["Domain Mix\n(web 70%, code 15%...)"]
    Tok["Tokenization\n(BPE / SentencePiece)"]
    Pack["Pack Sequences\n(fill context windows)"]
    Shards["Training Shards\n(object storage)"]
    Train["GPU Cluster\nForward + Backprop"]

    Web --> Dedup
    Books --> Dedup
    Dedup --> Filter --> Mix --> Tok --> Pack --> Shards --> Train

🌍 Real-World Applications: Where Pre-trained Models Power the World

Pre-trained language models are the invisible backbone of dozens of products in use today. The same foundational technique — next-token prediction on massive corpora — powers everything from consumer chat apps to scientific research tools.

Product / SystemPre-trained BaseWhat it enables
ChatGPT / GPT-4OpenAI internal base modelConversational AI, coding, long-form writing
Claude 3Anthropic base modelSafety-focused long-context assistant
Gemini 1.5Google DeepMind baseMultimodal: text, images, audio, video
GitHub CopilotCodex / GPT-4 familyIn-editor code completion and generation
LLaMA 3 / MistralOpen-weight base modelsCommunity fine-tuning and research platform
AlphaFold 2Pre-trained on protein sequences3D protein structure prediction

Beyond chat assistants, pre-trained models drive progress across many fields:

  • Search engines: Google and Bing use LLMs to improve query understanding and surface direct answers within results.
  • Legal and finance: Domain-specific models read and summarize contracts, regulatory filings, and earnings calls in seconds.
  • Drug discovery: Models pre-trained on biochemical literature assist researchers in generating and filtering hypotheses.
  • Education: Tutoring tools use pre-trained bases fine-tuned for pedagogy — adapting explanations to the student's level.

The pattern is consistent: pre-train broadly on diverse data, fine-tune narrowly for a specific task. The expensive, reusable asset is always the base model.


🧪 Practical Considerations for Practitioners

Here is what pre-training means for your day-to-day work as a practitioner:

You almost certainly will not pre-train from scratch. Training a frontier model requires thousands of H100 GPUs for weeks and costs $50M–$100M+. Even well-funded startups typically start from an open base model and fine-tune from there.

Open base models available today give you a strong starting point at zero training cost:

  • LLaMA 3 8B / 70B (Meta): Strong general-purpose base; commercially licensed for most use cases.
  • Mistral 7B / Mixtral 8x7B: Efficient architectures with excellent performance-per-compute ratio.
  • Qwen 2.5 (Alibaba): Strong multilingual capabilities and coding performance.

Your practical levers after choosing a base model:

ApproachCompute neededWhen to use
Prompt engineeringInference onlyTask is well-defined; no labeled training data available
LoRA fine-tuning1–2 consumer GPUsCustom tone, domain vocabulary, specific task style
Full fine-tuning4–16 GPUsDeep behavior change with a large labeled dataset
Pre-training from scratch1,000+ GPUsNovel domain not covered by any existing model

The LoRA shortcut: LoRA (Low-Rank Adaptation) freezes the base model weights and trains tiny adapter matrices inserted into each transformer attention layer. This reduces the number of trainable parameters by 100–10,000× while preserving most of the fine-tuning quality. It is the most practical approach for teams without a large GPU budget.


🛠️ Hugging Face Transformers: The Trainer API and DataCollatorForLanguageModeling

Hugging Face Transformers is the standard open-source toolkit for working with pre-trained language models — it provides AutoModel classes for every major architecture, Trainer for training loops, and data utilities like DataCollatorForLanguageModeling that handle the causal LM training objective (next-token prediction with automatic label shifting) without boilerplate.

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    DataCollatorForLanguageModeling,
    TrainingArguments,
    Trainer,
)
from datasets import load_dataset

model_name = "gpt2"   # swap to mistralai/Mistral-7B for a larger pre-training run

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(model_name)

# Load and tokenize a text corpus (e.g., domain-specific pre-training data)
raw_dataset = load_dataset("text", data_files={"train": "corpus.txt"})

def tokenize(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512)

tokenized = raw_dataset.map(tokenize, batched=True, remove_columns=["text"])

# DataCollatorForLanguageModeling:
# - mlm=False → causal LM mode (next-token prediction, not masked LM)
# - automatically shifts labels by one position — no manual label construction
collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

training_args = TrainingArguments(
    output_dir="./pretrained-output",
    per_device_train_batch_size=8,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    learning_rate=5e-5,
    bf16=True,
    logging_steps=100,
    save_steps=500,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    data_collator=collator,
)

trainer.train()
trainer.save_model("./pretrained-output/final")

DataCollatorForLanguageModeling with mlm=False is the key piece — it constructs the (input_ids, labels) pairs for causal next-token prediction automatically, where labels[i] = input_ids[i+1] across each sequence. This is the same training objective described in the cross-entropy section above, just abstracted into two lines.

For a full deep-dive on Hugging Face Transformers' Trainer API and pre-training pipelines, a dedicated follow-up post is planned.


📚 Lessons from Pre-Training at Scale

Years of large-scale pre-training experiments have surfaced insights that are not obvious from first principles — and that matter for anyone building on top of these models:

Scaling laws are predictable. Kaplan et al. (2020) showed that validation loss decreases as a smooth power law with compute, data size, and parameter count. You can reliably forecast how much a larger model will improve before committing to the expensive training run.

The Chinchilla lesson: you are probably undertraining. The 2022 Chinchilla paper (Hoffmann et al.) showed that most pre-2022 LLMs used too many parameters relative to their training token count. The compute-optimal recipe is roughly 20 tokens per parameter: a 7B model should train on ~140B tokens. Well-trained smaller models routinely outperform larger models that were trained on too little data.

Data quality beats data quantity. Filtering noisy web text, deduplicating aggressively, and curating high-quality sources (books, code, peer-reviewed papers) often produces better results than simply dumping more raw tokens into training. LLaMA 3's heavily filtered corpus approach demonstrated this at scale.

Emergent capabilities appear suddenly. Some abilities — multi-step arithmetic, in-context few-shot learning, chain-of-thought reasoning — are nearly absent at small scale and appear almost discontinuously at certain parameter or data thresholds. This emergence phenomenon remains an active research area and has direct implications for evaluating safety before deployment.

A base model is not safe to deploy directly. It will complete any prompt without guardrails, including requests for harmful content. Reinforcement Learning from Human Feedback (RLHF) and Supervised Fine-Tuning (SFT) are required to transform a raw base model into a safe, helpful product.


📌 TLDR: Summary & Key Takeaways

  • The loss is cross-entropy; minimizing it forces the model to learn grammar, facts, and reasoning.
  • The result is a Base Model — capable but unaligned. Fine-tuning is required for product use.
  • The data pipeline (dedup → filter → tokenize → pack) is as important as the model architecture.
  • Most practitioners never pre-train from scratch; they fine-tune existing open models.

📝 Practice Quiz

  • A) Human annotators label each sentence as grammatically correct or incorrect
  • B) The training labels come from the data itself — the next token is the target
  • C) The model generates its own synthetic training data from scratch
  • D) A teacher model provides soft probability labels for each token

    Correct Answer: B — No human labeling is needed. You mask the next token and the model learns to predict it. The supervision signal is already embedded in the raw text itself.

  1. What does Byte-Pair Encoding (BPE) do during the pre-training pipeline?

    • A) Compresses model weights to reduce GPU memory requirements
    • B) Removes duplicate sentences from the training corpus
    • C) Splits text into subword units and maps them to integer IDs
    • D) Adjusts the learning rate dynamically based on token frequency

    Correct Answer: C — BPE is a tokenization algorithm that breaks words into subword units (e.g., "unbelievable" → "un", "believ", "able") and assigns each unit a unique integer that the model can process numerically.

  2. According to the Chinchilla scaling paper, what is the approximately compute-optimal training ratio?

    • A) 1 token per parameter
    • B) 5 tokens per parameter
    • C) 20 tokens per parameter
    • D) 200 tokens per parameter

    Correct Answer: C — Hoffmann et al. (2022) showed that ~20 tokens per parameter is compute-optimal. A 7B-parameter model should train on roughly 140B tokens to reach its best possible performance for a given compute budget.

  3. A base model produced by pre-training differs from a fine-tuned assistant because:

    • A) A base model is smaller in size and faster to serve in production
    • B) A base model will complete any prompt without alignment or safety guardrails
    • C) A base model was trained on a smaller and more focused dataset
    • D) A base model cannot generate code or structured output

    Correct Answer: B — A base model is a powerful text completer but has no alignment. It will continue any prompt, including harmful ones. RLHF and SFT fine-tuning are required to add safety constraints and instruction-following behavior.


Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms