All Posts

Why Embeddings Matter: Solving Key Issues in Data Representation

How do computers understand that 'King' - 'Man' + 'Woman' = 'Queen'? Embeddings convert words int...

Abstract AlgorithmsAbstract Algorithms
··14 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: Embeddings convert words (and images, users, products) into dense numerical vectors in a geometric space where semantic similarity = geometric proximity. "King - Man + Woman ≈ Queen" is not magic — it is the arithmetic property of well-trained embeddings.


📖 The One-Hot Problem: Numbers That Know Nothing

Before embeddings, machines represented words as one-hot vectors:

Vocabulary: [cat, dog, fish, car, truck]
"cat"  = [1, 0, 0, 0, 0]
"dog"  = [0, 1, 0, 0, 0]
"fish" = [0, 0, 1, 0, 0]

Problems:

  1. Sparse: 50,000-word vocab = 50,000-dimensional vectors that are 99.998% zeros.
  2. No similarity: The machine sees cat and dog as equally distant as cat and car. Nothing in their representation captures that cats and dogs are both pets.

Embeddings solve both problems.


🔍 Vectors, Dimensions, and the Geometry of Meaning

Think of a vector as a list of numbers that acts like a GPS coordinate — except instead of latitude and longitude, it has hundreds of coordinates in "meaning space." Each number encodes some abstract feature of a word (its "animal-ness," "size," or "emotional tone"), though no single dimension has a human-readable label.

What is an embedding model? An embedding model is a neural network trained to produce these vectors. You feed in a word, sentence, or image, and the model outputs a dense vector — a numerical fingerprint of that input's meaning. Popular options include:

ModelInputDimensionsBest For
word2vecSingle words100–300Word similarity
Sentence-BERTSentences384–768Sentence similarity
OpenAI text-embedding-3-smallText1536Production semantic search
CLIPText or images512Cross-modal image/text search

How cosine similarity works (no math degree required): Cosine similarity measures the angle between two vectors, not the raw distance between their endpoints. Two vectors pointing in nearly the same direction score close to 1.0 (very similar). Completely unrelated vectors score near 0. Opposites score near −1.0.

cosine_similarity("cat",  "kitten")   ≈ 0.92   → nearly identical meaning
cosine_similarity("cat",  "dog")      ≈ 0.76   → related concepts
cosine_similarity("cat",  "database") ≈ 0.05   → unrelated

This is why embedding-powered search engines understand what you mean, not just which keywords you typed.

flowchart LR
    TX[Text Input] --> EM[Embedding Model]
    EM --> VEC[Vector Representation]
    VEC --> VS[Vector Space]
    VS --> CD[Cosine Distance]
    CD --> SD[Similar Documents]

🔢 Dense Vectors: Coordinates in Meaning Space

An embedding represents a word as a dense low-dimensional vector (e.g., 300 dimensions):

"cat"  → [0.8, -0.3,  0.6, 0.1, ...]   (300 values, all non-zero)
"dog"  → [0.7, -0.2,  0.5, 0.2, ...]   (similar to cat)
"car"  → [-0.1, 0.9, -0.4, 0.8, ...]   (different region of space)

Cosine similarity between cat and dog: ~0.92 (very close).
Cosine similarity between cat and car: ~0.1 (far apart).

The model has learned that cats and dogs live in the same region of the 300D semantic space.


⚙️ The Learning Principle: "You Shall Know a Word by the Company It Keeps"

This is Firth's distributional hypothesis (1957), and it is the foundation of word2vec, GloVe, and modern LLM embeddings.

Training signal: Predict surrounding words.

Context window: "... feeds the ___ every morning ..."
Target word: "cat"

Words that appear in similar contexts get similar representations. Cat and dog both appear near "pet," "feed," "vet," "collar" → their vectors converge in the training process.

Word2Vec (skip-gram) objective: Given a word $w$, maximize the probability of observing its context words $c_i$:

$$\max \sum_{(w, c) \in \text{corpus}} \log P(c | w)$$


🧠 Deep Dive: Vector Arithmetic: Why "King - Man + Woman ≈ Queen"

Because semantically coherent relationships are encoded as directions in embedding space:

direction("Man""King") ≈ direction("Woman""Queen")
= vector("King") - vector("Man") ≈ vector("Queen") - vector("Woman")

Rearranging: $$\text{vector("Queen")} \approx \text{vector("King")} - \text{vector("Man")} + \text{vector("Woman")}$$

The "royalty" concept is a direction. The "gender" flip is another direction. These directions are geometrically consistent across thousands of analogies because they learned from the same statistical patterns.

flowchart LR
    King["King"] -->|subtract| ManAxis["— Man direction"]
    ManAxis -->|add| WomanAxis["+ Woman direction"]
    WomanAxis --> Queen["≈ Queen\n(nearest neighbor in embedding space)"]

⚙️ Embeddings in Production: Not Just Words

Modern embeddings go far beyond words:

Input TypeEmbedding ModelApplication
TextBERT, Sentence-BERT, OpenAI text-embedding-3Semantic search, RAG, classification
ImagesCLIP, ViTImage search, visual Q&A, multimodal retrieval
UsersCollaborative filtering embeddingsRecommendation systems (Netflix, Spotify)
ProductsCatalog embeddings"Customers who bought X also bought Y"
CodeOpenAI Codex embeddingsSemantic code search

Vector databases (Pinecone, Weaviate, Milvus, pgvector) store billions of embedding vectors and support approximate nearest neighbor (ANN) search — the "find the most semantically similar documents" query at scale.


📊 From Raw Text to a Point in Space: The Embedding Pipeline

When you type a query into a semantic search engine, here is what happens under the hood — from your raw text to a meaningful coordinate in vector space:

flowchart TD
    A["📝 Raw Text\n'How do cats sleep?'"] --> B["✂️ Tokenizer\nSplit into sub-word tokens"]
    B --> C["🔢 Token IDs\nMap each token to an integer index\n[2129, 103, 8855, 3581, 30]"]
    C --> D["🧠 Embedding Model\nNeural network forward pass\n(attention + feed-forward layers)"]
    D --> E["📐 Dense Vector\n[0.23, -0.41, 0.87, ..., 0.12]\n(e.g., 1536 dimensions)"]
    E --> F["🗺️ Vector Space\nA point in high-dimensional semantic space"]
    F --> G["🔍 Nearest Neighbor Search\nFind closest vectors → retrieve similar content"]

What each step does:

  1. Tokenizer — breaks text into sub-word units. "sleeping" might become ["sleep", "##ing"] in BERT's vocabulary, so even unknown words are represented.
  2. Token IDs — each token maps to an integer that the model's lookup table recognises.
  3. Embedding model — the neural network processes the token sequence through multiple layers and produces a single fixed-size output vector for the whole input.
  4. Dense vector — a compact numerical fingerprint. Two semantically similar texts produce vectors that are geometrically close to each other.
  5. Vector space — every embedded item lives together in this space. Related items form natural clusters; unrelated items are far apart.
  6. Nearest neighbor search — vector databases use algorithms like HNSW or IVFFlat to find the closest embeddings in milliseconds, even across billions of stored vectors.

📊 Embedding Lookup Flow

sequenceDiagram
    participant Q as Query Text
    participant M as Embed Model
    participant V as Vector DB
    participant R as Results
    Q->>M: encode query
    M->>V: 1536-dim vector
    V->>R: ANN search
    R-->>Q: top similar results

🌍 Real-World Applications: Embeddings Powering the Apps You Use Every Day

Embeddings are the invisible infrastructure behind a surprising number of modern products.

Semantic Search Google, Notion, and GitHub Copilot all use embedding-based search. When you type "how do I handle an exception in Python," a keyword search might miss a result titled "Python error handling best practices" because the exact words differ. An embedding search finds it because both phrases map to nearby vectors. Keyword search looks for word overlap; semantic search looks for meaning overlap.

Retrieval-Augmented Generation (RAG) RAG systems give LLMs access to private knowledge bases. When you ask a company's AI assistant a question, it embeds your query, searches a vector database of company documents, retrieves the closest matches, and feeds them as context to the LLM. The model answers based on real, up-to-date information rather than stale training data — all without expensive fine-tuning. See RAG Explained for a deep dive.

Recommendation Systems Spotify's "Discover Weekly" and Netflix's "Because You Watched" features rely on embeddings. Each song, show, and user has an embedding vector. If your listening-history vector is geometrically close to another user's vector, you receive their top recommendations. The model does not need to understand why you like something — geometric proximity does the reasoning.

Text Classification Spam filters, sentiment analyzers, and content-moderation systems embed text and train a simple classifier on top of the resulting vectors. The embedding handles the language understanding; the classifier just learns which regions of the vector space correspond to spam, positive sentiment, or policy violations.


🧪 Generating Your First Embedding in Five Lines of Python

The fastest way to build intuition is to generate a real embedding and measure similarity yourself.

With the OpenAI API:

from openai import OpenAI
import numpy as np

client = OpenAI()  # reads OPENAI_API_KEY from environment

def embed(text):
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return np.array(response.data[0].embedding)

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

cat    = embed("cat")
kitten = embed("kitten")
car    = embed("car")

print(cosine_similarity(cat, kitten))  # → ~0.88  (very similar)
print(cosine_similarity(cat, car))     # → ~0.15  (unrelated)

With HuggingFace Sentence Transformers (free, runs locally):

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer("all-MiniLM-L6-v2")

sentences = [
    "I love cats",
    "Kittens are adorable",
    "SQL databases store tabular data"
]
embeddings = model.encode(sentences)

print(cosine_similarity([embeddings[0]], [embeddings[1]]))  # → ~0.76 (related)
print(cosine_similarity([embeddings[0]], [embeddings[2]]))  # → ~0.04 (unrelated)

Querying stored embeddings with pgvector:

-- Store a document with its embedding
INSERT INTO documents (content, embedding)
VALUES ('Kittens are adorable', '[0.23, -0.41, 0.87, ...]'::vector);

-- Retrieve the 5 most semantically similar documents to a query vector
SELECT content, embedding <=> '[0.24, -0.39, 0.88, ...]'::vector AS distance
FROM documents
ORDER BY distance
LIMIT 5;

The <=> operator in pgvector computes cosine distance. Vector databases like Pinecone, Weaviate, and Milvus offer the same capability with built-in ANN indexing for billion-scale corpora.


⚖️ Trade-offs & Failure Modes: One-Hot vs. Dense Embeddings: The Full Picture

PropertyOne-HotDense Embedding
DimensionalityEqual to vocabulary size (50K+)Fixed low dimension (128–1536)
Sparsity~100% sparseDense (all values non-zero)
Semantic similarityNot capturedCaptured via geometric distance
ComputationHigh (huge sparse vectors)Efficient (small dense vectors)
Supports analogies (King-Man+Woman)
Requires training❌ (constructed)✅ (learned from data)

🧭 Decision Guide: Choosing the Right Embedding Approach

Use pre-trained embeddings (OpenAI, sentence-transformers) for most NLP tasks — they require no training data and perform well out of the box. Fine-tune when your domain has specialized vocabulary (legal, medical). Choose embedding dimension based on benchmarks, not size: smaller dense embeddings often beat larger ones. For tabular data, embeddings rarely help — use classical ML features instead.


📚 Five Things Beginners Get Wrong About Embeddings

1. "More dimensions always means better embeddings" Not so. OpenAI's text-embedding-3-small (1536 dims) outperforms many larger models on real benchmarks. Dimension count matters far less than training data quality, model architecture, and whether the model was fine-tuned for your task. Always benchmark on your actual data before committing to a model.

2. "One embedding model works for every use case" Word2vec embeddings trained on Wikipedia perform poorly on product descriptions. CLIP image embeddings are wrong for protein sequences. Use a model trained on data similar to your domain. For production systems, evaluate multiple models on your downstream task before choosing one.

3. "Cosine similarity above 0.9 means the texts are identical" High cosine similarity means semantic relatedness, not identity. "Cat" and "kitten" might score 0.92, but they are not interchangeable in every context. Always validate your similarity threshold on real examples from your domain rather than relying on a single number.

4. "Embeddings capture everything about the text" Embeddings are a lossy compression. They excel at capturing statistical co-occurrence patterns but miss sarcasm, cultural nuance, and domain-specific jargon absent from the training corpus. A medical embedding model will interpret "acute" differently from a general-purpose one trained on web text.

5. "You can compare embeddings from different models" Embeddings from different models live in entirely separate vector spaces. Comparing a word2vec vector to a BERT vector is like comparing GPS coordinates in two different coordinate systems — the result is meaningless. Always embed both query and corpus with the exact same model version.


📌 TLDR: Summary & Key Takeaways

  • One-hot encoding is sparse and captures no semantic similarity.
  • Embeddings are dense vectors learned from co-occurrence statistics; semantically similar items are geometrically close.
  • Firth's hypothesis: "A word is known by the company it keeps" — context predicts embeddings.
  • Vector arithmetic (King - Man + Woman ≈ Queen) works because semantic relationships are consistent directions in the embedding space.
  • Vector databases (Pinecone, pgvector) serve nearest-neighbor queries over billions of embeddings for RAG, recommendation, and semantic search.

📝 Practice Quiz

  1. Two words have a cosine similarity of 0.95 in embedding space. What does this indicate?

    • A) Their one-hot vectors overlap in 95% of positions.
    • B) The words appear in very similar contexts and are semantically close (e.g., "cat" and "kitten").
    • C) One word contains 95% of the letters of the other.
    • D) They were both present in the same training sentence exactly 95 times.

    Correct Answer: B — Cosine similarity measures the angle between two vectors. A score near 1.0 means the vectors point in nearly the same direction, indicating the two words appear in similar contexts and carry related meanings. One-hot vectors and letter overlap have no connection to cosine similarity.

  2. Why does "King - Man + Woman ≈ Queen" work with word embeddings?

    • A) It is a hard-coded rule written into the embedding model.
    • B) The model learned consistent semantic directions from co-occurrence data — "royalty" and "gender" are separate geometric directions in the embedding space.
    • C) It only works for those four specific words and does not generalise.
    • D) The model stores lookup tables that map arithmetic operations to word labels.

    Correct Answer: B — During training, words appearing in similar contexts develop similar vectors. The relationships "Man → King" and "Woman → Queen" are parallel geometric directions because they reflect the same underlying pattern (gender applied to royalty) throughout the training corpus.

  3. You need to find the 10 most semantically similar documents to a query in a corpus of 100 million documents. Which tool is designed for this task?

    • A) A relational database with LIKE '%query%' full-text search.
    • B) A vector database (e.g., Pinecone, Weaviate, pgvector) with approximate nearest-neighbor (ANN) search over embedding vectors.
    • C) An inverted index with TF-IDF ranking.
    • D) A hash map keyed by exact query string.

    Correct Answer: B — Vector databases are purpose-built to store dense embedding vectors and run ANN search efficiently at scale. LIKE searches and TF-IDF match keywords, not meaning. Hash maps require exact key matches and cannot handle semantic similarity at all.


🛠️ sentence-transformers & OpenAI Embeddings API: Generating Embeddings in Practice

sentence-transformers is an open-source Python library that wraps pre-trained BERT/RoBERTa-family models with a single .encode() method — producing sentence-level embedding vectors locally, with no API key or GPU required for smaller models. It is the fastest path from raw text to embeddings in a local or data-sensitive setup.

OpenAI's Embeddings API (text-embedding-3-small, text-embedding-3-large) provides state-of-the-art embeddings via a managed REST endpoint — no GPU infrastructure to operate, billed per token.

# pip install sentence-transformers openai numpy scikit-learn

# ── sentence-transformers: local embeddings, no API key ──────────────────────
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("all-MiniLM-L6-v2")   # 80 MB model, runs on CPU

sentences = [
    "The quick brown fox jumps over the lazy dog",
    "A fast auburn fox leaps above a sleepy canine",
    "PostgreSQL is a relational database management system",
]

# encode() returns a (3 × 384) numpy array — one row per sentence
embeddings = model.encode(sentences, convert_to_tensor=True)

score_01 = util.cos_sim(embeddings[0], embeddings[1]).item()
score_02 = util.cos_sim(embeddings[0], embeddings[2]).item()

print(f"Sentences 0 & 1 (paraphrase):  {score_01:.2f}")  # → ~0.82
print(f"Sentences 0 & 2 (unrelated):   {score_02:.2f}")  # → ~0.07

# ── OpenAI Embeddings API: managed, production-grade ─────────────────────────
from openai import OpenAI
import numpy as np

client = OpenAI()   # reads OPENAI_API_KEY from environment

def embed(text: str) -> np.ndarray:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text,
    )
    return np.array(response.data[0].embedding)

v1 = embed("machine learning model training")
v2 = embed("training a neural network from scratch")

cosine = np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
print(f"OpenAI cosine similarity: {cosine:.2f}")   # → ~0.91
ModelDimensionsRuns locallyCostBest for
all-MiniLM-L6-v2384✅ CPU (80 MB)FreePrototyping, private data
all-mpnet-base-v2768✅ CPU/GPU (420 MB)FreeHigher accuracy, local hosting
OpenAI text-embedding-3-small1536❌ API call$0.02 / 1M tokensProduction semantic search

Start with sentence-transformers during prototyping or when data cannot leave your environment. Switch to the OpenAI API when retrieval quality is the primary constraint and managed infrastructure is acceptable.

For a full deep-dive on sentence-transformers model selection and production embedding pipelines, a dedicated follow-up post is planned.



Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms