Why Embeddings Matter: Solving Key Issues in Data Representation
How do computers understand that 'King' - 'Man' + 'Woman' = 'Queen'? Embeddings convert words int...
Abstract AlgorithmsTLDR: Embeddings convert words (and images, users, products) into dense numerical vectors in a geometric space where semantic similarity = geometric proximity. "King - Man + Woman ≈ Queen" is not magic — it is the arithmetic property of well-trained embeddings.
📖 The One-Hot Problem: Numbers That Know Nothing
Before embeddings, machines represented words as one-hot vectors:
Vocabulary: [cat, dog, fish, car, truck]
"cat" = [1, 0, 0, 0, 0]
"dog" = [0, 1, 0, 0, 0]
"fish" = [0, 0, 1, 0, 0]
Problems:
- Sparse: 50,000-word vocab = 50,000-dimensional vectors that are 99.998% zeros.
- No similarity: The machine sees
catanddogas equally distant ascatandcar. Nothing in their representation captures that cats and dogs are both pets.
Embeddings solve both problems.
🔍 Vectors, Dimensions, and the Geometry of Meaning
Think of a vector as a list of numbers that acts like a GPS coordinate — except instead of latitude and longitude, it has hundreds of coordinates in "meaning space." Each number encodes some abstract feature of a word (its "animal-ness," "size," or "emotional tone"), though no single dimension has a human-readable label.
What is an embedding model? An embedding model is a neural network trained to produce these vectors. You feed in a word, sentence, or image, and the model outputs a dense vector — a numerical fingerprint of that input's meaning. Popular options include:
| Model | Input | Dimensions | Best For |
| word2vec | Single words | 100–300 | Word similarity |
| Sentence-BERT | Sentences | 384–768 | Sentence similarity |
OpenAI text-embedding-3-small | Text | 1536 | Production semantic search |
| CLIP | Text or images | 512 | Cross-modal image/text search |
How cosine similarity works (no math degree required): Cosine similarity measures the angle between two vectors, not the raw distance between their endpoints. Two vectors pointing in nearly the same direction score close to 1.0 (very similar). Completely unrelated vectors score near 0. Opposites score near −1.0.
cosine_similarity("cat", "kitten") ≈ 0.92 → nearly identical meaning
cosine_similarity("cat", "dog") ≈ 0.76 → related concepts
cosine_similarity("cat", "database") ≈ 0.05 → unrelated
This is why embedding-powered search engines understand what you mean, not just which keywords you typed.
📊 Semantic Similarity Search
flowchart LR
TX[Text Input] --> EM[Embedding Model]
EM --> VEC[Vector Representation]
VEC --> VS[Vector Space]
VS --> CD[Cosine Distance]
CD --> SD[Similar Documents]
🔢 Dense Vectors: Coordinates in Meaning Space
An embedding represents a word as a dense low-dimensional vector (e.g., 300 dimensions):
"cat" → [0.8, -0.3, 0.6, 0.1, ...] (300 values, all non-zero)
"dog" → [0.7, -0.2, 0.5, 0.2, ...] (similar to cat)
"car" → [-0.1, 0.9, -0.4, 0.8, ...] (different region of space)
Cosine similarity between cat and dog: ~0.92 (very close).
Cosine similarity between cat and car: ~0.1 (far apart).
The model has learned that cats and dogs live in the same region of the 300D semantic space.
⚙️ The Learning Principle: "You Shall Know a Word by the Company It Keeps"
This is Firth's distributional hypothesis (1957), and it is the foundation of word2vec, GloVe, and modern LLM embeddings.
Training signal: Predict surrounding words.
Context window: "... feeds the ___ every morning ..."
Target word: "cat"
Words that appear in similar contexts get similar representations. Cat and dog both appear near "pet," "feed," "vet," "collar" → their vectors converge in the training process.
Word2Vec (skip-gram) objective: Given a word $w$, maximize the probability of observing its context words $c_i$:
$$\max \sum_{(w, c) \in \text{corpus}} \log P(c | w)$$
🧠 Deep Dive: Vector Arithmetic: Why "King - Man + Woman ≈ Queen"
Because semantically coherent relationships are encoded as directions in embedding space:
direction("Man" → "King") ≈ direction("Woman" → "Queen")
= vector("King") - vector("Man") ≈ vector("Queen") - vector("Woman")
Rearranging: $$\text{vector("Queen")} \approx \text{vector("King")} - \text{vector("Man")} + \text{vector("Woman")}$$
The "royalty" concept is a direction. The "gender" flip is another direction. These directions are geometrically consistent across thousands of analogies because they learned from the same statistical patterns.
flowchart LR
King["King"] -->|subtract| ManAxis["— Man direction"]
ManAxis -->|add| WomanAxis["+ Woman direction"]
WomanAxis --> Queen["≈ Queen\n(nearest neighbor in embedding space)"]
⚙️ Embeddings in Production: Not Just Words
Modern embeddings go far beyond words:
| Input Type | Embedding Model | Application |
| Text | BERT, Sentence-BERT, OpenAI text-embedding-3 | Semantic search, RAG, classification |
| Images | CLIP, ViT | Image search, visual Q&A, multimodal retrieval |
| Users | Collaborative filtering embeddings | Recommendation systems (Netflix, Spotify) |
| Products | Catalog embeddings | "Customers who bought X also bought Y" |
| Code | OpenAI Codex embeddings | Semantic code search |
Vector databases (Pinecone, Weaviate, Milvus, pgvector) store billions of embedding vectors and support approximate nearest neighbor (ANN) search — the "find the most semantically similar documents" query at scale.
📊 From Raw Text to a Point in Space: The Embedding Pipeline
When you type a query into a semantic search engine, here is what happens under the hood — from your raw text to a meaningful coordinate in vector space:
flowchart TD
A["📝 Raw Text\n'How do cats sleep?'"] --> B["✂️ Tokenizer\nSplit into sub-word tokens"]
B --> C["🔢 Token IDs\nMap each token to an integer index\n[2129, 103, 8855, 3581, 30]"]
C --> D["🧠 Embedding Model\nNeural network forward pass\n(attention + feed-forward layers)"]
D --> E["📐 Dense Vector\n[0.23, -0.41, 0.87, ..., 0.12]\n(e.g., 1536 dimensions)"]
E --> F["🗺️ Vector Space\nA point in high-dimensional semantic space"]
F --> G["🔍 Nearest Neighbor Search\nFind closest vectors → retrieve similar content"]
What each step does:
- Tokenizer — breaks text into sub-word units. "sleeping" might become
["sleep", "##ing"]in BERT's vocabulary, so even unknown words are represented. - Token IDs — each token maps to an integer that the model's lookup table recognises.
- Embedding model — the neural network processes the token sequence through multiple layers and produces a single fixed-size output vector for the whole input.
- Dense vector — a compact numerical fingerprint. Two semantically similar texts produce vectors that are geometrically close to each other.
- Vector space — every embedded item lives together in this space. Related items form natural clusters; unrelated items are far apart.
- Nearest neighbor search — vector databases use algorithms like HNSW or IVFFlat to find the closest embeddings in milliseconds, even across billions of stored vectors.
📊 Embedding Lookup Flow
sequenceDiagram
participant Q as Query Text
participant M as Embed Model
participant V as Vector DB
participant R as Results
Q->>M: encode query
M->>V: 1536-dim vector
V->>R: ANN search
R-->>Q: top similar results
🌍 Real-World Applications: Embeddings Powering the Apps You Use Every Day
Embeddings are the invisible infrastructure behind a surprising number of modern products.
Semantic Search Google, Notion, and GitHub Copilot all use embedding-based search. When you type "how do I handle an exception in Python," a keyword search might miss a result titled "Python error handling best practices" because the exact words differ. An embedding search finds it because both phrases map to nearby vectors. Keyword search looks for word overlap; semantic search looks for meaning overlap.
Retrieval-Augmented Generation (RAG) RAG systems give LLMs access to private knowledge bases. When you ask a company's AI assistant a question, it embeds your query, searches a vector database of company documents, retrieves the closest matches, and feeds them as context to the LLM. The model answers based on real, up-to-date information rather than stale training data — all without expensive fine-tuning. See RAG Explained for a deep dive.
Recommendation Systems Spotify's "Discover Weekly" and Netflix's "Because You Watched" features rely on embeddings. Each song, show, and user has an embedding vector. If your listening-history vector is geometrically close to another user's vector, you receive their top recommendations. The model does not need to understand why you like something — geometric proximity does the reasoning.
Text Classification Spam filters, sentiment analyzers, and content-moderation systems embed text and train a simple classifier on top of the resulting vectors. The embedding handles the language understanding; the classifier just learns which regions of the vector space correspond to spam, positive sentiment, or policy violations.
🧪 Generating Your First Embedding in Five Lines of Python
The fastest way to build intuition is to generate a real embedding and measure similarity yourself.
With the OpenAI API:
from openai import OpenAI
import numpy as np
client = OpenAI() # reads OPENAI_API_KEY from environment
def embed(text):
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return np.array(response.data[0].embedding)
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
cat = embed("cat")
kitten = embed("kitten")
car = embed("car")
print(cosine_similarity(cat, kitten)) # → ~0.88 (very similar)
print(cosine_similarity(cat, car)) # → ~0.15 (unrelated)
With HuggingFace Sentence Transformers (free, runs locally):
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
model = SentenceTransformer("all-MiniLM-L6-v2")
sentences = [
"I love cats",
"Kittens are adorable",
"SQL databases store tabular data"
]
embeddings = model.encode(sentences)
print(cosine_similarity([embeddings[0]], [embeddings[1]])) # → ~0.76 (related)
print(cosine_similarity([embeddings[0]], [embeddings[2]])) # → ~0.04 (unrelated)
Querying stored embeddings with pgvector:
-- Store a document with its embedding
INSERT INTO documents (content, embedding)
VALUES ('Kittens are adorable', '[0.23, -0.41, 0.87, ...]'::vector);
-- Retrieve the 5 most semantically similar documents to a query vector
SELECT content, embedding <=> '[0.24, -0.39, 0.88, ...]'::vector AS distance
FROM documents
ORDER BY distance
LIMIT 5;
The <=> operator in pgvector computes cosine distance. Vector databases like Pinecone, Weaviate, and Milvus offer the same capability with built-in ANN indexing for billion-scale corpora.
⚖️ Trade-offs & Failure Modes: One-Hot vs. Dense Embeddings: The Full Picture
| Property | One-Hot | Dense Embedding |
| Dimensionality | Equal to vocabulary size (50K+) | Fixed low dimension (128–1536) |
| Sparsity | ~100% sparse | Dense (all values non-zero) |
| Semantic similarity | Not captured | Captured via geometric distance |
| Computation | High (huge sparse vectors) | Efficient (small dense vectors) |
| Supports analogies (King-Man+Woman) | ❌ | ✅ |
| Requires training | ❌ (constructed) | ✅ (learned from data) |
🧭 Decision Guide: Choosing the Right Embedding Approach
Use pre-trained embeddings (OpenAI, sentence-transformers) for most NLP tasks — they require no training data and perform well out of the box. Fine-tune when your domain has specialized vocabulary (legal, medical). Choose embedding dimension based on benchmarks, not size: smaller dense embeddings often beat larger ones. For tabular data, embeddings rarely help — use classical ML features instead.
📚 Five Things Beginners Get Wrong About Embeddings
1. "More dimensions always means better embeddings"
Not so. OpenAI's text-embedding-3-small (1536 dims) outperforms many larger models on real benchmarks. Dimension count matters far less than training data quality, model architecture, and whether the model was fine-tuned for your task. Always benchmark on your actual data before committing to a model.
2. "One embedding model works for every use case" Word2vec embeddings trained on Wikipedia perform poorly on product descriptions. CLIP image embeddings are wrong for protein sequences. Use a model trained on data similar to your domain. For production systems, evaluate multiple models on your downstream task before choosing one.
3. "Cosine similarity above 0.9 means the texts are identical" High cosine similarity means semantic relatedness, not identity. "Cat" and "kitten" might score 0.92, but they are not interchangeable in every context. Always validate your similarity threshold on real examples from your domain rather than relying on a single number.
4. "Embeddings capture everything about the text" Embeddings are a lossy compression. They excel at capturing statistical co-occurrence patterns but miss sarcasm, cultural nuance, and domain-specific jargon absent from the training corpus. A medical embedding model will interpret "acute" differently from a general-purpose one trained on web text.
5. "You can compare embeddings from different models" Embeddings from different models live in entirely separate vector spaces. Comparing a word2vec vector to a BERT vector is like comparing GPS coordinates in two different coordinate systems — the result is meaningless. Always embed both query and corpus with the exact same model version.
📌 TLDR: Summary & Key Takeaways
- One-hot encoding is sparse and captures no semantic similarity.
- Embeddings are dense vectors learned from co-occurrence statistics; semantically similar items are geometrically close.
- Firth's hypothesis: "A word is known by the company it keeps" — context predicts embeddings.
- Vector arithmetic (King - Man + Woman ≈ Queen) works because semantic relationships are consistent directions in the embedding space.
- Vector databases (Pinecone, pgvector) serve nearest-neighbor queries over billions of embeddings for RAG, recommendation, and semantic search.
📝 Practice Quiz
Two words have a cosine similarity of 0.95 in embedding space. What does this indicate?
- A) Their one-hot vectors overlap in 95% of positions.
- B) The words appear in very similar contexts and are semantically close (e.g., "cat" and "kitten").
- C) One word contains 95% of the letters of the other.
- D) They were both present in the same training sentence exactly 95 times.
Correct Answer: B — Cosine similarity measures the angle between two vectors. A score near 1.0 means the vectors point in nearly the same direction, indicating the two words appear in similar contexts and carry related meanings. One-hot vectors and letter overlap have no connection to cosine similarity.
Why does "King - Man + Woman ≈ Queen" work with word embeddings?
- A) It is a hard-coded rule written into the embedding model.
- B) The model learned consistent semantic directions from co-occurrence data — "royalty" and "gender" are separate geometric directions in the embedding space.
- C) It only works for those four specific words and does not generalise.
- D) The model stores lookup tables that map arithmetic operations to word labels.
Correct Answer: B — During training, words appearing in similar contexts develop similar vectors. The relationships "Man → King" and "Woman → Queen" are parallel geometric directions because they reflect the same underlying pattern (gender applied to royalty) throughout the training corpus.
You need to find the 10 most semantically similar documents to a query in a corpus of 100 million documents. Which tool is designed for this task?
- A) A relational database with
LIKE '%query%'full-text search. - B) A vector database (e.g., Pinecone, Weaviate, pgvector) with approximate nearest-neighbor (ANN) search over embedding vectors.
- C) An inverted index with TF-IDF ranking.
- D) A hash map keyed by exact query string.
Correct Answer: B — Vector databases are purpose-built to store dense embedding vectors and run ANN search efficiently at scale. LIKE searches and TF-IDF match keywords, not meaning. Hash maps require exact key matches and cannot handle semantic similarity at all.
- A) A relational database with
🛠️ sentence-transformers & OpenAI Embeddings API: Generating Embeddings in Practice
sentence-transformers is an open-source Python library that wraps pre-trained BERT/RoBERTa-family models with a single .encode() method — producing sentence-level embedding vectors locally, with no API key or GPU required for smaller models. It is the fastest path from raw text to embeddings in a local or data-sensitive setup.
OpenAI's Embeddings API (text-embedding-3-small, text-embedding-3-large) provides state-of-the-art embeddings via a managed REST endpoint — no GPU infrastructure to operate, billed per token.
# pip install sentence-transformers openai numpy scikit-learn
# ── sentence-transformers: local embeddings, no API key ──────────────────────
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("all-MiniLM-L6-v2") # 80 MB model, runs on CPU
sentences = [
"The quick brown fox jumps over the lazy dog",
"A fast auburn fox leaps above a sleepy canine",
"PostgreSQL is a relational database management system",
]
# encode() returns a (3 × 384) numpy array — one row per sentence
embeddings = model.encode(sentences, convert_to_tensor=True)
score_01 = util.cos_sim(embeddings[0], embeddings[1]).item()
score_02 = util.cos_sim(embeddings[0], embeddings[2]).item()
print(f"Sentences 0 & 1 (paraphrase): {score_01:.2f}") # → ~0.82
print(f"Sentences 0 & 2 (unrelated): {score_02:.2f}") # → ~0.07
# ── OpenAI Embeddings API: managed, production-grade ─────────────────────────
from openai import OpenAI
import numpy as np
client = OpenAI() # reads OPENAI_API_KEY from environment
def embed(text: str) -> np.ndarray:
response = client.embeddings.create(
model="text-embedding-3-small",
input=text,
)
return np.array(response.data[0].embedding)
v1 = embed("machine learning model training")
v2 = embed("training a neural network from scratch")
cosine = np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
print(f"OpenAI cosine similarity: {cosine:.2f}") # → ~0.91
| Model | Dimensions | Runs locally | Cost | Best for |
all-MiniLM-L6-v2 | 384 | ✅ CPU (80 MB) | Free | Prototyping, private data |
all-mpnet-base-v2 | 768 | ✅ CPU/GPU (420 MB) | Free | Higher accuracy, local hosting |
OpenAI text-embedding-3-small | 1536 | ❌ API call | $0.02 / 1M tokens | Production semantic search |
Start with sentence-transformers during prototyping or when data cannot leave your environment. Switch to the OpenAI API when retrieval quality is the primary constraint and managed infrastructure is acceptable.
For a full deep-dive on sentence-transformers model selection and production embedding pipelines, a dedicated follow-up post is planned.
🔗 Related Posts
- Vector Databases Explained — how ANN indexes like HNSW store and query billions of embeddings at scale.
- RAG Explained: How to Give Your LLM a Brain Upgrade — see embeddings in action as the retrieval backbone of a RAG pipeline.
- How Transformer Architecture Works — understand the neural network that produces modern contextual embeddings.
- How GPT/LLM Works — the full picture of how language models are trained and how embeddings fit in.

Written by
Abstract Algorithms
@abstractalgorithms
More Posts

Types of LLM Quantization: By Timing, Scope, and Mapping
TLDR: There is no single "best" LLM quantization. You classify and choose quantization along three axes: when you quantize (timing), what you quantize (scope), and how values are encoded (mapping). In practice, most teams start with weight quantizati...
Stream Processing Pipeline Pattern: Stateful Real-Time Data Products
TLDR: Stream pipelines succeed when event-time semantics, state management, and replay strategy are designed together — and Kafka Streams lets you build all three directly inside your Spring Boot service. Stripe's real-time fraud detection processes...
Service Mesh Pattern: Control Plane, Data Plane, and Zero-Trust Traffic
TLDR: A service mesh intercepts all service-to-service traffic via injected Envoy sidecar proxies, letting a platform team enforce mTLS, retries, timeouts, and circuit breaking centrally — without changing application code. Reach for it when cross-te...
Serverless Architecture Pattern: Event-Driven Scale with Operational Guardrails
TLDR: Serverless is strongest for spiky asynchronous workloads when cold-start, observability, and state boundaries are intentionally designed. TLDR: Serverless works best for spiky, event-driven workloads when you design for idempotency, observabili...
