All Posts

A Beginner's Guide to Vector Database Principles

Vector databases turn text into meaning-aware vectors, enabling semantic search and reliable retrieval for RAG systems.

Abstract AlgorithmsAbstract Algorithms
··14 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: A vector database stores meaning as numbers so you can search by intent, not exact keywords. That is why "reset my password" can find "account recovery steps" even if the words are different.


📖 Searching by Meaning, Not by Words

A standard database answers: "Does this row contain the exact string 'password reset'?"

A vector database answers: "Which rows are semantically similar to 'forgot my credentials'?"

Think of music playlists:

  • A keyword search finds songs with "love" in the title.
  • A vector search finds "chill late-night tracks" — matching mood, not lyrics.
Search styleMatchesStrengthWeakness
Keyword (BM25)Exact tokensPrecise for known wordsMisses synonyms/rephrasing
Vector (semantic)Meaning similarityHandles natural languageNeeds embeddings + tuning
HybridKeyword + meaningBest real-world qualitySlightly more complex

🔍 What Makes a Vector Database Different from a Regular One

A relational database indexes text with a B-tree. It matches exact values. A vector database indexes float arrays — long lists of numbers — and matches by geometric proximity in high-dimensional space.

Every record in a vector database has three parts:

PartWhat it isExample
VectorFloat array encoding meaning[0.91, 0.12, -0.33, ...] (1536 dims)
MetadataStructured fields for filtering{ source: "kb", lang: "en" }
IDUnique document identifier"doc-0042"

The "search" operation is Approximate Nearest Neighbor (ANN): find the k vectors that point in the most similar direction to the query vector — without scanning every record.

The main products you will encounter:

ProductTypeBest for
PineconeManaged cloudProduction at scale, no ops
WeaviateOpen-source + cloudHybrid search, rich filtering
ChromaLocal / embeddedFast prototyping, local dev
pgvectorPostgreSQL extensionTeams already on Postgres

🔢 From Text to Numbers: What an Embedding Really Is

An embedding is a list of floats that captures the meaning of a piece of text.

You feed a sentence into an embedding model (e.g., text-embedding-ada-002, bge-base-en) and get back a vector like:

"reset my password"  →  [0.91, 0.12, -0.33, 0.07, ...]   (1536 dimensions)
"account recovery"   →  [0.90, 0.10, -0.31, 0.08, ...]   (1536 dimensions)
"banana bread"       →  [-0.22, 0.77,  0.55, -0.44, ...]  (very different)

The first two vectors point in nearly the same direction in 1536-dimensional space. The third points somewhere completely different.

Cosine similarity is the most common way to compare two vectors:

cosine(a, b) = (a · b) / (|a| × |b|)

Result near 1.0 = very similar meaning. Result near 0.0 = unrelated.

Toy walkthrough:

  • Query q = (0.91, 0.12), candidate d1 = (0.90, 0.10)
  • Dot product: 0.91×0.90 + 0.12×0.10 = 0.831
  • Norms: |q| ≈ 0.918, |d1| ≈ 0.906
  • Cosine: 0.831 / (0.918 × 0.906) ≈ 0.999 → highly similar ✅

Cosine similarity is length-invariant, so a long document and a short one on the same topic score high. Other options: dot product (fast, unnormalized) and Euclidean distance (L2, good when all vectors are unit-normalised).

📊 ANN Search Sequence

sequenceDiagram
    participant U as User Query
    participant E as EmbeddingModel
    participant H as HNSW Index
    participant F as Filter Layer
    participant R as Results

    U->>E: "How do I reset my password?"
    E->>H: Query vector [0.91, 0.12, ...]
    H->>H: Traverse graph layers
    H->>H: Prune distant nodes
    H->>F: Top-K candidate vectors
    F->>F: Apply metadata filters
    F->>R: Return top-K chunks
    R-->>U: Relevant document chunks

📊 Vector DB Comparison

flowchart LR
    Managed["☁️ Managed / Cloud"]
    Open["🔓 Open-Source / Self-Hosted"]
    Postgres["🐘 PostgreSQL Extension"]

    Pinecone["Pinecone\nManaged, scalable\nno ops required"]
    Weaviate["Weaviate\nHybrid search\nrich filtering"]
    Chroma["Chroma\nLocal dev\nfast prototype"]
    pgvector["pgvector\nSQL + vectors\nexisting Postgres"]

    Managed --> Pinecone
    Open --> Weaviate
    Open --> Chroma
    Postgres --> pgvector

⚙️ The Two-Phase Pipeline: Indexing and Querying

Vector databases separate write-time indexing from read-time querying.

flowchart TD
    A[Raw Documents] --> B[Chunking]
    B --> C[Embedding Model]
    C --> D[Vector + Metadata]
    D --> E[ANN Index]
    Q[User Query] --> R[Query Embedding]
    R --> E
    E --> S[Top-k Candidates]
    S --> T[Optional Reranker]
    T --> U[Context for App or LLM]

Write path: chunk documents → embed each chunk → upsert vector + metadata into the ANN index. Read path: embed the query → ANN search → optional reranking → return top-k results.

PhaseWhen it runsKey step
IndexingOffline or near-lineChunk → embed → upsert
QueryingOnline, per requestEmbed query → ANN search → rerank

This separation matters: you can rebuild the index with a new embedding model without touching the query path.


📊 How the RAG Pipeline Connects Every Piece

The most common production pattern is Retrieval-Augmented Generation (RAG), where the vector database acts as the LLM's long-term memory.

flowchart LR
    U[User Question] --> QE[Embed Query]
    QE --> VDB[(Vector DB\nPinecone / Weaviate\nChroma / pgvector)]
    VDB -->|Top-k chunks| CTX[Build Context]
    CTX --> LLM[LLM\nGPT-4 / Claude]
    LLM --> ANS[Grounded Answer]
    DOCS[Your Documents] --> IDX[Index Pipeline]
    IDX --> VDB

Without the vector database the LLM only knows what was in its training data. With it, the model can cite your private knowledge base, product catalog, or today's incidents.

The flow is: embed the user's question, retrieve the closest chunks from your vector store, inject them into the prompt, and let the LLM synthesise a grounded answer.


🧠 Deep Dive: ANN Index Structures

ANN (Approximate Nearest Neighbor) indexes make vector search fast at scale by trading a tiny amount of recall for dramatically lower query latency:

IndexRecallLatencyMemoryBest for
HNSWHighLowHighLow-latency semantic search
IVFMediumMediumMediumLarge-scale, limited RAM
IVF+PQMediumMediumLowBillion-scale, tight budgets

Pinecone and Weaviate default to HNSW. Chroma uses HNSW via hnswlib. pgvector supports both HNSW and IVF.


🌍 Real-World Application: Semantic Search for a Support Knowledge Base

Scenario: Your support team has 50,000 help articles. Customers type questions in natural language and expect the right article — even when wording does not match any article title.

Step 1 — Index: Chunk each article into 400-token segments. Embed each chunk with text-embedding-ada-002. Upsert the vector, chunk text, article ID, and language tag into Pinecone.

Step 2 — Query: When a customer types "my account keeps logging me out", embed that phrase, run a top-5 ANN search in Pinecone filtered to lang=en, and surface the matching article sections.

Step 3 — Augment: Feed the top-3 chunks into GPT-4 with "Answer based only on the provided articles." The LLM synthesises a direct answer with citations — no hallucination from training data.

Results seen in production:

  • Resolution rate improves because customers land on the right article, not the most-clicked one.
  • Agents use the same pipeline: "find all tickets similar to this escalation" surfaces precedent in seconds.

⚖️ Trade-offs & Failure Modes: Vector DB vs. Elasticsearch vs. Relational

DimensionVector DBElasticsearchRelational + pgvector
Semantic search✅ Native⚠️ With dense-vector plugin✅ With pgvector
Exact keyword / BM25❌ Needs hybrid wrapper✅ Native⚠️ Full-text only
Joins / transactions❌ None❌ None✅ Full ACID
Ops complexityLow (managed)HighLow if on Postgres already
Cost at 100M+ vectorsHigh (managed)MediumLow hardware cost

Common failure modes:

FailureWhy it happensFix
Chunk size too largeIrrelevant context floods results300–800 tokens per chunk
Embedding model upgradeOld and new embeddings incompatibleVersion embeddings; re-index on upgrade
No metadata filteringWrong language or tenant in resultsAlways filter on lang, tenant_id
No hybrid strategyExact product codes score lowBlend BM25 + vector with RRF
Stale documentsLLM cites outdated contentScheduled re-embed + TTL on records

🧭 Decision Guide: When to Reach for a Vector Database: Decision Guide

SituationRecommendation
Use whenQueries are natural-language and meaning matters more than exact wording; data has rich text content (docs, tickets, product descriptions)
Avoid whenAll lookups are by exact ID, timestamp range, or structured filters — a relational DB is simpler and cheaper
Consider hybridYou need both keyword precision (product codes, proper nouns) and semantic recall — use Weaviate or Elasticsearch with dense-vector support
Start with pgvector ifYou are already on Postgres, dataset is under 5M vectors, and you want zero additional infrastructure
Watch forEmbedding model lock-in: switching models requires re-indexing everything; plan for versioned index namespaces from day one

🧪 Your First Semantic Search with Chroma in Python

Chroma is the fastest way to try a vector database locally — no signup, no cluster, one pip install.

import chromadb

client = chromadb.Client()
collection = client.create_collection("support-docs")

# Index two documents (Chroma embeds them with its built-in model)
collection.add(
    documents=[
        "How to reset your account password via email link",
        "Steps to recover access when two-factor authentication is lost",
    ],
    ids=["doc-1", "doc-2"],
)

# Query with a natural-language question
results = collection.query(
    query_texts=["I can't log in, forgot my credentials"],
    n_results=2,
)

for doc, dist in zip(results["documents"][0], results["distances"][0]):
    print(f"[score {1 - dist:.3f}] {doc[:60]}...")

What happens under the hood: Chroma embeds the documents and query using all-MiniLM-L6-v2, stores them in an HNSW index, and returns the nearest vectors by cosine distance. To go to production, swap chromadb.Client() for Pinecone or Weaviate and use text-embedding-ada-002.


📚 Three Things That Catch Every Vector Database Beginner

1. You cannot search across mixed embedding models. If you index with text-embedding-ada-002 and later query with bge-base-en, the vectors live in incompatible geometric spaces — ANN search returns garbage. Use the same model for both indexing and querying, and track which model version was used for each document batch.

2. Filtering happens in metadata, not in the vector space. Asking "find me billing content in Spanish" requires a metadata filter on lang=es applied before the ANN search — not a vector operation. Design your metadata schema before you start indexing.

3. ANN recall is approximate — and that is by design. HNSW occasionally misses the mathematically closest vector in exchange for sub-millisecond latency. For RAG, that trade-off is almost always worth it. Raise ef_search if recall quality is critical.


🛠️ ChromaDB, Pinecone, Weaviate, and pgvector: Picking the Right Vector Store

ChromaDB is an open-source embedded vector database built for local development and rapid prototyping — zero infrastructure required. Pinecone is a managed cloud vector database with serverless scaling. Weaviate is an open-source vector search engine with native hybrid (BM25 + vector) search. pgvector is a PostgreSQL extension that adds vector storage and ANN search without leaving your existing relational database.

# --- ChromaDB + sentence-transformers (local prototype, no signup needed) ---
# pip install chromadb sentence-transformers
import chromadb
from sentence_transformers import SentenceTransformer

encoder    = SentenceTransformer("all-MiniLM-L6-v2")
client     = chromadb.PersistentClient(path="./chroma_store")
collection = client.get_or_create_collection("knowledge-base")

docs = [
    "Password reset sends a one-time link to your registered email address.",
    "Two-factor authentication can be disabled from your account security settings.",
]
embeddings = encoder.encode(docs).tolist()
collection.upsert(documents=docs, embeddings=embeddings, ids=["doc-1", "doc-2"])

query_vec = encoder.encode(["I forgot my login credentials"]).tolist()
results   = collection.query(query_embeddings=query_vec, n_results=2)
for doc, dist in zip(results["documents"][0], results["distances"][0]):
    print(f"[similarity {1 - dist:.3f}] {doc[:80]}")

# --- pgvector (stays inside Postgres — zero new infrastructure) ---
# pip install psycopg2-binary pgvector
import psycopg2
from pgvector.psycopg2 import register_vector

conn = psycopg2.connect("dbname=support user=postgres")
register_vector(conn)
with conn.cursor() as cur:
    cur.execute("CREATE EXTENSION IF NOT EXISTS vector")
    cur.execute(
        "CREATE TABLE IF NOT EXISTS docs "
        "(id serial PRIMARY KEY, content text, embedding vector(384))"
    )
    vec = encoder.encode("Password reset guide").tolist()
    cur.execute("INSERT INTO docs (content, embedding) VALUES (%s, %s)",
                ("Password reset guide", vec))
    # Cosine similarity search: <=> is the pgvector cosine distance operator
    query_vec = encoder.encode("forgot credentials").tolist()
    cur.execute(
        "SELECT content, 1 - (embedding <=> %s) AS similarity "
        "FROM docs ORDER BY embedding <=> %s LIMIT 5",
        (query_vec, query_vec)
    )
    for row in cur.fetchall():
        print(f"[similarity {row[1]:.3f}] {row[0]}")
conn.commit()
ToolBest forInfrastructure needed
ChromaDBLocal dev, notebooks, fast prototypingNone (embedded)
PineconeProduction at scale, serverless, no-opsCloud-managed
WeaviateHybrid search, multi-modal, open-source controlSelf-hosted or cloud
pgvectorTeams already on Postgres, < 5 M vectorsExisting Postgres cluster

For a full deep-dive on Pinecone index configuration and Weaviate hybrid search with BM25 + vector fusion, a dedicated follow-up post is planned.


📌 TLDR: Summary & Key Takeaways

TLDR: A vector database stores embeddings and finds nearest neighbors — reach for one when queries need semantic understanding, not exact keyword matching.

  • A vector database stores embeddings — numeric fingerprints of meaning — and returns the k most similar ones to any query.
  • Two phases: indexing (chunk → embed → upsert, done offline) and querying (embed query → ANN search → rerank, done online).
  • Three common ANN indexes: HNSW (best quality, high memory), IVF (clusters, medium memory), IVF+PQ (compressed, lowest memory).
  • The dominant production use case is RAG: injecting retrieved document chunks into an LLM prompt to ground answers in your private knowledge.
  • Do not mix embedding models across your index. Do use metadata filters for tenant and language isolation. Do retrieve top-k and rerank rather than relying on top-1.
  • Start locally with Chroma, scale with Pinecone (managed) or Weaviate (open-source), or stay on pgvector if you are already on Postgres.

📝 Practice Quiz

  1. A customer types "I can't log into my account" and your support search returns an article titled "Account Access and Recovery". Which search method made this possible?

    A) BM25 keyword search, because "account" appears in both
    B) Vector (semantic) search, because the embeddings of both phrases point in a similar direction
    C) SQL LIKE query with wildcard matching
    D) A synonym dictionary mapping "log in" to "access"

    Correct Answer: B — Embedding models encode intent and meaning, not just tokens. Semantically related phrases cluster near each other in vector space regardless of exact wording.

  2. You index 10 million product descriptions with text-embedding-ada-002 and later switch to bge-large-en-v1.5 for new products. What is the most likely outcome when a customer searches for an old product?

    A) The search works fine because both models use 1536 dimensions
    B) Old product results are ranked lower or missing because the two models produce vectors in incompatible geometric spaces
    C) The database automatically re-embeds old products using the new model
    D) Cosine similarity scores go above 1.0, causing an error

    Correct Answer: B — Different embedding models produce geometrically incompatible spaces. Mixing them in one index causes ANN search to return meaningless results for the older embeddings.

  3. Your HNSW-indexed vector database returns results in 4 ms for a corpus of 5 million chunks. You add a metadata filter so only documents from a specific tenant are returned. Which best describes the performance impact?

    A) Latency increases dramatically because HNSW must now scan all vectors
    B) Latency is roughly similar because metadata filtering narrows the search space rather than expanding it
    C) Latency goes to zero because filtered results are cached
    D) HNSW cannot support metadata filtering; you must switch to IVF

    Correct Answer: B — Pinecone, Weaviate, and Chroma all support pre-filtering that narrows the search space rather than expanding it, keeping latency roughly stable.

  4. You are building a product search feature. Users sometimes type exact SKU codes (e.g., "SKU-8842") and sometimes describe what they want ("waterproof hiking boots under $150"). Which architecture best handles both cases?

    A) Pure vector search with a high-dimensional model
    B) Pure BM25 keyword search
    C) Hybrid search: BM25 for exact token matches + vector search for semantic queries, scores merged with Reciprocal Rank Fusion
    D) A relational database with a LIKE query and a synonym table

    Correct Answer: C — Hybrid search pairs BM25 (exact token precision for SKUs and brand names) with vector search (semantic recall for natural-language descriptions). RRF merges both ranked lists without manual score-weight tuning.


Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms