Embedding
Summary: An embedding is a dense vector representation that captures semantic meaning of discrete objects like words, sentences, or images in a continuous numerical space. Embeddings enable machine learning models to process symbolic data by mapping similar concepts to nearby points in high-dimensional vector space, forming the foundation for modern NLP, recommendation systems, and similarity search applications.
Overview
An embedding is a learned representation that maps discrete symbolic information (like words, sentences, users, or products) into dense, continuous vector spaces where semantic similarity corresponds to geometric proximity. This transformation allows machine learning models to work with symbolic data by converting it into numerical form that preserves meaningful relationships and enables mathematical operations.
Core Concepts
From Symbolic to Numerical
Traditional approaches used sparse, one-hot representations:
# One-hot encoding (sparse, high-dimensional)
vocabulary = ["cat", "dog", "bird", "fish"]
cat = [1, 0, 0, 0] # 4-dimensional, mostly zeros
dog = [0, 1, 0, 0] # No semantic relationship captured
bird = [0, 0, 1, 0] # All words equally distant
fish = [0, 0, 0, 1]
```text
Embeddings use dense, low-dimensional vectors:
```python
## Dense embeddings (semantic relationships preserved)
cat = [0.2, 0.8, 0.3, 0.1] # 4-dimensional, all values meaningful
dog = [0.3, 0.7, 0.4, 0.0] # Similar to cat (both pets)
bird = [0.1, 0.2, 0.9, 0.3] # Different from mammals
fish = [0.0, 0.1, 0.8, 0.6] # Similar to bird (both non-mammals)
```text
### Semantic Space Properties
Well-trained embeddings exhibit remarkable properties:
#### Semantic Similarity
```python
## Words with similar meanings have similar vectors
cosine_similarity("king", "queen") = 0.72
cosine_similarity("cat", "dog") = 0.81
cosine_similarity("car", "bicycle") = 0.54
cosine_similarity("king", "apple") = 0.02
```text
#### Analogical Relationships
```python
## Vector arithmetic captures relationships
vector("king") - vector("man") ≈ vector("queen") - vector("woman")
vector("Paris") - vector("France") ≈ vector("Tokyo") - vector("Japan")
vector("walked") - vector("walk") ≈ vector("ran") - vector("run")
```text
#### Clustering
```python
## Related concepts cluster together
animals = {"cat", "dog", "bird", "fish"}
colors = {"red", "blue", "green", "yellow"}
countries = {"France", "Japan", "Brazil", "Canada"}
## Each group forms distinct clusters in embedding space
```text
## Types of Embeddings
### Word Embeddings
#### Word2Vec (2013)
Two training approaches for learning word representations:
**Skip-gram**: Predict context words from target word
```python
## Skip-gram training example
target_word = "cat"
context_window = ["the", "cat", "sat", "on", "mat"]
## Model learns: cat → {the, sat, on, mat}
## Objective: maximize P(context | target)
```text
**Continuous Bag of Words (CBOW)**: Predict target from context
```python
## CBOW training example
context_words = ["the", "sat", "on", "mat"]
target_word = "cat"
## Model learns: {the, sat, on, mat} → cat
## Objective: maximize P(target | context)
```text
#### GloVe (Global Vectors)
Combines matrix factorization with local context windows:
```python
## GloVe objective function
J = Σ f(X_ij) * (w_i^T * w_j + b_i + b_j - log(X_ij))²
## Where
## X_ij = co-occurrence count of word i with word j
## w_i, w_j = word vectors for words i and j
## b_i, b_j = bias terms
## f(x) = weighting function to handle frequent/rare words
```text
#### FastText
Extends Word2Vec to handle subword information:
```python
## FastText represents words as sum of subword vectors
word = "playing"
subwords = ["<pl", "pla", "lay", "ayi", "yin", "ing", "ng>"]
## Final embedding = sum of subword embeddings
embedding("playing") = Σ embedding(subword) for subword in subwords
```text
Benefits:
- Handles out-of-vocabulary words
- Captures morphological information
- Better for languages with rich morphology
### Contextual Embeddings
Traditional word embeddings assign fixed vectors to words, but contextual embeddings create different representations
based on context:
#### ELMo (Embeddings from Language Models)
Uses bidirectional LSTM to create context-aware embeddings:
```python
## ELMo creates different embeddings for same word
sentence1 = "I went to the bank to deposit money"
sentence2 = "I sat by the river bank"
## "bank" gets different embeddings based on context
bank_financial = elmo_embedding("bank", sentence1)
bank_river = elmo_embedding("bank", sentence2)
cosine_similarity(bank_financial, bank_river) = 0.23 # Different meanings
```text
#### BERT Embeddings
Transformer-based contextual embeddings:
```python
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
def get_word_embedding(sentence, word, model, tokenizer):
# Tokenize and get word position
tokens = tokenizer.tokenize(sentence)
word_idx = tokens.index(word)
# Get BERT embeddings
inputs = tokenizer(sentence, return_tensors='pt')
outputs = model(**inputs)
# Extract embedding for specific word
word_embedding = outputs.last_hidden_state[0, word_idx, :]
return word_embedding
```text
### Sentence and Document Embeddings
#### Sentence-BERT (SBERT)
Creates meaningful sentence-level embeddings:
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = [
"The cat sat on the mat",
"A feline rested on the rug",
"Dogs are great pets",
"Python is a programming language"
]
embeddings = model.encode(sentences)
## Similar sentences have similar embeddings
similarity_1_2 = cosine_similarity(embeddings[0], embeddings[1]) # High
similarity_1_3 = cosine_similarity(embeddings[0], embeddings[2]) # Medium
similarity_1_4 = cosine_similarity(embeddings[0], embeddings[3]) # Low
```text
#### Doc2Vec
Extends Word2Vec to document-level embeddings:
```python
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
## Prepare documents
documents = [
TaggedDocument(["neural", "networks", "deep", "learning"], ["doc1"]),
TaggedDocument(["machine", "learning", "algorithms"], ["doc2"]),
TaggedDocument(["cooking", "recipes", "food"], ["doc3"])
]
## Train Doc2Vec model
model = Doc2Vec(documents, vector_size=100, epochs=40)
## Get document embeddings
doc1_vector = model.docvecs["doc1"]
doc2_vector = model.docvecs["doc2"]
doc3_vector = model.docvecs["doc3"]
```text
### Multimodal Embeddings
#### CLIP (Contrastive Language-Image Pre-training)
Creates shared embedding space for text and images:
```python
import clip
import torch
from PIL import Image
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
## Process image and text
image = preprocess(Image.open("cat.jpg")).unsqueeze(0).to(device)
text = clip.tokenize(["a cat", "a dog", "a car"]).to(device)
## Get embeddings in shared space
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
# Calculate similarities
similarities = (image_features @ text_features.T).softmax(dim=-1)
print(f"Image-text similarities: {similarities}")
```text
## Training Methods
### Contrastive Learning
Many modern embeddings use contrastive learning objectives:
```python
## Contrastive loss function
def contrastive_loss(anchor, positive, negative, margin=1.0):
"""
anchor: reference embedding
positive: similar item embedding
negative: dissimilar item embedding
"""
pos_distance = torch.norm(anchor - positive)
neg_distance = torch.norm(anchor - negative)
loss = torch.max(
torch.tensor(0.0),
pos_distance - neg_distance + margin
)
return loss
```text
### Triplet Loss
Used for learning embedding spaces with relative similarity:
```python
def triplet_loss(anchor, positive, negative, margin=1.0):
"""Learn embeddings where positive is closer than negative to anchor"""
pos_dist = torch.norm(anchor - positive, p=2)
neg_dist = torch.norm(anchor - negative, p=2)
loss = torch.max(
torch.tensor(0.0),
pos_dist - neg_dist + margin
)
return loss
```text
### Self-Supervised Learning
Modern approaches use self-supervised objectives:
```python
## Masked language modeling (BERT-style)
def masked_lm_loss(embeddings, masked_tokens, predictions):
"""Predict masked tokens from context embeddings"""
loss = cross_entropy(predictions[masked_positions], masked_tokens)
return loss
## Next sentence prediction
def next_sentence_loss(sentence_a_emb, sentence_b_emb, is_next):
"""Predict if sentence B follows sentence A"""
combined = torch.cat([sentence_a_emb, sentence_b_emb], dim=-1)
prediction = classifier(combined)
loss = cross_entropy(prediction, is_next)
return loss
```text
## Applications
### Similarity Search and Retrieval
```python
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
class EmbeddingSearch:
def __init__(self, embeddings, items):
self.embeddings = np.array(embeddings)
self.items = items
def search(self, query_embedding, top_k=5):
# Calculate similarities
similarities = cosine_similarity(
query_embedding.reshape(1, -1),
self.embeddings
)[0]
# Get top-k most similar items
top_indices = np.argsort(similarities)[::-1][:top_k]
results = [
(self.items[i], similarities[i])
for i in top_indices
]
return results
## Usage example
search_engine = EmbeddingSearch(product_embeddings, product_names)
query = get_embedding("smartphone with good camera")
similar_products = search_engine.search(query, top_k=10)
```text
### Recommendation Systems
```python
class EmbeddingRecommender:
def __init__(self, user_embeddings, item_embeddings):
self.user_embeddings = user_embeddings
self.item_embeddings = item_embeddings
def recommend(self, user_id, exclude_seen=None, top_k=10):
user_emb = self.user_embeddings[user_id]
# Calculate user-item similarities
similarities = cosine_similarity(
user_emb.reshape(1, -1),
self.item_embeddings
)[0]
# Exclude already seen items
if exclude_seen:
similarities[exclude_seen] = -1
# Get top recommendations
top_items = np.argsort(similarities)[::-1][:top_k]
return top_items
```text
### Clustering and Classification
```python
from sklearn.cluster import KMeans
from sklearn.linear_model import LogisticRegression
## Clustering using embeddings
kmeans = KMeans(n_clusters=5)
clusters = kmeans.fit_predict(word_embeddings)
## Classification using embeddings as features
classifier = LogisticRegression()
classifier.fit(sentence_embeddings, labels)
predictions = classifier.predict(new_sentence_embeddings)
```text
### Question Answering
```python
class EmbeddingQA:
def __init__(self, passages, passage_embeddings):
self.passages = passages
self.passage_embeddings = passage_embeddings
def answer_question(self, question, top_k=3):
# Get question embedding
question_emb = get_embedding(question)
# Find most relevant passages
similarities = cosine_similarity(
question_emb.reshape(1, -1),
self.passage_embeddings
)[0]
top_passages = np.argsort(similarities)[::-1][:top_k]
# Return relevant passages for further processing
return [self.passages[i] for i in top_passages]
```text
## Quality Assessment
### Intrinsic Evaluation
#### Word Similarity Tasks
```python
## Evaluate on human-annotated similarity datasets
def evaluate_similarity(embeddings, similarity_dataset):
human_scores = []
model_scores = []
for word1, word2, human_score in similarity_dataset:
emb1 = embeddings[word1]
emb2 = embeddings[word2]
model_score = cosine_similarity(emb1, emb2)
human_scores.append(human_score)
model_scores.append(model_score)
# Calculate correlation
correlation = scipy.stats.spearmanr(human_scores, model_scores)
return correlation
```text
#### Analogy Tasks
```python
def evaluate_analogies(embeddings, analogy_dataset):
correct = 0
total = 0
for a, b, c, expected_d in analogy_dataset:
# a : b :: c : ?
# vector(d) ≈ vector(b) - vector(a) + vector(c)
target_vector = embeddings[b] - embeddings[a] + embeddings[c]
# Find closest word to target_vector
similarities = {}
for word in embeddings:
if word not in [a, b, c]: # Exclude input words
sim = cosine_similarity(target_vector, embeddings[word])
similarities[word] = sim
predicted_d = max(similarities, key=similarities.get)
if predicted_d == expected_d:
correct += 1
total += 1
accuracy = correct / total
return accuracy
```text
### Extrinsic Evaluation
#### Downstream Task Performance
```python
## Evaluate embeddings on classification task
def evaluate_on_task(embeddings, X_text, y_labels):
# Convert text to embeddings
X_embeddings = np.array([embeddings[text] for text in X_text])
# Train classifier
X_train, X_test, y_train, y_test = train_test_split(
X_embeddings, y_labels, test_size=0.2
)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
# Evaluate
accuracy = classifier.score(X_test, y_test)
return accuracy
```text
## Technical Considerations
### Dimensionality Selection
```python
## Common embedding dimensions
embedding_dims = {
"word2vec": [100, 200, 300], # Typical range
"glove": [50, 100, 200, 300], # Pre-trained available
"bert": [768, 1024], # Fixed architecture
"sentence_transformers": [384, 768], # Model-dependent
"openai_ada": [1536], # API-based embedding
}
## Trade-offs
## Smaller dimensions: Faster computation, less memory, potential information loss
## Larger dimensions: More expressive, slower computation, higher memory usage
```text
### Normalization
```python
def normalize_embeddings(embeddings):
"""L2 normalize embeddings for cosine similarity"""
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
normalized = embeddings / (norms + 1e-8) # Avoid division by zero
return normalized
## Benefits of normalization
## 1. Cosine similarity becomes dot product (faster computation)
## 2. All embeddings have unit length (consistent scale)
## 3. Focuses on direction rather than magnitude
```text
### Handling Out-of-Vocabulary (OOV) Words
```python
class EmbeddingHandler:
def __init__(self, embeddings, unknown_token="<UNK>"):
self.embeddings = embeddings
self.unknown_token = unknown_token
self.unk_vector = embeddings.get(unknown_token, self._create_unk_vector())
def _create_unk_vector(self):
# Create random vector or average of all embeddings
if len(self.embeddings) > 0:
all_vectors = np.array(list(self.embeddings.values()))
return np.mean(all_vectors, axis=0)
else:
return np.random.normal(0, 0.1, size=300) # Random initialization
def get_embedding(self, word):
return self.embeddings.get(word, self.unk_vector)
```text
## Modern Developments
### Large-Scale Embeddings
#### OpenAI Ada Embeddings
```python
import openai
## High-quality embeddings via API
client = openai.OpenAI()
def get_openai_embedding(text):
response = client.embeddings.create(
input=text,
model="text-embedding-ada-002"
)
return response.data[0].embedding
## Properties
## - 1536 dimensions
## - Trained on diverse, high-quality data
## - Strong performance across tasks
## - Cost: ~$0.0001 per 1K tokens
```text
#### Cohere Embeddings
```python
import cohere
co = cohere.Client("your-api-key")
def get_cohere_embedding(texts):
response = co.embed(
texts=texts,
model='embed-english-v2.0'
)
return response.embeddings
```text
### Specialized Embeddings
#### Code Embeddings
```python
## Microsoft CodeBERT for code similarity
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")
model = AutoModel.from_pretrained("microsoft/codebert-base")
def get_code_embedding(code_snippet):
inputs = tokenizer(code_snippet, return_tensors="pt", truncation=True)
outputs = model(**inputs)
return outputs.last_hidden_state.mean(dim=1)
```text
#### Scientific Paper Embeddings
```python
## SciBERT for scientific text
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("allenai/scibert-scivocab-uncased")
model = AutoModel.from_pretrained("allenai/scibert-scivocab-uncased")
```text
### Multilingual Embeddings
```python
## Multilingual Universal Sentence Encoder
import tensorflow_hub as hub
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder-multilingual/3")
texts = [
"Hello world", # English
"Bonjour le monde", # French
"Hola mundo", # Spanish
"こんにちは世界" # Japanese
]
embeddings = embed(texts)
## All texts mapped to same semantic space regardless of language
```text
## Limitations and Challenges
### Bias and Fairness
Embeddings can perpetuate societal biases:
```python
## Example of gender bias in word embeddings
def check_bias(embeddings):
# Problematic associations often found:
programmer_vec = embeddings["programmer"]
nurse_vec = embeddings["nurse"]
male_vec = embeddings["he"] - embeddings["she"]
# Programmer might be closer to "male" direction
programmer_bias = cosine_similarity(programmer_vec, male_vec)
nurse_bias = cosine_similarity(nurse_vec, male_vec)
print(f"Programmer-male bias: {programmer_bias}")
print(f"Nurse-male bias: {nurse_bias}")
```text
### Computational Scalability
```python
## Challenges with large-scale similarity search
def naive_similarity_search(query_emb, database_embs):
"""O(n) complexity - doesn't scale"""
similarities = []
for emb in database_embs: # This becomes slow for millions of embeddings
sim = cosine_similarity(query_emb, emb)
similarities.append(sim)
return similarities
## Solutions: Approximate nearest neighbor (ANN) libraries
## - Faiss (Facebook AI Similarity Search)
## - Annoy (Spotify)
## - Nmslib
## - Hnswlib
```text
### Context Length Limitations
```python
## Most embedding models have fixed context windows
model_limits = {
"sentence-transformers": 512, # tokens
"openai-ada-002": 8191, # tokens
"cohere-embed": 512, # tokens
"text-embedding-3-large": 8191 # tokens
}
## Long documents need chunking strategies
def chunk_document(text, max_length=512, overlap=50):
"""Split long text into overlapping chunks"""
words = text.split()
chunks = []
for i in range(0, len(words), max_length - overlap):
chunk = " ".join(words[i:i + max_length])
chunks.append(chunk)
return chunks
```text
## Future Directions
### Improved Training Methods
**Self-Supervised Learning**: Better pre-training objectives
**Contrastive Learning**: More effective positive/negative sampling
**Multi-Task Learning**: Training on multiple objectives simultaneously
### Architectural Innovations
**Sparse Embeddings**: Reducing computational and memory requirements
**Hierarchical Embeddings**: Capturing concepts at multiple levels
**Dynamic Embeddings**: Adapting representations based on context
### Cross-Modal Understanding
**Vision-Language**: Better alignment between visual and textual concepts
**Audio-Text**: Connecting spoken and written language
**Multimodal Fusion**: Combining multiple input types effectively
### Efficiency Improvements
**Quantization**: Reducing precision while maintaining quality
**Distillation**: Creating smaller models that match larger ones
**Hardware Optimization**: Specialized chips for embedding computation
Embeddings represent one of the most fundamental advances in machine learning, enabling models to work with symbolic
data in meaningful ways. As the field continues to evolve, embeddings are becoming more sophisticated, efficient, and
capable of capturing nuanced semantic relationships across diverse types of data and modalities.