In the world of Natural Language Processing (NLP), how we represent words fundamentally shapes how machines understand language. From early breakthroughs like Word2Vec and GloVe to the revolutionary rise of Transformers, the evolution of word embeddings has been nothing short of transformative.
In this post, we’ll explore the key differences between these three approaches—and show you how to use them with Python code.

Word2Vec: Learning from Local Context
Developed by Google in 2013, Word2Vec introduced a way to learn word embeddings by predicting context using a shallow neural network.
How it works:
- Two architectures:
- CBOW (Continuous Bag of Words): Predicts a word from its surrounding context.
- Skip-gram: Predicts surrounding words from a target word.
- Trained on local context windows in large corpora.
- Outputs a fixed-size vector for each word.
Code Example: Word2Vec with Gensim
from gensim.models import Word2Vec
# Sample corpus
sentences = [["king", "queen", "man", "woman", "child", "royal"]]
# Train Word2Vec model
model = Word2Vec(sentences, vector_size=100, window=2, min_count=1, sg=1)
# Get vector for a word
vector = model.wv["king"]
print("Vector for 'king':", vector[:5]) # Show first 5 dimensions
# Word analogy: king - man + woman ≈ ?
result = model.wv.most_similar(positive=["king", "woman"], negative=["man"])
print("Analogy result:", result)
GloVe: Global Co-occurrence Counts
Stanford’s GloVe (Global Vectors for Word Representation) blends the strengths of count-based and predictive models.
How it works:
- Builds a co-occurrence matrix of word pairs across the entire corpus.
- Learns embeddings by factorizing this matrix, preserving the ratios of co-occurrence probabilities.
- Embeddings reflect global statistical information.
Code Example: GloVe via spaCy
import spacy
# Load spaCy model with GloVe vectors
nlp = spacy.load("en_core_web_md")
# Get vector for a word
vector = nlp("king").vector
print("Vector for 'king':", vector[:5])
# Similarity between words
similarity = nlp("king").similarity(nlp("queen"))
print("Similarity between 'king' and 'queen':", similarity)
Install the model with:
python -m spacy download en_core_web_md
Transformers: Contextual Understanding with Attention
Transformers, introduced by models like BERT and GPT, don’t just embed words—they understand them in context.
How it works:
- Uses self-attention to weigh the importance of each word relative to others in a sentence.
- Learns contextual embeddings: the same word can have different meanings depending on usage.
- Trained on massive datasets with objectives like masked language modeling (BERT) or next-token prediction (GPT).
Code Example: BERT with Hugging Face
from transformers import BertTokenizer, BertModel
import torch
# Load pre-trained BERT
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")
# Encode sentence
sentence = "The bank was flooded after the storm."
inputs = tokenizer(sentence, return_tensors="pt")
outputs = model(**inputs)
# Extract contextual embeddings
last_hidden_state = outputs.last_hidden_state # shape: [1, tokens, 768]
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
# Print token and its vector
for token, vector in zip(tokens, last_hidden_state[0]):
print(f"{token:12} → {vector[:5].tolist()}")
Bonus: Visualizing Embeddings with PCA
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
words = ["king", "queen", "man", "woman", "child", "royal"]
vectors = [model.wv[word] for word in words]
# Reduce dimensions
pca = PCA(n_components=2)
reduced = pca.fit_transform(vectors)
# Plot
plt.figure(figsize=(8, 6))
for i, word in enumerate(words):
plt.scatter(*reduced[i])
plt.text(reduced[i][0]+0.01, reduced[i][1]+0.01, word)
plt.title("Word2Vec Embeddings (2D PCA)")
plt.grid(True)
plt.show()
Summary Table
| Feature | Word2Vec | GloVe | Transformers (e.g., BERT) |
|---|---|---|---|
| Type | Predictive | Count-based | Contextual, attention-based |
| Context Sensitivity | ❌ Static | ❌ Static | ✅ Dynamic |
| Training Objective | Predict context | Factorize co-occurrence | Masked/causal language modeling |
| Speed | ⚡ Fast | ⚡ Fast | 🐢 Slower (but powerful) |
| Output | One vector per word | One vector per word | One vector per word per context |
| Use Case Fit | Simple NLP tasks | Semantic similarity | Advanced NLP (QA, NER, etc.) |
Final Thoughts
Word2Vec and GloVe laid the foundation for understanding word semantics, but they treat words in isolation. Transformers, on the other hand, bring context into the equation—making them the backbone of modern NLP and generative AI.
Whether you’re building a chatbot, designing a RAG pipeline, or just exploring the magic of language models, understanding these embedding techniques is essential.



Leave a comment