In the world of Natural Language Processing (NLP), how we represent words fundamentally shapes how machines understand language. From early breakthroughs like Word2Vec and GloVe to the revolutionary rise of Transformers, the evolution of word embeddings has been nothing short of transformative.

In this post, we’ll explore the key differences between these three approaches—and show you how to use them with Python code.

Word2Vec: Learning from Local Context

Developed by Google in 2013, Word2Vec introduced a way to learn word embeddings by predicting context using a shallow neural network.

How it works:

  • Two architectures:
    • CBOW (Continuous Bag of Words): Predicts a word from its surrounding context.
    • Skip-gram: Predicts surrounding words from a target word.
  • Trained on local context windows in large corpora.
  • Outputs a fixed-size vector for each word.

Code Example: Word2Vec with Gensim

from gensim.models import Word2Vec

# Sample corpus
sentences = [["king", "queen", "man", "woman", "child", "royal"]]

# Train Word2Vec model
model = Word2Vec(sentences, vector_size=100, window=2, min_count=1, sg=1)

# Get vector for a word
vector = model.wv["king"]
print("Vector for 'king':", vector[:5])  # Show first 5 dimensions

# Word analogy: king - man + woman ≈ ?
result = model.wv.most_similar(positive=["king", "woman"], negative=["man"])
print("Analogy result:", result)

GloVe: Global Co-occurrence Counts

Stanford’s GloVe (Global Vectors for Word Representation) blends the strengths of count-based and predictive models.

How it works:

  • Builds a co-occurrence matrix of word pairs across the entire corpus.
  • Learns embeddings by factorizing this matrix, preserving the ratios of co-occurrence probabilities.
  • Embeddings reflect global statistical information.

Code Example: GloVe via spaCy

import spacy

# Load spaCy model with GloVe vectors
nlp = spacy.load("en_core_web_md")

# Get vector for a word
vector = nlp("king").vector
print("Vector for 'king':", vector[:5])

# Similarity between words
similarity = nlp("king").similarity(nlp("queen"))
print("Similarity between 'king' and 'queen':", similarity)

Install the model with:

python -m spacy download en_core_web_md

Transformers: Contextual Understanding with Attention

Transformers, introduced by models like BERT and GPT, don’t just embed words—they understand them in context.

How it works:

  • Uses self-attention to weigh the importance of each word relative to others in a sentence.
  • Learns contextual embeddings: the same word can have different meanings depending on usage.
  • Trained on massive datasets with objectives like masked language modeling (BERT) or next-token prediction (GPT).

Code Example: BERT with Hugging Face

from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained BERT
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

# Encode sentence
sentence = "The bank was flooded after the storm."
inputs = tokenizer(sentence, return_tensors="pt")
outputs = model(**inputs)

# Extract contextual embeddings
last_hidden_state = outputs.last_hidden_state  # shape: [1, tokens, 768]
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

# Print token and its vector
for token, vector in zip(tokens, last_hidden_state[0]):
    print(f"{token:12} → {vector[:5].tolist()}")

Bonus: Visualizing Embeddings with PCA

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

words = ["king", "queen", "man", "woman", "child", "royal"]
vectors = [model.wv[word] for word in words]

# Reduce dimensions
pca = PCA(n_components=2)
reduced = pca.fit_transform(vectors)

# Plot
plt.figure(figsize=(8, 6))
for i, word in enumerate(words):
    plt.scatter(*reduced[i])
    plt.text(reduced[i][0]+0.01, reduced[i][1]+0.01, word)
plt.title("Word2Vec Embeddings (2D PCA)")
plt.grid(True)
plt.show()

Summary Table

FeatureWord2VecGloVeTransformers (e.g., BERT)
TypePredictiveCount-basedContextual, attention-based
Context Sensitivity❌ Static❌ Static✅ Dynamic
Training ObjectivePredict contextFactorize co-occurrenceMasked/causal language modeling
Speed⚡ Fast⚡ Fast🐢 Slower (but powerful)
OutputOne vector per wordOne vector per wordOne vector per word per context
Use Case FitSimple NLP tasksSemantic similarityAdvanced NLP (QA, NER, etc.)

Final Thoughts

Word2Vec and GloVe laid the foundation for understanding word semantics, but they treat words in isolation. Transformers, on the other hand, bring context into the equation—making them the backbone of modern NLP and generative AI.

Whether you’re building a chatbot, designing a RAG pipeline, or just exploring the magic of language models, understanding these embedding techniques is essential.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.