dH #007: Static Word Representations: Understanding the Pre-Deep Learning Era of NLP

dH #007: Static Word Representations: Understanding the Pre-Deep Learning Era of NLP

*A comprehensive exploration of traditional word embeddings, from one-hot vectors to Word2Vec, and the remarkable properties that revolutionized natural language processing*

## Introduction

This lecture introduces static word representations; i.e. the representation of the word is the same regardless of the context it is used in (i.e. river bank vs. money bank). What I mean by a static representation is that the representation of the word is the same regardless of the context it is used in.

Static Word Representations Concept

You would almost always want to use contextualized representations in your research. However, we are covering this material because it is important background for understanding different approaches to NLP, and you may see it used in papers.

## Traditional Models of Words: The Pre-Deep Learning Era

Alright, so I want to start by saying just a brief word about traditional models of words, what this looked like in the pre-deep learning era. And so the traditional approach was to use a data set like WordNet which is effectively at the Saurus containing lists of synonyms and hypernims.

Traditional Models Overview

So this was made by humans with these human-engineered relationships and the problem was that that required a lot of human labor to create and maintain and also it could lack nuance or comprehensiveness. You know, it didn’t exist at all for lower-resource languages and it was static and just very costly to create and kind of limited in its nuance.

WordNet Human-Engineered Relationships

## The Problem with One-Hot Vector Representations

Alright, and so you can imagine treating words as discrete objects and just encoding them with one-hot vectors where the vector dimension is the number of words in the vocabulary. So why would you not actually want to do this?

One-Hot Vector Problems

In traditional NLP, words are treated as discrete objects. These can be encoded as one hot vectors, where vector dimension is the number of words in the vocabulary. No notion of similarity with one hot vectors, by definition they are orthogonal. First of all, there’s no notion of similarity with one-hot vectors by definition they are orthogonal.

More over, suppose you have 500,000 words in the vocabulary and you want to code a Saurus using one-hot vectors, that’s a 500,000 by 500,000 matrix. And so obviously it’s going to create computational problems.

So one-hot encoding of words are not really a computable object so how else can we represent words such that they are computable?

## The Foundation: “You Shall Know a Word by the Company It Keeps”

Well, kind of in any course about an LP, you take, you’ll see this obligatory quote from the 1950s, you shall know a word by the company it keeps. And what this, you know, is saying is that if you can explain the correct context in which to use the word, then you understand the meaning of that word.

J.R. Firth Quote Foundation

And so this leads us to the concept of a word embedding which is a dense vector representation of a word.

## Understanding Word Embeddings

A word embedding is a dense vector representing a word. And by dense vector, I mean that in contrast to the kind of the one-hot vectors which are sparse vectors, dense vectors have values in their positions rather than being mostly zeros.

Word Embedding Definition

So words are embedded such that words that appear in similar contexts have similar word vectors and words that appear in different contexts have more different word vectors.

## Count-Based Methods: The Pre-Deep Learning Approach

These predominated in the pre-deep learning era. And so the idea was to consider words in their context or the words that appear in a nearby window.

Count-Based Methods Overview

And count-based methods, also known as distributional models, created a word context matrix that counted the number of co-occurrences between words and the surrounding words in its context. So this was a really large matrix.

And so they did dimensionality reduction with singular value decomposition. And in practice, they needed a lot of hacks to make this work.

## Introduction to Word2Vec

So now I want to jump to Word2Vec and see how Word2Vec created these word embeddings. Where again, remember the goal is that words that are used in similar contexts should have similar representations.

Word2Vec Introduction

And so what Word2Vec does is to create non contextualized representations. So I hate this movie. You know, there’s a representation for hate and for this and for movie.

Whereas a contextualized representation would jointly embed the entire input sentence and then create representations for each word that depend on its context. So that is what the transformer does.

We’re going to seek contextualized representations next week but in this lecture we’re talking about non contextualized representations.

## Training Word2Vec: The Technical Process

So we want to train word to vet. We create a matrix for embeddings for the target word and a matrix for embeddings for the context.

Word2Vec Training Matrices

And so each of these the dimensionality of this is the vocabulary size by the embedding size. Our embeddings can be, you know, that’s a choice. I think in word to vet they were typically something like 300 dimensional vectors.

And so at the start of training this can be randomly initialized. By the way, I think like they trained word to whack on Wikipedia just for some background.

## The Training Process: Positive and Negative Examples

At each training step, take a positive example and sample some negative examples. And so the example given in this post, the sentence is something like, you know, thou shalt not do something.

Training Examples

And then the randomly sampled words from the vocabulary are Aaron and Taco. And so the input word is not. And thou is an actual positive word that appears in context. So it’s target is one and Aaron and Taco are the randomly negative sampled words.

Remember, we’re sampling negative words at random because we can’t compute the softmax loss over everywhere in the vocabulary because that’s too costly to do.

And so we look up the input word in the embedding matrix. And what you see there in green, in the context words, in the context matrix.

Matrix Lookup Process

So remember, we have every word, kind of in the vocabulary in every matrix. So we’ve looked up those embeddings.

And then we’re going to take the dot product of the embedding vector with each of the context vectors. And remember that the higher the dot product, the more similar those two vectors are.

Dot Product Computation

And the resulting number measures the similarity of those two vectors. Pass it through the sigmoid. And then compute the air and use that to update the model parameters.

And now we move on to the training of the Word2Vec model, where we compute the gradient of the loss and update the model parameters. And it’s negative samples and so on and so forth.

Training Process Flow

## Word2Vec Variants and Remarkable Properties

There’s a variant of this called Word2Vec, where we predict output words based on the sum of surrounding input word embeddings. We predict center words based on the sum of surrounding embeddings.

Word2Vec Variants

And so people were extremely, extremely excited about the Continuous Bag of Words model, which predicts the center word based on the sum of surrounding word embeddings as shown in the diagram (Mikolov et al., 2013).

Continuous Bag of Words Model

Word embeddings have some pretty remarkable properties, which served as the impetus for machine learning methods to revolutionize NLP.

## The Famous Analogy Examples

And so this is kind of a famous example from the paper where you see man and you add some amount. You add some vector to it and you get woman.

Famous Word Analogies

And the same thing that you add to uncle gives you ant or the king gives you queen. And so this is an example of analogies, which I think is the thing that people got the most excited about using word to veck for.

And so people use these on analogy tests. And so there’s a package that will let you compute these analogy.

Analogy Test Examples

So Australia is to beer as France is to Champagne, go to Staphantastic as bad as to terrible and so on. I mean, I don’t know quite how cherry-pig these examples are. But in general, you can get analogies that at least some of the times seems to make a lot of sense.

And high-dimensional vector space of where it can be close to lots of other words in many different directions. So you can potentially capture many dimensions of similarity.

## Context Window Size and Its Effects

The window size for the context, you want depends on what relationships that you hope to capture. So small windows are good for indicating words are interchangeable.

Small Window Effects

Smaller windows will tend to extract more syntax-based information.

Small Windows: good for indicating words are interchangeable (note that antonyms are often interchangeable if look at only the surrounding words; i.e. “This class is awesome/horrible”). Smaller windows will tend to extract more syntax based information (i.e. generic words indicate the input word is a noun).

Window Size Comparison

Whereas larger windows allow context to have a larger effect. And embeddings are more likely to major similarity in terms of the relatedness of words.

## Evaluation Approaches for Word Embeddings

So how would we think about evaluating how we’re doing? That can be hard to think about in an unsupervised task like word embeddings as there’s not really a clear evaluation metric.

FastText Evaluation

And so people thought about intrinsic evaluation, how good based on the features produced or extrinsic evaluation, like how useful for downstream tasks.

Evaluation Types

And so one approach that you saw taken was dimensionality reduction. I think, as I said, I think word vectors were something like 300 dimensions. Reduce it to a much lower space using dimensionality reduction techniques like Principal Component Analysis (PCA) or t-SNE, and see if the reduced dimensions group similar digits together in a sensible way.

Dimensionality Reduction Visualization

You can correlate word vector distances with human judgments. There were data sets collected by psychologists on this. So there’s this word, sim, 353 data set that was used a lot in this literature.

WordSim353 Dataset

People also evaluated performance with analogy data sets. And then finally, there was categorization tasks, which were human.

Analogy Dataset Evaluation

## Additional Evaluation Methods: The “Sesame Street” Approach

Another common evaluation is categorization tasks, where human-created datasets with four images are presented, and the task is to identify which one is not like the others, reminiscent of the Sesame Street game. And so yeah, the one key thing is that we didn’t optimize the model to do well on analogy tasks or on the Sesame Street task or any other task that it was evaluated on.

Categorization Task Example

We tried to do a simple task, which is to predict words and context using unsupervised methods. And the word vectors with these powerful properties were a side effect of this. And this is going to be a theme in L.P.

That good representations can have kind of very powerful properties and can be useful for tasks that you want to do even though the model was not pre-trained to do that specific task.

## Key Lessons: The Power of Neural Network Representations

Okay, so I’ll say if you learn only two things from this course, the second thing that I would like you to learn is that neural networks can learn better ways to represent data than what humans can engineer by hand, as illustrated by the word vector visualization showing semantic relationships between words like ‘king’, ‘queen’, ‘man’, and ‘woman’.

Neural Network Representations

Representing data well is essential to downstream tasks that we want to perform, as demonstrated by the powerful properties of word vectors learned in an unsupervised manner.

## Transfer Learning: A Fundamental Theme

And so this relates to another kind of really, really important thing that’s going to show up over and over again in this course, which is the power of transfer learning. So the general approach of learning a good representation for task one and then using it to perform a downstream task two is called transfer learning, as exemplified by the word vectors learned in an unsupervised manner being useful for various natural language processing tasks.

This general approach of learning a good representation for task 1 and then using it to perform a downstream task 2 is a powerful theme in deep learning. We will see, for example, that an ImageNet pre-trained backbone – estimated on a dataset of natural images (over 10% of which are dog breeds) – with limited fine tuning can be a powerful features extractor for very different tasks.

Transfer Learning Concept

Word embeddings have powered named entity recognition, part-of-speech tagging, parsing, and things like online recommendation engines during the mid-to-late 2010s. So we often initialize a downstream NLP task with embeddings and then fine tune the model to extract the relevant dimensions from those embeddings. And this is going to be the norm in most economic applications.

## Bilingual Word Embeddings: Bridging Languages

Another kind of a separate but relevant note that I wanted to make here was about some work that was done on bilingual word embeddings because this is going to be a pre-cursor to some of the things that we see, especially when we go to talk about multimodal learning.

This figure is shown from a paper that aimed to learn to embed words from two different languages into a single shared space. So taking words from English and words from Mandarin and you train two word embeddings, W_en and W_zh respectively, but you optimize such that words that we know are good translations are close.

Bilingual Word Embeddings

And kind of the striking thing about this paper is that this didn’t reasonably well on words not exposed to during training, which I think is just a really interesting point that suggesting somehow that human languages have similar topologies such that when some words are forced to align, other words align as well.

All right so nothing about this says we can only embed data of the same mode into a single representation. One could for example classify images by producing a word embedding vector. So when you show an image of a dog, the model should be close to the word embedding for a dog and similarly for cats or whatever else. And so what’s really remarkable is what happens when you test the model on a new class of image.

## Bias in Word Embeddings: A Critical Examination

It’s using word embeddings on this historical corpus of American English, which is like mostly books that are in the public domain to look at essentially biases in the corpus. And so for example, it’s looking at vectors that are very similar to ‘she’, relative to ‘he’, that represent different occupations listed under ‘Extreme she’ and ‘Extreme he’.

Gender Bias in Word Embeddings

And so you can see things like, you know, homemaker, nurse, receptionist, librarian, socialite, hairdresser, nanny, bookkeeper, stylist, housekeeper under ‘Extreme she’. They also look at gender stereotypes and ‘she-he’ analogies like ‘sewing-carpentry’, ‘nurse-surgeon’, ‘blond-burly’, ‘giggle-chuckle’, ‘sassy-snappy’, ‘diva-superstar’, ‘volleyball-football’, ‘cupcakes-pizzas’ when the word embedding model is trained on this corpus. And you can definitely see stereotypes there.

And so then they look at it by decades. So top adjectives associated with women in 1910, 1950, and 1990 by relative norm difference in the COHA embedding, as shown in Table 2 of the slide.

Historical Gender Stereotypes Analysis

And they argue that this can be used to approximate shifts in gender stereotypes, at least if you think this corpus is representative. And again, you see this very strong kind of these are adjectives that are strongly, you know, used in very, very similar context to ‘she’, but not to ‘he’ based on the COHA embeddings trained on the Corpus of Historical American English/Google Books, by decade.

And they also do something similar to look at ethnic stereotypes, which I believe that they do by looking at the most common names of Chinese descent versus common white names based on the information given in the slide title. And again, you see some pretty extreme stereotypes about Asians (vs. Whites) in the word embeddings, as shown by the negative adjectives like ‘Irresponsible’, ‘Envious’, and ‘Barbaric’ in 1910, ‘Disorganized’, ‘Outrageous’, and ‘Predatory’ in 1950, and ‘Inhibited’, ‘Passive’, and ‘Dissolute’ in 1990.

Ethnic Stereotypes in Word Embeddings

So I thought this paper was a really interesting example of using word embeddings to quantify 100 years of gender and ethnic stereotypes (PNAS).

## Summary and Conclusion

This comprehensive exploration of static word representations has taken us through a fascinating journey from the limitations of traditional approaches to the revolutionary impact of Word2Vec and beyond. We began by understanding the fundamental concept of static representations, where words maintain the same vector regardless of context, contrasting this with the contextualized approaches that would later dominate the field.

The progression from one-hot vectors to dense word embeddings represents a crucial paradigm shift in natural language processing. The foundational insight that “you shall know a word by the company it keeps” led to the development of count-based distributional models, which, despite their computational challenges, paved the way for more sophisticated approaches.

Word2Vec emerged as a groundbreaking solution, introducing an elegant training process that uses positive and negative sampling to learn meaningful word representations. The remarkable properties of these embeddings—particularly their ability to capture semantic relationships through vector arithmetic—demonstrated that neural networks could discover linguistic patterns that were both interpretable and practically useful.

Perhaps most significantly, this exploration revealed several key themes that continue to shape modern NLP:

**The Power of Learned Representations**: Neural networks can learn better ways to represent data than what humans can engineer by hand, as demonstrated by the semantic relationships captured in word vector spaces.

**Transfer Learning as a Fundamental Principle**: The ability of word embeddings to excel at tasks they weren’t explicitly trained for showcases the power of learning good representations for one task and applying them to downstream applications.

**The Importance of Evaluation**: From dimensionality reduction visualizations to analogy tests and human judgment correlations, the field developed sophisticated methods to assess the quality of learned representations.

**Addressing Societal Implications**: The examination of bias in word embeddings, including gender and ethnic stereotypes reflected in historical corpora, highlights the critical importance of understanding how our models inherit and potentially amplify societal biases present in training data.

The extension to bilingual word embeddings and the potential for multimodal representations point toward the broader applicability of these techniques across languages and data modalities. The striking finding that different human languages exhibit similar topological structures when embedded in vector spaces suggests deep universal properties of human communication.

As we stand at this juncture in the evolution of NLP, static word representations serve as both a foundation and a stepping stone. While modern contextualized models have largely superseded these approaches, the fundamental insights about representation learning, transfer learning, and the power of unsupervised methods continue to drive innovation in the field. The journey from Word2Vec to transformers represents not just technological progress, but a deeper understanding of how machines can learn to understand and process human language.