# Section 2: Operations on word vectors

In this notebook, we will cover three different operations on word vectors:

1. Similarity: Do similar words have similar word vectors?
2. Word Analogy Tasks: Can the similarity of word vectors be used to solve analogy tasks like "a is to b as c is to what"?

In [None]:
## Code heavily based on https://github.com/gemaatienza/Deep-Learning-Coursera/blob/master/5.%20Sequence%20Models/Operations%20on%20word%20vectors%20-%20v2.ipynb

import numpy as np
import collections
import os

def read_glove_vecs(glove_file):
    with open(glove_file, 'r') as f:
        words = set()
        word_to_vec_map = {}

        for line in f:
            line = line.strip().split()
            curr_word = line[0]
            words.add(curr_word)
            word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)

    return words, word_to_vec_map



For the purpose of this notebook, we will not be training word vectors. We will instead use pre-trained word vectors. More specically, we will use the [GloVe](https://aclanthology.org/D14-1162.pdf) word vectors trained on 6 billion tokens, where every vector is of 50 dimension.

You can download the word vectors from [here](https://nlp.stanford.edu/projects/glove/) or [here](https://www.kaggle.com/datasets/watts2/glove6b50dtxt).

In [None]:

words, word_to_vec_map = read_glove_vecs('glove.6B.50d.txt')

# words -- set of words in the vocabulary.
# word_to_vec_map -- dictionary mapping words to their respective vectors.

In [None]:
# Define cosine similarity function

def cosine_similarity(u, v):
    """
    Cosine similarity reflects the degree of similariy between u and v

    Arguments:
        u -- a word vector of shape (n,)
        v -- a word vector of shape (n,)

    Returns:
        cosine_similarity -- the cosine similarity between u and v defined by the formula above.
    """

    distance = 0.0

    # Compute the dot product between u and v
    dot = np.dot(u,v)
    # Compute the L2 norm of u
    norm_u = np.linalg.norm(u)

    # Compute the L2 norm of v
    norm_v = np.linalg.norm(v)
    # Compute the cosine similarity defined by formula
    cosine_similarity = dot/(norm_u*norm_v)

    return cosine_similarity

First, let us evaluate if relation words (e.g. mother, brother etc) are more similar to each other than to a random word (e.g. ball).

In [None]:

# Are relation words more similar to each other?

father = word_to_vec_map["father"]
mother = word_to_vec_map["mother"]
brother = word_to_vec_map["brother"]
ball = word_to_vec_map["ball"]

print("cosine_similarity(mother, father) = ", cosine_similarity(father, mother))
print("cosine_similarity(mother, brother) = ", cosine_similarity(father, brother))
print("cosine_similarity(mother, ball) = ", cosine_similarity(father, ball))



cosine_similarity(mother, father) =  0.8909038442893615
cosine_similarity(mother, brother) =  0.932262758630331
cosine_similarity(mother, ball) =  0.37516525380179355


Next, let us check if countries are more similar to each other than a random word. Do you observe in trend in terms on which two countries have more similar word vectors?

In [None]:
# Are countries more similar to each other?

france = word_to_vec_map["france"]
italy = word_to_vec_map["italy"]
japan = word_to_vec_map["japan"]
canada = word_to_vec_map["canada"]

print("cosine_similarity(france, italy) = ", cosine_similarity(france, italy))
print("cosine_similarity(france, canada) = ", cosine_similarity(france, canada))
print("cosine_similarity(france, japan) = ", cosine_similarity(france, japan))
print("cosine_similarity(france, ball) = ", cosine_similarity(france, ball))

cosine_similarity(france, italy) =  0.7788637392080094
cosine_similarity(france, canada) =  0.6477412715377313
cosine_similarity(france, japan) =  0.4044382941502501
cosine_similarity(france, ball) =  0.25069385386555776


Word Analogy Task -- we will first write the core function which returns the best answer. We will do this iterating over all words in the vocabulary, and computing the similarity of the difference (a - b, c - d).

In [None]:
## Word Analogy

# Function to find the best word which fits the analogy

def complete_analogy(word_a, word_b, word_c, word_to_vec_map):
    """
    Performs the word analogy task as explained above: a is to b as c is to ____.

    Arguments:
    word_a -- a word, string
    word_b -- a word, string
    word_c -- a word, string
    word_to_vec_map -- dictionary that maps words to their corresponding vectors.

    Returns:
    best_word --  the word such that v_b - v_a is close to v_best_word - v_c, as measured by cosine similarity
    """

    # convert words to lower case
    word_a, word_b, word_c = word_a.lower(), word_b.lower(), word_c.lower()

    # Get the word embeddings v_a, v_b and v_c (≈1-3 lines)
    e_a = word_to_vec_map.get(word_a)
    e_b = word_to_vec_map.get(word_b)
    e_c = word_to_vec_map.get(word_c)
    # e_a, e_b, e_c = word_to_vec_map[word_a], word_to_vec_map[word_b], word_to_vec_map[word_c]

    words = word_to_vec_map.keys()
    max_cosine_sim = -100              # Initialize max_cosine_sim to a large negative number
    best_word = None                   # Initialize best_word with None, it will help keep track of the word to output

    # loop over the whole word vector set
    for w in words:
        # to avoid best_word being one of the input words, pass on them.
        if w in [word_a, word_b, word_c] :
            continue

        # Compute cosine similarity between the vector (e_b - e_a) and the vector ((w's vector representation) - e_c)  (≈1 line)
        cosine_sim = cosine_similarity(np.subtract(e_b,e_a), np.subtract(word_to_vec_map.get(w),e_c))

        # If the cosine_sim is more than the max_cosine_sim seen so far,
            # then: set the new max_cosine_sim to the current cosine_sim and the best_word to the current word (≈3 lines)
        if cosine_sim > max_cosine_sim:
            max_cosine_sim = cosine_sim
            best_word = w

    return best_word

Next, let us try out some examples. In some cases the relation between a and b is more semantic (e.g. capital of a country) than in other cases where (e.g. comparative vs superlative). Note that these word vectors might not work for all different kinds of relations (such as the last one showed).

In [None]:
# May take 1-2 minutes to run

triads_to_try = [('italy', 'italian', 'spain'), ('india', 'delhi', 'japan'),
                 ('man', 'woman', 'boy'), ('small', 'smaller', 'large'),
                 ('small', 'smaller', 'big')]
for triad in triads_to_try:
    print ('{} -> {} :: {} -> {}'.format( *triad, complete_analogy(*triad,word_to_vec_map)))

italy -> italian :: spain -> spanish
india -> delhi :: japan -> tokyo
man -> woman :: boy -> girl
small -> smaller :: large -> larger
small -> smaller :: big -> competitors
