# Pytorch Basic Tutorial

In this notebook, we will cover the very basics of using pytorch. In all cases, the [pytorch documentation](https://pytorch.org/docs/stable/index.html) will be your best source for reference. We will be using pytorch for the following homeworks and it will be very helpful for your project as well. Pytorch is one of the most popular machine learning libraries which is used both in academia and industry.

The resources this notebook heavily relies on are:
1. CS224N Stanford Pytorch [tutorial](https://web.stanford.edu/class/cs224n/materials/CS224N_PyTorch_Tutorial.html).
2. Udacity skip-gram [tutorial](https://github.com/udacity/deep-learning-v2-pytorch/blob/master/word2vec-embeddings/Skip_Grams_Solution.ipynb)

## Tensors

Tensors are the basic building blocks in Pytorch. They are similar to n-dimensional arrays you have encoutered in numpy.

We'll first look at few different ways of creating tensors and some basic properties of them:


In [11]:
import torch
import numpy as np

# Create tensors using list
x1 = [[0, 1], [2, 3], [4, 5]]
tensor1 = torch.tensor(x1)

# Create tensors using numpy array
x2 = np.array(x1)
tensor2 = torch.from_numpy(x2)

# Check data type and shape of tensor and device of tensors
print(f"tensor1.dtype: {tensor1.dtype}") # data type
print(f"tensor1.shape: {tensor1.shape}") # shape
print(f"tensor1.device: {tensor1.device}") # device

tensor1.dtype: torch.int64
tensor1.shape: torch.Size([3, 2])
tensor1.device: cpu


The indexing of tensors and operations on them are very similar to those used in numpy. 

In [12]:

# Create a tensor of ones of size 5*2
a = torch.ones(5,2)

# Index a row of the tensor
print(f"a[0,:]: {a[0,:]}")

# Index a particular element
print(f"a[3,1]: {a[3,1]}")

# Get a python scalar corresponding to the item
print(f"a[3,1].item(): {a[3,1].item()}")


a[0,:]: tensor([1., 1.])
a[3,1]: 1.0
a[3,1].item(): 1.0


The [documentation](https://pytorch.org/docs/stable/tensors.html) covers a lot of operations, but some of the useful ones are:

In [13]:

# Create a tensor with elements ranging from 0 to 9
a = torch.arange(10)

# Reshape tensor to (1,10)
a = a.view(1,10)

# Concatenate in dimension 0 and 1
a_cat0 = torch.cat([a, a, a], dim=0)
a_cat1 = torch.cat([a, a, a], dim=1)

# Check shape of created tensors
print(f"a_cat0.shape: {a_cat0.shape}")
print(f"a_cat1.shape: {a_cat1.shape}")

# Squeeze removes of dimension of size 1
a = a.squeeze(dim=0)
print(f"After squeezing, a.shape={a.shape}")


a_cat0.shape: torch.Size([3, 10])
a_cat1.shape: torch.Size([1, 30])
After squeezing, a.shape=torch.Size([10])


## Autograd

One of the main useful features for Pytorch is the automatic differentiation feature provided through Autograd i.e. once you define a computation, the backpropogation algorithm to compute the gradient of all weights is done automatically by Pytorch. The main useful method we will look at is `backward()`.

In [5]:
# Create an example tensor
# requires_grad parameter tells PyTorch to store gradients
x = torch.tensor([2.], requires_grad=True)

# Print the gradient if it is calculated
# Currently None since x is a scalar
print(x.grad)

y = x * x * 3 # 3x^2
y.backward()
print(x.grad) # d(y)/d(x) = d(3x^2)/d(x) = 6x = 12

None
tensor([12.])


If we run backpropogation again on a different operation for the same tensor, Pytorch accumulates the gradient i.e. calculates the sum of the gradients so far. One method to zero out the gradients (which is usually done in every training iteration) is by using `zero_grad()` function. We will look at this later in the notebook.

In [6]:

# Define a new operation using the same tensor
z = x * x * 3 # 3x^2
z.backward()

# Output will be sum of gradients so far
print(x.grad)


tensor([24.])


## Neural Network Module

Pytorch provides a lot of useful simple building blocks in `torch.nn` which can be useful to create complicated neural networks.

### Linear layer

One can create linear layers using the `nn.Linear(d_in, d_out)` where `d_in` is the dimension of the input tensor and `d_out` is the size of the output tensor. This will take a matrix of `(N, *, H_in)` dimensions and output a matrix of `(N, *, H_out)`. The `*` denotes that there could be arbitrary number of dimensions in between. The linear layer performs the operation `Ax+b`, where `A` and `b` are initialized randomly. If we don't want the linear layer to learn the bias parameters, we can initialize our layer with `bias=False`.

In [7]:
import torch.nn as nn

# Create the inputs
input = torch.ones(2,3,4)

# Make a linear layers transforming N,*,H_in dimensional inputs to N,*,H_out
# dimensional outputs
# The weights are randomly initialized
linear = nn.Linear(4, 2)
linear_output = linear(input)

# Check shape of output
print(linear_output.shape)

torch.Size([2, 3, 2])



You can find many other useful module layers besides the linear layer such as recurrent layers (e.g. `nn.RNN` or `nn.LSTM`), pooling layers (e.g. `nn.MaxPool2d`), normalization layers (e.g. `nn.LayerNorm`), dropout layers (e.g. `nn.Dropout`), loss layers (e.g. `nn.BCELoss`) etc.

### Activation Layers

We can also use the nn module to apply activations functions to our tensors. Activation functions are used to add non-linearity to our network. Some examples of activations functions are `nn.ReLU()`, `nn.Sigmoid()` and `nn.Softmax()`. Activation functions operate on each element separately, so the shape of the tensors we get as an output are the same as the ones we pass in.

In [8]:
# Check output of linear layer, which is input to activation layer
print(linear_output)

sigmoid = nn.Sigmoid()
output = sigmoid(linear_output)

# Check output of activation layer
print(output)


tensor([[[-0.1536, -0.4272],
         [-0.1536, -0.4272],
         [-0.1536, -0.4272]],

        [[-0.1536, -0.4272],
         [-0.1536, -0.4272],
         [-0.1536, -0.4272]]], grad_fn=<ViewBackward0>)
tensor([[[0.4617, 0.3948],
         [0.4617, 0.3948],
         [0.4617, 0.3948]],

        [[0.4617, 0.3948],
         [0.4617, 0.3948],
         [0.4617, 0.3948]]], grad_fn=<SigmoidBackward0>)


So far we have seen that we can create layers and pass the output of one as the input of the next. Instead of creating intermediate tensors and passing them around, we can use nn.Sequential, which does exactly that.

In [9]:
block = nn.Sequential(
    nn.Linear(4, 2),
    nn.Sigmoid()
)

input = torch.ones(2,3,4)
output = block(input)
print(output)

tensor([[[0.3704, 0.6840],
         [0.3704, 0.6840],
         [0.3704, 0.6840]],

        [[0.3704, 0.6840],
         [0.3704, 0.6840],
         [0.3704, 0.6840]]], grad_fn=<SigmoidBackward0>)


### Custom Modules

Instead of using the predefined modules, we can also build our own by extending the `nn.Module` class. We can initialize our parameters in the `__init__` function, starting with a call to the `__init__` function of the super class. All the class attributes we define which are nn module objects are treated as parameters, which can be learned during the training. Tensors are not parameters, but they can be turned into parameters if they are wrapped in `nn.Parameter` class.

All classes extending `nn.Module` are also expected to implement a `forward(x)` function, where `x` is a tensor. This is the function that is called when an input is passed to our module, such as in `model(x)`.



In [10]:
class MultilayerPerceptron(nn.Module):

  def __init__(self, input_size, hidden_size):
    # Call to the __init__ function of the super class
    super(MultilayerPerceptron, self).__init__()

    # Bookkeeping: Saving the initialization parameters
    self.input_size = input_size 
    self.hidden_size = hidden_size 

    # Defining of our model
    self.model = nn.Sequential(
        nn.Linear(self.input_size, self.hidden_size),
        nn.ReLU(),
        nn.Linear(self.hidden_size, self.input_size),
        nn.Sigmoid()
    )
    
  def forward(self, x):
    output = self.model(x)
    return output


# Make a sample input filled with random numbers from a normal distribution with mean 0 and variance 1
input = torch.randn(2, 5)

# Create our model
model = MultilayerPerceptron(5, 3)

# Pass our input through our model
output = model(input)

# Check output
print("Output ", output, "\n")

# Check all parameters in our custom defined model
print("Parameters", list(model.named_parameters()))

Output  tensor([[0.5239, 0.3923, 0.6230, 0.6071, 0.6117],
        [0.5860, 0.4769, 0.7796, 0.6161, 0.7864]], grad_fn=<SigmoidBackward0>) 

Parameters [('model.0.weight', Parameter containing:
tensor([[ 0.3074, -0.3048,  0.2139, -0.0170, -0.1865],
        [ 0.2583, -0.2864,  0.4366,  0.1438, -0.0087],
        [ 0.1540,  0.4378,  0.0295,  0.0770,  0.4033]], requires_grad=True)), ('model.0.bias', Parameter containing:
tensor([-0.0056,  0.2543,  0.1362], requires_grad=True)), ('model.2.weight', Parameter containing:
tensor([[ 0.4098,  0.4714, -0.2060],
        [ 0.3935, -0.2444,  0.4201],
        [ 0.4898,  0.2844,  0.4218],
        [-0.0612,  0.0289,  0.0333],
        [ 0.0986,  0.5076,  0.4517]], requires_grad=True)), ('model.2.bias', Parameter containing:
tensor([ 0.0958, -0.4378,  0.5025,  0.4351,  0.4546], requires_grad=True))]


## Optimization


We have seen how gradients are calculated with the `backward()` function. Having the gradients isn't enough for our models to learn. We also need to know how to update the parameters of our models. This is where the optimizers come in. `torch.optim` module contains several optimizers that we can use. Some popular examples are `optim.SGD` and `optim.Adam`. When initializing optimizers, we pass our model parameters, which can be accessed with `model.parameters()`, telling the optimizers which values it will be optimizing. Optimizers also has a learning rate (`lr`) parameter, which determines how big of an update will be made in every step. Different optimizers have different hyperparameters as well.


In [15]:
import torch.optim as optim

# Create the y data
y = torch.ones(10, 5)

# Add some noise to our goal y to generate our x
# We want our model to predict our original data, albeit the noise
x = y + torch.randn_like(y)  # _like means it returns a tensor with the same size as input

# Instantiate the model
model = MultilayerPerceptron(5, 3)

# Define the optimizer
adam = optim.Adam(model.parameters(), lr=0.1)

# Define loss using a predefined loss function
loss_function = nn.BCELoss()

# Calculate how our model is doing now
y_pred = model(x)
print("Before optimizing, loss:")
print(loss_function(y_pred, y).item())


# Set the number of epoch, which determines the number of training iterations
n_epoch = 10 

for epoch in range(n_epoch):
  # Set the gradients to 0
  adam.zero_grad()

  # Get the model predictions
  y_pred = model(x)

  # Get the loss
  loss = loss_function(y_pred, y)

  # Print stats
  print(f"Epoch {epoch}: traing loss: {loss}")

  # Compute the gradients
  loss.backward()

  # Take a step to optimize the weights
  adam.step()


Before optimizing, loss:
0.6627954840660095
Epoch 0: traing loss: 0.6627954840660095
Epoch 1: traing loss: 0.5840400457382202
Epoch 2: traing loss: 0.48152053356170654
Epoch 3: traing loss: 0.3546123504638672
Epoch 4: traing loss: 0.2312818020582199
Epoch 5: traing loss: 0.13476794958114624
Epoch 6: traing loss: 0.07308819890022278
Epoch 7: traing loss: 0.04059215635061264
Epoch 8: traing loss: 0.023065410554409027
Epoch 9: traing loss: 0.013868678361177444


Let's check how our model performs on a data point:

In [17]:
# See how our model performs on the training data
print("Model performance on training data")
y_pred = model(x)
print(y_pred)

# Create test data and check how our model performs on it
print("Model performance on test data")
x2 = y + torch.randn_like(y)
y_pred = model(x2)
print(y_pred)


Model performance on training data
tensor([[0.9992, 0.9999, 0.9993, 0.9987, 0.9980],
        [0.9450, 0.9427, 0.9112, 0.9278, 0.9000],
        [0.9989, 1.0000, 0.9999, 0.9982, 0.9999],
        [0.9972, 0.9997, 0.9984, 0.9955, 0.9967],
        [0.9950, 0.9994, 0.9989, 0.9923, 0.9985],
        [0.9973, 0.9998, 0.9997, 0.9957, 0.9996],
        [0.9999, 1.0000, 1.0000, 0.9998, 1.0000],
        [1.0000, 1.0000, 1.0000, 1.0000, 1.0000],
        [0.9995, 1.0000, 0.9999, 0.9992, 0.9997],
        [0.9999, 1.0000, 1.0000, 0.9999, 1.0000]], grad_fn=<SigmoidBackward0>)
Model performance on test data
tensor([[0.9985, 0.9999, 0.9999, 0.9975, 0.9998],
        [0.9987, 1.0000, 0.9999, 0.9979, 0.9999],
        [0.9962, 0.9994, 0.9969, 0.9940, 0.9938],
        [0.9998, 1.0000, 1.0000, 0.9996, 1.0000],
        [0.9962, 0.9996, 0.9989, 0.9941, 0.9982],
        [1.0000, 1.0000, 1.0000, 1.0000, 1.0000],
        [0.9997, 1.0000, 1.0000, 0.9995, 1.0000],
        [0.9989, 1.0000, 0.9999, 0.9981, 0.9998],
     

## Real Example: Skip-gram Model

In previous lectures, you have learned word2vec skip-gram model where the task was to predict the context words given a particular word. Now, we will look at a simple example of how to train these word vectors in Pytorch.

We will be using the corpus [here](https://s3.amazonaws.com/video.udacity-data.com/topher/2018/October/5bbe6499_text8/text8.zip) (based on wikipedia) to train our word vectors.



In [18]:
# read in the extracted text file      
with open('text8') as f:
    text = f.read()

# print out the first 100 characters
print(text[:100])

 anarchism originated as a term of abuse first used against early working class radicals including t


Next, we will process the dataset using the tokenization functions from `nltk`. The dataset contains 16M tokens, but for faster training we are going to use 100k. You can comment out that part to train a model on the full data.

In [19]:
from nltk.tokenize import word_tokenize
from collections import Counter

def preprocess(text):
    words = word_tokenize(text)
    
    # Remove all words with 5 or fewer occurences
    word_counts = Counter(words)
    trimmed_words = [word for word in words if word_counts[word] > 5]

    return trimmed_words

# get list of words
words = preprocess(text)
print(words[:30])

# for the purpose of illustration, we are going to restrict the size of dataset
words = words[:100000]

# print some stats about this word data
print("Total words in text: {}".format(len(words)))
print("Unique words: {}".format(len(set(words))))



['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution', 'and', 'the', 'sans', 'culottes', 'of', 'the', 'french', 'revolution', 'whilst']
Total words in text: 100000
Unique words: 10916


Next, we are going to create dictionaries which map each word in our vocabulary to an integer and vice-versa.

In [20]:

def create_lookup_tables(words):
    """
    Create lookup tables for vocabulary
    :param words: Input list of words
    :return: Two dictionaries, vocab_to_int, int_to_vocab
    """
    word_counts = Counter(words)
    # sorting the words from most to least frequent in text occurrence
    sorted_vocab = sorted(word_counts, key=word_counts.get, reverse=True)
    # create int_to_vocab dictionaries
    int_to_vocab = {ii: word for ii, word in enumerate(sorted_vocab)}
    vocab_to_int = {word: ii for ii, word in int_to_vocab.items()}

    return vocab_to_int, int_to_vocab

vocab_to_int, int_to_vocab = create_lookup_tables(words)
int_words = [vocab_to_int[word] for word in words]
train_words = [word for word in int_words]

print(int_words[:30])


[91, 2240, 9, 6, 230, 1, 2666, 66, 100, 218, 127, 455, 554, 6085, 118, 0, 4232, 1, 0, 247, 708, 2, 0, 6086, 6087, 1, 0, 231, 708, 3298]


The original implementation of word2vec used some additional tricks to preprocess the data which we are going to exclude for simplicity here (e.g. to remove noise, they randomly drop some words).

We'll next implement the fuction which gives us neighbouring words within a window w.r.t to a particular word. Instead of a fixed window size, we'll follow the original word2vec paper and select `R` randomly between `1` and `window_size` as the widow size for one call to the function.


In [21]:


def get_target(words, idx, window_size=5):
    ''' Get a list of words in a window around an index. '''
    
    R = np.random.randint(1, window_size+1)
    start = idx - R if (idx - R) > 0 else 0
    stop = idx + R
    target_words = words[start:idx] + words[idx+1:stop+1]
    
    return list(target_words)

int_text = [i for i in range(10)]
print('Input: ', int_text)
idx=5 # word index of interest

target = get_target(int_text, idx=idx, window_size=5)
print('Target: ', target)  # you should get some indices around the idx


Input:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Target:  [1, 2, 3, 4, 6, 7, 8, 9]



Next, let's implement a function which gives us batches of data which we can use to train our model. The generator function returns batches of input and target data for our model, using the get_target function from above. The idea is that it grabs batch_size words from a words list. Then for each of those batches, it gets the target words in a window.


In [22]:
def get_batches(words, batch_size, window_size=5):
    ''' Create a generator of word batches as a tuple (inputs, targets) '''
    
    n_batches = len(words)//batch_size
    
    # only full batches
    words = words[:n_batches*batch_size]
    
    for idx in range(0, len(words), batch_size):
        x, y = [], []
        batch = words[idx:idx+batch_size]
        for ii in range(len(batch)):
            batch_x = batch[ii]
            batch_y = get_target(batch, ii, window_size)
            y.extend(batch_y)
            x.extend([batch_x]*len(batch_y)) # We want to use a word x to predict all its neighbor words
        yield x, y
        
int_text = [i for i in range(20)]
x,y = next(get_batches(int_text, batch_size=4, window_size=5))

print('x\n', x)
print('y\n', y)
    

x
 [0, 1, 1, 2, 2, 2, 3, 3]
y
 [1, 0, 2, 0, 1, 3, 1, 2]



We'll define the skip-gram model using the functions you just learnt from `nn.Module` in Pytorch.


In [25]:

class SkipGram(nn.Module):
    def __init__(self, n_vocab, n_embed):
        super().__init__()
        
        self.embed = nn.Embedding(n_vocab, n_embed)
        self.output = nn.Linear(n_embed, n_vocab)
        self.log_softmax = nn.LogSoftmax(dim=1)
    
    def forward(self, x):
        x = self.embed(x)
        scores = self.output(x)
        log_ps = self.log_softmax(scores)
        
        return log_ps


Before moving on to the final training loop part of this code, we are going to define a function which we will use to validate our training. This is the same cosine similarity function you looked at previously, except we are going to use lookup in `nn.Embedding()` which contains all our word vectors, and find similarity for a random word in our vocab.

In [23]:
def cosine_similarity(embedding, valid_size=16, valid_window=100, device='cpu'):
    
    # Here we're calculating the cosine similarity between some random words and 
    # our embedding vectors. With the similarities, we can look at what words are
    # close to our random words.
    
    # sim = (a . b) / |a||b|
    
    embed_vectors = embedding.weight
    
    # magnitude of embedding vectors, |b|
    magnitudes = embed_vectors.pow(2).sum(dim=1).sqrt().unsqueeze(0)
    
    # pick N words from our ranges (0,window) and (1000,1000+window). lower id implies more frequent 
    valid_examples = np.array(random.sample(range(valid_window), valid_size//2))
    valid_examples = np.append(valid_examples,
                               random.sample(range(1000,1000+valid_window), valid_size//2))
    valid_examples = torch.LongTensor(valid_examples).to(device)
    
    valid_vectors = embedding(valid_examples)
    similarities = torch.mm(valid_vectors, embed_vectors.t())/magnitudes
        
    return valid_examples, similarities

Finally, we'll define the main training loop where we will iterate over batches of data, compute a forward pass, compute the gradients using the `backward()` function, and use the optimizer to update the model parameters.

**NOTE**: The below training loop will be very slow to run on CPUs on your laptop, so I would not recommend it. You can instead run it on GPUs --- you should have access to the computing server now, if not, please leave your NetID on the corresponding note on Campuswire so that we can help you request access.

In [26]:
import random

# check if GPU is available
device = 'cuda' if torch.cuda.is_available() else 'cpu'

embedding_dim=50 # you can change, if you want

model = SkipGram(len(vocab_to_int), embedding_dim).to(device)
criterion = nn.NLLLoss()
optimizer = optim.Adam(model.parameters(), lr=0.003)

eval_every = 500
print_every = 10
steps = 0
epochs = 5

# train for some number of epochs
for e in range(epochs):
    
    # get input and target batches
    for inputs, targets in get_batches(train_words, 512):
        steps += 1
        inputs, targets = torch.LongTensor(inputs), torch.LongTensor(targets)
        inputs, targets = inputs.to(device), targets.to(device)
        
        log_ps = model(inputs)
        loss = criterion(log_ps, targets)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        if steps % print_every == 0:
            print("Loss: ", loss)

        if steps % eval_every == 0:                  
            # getting examples and similarities      
            valid_examples, valid_similarities = cosine_similarity(model.embed, device=device)
            _, closest_idxs = valid_similarities.topk(6) # topk the highest similarities
            
            valid_examples, closest_idxs = valid_examples.to('cpu'), closest_idxs.to('cpu')
            for ii, valid_idx in enumerate(valid_examples):
                closest_words = [int_to_vocab[idx.item()] for idx in closest_idxs[ii]][1:]
                print(int_to_vocab[valid_idx.item()] + " | " + ', '.join(closest_words))
            print("...")



Loss:  tensor(9.3538, grad_fn=<NllLossBackward0>)
Loss:  tensor(9.2237, grad_fn=<NllLossBackward0>)
Loss:  tensor(9.0586, grad_fn=<NllLossBackward0>)
Loss:  tensor(9.0187, grad_fn=<NllLossBackward0>)
Loss:  tensor(8.8914, grad_fn=<NllLossBackward0>)
Loss:  tensor(8.8088, grad_fn=<NllLossBackward0>)
Loss:  tensor(8.4792, grad_fn=<NllLossBackward0>)
Loss:  tensor(8.6375, grad_fn=<NllLossBackward0>)
Loss:  tensor(8.2983, grad_fn=<NllLossBackward0>)
Loss:  tensor(8.2334, grad_fn=<NllLossBackward0>)
Loss:  tensor(8.0423, grad_fn=<NllLossBackward0>)
Loss:  tensor(8.0246, grad_fn=<NllLossBackward0>)
Loss:  tensor(8.0054, grad_fn=<NllLossBackward0>)
Loss:  tensor(8.0163, grad_fn=<NllLossBackward0>)
Loss:  tensor(8.1407, grad_fn=<NllLossBackward0>)
Loss:  tensor(8.1291, grad_fn=<NllLossBackward0>)
Loss:  tensor(8.6070, grad_fn=<NllLossBackward0>)
Loss:  tensor(7.9302, grad_fn=<NllLossBackward0>)
Loss:  tensor(8.0155, grad_fn=<NllLossBackward0>)
Loss:  tensor(7.2684, grad_fn=<NllLossBackward0>)
