2. Basic machine learning

Machine learning is the main driving force of the success in NLP these days. Instead of specifying rules to process text (e.g., translate between two specific language pairs), which is difficult to scale, the goal of ML is to automatically learn the “rules” from data.

The success of ML depends on two things:

  • Availability of large amounts of data. Machine learning makes sense only if it is cheaper to obtain data than to write rules. Fortunately, there are plenty of text online, although sometimes we need to be smart about getting labels.

    • Tasks where we can get labels for free: language modeling, news summarization etc.

    • Tasks where we need to crowdsource labels: text entailment, reading comprehension etc.

    • Tasks that require expert annotations: constituent parsing, semantic parsing etc.

A famous saying about ML is “garbage in, garbage out”: you only get results as good as your data. We should always keep in mind that there is noise and biases in the data and ML systems are not immune to them.

  • Generalization to unseen samples. We are often not interested in performing well on the collected data (the training set); after all it is easy to memorize all labels. Instead, we want the model to make predictions on unseen inputs (the test set). ML promises generalization to unseen examples from the same distribution. This is formalized in statistical learnign theory. In practice, however, the test examples (generated by end users) are rarely from the training data distribution. Therefore, we should be aware of the limit of learning.

2.1. Modeling, learning, inference

The ML approach to any task consists of the following three components: modeling, learning, and inference.

2.1.1. Modeling

The goal of modeling is to represent the task by a mathematical object that we can manipulate. Generally, for NLP problems, given an input space \(\mathcal{X}\) and an output space \(\mathcal{Y}\), the model is a function \(f\colon \mathcal{X} \rightarrow \mathcal{Y}\). Take sentiment classification as an example: we have \(\mathcal{X}=\{\text{sentences of interest}\}\) and \(\mathcal{Y}=\{\text{negative}, \text{positive}\}\).

The modeling question here is what is \(f\). One possible solution is to use a lookup table—we save the value of \(y\in\mathcal{Y}\) for every possible input. What is the problem with this model? Well, it has a huge size (number of parameters) and will be difficult to learn. Instead of taking the complete sentence as input, we might want to just consider individual words in it (i.e the bag-of-words model).

Now, obviously the bag of words model breaks the sentence structure and we lose information in this process (it’s not possible to get back the exact sentence given individual words in it). However, the point of modeling is to capture the essential relations in a managable form, and anything else that is not modeled is considered as “noise”. In this course we will go through different modeling techniques for text, e.g. log linear models and neural networks.

2.1.2. Learning

Specifiying the model is often not enough because it may depend on unknown parameters. Learning allows us to estimate these parameters from data. Typically, learning consists of two steps: (a) specify a loss function (e.g. squared loss), and (b) minimize the average loss (i.e. empirical risk minimization). The optimization step (b) is often done by stochastic gradient descent.

The key topic in most learning algorithms is generalization: If we minimize the loss on the training data, how do we know that the model will have low loss on the test data too? The situation we want to avoid is overfitting, where the model performs very well on the training data but poorly on the test data. To prevent overfitting, we often want to make our model family simple because a complex function is more likely to memorize the training examples (think a lookup table). There are many ways to restrict the model family depending on the specific model at hand.

2.1.3. Inference

Given a model with learned parameters, we still need to execute the model on some input to obtain the output. This process is called inference (sometimes refered to as decoding or the argmax problem). It may seem trivial at first. In the sentiment classification example, we can simly compute the classifier score for each class and predict the highest-scoring one, which takes \(O(\text{number of classes})\). However, inference becomes slightly more complex if our output space contains sequences or trees, because there are exponentially many sequences or trees and enumerating their scores is intractable. We will need dynamic programming or integer linear programming in these cases. In latent variable models, we might want to marginalize over the latent variables, which again can be intractable. The main point is that inference is often not as easy as taking the class with the maximum score.

2.2. Loss functions and optimization

Below we briefly review the two key steps in learning: specifying the loss function and minimizing the average loss on the training set. The following text is meant to be a refresher only. For proper coverage of the topic, please refer to an introductory ML course (e.g. DS-GA.1003).

2.2.1. Loss functions

A loss function measures the goodness of a specific model prediction. It takes the input \(x\), the gold label \(y\), the model output \(f_w(x)\), and produce a score for this prediction:

(2.2.1)\[L(x, y, f_w(x)) ,\]

where \(w\) denotes the parameters of the model \(f\).

Here are some common margin-based loss functions for binary classification. Recall that margin is the score for the correct class: \(yf_w(x)\), and a large margin means that the prediction is more correct.

%matplotlib inline
from IPython import display
from matplotlib import pyplot as plt
import numpy as np

display.set_matplotlib_formats('svg')

x = np.arange(-2, 3, 0.01)

plt.plot(x, [1 if _x <= 0 else 0 for _x in x], label="0-1 loss")
plt.plot(x, np.maximum(0, 1-x), label="hinge loss");
plt.plot(x, np.log(1 + np.exp(-x)), label="logistic loss")
plt.legend()
plt.ylabel('$L(x, y, f_w(x))$')
plt.xlabel('$yf_w(x)$')
Text(0.5, 0, '$yf_w(x)$')
../_images/output_basic_ml_50252b_1_1.svg

The loss function tells us how good the prediciton for a specific input \(x\) is. Note that the input is a random variable following some (unknown) data distribution, thus the loss is also a random variable. To measure the performance of the model on the distribution, we estimate the expected loss (risk) by averaging over the training examples. This gives us the learning objective:

(2.2.2)\[\min \frac{1}{n}\sum_{i=1}^n L(x^{(i)}, y^{(i)}, f_w) .\]

2.2.2. Gradient descent

Now that we have a learning objective, the rest is an optimization problem. Stochastic gradient descent (SGD) is the most widely used optimization algorithm in ML. Below we briefly review the main ideas in SGD.

Gradient is the direction where the objective increases fastest. Therefore, to minimize the objective, we want to move in the opposite direction of the gradient. Given an initial weight vector \(w_0\), we iteratively update it by

(2.2.3)\[w \leftarrow w - \eta \nabla_w F(w) ,\]

where \(\eta\) is the step size and \(F\) is our minimization objective. The algorithm is best visualized in 1D. GD is guaranteed to find the global minimum if \(F\) is convex. However, in practice it works well for non-convex functions such as neural networks.

Unfortunately, GD can be quite slow for ML problems. Note that computing \(\nabla_w F(w)\) requires going through the whole training set, which may easily contain millions of examples. What is we just compute gradient on a single example? Intuitively, the single-example is noisy but should still provide us some useful information. After all, we want to reduce the loss on all examples. This is exactly the idea behind SGD. The algorithm is almost the same as GD except that in each iteration the gradient is estimated on a single example (or a minibatch of examples). Compared to GD, SGD allows more updates to the model within the same amount of time. Since each step we are following a noisy direction, it is important to choose the step size carefully. See illustration of the effect of different step sizes.

2.3. Summary

We have provided a quick review of the basic recipe for building ML systems: design a model for the task, specify a loss function, minimize the average loss, and run inference given the learned model. In the rest of this course, we will see how to extend this framework to various structured prediction problems in NLP.