## Huggingface Transformers

This notebook relies on a lot of different resources including the [Huggingface Documentation](https://huggingface.co/docs/transformers/index) and the official [tutorials](https://huggingface.co/docs/transformers/notebooks) for the library.

### Tokenizer

A tokenizer is responsible for preprocessing text into an array of numbers as inputs to a model. There are multiple rules that govern the tokenization process, including how to split a word and at what level words should be split. The most important thing to remember is you need to instantiate a tokenizer with the same model name to ensure you're using the same tokenization rules a model was pretrained with.


In [1]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
encoding = tokenizer("Let's learn how to tokenize a sentence using transformers library.")
print(encoding)


{'input_ids': [101, 2421, 112, 188, 3858, 1293, 1106, 22559, 3708, 170, 5650, 1606, 11303, 1468, 3340, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


Here `input_ids` are the indices of the tokens, `token_type_ids` is something which was used in BERT to indicate which sequence the token belongs to (if there is more than one), and `attention_mask` is whether the model should attend to that token (you don't want model to attend to pad tokens for example).

You can get back the original sentence from the ids as:

In [2]:
tokenizer.decode(encoding["input_ids"])

"[CLS] Let's learn how to tokenize a sentence using transformers library. [SEP]"

Here `[CLS]` is a special token added to indicate the start of sentence where `[SEP]` is a special separator token. When we finetune a model like BERT, we often use the representation of `[CLS]` as the representation of the sentence using which we finetune the model.

You can also process sentences in a batch:

In [3]:
batch_sentences = [
    "I was literally starving by that time since I had not eating anything that morning.",
    "It was such a good day.",
    "How many apples should I buy Mark?",
]
encoded_inputs = tokenizer(batch_sentences)
print(encoded_inputs)

{'input_ids': [[101, 146, 1108, 6290, 20285, 1118, 1115, 1159, 1290, 146, 1125, 1136, 5497, 1625, 1115, 2106, 119, 102], [101, 1135, 1108, 1216, 170, 1363, 1285, 119, 102], [101, 1731, 1242, 22888, 1431, 146, 4417, 2392, 136, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}


#### Padding

As you might notice in the above, the sentences are of different length and hence the returned `input_ids`, `attention_mask` etc. are of different length. We can pad the shorter sentences so that they match in length to the longest sentence (this will allow you to use the returned values directly in your models): 

In [4]:
batch_sentences = [
    "I was literally starving by that time since I had not eating anything that morning.",
    "It was such a good day.",
    "How many apples should I buy Mark?",
]
encoded_inputs = tokenizer(batch_sentences, padding=True)
print(encoded_inputs)

{'input_ids': [[101, 146, 1108, 6290, 20285, 1118, 1115, 1159, 1290, 146, 1125, 1136, 5497, 1625, 1115, 2106, 119, 102], [101, 1135, 1108, 1216, 170, 1363, 1285, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1731, 1242, 22888, 1431, 146, 4417, 2392, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}


#### Truncation

Sometimes a sentence may be too long for a model to handle. In this case you can truncate the longer sentences. It will be truncated to the maximum length accepted by that model (in BERT it is 512 tokens).

Additionally for the returned values to be directly used in models, you can specify return type as "pt" which will return pytorch tensors:

In [5]:
batch_sentences = [
    "I was literally starving by that time since I had not eating anything that morning.",
    "It was such a good day.",
    "How many apples should I buy Mark?",
]
encoded_inputs = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
print(encoded_inputs)

{'input_ids': tensor([[  101,   146,  1108,  6290, 20285,  1118,  1115,  1159,  1290,   146,
          1125,  1136,  5497,  1625,  1115,  2106,   119,   102],
        [  101,  1135,  1108,  1216,   170,  1363,  1285,   119,   102,     0,
             0,     0,     0,     0,     0,     0,     0,     0],
        [  101,  1731,  1242, 22888,  1431,   146,  4417,  2392,   136,   102,
             0,     0,     0,     0,     0,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


### Auto Classes

The library provides Auto Classes which allows us to load tokenizers, configurations and models instead of having a separate class for each different tokenizer and model. For models, depending on the type of model there are different Auto Classses.

In [6]:
from transformers import AutoConfig

# Load model configurations -- this will download from the hub and cache it
config = AutoConfig.from_pretrained("bert-base-uncased")

# Check what the config looks like
print(config)


BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.25.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



We've already looked at loading tokenizers using the `AutoTokenizer` class. So let's focus now on how to load model using the Auto Classes.

In [7]:
from transformers import AutoModelForCausalLM, AutoModelForMaskedLM, AutoModelForSeq2SeqLM, AutoModelForSequenceClassification

# Load a model which is a causal LM (model with a LM head on top)
model = AutoModelForCausalLM.from_pretrained("gpt2")

# Load a masked language model
model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")

# Load a sequence-2-sequence model (i.e. encoder-decoder based)
model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")

# Load a model with a classification head on top (you can finetune this on any classification task)
# e.g. BERT model with a classification head on top
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")

del model



Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predi

There are lot more Auto Classes for specific model types such as `AutoModelForTokenClassification` and `AutoModelForQuestionAnswering`. You can check the [documentation](https://huggingface.co/docs/transformers/model_doc/auto#auto-classes) for more details.

### Finetuning a Model

Next, let's look at how to load a pretrained model and finetune it on a dataset (for which we will use the `datasets` library from last time). For now, we will manually define the training loop similar to what we covered in the Pytorch tutorial as well as what you had in HW2.


In [9]:
from datasets import load_dataset
import torch
from torch.utils.data import DataLoader
from torch.optim import AdamW
from transformers import get_scheduler
from tqdm.auto import tqdm
import evaluate


# Load the dataset
dataset = load_dataset("yelp_review_full")

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

# Define a function to use in map (preprocessing)
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

# Tokenize the dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Prepare for training
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")

# For the purpose of this notebook we will use smaller datasets
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

# Let's first look at what example in our data looks like
print(dataset["train"][0])




Found cached dataset yelp_review_full (/home/nhj4247/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf)


  0%|          | 0/2 [00:00<?, ?it/s]

Loading cached processed dataset at /home/nhj4247/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf/cache-a5442ddf57c631fd.arrow
Loading cached processed dataset at /home/nhj4247/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf/cache-547637aec1835405.arrow
Loading cached shuffled indices for dataset at /home/nhj4247/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf/cache-38c2f884f59dd849.arrow
Loading cached shuffled indices for dataset at /home/nhj4247/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf/cache-2cf2f4a55417b89c.arrow


{'label': 4, 'text': "dr. goldberg offers everything i look for in a general practitioner.  he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first.  really, what more do you need?  i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank."}


Next, using the dataset we will define the training loop for BERT. Once the model is trained for 3 epochs, we will evaluate it on the test data and compute the accuracy.

In [10]:
# Dataloader (this will give us an iterator with specified batch size)
train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)
eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)

# Define model
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

# Definfe optimizer
optimizer = AdamW(model.parameters(), lr=5e-5)

# Learning rate scheduler
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

# Check if there is GPU and move model to device accordingly
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

# Main training loop
progress_bar = tqdm(range(num_training_steps))
model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)
        
# Evaluate the trained model
metric = evaluate.load("accuracy")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

print("Accuracy: ", metric.compute())

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

  0%|          | 0/375 [00:00<?, ?it/s]

Accuracy:  {'accuracy': 0.6}


#### Trainer

The library provides a `Trainer` class for training (or fine-tuning) models instead of defining your own training loops. It supports a wide range of features such as gradient accumulation, evaluation strategies (e.g. every epoch) etc. 

In [11]:
from transformers import TrainingArguments, Trainer
import evaluate
import numpy as np

# Define model
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

# Contains all the hyperparameters and arguments for training
training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

# Define the function to use for evaluation
# All models return logits which we need to process to compute our metric (accuracy in this case)
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# Define the trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.079788,0.513
2,No log,1.035249,0.574
3,No log,1.062526,0.574


***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=375, training_loss=0.949586669921875, metrics={'train_runtime': 151.7727, 'train_samples_per_second': 19.766, 'train_steps_per_second': 2.471, 'total_flos': 789354427392000.0, 'train_loss': 0.949586669921875, 'epoch': 3.0})

In the above code, we used the default values for the learning rate, optimizer etc. Make sure to check what these default values are in [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments) and change them as required.

### Text Generation

The above example covered how to finetune a model for text classification. In this part, we will focus on text generation. The training part is very similar to the above, so we will skip that and instead focus on how to do text generation with different decoding strategies.

In [13]:

# We'll try out using GPT2 and load its corresponding tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# add the EOS token as PAD token to avoid warnings
model = AutoModelForCausalLM.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

# encode context the generation is conditioned on
input_ids = tokenizer('I enjoy walking with my cute dog', return_tensors='pt').input_ids

# generate text until the output length (which includes the context length) reaches 50
greedy_output = model.generate(input_ids, max_length=50)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))



Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file config.json from cache at /home/nhj4247/.cache/huggingface/hub/models--gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/config.json
Model config GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
   

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with my dog. I'm not sure if I'll ever be able to walk with my dog.

I'm not sure if I'll


We can also use other decoding methods such as beam search decoding. For example:

In [14]:
# activate beam search and early_stopping
beam_output = model.generate(
    input_ids,  
    max_length=50, 
    num_beams=5, 
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I'm not sure if I'll ever be able to walk with him again. I'm not sure if I'll


You can also use methods like top-k or top-p sampling and specify how many sequences should be returned during generation using the parameter `num_return_sequences`:

In [15]:

# set top_k to 50
sample_outputs = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_k=50,
    num_return_sequences=2
)

print("Output:\n" + 100 * '-')
for i, output in enumerate(sample_outputs):
    print("{}: {}".format(i, tokenizer.decode(output, skip_special_tokens=True)))


Output:
----------------------------------------------------------------------------------------------------
0: I enjoy walking with my cute dog, but there are too many options for us to deal with! I can walk on foot, or I can take a bike for long walks. I don't have space in the house, so I sometimes just work
1: I enjoy walking with my cute dog, though. We're usually not alone on the trail, and usually have a good time.

After our trip, I headed outside of my apartment, just so I could keep my sanity. When the sun
