## Huggingface Datasets

For this tutorial, you will need to install the following libraries -- datasets, zstandard; This notebook was adopted from the [official tutorial](https://huggingface.co/docs/datasets/tutorial) from Hugginface.

In [1]:
import os
import sys
from pprint import pprint
import psutil
from datasets import list_datasets, load_dataset

First, let's check all the available datasets in this library.

In [2]:
datasets = list_datasets()

print(f"Currently {len(datasets)} datasets are available on the hub:")
pprint(datasets[:100] + [f"{len(datasets) - 100} more..."], compact=True)

Currently 22078 datasets are available on the hub:
['acronym_identification', 'ade_corpus_v2', 'adversarial_qa', 'aeslc',
 'afrikaans_ner_corpus', 'ag_news', 'ai2_arc', 'air_dialogue',
 'ajgt_twitter_ar', 'allegro_reviews', 'allocine', 'alt', 'amazon_polarity',
 'amazon_reviews_multi', 'amazon_us_reviews', 'ambig_qa', 'americas_nli', 'ami',
 'amttl', 'anli', 'app_reviews', 'aqua_rat', 'aquamuse', 'ar_cov19',
 'ar_res_reviews', 'ar_sarcasm', 'arabic_billion_words', 'arabic_pos_dialect',
 'arabic_speech_corpus', 'arcd', 'arsentd_lev', 'art', 'arxiv_dataset',
 'ascent_kb', 'aslg_pc12', 'asnq', 'asset', 'assin', 'assin2', 'atomic',
 'autshumato', 'babi_qa', 'banking77', 'bbaw_egyptian', 'bbc_hindi_nli',
 'bc2gm_corpus', 'beans', 'best2009', 'bianet', 'bible_para', 'big_patent',
 'billsum', 'bing_coronavirus_query_set', 'biomrc', 'biosses', 'blbooks',
 'blbooksgenre', 'blended_skill_talk', 'blimp', 'blog_authorship_corpus',
 'bn_hate_speech', 'bnl_newspapers', 'bookcorpus', 'bookcorpusopen'

In [3]:
# Check metadata and attributes of a particular dataset

squad_dataset = list(list_datasets(with_details=True))[datasets.index('squad')]

pprint(squad_dataset.__dict__)

{'_id': '621ffdd236468d709f181f95',
 'author': None,
 'cardData': {'annotations_creators': ['crowdsourced'],
              'dataset_info': {'config_name': 'plain_text',
                               'dataset_size': 89789763,
                               'download_size': 35142551,
                               'features': [{'dtype': 'string', 'name': 'id'},
                                            {'dtype': 'string',
                                             'name': 'title'},
                                            {'dtype': 'string',
                                             'name': 'context'},
                                            {'dtype': 'string',
                                             'name': 'question'},
                                            {'name': 'answers',
                                             'sequence': [{'dtype': 'string',
                                                           'name': 'text'},
                                 



Next, let's try loading a dataset and seeing a few examples from the data.

The main function we are going to use is `load_dataset` -- this will download the dataset, process and cache it in a structured Arrow table. Arrow table are arbitrarily long tables, typed with types that can be mapped to numpy/pandas/python standard types and can store nested objects. They can be directly access from drive, loaded in RAM or even streamed over the web.

In [4]:
# Load the dataset

dataset = load_dataset('squad', split='validation[:10%]')

Found cached dataset squad (/Users/nitishjoshi/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)


In [5]:
# Check what the returned dataset looks like

print(dataset)

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 1057
})


In [6]:
# Check size of dataset and see how a particular datapoint looks like

print(f"ðŸ‘‰ Dataset len(dataset): {len(dataset)}")
print("\nðŸ‘‰ First item 'dataset[0]':")
pprint(dataset[0])

ðŸ‘‰ Dataset len(dataset): 1057

ðŸ‘‰ First item 'dataset[0]':
{'answers': {'answer_start': [177, 177, 177],
             'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos']},
 'context': 'Super Bowl 50 was an American football game to determine the '
            'champion of the National Football League (NFL) for the 2015 '
            'season. The American Football Conference (AFC) champion Denver '
            'Broncos defeated the National Football Conference (NFC) champion '
            'Carolina Panthers 24â€“10 to earn their third Super Bowl title. The '
            "game was played on February 7, 2016, at Levi's Stadium in the San "
            'Francisco Bay Area at Santa Clara, California. As this was the '
            '50th Super Bowl, the league emphasized the "golden anniversary" '
            'with various gold-themed initiatives, as well as temporarily '
            'suspending the tradition of naming each Super Bowl game with '
            'Roman numerals (un

In [7]:
# You can also get a full column of the dataset by indexing with its name as a string:

print(dataset['question'][:10])


['Which NFL team represented the AFC at Super Bowl 50?', 'Which NFL team represented the NFC at Super Bowl 50?', 'Where did Super Bowl 50 take place?', 'Which NFL team won Super Bowl 50?', 'What color was used to emphasize the 50th anniversary of the Super Bowl?', 'What was the theme of Super Bowl 50?', 'What day was the game played on?', 'What is the AFC short for?', 'What was the theme of Super Bowl 50?', 'What does AFC stand for?']


In [8]:
# You can also directly access the column names and their types as well as size

print("Column names:")
pprint(dataset.column_names)
print("Features:")
pprint(dataset.features)

print("The number of rows", dataset.num_rows, "also available as len(dataset)", len(dataset))
print("The number of columns", dataset.num_columns)
print("The shape (rows, columns)", dataset.shape)

Column names:
['id', 'title', 'context', 'question', 'answers']
Features:
{'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None),
 'context': Value(dtype='string', id=None),
 'id': Value(dtype='string', id=None),
 'question': Value(dtype='string', id=None),
 'title': Value(dtype='string', id=None)}
The number of rows 1057 also available as len(dataset) 1057
The number of columns 5
The shape (rows, columns) (1057, 5)


**Advantage of the library** --- since the dataset is backed by Apache Arrow Tables, we can load datasets of arbitrary size without worrying about RAM memory limitation (basically the dataset take no space in RAM, it's directly read from drive when needed with fast IO access).

In [9]:
# Check RAM usage right now

print(f"RAM used: {psutil.Process().memory_info().rss / (1024 * 1024):.2f} MB")

RAM used: 167.78 MB


In [10]:
# Load a large dataset (may take a few minutes)

data_files = "https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst"
pubmed_dataset = load_dataset("json", data_files=data_files, split="train")

Using custom data configuration default-6e3092816c4f845b
Found cached dataset json (/Users/nitishjoshi/.cache/huggingface/datasets/json/default-6e3092816c4f845b/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)


In [11]:
# Check RAM usage now and the size of the dataset

# Process.memory_info is expressed in bytes, so convert to megabytes
print(f"RAM used: {psutil.Process().memory_info().rss / (1024 * 1024):.2f} MB")

print(f"Number of files in dataset : {pubmed_dataset.dataset_size}")
size_gb = pubmed_dataset.dataset_size / (1024**3)
print(f"Dataset size (cache file) : {size_gb:.2f} GB")


RAM used: 600.97 MB
Number of files in dataset : 20978892555
Dataset size (cache file) : 19.54 GB


#### Data Preprocessing

The library also provided a useful `map()` function, which is quite useful to preprocess your dataset such as removing stop words, dealing with punctuations or tokenizing the sentence.

In [12]:
# Let's add a prefix 'My cute title: ' to each of our titles

def add_prefix_to_title(example):
    example['title'] = 'My cute title: ' + example['title']
    return example

prefixed_dataset = dataset.map(add_prefix_to_title)

print(prefixed_dataset.unique('title'))  # `.unique()` is a super fast way to print the unique elemnts in a column (see the doc for all the methods)

Loading cached processed dataset at /Users/nitishjoshi/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453/cache-b106b7610a1ce672.arrow


['My cute title: Super_Bowl_50', 'My cute title: Warsaw']


In [13]:
# You can also use the indices during map

# This will add the index in the dataset to the 'question' field
with_indices_dataset = dataset.map(lambda example, idx: {'question': f'{idx}: ' + example['question']},
                                   with_indices=True)

pprint(with_indices_dataset['question'][:5])



  0%|          | 0/1057 [00:00<?, ?ex/s]

['0: Which NFL team represented the AFC at Super Bowl 50?',
 '1: Which NFL team represented the NFC at Super Bowl 50?',
 '2: Where did Super Bowl 50 take place?',
 '3: Which NFL team won Super Bowl 50?',
 '4: What color was used to emphasize the 50th anniversary of the Super Bowl?']
