Ever thought about how much easier things have become with technology? Well, the world of text classification is no exception. With tools like DistilBERT, we can dive into text analysis without getting bogged down in the nitty-gritty. It’s like having a super-smart assistant that’s done most of the heavy lifting for you. You tweak it here and there, and voilà—you’re getting precise, insightful results without breaking a sweat.
Bidirectional Encoder Representations from Transformers (BERT), the model that started it all, has set a new standard in natural language processing. As a foundational model, it captures the context of words in a sentence rather than just their meaning, making it incredibly powerful. DistilBERT, a more efficient version of BERT, retains most of its capabilities while being faster and lighter. By fine-tuning DistilBERT with your specific dataset, you can tailor it to your needs, making text classification both accessible and highly effective for a wide range of applications.
You would know in the past few weeks the storm that has hit the internet. Somebody decided to launch many academic papers and new sleeker, faster, and performant models out in the open. One such application for this velocity of information flow would be the ability to classify these research papers based on their abstract. It is a measured step where we do not abuse the super power, we have with the GPU’s and yet get value from the amazing models we have at our disposal.
Let us start customizing DistilBERT for a seriously cool task: classifying scientific research abstracts. Imagine a tool that can instantly tag academic papers by their abstracts, making it a breeze for ML researchers to pinpoint exactly what they need. No more endless doomscrolling through countless PDFs! We’ll be harnessing the power of Hugging Face’s transformers and datasets libraries. Installation is a piece of cake:
pip install transformers datasets |
You’ve got a CSV file called abstracts.csv. Inside, you’ll find two key columns: abstract (containing the research abstract text itself) and category (specifying the ML field – think “GAN,” “CNN,” or “NER” etc.). Now, let’s get cracking the code and see how this whole thing works its magic. We’ll start by loading our dataset and then get into the specifics of fine-tuning DistilBERT for this particular task.
Loading a dataset
Hugging face uses a module called Dataset. It is responsible for loading data from remote or local sources. It is different from the pandas DataFrame and has more features that augment the usage of such data within the transformers library of hugging face. We would load data like this –
from datasets import load_dataset |
sample = load_dataset(“imdb”) |
The load_dataset will automatically connect over the internet to hugging face dataset repositories to load a dataset that is named imdb. The default behaviour if you do not supply additional parameters is connect to hugging face, reasonably so. In our case we have to load a csv that has the abstracts data. We will resort to loading it with additional parameters –
abstracts = load_dataset(‘csv’, data_files = {‘train’: f‘{path_to_abstract_csv}’}) |
Notice we have used train. We could expand and include test and validate too. It will look like this –
abstracts = load_dataset(‘csv’, data_files = {‘train’: f’{path_to_abstract_csv_train}’, ‘test’: f’..’, ‘validate’:f’..’}) |
The easiest way of loading just a file which can then be split to train, test and validate will go like this
abstracts = load_dataset(‘csv’, f’{path_to_abstract_csv}’) |
Then one would follow it with the splitting –
train_test_abstracts = abstracts[‘train’].train_test_split(test_size=0.1, seed=42) |
train_abstracts = train_test_abstracts[‘train’] |
test_abstracts = train_test_abstracts[‘test’] |
For the validate dataset we will use the train abstracts as we will apportion some records from the training dataset to perform the validation.
train_val_abstracts = train_abstracts.train_test_split(test_size=0.25, seed = 9) |
train_abstracts = train_val_abstracts[‘train’] |
val_abstracts = train_val_abstracts[‘test’] |
Well, this is now golden. We have the dataset up in our memory and could inspect it using index like
train_abstracts[0] |
All the index syntax of Python can be applied to the dataset.
Now, what went wrong with data frame? If you are asking this, we have a special focus dispatch for that in the coming months.
Safeguarding against notorious None
Often times many of the weird error during training pop from poor data in the dataset. One of the notorious elements for the poor nature is the None object. If the CSV has an empty category or empty abstract it will be filled with None object in dataset. It is thus best in your interest that you treat it before even you inspect it.
We will be doing that with the map function which is a reminiscent feature of the list which Dataset inspires.
define replace_none(row): |
row[‘abstract’] = ‘MISSING’ if row[‘abstract’] == None else row[‘abstract’] |
row[‘label’] = ‘MISSING’ if row[‘label’] == None else row[‘label’] |
train_abstracts = train_abstracts.map(replace_none) |
One more suggestion that will save you a lot of time is to please keep the second column, as ‘label’ in that case. Downstream activities that we do expects it so. Renaming it brute force after loading to Dataset or doing some other hack will only ensure some of your precious hours are spent in troubleshooting obscure error messages during training or during inference.
Tokeniser
Imagine DistilBERT as a skilled artisan, ready to craft intricate text-based creations. But before the artisan can work their magic, they need the raw materials prepared – words transformed into manageable pieces. This is where the tokenizer steps in, acting as the artisan’s meticulous assistant. It’s not just about splitting sentences into words; it’s a delicate process of understanding the nuances of language. Think of it like a chef carefully chopping vegetables – some ingredients need to be diced finely, others left in larger chunks to preserve their flavor. The tokenizer does something similar with the abstract we supplied. It breaks down the input into “tokens,” which can be words, subwords, or even individual characters, depending on the tokenizer’s strategy.
DistilBERT, like many transformer models, often employs a subword tokenization approach. This is a clever technique that balances vocabulary size with the ability to handle rare or unseen words. Instead of treating every word as a separate entity, it breaks down frequent words into smaller, more common subword units. For example, the word “unbreakable” might be tokenized into “un,” “break,” and “able.” This approach allows the model to learn representations for these subword units, which can then be combined to understand the meaning of the whole word, even if it hasn’t seen it before. It’s like building with LEGO bricks – individual bricks can be combined in countless ways to create complex structures.
The tokenizer doesn’t just split words; it also adds special tokens that carry crucial contextual information. Think of these as the artisan’s specialized tools. There’s the [CLS] token, which acts like a label for the entire input sequence, and the [SEP] token, which separates different sentences or parts of the text. These special tokens help DistilBERT understand the structure and boundaries of the input, enabling it to perform tasks like sentence classification or question answering effectively. Furthermore, the tokenizer maps each token to a unique numerical ID, a process called “encoding.” This numerical representation is what DistilBERT actually understands. It’s like giving the artisan a specific code for each type of raw material, allowing them to work efficiently and precisely. The tokenizer, in essence, translates human-readable text into a machine-understandable format, paving the way for DistilBERT to unleash its text-processing prowess. It’s a crucial first step in the customization process, ensuring that DistilBERT receives the right ingredients, prepared in the right way, to create meaningful and impactful results.
In our next dispatch we will be initialising a tokenizer to treat our abstracts in the dataset. That will a crucial step to prepare for what comes after that. Until then happy discovering new models and wish you a normalised day.
#DistilBERT #Transformers #HuggingFace #Python #ML