Sharing is caring!

Lesson 3: Best Transformers and BERT Tutorial with Deep Learning and NLP

Table of Contents


Welcome to our blog! πŸŽ‰ Today, we’re delving into Lesson 3: Exploring the Top Transformers and BERT Tutorial for Deep Learning and NLP.

But don’t forget to check:

Transformers and BERT are truly the superheroes of the NLP realm, changing the game in how we comprehend and analyze language.

Throughout this blog post, we’ll serve as your trusty guides as we delve into these incredible tools, revealing their hidden gems and maximizing their capabilities.

bert models
berserker transformers
bert model

Whether you’re a seasoned NLP aficionado or just starting out, this tutorial is jam-packed with valuable insights, tricks, and practical examples to enhance your expertise and utilize the power of Transformers and BERT.

So, grab your superhero cape (or keyboard) and come along on this thrilling journey through the realm of deep learning and natural language processing! It’s an adventure you definitely won’t want to miss. Let’s get started! πŸš€βœ¨.

Course Overview

Let’s start by diving into the fundamentals of RNNs and then gradually move on to the most recent deep learning architectures designed for tackling NLP challenges. Here’s what we’ll go through:

  • Basic RNNs
  • Word Embeddings: Explanation and How to Acquire Them
  • LSTMs
  • GRUs
  • Bi-Directional RNNs
  • Encoder-Decoder Models (Seq2Seq Models)
  • Attention Models
  • Transformers – “Attention is All You Need”
  • BERT

This detailed guide aims to provide you with a solid understanding of these methods. By the end, you’ll be well-versed in all these concepts if you stick with it.

transformer shower curtain
transformers shower curtain
transformer toilet
transformers toilet

Please remember that the purpose of this notebook is not to achieve a high LB (Leaderboard) score but to act as a beginner-friendly manual for comprehending deep learning techniques applied in NLP. Furthermore, after covering these topics, I’ll introduce a basic solution for this competition.

So, get ready and let’s embark on this educational journey together! πŸš€πŸ“š.

berserker from transformers
bert paper
google bert

Let’s Start Coding

This kernel has been a work of more than 10 days If you find my kernel useful and my efforts appreciable, Please Upvote it , it motivates me to write more Quality content

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from tqdm import tqdm
from sklearn.model_selection import train_test_split
import tensorflow as tf
from keras.models import Sequential
from keras.layers.recurrent import LSTM, GRU,SimpleRNN
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.embeddings import Embedding
from keras.layers.normalization import BatchNormalization
from keras.utils import np_utils
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from keras.layers import GlobalMaxPooling1D, Conv1D, MaxPooling1D, Flatten, Bidirectional, SpatialDropout1D
from keras.preprocessing import sequence, text
from keras.callbacks import EarlyStopping

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from plotly import graph_objs as go
import as px
import plotly.figure_factory as ff
Using TensorFlow backend.

Configuring TPU’s

We will utilize TPU’s for this edition of Notebook since we need to construct a BERT Model.

# Detect hardware, return appropriate distribution strategy
    # TPU detection. No parameters necessary if TPU_NAME environment variable is
    # set: this is always the case on Kaggle.
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    print('Running on TPU ', tpu.master())
except ValueError:
    tpu = None

if tpu:
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
    # Default distribution strategy in Tensorflow. Works on CPU and single GPU.
    strategy = tf.distribute.get_strategy()

print("REPLICAS: ", strategy.num_replicas_in_sync)
Running on TPU  grpc://
train = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train.csv')
validation = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/validation.csv')
test = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/test.csv')

To make things simpler for training the models, we’ll focus on a smaller subset of the dataset, consisting of only 12,000 data points.

Additionally, we’ll treat this problem as a Binary Classification Problem by removing the other columns.

train = train.loc[:12000,:]
(12001, 3)

Let’s determine the maximum word limit for comments, which will assist us in padding afterwards:

train['comment_text'].apply(lambda x:len(str(x).split())).max()

Writing a function for getting auc score for validation

def roc_auc(predictions,target):
    This methods returns the AUC Score when given the Predictions
    and Labels
    fpr, tpr, thresholds = metrics.roc_curve(target, predictions)
    roc_auc = metrics.auc(fpr, tpr)
    return roc_auc

Data Preparation

xtrain, xvalid, ytrain, yvalid = train_test_split(train.comment_text.values, train.toxic.values, 
                                                  test_size=0.2, shuffle=True)
transformers berth
transformers bathroom
transformer bathroom
berth transformers

Before We Begin

Before we dive into our tutorial, I want to make sure everyone’s on the same page. If you’re completely new to NLP and haven’t worked with text data before, don’t worry! I’ve got you covered with some excellent resources to get you started:

These kernels will serve as a fantastic starting point for your NLP journey.

transformers machine learning
transformers deep learning
transformer machine learning explained
transformers deep learning explained

Additionally, if you need a refresher on basic neural networks, check out these helpful resources:

Once you’re feeling comfortable, you can learn how to visualize text data and more with these insightful kernels:

These resources will provide a solid foundation for understanding NLP and neural networks. Let’s get ready to learn and grow together! πŸ“šβœ¨

Simple RNN

transformers model
model transformers
transformer model
transformer architecture
transformers architecture

Basic Overview

What is a RNN?

Recurrent Neural Networks (RNNs) are a type of neural network that remember information from previous steps and use it to make decisions in the current step.

In regular neural networks, each input and output are treated separately, without any connection to previous or future steps. However, in tasks like predicting the next word in a sentence, we need to consider the words that came before.

RNNs help with this by using a hidden layer to remember information from previous steps, allowing them to better understand and predict sequences of data.

Why RNN’s?

RNNs excel in tasks where data order is crucial, such as predicting the next word in a sentence or forecasting future values in a time series. They possess a unique ability to retain information from previous steps, enhancing prediction accuracy by taking context into account.

RNNs stand out for their capability to accommodate sequences of varying lengths, making them adaptable for a wide range of applications. They are widely utilized in fields like natural language processing, speech recognition, and image captioning.

The reason RNNs are favored is due to their versatility, flexibility, and proficiency in comprehending data sequences!

For a deeper understanding of Recurrent Neural Networks (RNNs), check out these resources:

These resources offer valuable insights and explanations to help you grasp the concepts of RNNs more thoroughly.

transformers machine learning
transformers deep learning
transformer machine learning explained
transformers deep learning explained

Code Implementation

So first I will implement the and then I will explain the code step by step

# using keras tokenizer here
token = text.Tokenizer(num_words=None)
max_len = 1500

token.fit_on_texts(list(xtrain) + list(xvalid))
xtrain_seq = token.texts_to_sequences(xtrain)
xvalid_seq = token.texts_to_sequences(xvalid)

#zero pad the sequences
xtrain_pad = sequence.pad_sequences(xtrain_seq, maxlen=max_len)
xvalid_pad = sequence.pad_sequences(xvalid_seq, maxlen=max_len)

word_index = token.word_index
with strategy.scope():
    # A simpleRNN without any pretrained embeddings and one dense layer
    model = Sequential()
    model.add(Embedding(len(word_index) + 1,
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
Model: "sequential_1"
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 1500, 300)         13049100  
simple_rnn_1 (SimpleRNN)     (None, 100)               40100     
dense_1 (Dense)              (None, 1)                 101       
Total params: 13,089,301
Trainable params: 13,089,301
Non-trainable params: 0
CPU times: user 620 ms, sys: 370 ms, total: 990 ms
Wall time: 1.18 s, ytrain, nb_epoch=5, batch_size=64*strategy.num_replicas_in_sync) #Multiplying by Strategy to run on TPU's
/opt/conda/lib/python3.6/site-packages/ UserWarning:

The `nb_epoch` argument in `fit` has been renamed `epochs`.

Epoch 1/5
9600/9600 [==============================] - 39s 4ms/step - loss: 0.3714 - accuracy: 0.8805
Epoch 2/5
9600/9600 [==============================] - 39s 4ms/step - loss: 0.2858 - accuracy: 0.9055
Epoch 3/5
9600/9600 [==============================] - 40s 4ms/step - loss: 0.2748 - accuracy: 0.8945
Epoch 4/5
9600/9600 [==============================] - 40s 4ms/step - loss: 0.2416 - accuracy: 0.9053
Epoch 5/5
9600/9600 [==============================] - 39s 4ms/step - loss: 0.2109 - accuracy: 0.9079
<keras.callbacks.callbacks.History at 0x7fae866d75c0>
scores = model.predict(xvalid_pad)
print("Auc: %.2f%%" % (roc_auc(scores,yvalid)))
Auc: 0.69%
scores_model = []
scores_model.append({'Model': 'SimpleRNN','AUC_Score': roc_auc(scores,yvalid)})
transformers artificial intelligence
transformers ai
transformer neural network
transformer machine learning

Code Explanantion

After watching the videos and checking out the provided links, you’ve likely discovered that when working with RNNs, we feed sentences into the model word by word. Each word is transformed into a one-hot vector, with dimensions matching the total number of words in the vocabulary plus one.

Now, let’s talk about the Keras Tokenizer. It gathers all the distinct words from the text dataset and constructs a dictionary where each word is a key and its frequency is the value. This dictionary is then sorted based on word frequency in a descending order.

Subsequently, each word is assigned a unique numerical index. For instance, if ‘the’ is the most common word in the dataset, it will be given index 1. The vector representing ‘the’ will be a one-hot vector with a value of 1 at index 1 and zeros elsewhere.

By examining the first two elements of xtrain_seq, you’ll notice that each word is now denoted by a corresponding numerical index. This tokenization process is crucial for preparing the text data for input into the RNN model.


Now, you might be wondering: What is padding, and why is it done?

Here’s the answer:

Additionally, sometimes special tokens like EOS (end of string) and BOS (beginning of string) are used during tokenization. Here’s why:

The code token.word_index provides the vocabulary dictionary created by Keras.

Building the Neural Network:

To understand the dimensions of input and output given to an RNN in Keras, check out this helpful article:

Now, let’s break down the model:

The first line model.Sequential() tells Keras that we’ll be building our network sequentially.

  • We start by adding an Embedding layer, which converts each word’s one-hot vector into a 300-dimensional vector, similar to Word2Vec.
  • Next, we add 100 LSTM units without any dropout or regularization.
  • Finally, we add a single neuron with a sigmoid function to predict the results based on the output from the 100 LSTM cells.

After defining the model, we compile it using the Adam optimizer.

Comments on the Model:

While our model achieves an accuracy of 1 (which is amazing), we’re clearly overfitting. This was the simplest model, and there are many hyperparameters we can tune to improve performance, such as adjusting the number of RNN units, using batch normalization, or adding dropouts.

Despite this, achieving an AUC score of 0.82 with minimal effort is impressive, and it highlights the power of deep learning.

Word Embeddings

When we were building our simple RNN models, we discussed the concept of word embeddings. So, what exactly are word embeddings and how do we obtain them?

Here are a couple of resources that can provide you with more information:

To obtain word embeddings, the latest approach involves using pre-trained GLoVe or Fasttext models. Without delving into too many technical details, I’ll explain how we can create sentence vectors and utilize them to build a machine learning model.

Personally, I’m a fan of GLoVe vectors, word2vec, and fasttext. In this Notebook, I’ll be utilizing the GLoVe vectors. You can download the GLoVe vectors from this link or search for GLoVe in the datasets on Kaggle and add the file.

# load the GloVe vectors in a dictionary:

embeddings_index = {}
f = open('/kaggle/input/glove840b300dtxt/glove.840B.300d.txt','r',encoding='utf-8')
for line in tqdm(f):
    values = line.split(' ')
    word = values[0]
    coefs = np.asarray([float(val) for val in values[1:]])
    embeddings_index[word] = coefs

print('Found %s word vectors.' % len(embeddings_index))
2196018it [06:43, 5439.09it/s]
Found 2196017 word vectors.


Sure, here are the answers for the provided questions:

Basic Overview

Simple RNNs represented a significant improvement over traditional machine learning algorithms, achieving state-of-the-art results.

However, they faced challenges in capturing long-term dependencies within sequences, particularly in sentences. To overcome this limitation, LSTM (Long Short-Term Memory) networks were introduced around 1998-99.

Why LSTM’s?

What are LSTM’s?

Code Implementation

We have already tokenized and paded our text for input to LSTM’s

# create an embedding matrix for the words we have in the dataset
embedding_matrix = np.zeros((len(word_index) + 1, 300))
for word, i in tqdm(word_index.items()):
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 43496/43496 [00:00<00:00, 183357.18it/s]
with strategy.scope():
    # A simple LSTM with glove embeddings and one dense layer
    model = Sequential()
    model.add(Embedding(len(word_index) + 1,

    model.add(LSTM(100, dropout=0.3, recurrent_dropout=0.3))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy'])
Model: "sequential_2"
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 1500, 300)         13049100  
lstm_1 (LSTM)                (None, 100)               160400    
dense_2 (Dense)              (None, 1)                 101       
Total params: 13,209,601
Trainable params: 160,501
Non-trainable params: 13,049,100
CPU times: user 1.33 s, sys: 1.46 s, total: 2.79 s
Wall time: 3.09 s, ytrain, nb_epoch=5, batch_size=64*strategy.num_replicas_in_sync)
/opt/conda/lib/python3.6/site-packages/ UserWarning:

The `nb_epoch` argument in `fit` has been renamed `epochs`.

Epoch 1/5
9600/9600 [==============================] - 117s 12ms/step - loss: 0.3525 - accuracy: 0.8852
Epoch 2/5
9600/9600 [==============================] - 114s 12ms/step - loss: 0.2397 - accuracy: 0.9192
Epoch 3/5
9600/9600 [==============================] - 114s 12ms/step - loss: 0.1904 - accuracy: 0.9333
Epoch 4/5
9600/9600 [==============================] - 114s 12ms/step - loss: 0.1659 - accuracy: 0.9394
Epoch 5/5
9600/9600 [==============================] - 114s 12ms/step - loss: 0.1553 - accuracy: 0.9470
<keras.callbacks.callbacks.History at 0x7fae84dac710>
scores = model.predict(xvalid_pad)
print("Auc: %.2f%%" % (roc_auc(scores,yvalid)))
Auc: 0.96%
scores_model.append({'Model': 'LSTM','AUC_Score': roc_auc(scores,yvalid)})

Code Explanation

To begin, we generate an embedding matrix for our vocabulary using pretrained GLoVe vectors. When constructing the embedding layer, we utilize this matrix as weights for the layer instead of training it on the vocabulary directly, setting trainable to False.

The remainder of the model remains unchanged, except for the substitution of SimpleRNN units with LSTM units.

Model Evaluation

Upon evaluation, we observe that the model no longer exhibits signs of overfitting and achieves an AUC score of 0.96, which is highly commendable.

Additionally, we observe a reduction in the gap between accuracy and AUC. In this instance, we employed dropout to mitigate overfitting of the data.


GRU (Gated Recurrent Unit), introduced by Cho et al. in 2014, addresses the vanishing gradient problem encountered in standard recurrent neural networks (RNNs).

Similar to LSTM (Long Short-Term Memory), GRU is designed to handle long-term dependencies in sequences.

While both GRU and LSTM share similarities and often yield comparable results, GRU is noted for its simplicity and speed, making it an attractive alternative.

However, there is no clear winner between the two architectures.

For a detailed understanding of GRU networks, refer to the following resources:

Code Implementation

with strategy.scope():
    # GRU with glove embeddings and two dense layers
     model = Sequential()
     model.add(Embedding(len(word_index) + 1,
     model.add(Dense(1, activation='sigmoid'))

     model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy'])   
Model: "sequential_3"
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 1500, 300)         13049100  
spatial_dropout1d_1 (Spatial (None, 1500, 300)         0         
gru_1 (GRU)                  (None, 300)               540900    
dense_3 (Dense)              (None, 1)                 301       
Total params: 13,590,301
Trainable params: 541,201
Non-trainable params: 13,049,100
CPU times: user 1.3 s, sys: 1.29 s, total: 2.59 s
Wall time: 2.79 s, ytrain, nb_epoch=5, batch_size=64*strategy.num_replicas_in_sync)
/opt/conda/lib/python3.6/site-packages/ UserWarning:

The `nb_epoch` argument in `fit` has been renamed `epochs`.

Epoch 1/5
9600/9600 [==============================] - 191s 20ms/step - loss: 0.3272 - accuracy: 0.8933
Epoch 2/5
9600/9600 [==============================] - 189s 20ms/step - loss: 0.2015 - accuracy: 0.9334
Epoch 3/5
9600/9600 [==============================] - 189s 20ms/step - loss: 0.1540 - accuracy: 0.9483
Epoch 4/5
9600/9600 [==============================] - 189s 20ms/step - loss: 0.1287 - accuracy: 0.9548
Epoch 5/5
9600/9600 [==============================] - 188s 20ms/step - loss: 0.1238 - accuracy: 0.9551
<keras.callbacks.callbacks.History at 0x7fae5b01ed30>
scores = model.predict(xvalid_pad)
print("Auc: %.2f%%" % (roc_auc(scores,yvalid)))
Auc: 0.97%
scores_model.append({'Model': 'GRU','AUC_Score': roc_auc(scores,yvalid)})
[{'Model': 'SimpleRNN', 'AUC_Score': 0.6949714081921305},
 {'Model': 'LSTM', 'AUC_Score': 0.9598235453841757},
 {'Model': 'GRU', 'AUC_Score': 0.9716554069114769}]

Bi-Directional RNN’s

Bidirectional recurrent neural networks (RNNs) are a powerful tool for processing input sequences. They have the unique ability to analyze sequences in both forward and backward directions simultaneously.

This allows them to capture contextual information from both past and future inputs, which is extremely beneficial for tasks like natural language processing (NLP) and time series analysis.

To achieve this dual-direction processing, bidirectional RNNs consist of two separate layers. One layer processes the inputs in the forward direction, while the other layer processes them in the backward direction. The outputs from these two layers are then concatenated or combined to generate the final output sequence.

However, it’s important to manage sequence boundaries effectively to prevent future information from leaking into current time steps during training. This ensures that the model learns to understand the sequence context accurately.

If you’re interested in diving deeper into bidirectional RNNs and their practical applications, I recommend exploring the following resources:

  • Coursera Lecture on Bidirectional RNN: This lecture on Coursera provides an in-depth explanation of bidirectional RNNs. You can find it here.
  • Understanding Bidirectional RNN in PyTorch: This article on Towards Data Science offers a comprehensive understanding of bidirectional RNNs and their implementation using PyTorch. You can read it here.
  • Bidirectional RNN on The website provides a detailed explanation of bidirectional RNNs in their chapter on recurrent neural networks. You can find it here.

These resources will not only help you gain a thorough understanding of bidirectional RNNs but also provide practical guidance on implementing them in various domains. Happy exploring!

transformers machine learning
transformers deep learning
transformer machine learning explained
transformers deep learning explained

Code Implementation

with strategy.scope():
    # A simple bidirectional LSTM with glove embeddings and one dense layer
    model = Sequential()
    model.add(Embedding(len(word_index) + 1,
    model.add(Bidirectional(LSTM(300, dropout=0.3, recurrent_dropout=0.3)))

    model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy'])
Model: "sequential_4"
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 1500, 300)         13049100  
bidirectional_1 (Bidirection (None, 600)               1442400   
dense_4 (Dense)              (None, 1)                 601       
Total params: 14,492,101
Trainable params: 1,443,001
Non-trainable params: 13,049,100
CPU times: user 2.39 s, sys: 1.62 s, total: 4 s
Wall time: 3.41 s, ytrain, nb_epoch=5, batch_size=64*strategy.num_replicas_in_sync)
/opt/conda/lib/python3.6/site-packages/ UserWarning:

The `nb_epoch` argument in `fit` has been renamed `epochs`.

Epoch 1/5
9600/9600 [==============================] - 322s 34ms/step - loss: 0.3171 - accuracy: 0.9009
Epoch 2/5
9600/9600 [==============================] - 318s 33ms/step - loss: 0.1988 - accuracy: 0.9305
Epoch 3/5
9600/9600 [==============================] - 318s 33ms/step - loss: 0.1650 - accuracy: 0.9424
Epoch 4/5
9600/9600 [==============================] - 318s 33ms/step - loss: 0.1577 - accuracy: 0.9414
Epoch 5/5
9600/9600 [==============================] - 319s 33ms/step - loss: 0.1540 - accuracy: 0.9459
<keras.callbacks.callbacks.History at 0x7fae5a4ade48>
scores = model.predict(xvalid_pad)
print("Auc: %.2f%%" % (roc_auc(scores,yvalid)))
Auc: 0.97%
scores_model.append({'Model': 'Bi-directional LSTM','AUC_Score': roc_auc(scores,yvalid)})

Code Explanation

The code has not been changed, but we have added bidirectional functionality to the LSTM cells we previously used. This addition is easy to understand. We have achieved similar accuracy and AUC scores as before, and we have now covered all the common RNN architectures.

With this, we conclude Part 1 of our notebook. Get ready for the next part where we will explore more complex and state-of-the-art models. If you have followed along and understood the concepts discussed so far, you should be able to grasp these advanced models.

I recommend completing Part 1 before diving into the upcoming techniques, as they can be quite challenging without a strong foundation.

Seq2Seq Model Architecture


RNNs come in various types, each designed for different purposes. Check out this informative video explaining the different model architectures: Different Types of RNNs.

Instead of providing code implementations, I’ll direct you to resources where the code has already been implemented and explained in detail:

# Visualization of Results obtained from various Deep learning models
results = pd.DataFrame(scores_model).sort_values(by='AUC_Score',ascending=False)'Blues')
3Bi-directional LSTM0.966693
fig = go.Figure(go.Funnelarea(
    text =results.Model,
    values = results.AUC_Score,
    title = {"position": "top center", "text": "Funnel-Chart of Sentiment Distribution"}

Attention Models

This part can be quite challenging, but mastering the intuition and functionality of the attention block is crucial. Once you grasp this concept, understanding transformers and transformer-based architectures like BERT will become much easier.

I highly recommend dedicating sufficient time to understand this section thoroughly. Follow these resources in the provided order to avoid confusion, and try to create your own representation of an attention block:

Code Implementation

Transformers : Attention is all you need

We’ve finally reached a crucial point in our learning journey, where we’ll dive into the technology that revolutionized NLP and paved the way for state-of-the-art techniques: Transformers. These were introduced in the groundbreaking paper “Attention is All You

transformers machine learning
transformers deep learning
transformer deep-learning explained
transformers deep learning explained

” by Google. If you’ve grasped the concepts of attention models, understanding transformers will be a breeze. Here’s a comprehensive guide to Transformers:

Understanding Transformers:

Code Implementation:

BERT and Its Implementation:

Now, let’s delve into BERT, another groundbreaking architecture. Follow these resources in order:

  • Illustrated BERT: Gain an in-depth understanding of BERT with clear explanations and illustrations.

After exploring the above resource, you’ll have a solid understanding of how transformer architectures like BERT are leveraged by state-of-the-art models. These architectures can be used in two ways:

Using Pre-trained BERT without Tuning:

Tuning BERT for Your Task:

For our implementation, we’ll use the first example as a foundation, leveraging the Hugging Face and Keras libraries. However, unlike the first example, we’ll fine-tune our model for our task.

Acknowledgements: We’ll be following the footsteps of this Kaggle kernel by xhlulu, which provides valuable insights and guidance.

Steps Involved:

  • Data Preparation: Tokenization and encoding of data.
  • Configuring TPUs.
  • Building a Function for Model Training and adding an output layer for classification.
  • Training the model and analyzing the results.
# Loading Dependencies
import os
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
from kaggle_datasets import KaggleDatasets
import transformers

from tokenizers import BertWordPieceTokenizer

train1 = pd.read_csv("/kaggle/input/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train.csv")
valid = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/validation.csv')
test = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/test.csv')
sub = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/sample_submission.csv')

Encoder FOr DATA for understanding waht encode batch does read documentation of hugging face tokenizer : here

def fast_encode(texts, tokenizer, chunk_size=256, maxlen=512):
    Encoder for encoding the text into sequence of integers for BERT Input
    all_ids = []
    for i in tqdm(range(0, len(texts), chunk_size)):
        text_chunk = texts[i:i+chunk_size].tolist()
        encs = tokenizer.encode_batch(text_chunk)
        all_ids.extend([enc.ids for enc in encs])
    return np.array(all_ids)


# Configuration
BATCH_SIZE = 16 * strategy.num_replicas_in_sync
MAX_LEN = 192


For understanding please refer to hugging face documentation again

# First load the real tokenizer
tokenizer = transformers.DistilBertTokenizer.from_pretrained('distilbert-base-multilingual-cased')
# Save the loaded tokenizer locally
# Reload it with the huggingface tokenizers library
fast_tokenizer = BertWordPieceTokenizer('vocab.txt', lowercase=False)

Downloading: 100%

996k/996k [00:00<00:00, 4.84MB/s]

Tokenizer(vocabulary_size=119547, model=BertWordPiece, add_special_tokens=True, unk_token=[UNK], sep_token=[SEP], cls_token=[CLS], clean_text=True, handle_chinese_chars=True, strip_accents=True, lowercase=False, wordpieces_prefix=##)
x_train = fast_encode(train1.comment_text.astype(str), fast_tokenizer, maxlen=MAX_LEN)
x_valid = fast_encode(valid.comment_text.astype(str), fast_tokenizer, maxlen=MAX_LEN)
x_test = fast_encode(test.content.astype(str), fast_tokenizer, maxlen=MAX_LEN)

y_train = train1.toxic.values
y_valid = valid.toxic.values
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 874/874 [00:35<00:00, 24.35it/s]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 32/32 [00:01<00:00, 20.87it/s]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 250/250 [00:11<00:00, 22.06it/s]
train_dataset = (
    .from_tensor_slices((x_train, y_train))

valid_dataset = (
    .from_tensor_slices((x_valid, y_valid))

test_dataset = (
def build_model(transformer, max_len=512):
    function for training the BERT model
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    sequence_output = transformer(input_word_ids)[0]
    cls_token = sequence_output[:, 0, :]
    out = Dense(1, activation='sigmoid')(cls_token)
    model = Model(inputs=input_word_ids, outputs=out)
    model.compile(Adam(lr=1e-5), loss='binary_crossentropy', metrics=['accuracy'])
    return model
transformers machine learning
transformers deep learning
transformer machine learning explained
transformers deep learning explained

Starting Training

If you want to use any another model just replace the model name in transformers._ and use accordingly

with strategy.scope():
    transformer_layer = (
    model = build_model(transformer_layer, max_len=MAX_LEN)

Downloading: 100%

618/618 [00:00<00:00, 1.11kB/s] Downloading: 100%

911M/911M [00:25<00:00, 36.0MB/s]

Model: "model"
Layer (type)                 Output Shape              Param #   
input_word_ids (InputLayer)  [(None, 192)]             0         
tf_distil_bert_model (TFDist ((None, 192, 768),)       134734080 
tf_op_layer_strided_slice (T [(None, 768)]             0         
dense (Dense)                (None, 1)                 769       
Total params: 134,734,849
Trainable params: 134,734,849
Non-trainable params: 0
CPU times: user 34.4 s, sys: 13.3 s, total: 47.7 s
Wall time: 50.8 s
n_steps = x_train.shape[0] // BATCH_SIZE
train_history =
Train for 1746 steps, validate for 63 steps
Epoch 1/3
1746/1746 [==============================] - 255s 146ms/step - loss: 0.1221 - accuracy: 0.9517 - val_loss: 0.4484 - val_accuracy: 0.8479
Epoch 2/3
1746/1746 [==============================] - 198s 114ms/step - loss: 0.0908 - accuracy: 0.9634 - val_loss: 0.4769 - val_accuracy: 0.8491
Epoch 3/3
1746/1746 [==============================] - 198s 113ms/step - loss: 0.0775 - accuracy: 0.9680 - val_loss: 0.5522 - val_accuracy: 0.8500
n_steps = x_valid.shape[0] // BATCH_SIZE
train_history_2 =
Train for 62 steps
Epoch 1/6
62/62 [==============================] - 18s 291ms/step - loss: 0.3244 - accuracy: 0.8613
Epoch 2/6
62/62 [==============================] - 25s 401ms/step - loss: 0.2354 - accuracy: 0.8955
Epoch 3/6
62/62 [==============================] - 7s 110ms/step - loss: 0.1718 - accuracy: 0.9252
Epoch 4/6
62/62 [==============================] - 7s 111ms/step - loss: 0.1210 - accuracy: 0.9492
Epoch 5/6
62/62 [==============================] - 7s 114ms/step - loss: 0.0798 - accuracy: 0.9686
Epoch 6/6
62/62 [==============================] - 7s 110ms/step - loss: 0.0765 - accuracy: 0.9696
sub['toxic'] = model.predict(test_dataset, verbose=1)
sub.to_csv('submission.csv', index=False)
499/499 [==============================] - 41s 82ms/step
transformers machine learning
transformers deep learning
transformer machine learning explained
transformers deep learning explained

What are transformers in deep learning?

Transformers are a unique type of deep learning model that is primarily used for comprehending and generating language. They excel in understanding the meaning of words within a sentence and effectively combining them to convey valuable information.

Unlike older models, transformers do not analyze words individually. Instead, they consider the entire sentence as a whole, determining the significance of each word and how they interrelate. This approach enables them to operate swiftly and comprehend intricate language structures.

Transformers have gained significant prominence in recent years due to their ability to simplify and enhance various tasks, such as language translation and text summarization. They are akin to the new superheroes of language comprehension!

Why transformers is better than CNN?

Transformers are preferred over CNNs in tasks such as language processing due to their ability to capture long-range dependencies in sequential data more effectively.

While CNNs are great at recognizing local patterns, transformers’ attention mechanism enables them to grasp a wider context, making them more suitable for tasks related to language comprehension and generation.

Nevertheless, the decision between transformers and CNNs varies depending on the particular task and dataset being used.

What are transformers in NLP?

Transformers were first introduced in the paper “Attention is All You Need” by Vaswani et al. and have completely transformed the field of natural language processing (NLP).

By replacing traditional recurrent neural networks (RNNs) and convolutional neural networks (CNNs), Transformers utilize an attention mechanism to process input data simultaneously, enabling them to better capture long-range dependencies.

These models have now become the foundation of numerous cutting-edge NLP models such as BERT, GPT, and T5.

What is transformer in GPT?

GPT (Generative Pre-trained Transformer) utilizes the transformer architecture for various natural language processing tasks.

It specifically utilizes a variant of the transformer called the decoder-only transformer. The model is trained on extensive text data and then fine-tuned for specific tasks like language modeling, text generation, and question answering by predicting the next word in a sequence based on the context.


I decided to share my learning journey with the community to help others benefit too. The encouragement and generosity I’ve experienced here inspired me to give back.

I’ve shared all the resources I utilized to grasp these ideas, aiming to make NLP competitions more accessible for all.

deep learning projects github 	
deep learning project github
deep learning project ideas
deep learning projects ideas

It took me 10 days to learn everything, but feel free to learn at your own speed. Don’t be disheartened by the intricacy of the methods. By the end of this journey, you’ll feel proud and it will all be worth it.


Lesson 2: Best Pytorch Tutorial For Deep Learning · May 29, 2024 at 2:23 pm

[…] Lesson 3: Best Transformers and BERT Tutorial with Deep Learning and NLP […]

Best Deep Learning Tutorial For Beginners With Python 2024 · May 29, 2024 at 2:24 pm

[…] Lesson 3: Best Transformers and BERT Tutorial with Deep Learning and NLP […]

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *