Table of Contents
Introduction
Welcome to our blog! π Today, we’re delving into Lesson 3: Exploring the Top Transformers and BERT Tutorial for Deep Learning and NLP.
But don’t forget to check:
- Lesson 1: Best Deep Learning Tutorial for Beginners 2024
- Lesson 2: Best Pytorch Tutorial for Deep Learning
- Lesson 3: Best Transformers and BERT Tutorial with Deep Learning and NLP
- Lesson 4: Best Deep Reinforcement Learning Course
- Lesson 5: Introduction to Deep Learning with a Simple LSTM
Transformers and BERT are truly the superheroes of the NLP realm, changing the game in how we comprehend and analyze language.
Throughout this blog post, we’ll serve as your trusty guides as we delve into these incredible tools, revealing their hidden gems and maximizing their capabilities.
bert
bert models
berserker transformers
bert model
Whether you’re a seasoned NLP aficionado or just starting out, this tutorial is jam-packed with valuable insights, tricks, and practical examples to enhance your expertise and utilize the power of Transformers and BERT.
So, grab your superhero cape (or keyboard) and come along on this thrilling journey through the realm of deep learning and natural language processing! It’s an adventure you definitely won’t want to miss. Let’s get started! πβ¨.
Course Overview
Let’s start by diving into the fundamentals of RNNs and then gradually move on to the most recent deep learning architectures designed for tackling NLP challenges. Here’s what we’ll go through:
- Basic RNNs
- Word Embeddings: Explanation and How to Acquire Them
- LSTMs
- GRUs
- Bi-Directional RNNs
- Encoder-Decoder Models (Seq2Seq Models)
- Attention Models
- Transformers – “Attention is All You Need”
- BERT
This detailed guide aims to provide you with a solid understanding of these methods. By the end, you’ll be well-versed in all these concepts if you stick with it.
transformer shower curtain
transformers shower curtain
transformer toilet
transformers toilet
Please remember that the purpose of this notebook is not to achieve a high LB (Leaderboard) score but to act as a beginner-friendly manual for comprehending deep learning techniques applied in NLP. Furthermore, after covering these topics, I’ll introduce a basic solution for this competition.
So, get ready and let’s embark on this educational journey together! ππ.
berserker from transformers
bert paper
google bert
Let’s Start Coding
This kernel has been a work of more than 10 days If you find my kernel useful and my efforts appreciable, Please Upvote it , it motivates me to write more Quality content
import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) from tqdm import tqdm from sklearn.model_selection import train_test_split import tensorflow as tf from keras.models import Sequential from keras.layers.recurrent import LSTM, GRU,SimpleRNN from keras.layers.core import Dense, Activation, Dropout from keras.layers.embeddings import Embedding from keras.layers.normalization import BatchNormalization from keras.utils import np_utils from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline from keras.layers import GlobalMaxPooling1D, Conv1D, MaxPooling1D, Flatten, Bidirectional, SpatialDropout1D from keras.preprocessing import sequence, text from keras.callbacks import EarlyStopping import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline from plotly import graph_objs as go import plotly.express as px import plotly.figure_factory as ff
Using TensorFlow backend.
Configuring TPU’s
We will utilize TPU’s for this edition of Notebook since we need to construct a BERT Model.
# Detect hardware, return appropriate distribution strategy try: # TPU detection. No parameters necessary if TPU_NAME environment variable is # set: this is always the case on Kaggle. tpu = tf.distribute.cluster_resolver.TPUClusterResolver() print('Running on TPU ', tpu.master()) except ValueError: tpu = None if tpu: tf.config.experimental_connect_to_cluster(tpu) tf.tpu.experimental.initialize_tpu_system(tpu) strategy = tf.distribute.experimental.TPUStrategy(tpu) else: # Default distribution strategy in Tensorflow. Works on CPU and single GPU. strategy = tf.distribute.get_strategy() print("REPLICAS: ", strategy.num_replicas_in_sync)
Running on TPU grpc://10.0.0.2:8470 REPLICAS: 8
train = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train.csv') validation = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/validation.csv') test = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/test.csv')
To make things simpler for training the models, we’ll focus on a smaller subset of the dataset, consisting of only 12,000 data points.
Additionally, we’ll treat this problem as a Binary Classification Problem by removing the other columns.
train.drop(['severe_toxic','obscene','threat','insult','identity_hate'],axis=1,inplace=True)
train = train.loc[:12000,:] train.shape
(12001, 3)
Let’s determine the maximum word limit for comments, which will assist us in padding afterwards:
train['comment_text'].apply(lambda x:len(str(x).split())).max()
1403
Writing a function for getting auc score for validation
def roc_auc(predictions,target): ''' This methods returns the AUC Score when given the Predictions and Labels ''' fpr, tpr, thresholds = metrics.roc_curve(target, predictions) roc_auc = metrics.auc(fpr, tpr) return roc_auc
Data Preparation
xtrain, xvalid, ytrain, yvalid = train_test_split(train.comment_text.values, train.toxic.values, stratify=train.toxic.values, random_state=42, test_size=0.2, shuffle=True)
transformers berth
transformers bathroom
transformer bathroom
berth transformers
Before We Begin
Before we dive into our tutorial, I want to make sure everyone’s on the same page. If you’re completely new to NLP and haven’t worked with text data before, don’t worry! I’ve got you covered with some excellent resources to get you started:
- Spooky NLP and Topic Modelling Tutorial: Kernel Link
- Approaching Almost Any NLP Problem on Kaggle: Kernel Link
- What’s Cooking?: Kernel Link
These kernels will serve as a fantastic starting point for your NLP journey.
transformers machine learning
transformers deep learning
transformer machine learning explained
transformers deep learning explained
Additionally, if you need a refresher on basic neural networks, check out these helpful resources:
- Neural Networks from Scratch: YouTube Playlist
- Deep Learning Basics: YouTube Playlist
- Backpropagation Explained: YouTube Playlist
- Gradient Descent: YouTube Playlist
Once you’re feeling comfortable, you can learn how to visualize text data and more with these insightful kernels:
- Twitter Sentiment Extraction Analysis: Kernel Link
- Stop the S! Toxic Comments EDA: Kernel Link
These resources will provide a solid foundation for understanding NLP and neural networks. Let’s get ready to learn and grow together! πβ¨
Simple RNN
transformers model
model transformers
transformer model
transformer architecture
transformers architecture
Basic Overview
What is a RNN?
Recurrent Neural Networks (RNNs) are a type of neural network that remember information from previous steps and use it to make decisions in the current step.
In regular neural networks, each input and output are treated separately, without any connection to previous or future steps. However, in tasks like predicting the next word in a sentence, we need to consider the words that came before.
RNNs help with this by using a hidden layer to remember information from previous steps, allowing them to better understand and predict sequences of data.
Why RNN’s?
RNNs excel in tasks where data order is crucial, such as predicting the next word in a sentence or forecasting future values in a time series. They possess a unique ability to retain information from previous steps, enhancing prediction accuracy by taking context into account.
RNNs stand out for their capability to accommodate sequences of varying lengths, making them adaptable for a wide range of applications. They are widely utilized in fields like natural language processing, speech recognition, and image captioning.
The reason RNNs are favored is due to their versatility, flexibility, and proficiency in comprehending data sequences!
For a deeper understanding of Recurrent Neural Networks (RNNs), check out these resources:
- Medium Article: Understanding the Recurrent Neural Network
- YouTube Video Series: RNN – Understanding the Basics
- Online Book Chapter: RNN – Dive into Deep Learning
These resources offer valuable insights and explanations to help you grasp the concepts of RNNs more thoroughly.
transformers machine learning
transformers deep learning
transformer machine learning explained
transformers deep learning explained
Code Implementation
So first I will implement the and then I will explain the code step by step
# using keras tokenizer here token = text.Tokenizer(num_words=None) max_len = 1500 token.fit_on_texts(list(xtrain) + list(xvalid)) xtrain_seq = token.texts_to_sequences(xtrain) xvalid_seq = token.texts_to_sequences(xvalid) #zero pad the sequences xtrain_pad = sequence.pad_sequences(xtrain_seq, maxlen=max_len) xvalid_pad = sequence.pad_sequences(xvalid_seq, maxlen=max_len) word_index = token.word_index
%%time with strategy.scope(): # A simpleRNN without any pretrained embeddings and one dense layer model = Sequential() model.add(Embedding(len(word_index) + 1, 300, input_length=max_len)) model.add(SimpleRNN(100)) model.add(Dense(1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) model.summary()
Model: "sequential_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_1 (Embedding) (None, 1500, 300) 13049100 _________________________________________________________________ simple_rnn_1 (SimpleRNN) (None, 100) 40100 _________________________________________________________________ dense_1 (Dense) (None, 1) 101 ================================================================= Total params: 13,089,301 Trainable params: 13,089,301 Non-trainable params: 0 _________________________________________________________________ CPU times: user 620 ms, sys: 370 ms, total: 990 ms Wall time: 1.18 s
model.fit(xtrain_pad, ytrain, nb_epoch=5, batch_size=64*strategy.num_replicas_in_sync) #Multiplying by Strategy to run on TPU's
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:1: UserWarning: The `nb_epoch` argument in `fit` has been renamed `epochs`.
Epoch 1/5 9600/9600 [==============================] - 39s 4ms/step - loss: 0.3714 - accuracy: 0.8805 Epoch 2/5 9600/9600 [==============================] - 39s 4ms/step - loss: 0.2858 - accuracy: 0.9055 Epoch 3/5 9600/9600 [==============================] - 40s 4ms/step - loss: 0.2748 - accuracy: 0.8945 Epoch 4/5 9600/9600 [==============================] - 40s 4ms/step - loss: 0.2416 - accuracy: 0.9053 Epoch 5/5 9600/9600 [==============================] - 39s 4ms/step - loss: 0.2109 - accuracy: 0.9079
<keras.callbacks.callbacks.History at 0x7fae866d75c0>
scores = model.predict(xvalid_pad) print("Auc: %.2f%%" % (roc_auc(scores,yvalid)))
Auc: 0.69%
scores_model = [] scores_model.append({'Model': 'SimpleRNN','AUC_Score': roc_auc(scores,yvalid)})
transformers artificial intelligence
transformers ai
transformer neural network
transformer machine learning
Code Explanantion
After watching the videos and checking out the provided links, you’ve likely discovered that when working with RNNs, we feed sentences into the model word by word. Each word is transformed into a one-hot vector, with dimensions matching the total number of words in the vocabulary plus one.
Now, let’s talk about the Keras Tokenizer. It gathers all the distinct words from the text dataset and constructs a dictionary where each word is a key and its frequency is the value. This dictionary is then sorted based on word frequency in a descending order.
Subsequently, each word is assigned a unique numerical index. For instance, if ‘the’ is the most common word in the dataset, it will be given index 1. The vector representing ‘the’ will be a one-hot vector with a value of 1 at index 1 and zeros elsewhere.
By examining the first two elements of xtrain_seq
, you’ll notice that each word is now denoted by a corresponding numerical index. This tokenization process is crucial for preparing the text data for input into the RNN model.
xtrain_seq[:1]
[[664, 65, 7, 19, 2262, 14102, 5, 2262, 20439, 6071, 4, 71, 32, 20440, 6620, 39, 6, 664, 65, 11, 8, 20441, 1502, 38, 6072]]
Now, you might be wondering: What is padding, and why is it done?
Here’s the answer:
- Quora Discussion: Effect of Sequence Padding on Neural Network Training
- Machine Learning Mastery Article: Data Preparation for Variable-Length Input Sequences
- Coursera Lecture: Understanding Padding
Additionally, sometimes special tokens like EOS (end of string) and BOS (beginning of string) are used during tokenization. Here’s why:
- Stack Overflow Discussion: Reasons for Padding in NLP Tasks
The code token.word_index
provides the vocabulary dictionary created by Keras.
Building the Neural Network:
To understand the dimensions of input and output given to an RNN in Keras, check out this helpful article:
- Medium Article: Understanding Input and Output Shape in LSTM Keras
Now, let’s break down the model:
The first line model.Sequential()
tells Keras that we’ll be building our network sequentially.
- We start by adding an Embedding layer, which converts each word’s one-hot vector into a 300-dimensional vector, similar to Word2Vec.
- Next, we add 100 LSTM units without any dropout or regularization.
- Finally, we add a single neuron with a sigmoid function to predict the results based on the output from the 100 LSTM cells.
After defining the model, we compile it using the Adam optimizer.
Comments on the Model:
While our model achieves an accuracy of 1 (which is amazing), we’re clearly overfitting. This was the simplest model, and there are many hyperparameters we can tune to improve performance, such as adjusting the number of RNN units, using batch normalization, or adding dropouts.
Despite this, achieving an AUC score of 0.82 with minimal effort is impressive, and it highlights the power of deep learning.
Word Embeddings
When we were building our simple RNN models, we discussed the concept of word embeddings. So, what exactly are word embeddings and how do we obtain them?
Here are a couple of resources that can provide you with more information:
To obtain word embeddings, the latest approach involves using pre-trained GLoVe or Fasttext models. Without delving into too many technical details, I’ll explain how we can create sentence vectors and utilize them to build a machine learning model.
Personally, I’m a fan of GLoVe vectors, word2vec, and fasttext. In this Notebook, I’ll be utilizing the GLoVe vectors. You can download the GLoVe vectors from this link or search for GLoVe in the datasets on Kaggle and add the file.
# load the GloVe vectors in a dictionary: embeddings_index = {} f = open('/kaggle/input/glove840b300dtxt/glove.840B.300d.txt','r',encoding='utf-8') for line in tqdm(f): values = line.split(' ') word = values[0] coefs = np.asarray([float(val) for val in values[1:]]) embeddings_index[word] = coefs f.close() print('Found %s word vectors.' % len(embeddings_index))
2196018it [06:43, 5439.09it/s]
Found 2196017 word vectors.
LSTM’s
Sure, here are the answers for the provided questions:
Basic Overview
Simple RNNs represented a significant improvement over traditional machine learning algorithms, achieving state-of-the-art results.
However, they faced challenges in capturing long-term dependencies within sequences, particularly in sentences. To overcome this limitation, LSTM (Long Short-Term Memory) networks were introduced around 1998-99.
Why LSTM’s?
- The introduction to LSTM networks addresses the issue of vanishing gradients in RNNs and explains how LSTMs mitigate this problem. (Link: Vanishing Gradients with RNNs – Coursera Lecture)
- The Analytics Vidhya article provides an overview of LSTM networks and their importance in deep learning, offering insights into their architecture and applications. (Link: Introduction to LSTM – Analytics Vidhya)
What are LSTM’s?
- The Coursera lecture delves into the specifics of LSTM networks, explaining their architecture and how they address the challenge of capturing long-term dependencies in sequences. (Link: Long Short-Term Memory (LSTM) – Coursera Lecture)
- The Distill.pub article explores the concept of memorization in RNNs, providing insights into how LSTM networks enable better memory retention and learning. (Link: Memorization in RNNs – Distill.pub)
- The Towards Data Science guide offers an illustrated explanation of LSTMs and GRUs, breaking down their architecture and operation in a step-by-step manner. (Link: Illustrated Guide to LSTMs and GRUs – Towards Data Science)
Code Implementation
We have already tokenized and paded our text for input to LSTM’s
# create an embedding matrix for the words we have in the dataset embedding_matrix = np.zeros((len(word_index) + 1, 300)) for word, i in tqdm(word_index.items()): embedding_vector = embeddings_index.get(word) if embedding_vector is not None: embedding_matrix[i] = embedding_vector
100%|ββββββββββ| 43496/43496 [00:00<00:00, 183357.18it/s]
%%time with strategy.scope(): # A simple LSTM with glove embeddings and one dense layer model = Sequential() model.add(Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], input_length=max_len, trainable=False)) model.add(LSTM(100, dropout=0.3, recurrent_dropout=0.3)) model.add(Dense(1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy']) model.summary()
Model: "sequential_2" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_2 (Embedding) (None, 1500, 300) 13049100 _________________________________________________________________ lstm_1 (LSTM) (None, 100) 160400 _________________________________________________________________ dense_2 (Dense) (None, 1) 101 ================================================================= Total params: 13,209,601 Trainable params: 160,501 Non-trainable params: 13,049,100 _________________________________________________________________ CPU times: user 1.33 s, sys: 1.46 s, total: 2.79 s Wall time: 3.09 s
model.fit(xtrain_pad, ytrain, nb_epoch=5, batch_size=64*strategy.num_replicas_in_sync)
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:1: UserWarning: The `nb_epoch` argument in `fit` has been renamed `epochs`.
Epoch 1/5 9600/9600 [==============================] - 117s 12ms/step - loss: 0.3525 - accuracy: 0.8852 Epoch 2/5 9600/9600 [==============================] - 114s 12ms/step - loss: 0.2397 - accuracy: 0.9192 Epoch 3/5 9600/9600 [==============================] - 114s 12ms/step - loss: 0.1904 - accuracy: 0.9333 Epoch 4/5 9600/9600 [==============================] - 114s 12ms/step - loss: 0.1659 - accuracy: 0.9394 Epoch 5/5 9600/9600 [==============================] - 114s 12ms/step - loss: 0.1553 - accuracy: 0.9470
<keras.callbacks.callbacks.History at 0x7fae84dac710>
scores = model.predict(xvalid_pad) print("Auc: %.2f%%" % (roc_auc(scores,yvalid)))
Auc: 0.96%
scores_model.append({'Model': 'LSTM','AUC_Score': roc_auc(scores,yvalid)})
Code Explanation
To begin, we generate an embedding matrix for our vocabulary using pretrained GLoVe vectors. When constructing the embedding layer, we utilize this matrix as weights for the layer instead of training it on the vocabulary directly, setting trainable to False.
The remainder of the model remains unchanged, except for the substitution of SimpleRNN units with LSTM units.
Model Evaluation
Upon evaluation, we observe that the model no longer exhibits signs of overfitting and achieves an AUC score of 0.96, which is highly commendable.
Additionally, we observe a reduction in the gap between accuracy and AUC. In this instance, we employed dropout to mitigate overfitting of the data.
GRU’s
GRU (Gated Recurrent Unit), introduced by Cho et al. in 2014, addresses the vanishing gradient problem encountered in standard recurrent neural networks (RNNs).
Similar to LSTM (Long Short-Term Memory), GRU is designed to handle long-term dependencies in sequences.
While both GRU and LSTM share similarities and often yield comparable results, GRU is noted for its simplicity and speed, making it an attractive alternative.
However, there is no clear winner between the two architectures.
For a detailed understanding of GRU networks, refer to the following resources:
- Understanding GRU Networks – Towards Data Science
- Gated Recurrent Unit (GRU) – Coursera Lecture
- Gated Recurrent Unit Networks – GeeksforGeeks
Code Implementation
%%time with strategy.scope(): # GRU with glove embeddings and two dense layers model = Sequential() model.add(Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], input_length=max_len, trainable=False)) model.add(SpatialDropout1D(0.3)) model.add(GRU(300)) model.add(Dense(1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy']) model.summary()
Model: "sequential_3" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_3 (Embedding) (None, 1500, 300) 13049100 _________________________________________________________________ spatial_dropout1d_1 (Spatial (None, 1500, 300) 0 _________________________________________________________________ gru_1 (GRU) (None, 300) 540900 _________________________________________________________________ dense_3 (Dense) (None, 1) 301 ================================================================= Total params: 13,590,301 Trainable params: 541,201 Non-trainable params: 13,049,100 _________________________________________________________________ CPU times: user 1.3 s, sys: 1.29 s, total: 2.59 s Wall time: 2.79 s
model.fit(xtrain_pad, ytrain, nb_epoch=5, batch_size=64*strategy.num_replicas_in_sync)
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:1: UserWarning: The `nb_epoch` argument in `fit` has been renamed `epochs`.
Epoch 1/5 9600/9600 [==============================] - 191s 20ms/step - loss: 0.3272 - accuracy: 0.8933 Epoch 2/5 9600/9600 [==============================] - 189s 20ms/step - loss: 0.2015 - accuracy: 0.9334 Epoch 3/5 9600/9600 [==============================] - 189s 20ms/step - loss: 0.1540 - accuracy: 0.9483 Epoch 4/5 9600/9600 [==============================] - 189s 20ms/step - loss: 0.1287 - accuracy: 0.9548 Epoch 5/5 9600/9600 [==============================] - 188s 20ms/step - loss: 0.1238 - accuracy: 0.9551
<keras.callbacks.callbacks.History at 0x7fae5b01ed30>
scores = model.predict(xvalid_pad) print("Auc: %.2f%%" % (roc_auc(scores,yvalid)))
Auc: 0.97%
scores_model.append({'Model': 'GRU','AUC_Score': roc_auc(scores,yvalid)})
scores_model
[{'Model': 'SimpleRNN', 'AUC_Score': 0.6949714081921305}, {'Model': 'LSTM', 'AUC_Score': 0.9598235453841757}, {'Model': 'GRU', 'AUC_Score': 0.9716554069114769}]
Bi-Directional RNN’s
Bidirectional recurrent neural networks (RNNs) are a powerful tool for processing input sequences. They have the unique ability to analyze sequences in both forward and backward directions simultaneously.
This allows them to capture contextual information from both past and future inputs, which is extremely beneficial for tasks like natural language processing (NLP) and time series analysis.
To achieve this dual-direction processing, bidirectional RNNs consist of two separate layers. One layer processes the inputs in the forward direction, while the other layer processes them in the backward direction. The outputs from these two layers are then concatenated or combined to generate the final output sequence.
However, it’s important to manage sequence boundaries effectively to prevent future information from leaking into current time steps during training. This ensures that the model learns to understand the sequence context accurately.
If you’re interested in diving deeper into bidirectional RNNs and their practical applications, I recommend exploring the following resources:
- Coursera Lecture on Bidirectional RNN: This lecture on Coursera provides an in-depth explanation of bidirectional RNNs. You can find it here.
- Understanding Bidirectional RNN in PyTorch: This article on Towards Data Science offers a comprehensive understanding of bidirectional RNNs and their implementation using PyTorch. You can read it here.
- Bidirectional RNN on d2l.ai: The d2l.ai website provides a detailed explanation of bidirectional RNNs in their chapter on recurrent neural networks. You can find it here.
These resources will not only help you gain a thorough understanding of bidirectional RNNs but also provide practical guidance on implementing them in various domains. Happy exploring!
transformers machine learning
transformers deep learning
transformer machine learning explained
transformers deep learning explained
Code Implementation
%%time with strategy.scope(): # A simple bidirectional LSTM with glove embeddings and one dense layer model = Sequential() model.add(Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], input_length=max_len, trainable=False)) model.add(Bidirectional(LSTM(300, dropout=0.3, recurrent_dropout=0.3))) model.add(Dense(1,activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy']) model.summary()
Model: "sequential_4" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_4 (Embedding) (None, 1500, 300) 13049100 _________________________________________________________________ bidirectional_1 (Bidirection (None, 600) 1442400 _________________________________________________________________ dense_4 (Dense) (None, 1) 601 ================================================================= Total params: 14,492,101 Trainable params: 1,443,001 Non-trainable params: 13,049,100 _________________________________________________________________ CPU times: user 2.39 s, sys: 1.62 s, total: 4 s Wall time: 3.41 s
model.fit(xtrain_pad, ytrain, nb_epoch=5, batch_size=64*strategy.num_replicas_in_sync)
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:1: UserWarning: The `nb_epoch` argument in `fit` has been renamed `epochs`.
Epoch 1/5 9600/9600 [==============================] - 322s 34ms/step - loss: 0.3171 - accuracy: 0.9009 Epoch 2/5 9600/9600 [==============================] - 318s 33ms/step - loss: 0.1988 - accuracy: 0.9305 Epoch 3/5 9600/9600 [==============================] - 318s 33ms/step - loss: 0.1650 - accuracy: 0.9424 Epoch 4/5 9600/9600 [==============================] - 318s 33ms/step - loss: 0.1577 - accuracy: 0.9414 Epoch 5/5 9600/9600 [==============================] - 319s 33ms/step - loss: 0.1540 - accuracy: 0.9459
<keras.callbacks.callbacks.History at 0x7fae5a4ade48>
scores = model.predict(xvalid_pad) print("Auc: %.2f%%" % (roc_auc(scores,yvalid)))
Auc: 0.97%
scores_model.append({'Model': 'Bi-directional LSTM','AUC_Score': roc_auc(scores,yvalid)})
Code Explanation
The code has not been changed, but we have added bidirectional functionality to the LSTM cells we previously used. This addition is easy to understand. We have achieved similar accuracy and AUC scores as before, and we have now covered all the common RNN architectures.
With this, we conclude Part 1 of our notebook. Get ready for the next part where we will explore more complex and state-of-the-art models. If you have followed along and understood the concepts discussed so far, you should be able to grasp these advanced models.
I recommend completing Part 1 before diving into the upcoming techniques, as they can be quite challenging without a strong foundation.
Seq2Seq Model Architecture
Overview
RNNs come in various types, each designed for different purposes. Check out this informative video explaining the different model architectures: Different Types of RNNs.
Instead of providing code implementations, I’ll direct you to resources where the code has already been implemented and explained in detail:
- Basic Models – Coursera Lecture: Offers insights into different Seq2Seq Models.
- Introduction to Sequence-to-Sequence Learning in Keras, Neural Machine Translation with Keras: Explanation and implementation of basic Encoder-Decoder Model.
- How to Implement Seq2Seq LSTM Model in Keras: Details a more advanced Seq2Seq Model.
- Machine Translation and Dataset – d2l.ai, Encoder-Decoder – d2l.ai: Implementation of Encoder-Decoder Model from scratch.
- Introduction to Seq2Seq by fast.ai: Comprehensive overview of Seq2Seq.
# Visualization of Results obtained from various Deep learning models results = pd.DataFrame(scores_model).sort_values(by='AUC_Score',ascending=False) results.style.background_gradient(cmap='Blues')
Model | AUC_Score | |
---|---|---|
2 | GRU | 0.971655 |
3 | Bi-directional LSTM | 0.966693 |
1 | LSTM | 0.959824 |
0 | SimpleRNN | 0.694971 |
fig = go.Figure(go.Funnelarea( text =results.Model, values = results.AUC_Score, title = {"position": "top center", "text": "Funnel-Chart of Sentiment Distribution"} )) fig.show()
Attention Models
This part can be quite challenging, but mastering the intuition and functionality of the attention block is crucial. Once you grasp this concept, understanding transformers and transformer-based architectures like BERT will become much easier.
I highly recommend dedicating sufficient time to understand this section thoroughly. Follow these resources in the provided order to avoid confusion, and try to create your own representation of an attention block:
- Attention Model Intuition – Coursera Lecture (Watch this video only, not the next one)
- Sequence-2-Sequence Model with Attention Mechanism
- Attention and Its Different Forms
- Augmented RNNs – Distill.pub
Code Implementation
- Comprehensive Guide: Attention Mechanism in Deep Learning (Basic Level)
- Seq2Seq Translation Tutorial in PyTorch – Implementation from Scratch in Pytorch
Transformers : Attention is all you need
We’ve finally reached a crucial point in our learning journey, where we’ll dive into the technology that revolutionized NLP and paved the way for state-of-the-art techniques: Transformers. These were introduced in the groundbreaking paper “Attention is All You
transformers machine learning
transformers deep learning
transformer deep-learning explained
transformers deep learning explained
” by Google. If you’ve grasped the concepts of attention models, understanding transformers will be a breeze. Here’s a comprehensive guide to Transformers:
Understanding Transformers:
- Illustrated Transformer: This resource offers a detailed explanation of transformers with illustrative examples.
Code Implementation:
- Attention – Harvard NLP: Provides code implementation of the transformer architecture introduced in the Google paper.
BERT and Its Implementation:
Now, let’s delve into BERT, another groundbreaking architecture. Follow these resources in order:
- Illustrated BERT: Gain an in-depth understanding of BERT with clear explanations and illustrations.
After exploring the above resource, you’ll have a solid understanding of how transformer architectures like BERT are leveraged by state-of-the-art models. These architectures can be used in two ways:
Using Pre-trained BERT without Tuning:
- A Visual Guide to Using BERT for the First Time: Explains how to use pre-trained BERT for predictions without fine-tuning.
Tuning BERT for Your Task:
- Tuning BERT for Your Task – YouTube: Learn how to fine-tune BERT for your specific task.
For our implementation, we’ll use the first example as a foundation, leveraging the Hugging Face and Keras libraries. However, unlike the first example, we’ll fine-tune our model for our task.
Acknowledgements: We’ll be following the footsteps of this Kaggle kernel by xhlulu, which provides valuable insights and guidance.
Steps Involved:
- Data Preparation: Tokenization and encoding of data.
- Configuring TPUs.
- Building a Function for Model Training and adding an output layer for classification.
- Training the model and analyzing the results.
# Loading Dependencies import os import tensorflow as tf from tensorflow.keras.layers import Dense, Input from tensorflow.keras.optimizers import Adam from tensorflow.keras.models import Model from tensorflow.keras.callbacks import ModelCheckpoint from kaggle_datasets import KaggleDatasets import transformers from tokenizers import BertWordPieceTokenizer
# LOADING THE DATA train1 = pd.read_csv("/kaggle/input/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train.csv") valid = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/validation.csv') test = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/test.csv') sub = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/sample_submission.csv')
Encoder FOr DATA for understanding waht encode batch does read documentation of hugging face tokenizer : https://huggingface.co/transformers/main_classes/tokenizer.html here
def fast_encode(texts, tokenizer, chunk_size=256, maxlen=512): """ Encoder for encoding the text into sequence of integers for BERT Input """ tokenizer.enable_truncation(max_length=maxlen) tokenizer.enable_padding(max_length=maxlen) all_ids = [] for i in tqdm(range(0, len(texts), chunk_size)): text_chunk = texts[i:i+chunk_size].tolist() encs = tokenizer.encode_batch(text_chunk) all_ids.extend([enc.ids for enc in encs]) return np.array(all_ids)
#IMP DATA FOR CONFIG AUTO = tf.data.experimental.AUTOTUNE # Configuration EPOCHS = 3 BATCH_SIZE = 16 * strategy.num_replicas_in_sync MAX_LEN = 192
Tokenization
For understanding please refer to hugging face documentation again
# First load the real tokenizer tokenizer = transformers.DistilBertTokenizer.from_pretrained('distilbert-base-multilingual-cased') # Save the loaded tokenizer locally tokenizer.save_pretrained('.') # Reload it with the huggingface tokenizers library fast_tokenizer = BertWordPieceTokenizer('vocab.txt', lowercase=False) fast_tokenizer
Downloading: 100%
996k/996k [00:00<00:00, 4.84MB/s]
Tokenizer(vocabulary_size=119547, model=BertWordPiece, add_special_tokens=True, unk_token=[UNK], sep_token=[SEP], cls_token=[CLS], clean_text=True, handle_chinese_chars=True, strip_accents=True, lowercase=False, wordpieces_prefix=##)
x_train = fast_encode(train1.comment_text.astype(str), fast_tokenizer, maxlen=MAX_LEN) x_valid = fast_encode(valid.comment_text.astype(str), fast_tokenizer, maxlen=MAX_LEN) x_test = fast_encode(test.content.astype(str), fast_tokenizer, maxlen=MAX_LEN) y_train = train1.toxic.values y_valid = valid.toxic.values
100%|ββββββββββ| 874/874 [00:35<00:00, 24.35it/s] 100%|ββββββββββ| 32/32 [00:01<00:00, 20.87it/s] 100%|ββββββββββ| 250/250 [00:11<00:00, 22.06it/s]
train_dataset = ( tf.data.Dataset .from_tensor_slices((x_train, y_train)) .repeat() .shuffle(2048) .batch(BATCH_SIZE) .prefetch(AUTO) ) valid_dataset = ( tf.data.Dataset .from_tensor_slices((x_valid, y_valid)) .batch(BATCH_SIZE) .cache() .prefetch(AUTO) ) test_dataset = ( tf.data.Dataset .from_tensor_slices(x_test) .batch(BATCH_SIZE) )
def build_model(transformer, max_len=512): """ function for training the BERT model """ input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids") sequence_output = transformer(input_word_ids)[0] cls_token = sequence_output[:, 0, :] out = Dense(1, activation='sigmoid')(cls_token) model = Model(inputs=input_word_ids, outputs=out) model.compile(Adam(lr=1e-5), loss='binary_crossentropy', metrics=['accuracy']) return model
transformers machine learning
transformers deep learning
transformer machine learning explained
transformers deep learning explained
Starting Training
If you want to use any another model just replace the model name in transformers._ and use accordingly
%%time with strategy.scope(): transformer_layer = ( transformers.TFDistilBertModel .from_pretrained('distilbert-base-multilingual-cased') ) model = build_model(transformer_layer, max_len=MAX_LEN) model.summary()
Downloading: 100%
618/618 [00:00<00:00, 1.11kB/s] Downloading: 100%
911M/911M [00:25<00:00, 36.0MB/s]
Model: "model" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_word_ids (InputLayer) [(None, 192)] 0 _________________________________________________________________ tf_distil_bert_model (TFDist ((None, 192, 768),) 134734080 _________________________________________________________________ tf_op_layer_strided_slice (T [(None, 768)] 0 _________________________________________________________________ dense (Dense) (None, 1) 769 ================================================================= Total params: 134,734,849 Trainable params: 134,734,849 Non-trainable params: 0 _________________________________________________________________ CPU times: user 34.4 s, sys: 13.3 s, total: 47.7 s Wall time: 50.8 s
n_steps = x_train.shape[0] // BATCH_SIZE train_history = model.fit( train_dataset, steps_per_epoch=n_steps, validation_data=valid_dataset, epochs=EPOCHS )
Train for 1746 steps, validate for 63 steps Epoch 1/3 1746/1746 [==============================] - 255s 146ms/step - loss: 0.1221 - accuracy: 0.9517 - val_loss: 0.4484 - val_accuracy: 0.8479 Epoch 2/3 1746/1746 [==============================] - 198s 114ms/step - loss: 0.0908 - accuracy: 0.9634 - val_loss: 0.4769 - val_accuracy: 0.8491 Epoch 3/3 1746/1746 [==============================] - 198s 113ms/step - loss: 0.0775 - accuracy: 0.9680 - val_loss: 0.5522 - val_accuracy: 0.8500
n_steps = x_valid.shape[0] // BATCH_SIZE train_history_2 = model.fit( valid_dataset.repeat(), steps_per_epoch=n_steps, epochs=EPOCHS*2 )
Train for 62 steps Epoch 1/6 62/62 [==============================] - 18s 291ms/step - loss: 0.3244 - accuracy: 0.8613 Epoch 2/6 62/62 [==============================] - 25s 401ms/step - loss: 0.2354 - accuracy: 0.8955 Epoch 3/6 62/62 [==============================] - 7s 110ms/step - loss: 0.1718 - accuracy: 0.9252 Epoch 4/6 62/62 [==============================] - 7s 111ms/step - loss: 0.1210 - accuracy: 0.9492 Epoch 5/6 62/62 [==============================] - 7s 114ms/step - loss: 0.0798 - accuracy: 0.9686 Epoch 6/6 62/62 [==============================] - 7s 110ms/step - loss: 0.0765 - accuracy: 0.9696
sub['toxic'] = model.predict(test_dataset, verbose=1) sub.to_csv('submission.csv', index=False)
499/499 [==============================] - 41s 82ms/step
transformers machine learning
transformers deep learning
transformer machine learning explained
transformers deep learning explained
What are transformers in deep learning?
Transformers are a unique type of deep learning model that is primarily used for comprehending and generating language. They excel in understanding the meaning of words within a sentence and effectively combining them to convey valuable information.
Unlike older models, transformers do not analyze words individually. Instead, they consider the entire sentence as a whole, determining the significance of each word and how they interrelate. This approach enables them to operate swiftly and comprehend intricate language structures.
Transformers have gained significant prominence in recent years due to their ability to simplify and enhance various tasks, such as language translation and text summarization. They are akin to the new superheroes of language comprehension!
Why transformers is better than CNN?
Transformers are preferred over CNNs in tasks such as language processing due to their ability to capture long-range dependencies in sequential data more effectively.
While CNNs are great at recognizing local patterns, transformers’ attention mechanism enables them to grasp a wider context, making them more suitable for tasks related to language comprehension and generation.
Nevertheless, the decision between transformers and CNNs varies depending on the particular task and dataset being used.
What are transformers in NLP?
Transformers were first introduced in the paper “Attention is All You Need” by Vaswani et al. and have completely transformed the field of natural language processing (NLP).
By replacing traditional recurrent neural networks (RNNs) and convolutional neural networks (CNNs), Transformers utilize an attention mechanism to process input data simultaneously, enabling them to better capture long-range dependencies.
These models have now become the foundation of numerous cutting-edge NLP models such as BERT, GPT, and T5.
What is transformer in GPT?
GPT (Generative Pre-trained Transformer) utilizes the transformer architecture for various natural language processing tasks.
It specifically utilizes a variant of the transformer called the decoder-only transformer. The model is trained on extensive text data and then fine-tuned for specific tasks like language modeling, text generation, and question answering by predicting the next word in a sequence based on the context.
Conclusion
I decided to share my learning journey with the community to help others benefit too. The encouragement and generosity I’ve experienced here inspired me to give back.
I’ve shared all the resources I utilized to grasp these ideas, aiming to make NLP competitions more accessible for all.
deep learning projects github
deep learning project github
deep learning project ideas
deep learning projects ideas
It took me 10 days to learn everything, but feel free to learn at your own speed. Don’t be disheartened by the intricacy of the methods. By the end of this journey, you’ll feel proud and it will all be worth it.
2 Comments
Lesson 2: Best Pytorch Tutorial For Deep Learning · May 29, 2024 at 2:23 pm
[…] Lesson 3: Best Transformers and BERT Tutorial with Deep Learning and NLP […]
Best Deep Learning Tutorial For Beginners With Python 2024 · May 29, 2024 at 2:24 pm
[…] Lesson 3: Best Transformers and BERT Tutorial with Deep Learning and NLP […]