To be(rt) or not to be(rt)

Valerie Lim
Analytics Vidhya
Published in
5 min readMay 26, 2020

--

A beginner’s friendly walkthrough of the current SOTA NLP model

Image credit

There are a variety of NLP techniques, from simple Bag of Words approach (e.g. TF-IDF), to static word embeddings (e.g. word2vec, Glove), to contextualized (i.e. dynamic) word embedding. Static word embeddings generate the same embedding for the same word in different contexts. For instance, these two sentences — “this movie is not good at all” vs. “this offer is too good to be true” have the same word ‘good’, but they have different meanings in the two sentences. Static word embeddings would return the same embedding for ‘good’. On the other hand, the current state-of-the-art NLP technique, BERT (Bidirectional Encoder Representations from Transformers), uses transformers, which introduced an attention mechanism, that processes the entire text input simultaneously to learn contextual relations between words (or sub-words). As such, BERT would return different embeddings for the word ‘good’. At the same time, unlike static word embeddings which only outputs vectors for downstream tasks, BERT outputs the trained model and vectors. This means we can fine-tune a pre-trained transformer model for our own use cases.

There are different versions of same model from tfhub, google-research, huggingface, etc. I will be using Huggingface since they have made it very easy to switch between different models (e.g. BERT, DistilBERT, RoBERTa, XLNet. I covered a brief difference between these variations in section ii of this article) and the models are compatible with PyTorch and Keras.

This tutorial will be based on a tweet sentiment extraction obtained from Kaggle. A training input has the 4 variables, id, `full text`, `sentiment label`, and `selected text` that best represents the sentiment label. Given a new input with only full text and sentiment label, the model would extract support phrases from the full text that best represents the corresponding sentiment. Sentiment analysis is prevalent in many aspects of NLP, such as understanding the tone of users’ messages on chatbots so as to provide appropriate responses as per user mood and help in downstream analysis of services and breakdowns. Capturing support phrases that led to the sentiment description can help in downstream analysis.

i) A general pipeline for a transformer model

  1. Text preprocessing
  2. Model Training
  3. Fine-tuning
  4. Inference

1. Let’s preprocess our data so that it matches the data DistilBERT was trained on.

2. Model Training

For every token in the input sentence, DistilBERT will output a 768-dimension vector (i.e. embeddings. I use the default hidden layer size but this is configurable). The first output from DistilBERT are embeddings of all input tokens. This represents DistilBERT’s ‘understanding’ of the token’s meaning in the sentence context, and serves as inputs for fine-tuning step.

3. Fine-tuning

The embeddings serve as inputs for the 2nd model which I build to adapt DistilBERT to my use case. To extract the support phrases, the model would predict the start and end indices of the tokens (which are spatial, i.e. where the words are located), based on the full text and sentiment label. Hence, I used 1D convolution layer to preserve this spatial information. After flattening and applying softmax, the output are the one hot encodings of the start and end tokens indices.

4. Inference

I use the fine-tuned model to predict the locations of the support phrases based on the full text and sentiment label. Subsequently, I extract the support phrases by locating their corresponding indices.

An illustration of how a prediction is made

ii) Comparison of BERT models

1. BERT

  • By giving a masked token, BERT will use both sides (before and after target token) to predict the masked token in order to learn text representation.

Limitations

  • [MASK] token exist in training phase while it doest not exist in prediction phase.
  • Can’t handle long text sequence. By default, BERT supports up to 512 token. overcome it by ignoring text after 512 tokens or split tokens to 2 or more inputs and predict it separately.

2. DistilBERT

  • uses a technique called distillation, which approximates the Google’s BERT, i.e. the large neural network by a smaller one. The idea is that once a large neural network has been trained, its full output distributions can be approximated using a smaller network.

3. RoBERTa

  • To improve the training procedure, RoBERTa removes the Next Sentence Prediction (NSP) task from BERT’s pre-training and introduces dynamic masking so that the masked token changes during the training epochs.

4. XLNet

  • Overcome uni-directional way of encoding text by using Permutation Language Modeling (PLM) where all tokens are predicted but in random order. This is in contrast to BERT’s masked language model where only the masked (15%) tokens are predicted. This is also in contrast to the traditional language models, where all tokens were predicted in sequential order instead of random order. This helps the model to learn bidirectional relationships and therefore better handles dependencies and relations between words.
  • PLM does not change original sequence order but manipulating it in attention part.
  • Handle long text sequence by using hidden state (from Transformer-XL)

I hope this article gave you an intuitive understanding of how you can use BERT for a wide range of NLP application, such as Text classification, Question-Answer system, Text summarization etc. Head over to the Hugging Face’s docs for more detail.

I am currently on a look out for Data Scientist opportunity, to join a team that allows me to apply my data science knowledge on real life business challenges that can deliver business values. If you found this article useful, hit the applause button below, it would mean a lot to me and it would help others to see the story. If your Data Science team is expanding, feel free to reach out to me via LinkedIn. I’m happy to share more about my profile :)

Check out my write-ups for my other Data Science projects here too.

Take care and stay safe :)

--

--

Valerie Lim
Analytics Vidhya

A fast learner and self-starter, Valerie is results driven and possesses strong analytical skills | Data Scientist @ Dell | linkedin.com/in/valerie-lim-yan-hui/