IT Strategy

NLP Text Preprocessing in Python: 3 Methods

Ways to improve the NLU classification task in chatbot development using Rasa conversational AI framework and Python.

To get a well-performing chatbot with accurate intent classification and question-answering, you usually need a massive amount of training data. Generally speaking, that data is collected from human inputs (dialogues and chats with the bot) and used to retrain the underlying model.

At the beginning of the chatbot development process, however, you may face the lack of training data, which results in low accuracy in intent classification. A few workarounds exist to solve this problem:

  1. Pre-trained model
  2. Training data generator
  3. Crowdsource

These three methods can greatly improve the NLU (Natural Language Understanding) classification training process in your chatbot development project and aid the preprocessing in text mining. Below we demonstrate how they can increase intent detection accuracy.

Check out a related article:

For this demo, we use Rasa, an open-source conversational AI framework. You may need to install this framework with a few dependencies. The source code for this demo is available here.

!git clone https://github.com/interds/3-methods-of-nlu-data-pre-processing.git
%cd ./3-methods-of-nlu-data-pre-processing
!apt-get install python3-venv
!python -m venv --system-site-packages ./venv
!source ./venv/bin/activate
!pip install rasa[transformers]
!pip install -U ipython # fix create_prompt_application
!pip install pandas
!pip install chatette
!pip install transformers
!pip install tensorflow_datasets

Initial model

Initial model in chatbot development

Rasa's boilerplate generated by 'rasa init' is enough to demonstrate the initial model in our chatbot development effort.

We train and evaluate the model with the following config:

language: en
pipeline:
  - name: WhitespaceTokenizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: "char_wb"
    min_ngram: 1
    max_ngram: 4
  - name: DIETClassifier
    epochs: 100
!rasa train -c config-simple.yml --fixed-model-name simple --quiet

Training Core model...
2020-06-22 20:46:54.811928: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Core model training completed.
Training NLU model...
/usr/local/lib/python3.6/dist-packages/rasa/utils/common.py:363: UserWarning: You specified 'DIET' to train entities, but no entities are present in the training data. Skip training of entities.
NLU model training completed.
Your Rasa model is trained and saved at '/content/models/simple.tar.gz'.
!rasa test nlu -c config-simple.yml -u test_data.md -m models/simple.tar.gz --out results/simple --quiet

report = pd.read_json("results/simple/intent_report.json", orient="values")
simple_f1 = report["weighted avg"]["f1-score"]
data = [["simple", simple_f1]]
pd.DataFrame(data, columns=["Model", "F1-sore"])
!rasa test nlu -c config-simple.yml -u test_data.md -m models/simple.tar.gz --out results/simple --quiet

report = pd.read_json("results/simple/intent_report.json", orient="values")
simple_f1 = report["weighted avg"]["f1-score"]
data = [["simple", simple_f1]]
pd.DataFrame(data, columns=["Model", "F1-sore"])

2020-06-22 21:00:29.891699: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
100% 14/14 [00:00<00:00, 108.22it/s]
/usr/local/lib/python3.6/dist-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.6/dist-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))

   Model	  F1-sore
0  simple  0.614286

Expected F1-score = 0.752381

In test data we have lexically different examples from the ones in training data, so it is expected that our simple pipeline doesn't recognize them properly:

intent:affirm- alright- sure- ok

1. Pre-trained model

Pre-trained model in chatbot development

The pre-trained language model can be used for NLU tasks without any task-specific change to the model architecture. Pre-trained models have an ability to continue pre-training on custom data, strarting from some checkpoint.

This process is compute-intensive and requires a massively parallel compute infrastructure. The training of such general models on task-specific training data, called Fine-tuning, is far less computationally demanding and used more often. Well-known examples of similar models are Google’s BERTXLNet and OpenAI’s GPT-2.

Let's find how the pre-trained model can help with intent classification. We use the following config:

language: en
pipeline:
  - name: HFTransformersNLP
    model_weights: "bert-base-uncased"
    model_name: "bert"
  - name: LanguageModelTokenizer
  - name: LanguageModelFeaturizer
  - name: DIETClassifier
    epochs: 100
!rasa train -c config-bert.yml --fixed-model-name bert --quiet
Core stories/configuration did not change. No need to retrain Core model.
Training NLU model...
Downloading: 100% 232k/232k [00:00<00:00, 1.93MB/s]
Downloading: 100% 433/433 [00:00<00:00, 299kB/s]
Downloading: 100% 536M/536M [00:08<00:00, 63.4MB/s]
2020-06-22 20:48:18.155538: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
/usr/local/lib/python3.6/dist-packages/rasa/utils/common.py:363: UserWarning: You specified 'DIET' to train entities, but no entities are present in the training data. Skip training of entities.
NLU model training completed.
Your Rasa model is trained and saved at '/content/models/bert.tar.gz'.
!rasa test nlu -c config-bert.yml -u test_data.md -m models/bert.tar.gz --out results/bert --quiet

report = pd.read_json("results/bert/intent_report.json", orient="values")
bert_f1 = report["weighted avg"]["f1-score"]
data = [["simple", simple_f1], ["bert", bert_f1]]
pd.DataFrame(data, columns=["Model", "F1-sore"])

2020-06-22 20:49:03.856455: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
100% 14/14 [00:03<00:00,  4.04it/s]

   Model    F1-sore
0  simple   0.614286
1  bert     0.930612

Expected F1-score = 0.930612

As we see, without modification of training data, usage of the pre-trained BERT model improves the accuracy of intent detection. This happens because the model already has knowledge about word's synonyms, which helped to recognize matches.

Fine-tuning your AI chatbot

To perform Fine-tuning of the chatbot development model, follow the instructions on Sentence (and sentence-pair) classification tasks from Google's BERT repository. In general, you need to download some text corpus or to convert your text data to BERT's input format, then run Fine-tuning command. You can prepare a new model with the following script:

from transformers import TFBertModel, BertTokenizer

model = TBertModel.from_pretrained("bert-base-uncased")
model.save_pretrained("./model-fine-tuned-1/")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokenizer.save_pretrained("./model-fine-tuned-1/")

Follow the text preprocessing steps for fine-tuning. An example of Fine-tuning Bert model on the MRPC classification task is given below:

export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12
export GLUE_DIR=/path/to/glue
 
python run_classifier.py \
  --task_name=MRPC \
  --do_train=true \
  --do_eval=true \
  --data_dir=$GLUE_DIR/MRPC \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --max_seq_length=128 \
  --train_batch_size=32 \
  --learning_rate=2e-5 \
  --num_train_epochs=3.0 \
  --output_dir=/tmp/mrpc_output/

When ready, the model from resulting folder can be used in your pipeline and it should have higher F1-score than original one.

Here is another tuning example from the transformers library:

import tensorflow as tf
import tensorflow_datasets
from transformers import *

# Load dataset, tokenizer, model from pretrained model/vocabulary
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')
data = tensorflow_datasets.load('glue/mrpc')

# Prepare dataset for GLUE as a tf.data.Dataset instance
train_dataset = glue_convert_examples_to_features(data['train'], tokenizer, max_length=128, task='mrpc')
valid_dataset = glue_convert_examples_to_features(data['validation'], tokenizer, max_length=128, task='mrpc')
train_dataset = train_dataset.shuffle(100).batch(32).repeat(2)
valid_dataset = valid_dataset.batch(64)

# Prepare training: Compile tf.keras model with optimizer, loss and learning rate schedule
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

# Train and evaluate using tf.keras.Model.fit()
history = model.fit(train_dataset, epochs=2, steps_per_epoch=115,
                    validation_data=valid_dataset, validation_steps=7)

# Load the TensorFlow model in PyTorch for inspection
model.save_pretrained('./model-fine-tuned-2/')

2. Training data generator

Training data generator in chatbot development

To quickly get more training data for NLU task, you can use a training data generator. It is a program which takes templates and generates a lot of examples for model training. Well-known generators are Chatito (in Node.js) and Chatette (in Python). A Python programming language preprocessor is a common tool.

Instead of writing text preprocessing examples directly, you write one or several template files in a specific format, then run the generator which parses templates and outputs ready-to-use examples. This is one of the essential preprocessing steps in text mining. To train a powerful model, you need to use the generator not just for intent detection, but more for named entity recognition.

Let's generate samples for our easy intent with the NLP text preprocessing Python Chatette. Write a template file "affirm.chatette" with the following content:

```
~[yes]
  yes
  indeed
  of course
  that sounds good
  correct
  got it
  alright
  sure
  ok


~[thanks]
    thanks
    thank you 

%[greet](training:100)
    ~[yes][,?] ~[thanks?]
```

Next, generate an nlu file.

!python -m chatette -f -s YZkEus -a rasa-md -o results data-generate/affirm.chatette
!cp results/train/output.md data-generate/nlu/affirm.md

Executing Chatette with seed 'YZkEus'.
[DBG] Parsing file: /content/data-generate/affirm.chatette
[DBG] Generating training examples...
[DBG] Generating testing examples...
[DBG] Generation over

Next, we train and evaluate our simple model with just generated data.

!rasa train -c config-simple.yml --data data-generate --fixed-model-name generated --quiet

Core stories/configuration did not change. No need to retrain Core model.
Training NLU model...
2020-06-22 20:49:26.699066: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
/usr/local/lib/python3.6/dist-packages/rasa/utils/common.py:363: UserWarning: You specified 'DIET' to train entities, but no entities are present in the training data. Skip training of entities.
NLU model training completed.
Your Rasa model is trained and saved at '/content/models/generated.tar.gz'.
!rasa test nlu -c config-simple.yml -u test_data.md -m models/generated.tar.gz --out results/generated --quiet

report = pd.read_json("results/generated/intent_report.json", orient="values")
generated_f1 = report["weighted avg"]["f1-score"]
data = [["simple", simple_f1], ["generated", generated_f1]]
pd.DataFrame(data, columns=["Model", "F1-sore"])

2020-06-22 21:06:57.539434: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
100% 14/14 [00:00<00:00, 110.23it/s]

     Model	F1-sore
0    simple	0.614286
1    generated	1.000000

Expected F1-score = 1.000000

As we see in this case, we improve the accuracy by generating examples, without modifying the model config itself. Now, imagine that having 3 aliases with 10 examples each, like

%[example]
    ~[alias1] ~[alias2] ~[alias3]

you can generate 1000 training samples!

With many intents used to build your chatbot, the intent errors count grows. To fix this you need more examples, especially for lexically close intents. And the training data generator is an easy way to provide them.

3. Crowdsource in AI chatbot development

To get more training data of high quality, you can outsource jobs to distributed workers, using Amazon Mechanical Turk, Microworkers, Clickworker platforms.

MTurk is one of the text preprocessing tools to take simple and repetitive tasks that need to be handled manually. In a short time and at a reduced cost, you can get human-written training data for your initial model to augment training data collection and ultimately accelerate Python-based chatbot development using data preprocessing for text classification.

Key takeways

In this article, we have considered three ways to improve language model intent detection accuracy.

The first one, the Pre-trained model, is the most computation-heavy yet allows us to have fewer examples to perform training, without accuracy loss.

The second one, Training data generator, is the simplest method that still requires to write some templates in a specific language. It reduces the amount of text input and grows the number of available samples to train the model.

The third one, Crowdsource, gives the most accurate output written by humans, but typically involves an extra cost to put in place.

avatar
AI Developer
An experienced full-stack software developer with engineer mentality. Dmytro codes mostly with Node.js, Python and Rust, explores and experiments with the latest technologies, such as AI and Deep Learning.