IT Strategy

How to Build a Question Answering System Using Deep Learning

In this article, we’ll learn two basic methods of implementing AI-powered Question Answering (QA) systems using Python and NLP BERT models.

In this article, we'll learn two basic methods of implementing Question Answering systems using Python development.

Question Answering (QA) system aims at satisfying users who are looking to answer a specific question in natural language. It works like search engines but with different result representations: a search engine returns a list of links to answering resources, while a QA system gives a direct answer to a question.

The information-retrieval process in QA systems is divided into three stages: question processing, ranking, and answer extraction. Question processing and ranking can be performed using algorithmic functions or machine learning.

Check out a related article:

QA systems with the approximate match function are as simple as the following.

Question answering system approximate

More complex QA systems use NLP techniques to understand what people talk and write about. NLP, or Natural Language Processing, is the ability of a computer program to understand human language as it is spoken or written.

Basic QA system pipeline

The pipeline of a basic QA system with a pre-trained NLP model includes two stages - preparation of data and processing as follows below:

Question Answering system pretrained NLP models

Prerequisites

To run these examples, you need Python 3. Also, install Jupyter Lab and a few Python modules.

pip install jupyterlab
pip install python-Levenshtein
pip install bert-serving-server bert-serving-client

Data

For demo purposes, we use a small set of question-answer pairs. To build a high-quality QA system, you should use many question samples, specialized data storage, or a database for fast lookup, and, as you'll see at the end, it is a good idea to master training in NLP models.

In our examples, we will use the knowledge base without any modification, but you are free to insert additional question samples to improve answering quality. Let's load our data:

import pandas as pd
data = pd.read_csv('qa.csv')

# this function is used to get printable results
def getResults(questions, fn):
    def getResult(q):
        answer, score, prediction = fn(q)
        return> [q, prediction, answer, score]

        return pd.DataFrame(list(map(getResult, questions)), columns=["Q", "Prediction", "A", "Score"])

test_data = [
    "What is the population of Egypt?",
    "What is the poulation of egypt",
    "How long is a leopard's tail?",
    "Do you know the length of leopard's tail?",
    "When polar bears can be invisible?",
    "Can I see arctic animals?",
    "some city in Finland"
]

data
#QuestionAnswer
0Who determined the dependence of the boiling o...Anders Celsius
1Are beetles insects?Yes
2Are Canada 's two official languages English a...yes
3What is the population of Egypt?more than 78 million
4What is the biggest city in Finland?Greater Helsinki
5What is the national currency of Liechtenstein?Swiss franc
6Can polar bears be seen under infrared photogr...Polar bears are nearly invisible under infrare...
7When did Tesla demonstrate wireless communicat...1893
8What are violins made of?different types of wood
9How long is a leopard's tail?60 to 110cm

The most straightforward QnA system using Python

Below is an example of a very naive QA system where a user's query needs to be equal to or part of some question.

imporе re

def getNaiveAnswer(q):
    # regex helps to pass some punctuation signs
    row = data.loc[data['Question'].str.contains(re.sub(r"[^\w'\s)]+", "", q),case=False)]
    if len(row) > 0:
        return row["Answer"].values[0], 1, row["Question"].values[0]
         return "Sorry, I didn't get you.", 0, ""

getResults(test_data, getNaiveAnswer)
#QPredictionAScore
0What is the population of Egypt?What is the population of Egypt?more than 78 million1
1What is the poulation of egyptSorry, I didn't get you.0
2How long is a leopard's tail?How long is a leopard's tail?60 to 110cm1
3Do you know the length of leopard's tail?Sorry, I didn't get you.0
4When polar bears can be invisible?Sorry, I didn't get you.0
5Can I see arctic animals?Sorry, I didn't get you.0
6some city in FinlandSorry, I didn't get you.0

This system has a notable drawback to not find a match if there some grammar mistakes. Even if we use some string pre-processing of source and query texts, like punctuation symbols removal, lowercasing, etc., the result has poor quality. This way of question matching is very inefficient. Let's improve it to become error-prone with approximate string matching.

Approximating QA system

Let's use approximate string matching to make our system admit grammar mistakes and some text differences. In computer science, many methods to do approximate string matching exist. For our demo purposes, we use one of the implementations of fuzzy string searching called Levenshtein distance. The Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into another.

Let's implement our system with the Levenshtein Python module. It contains a set of approximate string matching functions; you can try any other Python modules of your choice.

from
Levenshtein import ratio
def getApproximateAnswer(q):
    max_score = 0
    answer = ""
    prediction = ""
        for idx, row in data.iterrows():
        score = ratio(row["Question"], q)
        if score >= 0.9:  # I'm sure, stop here 
            return  row["Answer"], score, row["Question"]
             elif score > max_score: # I'm unsure, continue
            max_score = score
            answer = row["Answer"]
            prediction = row["Question"]
    if max_score > 0.8:
        return answer, max_score, prediction
         return "Sorry, I didn't get you.", max_score, prediction
getResults(test_data, getApproximateAnswer)
#QPredictionAScore
0What is the population of Egypt?What is the population of Egypt?more than 78 million1.000000
1What is the poulation of egyptWhat is the population of Egypt?more than 78 million0.935484
2How long is a leopard's tail?How long is a leopard's tail?60 to 110cm1.000000
3Do you know the length of leopard's tail?How long is a leopard's tail?Sorry, I didn't get you.0.657143
4When polar bears can be invisible?Can polar bears be seen under infrared photogr...Sorry, I didn't get you.0.517647
5Can I see arctic animals?What is the biggest city in Finland?Sorry, I didn't get you.0.426230
6some city in FinlandWhat is the biggest city in Finland?Sorry, I didn't get you.0.642857

The second question with two grammar mistakes is answered, getting a score below 1.0 but acceptably high. For now, our system is better, it can do spell-checking, but they still have trouble with questions written in the native language. Let's try to adjust the max_ratio coefficient of our function to be more tolerant.

from Levenshtein import ratio

def getApproximateAnswer2(q):
    max_score = 0
    answer = ""
    prediction = ""
    for idx, row in data.iterrows():
        score = ratio(row["Question"], q)
        if score >= 0.9: # I'm sure, stop here
            return row["Answer"], score, row["Question"]
        elif score > max_score: # I'm unsure, continue
            max_score = score
            answer = row["Answer"]
            prediction = row["Question"]

    if max_score > 0.3: # treshold is lowered
        return answer, max_score, prediction
    return "Sorry, I didn't get you.", max_score, prediction

getResults(test_data, getApproximateAnswer2)
#QPredictionAScore
0What is the population of Egypt?What is the population of Egypt?more than 78 million1.000000
1What is the poulation of egyptWhat is the population of Egypt?more than 78 million0.935484
2How long is a leopard's tail?How long is a leopard's tail?60 to 110cm1.000000
3Do you know the length of leopard's tail?How long is a leopard's tail?60 to 110cm0.657143
4When polar bears can be invisible?Can polar bears be seen under infrared photogr...Polar bears are nearly invisible under infrare...0.517647
5Can I see arctic animals?What is the biggest city in Finland?Greater Helsinki0.426230
6some city in FinlandWhat is the biggest city in Finland?Greater Helsinki0.642857

Let's examine the results. For now, our system has more answers. We've got responses for questions written with different words. But look at result #5. It looks like a false positive, and the answers don't match our questions semantically. We need some balanced ratio depending on the data set, choosing between language understanding and correctness. This Python code is straightforward, but it is impractical on large volumes because of the overall iteration dataset.

For now, you have an idea about how to improve answering quality. Maybe you could do that by adjusting coefficients, inserting more question samples, using a set of functions simultaneously, splitting the sentence into words, and doing matching on word level too, and more.

But in fact, you shouldn't do that. There are advanced libraries developed by Google and Facebook, which already deal with the subject matter, and they are doing it pretty well. Let's go to the next level with NLP models.

NLP-powered / BERT QA system

We use BERT-as-service to implement our following QA function. BERT or Bidirectional Encoder Representations from Transformers is a new method of pre-training language representations developed by Google. Bert-as-service uses BERT as a sentence encoder, allowing you to map sentences into fixed-length representations in a few lines of Python code.

Installation

Install the server and client via pip (consult the documentation for details):

pip install bert-serving-server bert-serving-client

Download a Pre-trained BERT Model. We use BERT-Base Cased, but you can try another model that fits better. Download and unpack the archive.

Start service, pointing model_dir to the folder with your downloaded model. Also, you need to set the maximum question sentence length if the default value of 25 doesn't fit your texts:

bert-serving-start -model_dir /tmp/cased_L-12_H-768_A-12/ -num_worker=4 -max_seq_len=64

Ranking (or pre-processing)

Before we use the service, we need to encode our knowledge base to BERT format.

from bert_serving.client import BertClient
import numpy as np

def encode_questions():
    bc = BertClient()
    questions = data["Question"].values.tolist()
    print("Questions count", len(questions))
    print("Start to calculate encoder....")
    questions_encoder = bc.encode(questions)
    np.save("questions", questions_encoder)
    questions_encoder_len = np.sqrt(
        np.sum(questions_encoder * questions_encoder, axis=1)
    )
    np.save("questions_len", questions_encoder_len)
    print("Encoder ready")

encode_questions()
Questions count 10
Start to calculate encoder....
Encoder ready

Run

from bert_serving.client import BertClient
import numpy as np

class BertAnswer():
    def __init__(self):
        self.bc = BertClient()
        self.q_data = data["Question"].values.tolist()
        self.a_data = data["Answer"].values.tolist()
        self.questions_encoder = np.load("questions.npy")
        self.questions_encoder_len = np.load("questions_len.npy")

    def get(self, q):
        query_vector = self.bc.encode([q])[0]
        score = np.sum((query_vector * self.questions_encoder), axis=1) / (
            self.questions_encoder_len * (np.sum(query_vector * query_vector) ** 0.5)
        )
        top_id = np.argsort(score)[::-1][0]
        if float(score[top_id]) > 0.94:
            return self.a_data[top_id], score[top_id], self.q_data[top_id]
        return "Sorry, I didn't get you.", score[top_id], self.q_data[top_id]

bm = BertAnswer()

def getBertAnswer(q):
    return bm.get(q)

getResults(test_data, getBertAnswer)
#QPredictionAScore
0What is the population of Egypt?What is the population of Egypt?more than 78 million1.000000
1What is the poulation of egyptWhat is the population of Egypt?more than 78 million0.967848
2How long is a leopard's tail?How long is a leopard's tail?60 to 110cm1.000000
3Do you know the length of leopard's tail?How long is a leopard's tail?60 to 110cm0.970769
4When polar bears can be invisible?Can polar bears be seen under infrared photogr...Polar bears are nearly invisible under infrare...0.975287
5Can I see arctic animals?Can polar bears be seen under infrared photogr...Polar bears are nearly invisible under infrare...0.964607
6some city in FinlandWhat is the biggest city in Finland?Sorry, I didn't get you.0.932894

Our function correctly answered most questions. But we have unanswered question #6. We can play with the score threshold and add additional question samples to improve understanding in our case, but in general, we need a better way - to perform fine-tuning of the model.

Let's try the fine-tuned BERT model in the next step.

Fine-tune pre-trained BERT QA systems

Pre-trained BERT models often show pretty good results on many tasks. However, to release the true power of BERT, fine-tuning domain-specific data is necessary.

We follow the instruction in "Sentence (and sentence-pair) classification tasks." Clone the repository:

git clone https://github.com/google-research/bert.git

Download this script and run it to download "GLUE data":

python download_glue_data.py

Then run fine-tuning process:

export BERT_BASE_DIR=/tmp/cased_L-12_H-768_A-12
export GLUE_DIR=/tmp/glue_data

python run_classifier.py \
  --task_name=MRPC \
  --do_train=true \
  --do_eval=true \
  --data_dir=$GLUE_DIR/MRPC \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --max_seq_length=128 \
  --train_batch_size=32 \
  --learning_rate=2e-5 \
  --num_train_epochs=3.0 \
  --output_dir=/tmp/mrpc_output/ \
  --do_lower_case=False

The fine-tuned model is stored at /tmp/mrpc_output/. Look inside it and find our fine-tuned model checkpoint, which is named like model.ckpt-343. Remember to use it as a parameter to BERT-server.

Now start BertServer by putting three pieces together:

bert-serving-start -model_dir /tmp/cased_L-12_H-768_A-12/ -num_worker=4 -max_seq_len=64 \
  -tuned_model_dir=/tmp/mrpc_output/ -ckpt_name=model.ckpt-343

After the server starts, you should find this line in the log:

I:GRAPHOPT:checkpoint (override by the fine-tuned model): /tmp/mrpc_output/model.ckpt-343
...
I:VENTILATOR:all set, ready to serve request!

Now let's repeat pre-processing and run steps:

from bert_serving.client import BertClient
import numpy as np

def encode_questions2():
    bc = BertClient()
    questions = data["Question"].values.tolist()
    print("Questions count", len(questions))
    print("Start to calculate encoder....")
    questions_encoder = bc.encode(questions)
    np.save("questions2", questions_encoder)
    questions_encoder_len = np.sqrt(
        np.sum(questions_encoder * questions_encoder, axis=1)
    )
    np.save("questions_len2", questions_encoder_len)
    print("Encoder ready")

encode_questions2()
Questions count 10 Start to calculate encoder.... Encoder ready 
from bert_serving.client import BertClient
import numpy as np

class TunedBertAnswer():
    def __init__(self):
        self.bc = BertClient()
        self.q_data = data["Question"].values.tolist()
        self.a_data = data["Answer"].values.tolist()
        self.questions_encoder = np.load("questions2.npy")
        self.questions_encoder_len = np.load("questions_len2.npy")

    def get(self, q):
        query_vector = self.bc.encode([q])[0]
        score = np.sum((query_vector * self.questions_encoder), axis=1) / (
            self.questions_encoder_len * (np.sum(query_vector * query_vector) ** 0.5)
        )
        top_id = np.argsort(score)[::-1][0]
        if float(score[top_id]) > 0.94:
            return self.a_data[top_id], score[top_id], self.q_data[top_id]
        return "Sorry, I didn't get you.", score[top_id], self.q_data[top_id]

bm2 = TunedBertAnswer()

def getTunedBertAnswer(q):
    return bm2.get(q)

getResults(test_data, getTunedBertAnswer)
QPredictionAScore
0What is the population of Egypt?What is the population of Egypt?more than 78 million1.000000
1What is the poulation of egyptWhat is the population of Egypt?more than 78 million0.968978
2How long is a leopard's tail?How long is a leopard's tail?60 to 110cm1.000000
3Do you know the length of leopard's tail?How long is a leopard's tail?60 to 110cm0.967964
4When polar bears can be invisible?Can polar bears be seen under infrared photogr...Polar bears are nearly invisible under infrare...0.960934
5Can I see arctic animals?Can polar bears be seen under infrared photogr...Polar bears are nearly invisible under infrare...0.964808
6some city in FinlandWhat is the biggest city in Finland?Greater Helsinki0.946184

We've got all the questions answered for now. Please note that this is just an example. On other questions bases, it is possible to get better or worse results, so we need to examine accessible technologies in each case.

Conclusion

The quantity and content of examples have a significant impact on the pre-trained model when the score is near the threshold. Try to change some test questions, and you'll get a different result; even some correct answers may go. So the pre-trained model can handle many input variants, but it doesn't solve all possible cases.

To make a sound QnA system, we need many question examples, trying to raise the prediction score to 1 for the possible input questions. Training on own domain-specific data instead of general may give a better prediction result too.

In the examples above, we demonstrated how the quality of a QA system is influenced by AI technology, from a single function to a pre-trained NLP model. We built a basic Question Answering system with natural language understanding in a few lines of Python code. Such systems can be used standalone to serve Frequently Asked Questions search, documentation search, etc.

QA system can be used to improve the quality of chatbots significantly. A large number of questions and answers can introduce some difficulty in training, but the QA system can serve this task quite well.

IT and talent solutions
Intersog is a leading IT solutions provider in the United States and other countries. The company has provided agile software engineering and staffing services for Fortune 500 corporations, tech startups, and SaaS businesses for over eighteen years. Why Intersog.