How to Build a Question Answering System: a Step-by-Step Tutorial

In this article, we'll learn two basic methods of implementing Question Answering systems using Python development.

A Question Answering (QA) system aims at satisfying users who are looking to answer a specific question in natural language. It works like search engines but with different result representations: a search engine returns a list of links to answering resources, while a QA system gives a direct answer to a question.

The information-retrieval process in QA systems is divided into three stages: question processing, ranking, and answer extraction. Question processing and ranking can be performed using algorithmic functions or machine learning.

Check out a related article:

QA systems with the approximate match function are as simple as the following.

More complex QA systems use NLP techniques to understand what people talk and write about. NLP, or Natural Language Processing, is the ability of a computer program to understand human language as it is spoken or written.

Basic QA system pipeline

The pipeline of a basic QA system with a pre-trained NLP model includes two stages - preparation of data and processing as follows below:

Question Answering system pretrained NLP models

Prerequisites

To run these examples, you need Python 3. Also, install Jupyter Lab and a few Python modules.

pip install jupyterlab
pip install python-Levenshtein
pip install bert-serving-server bert-serving-client

Data

For demo purposes, we use a small set of question-answer pairs. To build a high-quality QA system, you should use many question samples, specialized data storage, or a database for fast lookup, and, as you'll see at the end, it is a good idea to master training in NLP models.

In our examples, we will use the knowledge base without any modification, but you are free to insert additional question samples to improve answering quality. Let's load our data:

Hire talented developers from LATAM, Canada, and Europe

Check out a related article:

import pandas as pd
data = pd.read_csv('qa.csv')

# this function is used to get printable results
def getResults(questions, fn):
    def getResult(q):
        answer, score, prediction = fn(q)
        return> [q, prediction, answer, score]

        return pd.DataFrame(list(map(getResult, questions)), columns=["Q", "Prediction", "A", "Score"])

test_data = [
    "What is the population of Egypt?",
    "What is the poulation of egypt",
    "How long is a leopard's tail?",
    "Do you know the length of leopard's tail?",
    "When polar bears can be invisible?",
    "Can I see arctic animals?",
    "some city in Finland"
]

data

#	Question	Answer
0	Who determined the dependence of the boiling o...	Anders Celsius
1	Are beetles insects?	Yes
2	Are Canada 's two official languages English a...	yes
3	What is the population of Egypt?	more than 78 million
4	What is the biggest city in Finland?	Greater Helsinki
5	What is the national currency of Liechtenstein?	Swiss franc
6	Can polar bears be seen under infrared photogr...	Polar bears are nearly invisible under infrare...
7	When did Tesla demonstrate wireless communicat...	1893
8	What are violins made of?	different types of wood
9	How long is a leopard's tail?	60 to 110cm

The most straightforward QnA system using Python

Below is an example of a very naive QA system where a user's query needs to be equal to or part of some question.

imporе re

def getNaiveAnswer(q):
    # regex helps to pass some punctuation signs
    row = data.loc[data['Question'].str.contains(re.sub(r"[^\w'\s)]+", "", q),case=False)]
    if len(row) > 0:
        return row["Answer"].values[0], 1, row["Question"].values[0]
         return "Sorry, I didn't get you.", 0, ""

getResults(test_data, getNaiveAnswer)

#	Q	Prediction	A	Score
0	What is the population of Egypt?	What is the population of Egypt?	more than 78 million	1
1	What is the poulation of egypt		Sorry, I didn't get you.	0
2	How long is a leopard's tail?	How long is a leopard's tail?	60 to 110cm	1
3	Do you know the length of leopard's tail?		Sorry, I didn't get you.	0
4	When polar bears can be invisible?		Sorry, I didn't get you.	0
5	Can I see arctic animals?		Sorry, I didn't get you.	0
6	some city in Finland		Sorry, I didn't get you.	0

This system has a notable drawback to not find a match if there some grammar mistakes. Even if we use some string pre-processing of source and query texts, like punctuation symbols removal, lowercasing, etc., the result has poor quality. This way of question matching is very inefficient. Let's improve it to become error-prone with approximate string matching.

Approximating QA system

Let's use approximate string matching to make our system admit grammar mistakes and some text differences. In computer science, many methods to do approximate string matching exist. For our demo purposes, we use one of the implementations of fuzzy string searching called Levenshtein distance. The Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into another.

Let's implement our system with the Levenshtein Python module. It contains a set of approximate string matching functions; you can try any other Python modules of your choice.

from
Levenshtein import ratio
def getApproximateAnswer(q):
    max_score = 0
    answer = ""
    prediction = ""
        for idx, row in data.iterrows():
        score = ratio(row["Question"], q)
        if score >= 0.9:  # I'm sure, stop here 
            return  row["Answer"], score, row["Question"]
             elif score > max_score: # I'm unsure, continue
            max_score = score
            answer = row["Answer"]
            prediction = row["Question"]
    if max_score > 0.8:
        return answer, max_score, prediction
         return "Sorry, I didn't get you.", max_score, prediction
getResults(test_data, getApproximateAnswer)

#	Q	Prediction	A	Score
0	What is the population of Egypt?	What is the population of Egypt?	more than 78 million	1.000000
1	What is the poulation of egypt	What is the population of Egypt?	more than 78 million	0.935484
2	How long is a leopard's tail?	How long is a leopard's tail?	60 to 110cm	1.000000
3	Do you know the length of leopard's tail?	How long is a leopard's tail?	Sorry, I didn't get you.	0.657143
4	When polar bears can be invisible?	Can polar bears be seen under infrared photogr...	Sorry, I didn't get you.	0.517647
5	Can I see arctic animals?	What is the biggest city in Finland?	Sorry, I didn't get you.	0.426230
6	some city in Finland	What is the biggest city in Finland?	Sorry, I didn't get you.	0.642857

The second question with two grammar mistakes is answered, getting a score below 1.0 but acceptably high. For now, our system is better, it can do spell-checking, but they still have trouble with questions written in the native language. Let's try to adjust the max_ratio coefficient of our function to be more tolerant.

from Levenshtein import ratio

def getApproximateAnswer2(q):
    max_score = 0
    answer = ""
    prediction = ""
    for idx, row in data.iterrows():
        score = ratio(row["Question"], q)
        if score >= 0.9: # I'm sure, stop here
            return row["Answer"], score, row["Question"]
        elif score > max_score: # I'm unsure, continue
            max_score = score
            answer = row["Answer"]
            prediction = row["Question"]

    if max_score > 0.3: # treshold is lowered
        return answer, max_score, prediction
    return "Sorry, I didn't get you.", max_score, prediction

getResults(test_data, getApproximateAnswer2)

#	Q	Prediction	A	Score
0	What is the population of Egypt?	What is the population of Egypt?	more than 78 million	1.000000
1	What is the poulation of egypt	What is the population of Egypt?	more than 78 million	0.935484
2	How long is a leopard's tail?	How long is a leopard's tail?	60 to 110cm	1.000000
3	Do you know the length of leopard's tail?	How long is a leopard's tail?	60 to 110cm	0.657143
4	When polar bears can be invisible?	Can polar bears be seen under infrared photogr...	Polar bears are nearly invisible under infrare...	0.517647
5	Can I see arctic animals?	What is the biggest city in Finland?	Greater Helsinki	0.426230
6	some city in Finland	What is the biggest city in Finland?	Greater Helsinki	0.642857

Let's examine the results. For now, our system has more answers. We've got responses for questions written with different words. But look at result #5. It looks like a false positive, and the answers don't match our questions semantically. We need some balanced ratio depending on the data set, choosing between language understanding and correctness. This Python code is straightforward, but it is impractical on large volumes because of the overall iteration dataset.

For now, you have an idea about how to improve answering quality. Maybe you could do that by adjusting coefficients, inserting more question samples, using a set of functions simultaneously, splitting the sentence into words, and doing matching on word level too, and more.

But in fact, you shouldn't do that. There are advanced libraries developed by Google and Facebook, which already deal with the subject matter, and they are doing it pretty well. Let's go to the next level with NLP models.

NLP-powered / BERT QA system

We use BERT-as-service to implement our following QA function. BERT or Bidirectional Encoder Representations from Transformers is a new method of pre-training language representations developed by Google. Bert-as-service uses BERT as a sentence encoder, allowing you to map sentences into fixed-length representations in a few lines of Python code.

Would you like to learn more

about talent solutions

Installation

Install the server and client via pip (consult the documentation for details):

pip install bert-serving-server bert-serving-client

Download a Pre-trained BERT Model. We use BERT-Base Cased, but you can try another model that fits better. Download and unpack the archive.

Start service, pointing model_dir to the folder with your downloaded model. Also, you need to set the maximum question sentence length if the default value of 25 doesn't fit your texts:

bert-serving-start -model_dir /tmp/cased_L-12_H-768_A-12/ -num_worker=4 -max_seq_len=64

Ranking (or pre-processing)

Before we use the service, we need to encode our knowledge base to BERT format.

from bert_serving.client import BertClient
import numpy as np

def encode_questions():
    bc = BertClient()
    questions = data["Question"].values.tolist()
    print("Questions count", len(questions))
    print("Start to calculate encoder....")
    questions_encoder = bc.encode(questions)
    np.save("questions", questions_encoder)
    questions_encoder_len = np.sqrt(
        np.sum(questions_encoder * questions_encoder, axis=1)
    )
    np.save("questions_len", questions_encoder_len)
    print("Encoder ready")

encode_questions()

Questions count 10
Start to calculate encoder....
Encoder ready

Run

from bert_serving.client import BertClient
import numpy as np

class BertAnswer():
    def __init__(self):
        self.bc = BertClient()
        self.q_data = data["Question"].values.tolist()
        self.a_data = data["Answer"].values.tolist()
        self.questions_encoder = np.load("questions.npy")
        self.questions_encoder_len = np.load("questions_len.npy")

    def get(self, q):
        query_vector = self.bc.encode([q])[0]
        score = np.sum((query_vector * self.questions_encoder), axis=1) / (
            self.questions_encoder_len * (np.sum(query_vector * query_vector) ** 0.5)
        )
        top_id = np.argsort(score)[::-1][0]
        if float(score[top_id]) > 0.94:
            return self.a_data[top_id], score[top_id], self.q_data[top_id]
        return "Sorry, I didn't get you.", score[top_id], self.q_data[top_id]

bm = BertAnswer()

def getBertAnswer(q):
    return bm.get(q)

getResults(test_data, getBertAnswer)

#	Q	Prediction	A	Score
0	What is the population of Egypt?	What is the population of Egypt?	more than 78 million	1.000000
1	What is the poulation of egypt	What is the population of Egypt?	more than 78 million	0.967848
2	How long is a leopard's tail?	How long is a leopard's tail?	60 to 110cm	1.000000
3	Do you know the length of leopard's tail?	How long is a leopard's tail?	60 to 110cm	0.970769
4	When polar bears can be invisible?	Can polar bears be seen under infrared photogr...	Polar bears are nearly invisible under infrare...	0.975287
5	Can I see arctic animals?	Can polar bears be seen under infrared photogr...	Polar bears are nearly invisible under infrare...	0.964607
6	some city in Finland	What is the biggest city in Finland?	Sorry, I didn't get you.	0.932894

Our function correctly answered most questions. But we have unanswered question #6. We can play with the score threshold and add additional question samples to improve understanding in our case, but in general, we need a better way - to perform fine-tuning of the model.

Hire talented developers from LATAM, Canada, and Europe

Let's try the fine-tuned BERT model in the next step.

Fine-tune pre-trained BERT QA systems

Pre-trained BERT models often show pretty good results on many tasks. However, to release the true power of BERT, fine-tuning domain-specific data is necessary.

We follow the instruction in "Sentence (and sentence-pair) classification tasks." Clone the repository:

git clone https://github.com/google-research/bert.git

Download this script and run it to download "GLUE data":

python download_glue_data.py

Then run fine-tuning process:

export BERT_BASE_DIR=/tmp/cased_L-12_H-768_A-12
export GLUE_DIR=/tmp/glue_data

python run_classifier.py \
  --task_name=MRPC \
  --do_train=true \
  --do_eval=true \
  --data_dir=$GLUE_DIR/MRPC \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --max_seq_length=128 \
  --train_batch_size=32 \
  --learning_rate=2e-5 \
  --num_train_epochs=3.0 \
  --output_dir=/tmp/mrpc_output/ \
  --do_lower_case=False

The fine-tuned model is stored at /tmp/mrpc_output/. Look inside it and find our fine-tuned model checkpoint, which is named like model.ckpt-343. Remember to use it as a parameter to BERT-server.

Now start BertServer by putting three pieces together:

bert-serving-start -model_dir /tmp/cased_L-12_H-768_A-12/ -num_worker=4 -max_seq_len=64 \
  -tuned_model_dir=/tmp/mrpc_output/ -ckpt_name=model.ckpt-343

After the server starts, you should find this line in the log:

I:GRAPHOPT:checkpoint (override by the fine-tuned model): /tmp/mrpc_output/model.ckpt-343
...
I:VENTILATOR:all set, ready to serve request!

Now let's repeat pre-processing and run steps:

Would you like to learn more

about talent solutions

from bert_serving.client import BertClient
import numpy as np

def encode_questions2():
    bc = BertClient()
    questions = data["Question"].values.tolist()
    print("Questions count", len(questions))
    print("Start to calculate encoder....")
    questions_encoder = bc.encode(questions)
    np.save("questions2", questions_encoder)
    questions_encoder_len = np.sqrt(
        np.sum(questions_encoder * questions_encoder, axis=1)
    )
    np.save("questions_len2", questions_encoder_len)
    print("Encoder ready")

encode_questions2()

Questions count 10 Start to calculate encoder.... Encoder ready

from bert_serving.client import BertClient
import numpy as np

class TunedBertAnswer():
    def __init__(self):
        self.bc = BertClient()
        self.q_data = data["Question"].values.tolist()
        self.a_data = data["Answer"].values.tolist()
        self.questions_encoder = np.load("questions2.npy")
        self.questions_encoder_len = np.load("questions_len2.npy")

    def get(self, q):
        query_vector = self.bc.encode([q])[0]
        score = np.sum((query_vector * self.questions_encoder), axis=1) / (
            self.questions_encoder_len * (np.sum(query_vector * query_vector) ** 0.5)
        )
        top_id = np.argsort(score)[::-1][0]
        if float(score[top_id]) > 0.94:
            return self.a_data[top_id], score[top_id], self.q_data[top_id]
        return "Sorry, I didn't get you.", score[top_id], self.q_data[top_id]

bm2 = TunedBertAnswer()

def getTunedBertAnswer(q):
    return bm2.get(q)

getResults(test_data, getTunedBertAnswer)

	Q	Prediction	A	Score
0	What is the population of Egypt?	What is the population of Egypt?	more than 78 million	1.000000
1	What is the poulation of egypt	What is the population of Egypt?	more than 78 million	0.968978
2	How long is a leopard's tail?	How long is a leopard's tail?	60 to 110cm	1.000000
3	Do you know the length of leopard's tail?	How long is a leopard's tail?	60 to 110cm	0.967964
4	When polar bears can be invisible?	Can polar bears be seen under infrared photogr...	Polar bears are nearly invisible under infrare...	0.960934
5	Can I see arctic animals?	Can polar bears be seen under infrared photogr...	Polar bears are nearly invisible under infrare...	0.964808
6	some city in Finland	What is the biggest city in Finland?	Greater Helsinki	0.946184

We've got all the questions answered for now. Please note that this is just an example. On other questions bases, it is possible to get better or worse results, so we need to examine accessible technologies in each case.

Conclusion

The quantity and content of examples have a significant impact on the pre-trained model when the score is near the threshold. Try to change some test questions, and you'll get a different result; even some correct answers may go. So the pre-trained model can handle many input variants, but it doesn't solve all possible cases.

To make a sound QnA system, we need many question examples, trying to raise the prediction score to 1 for the possible input questions. Training on own domain-specific data instead of general may give a better prediction result too.

In the examples above, we demonstrated how the quality of a QA system is influenced by AI technology, from a single function to a pre-trained NLP model. We built a basic Question Answering system with natural language understanding in a few lines of Python code. Such systems can be used standalone to serve Frequently Asked Questions search, documentation search, etc.

QA system can be used to improve the quality of chatbots significantly. A large number of questions and answers can introduce some difficulty in training, but the QA system can serve this task quite well.

Post Views: 1,239

How to Build a Question Answering System Using Deep Learning

Basic QA system pipeline

Prerequisites

Data

The most straightforward QnA system using Python

Approximating QA system

NLP-powered / BERT QA system

Installation

Ranking (or pre-processing)

Run

Fine-tune pre-trained BERT QA systems

Conclusion

Related Posts

IT Strategy
GenAI Meets Product: How Smarter Teams Are Building Better Software

IT Strategy
UX for AI Products: Why Developer Experience Starts with Design

IT Strategy
Domo Origato, Mr. Roboto! A CTO’s Guide to Trends in AI-Driven WealthTech

Learn why Fortune 500 and other companies trust Intersog

Learn why Fortune 500 and other companies trust Intersog

Learn why Fortune 500 and other companies trust Intersog

Learn why Fortune 500 and other companies trust Intersog

Learn why Fortune 500 and other companies trust Intersog

How to Build a Question Answering System Using Deep Learning

Basic QA system pipeline

Prerequisites

Data

The most straightforward QnA system using Python

Approximating QA system

NLP-powered / BERT QA system

Installation

Ranking (or pre-processing)

Run

Fine-tune pre-trained BERT QA systems

Conclusion

Related Posts

IT StrategyGenAI Meets Product: How Smarter Teams Are Building Better Software

IT StrategyUX for AI Products: Why Developer Experience Starts with Design

IT StrategyDomo Origato, Mr. Roboto! A CTO’s Guide to Trends in AI-Driven WealthTech

IT Strategy
GenAI Meets Product: How Smarter Teams Are Building Better Software

IT Strategy
UX for AI Products: Why Developer Experience Starts with Design

IT Strategy
Domo Origato, Mr. Roboto! A CTO’s Guide to Trends in AI-Driven WealthTech