A Review on Question Generation from Natural Language Text

Abstract

In this investigation, we attempt to classify the question generation task more comprehensively from three different perspectives, namely the type of input context text, the target answer, and the generated question. We conduct an in-depth study of existing models from different dimensions to analyze their basic ideas, main design principles, and training strategies, and we compare these models through benchmark tasks to gain an empirical understanding of the existing technology. Furthermore, we discuss what is missing from the current literature as well as promising and anticipated future directions.

Introduction

Template-based methods are divided into three categories

Recently, we have witnessed a boom in deep neural models in the field of QG. Neural QG models provide a fully data-driven, end-to-end trainable framework where content selection and problem construction can be jointly optimized. Compared with previous rule-based methods, neural QG models show great advantages in both problem fluency and diversity. Without loss of generality, most neural methods formulate the QG task as a sequence-to-sequence (Seq2Seq) problem and design different types of encoders and decoders to improve the quality of the generated problem. Perhaps the first neural QG model was introduced in 2017 by Reference [57], which achieved better performance than traditional rule-based methods through the Seq2Seq model [14] using vanilla attention-based RNN. Later, many works were proposed to make the RNN-based Seq2Seq framework more powerful by leveraging question type [61, 145], answer position features [135, 264], answer separation [88, 203] and self-attention mechanism [25, 108] . Furthermore, some popular frameworks, such as pre-training frameworks [53], variational autoencoders [228], graph-based frameworks [36] and adversarial networks [16], have also attracted widespread attention for problem generation. In addition to the widely used maximum likelihood estimation [94], some works also use multi-task learning [227], reinforcement learning [66], transfer learning [129] and other effective training strategies to optimize neural QG models.

Leveraging Context Information for Natural Question Generation【2018】【203】
Use contextual information for question generation

We have made exciting progress on the QG model so far. QG workshops and tutorials have attracted widespread interest from the research community [19, 89, 187, 189]. Standard benchmark datasets [21, 172, 216], evaluation tasks [9, 193] and open source toolkits [83, 245] have been created to facilitate research and rigorous comparisons. Despite these exciting results, the QG field lacks a comprehensive taxonomy to better understand existing QG tasks. Furthermore, there is little understanding and few guidelines on the design principles and learning strategies of different models. Therefore, it is the right moment to look back, summarize the current state and gain some insights into future developments.

Application scenarios

QA
Machine reading comprehension
Intelligent tutoring

Problem statement

Answer-independent QG tasks promote the constraint of knowing the target answer before generating the question. In practical applications such as intelligent tutor systems, humans or machines, it is often necessary to create questions from natural language text without explicitly annotating the answers

Dataset

Existing methods

traditional s2s
Unanswered question generation

Generate questions with answers

Based on pre-trained model
Based on graph
Generate model GAN

Answer fragment input based on s2s

To determine what information to focus on when generating questions, most Seq2Seq models utilize answer position features to incorporate answer spans. As shown in Figure 3, various works attempt to augment each word vector of the input context text with additional answer indicator features, indicating whether the word is within the answer span. Existing implementations of this functionality can generally be divided into BIO tagging schemes and binary indicators.

Add answer mark
BIO
B represents the beginning of the answer
I represents the continuation of the answer
O represents a word that does not form part of the answer
Zhang and Bansal [255] leverage BIO tags and POS and NER linguistic features to enhance word embeddings of ELMo or BERT for paragraph-level input

Addressing Semantic Drift in Question Generation for Semi-Supervised Question Answering 2019 【255】
For binary metrics, the answer position feature is set to 1 if a word token occurs in the answer and 0 otherwise.

Asking Questions the Human Way: Scalable Question-Answer Generation from Text Corpus 2020 【134】

Accelerating Real-Time Question Answering via Question Generation 2020
Use the QA method to learn the distribution of answers to obtain answer information

New answer embedding method

Unlike answer metric features, which are weak representations of the relative distance between an answer and its context words, many works propose to explicitly model the relative distance between context words and answers. This way, the model can put more emphasis on the contextual words surrounding the answer.
For example, Gao et al. [70] take relative position embeddings that capture proximity cues and word embeddings of the input sentences as inputs to the encoder in a typical Seq2Seq framework. Sun et al. [210] proposed an answer-centric and position-aware neural QG model, in which relative distances are encoded as position embeddings to help the model replicate context words that are relatively close and related to the answer

Answer-focused and Position-aware Neural Question Generation
2018
Difficulty Controllable Generation of Reading Comprehension Questions
2018

Seq2Seq with abstract answer as input

Much work has focused on encoding target answers and contextual text separately. Note that some researchers also code contextual text and answer spans separately. Here we group all these models into Seq2Seq with abstract answers.

For example, Hu et al. [88] first identify shared aspects in a given QA pair and then encode aspect and answer information separately.
Wang et al. [231] designed a discriminator based on weak supervision to encode answers and paragraphs separately to capture the relationship between them and focus on the answer-related parts of the paragraph.
Chen et al. [35] first adopt a deep alignment network (DAN) to explicitly model the global interaction between passages and answers at multiple granularity levels, and then use an RNN-based decoder to generate questions.

Song et al. [203] first encode the contextual input and answers through two independent LSTMs, and then match the answers to passages before generating questions.
Kim et al. [98] proposed an answer-separated Seq2Seq, which first replaces the target answers in the original paragraphs with special markers (i.e., answer masks), and then encodes the masked paragraphs and answers separately for better utilization information from both parties.
Improving Neural Question Generation using Answer Separation.2018 [98]

Based on transformer model

For example, Chan and Fan [28] studied the use of pretrained BERT consisting of a transformer model of QG with answer span information. Wang et al. [224] proposed to treat the answer span as the hidden pivot of QG and adopt Transformer as the encoder and decoder modules. Chai and Wan [25] proposed using answer span information to generate questions in a semi-autoregressive manner, where both the encoder and decoder are in the form of a Transformer architecture. Some works [60, 223] fine-tune a pre-trained BART language model [123], which combines bidirectional and autoregressive Transformers, to generate questions. Specifically, Wang et al. [223] concatenate answers from source articles with special marker markers in the middle, while Durmus et al. [60] mask important text spans (i.e. golden answers) in input sentences. Note that in references [60, 223] QG is used to evaluate the overall quality of abstract summaries, which is a novel and interesting direction for researchers in the field of QG.

224 Neural Question Generation with Answer Pivot 2020

s2s unanswered question generation

As shown in Figure 5, Du et al. [57] proposed the first neural QG model using the rnn-based Seq2Seq framework [37] for sentence- and paragraph-level input text without utilizing answer information. An attention mechanism [14] is applied to help the decoder focus on the most relevant parts of the input text when generating questions. Subsequently, many research works [73, 235] were studied, adopting RNN-based approach Seq2Seq framework for answer-agnostic QG with sentence-level input text.
Guo et al. [73] proposed the problem of generating a given sentence using a rnn-based Seq2Seq model similar to Reference [57].
Wu et al. [235] proposed a question type-driven framework for answer-agnostic QG, which first predicts question types and then generates questions that follow specific question type patterns.
To prevent the generated questions from repeating themselves, Chali and Baghaee [26] incorporated a covering mechanism into an RNN-based Seq2Seq framework that takes sentences as input text.
To handle the rare or unknown word problem, Tang et al. [212] and Tang et al. [214] incorporated the copying mechanism and the post-processing replacement mechanism into the Seq2Seq framework respectively.
Furthermore, for paragraph-level input text, Duan et al. [58] proposed to train two Seq2Seq models, the former learns to generate the most relevant question templates from paragraphs, and the latter learns to fill the gaps of the templates with topic phrases through a copy mechanism.
For keyword-level input text, Reddy et al. [177] utilized the RNN-based Seq2Seq framework to generate questions from a given set of keywords.

The key assumption of the above model is that the input text contains a question-worthy concept that has been identified ahead of time. In fact, it is necessary to ask the model to automatically learn what is worth asking.

Recently, as shown in Figure 6, many works provide a two-stage framework in which content selection and problem construction are jointly optimized in an end-to-end manner [55, 56, 109, 152, 208, 226]. Du and Cardie [55] proposed a hierarchical neural sentence-level sequence labeling model to identify problematic sentences from a given paragraph, and then incorporated a sentence selection component into a previously proposed neural QG system [57]. Similarly, Subramanian et al. [208] first identified key phrases in passages or documents that humans might choose to ask relevant questions.

Pre-trained Seq2Seq model

Graph-based model

Traditional Seq2Seq models only capture the surface linear structure of context and cannot model long-distance relationships between sentences. To address this issue, some recent research has focused on graph-based QG neural models, which are inspired by the use of graphs [100, 127] to model highly structured objects such as entity relationships and molecules. These methods utilize the representation capabilities of deep neural networks and the structural modeling capabilities of relational sentence graphs to model long-distance relationships between sentences.

As shown in Figure 8, most such methods first construct a graph from the input context and then use graph-based models to efficiently learn graph embeddings from the constructed text graph.

Generate model

Model training strategy

MLE maximum likelihood function
reinforcement learning
multi-task learning
transfer learning

Model comparison

SQuAD

We follow the data split in Du et al. [57], outlining previously published results on the SQuAD dataset,
This dataset contains 70,484/10,57,0/11,877 (train/dev/test) examples,
Zhou et al. [261]. [261] contains 86,635/8965/8964 (train/dev/test) examples,

As mentioned in Reference [57], nearly 30% of questions in SQuAD rely on information beyond a single sentence. Also, their experimental results show that encoding paragraphs results in a small performance degradation compared to encoding only sentences.
Later, many works explored how to consider paragraph-level contextual information (subscript P in Tables 7 and 8) or document-level contextual information (subscript D in Tables 7 and 8) to improve the performance of QG systems

Question relies on information beyond a single sentence
But using paragraph level does not work well

The model proposed in Reference [134] achieved the best performance on BLEU1, BLEU2, BLEU3 and ROUGEL on SQuAD, following the data splitting in Reference [57]. The reason may be that it transforms the one-to-many mapping problem into a one-to-one mapping problem, which makes the generation process more controllable and improves the quality of the generation problem.

Asking Questions the Human Way: Scalable Question-Answer Generation from Text Corpus
2020 sota du

A Recurrent BERT-based Model for Question Generation
2019 sota zhou
Chan and Fan [28] achieved the best performance on BLEU1, BLEU2 and BLEU3 on SQuAD based on the data splitting in Ref. [261]. Specifically, the authors introduced different neural architectures built on top of the pre-trained BERT language model to generate questions, which once again demonstrated the effectiveness of the pre-trained model.

NewsQA

Table 9 shows an overview of previously published results on the NewsQA dataset. Specifically, the representative QG model on NewsQA is answer-aware. In general, based on the reported results, we observe that the overall performance of NewsQA is worse than that of SQuAD. The main reason is that the average answer length of NewsQA is greater than that of SQuAD, and long answers usually bring more key information requirements and make it more difficult to generate questions. Furthermore, real questions in NewQA tend to have less strict syntax and more diverse wording [135].
Luu Anh Tuan, Darsh J. Shah, and Regina Barzilay. 2020. Capturing greater context for question generation 2020 sota

HotpotQA

Table 10 shows an overview of previously published results on the HotpotQA dataset. Unlike the SQuAD and NewsQA datasets, which have simple questions involving single-hop relationships, the HotpotQA dataset contains complex and semantically related questions from multiple documents through multi-hop reasoning. Therefore, it is more challenging than existing single-hop QG tasks. Recently, multi-hop QG has received increasing attention due to its widespread applications in future intelligent systems. For example, in education systems, these questions require higher-order cognitive skills that are crucial for assessing students’ knowledge and stimulating self-learning. Published experimental results have shown successful results with multi-hop QG, and there is a need to advance the generation of such deep questions to consider how human intelligence embodies the skills of curiosity and integration. Specifically,

We find that graph-based methods [141] outperform Seq2Seq models in terms of BLEU 1-3. The reason may be that it utilizes different levels of granularity
Xiyao Ma, Qile Zhu, Y. Zhou, Xiaolin Li, and Dapeng Wu. 2020. Asking complex questions with multi-hop answerfocused reasoning. ArXiv abs/2009.07402 (2020).
2020 sota

Empirical comparison of independent question generation based on abstract answers

To better understand the performance of different neural QG models, as shown in Table 11, we investigated previously published results on the MS MARCO dataset to generate independent questions with abstract answers.

However, most works [140, 210, 214, 217, 259, 261, 264] extract a subset of MS MARCO where answers are subspans in paragraphs and utilize MS MARCO to Learn the QG model to generate factual questions instead of non-factual questions.
Some works [57, 58, 212] generate questions directly from input sentences based on the assumption that the sentences contain the correct answer span. Therefore, it is not reasonable to compare the results published in these papers, which can only provide a rough understanding of the existing works.

In future work, more effort should be put into generating questions related to abstract answers that summarize the information stated in the paragraphs.

Only partially used, no value

Future research directions

Diverse problem generation

Pre-training tailored for problem generation

Question generation with higher cognitive level

Question generation for information search
In addition to the “user asks, system responds” paradigm, another solution is that the system can clarify the user’s information intention by proactively asking questions. This helps users refine their information needs and then increases the chances of retrieving satisfactory results.