Langchain-Chachat project: Dataset used by 4.2-P-Tuning v2

This article mainly introduces the five tasks in the P-tuning-v2 paper, which are Glue tasks, NER tasks, QA tasks, SRL tasks, and SuperGlue tasks. It focuses on the data sets used for each task.

1.Glue task
GLUE (General Language Understanding Evaluation) is a multi-task natural language understanding benchmark and analysis platform created by New York University, University of Washington and other institutions. GLUE contains nine NLU tasks, all in English. GLUE’s nine tasks involve natural language inference, text implication, sentiment analysis, semantic similarity and other tasks. It can be divided into three major categories, namely single sentence tasks, similarity and paraphrase tasks, and inference tasks. All tasks are 2 classification, except STS-B which is a regression task and MNLI has 3 categories [1][2][3] as follows:

The task_to_keys dictionary in the P-tuning-v2/tasks/glue/dataset.py file is as follows:

task_to_keys = {
    "cola": ("sentence", None), # None here means there is no second sentence
    "mnli": ("premise", "hypothesis"), # The first sentence here is the premise and the second sentence is the hypothesis
    "mrpc": ("sentence1", "sentence2"), # The first sentence here is sentence 1 and the second sentence is sentence 2
    "qnli": ("question", "sentence"), # The first sentence here is the question and the second sentence is the sentence
    "qqp": ("question1", "question2"), # The first sentence here is question 1 and the second sentence is question 2
    "rte": ("sentence1", "sentence2"), # The first sentence here is sentence 1 and the second sentence is sentence 2
    "sst2": ("sentence", None), # None here means there is no second sentence
    "stsb": ("sentence1", "sentence2"), # The first sentence here is sentence 1 and the second sentence is sentence 2
    "wnli": ("sentence1", "sentence2"), # The first sentence here is sentence 1 and the second sentence is sentence 2
}

1.CoLA(The Corpus of Linguistic Acceptability)
A data set about grammar released by New York University. This task is mainly to determine whether a given sentence is grammatically correct. Therefore, CoLA is a text classification task for a single sentence.

2.SST(The Stanford Sentiment Treebank)
A sentiment analysis data set released by Stanford University, mainly for sentiment classification of movie reviews, so SST belongs to the text classification task of a single sentence (where SST-2 is a two-class classification, SST-5 is a five-class classification, and SST-5 Emotional polarity is distinguished more carefully).

3.MRPC(Microsoft Research Paraphrase Corpus)
Published by Microsoft, it determines whether two given sentences have the same semantics and belongs to the text classification task of sentence pairs.

4.STS-B(Semantic Textual Similarity Benchmark)
Mainly comes from a task in SemEval over the years (and the data set is also included in SentEval). Specifically, it uses a score of 1 to 5 to characterize the semantic similarity of two sentences. It is essentially a regression problem. , but it can still be done using classification methods, so it can be classified as a five-class text classification task of sentence pairs.

5.QQP(Quora Question Pairs)
A data set of whether two sentences published by Quora are semantically consistent, belonging to the text binary classification task of sentence pairs.

6.MNLI(Multi-Genre Natural Language Inference)
Also released by New York University, it is a text implication task. Under a given premise (Premise), it is necessary to judge whether the hypothesis (Hypothesis) is true. Because the main selling point of MNLI is the collection of texts from many different fields, it is also The MNLI data set is divided into two versions: matched and mismatched. The former means that the data sources of the training set and the test set are consistent, while the latter means that the sources are inconsistent. This task belongs to the text three-classification problem of sentence pairs.

7.QNLI(Question Natural Language Inference)
Its predecessor is the SQuAD 1.0 data set. Given a question, it is necessary to determine whether the given text contains the correct answer to the question. Text binary classification task belonging to sentence pairs.

8.RTE(Recognizing Textual Entailment)
Similar to MNLI, it is also a text implication task. The difference is that MNLI is a three-class classification task. RTE only needs to determine whether two sentences can be inferred or aligned, and it is a two-class text classification task of sentence pairs.

9.WNLI(Winograd Natural Language Inference)
A text implication task, a two-category task, to determine whether the meaning of two sentences is the same.

From the official website GLUE Tasks, there is now another Diagnostics Main classification task, as shown below:

2. NER tasks
Mainly processing the script P-tuning-v2/tasks/ner/dataset.py file.
1.conll2003 data set
(1 Introduction
The Conll-2003 dataset is an English named entity recognition dataset released by the European Association for Computational Linguistics (CoNLL) in 2003. This dataset contains entity category and entity location information in English news text. Among them, entity categories include person names, place names, organization names and other entities. Entity location information is presented in the form of annotations, that is, represented by the character index of the start and end of the entity. The Conll-2003 dataset consists of a training set, a development set, and a test set, and is used to train and evaluate named entity recognition models.
(2) Download address
Link: https://www.cnts.ua.ac.be/conll2002/ner/
2.conll2004 data set
(1 Introduction
The CoNLL04 dataset consists of news articles from the Wall Street Journal and the Associated Press. CoNLL04 defines 4 entity types, including location (Loc), organization (Org), person (Peop) and other (Other), as well as 5 relationship categories, namely located in (Locate_In), organization-based in (OrgBased_In), Live_In, Kill and Work_For.
(2) Download address
Link: https://www.clips.uantwerpen.be/conll2003/ner/
3.ontonotes data set
(1 Introduction
OntoNotes 5.0 is the final version of the OntoNotes project, a collaboration between BBN Technologies, the University of Colorado, the University of Pennsylvania, and the University of Southern California’s Information Science Institute. The goal of this project is to annotate a large corpus of various types of texts (news, phone conversations, weblogs, usenet newsgroups, radio, talk shows) in three languages (English, Chinese and Arabic) Composition, containing structural information (grammar and predicate argument structure) and shallow semantics (word meanings associated with ontology and core references).
(2) Download address
Links: OntoNotes Release 4.0: https://catalog.ldc.upenn.edu/LDC2011T03; OntoNotes Release 5.0: https://catalog.ldc.upenn.edu/LDC2013T19

3. QA tasks
Mainly processing the script P-tuning-v2/tasks/qa/dataset.py file.
1.SQuAD 1.1 data set
SQuAD is an extractive QA data set proposed by Rajpurkar et al. This data set contains 100,000 (question, original text, answer) triples, and the original text comes from 536 Wikipedia articles. For each article’s questions (<=5), many annotators annotate the answers, and the answers appear in the original article. https://huggingface.co/datasets/squad
The training set data is as follows:

The validation set data is as follows:

2.SQuAD 2.0 data set
Compared to the 100,000 questions and answers in SQuAD 1.1, SQuAD 2.0 has added 50,000 questions written by humans – and the questions do not necessarily have corresponding answers. https://huggingface.co/datasets/squad_v2
The training set data is as follows:

The validation set data is as follows:

4.SRL tasks
Mainly processing the script P-tuning-v2/tasks/srl/dataset.py file. The goal of Semantic Role Labeling is mainly to identify Who did What to Whom, When and Where in the sentence. The English data sets mainly include annotation data sets provided by CoNLL-2005 and CoNLL-2012. The data set of CoNLL-2005 comes from Penn Tree Bank, and the data set of CoNLL-2012 comes from OntoNotes v5.0.
1.conll2005 data set
Link: https://github.com/strubell/preprocess-conll05
2.conll2012 data set
Link: https://cemantix.org/conll/2012/data.html

5. SuperGlue task
Mainly processing the script P-tuning-v2/tasks/superglue/dataset.py file. SuperGLUE (General Language Understanding Evaluation) is a benchmark collection widely used to test the performance of natural language understanding models. It was jointly developed by Stanford University and other institutions. It is one of the most challenging test sets in the field of natural language understanding and aims to promote the development of natural language processing technology. SuperGLUE contains 8 sub-datasets: BoolQ, CB, COPA, MultiRC, ReCoRD, RTE, WiC, and WSC. For details, please refer to the paper: https://w4ngatang.github.io/static/papers/superglue.pdf.

task_to_keys = {
    "boolq": ("question", "passage"), # boolq data set: contains questions and paragraphs, predicts whether the paragraph contains the answer
    "cb": ("premise", "hypothesis"), # cb data set: contains premises and hypotheses, predicts whether the hypothesis is an implication of the premise
    "rte": ("premise", "hypothesis"), # rte data set: contains premises and hypotheses, predicts whether the hypothesis is an implication of the premise
    "wic": ("processed_sentence1", None), # wic data set: contains 2 sentences and 1 polysemy, predict whether the words in the 2 sentences have the same meaning
    "wsc": ("span2_word_text", "span1_text"), # wsc data set: contains 1 sentence and 2 noun phrases, predict which noun phrase is more consistent with the reference relationship in the sentence
    "copa": (None, None), # copa data: contains 1 question and 2 candidate answers, predicting which answer is more in line with the context of the question
    "record": (None, None), # record data set: contains 1 news article and 1 cloze question about the article, and the blocked entity of the prediction question
    "multirc": ("paragraph", "question_answer") # multirc dataset: Example consists of context paragraph, question and list of possible answers, predict whether the answer is correct
}

1.BoolQ data set
BoolQ (Boolean Questions) is a QA task that predicts whether a paragraph contains an answer.
2.CB data set
CB (CommitmentBank) is a short text corpus. Based on the given premise and hypothesis, it is judged whether the hypothesis is the implication of the premise.
3.RTE data set
RTE (Recognizing Textual Entailment) data set comes from a series of annual competitions on text implication, which determines whether two given sentences have an implication relationship.
4.WiC Data Set
WiC (Word-in-Context) is a word sense disambiguation task as a binary classification of sentence pairs. Given two text fragments and a polysemy word that appears in two sentences, the task is to determine whether the word is used with the same meaning in both sentences.
5.WSC Data Set
WSC (Winograd Schema Challenge) appears as an NLI task in GLUE. Given a sentence and two noun phrases, determine which noun phrase is more consistent with the reference relationship in the sentence.
6.COPA data set
COPA (Choice of Plausible Alternatives) is a causal reasoning task. Given a question and two candidate answers, determine which answer is more consistent with the context of the question.
7.ReCoRD Dataset
ReCoRD (Reading Comprehension with Commonsense Reasoning Dataset) is a multiple-choice QA task. Each example contains a news article and a cloze question about the article. One entity in the article is blocked. , the model needs to predict the blocked entities from the list of possible entities given in the provided passage.
8.MultiRC data set
MultiRC (Multi-Sentence Reading Comprehension) is a QA task where each example consists of a context passage, a question about the passage, and a list of possible answers, with the model predicting which answers are correct and which are incorrect.

References:
[1] GLUE’s paper: GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding (https://aclanthology.org/W18-5446/)
[2]GLUE’s official website: https://gluebenchmark.com/
[3] Introduction to common NLP tasks: https://www.cnblogs.com/guozw/p/13369757.html
[4] Summary of commonly used NER data sets: https://zhuanlan.zhihu.com/p/606788093
[5]SUPER_GLUE data set: https://www.modelscope.cn/datasets/modelscope/super_glue/summary