Question answering system based on medical knowledge graph

1. Project source

Since I used Rasa to build a dialogue system before, I have always wanted to break away from the open source framework of Rasa and build a dialogue system that can achieve similar functions from the bottom. After all, no matter how slippery the framework is, it is better to do it yourself.

I happened to see it in the Rasa group

@王乐

A project shared by the predecessors: a medical diagnosis knowledge question answering system based on knowledge graphs. I watched the video first, then downloaded the code, implemented it myself, and watched the video again when I encountered something I didn’t understand. Now I basically understand this project. Yes, write a summary and share it. Later, we will do some horizontal expansion on the existing basis.

Since the predecessors already have video explanations (the project homepage has a video link), my summary and sharing try to avoid the existing explanation content and avoid repetition.

Therefore, it is recommended to eat this article together with the video of the predecessors~

2. Project structure[1]

The current implementation is the minimum demo version, and the later predecessors may add technology stacks such as Redis, Elasticsearch, tf-serving, etc., but for me, the current version is enough.

The introduction of this article is also based on the current Demo, which will be updated continuously later.

What this project has implemented:

small talk monologue
Multi-round question answering based on knowledge graph

First look at the dialogue process, or also called the text analysis process, which is organized according to the code logic:

Take the user input “Excuse me, what should I do if I have a heart attack” as an example:

NLU module

1) First enter the classification model 1 to judge whether it is the intention of chatting, including: greet, goodbye, deny, isbot, accept, diagnosis:

Hit the first four intents, then enter Chitchat_bot, randomly select one from the prepared reply corpus and return it to the user, and the conversation ends;
Hit accept, the intention of accept is to play a role in question clarification;
Hit diagnosis, enter 2) Medical_bot;

In this example, hit diagnosis, enter 2) Medical_bot

2) In Medical_bot, first enter the classification model 2 for secondary recognition of intent, which includes 13 medical intents; then enter the NER model for entity recognition and extraction;

The result of intent recognition is

{'confidence': 0.8997645974159241, 'intent': 'treatment method'}

The result of entity recognition is

[{'entities': [{'type': 'disease', 'word': 'heart disease'}], 'string': 'Excuse me What about heart disease'}, {'entities': [{'recog_label': 'dict', 'type': 'disease', 'word': ' Heart
Heart disease'}], 'string': 'What should I do if I have a heart disease'}]

The previous two steps are equivalent to the NLU module in the task-based dialogue robot

DST module

3) After getting the intent and entity, first fill the slot with the entity

Let’s talk about the relationship and difference between NLU and DST. The relationship between NLU and DST is actually very close. They both play a role in the slot filling process, but they play different roles in this process:

NLU module is to classify the user’s input and mark the entities in the input. The following is an excerpt of the above entity recognition results:

{'entities': [{'type': 'disease', 'word': 'heart disease'}], 'string': 'May I have a heart What should I do if I am sick'}

In this project, the type of the entity (type) is marked, corresponding to the node in the knowledge graph, and the corresponding field (word) is marked at the same time, but it has not been filled, just to find the entity.

The DST module finds a slot value for each slot in the slot list based on the dialogue history

"Treatment method":{
        "slot_list" : ["Disease"],
        "slot_values": None,
        "cql_template" : ["MATCH(p:disease) WHERE p.name='{Disease}' RETURN p.cure_way",
                        "MATCH(p:Disease)-[r:recommand_drug]->(q) WHERE p.name='{Disease}' RETURN q.name",
                        "MATCH(p:Disease)-[r:recommand_recipes]->(q) WHERE p.name='{Disease}' RETURN q.name"],
        "reply_template" : "'{Disease}' Disease treatment methods, available medicines, and recommended dishes are:\\
",
        "ask_template" : "Are you asking about the cure for the disease '{Disease}'?",
        "intent_strategy" : "",
        "deny_response":"I didn't understand what you said~"
    },

This is the information (form) under the intention of “Treatment Method”. There is only one slot “Disease” in the slot list (this form will be introduced in detail below).

In each round of dialogue, the DST module will check all the dialogue history so far, and then determine which text can be filled as the slot value of a specific slot in the slot list, this process is called tracking(Dialogue State Tracking).

In this project, this step is done by traversing the slot list and matching the entity recognition results.

PL module

4) Then determine the reply strategy according to the confidence of the intention. Here are three simple cases:

>=0.8, query the answer in Neo4j (knowledge map) according to the recognized intention, and return it to the user
0.4~0.8, ask the user back to clarify the question
<0.4, return to the bottom line

DST module + PL module constitutes the dialogue management (DM) module in the task-based dialogue robot. In this project, the boundary is not particularly obvious, and the main implementation logic is in the of the modules.py file. In the semantic_parser function.

There are also many people in the academic and industrial circles who are researching this part. A popular direction is to use reinforcement learning to select strategies. There are related papers. If you are interested, you can take a look.

To achieve the above process, the following steps need to be completed:

1) Build a knowledge map as the underlying data support

The data set used to build the knowledge graph here is the medical data set used by liuhuanyong in the QABasedOnMedicaKnowledgeGraph project.

A total of 8 types of entities: medicines, recipes, food, inspections, departments, pharmaceutical companies, diseases, symptoms, a total of more than 40,000

11 entity relationships:

Disease – avoid food relationship
Disease-eating food relationship
Disease-recommended food relationship
Disease-generic drug relationship
Disease-Top Drug Relationships
disease-examination relationship
Department-Department Relationship
Vendor-drug relationship
disease symptom relationship
disease comorbidity
The relationship between diseases and departments

For students who have never built a knowledge map, in the case of a structured data set, don’t imagine building a knowledge map as a particularly complicated and difficult thing; the difficulty of the knowledge map lies in:

In the data collection stage, structured information is extracted from unstructured data, such as identifying entities (NER) from web page text, And extract the relationship between entities
Graph design stage, entity definition, attribute definition, relationship definition, requires certain domain background knowledge to better design graph

In this project, the data set is structured, and the entities, attributes, and relationships are all defined, so don’t think this step is difficult.

In actual business, the difficulty of building a knowledge graph depends on your specific business scenario and cannot be generalized.

2) Classification model 1

The task of classification model 1 is to complete multi-intent classification and judge whether the user’s intention is chatting. This is the first intention classification (a total of two layers of intent classification have been carried out in this project). Here, LR + GBDT multi-model fusion is used. Methods.

3) Classification model 2

In Medical_bot, it first enters the classification model 2 to identify specific medical diagnosis intentions: a total of 13 medical diagnosis intentions such as definition, etiology, treatment time, prevention, and treatment methods. Bert + TextCNN is used here for multi-intent classification.

4) NER model

In Medical_bot, the second step is to identify and extract entities in the medical field through the NER model, that is, to extract the names of entities contained in the knowledge map from the user input. As mentioned earlier, there are a total of 8 in this project. kind.

However, in a large-scale project, the number of entities is tens of thousands, and the ordinary matching method needs to be backtracked every time the matching fails, which takes a long time, and the time complexity of the AC automaton is ideally O(n), n is the length of the string entered by the user.

So one technical route is to use AC automata to extract all substrings (or longest substrings) containing entities in the knowledge base + NER entity recognition to extract entities in user input, and then combine the two results to recall scores The highest Top n entity performs subsequent linking operations.

This project uses the AC automaton to extract all substrings + NER entity recognition.

3. Project operation

In order to facilitate everyone’s practice, I uploaded my code to GitHub

The address is KBQA-study compared with the original code:

Added comments to some codes for easy debugging and reading
Modified the names of some parameters and functions
The original author linked to WeChat through itchat, and interacted through WeChat. My own WeChat account could not be connected, so I used sanic, a lightweight framework, to create a web interface. I wanted to connect to Telegram, but I really didn’t have time. Well, this plan is put on hold for the time being. What kind of interaction is actually not the most important thing, the focus is on the work of the algorithm part.

My operating environment:

Win10 + 16GB memory + Pycharm

To run this project you need to do the following

The first step is to download the required data and models

The data must be used. You can train your own later in the model. The download link is on the GitHub homepage

After downloading, put the data into the corresponding folder

There are everything here except the Bert pre-training file. The Bert pre-training file is too big, so download it online.

The second step is to build a knowledge map

1) Install Neo4j on your computer

refer to:

Install Neo4j under Win10
Tutorial on the installation and use of graph database Neo4j (win & amp; Linux)
Teach you how to quickly start the knowledge map – Neo4J Tutorial

2) Open a CMD, enter neo4j.bat console, start the service

3) Run build_kg_utils.py under the build_kg folder in Pycharm to build a knowledge map

This process is relatively slow. My own computer takes about 2 hours. If your memory is relatively small, turn off some non-essential applications
Replace the path and account involved in it with your own

The third step, start the service

1) Start the intent recognition model service and NER model service

The startup commands are in the following two files

run_intent_recog_service.bat
run_ner_service.bat

Double-click to run these two files directly, do not close after opening, if it is a Linux system, change it into a shell script and run it;

However, my computer always crashes after double-clicking. If you also encounter a similar situation, you can try the following methods:

Open two Terminals in Pycharm and run the two commands in run_intent_recog_service.bat and run_ner_service.bat respectively;

Generally, it will flash back when an error is reported. You can use this method to view the error message and modify it.

2) Start the knowledge graph service

If you have not closed the service after building the knowledge map, this step can be skipped. If it is closed, restart it

Open a CMD, enter neo4j.bat console, start the service

3) Start the main program

I created a new local.py file myself, and wrote the sanic service code in it

You can try to use the WeChat version first:

Open a Terminal in Pycharm, enter: python itchat_app.py

If you can’t log in, you can try to enter: python local.py, and then open http://your IP:12348/swagger/#/default in your browser to interact with the robot in my way.

Fourth, project summary

As far as the current implementation of the project is concerned, it mainly involves the following knowledge points

1) Construction of knowledge graph

Extract all entities and relationships from a structured dataset and build triples
Use Neo4j to create nodes (node) and relationship (edge), this step requires you to be familiar with some common statements of neo4j

2) Text classification

Text classification is mainly reflected in two intent recognitions in this project:

The first intent recognition: to determine whether the user intent is chatting, using the method of LR + GBDT multi-model fusion
The second intent recognition: identify the specific intent of medical diagnosis, using Bert + TextCNN

If you have a learning attitude, you can’t stop at the content of the project. After all, text classification is not only an introductory task of NLP, but also a major branch. You can combine this project to do some horizontal expansion, and try other projects based on the data set of this project. model to see how it works, and you can also try other data sets. In short, master the task of text classification firmly.

3) Sequence annotation

Named Entity Recognition

The named entity recognition part of this project uses the BiLSTM + CRF model, and the AC automaton is used to make supplementary corrections.

Slot filling

Slot filling generally appears in task-based robots to collect information from users. First, see what slots are in the intention, and then perform entity recognition in user input to extract the slot value of the slot. The essence of slot filling It belongs to a sequence labeling task.

I read all kinds of related materials before, they were all about theory, and I never said how to fill them in. Through this project, I basically understood it. The focus is on the config.py file and modules.py In the semantic_parser() function of the file, since I have used Rasa before, I immediately think of the domain file in Rasa. The whole domain file is actually a big dictionary, storing various slots and intent information.

Take a look at how this project implements multiple rounds of question and answer. The following is excerpted from the config.py file:

semantic_slot = {
    "definition":{
        "slot_list":["Disease"],
        "slot_values": None,
        "cql_template": "MATCH(p:Disease) WHERE p.name='{Disease}' RETURN p.desc",
        "reply_template": "'{Disease}' is like this:\\
",
        "ask_template": "Are you asking about the definition of '{Disease}'?",
        "intent_strategy": "",
        "deny_response": "Sorry for not understanding what you mean~"
    },
    "Etiology":{
        "slot_list" : ["Disease"],
        "slot_values": None,
        "cql_template" : "MATCH(p:Disease) WHERE p.name='{Disease}' RETURN p.cause",
        "reply_template" : "'{Disease}' The cause of the disease is:\\
",
        "ask_template" : "Are you asking about the cause of the disease '{Disease}'?",
        "intent_strategy" : "",
        "deny_response":"I don't understand what you said, you can ask me in another way~"
    },
    "prevention":{
        "slot_list" : ["Disease"],
        "slot_values": None,
        "cql_template" : "MATCH(p:Disease) WHERE p.name='{Disease}' RETURN p.prevent",
        "reply_template" : "About '{Disease}' disease, you can prevent it like this:\\
",
        "ask_template" : "Are you asking about preventive measures for disease '{Disease}'?",
        "intent_strategy" : "",
        "deny_response":"Uh~ I don't seem to understand what you are talking about~"
    },
}

The definition, cause, and prevention respectively represent three different medical diagnosis intentions, each intention is followed by a dictionary, and the key of the dictionary -value stores some information related to the intent, take “definition” as an example:

slot list

"slot_list":["Disease"]

Slot value: After being filled, value is a dictionary that stores the slot-slot values that appear in the slot list

"slot_values":None

Graph query statement

"cql_template": "MATCH(p:Disease) WHERE p.name='{Disease}' RETURN p.desc",

reply template

"reply_template": "'{Disease}' is like this:\\
",

Clarification of speech: if the confidence level of the intention is between 0.4-.8, this template will be used for question clarification

"ask_template": "Are you asking about the definition of '{Disease}'?",

reply strategy

"intent_strategy": "",

Intention recognition failure reply speech

"deny_response": "Sorry for not understanding what you mean~"

Note that there is a slot for “Disease” under each intent here. As long as the slot is filled at the very beginning, you can continue to have multiple rounds of dialogue on the value of the filled slot later, because the slot is filled One of the logic is to get the slot value of the previous step.

4) Model release

Both the classification model 2 and the NER model provide services in the form of interfaces. Through this, you can understand the model release.

The most important part of this project for me is the link of slot filling. Combining with the previous experience of using Rasa, I have a deeper understanding of how to implement task-based dialogue models, and I also secretly sighed at the power of the Rasa framework.

Where this project can be extended

At the algorithm level, this project has involved two of the four major tasks in the NLP field: Sequence annotation and Text classification

At present, the knowledge map is used as the underlying database to store the answers. You can consider adding the content of text generation to it, and use the NLG model to generate answers and return them to users when appropriate. As for when Calling the NLG model depends entirely on your own performance.

Then there is sentence matching, which is also called sentence relationship judgment. The key is to find a suitable scene and data set, and then integrate them. For specific methods, please refer to this article: 21 classic deep learning inter-sentence relationship models｜ Code & Tips

In this way, a question-and-answer project covers Knowledge Graph + NLP Four Major Tasks, which should be a pretty good project for students who study NLP. The items of some training courses are even richer.

Temporarily write here, hope to help people in need

If there are any mistakes in the text, please point them out~

thank you~

Reference

^https://arxiv.org/abs/2105.04387

Question answering system based on medical knowledge graph bzdww

GitHub – DeqianBai/KBQA-study: Question Answering System Based on Medical Knowledge Graph

GitHub – z814081807/DeepNER: Tianchi Traditional Chinese Medicine Manual Entity Recognition Challenge Champion Scheme; Chinese Named Entity Recognition; NER; BERT-CRF & amp; BERT-SPAN & amp; BERT-MRC; Pytorch

[Share] Ahocorasick algorithm to filter sensitive words

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledgePython entry skill treeArtificial intelligenceNatural language processing 299024 people are studying systematically