A Multi-Modal Neural Geometric Solver with Textual Clauses Parsed from Diagram

Authors: Ming-Liang Zhang, Fei Yin, Cheng-Lin Liu

Published in: IJCAI 2023

Conclusion

The authors propose a new graph-text fusion solver, PGPS-Net, which combines textual terms parsed from graphs and builds a large-scale, finely annotated GPS dataset, PGPS9K. Through effective modality representation and efficient modality fusion, PGPS-Net fully utilizes the fundamental structural and semantic information to enable geometric reasoning. Furthermore, interpretable solutions and data augmentation provide GPS with key geometric knowledge, such as geometric theorems and geometric representations. Experimental results demonstrate the potential of the neural solver.

Abstract

Geometric problem solving (GPS) is a high-level mathematical reasoning that requires multimodal fusion and the ability to apply geometric knowledge. Neural solvers show great potential in GPS, but are still insufficient in representation of diagrams and modality fusion.
The author converts the diagram into basic basic textual clauses (Basic text components, that is to say, convert a graph into a text description (what side, what intersection, intersection position, etc. are hidden in the graph) strong>), to effectively describe the characteristics of graphs, and propose a new neural solver (neural solver uses neural networks to solve mathematical problems), called PGPS-Net, to effectively integrate multiple schema information. Combining structural and semantic pre-training, data augmentation, and self-limited decoding, PGPS-Net is endowed with rich knowledge of geometric theorems and geometric representations, thus facilitating geometric understanding and reasoning.

Introduction

A geometric problem is formed by a text problem and a geometric graph, where the text problem describes the conditions of the geometric problem in natural language and sets the reasoning goal (expressing the known conditions and setting questions in text), while the geometric graph carries the Rich structural and semantic information beyond textual questions to aid in problem solving.

Reference is made to traditional symbolic geometric solvers, which parse diagrammatic and textual problems into a unified formal language. Based on knowledge of geometric theorems, they usually perform symbolic reasoning through path search and condition matching, generating new conditional states until they find the search target (through continuous attempts to solve the function). Although symbolic solvers have better interpretability than neural solvers, they are all carefully designed with complex rules and difficult to scale. Also, some symbolic solvers may solve the problem slowly with many redundant steps, and the search process does not match human solutions.

Neural geometric solvers co-embed graphics and text problems with hybrid encoders and self-supervised auxiliary tasks, and generate solvers in a sequential fashion. Although neural solvers achieve impressive results on simple pipelines, recent work shows that there is still a large performance gap compared to symbolic solvers. One of the main reasons is that neural solvers employ a framework similar to general multimodal inference tasks applied to natural images, which cannot effectively utilize graphs. Since the primitives in the geometry map are slender and overlapping, and have complex spatial relationships, feature map-based, region proposal-based or image patch-based frameworks cannot extract refined features, and even destroy structural and semantic information.

PGPS-Net is a multimodal learning framework whose modal input includes not only graph and text questions, but also text terms parsed from graphs. It generates theorem-based interpretable solvers to solve geometric problems.

Considering the insufficient representation of graphs in neural solvers and the difficulty of cross-modal fusion, the authors represent geometric graphs with textual terms, including structural terms and semantic terms, as shown at the bottom of the figure above. Compared with visual images, clauses (Clauses) have a high degree of syntactic structure and naturally have less redundant information. Textual terms are more suitable for describing fine-grained and multi-level information in geometric graphs. Unlike formal languages in symbolic solvers, which consist of complex multi-order logic forms, the authors’ clauses describe only basic relations, where the structural clauses describe the connection relations between geometric primitives (the composition of points, lines, circles ), the semantic terms describe the semantic relationship between non-geometric primitives and geometric primitives (length information on edges, angle information, etc.). We do not pursue higher-level relationships built by geometric rules, as they are not in line with the goals of neural solvers. The authors hope that the model has the ability to match geometric patterns and construct high-level relations from basic relations.

The dataset PGPS9K contains 9,022 geometric questions paired with geometric graphs. Compared with existing datasets, PGPS9K annotates both the graphical annotation and the solving process. Considering the complexity of GPS solution, the author designed a new annotation form based on geometric theorems for the solution program. This solution program provides a problem-solving process where each step is an application of a theorem (axiom) as shown at the top of the diagram above, rather than specific solution steps involving basic arithmetic operations (that is, they change the method of operation, It is no longer a step-by-step solution, but a corresponding theorem calculation). The author’s solver has rich geometric knowledge, which has better interpretability, and also reduces the burden of model learning.

SPGP-Net takes advantage of multimodal information, so the author proposes a new graphic-text fusion neural solver, named PGP-Net, as shown in the figure above. In addition to geometric graph and text problems, PGPNet combines structural and semantic descriptions and generates problem-solving procedures. In order to fuse different parts of textual modalities, the authors also propose a structural and semantic pre-training strategy based on the masked language modeling (MLM) task for improving the model’s understanding of structural and semantic content through explicit modeling< /u>. To overcome the limitations of data scale and pre-training corpus, the authors design five data augmentation strategies based on diversity and equivalence of geometric representations. In addition, a self-limiting GRU decoder is constructed to shrink the representation space and the search space of operands, speeding up training and inference.

Article contributions: 1) The authors use textual descriptions to efficiently express refined structural and semantic information in geometric graphs. (2) The authors propose a new neural solver, PGPSNet, that incorporates multimodal information through structural and semantic pre-training, data augmentation, and self-limited decoding. (3) Construct a large-scale dataset PGPS9K, which is annotated with fine-grained graphical annotations and interpretable solving procedures. (4) PGPSNet significantly outperforms existing neural solvers.

Dataset

Features of \PGPS9K:

(1) Theorem-based: Solving problems in PGPS9K requires the application of geometric theorem knowledge for algebraic calculations, and finally obtains numerical results;

(2) Relying on diagrams: more than 90% of the problems must be solved using diagrams, because necessary conditions, such as variable contents and geometric structures, are displayed through visual forms rather than text;

(3) Abstraction: Diagrams are combined with basic geometric elements (points, lines, circles) and non-geometric elements (text, symbols). Apart from abstract geometric conditions, no complex semantic scenarios are involved in textual questions;

(4) Fine-grained: Problems with the same graph differ in conditions or objectives. Subtle differences in textual problems often lead to completely different problem solutions;

(5) Redundant conditions: Many conditions in semantic clauses or textual problems are not needed in practical problem solving. These five features allow PGPS9K to focus on the challenges of geometric reasoning and mitigate text-induced bias.

Annotation Form

Diagram annotation is to extract structural and semantic information in diagrams, while solution procedures define problem-solving steps. Diagram annotations employ primitive-level labels, which include primitive content and primitive relationships represented as tuples. It is then transformed into two kinds of text representations: structural representation and semantic representation.

Structural expression is limited to the connection relationship between geometric primitive objects (not including auxiliary lines and other information, only the original relationship in the illustration), and will describe the dots on the line and on the circle, where the dots are arranged in order. The connection relationship reveals the most basic structural relationship in the diagram (these connection relationships may not be mentioned in the title but can be seen from the diagram). Semantic clauses describe the basic relationship between geometric primitives and non-geometric primitives in natural language. These relationships are integral to problem solving and complement each other in diagrammatic and textual problems (images and texts are used to complete the description of the topic). The definition and description method of the textual clauses are still open, and the general design principle is to describe the complete characteristics of the graph to help GPS.
The model provides a geometric solution process with several derivation steps, which consists of 34 operators OP and 55 operands PN, where an operator and some associated operands form a step. Each operator implies a geometric theorem in which the operands involved are ordered according to the corresponding theorem formula (the operands are ordered in the order of usage of the theorem).
Operands can be divided into four types: question variables N, which are presented in textual questions or semantic expressions, and process variables V, which are generated in the process , the argument ARG is the alphabet unknown [a-z], and the constant C. For example, the Pythagorean theorem reveals the relationship between the right-angled side and the hypotenuse in a right-angled triangle, and its theorem formula is a^2 + b^2 = c^2, so we express it as “Gougu(a, b, c )”.
Compared with other proposed annotation methods, the paper’s annotation eliminates the basic arithmetic operators (+, -,, /), and thus has the advantages of structure, knowledge guidance, and interpretability (as shown in the figure above), enabling the solution The program is more concise, which reduces the difficulty of model learning. Furthermore, the process variable V is first introduced as an unknown variable within the inner steps and a transfer variable between steps, unifying the forward and reverse operations within one theorem. For example, in the Pythagorean theorem, “Gougu(V,,)” and “Gougu(,*,V)” can be set to solve the right-angled side and the hypotenuse, respectively. In conjunction with the annotation form of the solution program, we have also created a powerful program executor to realize symbolic algebraic operations of multiple unknowns according to theorem formulas of the SymPy Python library.

PGPSNet Model

To fully incorporate the multimodality of geometric problems, we propose a new neural solver, PGPSNet, shown in the figure above. The input of PGPS-Net includes geometric figures D and question text T, where the question text includes structural clauses T_stru, semantic clauses T_sem and textual questions T_prob, that is, the formula T = {T_stru , T_sem , T_prob } = {t_{i< /sub>}ⁿ_i=1. Geometric images are encoded using a convolutional neural network (CNN), while question texts are processed through a structured and semantically pretrained language model. The tokens of these two modalities are then concatenated together and fed to a bidirectional GRU encoder to perform fused encoding. Next, they are decoded by a self-restricted GRU decoder to obtain the solver Y accordingly. Finally, the solver calculates through the program executor and obtains the numerical results of the geometric problem (get all the calculation process, and finally perform the numerical calculation).}

Structural and Semantic Pre-training

The process of structural and semantic pre-training. [M] means MASK token. [G], [N], [ARG], [P], and [ANGID] class tags represent general tags respectively (the general editor here generally refers to some commonly used language or mathematical symbols, which is equivalent to the tag other in the entity in the text) , variables, parameters, tokens of point and angle IDs. Partial labels of [S], [C], and [T] refer to tokens of structures, conditions, and targets respectively

Long, trivial and unordered texts still bring great difficulties to modality fusion and semantic understanding. The authors design a structural and semantic pre-training method based on the masked language model task for text modality learning, as shown in the figure above.

masked LM task: Mask 30% of the text mark as a mask mark while keeping the text mark unchanged. The model is trained to recover masked text via a unified text generation approach. For example, from the structural sentences “line BCD” and “line ECA”, it can be inferred that the mask token in the semantic clause “BD ⊥ EA on [M]” is “C”, which helps the model learn the geometric knowledge of line intersections . However, in some cases, the model may not be able to accurately infer the mask content, but geometric knowledge is populated in its marker candidates. Taking the text question “Find [M]D.” as an example, the mask token is likely to be “C” or “B” according to the structural clause “line BCD”. In summary, pre-training enables the model to acquire advanced geometric cognition, which is an essential skill of GPS.

Assign a category label and a part label to each token in the text. The category labels indicate the semantic categories of tokens in five categories: general tokens, variables, parameters, point and angle IDs. The part label refers to the part that the token belongs to, and the question is divided into three parts: structure, condition and goal. The input embedding e_i of the language model combines the embedding of position encoding, category labeling and partial labeling, as in the above formula.

Encoder and Decoder

In the model, the CNN encoder only extracts the coarse-grained global visual features of the illustration, such as geometric style, to quickly determine possible operations and accelerate model convergence**. The GRU encoder encodes the graph into a visual markup** and integrates the text markup enhanced by the structural and semantic language models to obtain a mixed encoding context h as the encoder output, the formula is H ^E = {h_i^E}_i=1^{n + 1}.

A sequential solution program is generated in an autoregressive manner using a self-limited GRU encoder. There are two benefits of using the self-limited encoder, the following two formulas:

The token encoding of deocder input (candidate encoding), where V_V, V_C, V_N, V_ARG are process variables, constants, problem variables, and variables for data augmentation. Glossary, loc(y , T) is the position of token y in the question text T (it should be that ids are not 2D).
One of the benefits of using the self-limited decoder is that it can reduce the input token embedding space and reduce the amount of calculation (the principle is that in the traditional attention decoder, the input embedding is initially a randomly initialized (or training) matrix, and The self-limited decoer is copied directly from the output of the encoder. That is to say, the input of the decoder at the current time step is no longer a word embedding vector, but a vector selected from the output of the encoder. Doing The reason is to provide more context information, so as to better predict the output of the next time step).

Predict the probability of the generated token y, y∈{V_V , V_C , V_N∩T , V_{ARG∩T }, Score is the scoring function, h^D is the hidden output of the decoder, c is the context vector generated from H^E using the attention mechanism, e( y) is the candidate token embedding.}

In the Self-limited decoder, the output token of each decoder is generated based on the output of the encoder and the tokens that have been generated before. When generating each token, the Self-limited GRU model will select the most probable token from the pre-defined token set as output according to the current context information. This token set usually includes some basic mathematical symbols and geometric figures. Therefore, the output of the Self-limited decoder (the Self-limited GRU model in the Self-limited decoder uses a special restriction mechanism, and each time step can only select one of the most likely tokens from the predefined token set as the output , thereby limiting the search space, reducing the computational complexity of the decoder, and improving the accuracy of the decoder.) The ken search space is restricted to a smaller set, which can reduce the size of the search space.

Data Augmentation

Token Replacement

Token replacement: Replaceable tokens include three types: point, angle mark and increment. Once a token is changed, all identical tokens in text clauses and text questions should be uniformly replaced. For example point B is replaced with point V, and then get new text: “line V C D”, “○A lieson E V D”, “VD ⊥ EA on C”, “Find VD”.

Connection Rotation

Reverse connection: The connection relationship in structural representation can be re-expressed by changing the position order of points. For example, “line B C D fell on circle A” is equivalent to the reverse order of “line D C B ” but the meaning is always the same, “E B D ” and “E D B fall on circle A”.

Clauses Shuf?e

The semantic clauses are scrambled to generate new IDs for problem variables, and the corresponding solution programs are modified at the same time. When the semantic clause is adjusted to “CA = 3(N0), BD ⊥ EA on C, EC = 2(N1)”, the solution program becomes “··· Pythagorean N0 V1 V0···\ “.

Diagram Flip

Graphical flipping: Visual text in diagrams is negligible since the textual content is already parsed in the semantic clause. Therefore, a flipped or rotated diagram is the same as the original diagram for the problem.

Supplement

Textual Clauses

Template for text terms. The symbols ' & amp;', '*', '$', '%' represent point, line, variable and angle IDs, respectively.

Textual clauses are linguistic descriptions of primitive relationships in geometric diagrams. They include structural clauses and semantic clauses, where structural clauses describe the connections between geometric primitives (points, lines, and circles), and semantic clauses describe the relationship between geometric primitives and non-geometric primitives (text and symbols). Relationship. Table 4 shows the complete template of textual clauses, including 3 kinds of structural clauses and 6 kinds of semantic clauses. These textual clauses are fundamental and necessary, and can be directly translated from relational tuples of graph annotations without using advanced geometric rules. The design of clauses is open, but neural solvers do not pursue high-level logical clauses, although they may contribute to problem solving.

Solution Program

The assembly is defined in the solution program, including 34 operators and 55 operands, among which the operands involve 11 problem variables, 7 process variables, 26 enhancement items and 11 constants.

Assembly details are listed in the table above. It should show that solution programs for geometric problems still face problems similar to general mathematical problems: (1) Uncertainty caused by commutative operands. The operands in some theorem formulas are commutative, for example, in the Pythagorean theorem, the formula is a² + b² = c², the two sides are interchangeable. In our note, regularize the solution program by specifying two levels of precedence for commutative operands. The first level is the category level, which is “Enhancements > Process Variables > Problem Variables > Constants“, and the second level is the Index level, in positive order. (2) Uncertainty in the order of equivalent steps. Sometimes calculation steps are in no particular order. The authors manually kept the order of predefined steps the same for the same problem type. (3) Multiple solutions. Some geometric problems can be solved by various solutions. The author chooses the solution corresponding to the most concise solution program.

Related knowledge

Neural Solvers

Neural solvers refer to the method of using neural networks to solve certain mathematical problems. By training a neural network to learn specific mathematical laws, it is applied to solve mathematical problems such as differential equations, inverse problems, optimization problems, etc. Compared with traditional numerical calculation methods, neural networks have better performance and efficiency in dealing with nonlinear problems.

Self-Limited GRU Decoder

The Self-limited GRU Decoder is a decoder that converts the contextual representation generated by the encoder into a sequence solution program. It is a Gated Recurrent Unit (GRU) based autoregressive model for generating solution programs for geometric problems. Different from traditional attention-based decoders, Self-limited GRU Decoder uses a gating mechanism to control the flow of information to avoid overfitting and generating meaningless tokens. Here it should be that the matrix used to generate text in the decoder does not need to be initialized randomly, but is used directly from the encoder, which can reduce learning costs and improve model learning efficiency.

Symbolic geometric solvers

In symbolic computing and symbolic geometry solving, “symbol” usually refers to algebraic notation, that is, letters, numbers, or other symbols that represent variables. In symbolic computing, we use symbols to represent variables and expressions instead of using specific values to perform various mathematical operations. For example, in symbolic computing, we can represent a polynomial p(x) = x^3 + 2x^2 + 3x + 4, using the symbol x to represent the variable instead of using a specific value, such as p(2) = 24.

Suggestions based on region

Region Proposal-Based is a method for object detection in deep learning. Object detection is a task in computer vision where the goal is to determine the location and class of objects in an image or video.

Object detection methods based on region proposals usually consist of two stages: region proposal and object classification. Region proposal refers to a pre-trained neural network to generate some candidate boxes (bounding boxes), which contain image regions that may contain objects. Usually, the region proposal network uses a convolutional neural network (CNN) to extract features from an image, and generates a set of candidate boxes through methods such as sliding window or selective search.

In the target classification stage, for each candidate box, classify the image in the candidate box by using another pre-trained neural network (such as a convolutional neural network) to determine whether it contains the target object and classify it into the corresponding category.