elasticsearch data modeling

Data modeling

Nesting type: Nested

Nested is a type of object and is an indexing operation for complex type object arrays in Elasticsearch. Elasticsearch has no concept of internal objects. Therefore, when ES stores complex types, it will flatten the complex hierarchical results of the object into a list of key-value pairs.

For example:

PUT my-index-000001/_doc/1
{
  "group" : "fans",
  "user" : [
    {
      "first" : "John",
      "last" : "Smith"
    },
    {
      "first" : "Alice",
      "last" : "White"
    }
  ]
}

After the above document is created, each json object in the user array will be stored in the following form

{
  "group" : "fans",
  "user.first" : [ "alice", "john" ],
  "user.last" : [ "smith", "white" ]
}

The user.first and user.last fields are flattened into multi-valued fields, between first and last Association lost.

Use nested to create mappings for complex types:

PUT <index_name>
{
  "mappings": {
    "properties": {
      "<nested_field_name>": {
        "type": "nested"
      }
    }
  }
}

Query:

GET <index_name>/_search
{
  "query": {
    "nested": {
      "path": "<nested_field_name>",
      "query": {
        ...
      }
    }
  }
}

Optins:

path: Query depth of nested object

score_mode: Score calculation method

- avg (default): Use the average relevance score of all matching sub-objects.

- max: Use the highest relevance score among all matching sub-objects.

- min: Use the lowest relevance score among all matching sub-objects.

- none: Do not use the relevance score of the matched sub-object. This query assigns a score of 0 to the parent document.

- sum: Sum the relevance scores of all matching sub-objects.

Parent-child relationship: Join

The join data type is a special field that creates a parent/child relationship among documents in the same index. The relationships section defines a set of possible relationships in the document, where each relationship is a parent name and a child name. The parent/child relationship can be defined as follows

PUT <index_name>
{
  "mappings": {
    "properties": {
      "<join_field_name>": {
        "type": "join",
        "relations": {
          "<parent_name>": "<child_name>"
        }
      }
    }
  }
}

Usage scenarios

The join type cannot be used like a table link in a relational database. Whether it is a has_child or has_parent query, it will seriously affect the query performance of the index. negative impact. and will trigger global ordinals

joinThe only suitable application scenario is when the index data contains a one-to-many relationship, and the number of one entity far exceeds that of the other. For example: Teacher has ten thousand students

Note

When indexing parent-child relationship data, the routing parameter must be passed, that is, specifying which shard to store the data in, because the parent document and the child document must be on the same shard, so when obtaining , need to provide the same routing value when deleting or updating a subdocument.

Each index is only allowed to have one join type field mapping

An element can have multiple child elements but only one parent element

Can add new relationships to existing join fields

You can also add child elements to existing elements, but only if the element is already a parent element

Data modeling

Concept

A data model is a physical abstraction that describes a certain phenomenon or state in the real world. For example, we used FSA to describe the phenomenon of Teacher Zhou’s Day, which abstracts the real world into Some kind of model. The real world has many important relationships: a blog post has comments, a bank account has multiple transactions, a customer has multiple bank accounts, an order has multiple order details, and a file directory has multiple files and subdirectories.

Relational database relationships:

Each entity (or row , in the relational world) can be uniquely identified by a primary key.

Entity normalization (normal form). A unique entity’s data is stored only once, while a related entity only stores its primary key. This entity’s data can only be modified at a specific location.

Entities can be associated with queries and can be searched across entities.

Changes to a single entity are atomic, consistent, isolated, and durable Sex. (More details can be found in ACID Transactions.)

Most relational databases support ACID transactions across multiple entities.

But relational databases have their limitations, including limited support for full-text retrieval. Entity association query time consumption is very expensive. The more associations, the more expensive it is. In particular, entity association across servers is extremely expensive and basically unavailable. However, there are limitations on the amount of data on a single server.

Elasticsearch, like most NoSQL databases, is flat. An index is a collection of independent documents. Whether a document matches a search request depends on whether it contains all the required information.

Data changes to a single document in Elasticsearch are ACIDic, while transactions involving multiple documents are not. When a transaction partially fails, index data cannot be rolled back to the previous state.

Flatness has the following advantages:

The indexing process is fast and lock-free.

The search process is fast and lock-free.

Because each document is independent of each other, large-scale data can be distributed across multiple nodes.

But relationships are still very important. At some point, we need to bridge the gap between flat and real-world relationship models. The following four commonly used methods are used to manage relational data in Elasticsearch:

Application-side joins

Data denormalization

Nested objects

Parent/child relationships

Objects and Entities

The relationship between objects and entities is the mapping of the real world and the data model. The POJO domain model we often use when doing Java development is this relationship:

Hierarchical domain model specification:

DO (Data Object): corresponds to the database table structure one-to-one, and transmits the data source object upward through the DAO layer.

DTO (Data Transfer Object): Data transfer object, an object transferred externally by Service or Manager.

BO (Business Object): Business object. An object that encapsulates business logic output by the Service layer.

AO (Application Object): Application object. The abstract reuse object model between the Web layer and the Service layer is very close to the presentation layer and has a low degree of reuse.

VO (View Object): Display layer object, usually an object transmitted by the Web to the template rendering engine layer.

POJO (Plain Ordinary Java Object): In this manual, POJO refers specifically to simple classes with only setter/getter/toString, including DO/DTO/BO/VO, etc.

Query: Data query object, each layer receives query requests from the upper layer. Note that query encapsulation with more than 2 parameters is prohibited from using the Map class for transmission.

Domain model naming convention:

Data object: xxxDO, xxx is the name of the data table.

Data transfer object: xxxDTO, xxx is the name related to the business field.

Display object: xxxVO, xxx is generally the name of the web page.

POJO is the collective name for DO/DTO/BO/VO, and it is forbidden to name it xxxPOJO.

Data modeling process

Concept: Requirements => Abstraction. That is, the actual user needs are abstracted into a certain data model. For example, when we store Inverted List, we abstract the requirement of “storing the inverted list” into FST This abstract data model.

Logic: abstract => concrete. Still taking the “storage inverted list” as an example, after the FST model is built, we need to turn its abstraction into specific codes and objects, and turn the implementation into something visible to the naked eye.

Physics: Specific => Landing. Same as above, after we have the logic, we can program real data files through specific objects and attributes, and save them on your disk.

Meaning

My personal summary is as follows, but it is not limited to the following points:

Development: Simplify the development process to increase efficiency

Product: Improve data storage efficiency and improve query performance

Management: Sufficient preparation in the early stage to reduce the possibility of problems in the later stage

Cost: combine various factors to reduce overall operation and management costs

Contents included in data modeling

Relationship processing (index relations):

Relationships of data models: We can (partially) simulate a relational database by implementing joins in our application. The main advantage of application layer joins is the standardization of data. A user’s name can only be modified in the `user` document. The disadvantage is that in order to join documents when searching, additional queries must be run

- Denormalized data: The way to get the best search performance with Elasticsearch is by purposefully denormalizing at index time. Maintaining a certain number of redundant copies of each document avoids correlation when access is required

- Sparse Fields: Avoid sparse field documents

- Concurrency issues: global lock, document lock, tree lock (exclusive lock, shared lock), optimistic lock, pessimistic lock

Object type: The popular point is to use a large and wide table to achieve coarse-grained index through field redundancy, which can give full play to the advantages of flattening. But this comes at the expense of index performance and flexibility. Prerequisites for use: Redundant fields should rarely change; more suitable for processing a small number of relationships. When the business database does not adopt a non-standardized design, it is difficult to use the above incremental synchronization solution to synchronize the data to ES as a secondary index library. Customized development must be carried out and application development based on specific businesses is required. join association and entity splicing

Nested objects: Index performance and query performance cannot have both, and a trade-off must be made. Nested documents nest and combine entity relationships within a single document (similar to the one-to-many hierarchical structure of json). This method sacrifices indexing performance (any attribute change in the document requires re-indexing the document) in exchange for query performance. Relationship entities can be returned at the same time, which is more suitable for processing a small number of relationships. When using nested documents, they cannot be accessed using general query methods. Appropriate query methods (nested query, nested filter, nested facet, etc.) must be used. In many scenarios, the complexity of using nested documents lies in the indexing stage. Organizational assembly of relationships

Parent-child relationship: Parent-child documents sacrifice certain query performance in exchange for index performance, and are suitable for one-to-many relationship processing. It represents parent-child entities through two types of documents, and the indexes of parent-child documents are independent. Parent-child document ID mappings are stored in Doc Values. Doc Values provide fast processing of the map when the map is entirely in memory, and on the other hand provide adequate scalability by spilling to disk when the map is very large. When searching for parent-child alternatives, I found a syntax for filter-terms that requires a list of IDs of related entities in a certain field. The basic principle is that in terms of multiple values, if the primary key id is known in another index or type, and a certain field has these values, you can directly nest the query. For details, please refer to the example of the official document: query the Weibo list published by a user’s fans through the fan relationship in the user and the relationship between Weibo and the user.

Extensibility:

- Fragment allocation awareness

- Index template

- Index life cycle

- Hot and cold architecture

- Shard management and planning

- Scrolling indexes and aliases

- Cross-cluster search