Elasticsearch: Using DISSECT and GROK for data processing in ES|QL

Table of Contents

DISSECT or GROK? Or both?

Use DISSECT to process data

Dissect pattern

the term

example

DISSECT key modifier

Right padding modifier (->)

Additional modifier ( + )

Add order modifiers (+ and /n)

Named skip key(?)

Reference keys (* and & amp;)

Use GROK to process data

Grok pattern

regular expression

example

Grok debugger

limitation

Your data may contain unstructured strings that you want to structure. This makes it easier to analyze the data. For example, a log message might contain IP addresses that you want to extract so that you can find the most active IP addresses.

For developers who have used Logstash and Ingest pipeline, DISSECT and GROK are no stranger to you. You can refer to the following articles:

Elasticsearch: Deep understanding of Dissect ingest processor
Elasticsearch: Difference between Dissect and Grok processors
Logstash: Use dissect to import documents in CSV format
Logstash: Grok pattern example for log parsing

Elasticsearch can build data at index time or query time. At index time, you can use the Dissect and Grok ingestion processors, or the Logstash Dissect and Grok filters. When querying, you can use the ES|QL DISSECT and GROK commands.

DISSECT or GROK? Or both?

DISSECT works by breaking up strings using delimiter-based patterns. GROK works similarly but uses regular expressions. This makes GROK more powerful, but generally slower. DISSECT works well when the data is reliably repeated. When you really need the power of regular expressions, such as when the structure of the text varies from line to line, GROK is the better choice.

You can use DISSECT and GROK for mixed use cases. For example, when part of a line repeats reliably, but the entire line does not. DISSECT can deconstruct repeated line sections. GROK can use regular expressions to process the remaining field values.

Use DISSECT to process data

The DISSECT processing command matches a string against a delimiter-based pattern and extracts the specified keys into columns.

For example, the following pattern:

%{clientip} [%{@timestamp}] %{status}

Matches log lines of the following format:

1.2.3.4 [2023-01-23T12:15:00.000Z] Connected

and add the following columns to the input table:

clientip:keyword	@timestamp:keyword	status:keyword
1.2.3.4	2023-01-23T12:15:00.000Z	Connected

Dissect pattern

The Dissect pattern is defined by the portion of the string that will be discarded. In the previous example, the first part to be discarded was a single space. Dissect finds this space and assigns the value of clientip to everything before that space. Next, dissect matches [ and ], then assigns @timestamp to everything between [ and ]. Paying special attention to the parts of the string you want to discard will help you build successful dissect patterns.

Empty key %{} or
Named skip keys can be used to match a value but exclude the value from the output.

All matching values are output as the keyword string data type. Use type conversion functions to convert to another data type.

Dissect also supports key modifiers that can change the default behavior of dissect. For example, you can instruct dissect to ignore certain fields, append fields, skip padding, etc.

Term

Name	Description
dissect pattern	Describes the set of fields and separators in the text format. Also called dissection. Use a set of %{} parts to describe the dissection: %{a} – %{b} – %{c}
Field	Text from %{ to } (inclusive).
The text between the delimiter	} and the following %{ characters. Any character set other than %{, ‘not }’, or } is a delimiter.
key	Text between %{ and }, excluding ?, +, & & amp; prefixes and ordinal suffixes. example: %{?aaa} – key is aaa %{ + bbb/3} – key is bbb %{ & amp;ccc} – key is ccc

example

The following example parses a string containing a timestamp, some text, and an IP address:

ROW a = "2023-01-23T12:15:00.000Z - some text - 127.0.0.1"
| DISSECT a "%{date} - %{msg} - %{ip}"
| KEEP date, msg, ip

date:keyword	msg:keyword	ip:keyword
2023-01-23T12:15:00.000Z	some text	127.0.0.1

By default, DISSECT outputs the keyword string column. To convert to other types, use type conversion functions:

ROW a = "2023-01-23T12:15:00.000Z - some text - 127.0.0.1"
| DISSECT a "%{date} - %{msg} - %{ip}"
| KEEP date, msg, ip
| EVAL date = TO_DATETIME(date)

msg:keyword	ip:keyword	date:date
some text	127.0.0.1	2023-01-23T12:15:00.000Z

DISSECT key modifier

Key modifiers can change the default behavior of dissect. Key modifiers may be to the left or right of %{keyname} and are always within %{ and }. For example %{ + keyname ->} has append and right padding modifiers.

Dissect key modifier
Modifier	Name	Position	Example	Description	Details
`->`	Skip right padding	(far) right	`%{keyname1->}`	Skip all repeated characters to the right	link
`+`	Append	left	`%{ + keyname} %{ + keyname}`	Append two or more fields together	link
`+` with `/n`	Append with order	left and right	`%{ + keyname/2} %{ + keyname/1}`	Append two or more fields together in the specified order	link
`?`	Named skip key	left	`%{?ignoreme}`	Skip matching values in the output. Same behavior as %{}	link
`*` and `& amp;`	Reference keys	left	`%{*r1} %{ & amp;r1}`	Set the output key to * value and & amp; output value	link

Pattern

`%{ts->} %{level}`

Input

1998-08-10T17:15:42,466 WARN

Result

ts = 1998-08-10T17:15:42,466

level = WARN

The right padding modifier can be used with null keys to help skip unwanted data. For example, the same input string, but enclosed in parentheses, would require an empty right pad key to achieve the same result.

Right padding modifier example with null key

Pattern

`[%{ts}]%{->}[%{level}]`

Input

[1998-08-10T17:15:42,466] [WARN]

Result

ts = 1998-08-10T17:15:42,466

level = WARN

Additional modifier (+)

Dissect supports appending two or more results together for output. Values are appended from left to right. Additional delimiters can be specified. In this example, append_separator is defined as whitespace.

Additional modifier examples:

Pattern	`%{ + name} %{ + name} %{ + name} %{ + name}`
Input	john jacob jingleheimer schmidt
Result	name = john jacob jingleheimer schmidt

Add order modifiers (+ and /n)

Dissect supports appending two or more results together for output. Values are appended according to the order defined (/n). Additional delimiters can be specified. In this example, append_separator is defined as a comma.

Example of additional order modifiers:

Pattern	`%{ + name/2} %{ + name/4} %{ + name/3} %{ + name/1}`
Input	john jacob jingleheimer schmidt
Result	name = schmidt,john,jingleheimer,jacob

Named skip key(?)

Dissect supports ignoring matches in the final result. This can be done using the null key %{}, but for readability the null key may need to be named.

Named skip key modifier example:

Pattern	`%{clientip} %{?ident} %{?auth} [%{@timestamp}]`
Input	1.2.3.4 – – [30/Apr/1998:22:00:52 + 0000]
Result	clientip=1.2.3.4 @timestamp = 30/Apr/1998:22:00:52 + 0000

Reference Key (* and & amp;)

Dissect supports using parsed values as key/value for structured content. Imagine a system that partially records key/value pairs. Reference keys allow you to maintain this key/value relationship.

Reference key modifier example:

Pattern	`[%{ts}] [%{level}] %{p1}:%{ & amp;p1} %{p2}:%{ & amp;p2}`
Input	[2018-08-10T17:15:42,466] [ERR] ip:1.2.3.4 error:REFUSED
Result	ts = 2018-08-10T17:15:42,466 level=ERR ip=1.2.3.4 error = REFUSED

Use GROK to process data

The GROK processing command matches a string against a regular expression-based pattern and extracts the specified keys into columns.

For example, the following pattern:

%{IP:ip} \[%{TIMESTAMP_ISO8601:@timestamp}\] %{GREEDYDATA:status}

Matches log lines of the following format:

1.2.3.4 [2023-01-23T12:15:00.000Z] Connected

and add the following columns to the input table:

@timestamp:keyword	ip:keyword	status:keyword
2023-01-23T12:15:00.000Z	1.2.3.4	Connected

Grok pattern

The syntax for Grok pattern is %{SYNTAX:SEMANTIC}

SYNTAX is the name of the pattern that matches your text. For example, 3.44 is matched by the NUMBER pattern, and 55.3.244.1 is matched by the IP pattern. Grammar is how you match.

Semantics are identifiers you provide for matching text fragments. For example, 3.44 might be the duration of the event, so you could just call it duration. Additionally, the string 55.3.244.1 identifies the client making the request.

By default, matched values are output as the keyword string data type. To convert a semantic data type, suffix it with the target data type. For example, %{NUMBER:num:int}, which converts the num semantics from a string to an integer. Currently the only supported conversions are int and float. For other types, use type conversion functions.

For an overview of available modes, see GitHub. You can also use the REST API to retrieve a list of all schemas.

regular expression

Grok is based on regular expressions. Any regular expression is also valid in grok. Grok uses the Oniguruma regular expression library. For the complete supported regular expression syntax, see the Oniguruma GitHub repository.

Note: Special regular expression characters such as [ and ] need to be escaped with \. For example, in the previous pattern:
%{IP:ip} \[%{TIMESTAMP_ISO8601:@timestamp}\] %{GREEDYDATA:status}
In ES|QL queries, the backslash character itself is a special character and needs to be escaped with another \. For this example, the corresponding ES|QL query becomes:
ROW a = "1.2.3.4 [2023-01-23T12:15:00.000Z] Connected"
| GROK a "%{IP:ip} \[%{TIMESTAMP_ISO8601:@timestamp}\] %{GREEDYDATA:status}"

custom patterns

If grok doesn’t have the pattern you need, you can use the Oniguruma syntax for named capture, which lets you match a piece of text and save it as a column:

(?<field_name>the pattern here)

For example, the queue id for a postfix log is a 10 or 11 character hexadecimal value. This can be captured into a column called queue_id using:

(?<queue_id>[0-9A-F]{10,11})

Example

The following example parses a string containing timestamps, IP addresses, email addresses, and numbers:

ROW a = "2023-01-23T12:15:00.000Z 127.0.0.1 [email protected] 42"
| GROK a "%{TIMESTAMP_ISO8601:date} %{IP:ip} %{EMAILADDRESS:email} %{NUMBER:num}"
| KEEP date, ip, email, num

date:keyword	ip:keyword	email:keyword	num:keyword
2023-01-23T12:15:00.000Z	127.0.0.1	[email protected]	42

By default, GROK outputs keyword string columns. Int and float types can be converted by appending :type to the semantics in the pattern. For example {NUMBER:num:int}:

ROW a = "2023-01-23T12:15:00.000Z 127.0.0.1 [email protected] 42"
| GROK a "%{TIMESTAMP_ISO8601:date} %{IP:ip} %{EMAILADDRESS:email} %{NUMBER:num:int}"
| KEEP date, ip, email, num

date:keyword	ip:keyword	email:keyword	num:integer
2023-01-23T12:15:00.000Z	127.0.0.1	[email protected]	42

For other type conversions, use type conversion functions:

ROW a = "2023-01-23T12:15:00.000Z 127.0.0.1 [email protected] 42"
| GROK a "%{TIMESTAMP_ISO8601:date} %{IP:ip} %{EMAILADDRESS:email} %{NUMBER:num:int}"
| KEEP date, ip, email, num
| EVAL date = TO_DATETIME(date)

ip:keyword	email:keyword	num:integer	date:date
127.0.0.1	[email protected]	42	2023-01-23T12:15:00.000Z

Grok Debugger

To write and debug grok mode, you can use the Grok debugger. It provides a UI for testing patterns against sample data. Under the hood, it uses the same engine as the GROK command.

Limitations

The GROK command does not support configuring custom modes or multiple modes. GROK commands are not subject to the Grok watchdog settings.

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. MySQL entry skill treeSQL advanced skillsCTE and recursive query 77721 people are learning the system