Elasticsearch: ES|QL hands-on

In my previous article “Elasticsearch: An Introduction to the ES|QL Query Language”, I gave a brief introduction to Elasticsearch’s latest query language, ES|QL. In today’s article, we use some examples to demonstrate the powerful search and analysis capabilities of ES|QL in detail.

Installation

If you have not installed your own Elasticsearch and Kibana, please refer to the following link to install it:

How to install Elasticsearch on Linux, MacOS and Windows
Kibana: How to install Kibana in the Elastic stack on Linux, MacOS, and Windows

During installation, we choose Elastic Stack 8.x for installation. It is particularly worth pointing out: ES|QL is only available in Elastic Stack 8.11 and later versions. You need to download Elastic Stack 8.11 and later versions to install.

When starting Elasticsearch for the first time, we can see the following output:

We need to write down the password of the Elasticsearch superuser elastic.

Write data

First, we enter the following command in Kibana to create an index called nyc_taxis:

PUT nyc_taxis { "mappings": { "dynamic": "strict", "_source": { "mode": "stored" }, "properties": { "cab_color": { "type": "keyword" }, "dropoff_datetime": { "type": "date", "format": "yyyy-MM-dd HH:mm:ss" }, "dropoff_location": { "type": "geo_point" }, "ehail_fee": { "type": "scaled_float", "scaling_factor": 100 }, "extra": { "type": "scaled_float", "scaling_factor": 100 }, "fare_amount": { "type": "double" }, "improvement_surcharge": { "type": "scaled_float", "scaling_factor": 100 }, "mta_tax": { "type": "scaled_float", "scaling_factor": 100 }, "passenger_count": { "type": "integer" }, "payment_type": { "type": "keyword" }, "pickup_datetime": { "type": "date", "format": "yyyy-MM-dd HH:mm:ss" }, "pickup_location": { "type": "geo_point" }, "rate_code_id": { "type": "keyword" }, "store_and_fwd_flag": { "type": "keyword" }, "surcharge": { "type": "scaled_float", "scaling_factor": 100 }, "tip_amount": { "type": "double" }, "tolls_amount": { "type": "scaled_float", "scaling_factor": 100 }, "total_amount": { "type": "scaled_float", "scaling_factor": 100 }, "trip_distance": { "type": "scaled_float", "scaling_factor": 100 }, "trip_type": { "type": "keyword" }, "vendor_id": { "type": "keyword" }, "vendor_name": { "type": "text" } } } }

Next, we can download the data set file esql.json at the address GitHub – liu-xiao-guo/esql. We can use the following command to write data:

curl --cacert /Users/liuxg/elastic/elasticsearch-8.11.0/config/certs/http_ca.crt -u elastic:o6G_pvRL=8P*7on + o6XH -s -H "Content-Type: application/x -ndjson" -XPOST https://localhost:9200/nyc_taxis/_bulk --data-binary @esql.json

You need to rewrite the path of the above certificate http_ca.crt according to your installation directory. You need to make corresponding adjustments according to the elastic user’s password.

After running the above command:

GET nyc_taxis/_count

The above command returns:

{ "count": 100, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 } }

We can see 100 data. We create a data view for this data:

In this way we have created an index pattern for nyc_taxis.

ES|QL hands-on practice

First let’s do a simple exercise.

Query data

Let’s select the time range and select Try ES|QL:

We found that by default, the query statement in the Query bar is like this:

from nyc_taxis | limit 10

This is equivalent to:

GET nyc_taxis/_search?size=10

For ease of display, we enlarge the editing box:

This will make our content easier to see clearly.

We make the following query:

from nyc_taxis | limit 100 | project pickup_datetime, total_amount

Above, we use project to return the fields we want. Of course we can use keep to do the same thing:

from nyc_taxis | limit 100 | keep pickup_datetime, total_amount

We can also enter the following command in Kibana’s Dev Tools:

POST /_query?format=json { "query": """ from nyc_taxis | limit 100 | keep pickup_datetime, total_amount """ }

We can also change its output format to txt:

POST /_query?format=txt { "query": """ from nyc_taxis | limit 100 | keep pickup_datetime, total_amount """ }

We can also output to csv format:

POST /_query?format=csv { "query": """ from nyc_taxis | limit 10 | keep pickup_datetime, total_amount """ }

We can sort the results using sort:

We can see that the results are sorted in descending order by total_amount.

Above, we can see that for the index nyc_taxis, it does not have a @timestamp time field. So what should we do? We can achieve this through field alias. We execute the following command:

PUT nyc_taxis/_mapping { "properties": { "@timestamp": { "type": "alias", "path": "pickup_datetime" } } }

After executing the above command, we refresh the page again:

Someone may want to ask, what is the corresponding DSL query statement? If you are familiar with DSL, the above statement will have the same result as the following query:

GET nyc_taxis/_search?filter_path=**.hits { "size": 100, "_source": false, "fields": [ "pickup_datetime", "tolls_amount" ], "sort": [ { "total_amount": { "order": "desc" } } ] }

Next, let’s query the results where fare_amount is greater than 20:

from nyc_taxis | where fare_amount > 20

from nyc_taxis | where fare_amount > 20 | where payment_type == "1"

The results shown above are not very clear. We can use keep to view:

from nyc_taxis | where fare_amount > 20 | where payment_type == "1" | keep fare_amount, payment_type

We can add more filters:

from nyc_taxis | where fare_amount > 20 | where payment_type == "1" | where tip_amount > 5 | keep fare_amount, payment_type, tip_amount

We can limit the first 5 results through limit (there are 6 results shown above):

Above I intentionally wrote limit as LIMIT in uppercase letters. We can see that it actually has no impact. In other words, keywords have nothing to do with capitalization. We can also sort the results:

from nyc_taxis | where fare_amount > 20 | where payment_type == "1" | where tip_amount > 5 | LIMIT 5 | Sort tip_amount desc | keep fare_amount, payment_type, tip_amount

The above query is the same as the following DSL query:

GET nyc_taxis/_search { "size": 5, "_source": [ "fare_amount", "payment_type", "tip_amount" ], "query": { "bool": { "filter": [ { "range": { "fare_amount": { "gt": 20 } } }, { "term": { "payment_type": "1" } }, { "range": { "tip_amount": { "gt": 5 } } } ] } }, "sort": [ { "tip_amount": { "order": "desc" } } ] }

Obviously, our ES|QL syntax is simpler and clearer. What’s more, it performs faster!

Next, let’s generate new fields from existing fields. This is the runtime fields we talked about before. We want to calculate how much it costs per mile:

from nyc_taxis | eval cost_per_mile = total_amount/trip_distance | keep total_amount, trip_distance, cost_per_mile

If we use the previous runtime fields to implement, it will look like this:

GET nyc_taxis/_search?filter_path=**.hits { "_source": false, "runtime_mappings": { "cost_per_mile": { "type": "double", "script": { "source": "emit(doc['total_amount'].value/doc['trip_distance'].value)" } } }, "fields": [ "total_amount", "trip_distance", "cost_per_mile" ] }

From the comparison above, we can see that ES|QL is very concise and easy to understand.

For the above query, we can also add filters to filter:

from nyc_taxis | eval cost_per_mile = total_amount/trip_distance | where trip_distance > 10 | keep total_amount, trip_distance, cost_per_mile

We next filter further on the generated field cost_per_mile:

from nyc_taxis | eval cost_per_mile = total_amount/trip_distance | where trip_distance > 10 | keep total_amount, trip_distance, cost_per_mile | where cost_per_mile > 3.5

From the results shown, we can see that we only have two results.

We can sort further:

from nyc_taxis | eval cost_per_mile = total_amount/trip_distance | where trip_distance > 10 | keep total_amount, trip_distance, cost_per_mile | where cost_per_mile > 3.5 | sort cost_per_mile desc

We can also use ES|QL to search on text fields, although there are currently some limitations. Please see the article “Elasticsearch: Limitations of ES|QL” for details. We can refer to some of them:

Text fields behave like keyword fields

Although ES|QL supports text fields, ES|QL does not handle these fields the same way the Search API does. ES|QL queries do not query or aggregate analyzed strings. Instead, the ES|QL query will try to get the subfields of the keyword family type’s text field and query/aggregate them. If the keyword subfield cannot be retrieved, ES|QL will obtain the string from the document’s _source. If the _source cannot be retrieved, such as when using a synthetic source, null is returned.

Note that ES|QL retrieval of the keyword subfield may have unintended consequences. ES|QL queries on text fields are case-sensitive. Additionally, subfields may have been mapped using a normalizer, which converts the original string. Or it might have been mapped using ignore_above, which can truncate the string. None of these mapping operations are applied to ES|QL queries, which may result in false positives or false negatives.

To avoid these problems, best practice is to be clear about the fields you query on, and query keyword subfields rather than text fields.

This will be further improved in future versions.

Next, let’s try to do a text search. for example:

from nyc_taxis | where vendor_name like "Uber" | keep vendor_name

Above, we can see that documents with vendor_name “uber” are not searched. This is because ES|QL searches the text field as a keyword. In order to be able to search Uber and uber documents at the same time, we can use the following method:

from nyc_taxis | where vendor_name like "?ber" | keep vendor_name

We can also do the following searches:

from nyc_taxis | where vendor_name rlike "[uU]ber" | keep vendor_name

In fact, in our documents, we also have some documents whose vendor_name is “Uber special”, “Uber taxi”, etc., but we did not see it in our search above. In order to search for these documents, we You can use the following search terms:

from nyc_taxis | where vendor_name rlike "[uU]ber.*" | keep vendor_name

Next we aggregate the data:

Aggregate data

We want to know what is the maximum passenger_count value for each payment_type. We can do this using stats:

from nyc_taxis | stats max_passengers=max(passenger_count) by payment_type | keep payment_type, max_passengers

This is similar to our previous DSL as follows:

GET nyc_taxis/_search?filter_path=aggregations { "size": 0, "aggs": { "max_passengers": { "terms": { "field": "payment_type" }, "aggs": { "max_count": { "max": { "field": "passenger_count" } } } } } }

The result returned by the above command is:

{ "aggregations": { "max_passengers": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "1", "doc_count": 71, "max_count": { "value": 6 } }, { "key": "2", "doc_count": 27, "max_count": { "value": 5 } }, { "key": "3", "doc_count": 1, "max_count": { "value": 1 } }, { "key": "4", "doc_count": 1, "max_count": { "value": 1 } } ] } } }

Obviously, our ES|QL query will be much simpler and clearer.

We can also add other aggregations. For example, we want to get the number of payment types payment_type in each max_passengers:

from nyc_taxis | stats max_passengers=max(passenger_count) by payment_type | keep payment_type, max_passengers | stats type_count=count(payment_type) by max_passengers

As shown above, in the display area, it only displays the latest gathering status.

We can also do date_histogram aggregation over time:

from nyc_taxis | eval bucket=AUTO_BUCKET(@timestamp, 12, "2014-12-22T00:00:00.00Z", "2015-11-26T00:00:00.00Z") | stats count(*) by bucket

This is similar to our previous DSL:

GET nyc_taxis/_search?filter_path=aggregations { "size": 0, "aggs": { "monthly_count": { "date_histogram": { "field": "@timestamp", "fixed_interval": "30d" } } } }

We can perform statistics on payment_types:

from nyc_taxis | stats payment_types = count(*) by payment_type | sort payment_types desc

This is similar to the following statistics for DSL:

GET nyc_taxis/_search?filter_path=aggregations { "size":0, "aggs": { "payment_types": { "terms": { "field": "payment_type" } } } }

Visualize in Kibana

We can also use ES|QL in visualizations:

We can generate the corresponding visualization ourselves in Discover. Click the save icon above:

This makes it easy to generate our visualization.

We can also edit it:

Okay, that’s all for today. Hopefully we all learn how to use ES|QL as a tool to provide efficiencies in our work in the future.

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. MySQL entry skill treeSQL advanced skillsCTE and recursive query 78094 people are learning the system