How does Elasticsearch 8.X generate terabytes of test data?

1. Practical issues

I just want to insert a large amount of test data, not test performance. Is there an automatic way to generate TB-level test data?

Have tools? Or is there something like a test data set?

–The question comes from the Elasticsearch Chinese community

https://elasticsearch.cn/question/13129

2. Problem analysis

In fact, similar questions have been asked frequently in the community before. In actual business scenarios, before large-scale data is available, some simulation data may be constructed and generated for purposes such as performance testing.

Real business scenarios generally do not worry about data, including but not limited to:

Generate data
Business systems generate data
Data collected from the Internet, devices, etc.
Other scenarios that generate data…

Regression problem, how is Elasticsearch 8.X structured?

The solution given by the social expert’s mortal enemy, Mr. Wen, is to reindex the two sample data back and forth, doubling the amount of data in one operation.

In fact, the archenemy wen refers to the following three parts of sample data.

So are there any other solutions? This article gives two options.

3. Solution 1, elasticsearch-faker constructs data

3.0 elasticsearch-faker tool introduction

elasticsearch-faker is a command line tool for generating fake data for Elasticsearch.

It uses templates to define the data structure to be generated, and uses placeholders in the template to represent dynamic content, such as random usernames, numbers, dates, etc.

These placeholders will be populated with randomly generated data provided by the Faker library. When executed, the tool generates documents based on the specified template and uploads them to the Elasticsearch index for testing and development to verify the functionality of Elasticsearch queries and aggregations.

3.1 Step 1: Install the toolset

https://github.com/thombashi/elasticsearch-faker#installation

pip install elasticsearch-faker

3.2 Step 2: Make startup script es_gen.sh

#!/bin/bash

# Set environment variables
export ES_BASIC_AUTH_USER='elastic'
export ES_BASIC_AUTH_PASSWORD='psdXXXXX'
export ES_SSL_ASSERT_FINGERPRINT='XXddb83f3bc4f9bb763583d2b3XXX0401507fdfb2103e1d5d490b9e31a7f03XX'

# Call the elasticsearch-faker command to generate data
elasticsearch-faker --verify-certs generate --doc-template doc_template.jinja2 https://172.121.10.114:9200 -n 1000

At the same time, edit the template file doc_template.jinja2.

The template looks like this:

{
  "name": "{<!-- -->{ user_name }}",
  "userId": {<!-- -->{ random_number }},
  "createdAt": "{<!-- -->{ date_time }}",
  "body": "{<!-- -->{ text }}",
  "ext": "{<!-- -->{ word }}",
  "blobId": "{<!-- -->{ uuid4 }}"
}

3.3 Step 3: Execute the script es_gen.sh

[root@VM-0-14-centos elasticsearch-faker]# ./es_gen.sh
document generator #0: 100%|███████████████████████1000/1000 [00:00<00:00, 1194.47docs/s]
[INFO] generate 1000 docs to test_index

[Results]
target index: test_index
Completed in 10.6 secs
current store.size: 0.8 MB
currentdocs.count: 1,000
generated store.size: 0.8 MB
average size[byte]/doc: 831
generateddocs.count: 1,000
generateddocs/secs: 94.5
bulk size: 200

3.4 Step 4: View the imported data results in kibana.

"hits": [
      {
        "_index": "test_index",
        "_id": "2ff2971b-bc51-44e6-bbf7-9881050d5b78-0",
        "_score": 1,
        "_source": {
          "name": "smithlauren",
          "userId": 207,
          "createdAt": "1982-06-14T03:47:00.000 + 0000",
          "body": "Risk cup tax. Against growth possible something international our ourselves. Pm owner card sell responsibility oil.",
          "ext": "mean",
          "blobId": "c4f5c8dc-3d97-44ee-93da-2d93be676b8b"
        }
      },
      {<!-- -->

4. Use the Logstash generator plug-in to generate random sample data

4.1 Preparing the environment

Make sure Elasticsearch 8.X and Logstash 8.X are installed in your environment. Elasticsearch should be configured correctly and run over HTTPS.

In addition, ensure that the relevant certificates for Elasticsearch have been correctly configured in Logstash.

4.2 Generate sample data

We’ll use Logstash’s generator input plugin to create the data and the ruby filter plugin to generate UUIDs and random strings.

4.3 Logstash configuration

Create a configuration file named logstash-random-data.conf and fill in the following content:

input {
  generator {
    lines => [
      '{"regist_id": "UUID", "company_name": "RANDOM_COMPANY", "regist_id_new": "RANDOM_NEW"}'
    ]
    count => 10
    codec => "json"
  }
}

filter {
  ruby {
    code => '
      require "securerandom"
      event.set("regist_id", SecureRandom.uuid)
      event.set("company_name", "COMPANY_" + SecureRandom.hex(10))
      event.set("regist_id_new", SecureRandom.hex(10))
    '
  }
}

output {
 elasticsearch {
    hosts => ["https://172.121.110.114:9200"]
    index => "my_log_index"
    user => "elastic"
    password => "XXXX"
    ccacert => "/www/elasticsearch_0810/elasticsearch-8.10.2/config/certs/http_ca.crt"
  }
  stdout { codec => rubydebug }
}

4.4 Analysis configuration file

1.Input

The a.generator plugin is used to generate event streams.

b.lines contains a JSON string template that defines the structure of each event.

c.count specifies the number of documents to generate.

d.codec is set to json to tell Logstash the input format to expect.

2.Filter

The a.ruby filter is used to execute Ruby code.

b. A UUID is generated as regist_id in the code snippet.

c.company_name and regist_id_new are populated with random hex strings.

3.Output

a. Specify the host, index, user authentication information and certificate of Elasticsearch.

b.stdout output is used for debugging, it will output events processed by Logstash.

4.5 Running Logstash

After saving the configuration file, run the following command in the terminal to start Logstash and generate data:

$ bin/logstash -f logstash-random-data.conf

The execution results are as follows:

The results of viewing the data in kibana are as follows:

With Logstash, we can easily generate large amounts of random sample data for Elasticsearch testing and development. This method is not only efficient, but also flexible to generate data in various formats according to needs.

5. Summary

The above verifications were all verified using Elasticsearch 8.10.2 version.

In fact, in addition to the two solutions given in the article, there are many other solutions, such as: esrally generates test data, uses Python’s Faker to implement sample data construction, Common Crawl, Kaggle and other websites provide large public data sets, which can be used as Source of test data.

Have you encountered similar problems and how did you solve them? Welcome to leave a message and communicate.