Elasticsearch series components: Logstash’s powerful log management and data analysis tools

Elasticsearch is an open source, Lucene-based distributed search and analysis engine designed for use in cloud computing environments, enabling real-time, scalable search, analysis, and exploration of full-text and structured data. It is highly scalable and can search and analyze large amounts of data in a short time.

Elasticsearch is not only a full-text search engine, it also provides distributed multi-user capabilities, real-time analysis, and the ability to process complex search statements, making it useful in many scenarios, such as enterprise search, log and event data analysis, etc. , all have a wide range of applications.

This article will introduce the introduction, principle, installation and simple use of the Elastic Stack component Logstash.

Article directory

        • 1. Introduction and principles of Logstash
          • 1.1. Introduction to Logstash
          • 1.2. Working principle of Logstash
          • 1.3. Logstash execution model
          • 1.4. Download and install Logstash
        • 2. Logstash configuration instructions
          • 2.1. Introduction to Logstash configuration
          • 2.2. Pipeline configuration file-input
          • 2.3. Pipeline configuration file-filtering
          • 2.4. Pipeline configuration file-output
          • 2.4. Settings configuration file
        • 3. Logstash usage examples
          • 3.1. Logstash Hello world
          • 3.2. Log format processing
          • 3.3. Import data into Elasticsearch
1. Introduction and principles of Logstash
1.1, Introduction to Logstash

Logstash is an open source data collection engine with real-time pipeline capabilities that can be used to unify data from disparate sources and send it to the destination of your choice. Logstash supports multiple types of input data, including log files, system message queues, databases, etc. It can perform various conversions and processing on the data, and then send the data to various targets, such as Elasticsearch, Kafka, email notifications, etc.

image-20231016113032292

Key features of Logstash include:

  1. Multiple input sources: Logstash supports multiple types of input data, including log files, system message queues, databases, etc.

  2. Data processing: Logstash can perform various conversions and processing of data, such as filtering, parsing, formatting, etc.

  3. Multiple output targets: Logstash can send data to various targets such as Elasticsearch, Kafka, email notifications, etc.

  4. Plug-in mechanism: Logstash provides a wealth of plug-ins that can easily extend its functionality.

  5. Integration with Elasticsearch and Kibana: Logstash is part of the Elastic Stack (formerly ELK Stack) and has good integration with Elasticsearch and Kibana for easy data search, storage, and visualization.

1.2. Working principle of Logstash

The working principle of Logstash can be divided into three main steps: input, filter and output.

  1. Input: Logstash supports multiple types of input data, including log files, system message queues, databases, etc. In the configuration file, you can specify one or more input sources.
  2. Filter: After the input data is collected, Logstash can perform various transformations and processing on the data. For example, you can use the grok plugin to parse unstructured log data and convert it into structured data. You can also use the mutate plugin to modify data, such as adding new fields, deleting fields, changing field values, etc.
  3. Output: Processed data can be sent to one or more destinations. Logstash supports multiple types of output targets, including Elasticsearch, Kafka, email notifications, etc.

These three steps are executed sequentially in Logstash’s event processing pipeline. Each event (for example, a line of log data) goes through three steps: input, filtering, and output. During the filtering phase, if an event is dropped by the filter, it will not be sent to the output destination.

image-20231016113504315

The above is the basic working principle of Logstash. It should be noted that the configuration of Logstash is very flexible, and you can choose the appropriate input sources, filters, and output targets according to actual needs.

1.3, Logstash execution model

Yes, your understanding is correct. Logstash’s execution model mainly includes the following steps:

  1. Start a thread per Input: Logstash starts a thread for each input plug-in, and these threads run in parallel to get data from their respective data sources.

  2. Data writing queue: The data obtained by the input plug-in will be written to a queue. By default, this is a bounded queue stored in memory, and if Logstash stops unexpectedly, the data in the queue will be lost. To prevent data loss, Logstash provides two features:

    • Persistent Queues: This feature will store the queue on disk so that even if Logstash stops unexpectedly, the data in the queue will not be lost.

    • Dead Letter Queues: This feature saves events that cannot be processed. It should be noted that this feature only supports Elasticsearch as an output source.

  3. Multiple Pipeline Workers process data: Logstash will start multiple Pipeline Workers, each Worker will take a batch of data from the queue, and then execute filters and output plug-ins. The number of Workers and the amount of data processed each time can be set in the configuration file.

This model enables Logstash to efficiently handle large amounts of data, and configurations can be tuned to optimize performance.

1.4, Logstash download and installation

Link to Elastic’s official download page. On this page, you can download various components of the Elastic Stack, including Elasticsearch, Kibana, Logstash, Beats, etc. This page provides download links for the latest versions of each component, as well as download links for historical versions: Past Releases of Elastic Stack Software | Elastic

Here we will select Logstash and make sure the Logstash version selected matches the Elasticsearch version we are using:

image-20231016103230551

After selecting, select “Download” to start downloading, and after the download is successful, unzip it to the specified location.

2. Logstash configuration instructions
2.1, Logstash configuration introduction

Logstash configuration is mainly divided into two parts: Pipeline configuration file and Settings configuration file.

  1. Pipeline configuration file: This is the core configuration of Logstash, used to define the data processing process, including input, filter and output. Each section can use a variety of plug-ins to accomplish specific tasks. For example, the input part can use the file plugin to read data from a file, the filtering part can use the grok plugin to parse the logs, and the output part can use the elasticsearch plugin to send data to Elasticsearch.
  2. Settings configuration file: This is the global configuration of Logstash, usually set in the logstash.yml file. These configurations include the name of the Logstash instance, data storage path, configuration file path, auto-reload configuration, number of worker threads, etc.

Both parts of the configuration are written in YAML format and can be edited using a text editor. When Logstash starts, it first reads the Settings configuration file, then loads and executes the Pipeline configuration file.

2.2, Pipeline configuration file-input

In the Logstash Pipeline configuration file, the input section defines the source of data. Logstash provides a variety of input plugins to read data from various data sources.

Here are some commonly used input plugins:

file: Read data from a file. Commonly used configuration items include path (file path) and start_position (start reading position).

input {<!-- -->
  file {<!-- -->
    path => "/path/to/your/logfile"
    start_position => "beginning"
  }
}

beats: Receive data from Beats clients (such as Filebeat, Metricbeat, etc.). Commonly used configuration items include port (listening port number).

input {<!-- -->
  beats {<!-- -->
    port => 5044
  }
}

http: Receive data via HTTP request. Commonly used configuration items include port (listening port number).

input {<!-- -->
  http {<!-- -->
    port => 8080
  }
}

jdbc: Read data from the database. Commonly used configuration items include jdbc_driver_library (JDBC driver path), jdbc_driver_class (JDBC driver class name), jdbc_connection_string (database connection string) , jdbc_user (database user name) and jdbc_password (database password).

input {<!-- -->
  jdbc {<!-- -->
    jdbc_driver_library => "/path/to/your/jdbc/driver"
    jdbc_driver_class => "com.mysql.jdbc.Driver"
    jdbc_connection_string => "jdbc:mysql://localhost:3306/yourdatabase"
    jdbc_user => "yourusername"
    jdbc_password => "yourpassword"
  }
}

kafka: In this configuration, the bootstrap_servers parameter specifies the address and port of the Kafka server, and the topics parameter specifies which topic you want to read data from.

input {<!-- -->
  kafka {<!-- -->
    bootstrap_servers => "localhost:9092"
    topics => ["your_topic"]
  }
}

The kafka input plug-in has many other configuration items that you can set according to actual needs. For example, you can set the group_id parameter to specify the consumer group, and set the auto_offset_reset parameter to specify how to locate the consumer when there is no initial offset or the current offset does not exist Location etc.

Specific configuration items and possible values can be found in Logstash’s official documentation.

The above are some commonly used input plug-ins and their configurations. You can choose appropriate plug-ins and configurations based on actual needs. It should be noted that you can define multiple inputs in a configuration file and Logstash will process all inputs in parallel.

2.3, Pipeline configuration file-filtering

In the Logstash Pipeline configuration file, the filter section defines the rules for data processing. Filter plug-ins can perform various operations on data, such as parsing, transforming, adding and deleting fields, etc.

Here are some commonly used filter plug-ins and their operations:

grok: The grok filter is used to parse unstructured log data and convert it into structured data. It uses pattern matching to parse text, where each pattern is a combination of a name and a regular expression. For example:

filter {<!-- -->
  grok {<!-- -->
    match => {<!-- --> "message" => "%{COMBINEDAPACHELOG}" }
  }
}

In this configuration, the grok filter will try to match the contents of the message field to the COMBINEDAPACHELOG pattern, which is a predefined pattern used for parsing Apache logs.

mutate: The mutate filter is used to modify event data, such as adding new fields, deleting fields, changing field values, etc. For example:

filter {<!-- -->
  mutate {<!-- -->
    add_field => {<!-- --> "new_field" => "new_value" }
  }
}

In this configuration, the mutate filter adds a new field named new_field with the value new_value to each event.

date: The date filter is used to parse date and time information and convert it into Logstash’s @timestamp field. For example:

filter {<!-- -->
  date {<!-- -->
    match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
  }
}

In this configuration, the date filter attempts to match the contents of the timestamp field to the specified date and time format.

The above are some commonly used filter plug-ins and their operations. You can choose appropriate plug-ins and configurations based on actual needs. It should be noted that you can define multiple filters in a configuration file, and Logstash will execute these filters in sequence in the order in the configuration file.

2.4, Pipeline configuration file-output

In the Logstash Pipeline configuration file, the output section defines where the processed data should be sent. Logstash provides a variety of output plugins to send data to various destinations.

Here are some commonly used output plugins:

elasticsearch: Send data to Elasticsearch. Commonly used configuration items include hosts (the address and port of the Elasticsearch server) and index (index name).

output {<!-- -->
  elasticsearch {<!-- -->
    hosts => ["localhost:9200"]
    index => "your_index"
  }
}

file: Write data to a file. Commonly used configuration items include path (file path).

output {<!-- -->
  file {<!-- -->
    path => "/path/to/your/file"
  }
}

stdout: Output data to standard output. Commonly used configuration items include codec (encoding format), and commonly used values include rubydebug (output in Ruby debugging format).

output {<!-- -->
  stdout {<!-- -->
    codec => rubydebug
  }
}

kafka: Send data to Kafka. Commonly used configuration items include bootstrap_servers (the address and port of the Kafka server) and topic_id (topic name).

output {<!-- -->
  kafka {<!-- -->
    bootstrap_servers => "localhost:9092"
    topic_id => "your_topic"
  }
}

The above are some commonly used output plug-ins and their configurations. You can choose appropriate plug-ins and configurations based on actual needs. Note that you can define multiple outputs in a configuration file and Logstash will send each event to all outputs.

2.4, Settings configuration file

The Settings configuration file of Logstash is usually logstash.yml, which is the global configuration file of Logstash and is used to set some basic parameters for Logstash operation.

The following are some common configuration items:

  1. node.name: Set the name of the Logstash instance. The default value is the host name of the current host.

    node.name: test
    
  2. path.data: Set the path where Logstash stores persistent data. The default value is the data folder in the Logstash installation directory.

    path.data: /var/lib/logstash
    
  3. path.config: Set the path of the Pipeline configuration file.

    path.config: /etc/logstash/conf.d/*.conf
    
  4. config.reload.automatic: If set to true, Logstash will automatically detect changes to the Pipeline configuration file and reload the configuration.

    config.reload.automatic: true
    
  5. pipeline.workers: Set the number of worker threads to process events, usually set to the number of CPU cores of the machine.

    pipeline.workers: 2
    
  6. pipeline.batch.size: Set the number of events per batch. Increasing this value can improve throughput, but it will also increase processing latency.

    pipeline.batch.size: 125
    
  7. pipeline.batch.delay: Set the maximum wait time between two batches (in milliseconds).

    pipeline.batch.delay: 50
    

The above are some common Logstash Settings configuration items. You can modify these configurations according to actual needs. Specific configuration items and possible values can be found in Logstash’s official documentation.

3. Logstash usage example
3.1, Logstash Hello world

First, let’s do a very basic Logstash usage example. In this example, Logstash uses standard input as the input source and standard output as the output destination, and does not specify any filters.

  1. Switch to the root directory of Logstash on the command line and execute the following command to start Logstash:
cd logstash-8.10.2
bin/logstash -e 'input { stdin { } } output { stdout {} }'

In this command, the -e parameter is used to specify the Pipeline configuration, input { stdin { } } means using standard input as the input source, output { stdout {} } indicates using standard output as the output destination.

  1. After Logstash is started successfully, you can enter some text in the console, such as “hello world”, and then Logstash will process this text as event data.

  2. Logstash will automatically add some fields to each event, such as @version, host and @timestamp, and then output the processed events to the standard output.

For example, after you enter “hello world” on the console, you may see the following output:

{<!-- -->
    "@version": "1",
    "host": "localhost",
    "@timestamp": "2018-09-18T12:39:38.514Z",
    "message": "hello world"
}

In this example, Logstash simply takes the data from standard input, adds a few simple fields, and writes the data to standard output. This is the most basic way to use it. In fact, Logstash can also do a lot of complex data processing and conversion.

image-20231016154658466

3.2. Log format processing

We can see that although the above example uses standard input as the input source and outputs data to standard output, the log content as a whole is stored in the message field, which is extremely inconvenient for subsequent storage and query. You can specify a grok filter for this pipeline to process the log format.

  1. Add filter configuration in first-pipeline.conf as follows
input {<!-- --> stdin {<!-- --> } }
filter {<!-- -->
    grok {<!-- -->
        match => {<!-- --> "message" => "%{COMBINEDAPACHELOG}"}
    }
}
output {<!-- -->
   stdout {<!-- --> codec => rubydebug }
}

Among them codec => rubydebug is used to beautify the output

  1. Verify the configuration (note the path to the specified configuration file):
./bin/logstash -f first-pipeline.conf --config.test_and_exit
  1. Start command:
./bin/logstash -f first-pipeline.conf --config.reload.automatic

The --config.reload.automatic option enables the dynamic reload configuration function

  1. expected outcome:

Our configuration uses the grok filter to parse Apache logs in the COMBINEDAPACHELOG format. Here is an example log that conforms to this format:

127.0.0.1 - - [28/Sep/2021:10:00:00 + 0800] "GET /test.html HTTP/1.1" 200 2326 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64 ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"

This log records information about an HTTP request, including the client IP address, request time, request method and URL, HTTP version, response status code, number of bytes in the response body, Referer and User-Agent, etc.

We can take this log as input and Logstash will process this log using our configuration. The processed results will be output to standard output in Ruby debugging format.

3.3. Import data into Elasticsearch

As an important part of the Elastic stack, Logstash’s most commonly used function is to import data into Elasticssearch. It is also very convenient to import data from Logstash into Elasticsearch. You only need to add the output of Elasticsearch in the pipeline configuration file.

  1. First, you must have a deployed Logstash
  2. Add Elasticsearch configuration in first-pipeline.conf as follows
input {<!-- --> stdin {<!-- --> } }
filter {<!-- -->
    grok {<!-- -->
        match => {<!-- --> "message" => "%{COMBINEDAPACHELOG}"}
    }
}
output {<!-- -->
    elasticsearch {<!-- -->
        hosts => [ "localhost:9200" ]
        topic_id => "logstash"
    }
}
  1. Start command:
./bin/logstash -f first-pipeline.conf --config.reload.automatic

The --config.reload.automatic option enables the dynamic reload configuration function

  1. expected outcome:

Our configuration uses the grok filter to parse Apache logs in the COMBINEDAPACHELOG format. Here is an example log that conforms to this format:

"127.0.0.1 - - [28/Sep/2021:10:00:00 + 0800] "GET /test.html HTTP/1.1" 200 2326 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"

This log records information about an HTTP request, including the client IP address, request time, request method and URL, HTTP version, response status code, number of bytes in the response body, Referer and User-Agent, etc.

We can take this log as input and Logstash will process this log using our configuration. The processed results will be output to standard output in Ruby debugging format.

Query Elasticsearch to confirm whether the data is uploaded normally:

curl -XGET 'http://localhost:9200/logstash/_search?pretty & amp;q=response=200'

You can also use Kibana to view:

image-20231016170052670