The idea of system I/O performance analysis under Linux

Routine of system I/O performance analysis under Linux

linux

14 people liked this article

How to quickly analyze and locate I/O performance problems

1. File system I/O performance indicators

First of all, think about the usage of storage space, including capacity, usage, and remaining space. We usually also call these the amount of disk space used, but this is only the space usage displayed by the file system, not the actual amount of disk space, because the metadata of the file system also takes up disk space. Moreover, if you configure RAID, the usage seen from the file system and the actual disk space will also be different due to different RAID levels.

In addition to the storage space of the data itself, another thing that is easily overlooked is the usage of the index node, including capacity, usage, and remaining amount. If there are too many small files in the file system, you may encounter the problem that the index node capacity is full.

Second, cache usage, including page cache, inode cache, directory entry cache, and cache for each specific file system. By using memory to temporarily cache file data or file system metadata, thereby reducing the number of disk accesses. ·

Finally, the performance indicators of file I/O, including IOPS (r/s, w/s), response time (delay), throughput (B/s), etc. When examining such indicators, it is also necessary to comprehensively analyze the performance of file I/O in combination with the actual file read and write conditions, file size, quantity, I/O type, etc.

2. Disk I/O performance indicators

The performance indicators of disk I/O mainly consist of four core indicators: utilization, IOPS, response time, throughput, and one mentioned earlier, the buffer.

When examining these indicators, we must pay attention to the specific scenarios of comprehensive I/O to analyze, such as the type of read and write (sequential read and write or random read and write), read and write ratio, read and write size, storage type (with or without RAID, RAID level, local storage or network storage), etc.

However, when examining these indicators, there is a big taboo, which is to compare the I/O indicators of different scenarios.

3. Performance Tools

One class: df, top, iostat, pidstat;

The second category: /proc/meminfo, /proc/slabinfo, slabtop;

Three categories: strace, lsof, filetop, opensnoop

4. The relationship between performance indicators and performance tools

5. How to quickly analyze I/O performance bottlenecks

Simply put, it is to find connections. There is a certain correlation between various performance indicators. To understand the correlation between indicators, it is necessary to know how various indicators work. When a performance problem occurs, the basic analysis idea is:

First use iostat to discover disk I/O performance bottlenecks;

Then use pidstat to locate the process that causes performance bottlenecks;

Then analyze the behavior of process I/O;

Finally, combine the principles of the application to analyze the source of these I/Os.

The figure lists the most commonly used file system and disk I/O performance analysis tools, and the corresponding analysis process.

Several ideas for disk I/O performance optimization

1. I/O Benchmark

Before optimizing, we need to know what is the goal of I/O performance optimization? In other words, how many of these I/O indicators (IOPS, throughput, response time, etc.) we observe should be appropriate? In order to evaluate the optimization effect more objectively and reasonably, the disk and file system should be benchmarked first to obtain their extreme performance.

fio (Flexible I/O Tester), which is a commonly used performance benchmarking tool for file systems and disk I/O. It provides a large number of customizable options, which can be used to test the I/O performance of bare disk or file system in various scenarios, including scenarios such as different block sizes, different I/O engines, and whether to use cache.

There are many options for fio, here are a few commonly used ones:

# random read
fio -name=randread -direct=1 -iodepth=64 -rw=randread -ioengine=libaio -bs=4k -size=1G -numjobs=1 -runtime=1000 -group_reporting -filename=/dev/sdb
# random write
fio -name=randwrite -direct=1 -iodepth=64 -rw=randwrite -ioengine=libaio -bs=4k -size=1G -numjobs=1 -runtime=1000 -group_reporting -filename=/dev/sdb
# sequential read
fio -name=read -direct=1 -iodepth=64 -rw=read -ioengine=libaio -bs=4k -size=1G -numjobs=1 -runtime=1000 -group_reporting -filename=/dev/sdb
# Write sequentially
fio -name=write -direct=1 -iodepth=64 -rw=write -ioengine=libaio -bs=4k -size=1G -numjobs=1 -runtime=1000 -group_reporting -filename=/dev/sdb

Focus on explaining several parameters:

direct, whether to skip the system cache, iodepth1 is to skip.

iodepth, when using asynchronous I/O (AIO), the upper limit of simultaneous requests.

rw, I/O mode, sequential read/write, random read/write.

ioengine, I/O engine, supports synchronous (sync), asynchronous (libaio), memory mapping (mmap), network and other I/O engines.

bs, I/O size. 4k, the default value.

filename, file path, can be a disk path or a file path. However, it should be noted that using the disk path to test write will destroy the file system of this disk, so before testing, pay attention to backup.

The following shows an example of fio testing sequential reads:

read: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=16.7MiB/s,w=0KiB/s][r=4280,w=0 IOPS][eta 00m:00s]
read: (groupid=0, jobs=1): err= 0: pid=17966: Sun Dec 30 08:31:48 2018
   read: IOPS=4257, BW=16.6MiB/s (17.4MB/s)(1024MiB/61568msec)
    slat (usec): min=2, max=2566, avg= 4.29, stdev=21.76
    clat (usec): min=228, max=407360, avg=15024.30, stdev=20524.39
     lat (usec): min=243, max=407363, avg=15029.12, stdev=20524.26
    clat percentiles (usec):
     | 1.00th=[ 498], 5.00th=[ 1020], 10.00th=[ 1319], 20.00th=[ 1713],
     | 30.00th=[ 1991], 40.00th=[ 2212], 50.00th=[ 2540], 60.00th=[ 2933],
     | 70.00th=[ 5407], 80.00th=[ 44303], 90.00th=[ 45351], 95.00th=[ 45876],
     | 99.00th=[ 46924], 99.50th=[ 46924], 99.90th=[ 48497], 99.95th=[ 49021],
     | 99.99th=[404751]
   bw (KiB/s): min= 8208, max=18832, per=99.85%, avg=17005.35, stdev=998.94, samples=123
   iops : min= 2052, max= 4708, avg=4251.30, stdev=249.74, samples=123
  lat (usec) : 250=0.01%, 500=1.03%, 750=1.69%, 1000=2.07%
  lat (msec) : 2=25.64%, 4=37.58%, 10=2.08%, 20=0.02%, 50=29.86%
  lat (msec) : 100=0.01%, 500=0.02%
  cpu : usr=1.02%, sys=2.97%, ctx=33312, majf=0, minf=75
  IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwt: total=262144,0,0, short=0,0,0, dropped=0,0,0
     latency : target=0, window=0, percentile=100.00%, depth=64
 
Run status group 0 (all jobs):
   READ: bw=16.6MiB/s (17.4MB/s), 16.6MiB/s-16.6MiB/s (17.4MB/s-17.4MB/s), io=1024MiB (1074MB), run=61568-61568msec
 
Disk stats (read/write):
  sdb: ios=261897/0, merge=0/0, ticks=3912108/0, in_queue=3474336, util=90.09%

In this example, focus on a few lines, slat, clat, lat, and bw and iops. The first three all refer to I/O delay, but there are differences:

slat refers to the time from I/O submission to actual execution of I/O;

clat, refers to the time from I/O submission to I/O completion;

lat refers to the time from when fio creates an I/O to when the I/O is completed.

It should be noted here that for synchronous I/O, submission and completion are an action, slat is the time when the I/O is completed, and clat is 0; when using asynchronous I/O, lat is approximately equal to slat + clat.

Let’s look at bw again, which means throughput. In the above output, the average throughput is 16MB (17005/1024).

The final IOPS is actually the number of I/Os per second. The average IOPS output above is 4250.

Usually, the IO read and write of the application is parallel, and the size of each I/O is different. So the above few scenarios cannot accurately simulate the I/O pattern of the application. Fortunately, fio supports I/O replay. You need to use blktrace first to record the I/O access of the disk device, and then use fio to replay the blktrace record.

# Use blktrace to track disk I/O, pay attention to the disk that the specified application is operating
$ blktrace /dev/sdb
# View the results recorded by blktrace
# ls
sdb.blktrace.0 sdb.blktrace.1
# Convert the result to a binary file
$ blkparse sdb -d sdb.bin
# Use fio to replay logs
$ fio --name=replay --filename=/dev/sdb --direct=1 --read_iolog=sdb.bin

Recommended related videos

Understand the Linux kernel MMU mechanism in 90 minutes

“Advanced Topics on Memory Management” of Linux Kernel Source Code Analysis

3 secrets of the linux kernel, let you thoroughly understand the file system

Learning address: Linux kernel source code/memory tuning/file system/process management/device driver/network protocol stack

Need C/C++ Linux server architect learning materials plus qun812855908 to obtain (materials include C/C++, Linux, golang technology, Nginx, ZeroMQ, MySQL, Redis, fastdfs, MongoDB, ZK, streaming media, CDN, P2P, K8S, Docker, TCP/IP, coroutine, DPDK, ffmpeg, etc.), free to share

2. I/O performance optimization idea

application optimization

The application is at the top of the I/O stack, and can adjust the I/O mode (sequence or random, synchronous or asynchronous) through system calls, and it is also the ultimate source of data. The following summarizes several aspects to optimize application performance:

First, append writing can be used instead of random writing, reducing addressing overhead and speeding up I/O writing.

Second, with the help of cache I/O, the system cache can be fully utilized to reduce the number of actual I/O.

Third, build your own cache inside the application, or use an external cache system like Redis. This can not only control the data and life cycle of the cache internally, but also reduce the impact of other applications using the cache on itself. For example, the C standard library provides library functions such as fopen and fread, all of which use the standard library cache to reduce disk operations. If you directly use system calls such as open and read, you can only use the page cache and buffer of the operating system.

Fourth, when you need to frequently read and write the same disk space, you can use mmap instead of read/write to reduce the number of memory copies.

Fifth, in scenarios where synchronous writing is required, try to combine write requests instead of writing each request to disk synchronously, that is, use fsync() instead of O_SYNC.

Sixth, when multiple applications share the same disk, in order to ensure that I/O is not completely occupied by an application, it is recommended to use the I/O subsystem of cgroups to limit the IOPS and throughput of the process/process group.

Finally, when using the CFQ scheduler, you can use ionice to adjust the I/O scheduling priority of the process, especially to improve the I/O priority of the core application. It supports three priority classes: Idle, Best-effort and Realtime . Among them, the latter two also support the level of 0-7, the smaller the value, the higher the priority.

file system optimization

When the application program accesses ordinary files, it is indirectly responsible for reading and writing files on the disk through the file system. So there are many optimization methods related to the file system.

First, an appropriate file system can be selected according to different actual load scenarios. For example, Ubuntu uses ext4 by default, and Centos uses xfs by default. Compared with ext4, xfs supports larger disk partitions and larger number of files. xfs supports disks larger than 16TB, but its disadvantage is that it cannot shrink, while ext4 can.

Second, after selecting the file system, you can optimize the configuration options of the file system. Including the characteristics of the file system (such as ext_attr, dir_index), journal mode (such as journal, ordered, writeback, etc.), mount options (such as noatime), etc. For example, using the tune2fs tool, you can adjust the characteristics of the file system, and it is also commonly used to view the contents of the super block of the file system. And adjust the log mode and mount options of the file system through /etc/fstab, or mount.

Third, optimize the cache of the file system. For example, you can optimize pdflush’s dirty page refresh frequency (set dirty_expire_centisecs and dirty_writeback_centisecs) and dirty page quota (adjust dirty_background_ratio and dirty_ratio). For another example, you can also optimize the tendency of the kernel to recycle the directory entry cache and inode cache, and adjust vfs_cache_pressure (/proc/sys/vm/vfs_cache_pressure, the default value is 100). The larger the value, the easier it is to recycle.

Finally, when persistence is not required, the memory file system tmpfs can be used for better I/O performance. tmpfs stores data directly in memory, not on disk. For example, /dev/shm is a memory file system configured by default in most Linux, and its size defaults to half of the total system memory.

disk optimization

The persistence of data will eventually fall to the physical disk, and the disk is also the bottom layer of the entire I/O stack. From the perspective of disk, there are also many optimization methods:

The first, and easiest, is to replace the HDD with an SSD.

Second, use RAID to combine multiple disks into a logical disk to form a redundant and independent disk array, which can not only improve data reliability, but also improve data access performance.

Third, according to the characteristics of the I/O mode of the disk and the application program, the most suitable I/O scheduling algorithm can be selected.

Fourth, disk-level isolation can be performed for application data. For example, separate disks can be configured for applications with heavy I/O pressure such as logs and databases.

Fifth, in scenarios where there are many sequential reads, the read-ahead data of the disk can be increased, and the read-ahead size of /dev/sdb can be adjusted in two ways. One, adjust the kernel option, /sys/block/sdb/queue/read_ahead_kb, the default size is 128KB. Another, blockdev tool, for example, blockdev –setra 8192 /dev/sdb, note that the unit here is 512B, so its value is always twice the read_ahead_kb.

Sixth, options to optimize kernel block device I/O. For example, adjusting the length of the disk queue, /sys/block/sdb/queue/nr_requests, appropriately increasing the queue length can increase the throughput of the disk, and of course increase the I/O delay.

Finally, hardware errors in the disk itself can also cause a sharp drop in I/O performance. For example, check whether there is a log of hardware I/O failure in dmesg, you can also use tools such as badblocks and smartctl to detect disk hardware problems, or use e2fsck to detect file system errors. If problems are found, tools such as fsck can be used to repair them.