Troubleshooting ideas and steps for abnormalities and freezes in Linux servers

Table of Contents

  • foreword
  • 1. View memory usage
  • Second, check the disk usage
  • Three, top command
    • 3.1 jmap analyzes heap memory configuration information and usage
    • 3.2 jstack analyzes the execution of threads
    • 3.3 jstat to view the percentage of the heap in each area
  • 4. Other instructions
  • Summarize

Foreword

There are many reasons for Linux servers to experience abnormalities and freezes. Here are some common reasons:
1. The CPU usage is too high: When the CPU usage is too high, the response speed of the system will slow down, or even freeze. Common reasons include process endless loops, CPU-intensive tasks, and more.

2. The memory usage is too high: When the memory usage is too high, the system will use the swap partition (swap), which will cause the system to slow down the response speed and even freeze. Common causes include memory leaks, processes using too much memory, etc.

3. Insufficient network bandwidth: When the network bandwidth is insufficient, the network transmission speed will slow down, or even freeze. Common reasons include network congestion, insufficient network bandwidth, etc.

4. Hard disk I/O is too high: When the hard disk I/O is too high, the response speed of the system will slow down, or even freeze. Common reasons include slow disk read/write speed, file system corruption, etc.

5. Too many processes: When there are too many processes running in the system, it will lead to competition for system resources, resulting in slower system response and even freezing.

6. Improper system configuration: When the system configuration is improper, it will also cause abnormalities and freezes in the system. Common reasons include improper system kernel parameter settings, insufficient hardware configuration, improper network configuration, etc.

For these problems, some common commands can be used for analysis and tuning, such as top, ps, etc., so as to find out the root cause of the problem and perform targeted optimization.

1. View memory usage

Displays the usage of system memory, including physical memory, swap memory (swap) and kernel buffer memory
free -h

Continuously observe the memory usage, output every 3s
free -h -s 3

Mainly focus on the remaining memory space of free.

There is no direct relationship between the size of the jar package and the memory occupied by the server, even if java -jar specifies the heap memory size free -h will not be directly reflected.
The running memory of the server refers to the memory size occupied by the Java program when the server is running.
When a Java program is running, the JVM will load the classes and resources in the jar package into memory. The size of the jar package may affect the program startup time and memory usage.

2. View disk usage

When there are too many logs, jar package files, and database backup files, the system performance will decrease and the system will crash.

df -h


Look at the usage of the root directory.

3. top command

If there is a disagreement, it will be top, common parameters

  • top -p 8080,8081 Monitor the status of the process ID alone
  • top -c displays the complete command line

You can also use internal commands after entering top without adding parameters:

1 – Number 1 Press the number "1" to monitor the status of each logical CPU
f/F – Add or remove display fields in top
K -  terminate a process. The system will prompt the user to enter the PID of the process that needs to be terminated, and what kind of signal needs to be sent to the process. Generally, signal 15 can be used to terminate the process; if it cannot end normally, use signal 9 to force the process to end. The default is signal 15. This command is blocked in safe mode.
u – Search for a user's processes
n – Set the number of processes displayed in the process list
s – Change the screen update cycle, in hours and seconds
P – Sorting【%cpu】Arrange the process list in order of CPU usage
c - Toggles displaying command names and full command lines.
o or O: change the order of the displayed items
l – Turn off or turn on the first line of the first part to switch between displaying average load and startup time information.
t – Turn off or turn on the display of Tasks in the second line of the first part and Cpus information in the third line to switch between displaying process and CPU status information.
m – Turn off or turn on the representation of Mem in the fourth line of the first part and Swap in the fifth line to switch the display of memory information.
N – Sort [PID] Arrange the process list in the order of the size of the PID
M – Sort [Memory Occupancy] Arrange the process list in order of size
T – Sort [according to time/cumulative time]
i: Ignore idle and zombie processes. This is a switch command.
S: switch to accumulation mode.
h – display help
q – quit top
W: Write the current settings to the ~/.toprc file.
b – Turn on/off the highlighting effect of the running thread [R state]
x – Turn on/off the highlighting effect of the running thread [R state] sort column
"shift + >" or "shift + <: You can change the sorting column to the right or left


Can be obtained from the data returned by the top command (by row)

  1. System time – boot time – number of logins – average load (load within 1, 5, 15 minutes, when this value exceeds the number of CPUs, it means that the system load is relatively high, you can stop or optimize the process that occupies cpu resources, or increase Hardware such as cpu, memory, disk, etc.)
  2. Total Processes – Running – Hibernating – Stopped – Zombie Processes
  3. us: The percentage of CPU occupied by the user space, that is, the CPU usage of the application.
    sy: The percentage of the CPU occupied by the kernel space, that is, the CPU usage of the system kernel.
    ni: Percentage of CPU occupied by processes with higher NICE values.
    id: Percentage of idle CPU.
    wa: The percentage of CPU time waiting for I/O operations to complete.
    hi: Percentage of CPU time handling hardware interrupts.
    si: Percentage of CPU time handling software interrupts.
  4. Total memory – memory in use – free memory – cached memory
  5. Total Swap – Total Swap Used – Total Free Swap – Total Buffered Swap

System memory and swap space (swap) are interrelated. When the system memory is insufficient, the operating system will move some infrequently used memory data to the swap space to release physical memory and ensure the normal operation of the system. Therefore, swap space can be seen as a way to expand memory to help the system better manage memory resources. From this, you can observe this value. If it keeps changing, it means that the system memory is insufficient.

  1. PID: Process ID, used to uniquely identify a process. USER: the user to which the process belongs, PR: the priority of the process, the smaller the value, the higher the priority, NI: the nice value of the process, the smaller the value, the higher the priority, VIRT: the virtual memory size, the unit is KB (kilobytes) ), including the memory size occupied by the code segment, data segment, stack segment, and shared library used by the process, RES: resident memory size, in KB (kilobytes), refers to the physical memory that the process currently resides in memory Size, that is, the actual memory size of the process, SHR: shared memory size, in KB (kilobytes), refers to the shared memory size used by the process. S: Process state, including running (R), sleeping (S), stopping (T), zombie (Z) and other states, %CPU: the proportion of CPU occupied by the process, indicating the proportion of time occupied by the process in the CPU time slice, % MEM: the proportion of physical memory occupied by the process, indicating the proportion of physical memory occupied by the process to the total memory size, TIME +: CPU time occupied by the process, including CPU time in user mode and kernel mode, COMMAND: command line information of the process, indicating The command executed by the process and its parameters.

Related commands

View the number of CPU cores cat /proc/cpuinfo|grep processor|wc -l

3.1 jmap analyzes heap memory configuration information and usage

The information output by the jmap -heap command is mainly divided into two parts, one part is Heap Configuration, which indicates the heap memory configuration information of the Java process; the other part is Heap Usage, which indicates the heap memory usage of the Java process.

1. The top command finds the Java process PID with high memory usage (RES column).
RES (Resident Set Size) refers to the physical memory size of the process currently residing in the memory, that is, the actual memory size occupied by the process. RES includes the memory size occupied by the process code segment, data segment, stack segment, and shared library. The size of RES can reflect the memory usage of a process. Generally speaking, the larger the RES of a process, the more memory it occupies.

2. View and analyze heap memory usage

jmap -heap PID

Heap Configuration heap configuration
-MinHeapFreeRatio: Minimum heap free ratio. If the free percentage of the heap falls below this value, garbage collection will be attempted.
-MaxHeapFreeRatio: Maximum heap free ratio. If the free percentage of the heap is higher than this value, it will try to free some memory.
-MaxHeapSize: The maximum size of the heap. When the size of the heap reaches this value, no further auto-extension will occur.
-NewSize: The size of the new generation.
-MaxNewSize: The maximum size of the new generation.
-OldSize: Old generation size.
-NewRatio: The ratio of the new generation to the old generation. For example, NewRatio=2 means that the ratio of young generation to old generation is 1:2.
-SurvivorRatio: ratio of Eden district to Survivor district. For example, SurvivorRatio=8 means that the ratio of Eden districts to Survivor districts is 8:1.
-MetaspaceSize: Metaspace size.
-CompressedClassSpaceSize: Compressed class space size.
-MaxMetaspaceSize: The maximum size of the metaspace.
-G1HeapRegionSize: The heap region size for the G1 collector.
Heap Usage
-PS Young Generation: The memory usage of the new generation.
-Eden Space: Eden area memory usage.
-Survivor Space: Survivor space memory usage.
-PS Old Generation: Old generation memory usage.
For Heap Usage, we can see the capacity, used size, free size, usage ratio, etc. of each area. This information can help us understand the memory usage of the Java process for performance tuning or troubleshooting.

3.2 jstack analyzes the execution of threads

By analyzing the output of the jstack PID command, you can understand the execution of each thread in the Java process, including thread status, call stack information, lock information, and monitor information, so as to quickly locate thread-related performance problems, such as deadlocks , infinite loop, thread blocking and other problems. The jstack command can also be used to analyze online problems, provide clues for diagnosing online problems, and help developers locate and solve problems faster.

1. The top command finds the PID of the process that occupies a high CPU
2. jstack PID

The following information will be printed:

  1. Status information of Java threads, including thread ID, thread name, thread status, etc.
  2. Java thread stack information, including the call stack information of each thread, that is, the method that the thread is currently executing, its class, line number, and other information.
  3. Java thread lock information, including lock information held by each thread, and thread information waiting to acquire the lock.
  4. The monitor information of the Java thread, including the monitor information held by each thread, and the thread information waiting to acquire the monitor, etc.

3.3 jstat to view the percentage of heap in each area

jstat -gcutil PID 5000

By analyzing the output of jstat -gcutil PID, you can understand the memory usage of different areas in the Java process, as well as the number and time of GC operations, so as to optimize the memory usage and GC performance of the program.

S0: Indicates the usage of Survivor Zone 0, that is, the proportion of Survivor Zone 0 that has been used.
S1: Indicates the usage of Survivor Zone 1, that is, the proportion of Survivor Zone 1 that has been used.
E: Indicates the usage of the Eden area, that is, the proportion that has been used in the Eden area.
O: Indicates the usage of the Old area, that is, the proportion of the Old area that has been used.
M: Indicates the usage of the Metaspace area, that is, the used ratio in the Metaspace area.
CCS: Indicates the usage of the Compressed Class Space area, that is, the used ratio in the Compressed Class Space area.
YGC: Indicates the number of Young GC, that is, the number of times Young GC has been executed.
YGCT: Indicates the total time of Young GC, that is, the total time that Young GC has been executed.
FGC: Indicates the number of times of Full GC, that is, the number of times that Full GC has been executed.
FGCT: Indicates the total time of Full GC, that is, the total time that Full GC has been executed.
GCT: Indicates the total time of GC, that is, the total time that GC has been executed.

4. Other instructions

1. View the memory usage of the java process

top -o %MEM -b -n 1 | grep java | awk '{print "PID: "$1" \t virtual memory: "$5" \t physical memory: "$6" \t shared memory: "$7 " \t CPU usage: "$9"% \t Memory usage: "$10"%"}'

2. Monitor the number of java threads

ps -eLf | grep java | wc -l

3. Check the occupied port

netstat -ntlp

4. List all process information (the main difference between the ps -aux and ps -ef commands is the level of detail of the output information, the information output by the former is more detailed, and the information output by the latter is more concise.)

ps -ef and ps -aux

Summary

There are hardware and software reasons for the abnormality and freeze of the Linux server. If there is no problem with the hardware, you can use the top command, jps command, jmap to analyze the heap memory configuration information and usage, jstack to analyze the execution of threads, and jstat to view the heap occupied by each area Percentage and service logs to troubleshoot problems.