Wireshark TS | Step by step to see the occasional failure of system access

Foreword

One day, a colleague in the R&D center reported that the system access was abnormal, and the system could not be opened from time to time. When a production problem occurred that affected the normal operation of the business, it was naturally an emergency response, so we cooperated with the R&D colleagues to investigate and deal with it. The whole process was full of twists and turns, and it is quite interesting to summarize this case.

Description of the problem

After sorting out the information related to business system failures together with R&D colleagues, the explanations are as follows:

The failure phenomenon is that when business users use the system, the system will occasionally fail to open, and it will happen from time to time;
The service access path is from the Internet DMZ area to the intranet area;
The same business system works normally in the test environment;
Outsourced business systems, technical support personnel are not strong enough to provide relevant and effective information.

Because the system works normally in the test environment, but there are problems in the production environment, so the R&D colleagues suspected that there was a problem with the network from the very beginning. How can we judge like this? The pot comes from the sky, so of course you can’t pick it up casually. After a brief understanding of the problem, the Ping and other monitoring of the servers at both ends are normal. I feel half at ease in my heart, and it doesn’t seem to be a network problem.

Because the network has NPM deployed in the key DMZ area (it happens to be the mirror node on the access switch where the client is located), so the traffic is traced back according to the provided failure time point and service IP communication pair, and the actual analysis is carried out.

& id=Lf8M1 & amp;originHeight=134 & amp;originWidth=421 & amp;originalType=binary & amp;ratio=1 & amp;rotation=0 & amp;showTitle=false & amp;size=11933 & amp;status= done & amp;style=none & amp;taskId=u23cb288d-3f8e-4217-b9b7-3129762a916 & amp;title=” alt=”TS01.png”>

Problem Analysis

Phase 1

The data package file exported by the device is retrospectively analyzed through NPM. The basic information is as follows:

λ capinfos test.pcapng
File name: test.pcapng
File type: Wireshark/... - pcapng
File encapsulation: Ethernet
File timestamp precision: nanoseconds (9)
Packet size limit: file hdr: (not set)
Packet size limit: inferred: 70 bytes - 1518 bytes (range)
Number of packets: 6463
File size: 2174 kB
Data size: 1906 kB
Capture duration: 5990.797626192 seconds
First packet time: 2022-11-18 16:45:07.192079642
Last packet time: 2022-11-18 18:24:57.989705834
Data byte rate: 318 bytes/s
Data bit rate: 2545 bits/s
Average packet size: 294.95 bytes
Average packet rate: 1 packets/s
SHA256: e97bbdffd9f98d0805737cb93d6d8255acd045241aa716a8af3966b0ae5ca76f
RIPEMD160: 0329186f9145dcf38fac739077219a3d93298b34
SHA1: 9a3f06a04163f388b8800889d46fe3db04453c26
Strict time order: True
Capture comment: Sanitized by TraceWrangler v0.6.8 build 949
Number of interfaces in file: 1
Interface #0 info:
                     Encapsulation = Ethernet (1 - ether)
                     Capture length = 2048
                     Time precision = nanoseconds (9)
                     Time ticks per second = 1000000000
                     Time resolution = 0x09
                     Number of stat entries = 0
                     Number of packets = 6463

The data package file is downloaded through NPM traceback analysis, and has been specifically filtered according to IP communication, and processed by TraceWrangler anonymization software. The total capture time is 5990 seconds, the number of packets is 6463, and the rate of 2545 bps is very low.

For an introduction to TraceWrangler anonymization software, you can check out the previous article “Wireshark Tips and Tricks | How to Anonymize Data Packets”

Another special place is Packet size limit: inferred: 70 bytes - 1518 bytes (range), generally speaking, according to snaplen, it should be a unified value, that is, all data packets The capture length should be the same, not a range value, here’s why the NPM backtracking analysis facility handles settings.

The expert information is as follows, there are some error problems in protocol analysis, and warning problems such as the previous segment was not captured, and the ACK confirmed the segment that was not captured. Considering the phenomenon of ACKed unseen segment, It is more that the package is not captured, which is not a problem.

At the failure time node 16:45 – 16:48 provided by the business, the detailed information of the relevant data packets is expanded, as follows, it is obvious that the client 192.168.0.1 will send data packets at a long interval.

Adding the TCP stream information column, you can see that they belong to different TCP data streams (this is related to business applications, so I won’t go into details here), and the interval between up and down data packets between single streams is more.

During the communication, the business reported an exception that it could not be opened again, so the data packet capture time was lengthened backwards, filtered according to the source address of the client, and the interval time was increased to sort from large to small, basically all time intervals were large. But the overall observation pattern, the fixed time interval (such as 60s) in some streams is more like the application logic of the GRPC call itself, which does not seem to be the fundamental problem.

Therefore, based on the information at this stage, the advice given to the R&D colleagues is to check the client 192.168.0.1, because it fell into silence at the point of failure, and no data packets were sent, which is suspected to be the problem.

Phase 2

The R&D colleagues went back with questions and suggestions, and communicated with the technology manufacturer for two or three days, but failed to find any substantive reasons, and at the same time, the problems still occurred from time to time. Then came with doubts, saying that the client request in the DMZ area was indeed sent, but the server could not receive it, and emphasized that the first time the failure occurred every day was the first time the system entered the system. Phenomenon, the phenomenon displayed on the application is to click on the page to send a request, but the interface is always pending, and the error log Connection timed out.

Based on the above phenomenon, the data packet trace file was taken again. From the perspective of the data packet, there was still no data packet generated at around 11-20 15:09:33 when the fault occurred. Considering that the location where the data packet is captured is on the access switch where the client is located, the client keeps saying that it has sent a request, but the directly connected switch did not see the relevant data packet at all when it was sent out. Direct packet loss. . .

The packet capture on the switch did not have any data packets between 15:09:06 – 15:09:33, so I insisted on the previous conclusion, checked the client application, and asked to directly capture packets on the client when the failure occurred, using To prove whether the client request has been sent.

Stage Three

Come, come, R&D came with local data packets, and captured packets at both ends of the client and server at the same time (with client->server one-way filtering), it is said that when the system fails to open, it is very easy It is obvious that the client sent a request, but the server did not receive it.

The data packets sent by the client and received by the server are compared as follows:

You can see the same TCP data flow (connected by ip.id), there is another TCP data flow 0xc05c – 0xc063 between the sending direction 0x3590-0x3592 and 0x3593-0x3594 continuous data packets, but in the receiving direction In the direction, there are only data packets of 0x3590-0x3594, and there is no data packet of another TCP data flow;
This TCP data stream is just the PSH/ACK packet sent by the client 192.168.0.1, because the server did not receive it and therefore could not return a confirmation ACK, so the client retransmitted a total of 6 times, and FIN ended the connection;
After that, the previous TCP data stream started to transmit data packets after 0x3593, and the interval was nearly 20s.

The speechless incident, could it really be a slap in the face? . Network switch dropping packets? This is very unscientific. Since there is such an interaction between the client and the server’s data packets, is it the same phenomenon that can be traced back and analyzed in NPM? Through the NPM backtracking in the figure below, it is found that the phenomenon is indeed the same. I also saw another data flow sent by the client. It can be seen that the upper layer switch forwarded the data packet normally. Could it be that there is a packet loss in the network path behind? Maybe.

& height=272 & amp;id=u5a9be31f & amp;originHeight=272 & amp;originWidth=1079 & amp;originalType=binary & amp;ratio=1 & amp;rotation=0 & amp;showTitle=false & amp;size= 60701 & amp;status=done & amp;style=none & amp;taskId=u01c28e3e-b45c-41aa-941f-dbc10e8fcdd & amp;title= & amp;width=1079″ alt=”image.png”>

At this time, the fault phenomena and data packets in the three stages are slowly sorted out, and the following conclusions are drawn:

The conclusion of phases 1 and 2 is that there is a problem with the client, the sending interval is long, and no request is sent when the system is suspected of failure. Regardless of whether there is a problem with the client or not, the server will at least respond when it receives the request;
From the packet capture phenomenon of the client, upper-layer switch, and server in the third stage, it is true that the client sent and retransmitted the request, but the server did not receive it. Combined with the occasional fault phenomenon, it is impossible for the network switch to lose packets, because The switch will not specifically drop a certain TCP data flow;
When the fault occurs, another data stream must start to transmit. It is not generated out of thin air, but starts directly with PSH/ACK, indicating that the connection has always existed before, so what kind of interaction will this data stream have? !

With the above questions in mind, I went back to the NPM data for an hour, that is, the data packets from 17:00 to the failure time point 18:05, and found an important phenomenon. In such a long period of time, the data flow It is empty, there is no data packet interaction, only a few data packets of 0xc05c – 0xc063 are generated by the node client in one direction at the time of failure, and FIN terminates the connection after no response.
At this point, I suddenly understood the problem, and I have been ignoring a key link, the firewall. When thinking of the firewall, I reviewed the entire troubleshooting process, and finally located the cause, which is summarized as follows:

The system application is a long-connection application, but the keep-alive mechanism is not enabled at the application level and the TCP level. After the TCP connection is idle for a certain period of time, the maximum idle time limit of the firewall session is reached, so the firewall removes the session, and the client The end and the server still maintain the connection without a good end mechanism;
Corresponding to business feedback, the first time the fault occurs every day is the first time when the system is entered, there will be an abnormal phenomenon, indicating that the original long connection continues to send data packets, but because the firewall does not have this session information, it is discarded. Therefore, the server cannot receive the data packet normally, and after the client fails to retransmit repeatedly, FIN closes the connection.
After the business personnel log out of the system application and log in again, it can be opened normally, but if there is no operation, that is, if it is idle for a long time again, the firewall will still delete the session, and the system will not open again when it is used again, so this is the feedback from the business Occasional system access failures.

Question summary

Because the firewall is not maintained in the network group, it was transferred to the security group for processing. Later, it was learned that the idle session time limit was adjusted separately for the application IP communication pair on the firewall (the default is one hour), and the business system has returned to normal.