Problems encountered when maintaining domain name cache within the application

In a recent project I participated in, I encountered a series of problems that relied on DNS servers to resolve external business clusters.
The remote business cluster provides business services based on the HTTP/HTTPS protocol. The cluster contains multiple business nodes. In the current solution, domain names are configured on the DNS server to point to multiple business nodes in the business cluster.
DNS servers are provided by the customer.
libcurl is used in the product feature code to access the HTTP/HTTPS services provided by the business cluster.
During the local development and verification process, there was no problem with the above networking method and everything was fine.
As a result, when entering the large-scale testing phase, a serious problem was exposed: due to the relatively high concurrency of the product, a large number of business requests will be triggered in a short period of time after the feature is turned on. This results in the following consequences:

  1. The libcurl code initiates a large number of domain name resolution requests. The DNS server has insufficient concurrency capabilities and can only selectively reply to some messages.
  2. A large number of requests failed due to domain name resolution failure when establishing the link.
  3. Since the product has built-in retry logic, more requests are triggered.
  4. As a result, the DNS was overwhelmed and rejected more domain name resolution requests.
  5. As a result, other businesses that rely on DNS to resolve domain names were also affected.

After the testing team feedbacks the above issues, the development team does the following:

  1. In the existing retry logic, the sleep time is increased, and the exponential decay method is used to gradually increase the interval between retries to avoid a large number of requests. After actual testing, this strategy alleviated the problem, but did not solve it.
  2. Modify the usage of libcurl, enable the DNS caching feature, and increase the cache expiration time. After actual testing, the effect of this strategy is unclear.
  3. Modify the product’s business code to reduce concurrency. The implementation of this strategy has obvious effects, and the number of failed requests has dropped significantly. Considering that there are multiple features in the product code that use libcurl to access remote cluster services, this problem still exists.

The above strategies are all trivial and do not fundamentally solve the problem.
The development team organized discussions within the team, re-discussed them, and came to the following conclusions:

  • Reducing concurrency is not feasible.
    On the one hand, it may lead to substandard performance specifications of current features. In addition, many features in the early stage were using libcurl, and each business followed the same strategy to reduce concurrency. The workload of modification and verification was huge, which was unacceptable to the management team, and the development and testing teams also felt tired. Unsustainable.
  • The concurrency capability of the DNS server is unpredictable.
    In the delivery scenario, the DNS server is provided by the customer. Its concurrency capability is unpredictable and it is impossible to improve performance in a short period of time. If the processing capacity of the customer’s DNS server is occupied due to this product, it may block the customer’s other services. Therefore, Problems still need to be faced head-on.

After discussion, the development team made the following decisions:

  • Incorporate the use of libcurl in existing business codes and provide a unified HTTP protocol client for each feature.
  • Add the cache of domain name resolution results to the above client code.

This solution has a small workload and quickly passed the functional verification and performance verification organized by the testing team.
I originally thought the future was bright, but when the test team executed the fault injection use case, they discovered a new problem. The domain name resolution operation took too long to execute, and a single call actually took 20+ seconds to return.

This new problem is related to the implementation of domain name resolution in the aforementioned solution. In other words, this is a pitfall of the C function of domain name resolution.
The currently known C functions used to complete domain name resolution are as follows:

  • gethostbyname is thread unsafe and excluded.
  • gethostbyname_r is a thread-safe version of gethostbyname and is currently in use.

From the actual test results, it was found that gethostbyname_r has a serious problem. The timeout value cannot be specified in the parameters of the function, resulting in uncontrollable time consumption.
After consulting a lot of information, I found that the time-consuming problem is actually related to the process of DNS domain name resolution. I can’t blame the seniors who designed the API for not paying attention.
Although you cannot pass a timeout value when calling gethostbyname_r, I stumbled upon /etc/resolv.conf. By modifying attempts and timeout in options attempts:5 timeout:6, we found that calling < in DNS failure scenarios can be alleviated to a certain extent. The code>gethostbyname_r function takes too long.

After a series of twists and turns, although the current solution is not perfect, it has at least been recognized by the product team, so the optimization work has come to an end for the time being.

The following are the configuration files or tools used in solving the aforementioned problems.

An example of the contents of the /etc/resolv.conf file is as follows:

$ cat /etc/resolv.conf
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
# DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
# 127.0.0.53 is the systemd-resolved stub resolver.
# run "systemd-resolve --status" to see details about the actual nameservers.

nameserver 127.0.0.53
searchDHCP HOST
options attempts:5 timeout:6

An example of the /etc/hosts file is as follows:

$ cat /etc/hosts
127.0.0.1 localhost
127.0.1.1 jackie-ubuntu

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

The execution output of the dig command is as follows:

$ dig www.baidu.com

; <<>> DiG 9.16.1-Ubuntu <<>> www.baidu.com
;; global options: + cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 56073
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;www.baidu.com. IN A

;; ANSWER SECTION:
www.baidu.com. 1105 IN CNAME www.a.shifen.com.
www.a.shifen.com. 29 IN A 36.152.44.96
www.a.shifen.com. 29 IN A 36.152.44.95

;; Query time: 4 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: November 12 11:15:58 CST 2023
;; MSG SIZE rcvd: 101

The + trace option of dig executes the output. The example is as follows:

$ dig + trace www.baidu.com

; <<>> DiG 9.16.1-Ubuntu <<>> + trace www.baidu.com
;; global options: + cmd
. 2687 IN NS f.root-servers.net.
. 2687 IN NS m.root-servers.net.
. 2687 IN NS e.root-servers.net.
. 2687 IN NS l.root-servers.net.
. 2687 IN NS c.root-servers.net.
. 2687 IN NS h.root-servers.net.
. 2687 IN NS j.root-servers.net.
. 2687 IN NS a.root-servers.net.
. 2687 IN NS k.root-servers.net.
. 2687 IN NS g.root-servers.net.
. 2687 IN NS i.root-servers.net.
. 2687 IN NS d.root-servers.net.
. 2687 IN NS b.root-servers.net.
;; Received 262 bytes from 127.0.0.53#53(127.0.0.53) in 12 ms

com. 172800 IN NS a.gtld-servers.net.
com. 172800 IN NS b.gtld-servers.net.
com. 172800 IN NS c.gtld-servers.net.
com. 172800 IN NS d.gtld-servers.net.
com. 172800 IN NS e.gtld-servers.net.
com. 172800 IN NS f.gtld-servers.net.
com. 172800 IN NS g.gtld-servers.net.
com. 172800 IN NS h.gtld-servers.net.
com. 172800 IN NS i.gtld-servers.net.
com. 172800 IN NS j.gtld-servers.net.
com. 172800 IN NS k.gtld-servers.net.
com. 172800 IN NS l.gtld-servers.net.
com. 172800 IN NS m.gtld-servers.net.
com. 86400 IN DS 30909 8 2 E2D3C916F6DEEAC73294E8268FB5885044A833FC5459588F4A9184CF C41A5766
com. 86400 IN RRSIG DS 8 1 86400 20231124170000 20231111160000 46780. rTOgKkj9DMvMzyk + rKDz7dsie4Xx1jwuBlZdH9ntLEikavNoZMRN7SxE iweiVanZo1q9hhrSxAn8O1Sc KkRwwHTSCiSvQnZ8bzy4ToM3I832VIiR Oir + C + K7GtufMaxNCOMD14s7Zg24qLf9CmQT + id3eIBMP4Sjuq4MSIsu tgSXJS6EI1OumSojANeO9mq1khc5cxLaeOqJfRb10Vvujl73jZpaXxE9 J4/ GehjpG6YR04/37geOwOSaVwx6c3PndgT0L33O/maN/Tjng2UUhHtW lOh8gIVxFYRipqdDZ1XJQK + x5o4o8Oh3YN3Vd1I5rrKJhEfwecej7nyI fe5BKA==
;; Received 1173 bytes from 199.7.83.42#53(l.root-servers.net) in 20 ms

baidu.com. 172800 IN NS ns2.baidu.com.
baidu.com. 172800 IN NS ns3.baidu.com.
baidu.com. 172800 IN NS ns4.baidu.com.
baidu.com. 172800 IN NS ns1.baidu.com.
baidu.com. 172800 IN NS ns7.baidu.com.
CK0POJMG874LJREF7EFN8430QVIT8BSM.com. 86400 IN NSEC3 1 1 0 - CK0Q2D6NI4I7EQH8NA30NS61O48UL8G5 NS SOA RRSIG DNSKEY NSEC3PARAM
CK0POJMG874LJREF7EFN8430QVIT8BSM.com. 86400 IN RRSIG NSEC3 8 2 86400 20231118052525 20231111041525 63246 com. HHilTlIgM2bBSkNCrfIYydeweb7FpcSd/HCPjoMq9c DoI45LnU1trxYf GtncYfSgPxd01lt7BuBdTBRjFX2kHEWQNAjqKR + wj9ohk9mqvk3naenD eWVPwSEZjYdV + LPjL7rXvMWq6GRZXFG2OC0oR37mS4PCPT/pWYTARo7m 66PiR/ixIP8UPkUbxjZTHFuDsR + lywg8Od0OsTopTj5 + rw==
HPVV1UNKTCF9TD77I2AUR73709T975GH.com. 86400 IN NSEC3 1 1 0 - HPVVAN8CFKHHHMEIDVJHFNQEOI5G6C89 NS DS RRSIG
HPVV1UNKTCF9TD77I2AUR73709T975GH.com. 86400 IN RRSIG NSEC3 8 2 86400 20231115212015 20231108201015 63246 com. jtcfLKFI33jWF + yVS/iT/x73j + JM7gult + JQD6xH Y0yl4ZWp2ktqwCrA wk8ybADERvnpDU/u9LKgBVGkT7rIDnheKGcXzKe5Lgjilu9aHWjIiyny J/kwYkBe + PRm13QmKOuUh/DvWhj63Ru5g==
;; Received 849 bytes from 192.48.79.30#53(j.gtld-servers.net) in 284 ms

www.baidu.com. 1200 IN CNAME www.a.shifen.com.
;; Received 100 bytes from 36.155.132.78#53(ns3.baidu.com) in 8 ms

The execution output of the nslookup command is as follows:

$ nslookup www.baidu.com
Server: 127.0.0.53
Address: 127.0.0.53#53

Non-authoritative answer:
www.baidu.com canonical name = www.a.shifen.com.
Name: www.a.shifen.com
Address: 36.152.44.96
Name: www.a.shifen.com
Address: 36.152.44.95
Name: www.a.shifen.com
Address: 2409:8c20:6:1d55:0:ff:b09c:7d77
Name: www.a.shifen.com
Address: 2409:8c20:6:1135:0:ff:b027:210c

Related information

  • resolv.conf(5)
  • Linux DNS resolution and configuration nslookup usage and configuration of /etc/resolv.conf file
  • A brief analysis of common parameters of resolv.conf
  • Linux /etc/resolv.conf file analysis
  • Detailed explanation of /etc/resolv.conf in REHL7/CentOS7
  • Configure DNS server with resolvconf
  • Detailed explanation of search and options ndots in dns configuration file /etc/resolv.conf
  • Linux /etc/resolv.conf file description
  • Domain name resolution – gethostbyname() function and socket client domain name resolution
  • Detailed explanation of gethostbyname() function
  • Basic information on gethostbyname() and getaddrinfo() functions
  • About the parameters and return values of gethostbyname_r
  • gethostbyname and gethostbyname_r (reentrant) get dns information
  • [UNP Notes] Chapter 11 Name Address Conversion
  • Things to note about gethostbyname_r
  • C++ gethostbyname_r function code example
  • how to use gethostbyname_r in linux
  • DIG domain name resolution query tool
  • Master DNS query skills and basic usage of dig command
  • Detailed explanation of the use of dig in DNS chapter
  • Linux [Network] Detailed explanation of dig query DNS