About disabling cluster ipv6 DNS requests

The article is reproduced from: https://blog.51cto.com/mahmut/8141303

Environment:

System: Tongxin Youyue 1060A

Cluster: Tongxin Youque bare metal deployment

Source of the problem

Youque cluster basic node DNS configuration

copy

<code>[root@bastion ~]# cat /etc/coredns/Corefile
.:53 {
    template IN A apps.utccp.example.com {
    match .*apps\.utccp\.example\.com
    answer "{<!-- -->{ .Name }} 60 IN A 10.12.24.125"
    fallthrough
    }
    hosts {
        10.12.24.125 api.utccp.example.com
        10.12.24.125 api-int.utccp.example.com
        10.12.24.125 bastion.utccp.example.com
        10.12.24.127 master1.utccp.example.com
        10.12.24.128 master2.utccp.example.com
        10.12.24.129 master3.utccp.example.com
        fallthrough
    }
    prometheus
    cache 160
    forward .114.114.114.114
    log
}</code>

Issue found:

Environment: Youque default configuration, running data for 48 hours. Total number of requests: 660,100

type	quantity	question	Proportion
ipv4 request	398826	Normal request	0.60419
ipv6 request	206733	The cluster defaults to ipv4, no ipv6 network	0.31318
Repeat base domain request	54541	Duplication of the cluster base domain will cause `NXDOMAIN` and cannot be parsed	0.08262

According to the above data, it can be seen that the proportion of abnormal requests is about 40%.

Question analysis

First, you need to check the source of the ipv6 request.
By default in an offline cluster, DNS requests basically come from the containers of each component, and DNS requests for system services can be ignored.
Troubleshoot the cause of duplicate base fields.
By default, the hostname of each node will have a base domain. The hostname is configured through hostnamectl set-hostname, and problems may occur.

Close ipv6 DNS resolution request (client side)

Turn off ipv6 in NetworkManager.

copy

<code># nmcli connection modify enp1s0 ipv6.method disabled
# systemctl restart NetworkManager</code>

It is found that there are still ipv6 dns requests reaching the basic node.

Close the kernel parameters of the system and execute it on each node

copy

<code># sysctl -w net.ipv6.conf.all.disable_ipv6=1
# sysctl -w net.ipv6.conf.all.disable_policy=1</code>

After the above two steps, there will still be ipv6 dns requests reaching the basic node.

Turn off avahi-daemon’s ipv6

copy

<code># vim /etc/avahi/avahi-daemon.conf
set up
use-ipv6=no
</code>

systemctl restart avahi-daemon

Still not working

Set /etc/resolv.conf

copy

<code>options single-request-reopen</code>

systemctl restart NetworkManager

Still not working

Modify OVS configuration

copy

<code># vim /etc/openvswitch/ovs-vswitchd.conf.db</code>

other_config:ipv6_prefix=[]

systemctl restart openvswitch.service

Still not working

Comment ipv6 local in /etc/hosts

copy

<code>[root@worker1 ~]# cat /etc/hosts</code>

127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
#::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

Still not working

Disable /etc/gai.conf

copy

<code>precedence ::ffff:0:0/96 100</code>

Still not working

Remove kernel module

copy

<code>modprobe -r ipv6
</code>

Built-in modules cannot be uninstalled

After searching for documents, I found rfc4472 Chapter 5.1 Description

copy

<code>5.1. DNS Lookups May Query IPv6 Records Prematurely</code>

The system library that implements the getaddrinfo() function for
looking up names is a critical piece when considering the robustness
of enabling IPv6; it may come in basically three flavors:

The system library does not know whether IPv6 has been enabled in
the kernel of the operating system: it may start looking up AAAA
records with getaddrinfo() and AF_UNSPEC hint when the system is
upgraded to a system library version that supports IPv6.
The system library might start to perform IPv6 queries with
getaddrinfo() only when IPv6 has been enabled in the kernel.
However, this does not guarantee that there exists any useful
IPv6 connectivity (e.g., the node could be isolated from the
other IPv6 networks, only having link-local addresses).

The system library might implement a toggle that would apply some
heuristics to the “IPv6-readiness” of the node before starting to
perform queries; for example, it could check whether only link-
local IPv6 address(es) exists, or if at least one global IPv6
address exists.

First, let us consider generic implications of unnecessary queries
for AAAA records: when looking up all the records in the DNS, AAAA
records are typically tried first, and then A records. These are
done in serial, and the A query is not performed until a response is
received to the AAAA query. Considering the misbehavior of DNS
servers and load-balancers, as described in Section 3.1, the lookup
delay for AAAA may incur additional unnecessary latency, and
introduce a component of unreliability.

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
twenty one.
twenty two.
twenty three.
twenty four.
25.
26.
27.
28.
29.
30.
31.

Add grub parameter to disable ipv6

copy

<code>ipv6.disable=1</code>

copy

<code>[root@worker1 ~]# ss -tunlp<br>
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port Process<br>
udp UNCONN 0 0 0.0.0.0:111 0.0.0.0:* users:(("rpcbind",pid=793,fd=6))<br>
udp UNCONN 0 0 0.0.0.0:33062 0.0.0.0:* users:(("avahi-daemon",pid=799,fd=16))<br>
udp UNCONN 0 0 127.0.0.1:323 0.0.0.0:* users:(("chronyd",pid=817,fd=6))<br>
udp UNCONN 0 0 0.0.0.0:4789 0.0.0.0:*<br>
udp UNCONN 0 0 0.0.0.0:5353 0.0.0.0:* users:(("avahi-daemon",pid=799,fd=15))<br>
udp UNCONN 0 0 0.0.0.0:55162 0.0.0.0:* users:(("rpcbind",pid=793,fd=7))<br>
tcp LISTEN 0 128 0.0.0.0:111 0.0.0.0:* users:(("rpcbind",pid=793,fd=8))<br>
tcp LISTEN 0 128 0.0.0.0:22 0.0.0.0:* users:(("sshd",pid=1208,fd=3))<br>
tcp LISTEN 0 5 127.0.0.1:631 0.0.0.0:* users:(("cupsd",pid=1327,fd=10))</code>

Although all ipv6 address-related information is turned off, ipv6dns resolution requests cannot be blocked.

Found by Google, this bug was fixed in glibc version 2.36, and option no-aaaa was added to /etc/resolv.conf. link

copy

<code>* The “no-aaaa” DNS stub resolver option has been added. System<br>
administrators can use it to suppress AAAA queries made by the stub<br>
resolver, including AAAA lookups triggered by NSS-based interfaces<br>
such as getaddrinfo. Only DNS lookups are affected: IPv6 data in<br>
/etc/hosts is still used, getaddrinfo with AI_PASSIVE will still<br>
produce IPv6 addresses, and configured IPv6 name servers are still<br>
used. To produce correct Name Error (NXDOMAIN) results, AAAA queries<br>
are translated to A queries. The new resolver option is intended<br>
primarily for diagnostic purposes, to rule out that AAAA DNS queries<br>
have adverse impact. It is incompatible with EDNS0 usage and DNSSEC<br>
validation by applications.</code>

The glibc version of the system I am currently using is not high enough, and it cannot be upgraded casually. All in order to avoid the timeout caused by simultaneous requests for A AAAA (according to the /etc/resolv.conf explanation, when requesting A and AAAA records, if the A request arrives but AAAA does not request the request, there will be a 5-second timeout —–horror ). So there is the following.

Server side (evasion)

Since I can’t solve the problem of glibc, I will start with coredns of the basic node.

Explanation: The default DNS request will request A and AAAA records at the same time, but if the AAAA record is returned immediately on the server, it may be faster than the A record, which avoids the 5-second timeout.

There are currently two solutions that come to mind

Use rewrite to directly reject AAAA records (safe, recommended)

copy

<code>.:53 {<!-- --><br>
rewrite stop type AAAA A
</code><p>template IN A apps.utccp.example.com {<!-- --><br>
match .*apps.utccp.example.com<br>
answer "{<!-- -->{ .Name }} 60 IN A 10.12.24.125"<br>
fallthrough<br>
}<br>
hosts {<!-- --><br>
10.12.24.125 api.utccp.example.com<br>
10.12.24.125 api-int.utccp.example.com<br>
10.12.24.125 bastion.utccp.example.com<br>
10.12.24.127 master1.utccp.example.com<br>
10.12.24.128 master2.utccp.example.com<br>
10.12.24.129 master3.utccp.example.com<br>
fallthrough<br>
}<br>
prometheus<br>
cache 160<br>
forward .114.114.114.114<br>
log<br>
}</p>
        
        <p></p>

Let AAAA records return NXDOMAIN (not recommended because AAAA records will have error messages)

copy

<code>.:53 {<!-- --><br>
template IN AAAA {<!-- --><br>
rcodeNXDOMAIN<br>
}
</code><p>template IN A apps.utccp.example.com {<!-- --><br>
match .*apps.utccp.example.com<br>
answer "{<!-- -->{ .Name }} 60 IN A 10.12.24.125"<br>
fallthrough<br>
}<br>
hosts {<!-- --><br>
10.12.24.125 api.utccp.example.com<br>
10.12.24.125 api-int.utccp.example.com<br>
10.12.24.125 bastion.utccp.example.com<br>
10.12.24.127 master1.utccp.example.com<br>
10.12.24.128 master2.utccp.example.com<br>
10.12.24.129 master3.utccp.example.com<br>
fallthrough<br>
}<br>
prometheus<br>
cache 160<br>
forward .114.114.114.114<br>
log<br>
}</p>
        
        <p></p>

Due to time constraints, there is another issue of repeated base fields that will be left to the next issue.

Reference link:

resolv manual: https://man7.org/linux/man-pages/man5/resolv.conf.5.html
dns standard: https://www.rfc-editor.org/rfc/rfc4472.html
openshift documentation: https://docs.openshift.com/container-platform/4.13/rest_api/operator_apis/dns-operator-openshift-io-v1.html#spec- upstreamresolvers
Bugs raised by kind people: https://bugzilla.redhat.com/show_bug.cgi?id=1027452
glibc change log: https://lists.gnu.org/archive/html/info-gnu/2022-08/msg00000.html
glibc source code cloned by a kind person: https://github.com/bminor/glibc/tree/ibm/2.28/master
glib patch proposed by a kind person: https://sourceware.org/pipermail/libc-alpha/2022-June/139341.html
Youdao translation: https://fanyi.youdao.com

About disabling cluster ipv6 DNS requests

Environment:

Source of the problem

Issue found:

Question analysis

Close ipv6 DNS resolution request (client side)

systemctl restart avahi-daemon

1.

2.

3.

4.

5.

Still not working

Set /etc/resolv.conf

copy

<code>options single-request-reopen</code>

systemctl restart NetworkManager

1.

2.

3.

Still not working

Modify OVS configuration

copy

<code># vim /etc/openvswitch/ovs-vswitchd.conf.db</code>

systemctl restart openvswitch.service

1.

2.

3.

Still not working

Comment ipv6 local in /etc/hosts

copy

<code>[root@worker1 ~]# cat /etc/hosts</code>

Built-in modules cannot be uninstalled

1.

2.

After searching for documents, I found rfc4472 Chapter 5.1 Description

copy

<code>5.1. DNS Lookups May Query IPv6 Records Prematurely</code>

Server side (evasion)

Reference link:

Environment:

Source of the problem

Issue found:

Question analysis

Close ipv6 DNS resolution request (client side)

systemctl restart avahi-daemon 1. 2. 3. 4. 5. Still not working Set /etc/resolv.conf copy <code>options single-request-reopen</code>

systemctl restart NetworkManager 1. 2. 3. Still not working Modify OVS configuration copy <code># vim /etc/openvswitch/ovs-vswitchd.conf.db</code>

systemctl restart openvswitch.service 1. 2. 3. Still not working Comment ipv6 local in /etc/hosts copy <code>[root@worker1 ~]# cat /etc/hosts</code>

Built-in modules cannot be uninstalled 1. 2. After searching for documents, I found rfc4472 Chapter 5.1 Description copy <code>5.1. DNS Lookups May Query IPv6 Records Prematurely</code>

Server side (evasion)

Reference link:

systemctl restart avahi-daemon

1.

2.

3.

4.

5.

Still not working

Set /etc/resolv.conf

copy

<code>options single-request-reopen</code>

systemctl restart NetworkManager

1.

2.

3.

Still not working

Modify OVS configuration

copy

<code># vim /etc/openvswitch/ovs-vswitchd.conf.db</code>

systemctl restart openvswitch.service

1.

2.

3.

Still not working

Comment ipv6 local in /etc/hosts

copy

<code>[root@worker1 ~]# cat /etc/hosts</code>

Built-in modules cannot be uninstalled

1.

2.

After searching for documents, I found rfc4472 Chapter 5.1 Description

copy

<code>5.1. DNS Lookups May Query IPv6 Records Prematurely</code>