About disabling cluster ipv6 DNS requests

The article is reproduced from: https://blog.51cto.com/mahmut/8141303

Environment:

System: Tongxin Youyue 1060A

Cluster: Tongxin Youque bare metal deployment

Source of the problem

Youque cluster basic node DNS configuration

copy

<code>[root@bastion ~]# cat /etc/coredns/Corefile
.:53 {
    template IN A apps.utccp.example.com {
    match .*apps\.utccp\.example\.com
    answer "{<!-- -->{ .Name }} 60 IN A 10.12.24.125"
    fallthrough
    }
    hosts {
        10.12.24.125 api.utccp.example.com
        10.12.24.125 api-int.utccp.example.com
        10.12.24.125 bastion.utccp.example.com
        10.12.24.127 master1.utccp.example.com
        10.12.24.128 master2.utccp.example.com
        10.12.24.129 master3.utccp.example.com
        fallthrough
    }
    prometheus
    cache 160
    forward .114.114.114.114
    log
}</code>
       
       
Issue found:

Environment: Youque default configuration, running data for 48 hours. Total number of requests: 660,100

type

quantity

question

Proportion

ipv4 request

398826

Normal request

0.60419

ipv6 request

206733

The cluster defaults to ipv4, no ipv6 network

0.31318

Repeat base domain request

54541

Duplication of the cluster base domain will cause NXDOMAIN and cannot be parsed

0.08262

According to the above data, it can be seen that the proportion of abnormal requests is about 40%.

Question analysis

  1. First, you need to check the source of the ipv6 request.
    By default in an offline cluster, DNS requests basically come from the containers of each component, and DNS requests for system services can be ignored.
  2. Troubleshoot the cause of duplicate base fields.
    By default, the hostname of each node will have a base domain. The hostname is configured through hostnamectl set-hostname, and problems may occur.

Close ipv6 DNS resolution request (client side)

  1. Turn off ipv6 in NetworkManager.

copy

<code># nmcli connection modify enp1s0 ipv6.method disabled
# systemctl restart NetworkManager</code>
       
       

It is found that there are still ipv6 dns requests reaching the basic node.

  1. Close the kernel parameters of the system and execute it on each node

copy

<code># sysctl -w net.ipv6.conf.all.disable_ipv6=1
# sysctl -w net.ipv6.conf.all.disable_policy=1</code>
       
       

After the above two steps, there will still be ipv6 dns requests reaching the basic node.

  1. Turn off avahi-daemon’s ipv6

copy

<code># vim /etc/avahi/avahi-daemon.conf
set up
use-ipv6=no
</code>

systemctl restart avahi-daemon

  • 1.
  • 2.
  • 3.
  • 4.
  • 5.

Still not working

  1. Set /etc/resolv.conf

copy

<code>options single-request-reopen</code>

systemctl restart NetworkManager

  • 1.
  • 2.
  • 3.

Still not working

  1. Modify OVS configuration

copy

<code># vim /etc/openvswitch/ovs-vswitchd.conf.db</code>

other_config:ipv6_prefix=[]

systemctl restart openvswitch.service

  • 1.
  • 2.
  • 3.

Still not working

  1. Comment ipv6 local in /etc/hosts

copy

<code>[root@worker1 ~]# cat /etc/hosts</code>

127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
#::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

  • 1.
  • 2.
  • 3.

Still not working

  1. Disable /etc/gai.conf

copy

<code>precedence ::ffff:0:0/96 100</code>
      
      

Still not working

  1. Remove kernel module

copy

<code>modprobe -r ipv6
</code>

Built-in modules cannot be uninstalled

  • 1.
  • 2.
  1. After searching for documents, I found rfc4472 Chapter 5.1 Description

copy

<code>5.1. DNS Lookups May Query IPv6 Records Prematurely</code>

The system library that implements the getaddrinfo() function for
looking up names is a critical piece when considering the robustness
of enabling IPv6; it may come in basically three flavors:

  1. The system library does not know whether IPv6 has been enabled in
    the kernel of the operating system: it may start looking up AAAA
    records with getaddrinfo() and AF_UNSPEC hint when the system is
    upgraded to a system library version that supports IPv6.

  2. The system library might start to perform IPv6 queries with
    getaddrinfo() only when IPv6 has been enabled in the kernel.
    However, this does not guarantee that there exists any useful
    IPv6 connectivity (e.g., the node could be isolated from the
    other IPv6 networks, only having link-local addresses).

  3. The system library might implement a toggle that would apply some
    heuristics to the “IPv6-readiness” of the node before starting to
    perform queries; for example, it could check whether only link-
    local IPv6 address(es) exists, or if at least one global IPv6
    address exists.

    First, let us consider generic implications of unnecessary queries
    for AAAA records: when looking up all the records in the DNS, AAAA
    records are typically tried first, and then A records. These are
    done in serial, and the A query is not performed until a response is
    received to the AAAA query. Considering the misbehavior of DNS
    servers and load-balancers, as described in Section 3.1, the lookup
    delay for AAAA may incur additional unnecessary latency, and
    introduce a component of unreliability.

    • 1.
    • 2.
    • 3.
    • 4.
    • 5.
    • 6.
    • 7.
    • 8.
    • 9.
    • 10.
    • 11.
    • 12.
    • 13.
    • 14.
    • 15.
    • 16.
    • 17.
    • 18.
    • 19.
    • 20.
    • twenty one.
    • twenty two.
    • twenty three.
    • twenty four.
    • 25.
    • 26.
    • 27.
    • 28.
    • 29.
    • 30.
    • 31.
    1. Add grub parameter to disable ipv6

    copy

    <code>ipv6.disable=1</code>
            
            

    copy

    <code>[root@worker1 ~]# ss -tunlp<br>
    Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port Process<br>
    udp UNCONN 0 0 0.0.0.0:111 0.0.0.0:* users:(("rpcbind",pid=793,fd=6))<br>
    udp UNCONN 0 0 0.0.0.0:33062 0.0.0.0:* users:(("avahi-daemon",pid=799,fd=16))<br>
    udp UNCONN 0 0 127.0.0.1:323 0.0.0.0:* users:(("chronyd",pid=817,fd=6))<br>
    udp UNCONN 0 0 0.0.0.0:4789 0.0.0.0:*<br>
    udp UNCONN 0 0 0.0.0.0:5353 0.0.0.0:* users:(("avahi-daemon",pid=799,fd=15))<br>
    udp UNCONN 0 0 0.0.0.0:55162 0.0.0.0:* users:(("rpcbind",pid=793,fd=7))<br>
    tcp LISTEN 0 128 0.0.0.0:111 0.0.0.0:* users:(("rpcbind",pid=793,fd=8))<br>
    tcp LISTEN 0 128 0.0.0.0:22 0.0.0.0:* users:(("sshd",pid=1208,fd=3))<br>
    tcp LISTEN 0 5 127.0.0.1:631 0.0.0.0:* users:(("cupsd",pid=1327,fd=10))</code>
            
            

    Although all ipv6 address-related information is turned off, ipv6dns resolution requests cannot be blocked.

    1. Found by Google, this bug was fixed in glibc version 2.36, and option no-aaaa was added to /etc/resolv.conf. link

    copy

    <code>* The “no-aaaa” DNS stub resolver option has been added. System<br>
    administrators can use it to suppress AAAA queries made by the stub<br>
    resolver, including AAAA lookups triggered by NSS-based interfaces<br>
    such as getaddrinfo. Only DNS lookups are affected: IPv6 data in<br>
    /etc/hosts is still used, getaddrinfo with AI_PASSIVE will still<br>
    produce IPv6 addresses, and configured IPv6 name servers are still<br>
    used. To produce correct Name Error (NXDOMAIN) results, AAAA queries<br>
    are translated to A queries. The new resolver option is intended<br>
    primarily for diagnostic purposes, to rule out that AAAA DNS queries<br>
    have adverse impact. It is incompatible with EDNS0 usage and DNSSEC<br>
    validation by applications.</code>
            
            

    The glibc version of the system I am currently using is not high enough, and it cannot be upgraded casually. All in order to avoid the timeout caused by simultaneous requests for A AAAA (according to the /etc/resolv.conf explanation, when requesting A and AAAA records, if the A request arrives but AAAA does not request the request, there will be a 5-second timeout —–horror ). So there is the following.

    Server side (evasion)

    Since I can’t solve the problem of glibc, I will start with coredns of the basic node.

    Explanation: The default DNS request will request A and AAAA records at the same time, but if the AAAA record is returned immediately on the server, it may be faster than the A record, which avoids the 5-second timeout.

    There are currently two solutions that come to mind

    1. Use rewrite to directly reject AAAA records (safe, recommended)

    copy

    <code>.:53 {<!-- --><br>
    rewrite stop type AAAA A
    </code><p>template IN A apps.utccp.example.com {<!-- --><br>
    match .*apps.utccp.example.com<br>
    answer "{<!-- -->{ .Name }} 60 IN A 10.12.24.125"<br>
    fallthrough<br>
    }<br>
    hosts {<!-- --><br>
    10.12.24.125 api.utccp.example.com<br>
    10.12.24.125 api-int.utccp.example.com<br>
    10.12.24.125 bastion.utccp.example.com<br>
    10.12.24.127 master1.utccp.example.com<br>
    10.12.24.128 master2.utccp.example.com<br>
    10.12.24.129 master3.utccp.example.com<br>
    fallthrough<br>
    }<br>
    prometheus<br>
    cache 160<br>
    forward .114.114.114.114<br>
    log<br>
    }</p>
            
            <p></p>
    1. Let AAAA records return NXDOMAIN (not recommended because AAAA records will have error messages)

    copy

    <code>.:53 {<!-- --><br>
    template IN AAAA {<!-- --><br>
    rcodeNXDOMAIN<br>
    }
    </code><p>template IN A apps.utccp.example.com {<!-- --><br>
    match .*apps.utccp.example.com<br>
    answer "{<!-- -->{ .Name }} 60 IN A 10.12.24.125"<br>
    fallthrough<br>
    }<br>
    hosts {<!-- --><br>
    10.12.24.125 api.utccp.example.com<br>
    10.12.24.125 api-int.utccp.example.com<br>
    10.12.24.125 bastion.utccp.example.com<br>
    10.12.24.127 master1.utccp.example.com<br>
    10.12.24.128 master2.utccp.example.com<br>
    10.12.24.129 master3.utccp.example.com<br>
    fallthrough<br>
    }<br>
    prometheus<br>
    cache 160<br>
    forward .114.114.114.114<br>
    log<br>
    }</p>
            
            <p></p>

    Due to time constraints, there is another issue of repeated base fields that will be left to the next issue.

    Reference link:
    1. resolv manual: https://man7.org/linux/man-pages/man5/resolv.conf.5.html
    2. dns standard: https://www.rfc-editor.org/rfc/rfc4472.html
    3. openshift documentation: https://docs.openshift.com/container-platform/4.13/rest_api/operator_apis/dns-operator-openshift-io-v1.html#spec- upstreamresolvers
    4. Bugs raised by kind people: https://bugzilla.redhat.com/show_bug.cgi?id=1027452
    5. glibc change log: https://lists.gnu.org/archive/html/info-gnu/2022-08/msg00000.html
    6. glibc source code cloned by a kind person: https://github.com/bminor/glibc/tree/ibm/2.28/master
    7. glib patch proposed by a kind person: https://sourceware.org/pipermail/libc-alpha/2022-June/139341.html
    8. Youdao translation: https://fanyi.youdao.com