7.DNS resolution failed, the oracle database could not be connected, the listener restarted and hung

The pro_db database cannot be connected, and the listener restarts and hangs to analyze the problem
Solution: 1. Remove the /etc/resolv.conf configuration file so that the database cannot be restored after DNS resolution
mv /etc/resolv.conf /etc/resolv.conf.bak
2. Add options timeout:1 to /etc/resolv.conf

Look at the nsswitch.conf configuration file, the configuration is indeed to first resolve the host name through the /etc/hosts file, and then look for DNS resolution through /etc/resolv.conf
cat /etc/nsswitch.conf
hosts: files dns

Question: Why is it configured to first resolve the host name through /etc/hosts and then look for DNS, but the host name cannot be resolved (or the resolution is very slow), causing the data monitoring to hang? Is it because the DNS resolution is configured that the monitor hangs and cannot be connected?

final conclusion:
The listener in version 11.2.0.1 of oracle first chooses to resolve the host name from DNS, and then tries to resolve it from the hosts file. When the /etc/resolv.conf file does not exist, it will be resolved from /etc/hosts, when the DNS server address cannot be accessed
When the database connection is slow, the listener will respond with a delay or hang once the DNS server resolves the problem and cannot find the ip. You can remove the /etc/resolv.conf file or set the timeout in the file
echo “options timeout:1” >> /etc/resolv.conf to resolve.

1. Reproduce the failure phenomenon in the standby database

1.1) View dns and host configuration (/etc/host.conf /etc/resolv.conf /etc/hosts)
Modify the DNS server in /etc/resolv.conf to an unreachable address, so that DNS resolution fails

[root@pro_db1-DG ~]# cat /etc/host.conf
multi on

[root@pro_db1-DG ~]# cat /etc/resolv.conf
# DNS1=xxx.xxx.xxx.xxx
# DNS2=xxx.xxx.xxx.xxx
# DOMAIN=lab.foo.com bar.foo.com
#nameserver 172.20.113.27
#nameserver 172.20.113.28
nameserver 172.20.113.29

[root@pro_db1-DG ~]# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
127.0.0.1 pro_db1-DG
172.20.124.139 laomatest


1.2) View listener configuration
[oracle@pro_db1-DG ~]$ cat /mnt/data_1t/app/oracle/product/11.2.0/VMANAGE/network/admin/listener.ora
SID_LIST_LISTENER=
  (SID_LIST =
    (SID_DESC =
      (ORACLE_HOME = /mnt/data_1t/app/oracle/product/11.2.0/VMANAGE)
      (SID_NAME = VMANAGE)
    )
  )

LISTENER =
  (DESCRIPTION =
    (ADDRESS = (PROTOCOL = TCP)(HOST = pro_db1-DG)(PORT = 1521))
  )

ADR_BASE_LISTENER = /mnt/data_1t/app/oracle
#TRACE_LEVEL_listener=admin


2. Open the trace of client sqlnet.ora
$ vi $ORACLE_HOME/network/admin/sqlnet.ora
TRACE_LEVEL_CLIENT = 16
TRACE_FILE_CLIENT = client
TRACE_DIRECTORY_CLIENT = /tmp
TRACE_TIMESTAMP_CLIENT = ON
DIAG_ADR_ENABLED=off

Open the trace of the listener (the listener has not been restarted so it is not implemented)
listener.ora open trace
DIAG_ADR_ENABLED_LISTENER=off
TRACE_LEVEL_LISTENER = 16
TRACE_TIMESTAMP_LISTENER = ON
TRACE_DIRECTORY_LISTENER = /tmp
TRACE_FILELEN_LISTENER = 1024
TRACE_FILENO_LISTENER = 1

3. sqlplus accesses the database through monitoring, and the connection is slow (it takes 12 seconds)

[oracle@pro_db1-DG ~]$ time sqlplus vmanage/y4yhl9t@vmanages

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
SQL>

real 0m12.592s
user 0m0.045s
sys 0m0.016s


4. Analyze the cause of slowness through logs
>>> Use strace to trace the listener process and its subprocess calls ps -ef | grep lsnr; strace -frT -o /tmp/listener.strc -p 106787
>>> strace and lsof trace the call of sqlplus process strace -frT -o /tmp/sqlplus.trc sqlplus vmanage/y4yhl9t@vmanages
>>> View the trace vi /tmp/client_1830.trc of the client sqlnet
>>> View the trace of the listener (not opened)

log analysis
4.1) The trace of client sqlnet reflects that there is a 10-second delay between the first connect packet and the second nttrd packet
client_1830.trc shows the delay between 1st Connect Packet and Resend packet on nttrd function

(714936352) [31-JAN-2018 14:56:26:271] nscon: entry
(714936352) [31-JAN-2018 14:56:26:271] nscon: recving a packet
(714936352) [31-JAN-2018 14:56:26:271] nsprecv: entry
(714936352) [31-JAN-2018 14:56:26:271] nsprecv: reading from transport...
(714936352) [31-JAN-2018 14:56:26:271] nttrd: entry
(714936352) [31-JAN-2018 14:56:36:301] nttrd: socket 8 had bytes read=8
(714936352) [31-JAN-2018 14:56:36:301] nttrd: exit


4.2.1) strace traces the calling of sqlplus. It took 12s to write the file descriptor fd=9 in /tmp/sqlplus.trc, which took the longest time. Among them, inet_addr("172.20.114.30") was opened. This step happened to be address resolution

27011 0.000144 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 9 <0.000036>
27011 0.000094 fcntl(9, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 <0.000027>
27011 0.000086 connect(9, {<!-- -->sa_family=AF_INET, sin_port=htons(1521), sin_addr=inet_addr("172.20.114.30")}, 16) = -1 EINPROGRESS (Operation now in progress) < 0.000093>
27011 0.000177 times({<!-- -->tms_utime=0, tms_stime=2, tms_cutime=0, tms_cstime=0}) = 2406186220 <0.000026>
27011 0.000085 mmap(NULL, 802816, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0362452000 <0.000029>
27011 0.000085 poll([{<!-- -->fd=9, events=POLLOUT}], 1, 60000) = 1 ([{<!-- -->fd=9, revents=POLLOUT}]) < 0.000028>
27011 0.000083 getsockopt(9, SOL_SOCKET, SO_ERROR, [-3925634571361583104], [4]) = 0 <0.000026>
27011 0.000071 fcntl(9, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK) <0.000025>
27011 0.000098 fcntl(9, F_SETFL, O_RDWR) = 0 <0.000019>
27011 0.000064 getsockname(9, {<!-- -->sa_family=AF_INET, sin_port=htons(50243), sin_addr=inet_addr("172.20.114.30")}, [16]) = 0 <0.000025>
27011 0.000079 getsockopt(9, SOL_SOCKET, SO_SNDBUF, [-3925628283529411012], [4]) = 0 <0.000025>
27011 0.000067 getsockopt(9, SOL_SOCKET, SO_RCVBUF, [-3925628283529374176], [4]) = 0 <0.000026>
27011 0.000070 setsockopt(9, SOL_TCP, TCP_NODELAY, [1], 4) = 0 <0.000027>
27011 0.000075 fcntl(9, F_SETFD, FD_CLOEXEC) = 0 <0.000025>
27011 0.000179 rt_sigaction(SIGPIPE, {<!-- -->SIG_IGN, ~[ILL ABRT BUS FPE SEGV USR2 XCPU XFSZ SYS RTMIN RT_1], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x35bf40f4a0}, {<!-- -->SIG_DFL , [], 0}, 8) = 0 <0.000029>
27011 0.000135 write(9, "\0\352\0\0\1\0\0\0\1:\1,\fA \0\177\377\177\10\0\0\1\0\ 0\260\0:\0\0\10\0"..., 234) = 234 <0.000046>
27011 0.000117 read(9, "\0\10\0\0\v\0\0\0", 8208) = 8 <12.048260>
27011 12.048380 write(9, "\0\352\0\0\1\0\0\0\1:\1,\fA \0\177\377\177\10\0\0\1\0\ 0\260\0:\0\0\10\0"..., 234) = 234 <0.000058>
27011 0.000156 read(9, "\0 \0\0\2\0\0\0\1:\fA \0\177\377\1\0\0\0\0 AA\0\0\0\ 0\0\0\0\0", 8208) = 32 <0.000122>
27011 0.000384 write(9, "\0\234\0\0\6\0\0\0\0\0\336\255\276\357\0\222\0\0\0\0\0\ 4\0\0\4\0\3\0\0\0\0\0"..., 156) = 156 <0.000091


4.2.2) lsof checks the opening and calling of file descriptor fd=9, file descriptor 9 is a pipeline
[oracle@pro_db1-DG tmp]$ ps -ef | grep sqlplus | grep -v grep
oracle 2049 31214 0 12:46 pts/4 00:00:00 sqlplus

[oracle@pro_db1-DG tmp]$ lsof -p 2049
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
tnslsnr 3489 oracle 8u IPv6 282128771 0t0 TCP *:ncube-lm (LISTEN)
tnslsnr 3489 oracle 9w FIFO 0,8 0t0 282128761 pipe
tnslsnr 3489 oracle 10u unix 0xffff880e3f283080 0t0 282128774 /var/tmp/.oracle/s#3489.1
tnslsnr 3489 oracle 11u unix 0xffff880e3f283680 0t0 282128776 /var/tmp/.oracle/s#3489.2
tnslsnr 3489 oracle 12u IPv4 282128793 0t0 TCP localhost:15237->localhost:smux (CLOSE_WAIT)
tnslsnr 3489 oracle 13u IPv6 282165915 0t0 TCP localhost:ncube-lm->localhost:45928 (ESTABLISHED)
tnslsnr 3489 oracle 14u IPv6 453571765 0t0 TCP 172.20.114.30:ncube-lm->172.20.114.30:9881 (ESTABLISHED)
tnslsnr 3489 oracle 16w FIFO 0,8 0t0 453571767 pipe
tnslsnr 3489 oracle 17r FIFO 0,8 0t0 453571768 pipe

-------- FD file descriptor, u indicates that the file is opened in read/write mode, not read-only (r) or write-only (w) mode.
  

4.3) View the strace log of the listener process
4.3.1) View the host name resolution call sequence of the listener process through strace as follows, the host name resolution call happens to be in the pipeline fd=9:
/etc/resolv.conf > /etc/host.conf > /etc/hosts > libnss_dns.so.2 > libresolv.so.2

18880 0.000173 close(9) = 0 <0.000037>
18880 0.000098 open("/etc/resolv.conf", O_RDONLY) = 9 <0.000035>
18880 0.000081 fstat(9, {<!-- -->st_mode=S_IFREG|0644, st_size=305, ...}) = 0 <0.000027>
18880 0.000070 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3a1ea34000 <0.000029>
18880 0.000064 read(9, "# Generated by NetworkManager\
\
\
"..., 4096) = 305 <0.000030>
18880 0.000076 read(9, "", 4096) = 0 <0.000025>
18880 0.000064 close(9) = 0 <0.000039>
18880 0.000099 munmap(0x7f3a1ea34000, 4096) = 0 <0.000033>
18880 0.000086 uname({<!-- -->sys="Linux", node="pro_db1-DG", ...}) = 0 <0.000019>
18880 0.000108 socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 9 <0.000033>
18880 0.000080 connect(9, {<!-- -->sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory) <0.000032>
18880 0.000085 close(9) = 0 <0.000030>
18880 0.000071 socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 9 <0.000028>
18880 0.000060 connect(9, {<!-- -->sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory) <0.000029>
18880 0.000078 close(9) = 0 <0.000027>
18880 0.000073 open("/etc/host.conf", O_RDONLY) = 9 <0.000031>
18880 0.000076 fstat(9, {<!-- -->st_mode=S_IFREG|0644, st_size=9, ...}) = 0 <0.000027>
18880 0.000071 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3a1ea34000 <0.000028>
18880 0.000069 read(9, "multi on\
", 4096) = 9 <0.000029>
18880 0.000073 read(9, "", 4096) = 0 <0.000025>
18880 0.000064 close(9) = 0 <0.000026>
18880 0.000064 munmap(0x7f3a1ea34000, 4096) = 0 <0.000025>
18880 0.000077 futex(0x35bf39f384, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000026>
18880 0.000071 open("/etc/hosts", O_RDONLY|O_CLOEXEC) = 9 <0.000031>
18880 0.000081 fstat(9, {<!-- -->st_mode=S_IFREG|0644, st_size=205, ...}) = 0 <0.000026>
18880 0.000068 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3a1ea34000 <0.000026>
18880 0.000057 read(9, "127.0.0.1 localhost localhost."..., 4096) = 205 <0.000027>
18880 0.000088 read(9, "", 4096) = 0 <0.000048>
18880 0.000092 close(9) = 0 <0.000026>

18880 0.000078 open("/mnt/data_1t/app/oracle/product/11.2.0/VMANAGE/lib/libnss_dns.so.2", O_RDONLY) = -1 ENOENT (No such file or directory) <0.000035>
18880 0.000087 open("/mnt/data_1t/app/oracle/product/11.2.0/VMANAGE/lib/libnss_dns.so.2", O_RDONLY) = -1 ENOENT (No such file or directory) <0.000030>
18880 0.000079 open("/lib/libnss_dns.so.2", O_RDONLY) = 9 <0.000034>

18880 0.000068 open("/usr/lib/libnss_dns.so.2", O_RDONLY) = -1 ENOENT (No such file or directory) <0.000039>
18880 0.000086 open("/etc/ld.so.cache", O_RDONLY) = 9 <0.000029>
18880 0.000073 fstat(9, {<!-- -->st_mode=S_IFREG|0644, st_size=123514, ...}) = 0 <0.000027>
18880 0.000070 mmap(NULL, 123514, PROT_READ, MAP_PRIVATE, 9, 0) = 0x7f3a1d064000 <0.000030>
18880 0.000061 close(9) = 0 <0.000026>
18880 0.000074 open("/lib64/libnss_dns.so.2", O_RDONLY) = 9 <0.000033>

18880 0.000088 open("/lib/libresolv.so.2", O_RDONLY) = 9 <0.000033>
18880 0.000077 read(9, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\ 240 &\0\0004\0\0\0"..., 832) = 832 <0.000026>
18880 0.000065 close(9) = 0 <0.000026>


 
4.3.2) Look again, the delay operation in the listener process of strace is between FD 9 and 15, and the address connected in FD =9 is the DNS server address configured in /etc/resolv.conf 172.20.113.29
That is, the fd=9 call needs to go to the DNS server to resolve the host name, because 172.20.113.29 cannot be accessed, resulting in a delay in resolution, and then go to fd=15 to resolve locally, and finally establish a connection with the client

18880 0.000148 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 9 <0.000039>
18880 0.000093 connect(9, {<!-- -->sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("172.20.113.29")}, 16) = 0 <0.000035>
18880 0.000091 poll([{<!-- -->fd=9, events=POLLOUT}], 1, 0) = 1 ([{<!-- -->fd=9, revents=POLLOUT}]) < 0.000028>
18880 0.000079 sendto(9, "\256>\1\0\0\1\0\0\0\0\0\0\tpro_db1-DG\0\0\34\0\1", 27, MSG_NOSIGNAL, NULL, 0) = 27 <0.000038>
18880 0.000086 poll([{<!-- -->fd=9, events=POLLIN}], 1, 5000 <unfinished ...>
3490 4.394932 <... nanosleep resumed> 0x7f29fc234e10) = 0 <5.000157>
3490 0.000082 socket(PF_INET6, SOCK_STREAM, IPPROTO_TCP) = 15 <0.000042>
3490 0.000089 fcntl(15, F_SETFD, FD_CLOEXEC) = 0 <0.000026>
3490 0.000070 setsockopt(15, SOL_SOCKET, SO_LINGER, {<!-- -->onoff=1, linger=900}, 8) = 0 <0.000029>
3490 0.000076 connect(15, {<!-- -->sa_family=AF_INET6, sin6_port=htons(6150), inet_pton(AF_INET6, "::1", & amp;sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = -1 ECONNREFUSED (Connection refused) <0.000108>
3490 0.000172 shutdown(15, 2 /* send and receive */) = -1 ENOTCONN (Transport endpoint is not connected) <0.000029>
3490 0.000076 close(15) = 0 <0.000037>
3490 0.000076 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 15 <0.000032>
3490 0.000069 fcntl(15, F_SETFD, FD_CLOEXEC) = 0 <0.000023>
3490 0.000061 setsockopt(15, SOL_SOCKET, SO_LINGER, {<!-- -->onoff=1, linger=900}, 8) = 0 <0.000023>
3490 0.000064 connect(15, {<!-- -->sa_family=AF_INET, sin_port=htons(6150), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 ECONNREFUSED (Connection refused) <0.000053>
3490 0.000111 shutdown(15, 2 /* send and receive */) = -1 ENOTCONN (Transport endpoint is not connected) <0.000030>
3490 0.000082 close(15) = 0 <0.000041>


5. After changing the DNS server address back to the normal address, the listener returns to normal, and the sqlplus login database is normal
#vi /etc/resolv.conf
# DNS1=xxx.xxx.xxx.xxx
# DNS2=xxx.xxx.xxx.xxx
# DOMAIN=lab.foo.com bar.foo.com
nameserver 172.20.113.27
nameserver 172.20.113.28

[oracle@dgvxl2249 ~]$ time sqlplus vmanage/y4yhl9t@vmanagess

real 0m0.952s >>>>>>>> It only takes 0.9 seconds to log in and return to normal
user 0m0.038s
sys 0m0.022s

Other test comparisons:
The standby machine of pro_db is 172.20.221.137 and the database version is also 11.2.0.1, but the problem cannot be reproduced with a single network card
The host of plm is 172.20.220.164, the database version is 11.2.0.4, single network card —— the problem cannot be reproduced

Reference documents:
DNS Issue: Connections To Oracle 11g are Slow or Delayed (Doc ID 561429.1)
Bug 9593134 – DNS or NIS mis-configuration can cause slow database connections (Doc ID 9593134.8)

Attached:
Check the listener version, 11.2.0.1

[oracle@pro_db1-DG ~]$ lsnrctl version
LSNRCTL for Linux: Version 11.2.0.1.0 - Production on 31-JAN-2018 15:03:48