Pitfalls and solutions encountered when using Docker

Pits and solutions encountered when using Docker

1. Docker service startup serial

Use the docker-compose command to start two groups of services respectively, and find that the services will be strung together!

[Cause of the problem] Under two directories with different names, use docker-compose to start the service. It is found that when the service of group A is started, and then the service of group B is started, it is found that some of the services in group A are restarted. Once, this is very strange! Because of this problem, Group A services and Group B services cannot be started at the same time. I thought it was a bug in the tool before, but later I found out the reason and suddenly realized.

# The service directory structure is as follows
A: /data1/app/docker-compose.yml
B: /data2/app/docker-compose.yml

[Solution] It is found that the reason why the two groups of services A and B will be crossed is that docker-compose will add labels to the started containers, and then identify and judge who started the corresponding container services based on these labels. Who will manage it, etc. Here, the label variable we need to pay attention to is com.docker.compose.project, and its corresponding value is the name of the bottom subdirectory of the directory using the startup configuration file, that is, the above app is the corresponding value. We can find that the values corresponding to the two groups of services A and B are both app, so they are considered to be the same when starting up, which leads to the above-mentioned problem. If you need an in-depth understanding, you can go to the corresponding source code.

# You can adjust the directory structure as follows
A: /data/app1/docker-compose.yml
B: /data/app2/docker-compose.yml

A: /data1/app-old/docker-compose.yml
B: /data2/app-new/docker-compose.yml

Or use the parameter -p provided by the docker-compose command to manually specify the label to avoid this problem.

# Specify the project project name
$ docker-compose -f ./docker-compose.yml -p app1 up -d

2.Docker command call error

When writing scripts, docker-related commands are often executed, but you need to pay attention to the usage details!

[Cause of the problem] A script was executed in the CI update environment, but an error was reported during the execution of the script, as shown below. Through the corresponding output information, you can see that the device being executed is not a tty.

Immediately, I checked the script and found that the place where the error was reported was the execution of an exec docker command, which is roughly as follows. The strange thing is that there is no problem when executing manually or directly calling the script, but there is always a problem when it is called by CI. Then take a good look at the following command and notice the -it parameter.

# The script calls the docker command
docker exec -it <container_name> psql -Upostgres ?…

We can look at these two parameters of the exec command together, and naturally we can almost understand it.

Number	Parameter	Explanation
1	-i/-interactive	Keep STDIN open even if nothing is attached; you need to enable this option if you need to execute commands
2	-t/–tty	Allocate a pseudo-terminal for execution; a bridge connecting the user’s terminal to the container’s stdin and stdout

[Solution] The parameter -t of docker exec means Allocate a pseudo-TTY, and CI does not execute the job in the TTY terminal when executing the job, so the parameter -t will report an error

3.Docker scheduled task exception

There is also an abnormal execution of Docker commands in the Crontab timing task!

[Cause of the problem] I found a problem today, that is, when backing up the Mysql database, use the docker container for backup, and then use the Crontab scheduled task to trigger the backup. However, it was found that the backup MySQL database was actually empty, but it was fine to execute the corresponding commands manually, which is very strange.

# Crontab timed task
0 */6 * * * \
    docker exec -it <container_name> sh -c \
        'exec mysqldump --all-databases -uroot -ppassword ?…'

[Solution] It was later found that it was caused by executing multiple -i commands in the docker command. Because the Crontab command is not interactive when it is executed, it needs to be removed. In summary, you need the -t option if you need echo, and the -i option if you need an interactive session.

4.Docker variables use quotation marks

The problem of environment variables in compose with or without quotation marks!

[Cause of the problem] Friends who have used compose may have encountered the problem of whether to use single quotes, double quotes or no quotes when adding environment variables when writing the service startup configuration file? After a long time, we may mix the three and think that the effect is the same. But later, more and more pits were discovered, and it became more and more obscure.

Anyway, I have encountered many problems, all because of the abnormal service startup caused by adding quotation marks, and then I came to the conclusion that quotation marks should not be used at all. Run naked, experience unprecedented refreshment! It was not until now that I saw the corresponding issus in Github that the case was finally solved.

# Reference the TEST_VAR variable in Compose, which cannot be found
TEST_VAR="test"

# Reference the TEST_VAR variable in Compose, you can find
TEST_VAR=test

# It was later discovered that docker itself has actually handled the use of quotation marks correctly
docker run -it --rm -e TEST_VAR="test" test:latest

[Solution] The conclusion is that because Compose parses the yaml configuration file, it finds that the quotation marks are also interpreted and packaged. This causes the original TEST_VAR=”test” to be parsed into ‘TEST_VAR=”test”‘, so we cannot get the corresponding value when we refer to it. The solution now is that whether we add environment variables directly to the configuration file or use the env_file configuration file, quotation marks are not applicable without quotation marks.

It should be noted that if the environment variable is configured in the log format (2022-01-01), if the yaml.load module of Python is used, it will be regarded as a date type. If you want to keep the original information, you can Use ‘/” to convert it into a string format.

5.Docker deletes the image and reports an error

Can’t delete the mirror image, in the final analysis, there is still a place to use it!

[Problem Cause] When clearing the disk space of the server, the following message is displayed when deleting a mirror. It prompts that forced deletion is required, but it is found that the forced deletion is still ineffective.

# delete image
$ docker rmi 3ccxxxx2e862
Error response from daemon: conflict: unable to delete 3ccxxxx2e862 (cannot be forced) - image has dependent child images

# force delete
$ dcoker rmi -f 3ccxxxx2e862
Error response from daemon: conflict: unable to delete 3ccxxxx2e862 (cannot be forced) - image has dependent child images

[Solution] It was discovered later that the reason for this is mainly because of TAG, that is, there are other mirrors that reference this mirror. Here we can use the following command to view the dependencies of the corresponding image file, and then delete the image according to the corresponding TAG.

# Query dependencies - image_id indicates the image name
$ docker image inspect --format='{<!-- -->{.RepoTags}} {<!-- -->{.Id}} {<!-- -->{.Parent}}' $ (docker image ls -q --filter since=<image_id>)

# Delete images according to TAG
$ docker rmi -f c565xxxxc87f

# delete the dangling image
$ docker rmi $(docker images --filter "dangling=true" -q --no-trunc)

6.Docker normal user switch

If you switch the Docker startup user, you still need to pay attention to the permission issue!

[Cause of the problem] We know that it is not safe to use the root user in the Docker container, and it is easy to cause security issues of unauthorized access. Therefore, under normal circumstances, we will use ordinary users instead of root to start and manage services. When switching users for a service today, I found that the Nginx service has been unable to start, prompting the following permission problem. Because the corresponding configuration file does not configure the var-related directory, helpless ?♀! ?

# Nginx error message
nginx: [alert] could not open error log file: open() "/var/log/nginx/error.log" failed (13: Permission denied)
2020/11/12 15:25:47 [emerg] 23#23: mkdir() "/var/cache/nginx/client_temp" failed (13: Permission denied)

[Solution] Later, it was found that the nginx.conf configuration file was still the configuration file, and there was a problem with the configuration. You need to configure all the files needed when the Nginx service starts to a directory without permissions, and you can solve it.

user www-data;
worker_processes 1;

error_log /data/logs/master_error.log warn;
pid /dev/shm/nginx.pid;

events {
    worker_connections 1024;
}

http {
    include /etc/nginx/mime.types;
    default_type application/octet-stream;

    gzip on;
    sendfile on;
    tcp_nopush on;
    keepalive_timeout 65;

    client_body_temp_path /tmp/client_body;
    fastcgi_temp_path /tmp/fastcgi_temp;
    proxy_temp_path /tmp/proxy_temp;
    scgi_temp_path /tmp/scgi_temp;
    uwsgi_temp_path /tmp/uwsgi_temp;

    include /etc/nginx/conf.d/*.conf;
}

7.Docker is bound to IPv6

When the Docker service is started, the address is bound to the IPv6 address, and an error message is prompted!

[Cause of the problem] After the corresponding patch was updated on the physical machine, the service was restarted, causing the docker-compose service, which could have started normally, to prompt the following error message. It is not clear whether the relevant configuration of the operating system has been modified, or other configurations corresponding to docker, such as modifying /etc/docker/daemon.json or the service startup file of docker.

# Docker error message
docker run -p 80:80 nginx:alpine succeeds. Previously, this was failing with Error \
starting userland proxy: listen tcp6 [::]:80: socket: address family not supported by protocol.

[Solution] Through the error message shown above, you can see that the startup port of the service is bound to tcp6, but the corresponding socket discovery system itself does not support it. At this time, we looked at the ipv6 settings of the corresponding operating system and found that the system disabled all ipv6 addresses.

# Operating system configuration
$ cat /etc/sysctl.conf | grep ipv6
net.ipv6.conf.all.disable_ipv6=1

[Method 1] The simplest solution is to manually specify to bind the port of the corresponding service to ipv4 in the docker-compose.yml file, as shown below.

version: "3"

services:
  app:
    restart: on-failure
    container_name: app_web
    image:app:latest
    ports:
      - "0.0.0.0:80:80/tcp"
    volumes:
      - "./app_web:/data"
    networks:
      -app_network

networks:
  app_network:

[Method 2] Or modify the /etc/docker/daemon.json file. In the configuration, prevent Docker from incorrectly mapping the port to IPv6 to achieve the same effect without modifying the startup configuration files of multiple services again. .

# modify the configuration
$ vim /etc/docker/daemon.json
{
  "ipv6": false,
  "fixed-cidr-v6": "2001:db8:1::/64"
}

# restart service
$ systemctl reload docker

[Method 3] By default, Docker will map ports to both IPv4 and IPv6 at the same time, and sometimes it will only be bound to IPv6, resulting in the situation that the service cannot be accessed normally. IPv4 addresses are always common now, so the easiest thing to do is to turn off IPv6 addresses.

# Modify system configuration
echo '1' > /proc/sys/net/ipv6/conf/lo/disable_ipv6
echo '1' > /proc/sys/net/ipv6/conf/lo/disable_ipv6
echo '1' > /proc/sys/net/ipv6/conf/all/disable_ipv6
echo '1' > /proc/sys/net/ipv6/conf/default/disable_ipv6

# Restart the network
$ /etc/init.d/networking restart

# Finally check if IPv6 is turned off
ip addr show | grep net6

8.Docker container startup timeout

When the Docker service started, the prompt timed out and was terminated directly!

[Cause of the problem] When using docker-compose to start the container, after waiting for a long time (about 2-3 minutes), the following message is prompted. By reading the content of the information, you can see that it is caused by a timeout. You can set the environment variable to increase the timeout time.

$ docker-compose up -d
ERROR: for xxx UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=70)
ERROR: An HTTP request took too long to complete. Retry with --verbose to obtain debug information.
If you encounter this issue regularly because of slow network conditions, consider setting COMPOSE_HTTP_TIMEOUT to a higher value (current value: 60).

[Solution] After setting the environment variables according to the prompts, start again and find that it can start normally, but you can still feel a little slow.

$ sudo vim /etc/profile
export COMPOSE_HTTP_TIMEOUT=500
export DOCKER_CLIENT_TIMEOUT=500

The next startup process is excluded, because the container startup has a mapped directory into the container and the directory size is relatively large, so it is suspected to be caused by I/O. Then use the iotop command to check the current I/O status of the server, and find that there are many rg commands, and they are all around 100%. After checking, I found that the process of searching the directory structure started by the vscode remote server, there are some pitfalls!

$ sudo iotop
 4269 be/4 escape 15.64 K/s 0.00 B/s 0.00 % 98.36 % rg --files --hidden
 4270 be/4 escape 28.15 K/s 0.00 B/s 0.00 % 97.46 % rg --files --hidden
 4272 be/4 escape 31.27 K/s 0.00 B/s 0.00 % 97.39 % rg --files --hidden
 4276 be/4 escape 34.40 K/s 0.00 B/s 0.00 % 96.98 % rg --files --hidden

9.Docker port network restrictions

If it is found that the services are all normal, but cannot be accessed, it is mostly a network problem!

[Cause of the problem] After enabling the service, the login redirection found that a 502 error was reported directly. There is no problem after excluding configuration and other related reasons (relevant tests have been done), which is very strange!

# deploy service architecture
nginx(80) -> web1(8080)
          -> web2(8081)

# The error message is as follows
nginx connect() failed (113: No route to host) while connecting to upstream

[Solution] According to the error message, it is because there is no route to the specified host. Then I checked that the firewall is open, and I found that it was filtered out after reading the log. The problem has been found. What needs to be done now is, or Add firewall rules, or turn off the firewall.

# Check for open ports
$ sudo firewall-cmd --permanent --zone=public --list-ports

# Open the port that needs to be routed
$ sudo firewall-cmd --permanent --zone=public --add-port=8080/tcp
$ sudo firewall-cmd --permanent --zone=public --add-port=8081/tcp

# The configuration takes effect immediately
firewall-cmd --reload

# Turn off the firewall
$ sudo systemctl stop firewalld.service

# disable autostart
$ sudo systemctl disable firewalld.service

10.Docker cannot obtain the image

The newly initialized machine cannot obtain the image file of the private warehouse!

[Problem Cause] After the machine is initialized, use the following command to log in to the private docker warehouse, and find that the corresponding image cannot be obtained, but the image can be successfully executed on other machines, which is very strange!

# Log in to the private warehouse
$ echo '123456' | docker login -u escape --password-stdin docker.escapelife.site

# Exception message prompt
$ sudo docker pull docker.escapelife.site/app:0.10
Error response from daemon: manifest for docker.escapelife.site/app:0.10 not found: manifest unknown: manifest unknown

[Solution] It’s too bad, I thought I found a hidden bug, but after a lot of investigation, I finally found out that the name of my image package was wrong, it should be written as 0.0.10, but I wrote it as 0.10 . Here, to commemorate, if you encounter the above error report in the future, it must be that the mirror image does not exist.

# After logging in to the private warehouse, a docker configuration will be generated in the user's home directory
# It is used to record the login authentication information of the docker private warehouse (encrypted information but not safe) => base64
$ cat .docker/config.json
{
    "auths": {
        "docker.escapelife.site": {
            "auth": "d00u11Fu22B3355VG2xasE12w=="
        }
    }
}

11.Docker keeps the container from exiting

How to make the container service started with docker-compose hang without exiting

[Cause of the problem] Sometimes when we start the service, some problems (bugs) cause the service to fail to start normally, and the container will restart infinitely (restart: on-failure), which is not conducive to troubleshooting the problem.

? docker ps -a
4e6xxx9a4 app:latest "/xxx/…" 26 seconds ago Restarting (1) 2 seconds ago

[Solution] At this time, we need to decide what command to use to hang the service based on the command used to build the service. The principle of getting stuck is similar to using /bin/bash to enter the container, so I won’t explain too much here.

# similar principle
docker run -it --rm --entrypoint=/bin/bash xxx/app:latest

# Use the Command command
tty: true
command: tail -f /dev/null

# Use the Entrypoint command
tty: true
entrypoint: tail -f /dev/null

Similarly, when we use the docker-compose or k8s platform to deploy services, sometimes due to startup problems, the started services do not exit directly to manually debug and troubleshoot the cause of the problem. Therefore, here I record the different deployment methods and suspension methods.

# Compose

version: "3"
services:
  app:
    image: ubuntu:latest
    tty: true
    entrypoint: /usr/bin/tail
    command: "-f /dev/null"

#K8S

apiVersion: v1
kind: Pod
metadata:
name: ubuntu
spec:
containers:
  -name: ubuntu
    image: ubuntu:latest
    command: ["/bin/bash", "-c", "--"]
    args: ["while true; do sleep 30; done;"]
    # command: ["sleep"]
    # args: ["infinity"]

12.Docker does not use the default network segment

In some cases, the internally planned network segment may conflict with the default network segment of Dockerd, resulting in an exception!

[Cause of the problem] Today, a whole set of services (multiple machines) was deployed on the new machine. After the service was deployed, it was found that it could not be accessed through the front-end Nginx service. After the port opened by the machine, it was found that the request was sent to the corresponding port. None were forwarded. This is rather strange, because the port control has already been enabled, and there should be no failure.

?nc -v 172.16.100.12 8000
nc: connect to 172.16.100.12 port 8000 (tcp) failed: Connection refused

[Solution] I found that the server port is blocked. I suspect that it may be caused by the start of the dockerd service, so I stopped the service first, and started the Python server program directly on the machine (Linux machine comes with Python2.7.x version), and then the front-end Nginx service found that the port is indeed connected. Later, it was found that the default network segment of the internal service and the default network segment of the dockerd service were in conflict, which caused the firewall rules of the machine to be rewritten, resulting in the above exception.

$ python -m SimpleHTTPServer 8000
Serving HTTP on 0.0.0.0 port 8000...

?nc -v 172.16.100.12 8000
Connection to 172.16.100.12 8000 port [tcp/*] succeeded!

Now that the problem is known, what needs to be done now is very simple: the default network segment is not applicable! Through “mirantis”, we can choose to set it up, and then restart the service dockerd service.

# modify the configuration
$ sudo cat /etc/docker/daemon.json
{
  "default-address-pools":[{"base":"192.168.100.0/20","size":24}]
}

# restart service
$ sudo systemctl restart docker

# Start the service to verify whether it is in effect
$ ip a
$ docker network inspect app | grep Subnet

At this time, it is time to test our network subnetting ability: how to divide it reasonably and efficiently under a given network segment? Ahem, it really stumps me. At this time, we can divide the JSON online analysis on this online website, and then select a reasonable base and size.

# error message
Error response from daemon: could not find an available, non-overlapping IPv4 address pool among the defaults to assign to the network

# According to the figure below, we can divide the pool reasonably
# Given the network segment of 10.210.200.0 + 255.255.255.0 to divide the subnet
$ sudo cat /etc/docker/daemon.json
{
  "default-address-pools":[{"base":"10.210.200.0/24","size":28}]
}

Among them, base tells us what is the subnet segment (from the beginning), does it start from the first two digits (/16) or the third digit (/24)? And size tells us how many IP addresses are available for each divided subnet? From “10.210.200.0/24”, we can know that there are only 254 available IP addresses under the network (direct use is definitely not enough), and then we need to use it for docker. Each subnet can be divided into 16 IP addresses, so the subnet is It should be written as 28.

13.Docker add private warehouse

In some cases, we need to use an internal private container image address on our server!

[Problem Cause] If the new machine needs to use a private warehouse, but there is no configuration, the following error message will appear when the image is obtained again.

# Prompt when pulling/logging in private library
$ docker pull 192.168.31.191:5000/nginx:latest
x509: certificate signed by unknown authority

[Solution] The solution to this problem is very simple, as shown below, configure the warehouse address, restart the service and log in to the private warehouse.

# Add configuration
$ sudo cat /etc/docker/daemon.json
{
    "insecure-registries": ["192.168.31.191:5000"]
}

# restart docker
$ sudo systemctl restart docker

# Just log in again
$ docker login private library address -u username -p password

14.Docker solves time synchronization

Solve the problem that the time zone of the Docker container is out of sync with the host!

[Problem Cause] Sometimes we will encounter newly created containers, and the internal and external time of the container are inconsistent, which causes service logs, scheduled tasks, etc. to not be triggered according to our predetermined time, which is very troublesome.

# The internal time of the container (CST - East Eighth District - Beijing time)
[root@server ~]# date
Fri Apr 27 22:49:47 CST 2022

# Container external time (UTC - GMT - Standard Time)
[root@server ~]# docker run --rm nginx date
Fri Apr 27 14:49:51 UTC 2022

[Solution] The host machine has set the time zone, but the Docker container has not set it, resulting in a difference of 8 hours between the two.

# start with docker run
$ docker run -d --name 'app' \
    -v /etc/localtime:/etc/localtime\
    escape/nginx:v1

# Build with Dockerfile
ENV TimeZone=Asia/Shanghai
RUN ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime

# Start with docker-compose
environment:
  TZ: Asia/Shanghai