An All-in-One Network for Dehazing and Beyond

Abstract

A convolutional neural network (CNN) based image dehazing model is proposed, called All-in-One Dehazing Network (AOD-Net). It is designed based on a new atmospheric scattering model. Instead of estimating the transfer matrix and atmospheric light separately like previous models, AOD-Net directly generates clean images through a lightweight CNN. This novel end-to-end design makes it easy to embed AOD-Net into other deep models, Faster R-CNN, for improving the performance of advanced tasks on fog images. Experimental results on synthetic and natural fog images show that the proposed algorithm outperforms existing algorithms in terms of peak signal-to-noise ratio, SSIM, and subjective visual quality. Furthermore, when AOD-Net is concatenated with Faster R-CNN and the joint pipeline is trained end-to-end, we witness a substantial improvement in object detection performance on foggy images.

I. INTRODUCTION

The presence of haze adds complex noise to images captured by cameras due to the presence of aerosols such as dust, mist and smoke. It greatly reduces the visibility of outdoor images, where contrast is reduced and surface colors become foggy. Furthermore, images of fog will jeopardize the effectiveness of many subsequent advanced computer vision tasks, such as object detection and recognition. Therefore, dehazing algorithms are widely regarded as a challenging instance of (ill-posed) image restoration and enhancement. Similar to other problems such as image denoising and super-resolution [37], [15], early dehazing work [23], [30], [38], [12] assumed that multiple images from the same scene were available. However, removing haze from a single image has now gained dominant popularity because it is more practical for realistic settings [7]. The problem of defogging a single image is studied.

A. Prior Work

As prior knowledge for haze removal, fog image generation follows a popular physical model (see Section II-A for details). In addition to estimating global atmospheric light levels, it has been recognized that the key to achieving haze removal is to restore the transmission matrix. [7] proposed a method for physically grounding the scene by estimating the albedo. [9], [34] discovered efficient dark channel priors (DCPs) to compute the transfer matrix more reliably, followed by a series of works [13], [24], [36]. [20] strengthened boundary constraints and context regularization to obtain cleaner restored images. An accelerated method for automatic restoration of atmospheric light is proposed in [33]. [45] established a color attenuation prior model and a scene depth linear model for fog images, and performed supervised learning on the model parameters. [16] showed a method for jointly estimating scene depth and recovering clear latent images from foggy video sequences. [1] proposed an algorithm based on non-local prior (Hazeline), which assumes that each color cluster in a clear image becomes a Haze-line in RGB space. All of these methods rely on physical models and various complex image statistical assumptions. However, since the estimation of physical parameters from a single image is usually inaccurate, the dehazing performance of the above methods does not seem to always be satisfactory. Recently, with the general success of convolutional neural networks (CNNs) in computer vision tasks, they have also been introduced to image dehazing. DehazeNet [3] proposes a trainable model for estimating the transfer matrix from hazy images. [27] further utilize multi-scale CNN (MSCNN) to first generate a coarse-scale transfer matrix and then refine it.

B. Key Challenges and Bottlenecks

1) Absence of End-to-End Dehazing:

Most deep learning methods for image restoration and enhancement fully employ end-to-end modeling: a model is trained to directly regress the clean image from the corrupted image. Examples include image denoising [42], dehazing [31] and super-resolution [41]. In contrast, so far there is no end-to-end deep model for dehazing that directly regresses clean images from foggy images. While this may seem strange at first glance, one needs to realize that haze inherently introduces non-uniform, signal-dependent noise: the scene attenuation of the surface caused by the haze is related to the physics between the surface and the camera. Distance dependent (ie, pixel depth). This is different from most image degradation models that assume signal-independent noise, in which case all signals undergo the same parametric degradation process. Therefore, their recovery model can be easily modeled with a static mapping function. This point is not directly applicable to dehazing: the degradation process varies with the signal, and the restoration model must also be input-adaptive.

Existing methods share the same belief that in order to recover a clean scene from smoke, the key is to estimate an accurate medium transmission map [1], [3], [27]. Atmospheric light is computed separately via empirical rules, and a clean image is recovered based on a physical model. Although intuitive, such a procedure does not directly measure or minimize reconstruction distortion. Errors in the two separate steps used to estimate the transfer matrix and atmospheric light will accumulate and potentially amplify each other. Therefore, conventional separation pipelines yield suboptimal image restoration quality.

2) Missing Link with High-Level Vision Tasks:

Currently, dehazing models rely on two sets of evaluation criteria: (1) for synthetic fog images, where their ground truth clean images are known, PSNR and SSIM are usually computed to measure restoration fidelity; (2) for images with unknown ground With live real natural fog images, the only available comparison of dehazing results is subjective visual quality. However, unlike image denoising and super-resolution results, the suppression effect of its visual artifacts is visible (e.g., on textures and edges), and state-of-the-art dehazing models [1], [3], [27] The visual differences between are usually manifested in global illumination and tint, and are often too subtle to tell.

General image restoration and enhancement, known as part of low-level vision tasks, are often considered as preprocessing steps for intermediate and high-level vision tasks. It is already known that the performance of high-level computer vision tasks such as object detection and recognition will deteriorate in the presence of various degradations and are then largely affected by the quality of image restoration and enhancement. However, to our knowledge, correlating dehazing algorithms and results with performance on high-level vision tasks has not been explored.

C. Main Contributions

In this paper, we propose All-in-One Dehazing Network (AOD-Net), a CNN-based dehazing model with two key innovations to address the above two challenges:

We are the first to propose an end-to-end trainable dehazing model that produces clean images directly from hazy images, rather than relying on any separate and intermediate parameter estimation step1. .AOD-Net is designed based on a reformulated atmospheric scattering model, thus retaining the same physical basis as existing works [3], [27]. However, it is based on our different belief that physical models can be formulated in a “more end-to-end” way, with all parameters estimated in a unified model.

· For the first time, we quantitatively study how dehazing quality affects subsequent high-level vision tasks, a new objective criterion for comparing dehazing results. Furthermore, AOD-Net can be seamlessly embedded with other deep models to form a pipeline to perform high-level tasks on foggy images with an implicit dehazing process. Due to our unique all-in-one design, such a pipeline can be jointly tuned from end-to-end to further improve performance, which is not feasible if replacing AOD-Net with other deep hehazing alternatives [3], [27].

AOD-Net is trained on synthetic fog images and tested on synthetic and real natural images. Experiments demonstrate that AOD-Net outperforms several state-of-the-art methods not only in terms of PSNR and SSIM (see Figure 1), but also in terms of visual quality (see Figure 2). AOD-Net is a lightweight and efficient model that takes only 0.026 seconds to process a 480 × 640 image on a single GPU. When connected with Faster R-CNN [26], AOD-Net significantly outperforms other dehazing models in improving object detection performance in foggy images, and when we jointly tune AOD-Net and Faster R-CNN from end-to-end The performance margin has been further improved when pipelining.

This paper is extended from a previous conference version [14]2. The most notable improvement of this paper lies in Section IV, where we provide an in-depth discussion on evaluating and enhancing dehazing for object detection, and introduce a joint training section with rich details and analysis. We also conduct a more detailed and comprehensive analysis of the architecture of AOD-Net (e.g. Section III-D). Additionally, we include broader comparison results.

II. AOD-NET: THE ALL-IN-ONE DEHAZING MODEL

In this section, the proposed AOD-Net is explained. Firstly, the transformed atmospheric scattering model is introduced, and AOD-Net is designed on this basis. Then the architecture of AOD-Net is described in detail.

A. Physical Model and Transformed Formula

Atmospheric scattering models have become a classic description of fog image generation [19], [21], [22]:

where I(x) is the observed fog image and J(x) is the scene radiance (i.e., the ideal “clean image”). There are two key parameters: A represents the global atmospheric light, and t(x) is the transmission matrix, defined as:

where β is the scattering coefficient of the atmosphere and d(x) is the distance between the object and the camera. We can rewrite the model (1) for clean images as output:

Existing works such as [27], [3] follow the same three-step procedure: 1) estimate the transmission matrix t(x) from the fog image I(x) using a complex depth model; 2) estimate A with an empirical method; 3 ) to estimate the clean image J(x) by (3). Such procedures lead to suboptimal solutions that do not directly minimize image reconstruction errors. When t(x) and A are combined to compute (3), separate estimates of t(x) and A will result in cumulative or even magnified errors.

Our core idea is to unify the two parameters t(x) and A into one formula, namely K(x) in (4), and directly minimize the pixel domain reconstruction error. To do this, reformulate the formula in (3) as the following transformation formula:

B. Network Design

The proposed AOD-Net consists of two modules, as shown in Fig. 4(a): the K estimation module, which estimates K(x) from the input I(x), followed by the clean image generation module, which utilizes K(x) J(x) is estimated as its input adaptive parameter.

The K-estimation module is a key component of AOD-Net responsible for estimating depth and relative haze levels. As shown in Fig. 4(B), we use five convolutional layers and form multi-scale features by fusing filters of different sizes. In [3], parallel convolutions with different filter sizes are used in the second layer. [27] concatenate coarse-scale network features with intermediate layers of fine-scale networks. Inspired by it, the “concat 1” layer of AOD-Net concatenates the features of the “conv 1” and “conv 2” layers. Similarly, “concat 2” concatenates those from “conv 2” and “conv 3”; “concat 3” concatenates those from “conv 1”, “conv 2”, “conv 3” and “conv 4”. Such a multi-scale design captures features at different scales, and the intermediate connections also compensate for the information loss during convolution. It is worth noting that each convolutional layer of AOD-Net uses only three filters. Therefore, AOD-Net is much lighter-weight compared to existing deep methods, e.g., [3], [27]. After the K estimation module, the clean image generation module consists of an element-wise multiplication layer and several element-wise addition layers in order to generate the restored image by computing (4).

To demonstrate why it is important to jointly learn t(x) and A, we first compare naive baseline estimation with traditional methods [9], and then learn t(x) from (3) using an end-to-end deep network By minimizing reconstruction errors (see section III for synthetic settings). As observed in Figure 3, the baseline was found to be overestimated, resulting in excessive visual effects. AOD-Net clearly produces more realistic lighting conditions and structural details, since the joint estimation of 1/t(x) and A enables them to complement each other. Inaccurate estimation of other hyperparameters (eg, gamma correction), can also compromise and compensate for the integration conception.

III. EVALUATIONS ON DEHAZING

A. Datasets and Implementation

We create synthetic fog images by (1) using ground truth images with depth metadata from the indoor NYU 2 depth database [32]. We set different atmospheric light A by choosing each channel uniformly between [0.6, 1.0], and choose β ∈ {0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6}. For the NYU 2 database, we take 27,256 images as the training set and 3,170 images as the non-overlapping test set A. We also obtain 800 full-scale synthetic images from the Middlebury stereo database as test set B. Additionally, we test on naturally hazy images to evaluate our model generalization.

During training, weights are initialized using Gaussian random variables. We use the ReLU neuron because we found it to be more effective than the BReLU neuron proposed by [3] in our particular setting. The momentum and decay parameters were set to 0.9 and 0.0001, respectively. We use a batch size of 8 images (480 × 640) with a learning rate of 0.001. We employ a simple Mean Squared Error (MSE) loss function and are pleased to find that it not only improves PSNR but also SSIM as well as visual quality.

The AOD-Net model takes about 10 training epochs to converge, and usually performs well enough after 10 epochs. In this article, we have trained the model for 40 epochs. We also found it helpful to clip the gradients to constrain the norm to the range [?0.1, 0.1]. This technique is popular for stabilizing recurrent network training [25].

B. Quantitative Results on Synthetic Images

We compare the proposed model with several state-of-the-art dehazing methods: Fast Visibility Restoration (FVR) [35], Dark Channel Prior (DCP) [9], Boundary Constrained Context Regularization (BCCR) [ 20], Automatic Atmospheric Light Restoration (ATM) [33], Color Attenuation Prior (CAP) [45], Non-local Image Dehazing (NLD) [1], [2], DehazeNet [3] and MSCNN [27] . In previous experiments, few quantitative results on restoration quality were reported due to the lack of fog-free ground truth when tested on real foggy images. Our synthesized fog images are accompanied by ground truth images, allowing us to compare PSNR and SSIM of these dehazed results.

Tables I and III-B show the average PSNR and SSIM results for test sets A and B, respectively. Since AOD-Net is optimized end-to-end under MSE loss, it is not surprising to see its higher PSNR performance than others. Even more attractive, AOD-Net obtains a large SSIM advantage over all competitors, even though SSIM is not directly known as an optimization criterion. Since SSIM’s measurements go beyond pixel errors and are known to more faithfully reflect human perception, we began to wonder which part of AOD-Net achieved this consistent improvement.

We conduct the following investigation: Each image in the test set B is decomposed into the sum of the mean image and the residual image. The former consists of all pixel locations taking the same mean (average 3-channel vector over the image). It is easy to show that the MSE between two images is equal to the MSE between their mean images plus the MSE between the two residual images. The average image roughly corresponds to global illumination and is related to A, while the residual focuses more on local structural changes and contrast etc. We observe that AOD-Net produces a similar residual MSE (averaged on TestSet B) to several competing methods such as DehazeNet and CAP. However, the MSE of the average part of AODNet results is significantly lower than that of DehazeNet and CAP, as shown in Table III. This means, AOD-Net can be more capable of correctly recovering A (global illumination), thanks to our joint parameter estimation scheme under the end-to-end reconstruction loss. Since the human eye is certainly more sensitive to large changes in global illumination than any local distortions, it’s no wonder that AODNet’s visual results are also significantly better, while some others tend to look unrealistically bright.

The above advantages are also manifested in computing the illumination (I) term of SSIM [39] and partly explain our strong SSIM results. Another major source of SSIM gain appears to come from the contrast (c) term. For example, we randomly selected 5 images for testing, and the average contrast of AOD-Net results on TestSetB was 0.9989, which was significantly higher than ATM (0.7281), BCCR (0.9574), FVR (0.9630), NLD (0.9250), DCP (0.9457 ), MSCNN (0.9697), DehazeNet (0.9076) and CAP (0.9760).

C. Qualitative Visual Results

a) Synthetic images: Figure 5 shows the dehazing results for synthetic images from test set A. We observe that AOD-Net results generally have sharper contours and richer colors, and are more visually faithful to the ground truth.

b) Challenging natural images: While trained on synthetic indoor images, ADO-Net was found to generalize well to outdoor images. We evaluate state-of-the-art methods on examples of natural images where dehazing is more challenging than general outdoor images found by the authors [9], [8], [3]. The challenge lies in the dominance of highly cluttered objects, fine textures, or lighting variations. As shown in Figure 6, FVR suffers from over-enhanced visual artifacts. DCP, BCCR, ATM, NLD, and MSCNN produce unrealistic tones on one or more images, such as DCP, BCCR, and ATM results on the second row (note the sky color), or BCCR, NLD, and MSCNN results (note the stone color). CAP, DehazeNet and AOD-Net have the most competitive visual results with reasonable details. However, through careful observation, we still observe that CAP sometimes hazes image textures and DehazeNet darkens certain regions. AOD-Net restores richer and more saturated colors (compare third and fourth row of results), while suppressing most artifacts.

c) White Scenery Natural Images: White scenes or objects have been the main obstacle to remove haze. Many effective priors (such as [9]) fail on white objects because the transmission values are close to zero for objects of similar color to atmospheric light. Both DehazeNet [3] and MSCNN [27] rely on well-chosen filtering operations for post-processing, which improves their robustness to white objects, but inevitably sacrifices more visual details.

Although AOD-Net does not explicitly consider handling white scenes, our end-to-end optimization scheme seems to provide stronger robustness here. Figure 7 shows two haze images of a white scene and their dehazing results by various methods. It is easy to notice intolerable artifacts in the DCP results, especially in the sky region of the first row. This problem is alleviated, but still exists in CAP, DehazeNet and MSCNN results, while AOD-Net has almost no artifacts. Furthermore, CAP seems to fog texture details on white objects, while MSCNN creates the opposite over-enhancement artifact: see cat head region for comparison. AOD-Net manages to remove fog without introducing false tones or distorted object outlines.

d) Small damage on fog-free images: Although trained on foggy images, AOD-Net is verified to have the highly desirable property of having little negative impact on input images if they are free from fog. This confirms the robustness and effectiveness of our K-estimator module. Figure 8 shows the results on two challenging clean images from Colorlines [8].

e) Image antihalation: We try AOD-Net on another image augmentation task, called image antihalation, without retraining. Halo is when light spreads beyond proper boundaries, creating an undesired fog effect in bright areas of a photo. Related to dehazing but following a different physical model, AODNet’s anti-halation results are also very good: see Fig. 9 for some examples.

D. Effectiveness of Multi-Scale Features

In this section, we specifically analyze the usefulness of the inter-layer cascade of the K estimation module, which combines multi-scale features from filters of different sizes. We speculate that, despite empirical findings, the current cascading approach favors smooth functional transitions from lower to higher levels by consistently feeding several successive lower layers to the next. For comparison, we design a baseline: “convl-conv 2-conv 3-conv 4-conv 5(K)”, which does not involve inter-layer cascading. For test set A, the average PSNR is 17.0517dB and SSIM is 0.7688. For test set B, the average PSNR is 22.3359 dB and SSIM is 0.9032. These results are generally inferior to AOD-Net (except for slightly higher PSNR on test set B), and in particular both SSIM values suffer from significant drops.

E. Running Time Comparison

The lightweight structure of AOD-Net leads to faster dehazing. We select 50 images from test set A for all models to run on the same machine (Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz and 16 GB memory) without GPU acceleration. The average running time per image for all models is shown in Table IV. Despite other slower Matlab implementations, it is fair to compare DehazeNet (Pycaffe version) with our [11]. Experimental results show that AOD-Net has good efficiency, and the overhead per image is only 1/10 of that of DehazeNet.

IV. BEYOND RESTORATION: EVALUATING AND IMPROVING DEHAZING ON OBJECT DETECTION

Advanced computer vision tasks, such as object detection and recognition, involve visual semantics and have received considerable attention [26], [43]. However, the performance of these algorithms may be largely compromised by various degradations in practical applications. Conventional methods take a separate image restoration step before feeding into the target task. Recently, [40], [17] verified that the joint optimization of the recovery and recognition steps will significantly improve the performance of traditional two-stage methods. However, previous works [44], [5], [4] mainly studied the impact and remedies of common degradations (such as noise, fog, and low resolution) on image classification tasks. To the best of our knowledge, there is no similar work that quantitatively investigates how the presence of fog affects high-level vision tasks, and how its impact can be mitigated using joint optimization methods.

We study the problem of object detection in the presence of fog, as an example of how high-level vision tasks can interact with dehazing. We choose the Faster R-CNN model [26] as a strong baseline4 and test it on synthetic and natural fog images. Then, we concatenate the AOD-Net model with the Faster R-CNN model as a unified pipeline for joint optimization. The general conclusion drawn from our experiments is that object detection becomes less reliable when the haze becomes heavier. Under all haze conditions (mild, moderate or heavy), our jointly tuned model consistently improves detection capabilities, outperforming pure Faster R-CNN and non-joint methods.

V. DISCUSSION AND CONCLUSIONS

This paper proposes AOD-Net, an all-in-one pipeline that can directly reconstruct haze-free images through an end-to-end CNN. We compare AOD-Net with various state-of-the-art methods, on synthetic and natural haze images, using objective (PSNR, SSIM) and subjective criteria. Extensive experimental results confirm the superiority, robustness and efficiency of AOD-Net. Furthermore, we also present the first-of-its-kind study, how AOD-Net can improve object detection and recognition performance for natural fog images, through joint pipeline optimization. It can be observed that our jointly tuned model consistently improves detection in the presence of fog, outperforming both naive Faster R-CNN and non-joint methods. Nevertheless, as mentioned above, dehazing techniques are highly related to depth estimation from images, and there is room to improve the performance of AOD-Net by incorporating depth prior knowledge or refined depth estimation modules.

Code

import tensorflow as tf


class AODNet(tf.keras.Model):

    def __init__(self, stddev: float = 0.02, weight_decay: float = 1e-4):
        super(AODNet, self).__init__()
        self.conv_layer_1 = tf.keras.layers.Conv2D(
            filters=3, kernel_size=1, strides=1, padding='same', activation=tf.nn.relu,
            use_bias=True, kernel_initializer=tf.initializers.random_normal(stddev=stddev),
            kernel_regularizer=tf.keras.regularizers.L2(weight_decay)
        )
        self.conv_layer_2 = tf.keras.layers.Conv2D(
            filters=3, kernel_size=1, strides=1, padding='same', activation=tf.nn.relu,
            use_bias=True, kernel_initializer=tf.initializers.random_normal(stddev=stddev),
            kernel_regularizer=tf.keras.regularizers.L2(weight_decay)
        )
        self.conv_layer_3 = tf.keras.layers.Conv2D(
            filters=3, kernel_size=5, strides=1, padding='same', activation=tf.nn.relu,
            use_bias=True, kernel_initializer=tf.initializers.random_normal(stddev=stddev),
            kernel_regularizer=tf.keras.regularizers.L2(weight_decay)
        )
        self.conv_layer_4 = tf.keras.layers.Conv2D(
            filters=3, kernel_size=7, strides=1, padding='same', activation=tf.nn.relu,
            use_bias=True, kernel_initializer=tf.initializers.random_normal(stddev=stddev),
            kernel_regularizer=tf.keras.regularizers.L2(weight_decay)
        )
        self.conv_layer_5 = tf.keras.layers.Conv2D(
            filters=3, kernel_size=3, strides=1, padding='same', activation=tf.nn.relu,
            use_bias=True, kernel_initializer=tf.initializers.random_normal(stddev=stddev),
            kernel_regularizer=tf.keras.regularizers.L2(weight_decay)
        )
        self.relu = tf.keras.layers.ReLU(max_value=1.0)

    def call(self, inputs, *args, **kwargs):
        conv_1 = self.conv_layer_1(inputs)
        conv_2 = self. conv_layer_2(conv_1)
        concat_1 = tf.concat([conv_1, conv_2], axis=-1)
        conv_3 = self.conv_layer_3(concat_1)
        concat_2 = tf. concat([conv_2, conv_3], axis=-1)
        conv_4 = self.conv_layer_4(concat_2)
        concat_3 = tf.concat([conv_1, conv_2, conv_3, conv_4], axis=-1)
        k = self.conv_layer_5(concat_3)
        j = tf.math.multiply(k, inputs) - k + 1.0
        output = self.relu(j)
        return output