Slice-assisted super-inference for small object detection

Small object detection refers to the identification and localization of objects of relatively small size in digital images. These objects often have limited spatial extent and low pixel coverage, and can be difficult to detect due to their small appearance and low signal-to-noise ratio.

Small target detection has many applications:

Surveillance and Security: Identify and track small objects in crowded areas to enhance public safety.
Autonomous Driving: Accurately detect pedestrians, cyclists and traffic signs in real time to ensure safe navigation.
Medical Imaging: Localization of abnormalities and lesions in medical images for early disease diagnosis and treatment planning.
Remote sensing and aerial imagery: Small target recognition in satellite images for urban planning and environmental monitoring.
Industrial Inspection: Perform quality control and defect detection during the manufacturing process to ensure high-quality products.
Microscopy and Life Sciences: Analysis of cell structures and microorganisms for the study of biology and genetics.

Why can’t regular object detectors detect smaller objects? There are many object detection algorithms in the current generation, such as Faster RCNN, YOLO, SSD, RetinaNet, EfficientDet, etc. Typically, these models are trained on the COCO (Common Objects in Context) dataset. It is a large dataset containing a wide range of object categories and annotations, making it popular for training object detectors. However, it turns out that these models are unable to detect small objects. Have you ever thought about this?

Challenges of small target detection

Limited receptive field Receptive field refers to the spatial extent of the input image that affects the output of a specific neuron or filter in a convolutional neural network (CNN). In a normal object detector, the receptive field may be limited, which means that the network may not have enough understanding of the contextual information around smaller objects. Therefore, detectors may have difficulty accurately detecting and localizing these objects due to insufficient receptive fields. Figure 2 shows the receptive field of the neural network.

Figure 2: Receptive field of deep neural network

Feature Representation Object detectors often rely on learned features in CNN architectures to identify objects. However, the mutual limitations of feature representations may hinder the detection of smaller objects, as the learned features may not adequately capture subtle and complex details. Figure 3 shows the feature representation of the convolutional neural network. Therefore, the detector may not be able to distinguish them from the background or other similar-looking objects.

Figure 4 – Feature representation in convolutional neural network

Scale changes Smaller objects show significant scale changes compared to larger objects in the image. Object detectors trained on datasets that mainly consist of larger objects, such as ImageNet or COCO, may have difficulty generalizing to smaller objects due to the difference in scale. In Figure 4, a scale change is applied. Changes in size may make it difficult to match the learned object representation, resulting in degraded detection performance for smaller objects.

Image proportion changes

Training data bias Object detection models are typically trained on large-scale datasets, which may contain biases for larger objects due to their generality. When it comes to smaller objects, this bias can inadvertently affect the performance of the object detector. Therefore, the model may not have been exposed to enough different training examples of small objects. This results in a lack of robustness and reduced detection accuracy for smaller object instances. Figure 5 shows a scatter plot of a dataset with two classes of objects. It can be observed that there are significantly more data points for class “0” than for class “1”.

Localization Challenges Accurate localization of smaller objects can be challenging due to the limited spatial resolution of feature map switching in CNN architectures. Fine-grained details required for precise positioning may be lost or become indistinguishable at lower resolutions. Small objects may be obscured by other larger objects or cluttered backgrounds, further exacerbating localization challenges. These factors may cause ordinary object detectors to be unable to accurately locate and detect smaller objects.

Positioning of small targets in large images

Existing Small Object Detection Methods In computer vision, there are a number of techniques that facilitate small object detection, including a range of techniques designed to address the challenges associated with accurate detection of small objects. These methods utilize various strategies and algorithms to improve detection performance, especially for smaller sized objects. Here are some commonly used methods:

Small target detection method

Image Pyramid It involves creating multiple scaled versions of the input image through downsampling or upsampling. These scaled versions, or pyramid levels, provide different image resolutions. The object detector can apply detection algorithms at each pyramid level to handle objects of different scales. In Figure 8, an image pyramid-based technique is applied to the sun image. This approach allows detection of small objects by searching for them at lower pyramid levels, where they may be more salient and distinguishable.

Image Pyramid Technology

Sliding Window Method This method involves sliding a fixed-size window over the image at different positions and scales. At each window position, the object detector applies a classification model to determine whether an object is present. By considering different window sizes and positions, the detector can efficiently search for small objects in images. However, sliding window methods can become very expensive, especially when dealing with large images or multiple scales.

Multi-Scale Feature Extraction Object detectors can leverage multi-scale feature extraction techniques to capture information at different levels of detail. This involves processing images at multiple resolutions or applying convolutional layers with different receptive fields. By combining features at different scales, the detector can effectively capture large and small objects in the scene. This approach helps preserve fine-grained details relevant for detecting small objects.

Data augmentation is one of the most well-known techniques in computer vision, which can improve small object detection performance by generating additional training samples. Augmentation methods such as random cropping, resizing, rotating, or introducing artificial noise can help create variation in the data set, allowing the detector to learn robust features of small objects. Enhancement techniques can also simulate different object scales, viewing angles, and occlusions, helping the detector better generalize to real-world scenarios.

Transfer learning This method involves leveraging the knowledge gained from learning in large-scale datasets (such as ImageNet) and applying it to object detection tasks. Pre-trained models, especially those with deep convolutional neural network (CNN) architecture, capture rich hierarchical features, which are beneficial for small object detection. By fine-tuning pre-trained models on target datasets, object detectors can use learned representations to quickly adapt to new tasks and be better able to detect small objects.

Slicing-assisted super-reasoning:

A revolutionary pipeline for small object detection Introducing AHI, a cutting-edge pipeline specifically designed for small object detection. SAHI harnesses the power of slice-assisted inference and fine-tuning technology to revolutionize the way objects are detected. What sets SAHI object detection apart is that it integrates seamlessly with any object detector, eliminating the need for fine fine-tuning. This breakthrough allows for quick and easy adoption without compromising performance.

Slicing assisted fine-tuning

Popular object detection frameworks such as Detectron2, MMDetection and YOLOv8 provide pre-trained weights on widely used datasets such as ImageNet and MS COCO. Pretraining allows efficient fine-tuning of models using smaller datasets and shorter training durations, eliminating the need to train from scratch on large datasets. However, the datasets typically used for pre-training consist of low-resolution images where relatively large objects cover a large portion of the image height. As a result, pre-trained models perform well on similar inputs but struggle to detect small objects in high-resolution images captured by advanced drones and surveillance cameras. To overcome this limitation, a method is proposed to tune the dataset by extracting patches from the fine-tuned dataset images. This technique is shown in Figure 14.

Each image is segmented into overlapping blocks of size M and N. The dimensions M and N are chosen as hyperparameters from the predefined ranges [Mmin, Mmax] and [Nmin, Nmax]. During the fine-tuning process, these patches are resized while maintaining the aspect ratio. Resize patches create enhanced images with the goal of increasing the relative size of objects compared to the original image. The original image is also used during fine-tuning. This utilization of raw images facilitates efficient detection of large objects.

As shown in Figure 15, this hierarchical approach is also used in the inference step. Initially, the original query image is divided into overlapping patches of dimension M×N. Subsequently, each patch is resized while maintaining the aspect ratio. An independent object detection forward pass is then applied to each overlapping patch. Additionally, an optional full inference step can be performed using the original image to detect larger objects. Finally, non-maximum suppression (NMS) is used to merge the predictions of overlapping patches and the results (if applicable) back to the original image size. During the NMS process, boxes with an Intersection over Union (IoU) ratio or a predefined matching threshold are considered matches. For each match, results with detection probabilities lower than the threshold are discarded. This ensures that only the most reliable and non-overlapping detections are retained.

Inferring YOLOv8 using SAHI technology In the last section of this article, the pre-trained YOLOv8-S model is used to infer object detection on images. We will also look at side-by-side comparisons between small object detection results obtained without and with SAHI. There is a download link to the laptop used in this experiment.

Model initialization

detection_model =AutoDetectionModel.from_pretrained(
    model_type='yolov8',
    model_path=yolov8_model_path,
    confidence_threshold=0.3

In the above code snippet, detection\ymodel has been initialized. In this experiment, the “Model Type” is yolov8, the “Model Path” points to the directory where the model is saved, and the default “Confidentiality Threshold” is set to 0.2. If you have a machine with an NVIDIA GPU, you can enable CUDA acceleration by changing the “device” flag to “uda:0”; otherwise, leave it as “pu”. But keep in mind that inference may be slower.

Perform slice inference

Slicing parameters like “Slice Height”, “Slice Width”, “Overlap Height Ratio”, “Overlay Width Ratio” need to be mentioned according to the size of the input image. This is mostly a trial and error process, as no default value is best for all types of images. As the number of slices increases, more computing power is required. This is undoubtedly where CUDA acceleration helps the most.

 "demo_data/small-vehicles1.jpeg",
    detection_model,
    slice_height=512,
    slice_width=512,
    overlap_height_ratio=0.2,

Visual prediction objects

By using the following code you can view the detected objects in the input image. Additionally, you can export the output image with detections. The code snippet shown below reads the input image, performs preprocessing, and displays the output using the visualize_object_predections() method. Object labels have been hidden for better visualization. This is done using the “hide_labels” parameter.

img =cv2.imread("demo_data/cars.jpg", cv2.IMREAD_UNCHANGED)
img_converted =cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
numpydata =asarray(img_converted)
visualize_object_predictions(
    numpydata,
    object_prediction_list=result.object_prediction_list,
    hide_labels=1,

Inference results

Conclusion

The introduction of the Slice-Assisted Hyper-Inference (SAHI) framework marks a major breakthrough in small target detection. With its innovative pipeline, seamless integration with any object detector, and significant performance improvements, SAHI revolutionizes the way we detect and identify small objects. Through extensive experimental evaluation of Visdrone and the xView dataset, SAHI has demonstrated its effectiveness in improving the average object detection accuracy (AP) of various detectors without modification. Slicing-assisted fine-tuning technology further improves detection accuracy, resulting in substantial aggregation in AP. The potential of SAHI to significantly improve object detection performance cannot be overstated. Its versatility and ability to adapt to different scenarios make it a game-changer in the field of computer vision.