Detect objects + estimate distance and direction | Source code attached

Click “Xiaobai Xue Vision” above and choose to add “star” or “Pin

Heavy stuff, delivered as soon as possible

In this article, you demonstrate how to use computer vision to create an application that detects objects from voice commands, estimates the approximate distance of objects, and uses location information to improve the lives of blind people. The main goal of this project is to process real-time data, similar to wearable technologies like Meta and Envision, to enhance users’ lives and improve their daily experiences.

This tutorial covers the following steps:

  • Import the library and define class parameters.

  • Speech recognition and processing function definitions.

  • Detect the object from the returned parameters, find its location, calculate the average distance, and send notifications.

Import libraries and define class parameters

  • Import “speech_recognition” library to capture audio from microphone and convert speech to text

  • Import the cv2 library to capture video from a webcam and apply various operations to it

  • Import Numpy for mathematical operations

  • Import the Ultralytics library to use the pretrained YOLOv8 model

  • Import pyttsx3 for text-to-speech conversion

  • Import the math library for trigonometric calculations and mathematical operations

import speech_recognition as sr
import cv2
import numpy as np
from ultralytics import YOLO
import pyttsx3
import math




class_names = ["person", "bicycle", "car", "motorbike", "aeroplane", "bus", "train", "truck" , "boat",
              "traffic light", "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat",
              "dog", "horse", "sheep", "cow", "elephant", "bear", "zebra", "giraffe", " backpack", "umbrella",
              "handbag", "tie", "suitcase", "frisbee", "skis", "snowboard", "sports ball", "kite", \ "baseball bat",
              "baseball glove", "skateboard", "surfboard", "tennis racket", "bottle", "wine glass", "cup",
              "fork", "knife", "spoon", "bowl", "banana", "apple", "sandwich", "orange", " broccoli",
              "carrot", "hot dog", "pizza", "donut", "cake", "chair", "sofa", "pottedplant", \ "bed",
              "diningtable", "toilet", "tvmonitor", "laptop", "mouse", "remote", "keyboard", "telephone",
              "microwave", "oven", "toaster", "sink", "refrigerator", "book", "clock", "vase", " scissors",
              "teddy bear", "hair drier", "toothbrush"]


object_dimensions = {
    "bird" : "0.10",
    "cat" : "0.45",
    "backpack" : "0.55",
    "umbrella" : "0.50",
    "bottle" : "0.20",
    "wine glass" : "0.25",
    "cup" : "0.15",
    "fork" : "0.15",
    "knife" : "0.25",
    "spoon" : "0.15",
    "banana" : "0.20",
    "apple" : "0.07",
    "sandwich" : "0.20",
    "orange" : "0.08",
    "chair" : "0.50",
    "laptop" : "0.40",
    "mouse" : "0.10",
    "remote" : "0.20",
    "keyboard" : "0.30",
    "phone" : "0.15",
    "book" : "0.18",
    "toothbrush" : "0.16"
}

I store the classes in the COCO dataset my YOLOv8 model is trained on in the ‘class_names’ variable and their average dimensions in the ‘object_dimensions’ variable. I chose specific objects considering that this application will be used in a domestic environment. If you want to use your own dataset, you will need to perform custom object detection and modify these variables accordingly.

Speech recognition and processing function definition

To create a general function that captures the searched object from a phrase (like “Where is my book?”, “Find book!”, “Book.”) assuming the object is at the end of the sentence, I defined a function called \ ‘get_last_word’ function. This function will return the last word in the sentence, which is the object.

def get_last_word(sentence):
    words = sentence.split()
    return words[-1]

A function named ‘voice_command’ is defined to return the object to be searched for and the average actual size of the object as a voice command.

def voice_command():
    recognizer = sr.Recognizer()


    with sr.Microphone() as source:
        print("Waiting for voice command...")
        recognizer.adjust_for_ambient_noise(source)
        audio = recognizer.listen(source)


    target_object = ""
    real_width = 0.15


    try:
        command = recognizer.recognize_google(audio, language="en-US")
        print("Recognized command:", command)
        last_word = get_last_word(command.lower())
        if last_word:
            print("Last word:", last_word)


        target_object = last_word.lower()
        
        if target_object in object_dimensions:
            real_width = float(object_dimensions[target_object])
            print(real_width)
        else:
            print(f"No length information found for {target_object}, using the default value of 0.15.")
    except sr.UnknownValueError:
        print("Voice cannot be understood.")
    except sr.RequestError as e:
        print("Voice recognition error; {0}".format(e))


    return target_object, real_width

Created a function called ‘voice_notification’ to remind the user with voice.

def voice_notification(obj_name, direction, distance):
    engine = pyttsx3.init()
    text = "{} is at {}. It is {:.2f} meters away.".format(obj_name, direction, distance)
    engine.say(text)
    engine.runAndWait()

The YOLOv8 model is loaded and can be downloaded and used from the Ultralytics website: https://docs.ultralytics.com/models/yolov8/#overview.

Calculate the distance of the object received from the voice command to the camera and notify the end user with the voice the object’s position and direction on the clock.

def main():
    # Load the YOLO model
    model = YOLO("yolov8n.pt")
    
    # Get video frame dimensions for calculating
    cap = cv2.VideoCapture(0)
    frame_width = cap.get(cv2.CAP_PROP_FRAME_WIDTH)
    frame_height = cap.get(cv2.CAP_PROP_FRAME_HEIGHT)
    center_x = int(frame_width // 2)
    center_y = int(frame_height // 2)
    radius = min(center_x, center_y) - 30 # Radius of the circle where clock hands are drawn
    
    #The target object the user wants to search for via voice command and its real-world average size
    target_object, real_width = voice_command()


    while True:
        success, img = cap.read()
        
        # Predict objects using the YOLO model
        results = model.predict(img, stream=True)
        
        # Draw clock
        for i in range(1, 13):
            angle = math.radians(360 / 12 * i - 90)
            x = int(center_x + radius * math.cos(angle))
            y = int(center_y + radius * math.sin(angle))


            if i % 3 == 0:
                thickness = 3
                length=20
            else:
                thickness=1
                length=10


            font = cv2.FONT_HERSHEY_SIMPLEX
            cv2.putText(img, str(i), (x - 10, y + 10), font, 0.5, (0, 255, 0), thickness)
        
        # detect and process objects recognized by model
        for r in results:
            boxes = r.boxes


            for box in boxes:
                x1, y1, x2, y2 = box.xyxy[0]
                x1, y1, x2, y2 = int(x1), int(y1), int(x2), int(y2)


                cls = int(box.cls)


                if class_names[cls].lower() == target_object:
                    camera_width = x2 - x1
                    distance = (real_width * frame_width) / camera_width
                    #voice_notification(target_object)


                    obj_center_x = (x1 + x2) // 2
                    obj_center_y = (y1 + y2) // 2


                    camera_middle_x = frame_width // 2
                    camera_middle_y = frame_height // 2


                    vector_x = obj_center_x - camera_middle_x
                    vector_y = obj_center_y - camera_middle_y


                    angle_deg = math.degrees(math.atan2(vector_y, vector_x))
                    #direction = ''
                    if angle_deg < 0:
                        angle_deg + = 360


                    if 0 <= angle_deg < 30:
                        direction = "3 o'clock"
                    elif 30 <= angle_deg < 60:
                        direction = "4 o'clock"
                    elif 60 <= angle_deg < 90:
                        direction = "5 o'clock"
                    elif 90 <= angle_deg < 120:
                        direction = "6 o'clock"
                    elif 120 <= angle_deg < 150:
                        direction = "7 o'clock"
                    elif 150 <= angle_deg < 180:
                        direction = "8 o'clock"
                    elif 180 <= angle_deg < 210:
                        direction = "9 o'clock"
                    elif 210 <= angle_deg < 240:
                        direction = "10 o'clock"
                    elif 240 <= angle_deg < 270:
                        direction = "11 o'clock"
                    elif 270 <= angle_deg < 300:
                        direction = "12 o'clock"
                    elif 300 <= angle_deg < 330:
                        direction = "1 o'clock"
                    elif 330 <= angle_deg < 360:
                        direction = "2 o'clock"
                    else:
                        direction = "Unknown Clock Position"


                    cv2.putText(img, direction, (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 255), 2)
                    cv2.putText(img, "Distance: {:.2f} meters".format(distance), (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
                    cv2.rectangle(img, (x1, y1), (x2, y2), (255, 0, 255), 3)
                    
                    if boxes is not None:
                        
                        voice_notification(target_object, direction, distance)


        
        cv2.imshow("Webcam", img)


        k = cv2.waitKey(1)
        if k == ord("q"):
            break


    cap.release()
    cv2.destroyAllWindows()


if __name__ == "__main__":
    main()

Conclusions and recommendations

Hopefully this article demonstrates how computer vision can enhance human vision and make our lives easier, and hopefully this project can bring you more inspiration to explore new things. You can also use your own dataset to further develop and customize this project to meet your specific needs and add features that suit your goals. For example, you can use optical character recognition (OCR) technology to convert any text to text-to-speech, which can be very useful. This can help someone find a product in a store or listen to a book, among many other applications.

Download 1: OpenCV-Contrib extension module Chinese version tutorial

Reply in the background of the "Xiaobai Xue Vision" public account: Chinese tutorial on extension module, you can download the first Chinese version of OpenCV extension module tutorial on the entire network, covering extension module installation, SFM algorithm, stereo vision, target tracking, biological vision, ultrasonic vision Resolution processing and more than twenty chapters.


Download 2: Python visual practical project 52 lectures
Reply in the background of the "Xiaobai Xue Vision" public account: Python visual practical projects, you can download them, including image segmentation, mask detection, lane line detection, vehicle counting, adding eyeliner, license plate recognition, character recognition, emotion detection, text content extraction, 31 practical vision projects, including facial recognition, support rapid school computer vision.


Download 3: 20 lectures on OpenCV practical projects
Reply in the background of the "Xiaobai Xue Vision" public account: OpenCV practical projects 20 lectures, you can download 20 practical projects based on OpenCV to achieve advanced OpenCV learning.


Communication group

Welcome to join the public account reader group to communicate with peers. There are currently WeChat groups for SLAM, 3D vision, sensors, autonomous driving, computational photography, detection, segmentation, recognition, medical imaging, GAN, algorithm competitions, etc. (will be gradually subdivided in the future). Please scan the WeChat ID below to join the group, and note: "nickname + school/company + research direction", for example: "Zhang San + Shanghai Jiao Tong University + visual SLAM". Please note according to the format, otherwise it will not be approved. After successful addition, you will be invited to join the relevant WeChat group according to the research direction. Please do not send advertisements in the group, otherwise you will be asked to leave the group, thank you for your understanding~