OpenCV implements gesture volume control

Foreword: Hello everyone, my name is Dream. Today let’s learn how to use OpenCV to implement gesture volume control. Everyone is welcome to come and discuss and learn~

1. Introduction to required libraries and functions

This experiment requires the use of OpenCV and mediapipe libraries for gesture recognition, and uses gesture distance to control the computer volume.

Import library:

  • cv2: OpenCV library for reading camera video streams and image processing.
  • mediapipe: mediapipe library for hand key point detection and gesture recognition.
  • ctypes and comtypes: used to interact with the operating system’s audio interface.
  • pycaw: pycaw library, used to control computer volume.

Features:

  1. Initialize the mediapipe and volume control module to obtain the volume range.
  2. Turn on the camera and read the video stream.
  3. Process each frame of image:
    • Convert image to RGB format.
    • Use mediapipe to detect hand key points.
    • If hand keypoints are detected:
      • Mark finger key points and gesture connections in the image.
      • Analyze finger key point coordinates.
      • Calculate the gesture distance based on the coordinates of the thumb and index finger tips.
      • Convert gesture distance to volume and control computer volume.
    • Display the processed image.
  4. Repeat the previous steps until you manually stop the program or turn off the camera.

Precautions:

  • Before running the code, you need to install the relevant libraries (opencv, mediapipe, pycaw).
  • The audio device needs to be connected and accessible.
  • When multiple hands are detected, only the first detected hand is processed.
  • When a key point on a finger is detected, the key point with an index of 0 is used as the tip of the thumb, and the key point with an index of 1 is used as the tip of the index finger.

cv2.VideoCapture() function parameter problem


There is nothing wrong with that. But when calling on the Raspberry Pi, the parameters need to be changed to:

cap = cv2.VideoCapture(1)

When calling the computer camera:
When the computer is using cv2.VideoCapture(0), an error will be reported after the program ends:

[ WARN:0] SourceReaderCB::~SourceReaderCB terminating async callback

Needs to be changed to:

cv2.VideoCapture(0,cv2.CAP_DSHOW)

2. Import the required modules

# Import OpenCV
import cv2
# Import mediapipe
import mediapipe as mp
# Import the computer volume control module
from ctypes import cast, POINTER
from comtypes import CLSCTX_ALL
from pycaw.pycaw import AudioUtilities, IAudioEndpointVolume

#Import other dependency packages
import time
import math
import numpy as np

3. Initialize HandControlVolume class

class HandControlVolume:
    def __init__(self):
        """
        Initialize an instance of the HandControlVolume class

        Initialize the mediapipe object for hand keypoint detection and gesture recognition.
        Get the computer volume interface and get the volume range.
        """
        #Initialize medialpipe
        self.mp_drawing = mp.solutions.drawing_utils
        self.mp_drawing_styles = mp.solutions.drawing_styles
        self.mp_hands = mp.solutions.hands

        # Get the computer volume range
        devices = AudioUtilities.GetSpeakers()
        interface = devices.Activate(
            IAudioEndpointVolume._iid_, CLSCTX_ALL, None)
        self.volume = cast(interface, POINTER(IAudioEndpointVolume))
        self.volume.SetMute(0, None)
        self.volume_range = self.volume.GetVolumeRange()
  • Initialize the mediapipe object for hand key point detection and gesture recognition.
  • Get the computer volume interface and get the volume range.

4. Main function

1. Calculate refresh rate

  1. Initialize the refresh rate calculation and record the current time as the initial time.

  2. Use OpenCV to open the video stream. The camera device is read here. The default device ID is 0.

  3. Sets the resolution of the video stream to the specified resize_w and resize_h sizes, and resizes the image to this size.

  4. Before using the hands object, use the with statement to create a context and set the relevant parameters for hand detection and tracking, including minimum detection confidence, minimum tracking confidence, and the maximum number of hands.

  5. Enter the loop to determine whether the video stream is turned on. Use the cap.read() function to read a frame of image from the video stream. The returned success indicates whether the reading is successful, and image is The image being read.

  6. resize the read image to adjust it to the specified size. If the read fails, a prompt message is printed and the cycle continues to the next time.

# Main function
    def recognize(self):
        # Calculate refresh rate
        fpsTime = time.time()

        # OpenCV reads video stream
        cap = cv2.VideoCapture(0)
        #Video resolution
        resize_w = 640
        resize_h = 480

        # Screen display initialization parameters
        rect_height = 0
        rect_percent_text = 0

        with self.mp_hands.Hands(min_detection_confidence=0.7,
                                 min_tracking_confidence=0.5,
                                 max_num_hands=2) as hands:
            while cap.isOpened():
                success, image = cap.read()
                image = cv2.resize(image, (resize_w, resize_h))

                if not success:
                    print("Empty frame.")
                    continue

2. Improve performance

  1. Set the image’s writable flag image.flags.writeable to False for memory optimization.

  2. Convert the image from BGR format to RGB format because the input processed by the MediaPipe model requires RGB format.

  3. Flip the image horizontally, a mirroring operation, to make the image more consistent with common mirror displays.

  4. Use the MediaPipe model to process the image and get the result.

  5. Set the image’s writable flag image.flags.writeable to True to re-enable writing to the image.

  6. Convert images from RGB format back to BGR format for subsequent display and processing.

These optimization operations are designed to improve the performance and efficiency of your program. Among them, setting the writable flag of the image to False can reduce unnecessary memory copies, and converting the image format and mirroring operations are to comply with the input requirements of the MediaPipe model and to better perform gesture recognition. Finally, the image is converted back to BGR format for compatibility with OpenCV’s display functions.

 # Improve performance
                image.flags.writeable = False
                # Convert to RGB
                image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
                # Mirror
                image = cv2.flip(image, 1)
                # mediapipe model processing
                results = hands.process(image)

                image.flags.writeable = True
                image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)

3. Determine whether there is a palm

  1. Determine whether results.multi_hand_landmarks exists, that is, whether the palm is detected. If it exists, continue executing the code below.

  2. Traverse each hand_landmarks in results.multi_hand_landmarks, that is, traverse each detected palm.

  3. Use the self.mp_drawing.draw_landmarks function to mark the detected palm on the image, including the key points of the fingers and the connecting lines between the fingers.

# Determine whether there is a palm
                if results.multi_hand_landmarks:
                    # Traverse each palm
                    for hand_landmarks in results.multi_hand_landmarks:
                        # Mark fingers on the screen
                        self.mp_drawing.draw_landmarks(
                            image,
                            hand_landmarks,
                            self.mp_hands.HAND_CONNECTIONS,
                            self.mp_drawing_styles.get_default_hand_landmarks_style(),
                            self.mp_drawing_styles.get_default_hand_connections_style())

4. Analyze the fingers and store the coordinates of each finger

First parse the coordinates of the finger and store them in the landmark_list list. Then, based on the coordinates of the fingers, the fingertip coordinates of the thumb and index finger are calculated, as well as the coordinates of their midpoint. Next, I drew my thumb, index finger, and the line connecting them, and used the Pythagorean Theorem to calculate the length between the two fingertips.

  1. Create an empty landmark_list list to store finger coordinates.

  2. Iterate through each element of the hand keypoints, store the id, x, y, and z coordinates of each keypoint in a list, and then add the list to the landmark_list.

  3. Determine whether landmark_list is not empty. If not, continue executing the following code.

  4. Get the list item of thumb tip coordinates from landmark_list, and then calculate the pixel coordinates on the image.

  5. Get the list item of the index finger tip coordinates from landmark_list, and then calculate the pixel coordinates on the image.

  6. Calculate the coordinates of the midpoint between the tip of the thumb and the tip of the index finger.

  7. Draw the tip points of your thumb and index finger, as well as the midpoint.

  8. Draw a line between your thumb and index finger.

  9. Use the Pythagorean theorem to calculate the length between the tip of the thumb and the tip of the index finger, and save it in line_len.

 # Analyze fingers and store the coordinates of each finger
                        landmark_list = []
                        for landmark_id, finger_axis in enumerate(
                                hand_landmarks.landmark):
                            landmark_list.append([
                                landmark_id, finger_axis.x, finger_axis.y,
                                finger_axis.z
                            ])
                        if landmark_list:
                            # Get the coordinates of the thumb tip
                            thumb_finger_tip = landmark_list[4]
                            thumb_finger_tip_x = math.ceil(thumb_finger_tip[1] * resize_w)
                            thumb_finger_tip_y = math.ceil(thumb_finger_tip[2] * resize_h)
                            # Get the coordinates of the index finger tip
                            index_finger_tip = landmark_list[8]
                            index_finger_tip_x = math.ceil(index_finger_tip[1] * resize_w)
                            index_finger_tip_y = math.ceil(index_finger_tip[2] * resize_h)
                            # midpoint
                            finger_middle_point = (thumb_finger_tip_x + index_finger_tip_x) // 2, (
                                    thumb_finger_tip_y + index_finger_tip_y) // 2
                            # print(thumb_finger_tip_x)
                            thumb_finger_point = (thumb_finger_tip_x, thumb_finger_tip_y)
                            index_finger_point = (index_finger_tip_x, index_finger_tip_y)
                            # Draw 2 points on the fingertips
                            image = cv2.circle(image, thumb_finger_point, 10, (255, 0, 255), -1)
                            image = cv2.circle(image, index_finger_point, 10, (255, 0, 255), -1)
                            image = cv2.circle(image, finger_middle_point, 10, (255, 0, 255), -1)
                            # Draw a line connecting 2 points
                            image = cv2.line(image, thumb_finger_point, index_finger_point, (255, 0, 255), 5)
                            # Pythagorean theorem calculation length
                            line_len = math.hypot((index_finger_tip_x - thumb_finger_tip_x),
                                                  (index_finger_tip_y - thumb_finger_tip_y))

5. Get the maximum and minimum volume of the computer

Get the maximum and minimum volume of the computer, map the length of the fingertip to the volume range and rectangular display, and then set the mapped volume value to the volume of the computer. The specific process is as follows:

  1. self.volume_range[0] and self.volume_range[1] obtain the minimum volume and maximum volume of the computer respectively.

  2. np.interp function maps the length of the fingertip line_len to the range from 50 to 300, and then maps it to the range of the minimum volume and maximum volume to obtain the volume value vol.

  3. np.interp function maps the length of the fingertip line_len to the range from 50 to 300, and then maps it to the range from 0 to 200 to obtain the height of the rectangle rect_height.

  4. np.interp function maps the length of the fingertip line_len to the range from 50 to 300, and then maps it to the range from 0 to 100 to obtain the value rect_percent_text of the rectangular percentage display.

  5. self.volume.SetMasterVolumeLevel method sets the volume value vol to the volume of the computer.

# Get the maximum and minimum volume of the computer
                            min_volume = self.volume_range[0]
                            max_volume = self.volume_range[1]
                            # Map fingertip length to volume
                            vol = np.interp(line_len, [50, 300], [min_volume, max_volume])
                            # Map the fingertip length to the rectangular display
                            rect_height = np.interp(line_len, [50, 300], [0, 200])
                            rect_percent_text = np.interp(line_len, [50, 300], [0, 100])

                            # Set computer volume
                            self.volume.SetMasterVolumeLevel(vol, None)

6. Display rectangle

cv2.putText function to display the percentage value of the rectangular frame on the image;
cv2.rectangle function to draw a rectangular box and fill it with color;
cv2.putText function to display the refresh rate FPS of the current frame on the image;
cv2.imshow function to display the processed image;
cv2.waitKey function waits for key input and exits the program when the ESC key is pressed or the window is closed;
The recognize method of the HandControlVolume class calls the gesture recognition function.

# Display rectangle
                cv2.putText(image, str(math.ceil(rect_percent_text)) + "%", (10, 350),
                            cv2.FONT_HERSHEY_PLAIN, 3, (255, 0, 0), 3)
                image = cv2.rectangle(image, (30, 100), (70, 300), (255, 0, 0), 3)
                image = cv2.rectangle(image, (30, math.ceil(300 - rect_height)), (70, 300), (255, 0, 0), -1)

                # Display refresh rate FPS
                cTime = time.time()
                fps_text = 1 / (cTime - fpsTime)
                fpsTime = cTime
                cv2.putText(image, "FPS: " + str(int(fps_text)), (10, 70),
                            cv2.FONT_HERSHEY_PLAIN, 3, (255, 0, 0), 3)
                # Display screen
                cv2.imshow('MediaPipe Hands', image)
                if cv2.waitKey(5) & amp; 0xFF == 27 or cv2.getWindowProperty('MediaPipe Hands', cv2.WND_PROP_VISIBLE) < 1:
                    break
            cap.release()


# Start program
control = HandControlVolume()
control.recognize()

5. Practical demonstration



Through the demonstration, we can find that the farther the distance between the index finger and the thumb is on the screen, the louder the volume will be, and vice versa, the volume will be controlled through gestures.

6. Source code sharing

import cv2
import mediapipe as mp
from ctypes import cast, POINTER
from comtypes import CLSCTX_ALL
from pycaw.pycaw import AudioUtilities, IAudioEndpointVolume
import time
import math
import numpy as np


class HandControlVolume:
    def __init__(self):
        #Initialize medialpipe
        self.mp_drawing = mp.solutions.drawing_utils
        self.mp_drawing_styles = mp.solutions.drawing_styles
        self.mp_hands = mp.solutions.hands

        # Get the computer volume range
        devices = AudioUtilities.GetSpeakers()
        interface = devices.Activate(
            IAudioEndpointVolume._iid_, CLSCTX_ALL, None)
        self.volume = cast(interface, POINTER(IAudioEndpointVolume))
        self.volume.SetMute(0, None)
        self.volume_range = self.volume.GetVolumeRange()

    # Main function
    def recognize(self):
        # Calculate refresh rate
        fpsTime = time.time()

        # OpenCV reads video stream
        cap = cv2.VideoCapture(0)
        #Video resolution
        resize_w = 640
        resize_h = 480

        # Screen display initialization parameters
        rect_height = 0
        rect_percent_text = 0

        with self.mp_hands.Hands(min_detection_confidence=0.7,
                                 min_tracking_confidence=0.5,
                                 max_num_hands=2) as hands:
            while cap.isOpened():
                success, image = cap.read()
                image = cv2.resize(image, (resize_w, resize_h))

                if not success:
                    print("Empty frame.")
                    continue

                # Improve performance
                image.flags.writeable = False
                # Convert to RGB
                image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
                # Mirror
                image = cv2.flip(image, 1)
                # mediapipe model processing
                results = hands.process(image)

                image.flags.writeable = True
                image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
                # Determine whether there is a palm
                if results.multi_hand_landmarks:
                    # Traverse each palm
                    for hand_landmarks in results.multi_hand_landmarks:
                        # Mark fingers on the screen
                        self.mp_drawing.draw_landmarks(
                            image,
                            hand_landmarks,
                            self.mp_hands.HAND_CONNECTIONS,
                            self.mp_drawing_styles.get_default_hand_landmarks_style(),
                            self.mp_drawing_styles.get_default_hand_connections_style())

                        # Analyze fingers and store the coordinates of each finger
                        landmark_list = []
                        for landmark_id, finger_axis in enumerate(
                                hand_landmarks.landmark):
                            landmark_list.append([
                                landmark_id, finger_axis.x, finger_axis.y,
                                finger_axis.z
                            ])
                        if landmark_list:
                            # Get the coordinates of the thumb tip
                            thumb_finger_tip = landmark_list[4]
                            thumb_finger_tip_x = math.ceil(thumb_finger_tip[1] * resize_w)
                            thumb_finger_tip_y = math.ceil(thumb_finger_tip[2] * resize_h)
                            # Get the coordinates of the index finger tip
                            index_finger_tip = landmark_list[8]
                            index_finger_tip_x = math.ceil(index_finger_tip[1] * resize_w)
                            index_finger_tip_y = math.ceil(index_finger_tip[2] * resize_h)
                            # midpoint
                            finger_middle_point = (thumb_finger_tip_x + index_finger_tip_x) // 2, (
                                    thumb_finger_tip_y + index_finger_tip_y) // 2
                            # print(thumb_finger_tip_x)
                            thumb_finger_point = (thumb_finger_tip_x, thumb_finger_tip_y)
                            index_finger_point = (index_finger_tip_x, index_finger_tip_y)
                            # Draw 2 points on the fingertips
                            image = cv2.circle(image, thumb_finger_point, 10, (255, 0, 255), -1)
                            image = cv2.circle(image, index_finger_point, 10, (255, 0, 255), -1)
                            image = cv2.circle(image, finger_middle_point, 10, (255, 0, 255), -1)
                            # Draw a line connecting 2 points
                            image = cv2.line(image, thumb_finger_point, index_finger_point, (255, 0, 255), 5)
                            # Pythagorean theorem calculation length
                            line_len = math.hypot((index_finger_tip_x - thumb_finger_tip_x),
                                                  (index_finger_tip_y - thumb_finger_tip_y))

                            # Get the maximum and minimum volume of the computer
                            min_volume = self.volume_range[0]
                            max_volume = self.volume_range[1]
                            # Map fingertip length to volume
                            vol = np.interp(line_len, [50, 300], [min_volume, max_volume])
                            # Map the fingertip length to the rectangular display
                            rect_height = np.interp(line_len, [50, 300], [0, 200])
                            rect_percent_text = np.interp(line_len, [50, 300], [0, 100])

                            # Set computer volume
                            self.volume.SetMasterVolumeLevel(vol, None)

                # Display rectangle
                cv2.putText(image, str(math.ceil(rect_percent_text)) + "%", (10, 350),
                            cv2.FONT_HERSHEY_PLAIN, 3, (255, 0, 0), 3)
                image = cv2.rectangle(image, (30, 100), (70, 300), (255, 0, 0), 3)
                image = cv2.rectangle(image, (30, math.ceil(300 - rect_height)), (70, 300), (255, 0, 0), -1)

                # Display refresh rate FPS
                cTime = time.time()
                fps_text = 1 / (cTime - fpsTime)
                fpsTime = cTime
                cv2.putText(image, "FPS: " + str(int(fps_text)), (10, 70),
                            cv2.FONT_HERSHEY_PLAIN, 3, (255, 0, 0), 3)
                # Display screen
                cv2.imshow('xyp', image)
                if cv2.waitKey(5) & amp; 0xFF == 27 or cv2.getWindowProperty('MediaPipe Hands', cv2.WND_PROP_VISIBLE) < 1:
                    break
            cap.release()
control = HandControlVolume()
control.recognize()