Three.js + Tensorflow.js build real-time face point cloud

This article focuses on the steps required to implement a real-time face mesh point cloud using Three.js and Tensorflow.js. It assumes you have previous knowledge of asynchronous javascript and Three.js basics, so the basics won’t be covered.

The source code for the project can be found in this Git repository. It will be helpful to look at the code as you read this article, as some basic implementation steps will be skipped.

This article will also implement the project in an object-oriented way, which involves a lot of abstractions, so a basic understanding of classes in Typescript is an advantage.

Recommended: Use NSDT editor to quickly build programmable 3D scenes

1. Get Three.js settings

Since the goal of this tutorial is to render face point clouds, we need to start by setting up three js scenes.

The data and methods required to set up the scene are encapsulated in a factory class named ThreeSetUp in sceneSetUp.ts. This class is responsible for creating all necessary scene objects such as renderers, cameras, and scenes. It also starts the resize handler for the canvas element. This class has the following public methods:

getSetUp: This function returns an object containing size information for the camera, scene, renderer, and canvas.

getSetUp(){
    return {
      camera: this.camera,
      scene: this.scene,
      renderer: this.renderer,
      sizes: this.sizes,
    }
  }

apply OrbitControls: This method will be responsible for adding the OrbitControls to our setup and returning the function we need to call to update the OrbitControls.

applyOrbitControls(){
    const controls = new OrbitControls(
      this.camera, this.renderer.domElement!
    )
    controls.enableDamping = true
    return ()=> controls.update();
  }

Our main implementation class, FacePointCloud, will start the ThreeSetUP class and call these two methods to get the settings element and apply the track controls.

2. Generate video data from web camera

In order for us to obtain facial mesh tracking information, we need a pixel input to provide to the facial mesh tracker. In this case we will use the device webcam to generate such input. We will also use the HTML video element (without adding it to the Dom) to read the media stream from the webcam and load it in a way that our code can interact with it.

After this step, we will set up an HTML canvas element (also without adding it to the Dom) and render the video output to it. This gives us the option to also generate three Js textures from the canvas and use them as materials (we won’t be implementing this in this tutorial). We will use the canvas element as input to the FaceMeshTracker.

To handle reading the media stream from the webcam and loading it into the video HTML element, we will create a class called WebcamVideo. This class will handle creating the HTML video element and calling the navigator api to load obtain user permissions and load information from the device’s webcam.

When starting this class, the private init method is called, which has the following code:

private init(){
  navigator.mediaDevices.getUserMedia(this.videoConstraints)
  .then((mediaStream)=>{
    this.videoTarget.srcObject = mediaStream
    this.videoTarget.onloadedmetadata = () => this.onLoadMetadata()
    }
  ).catch(function (err) {
    alert(err.name + ': ' + err.message)
    }
  )
}

This method calls the getUserMedia method on the mediaDevices property of the navigator object. This method takes a video constraint (also called a video setting) as a parameter and returns a promise. This promise resolves to a mediaStream object containing video data from the webcam. In the Promise’s resolution callback, we set the source of the video element to the returned mediaStream.

In the promise resolution callback, we also added a loadedmetadata event listener on the video element. The listener’s callback triggers the object’s onLoadMetaData method and sets the following side effects:

Autoplay video
Make sure the video plays inline
Call the optional callback we passed to the object to be called when the event fires

private onLoadMetadata(){
  this.videoTarget.setAttribute('autoplay', 'true')
  this.videoTarget.setAttribute('playsinline', 'true')
  this.videoTarget.play()
  this.onReceivingData()
}

At this point, we have a WebcamVideo object that is responsible for creating the video element containing the live webcam data. The next step is to draw the video output on a canvas object.

For this we will create a specific WebcamCanvas class using the WebcamVideo class. This class will create an instance of the WebcamVideo class and use it to draw the video’s output to the canvas via the canvas context method’s drawImage(). This will be implemented on the updateFromWebcam method.

updateFromWebCam(){
  this.canvasCtx.drawImage(
    this.webcamVideo.videoTarget,
    0,
    0,
    this.canvas.width,
    this.canvas.height
  )
}

We have to keep calling this function in the render loop to keep updating the canvas with the current frame of the video.

At this point, we have prepared the pixel input as a canvas element that displays the webcam.

3. Use Tensorflow.js to create a Face Mesh detector

Creating a face mesh detector and generating detection data is the main part of this tutorial. This will implement the Tensorflow.js facial landmark detection model.

npm add @tensorflow/tfjs-core, @tensorflow/tfjs-converter
npm add @tensorflow/tfjs-backend-webgl
npm add @tensorflow-models/face-detection
npm add @tensorflow-models/face-landmarks-detection

After installing all relevant packages, we will create a class that handles the following:

Load model
Get detector object
Add detector to class
Implement a public detection function for use by other objects.

We created a file called faceLandmark.ts to implement this class. The import at the top of the file is:

import '@mediapipe/face_mesh'
import '@tensorflow/tfjs-core'
import '@tensorflow/tfjs-backend-webgl'
import * as faceLandmarksDetection from '@tensorflow-models/face-landmarks-detection'

These modules will be required to run and create detector objects.

We create FaceMeshDetectorClass as follows:

export default class FaceMeshDetector {
  detectorConfig: Config;
  model: faceLandmarksDetection.SupportedModels.MediaPipeFaceMesh;
  detector: faceLandmarksDetection.FaceLandmarksDetector | null;

  constructor(){
    this.model = faceLandmarksDetection.SupportedModels.MediaPipeFaceMesh;
    this.detectorConfig = {
      runtime: 'mediapipe',
      refineLandmarks: true,
      solutionPath: 'https://cdn.jsdelivr.net/npm/@mediapipe/face_mesh',
     }
    this.detector = null;
  }

  private getDetector(){
    const detector = faceLandmarksDetection.createDetector(
      this.model,
      this.detectorConfig as faceLandmarksDetection.MediaPipeFaceMeshMediaPipeModelConfig
    );
    return detector;
  }

  async loadDetector(){
    this.detector = await this.getDetector()
  }

  async detectFace(source: faceLandmarksDetection.FaceLandmarksDetectorInput){
    const data = await this.detector!.estimateFaces(source)
    const keypoints = (data as FaceLandmark[])[0]?.keypoints
    if(keypoints) return keypoints;
    return [];
  }
}

The main method in this class is getDetector, which calls the createDetector method on FaceLandMarksDetection that we imported from Tensorflow.js. createDetector then takes the model we introduced in the constructor:

this.model = faceLandmarksDetection.SupportedModels.MediaPipeFaceMesh;

and a detection configuration object specifying detector parameters:

this.detectorConfig = {
  runtime: 'mediapipe',
  refineLandmarks: true,
  solutionPath: 'https://cdn.jsdelivr.net/npm/@mediapipe/face_mesh',
}

The detection function will return a promise which will resolve to a detector object. Then use the private getDetector function in the loadDetector public async method, which sets the this.detector property on the class to the detector.

The FaceMeshDetector class also implements a public detectorFace method:

async detectFace(source){
  const data = await this.detector!.estimateFaces(source)
  const keypoints = (data as FaceLandmark[])[0]?.keypoints
  if(keypoints) return keypoints;
  return [];
}

This method takes a source parameter, the pixel input. We will use the canvas element above as the tracking source here. The function will be called as follows:

faceMeshDetector.detectFace(this.webcamCanvas.canvas)

This method calls the estimateFaces method on the detector, if this method detects a face in the webcam output it will return an array of objects containing the detection data. This object has a property called Keypoints, which contains an array of objects for each of the 478 points detected by the model on the face. Each object has x, y, and z properties, which include the coordinates of a point in the canvas. example:

[
  {
    box: {
      xMin: 304.6476503248806,
      xMax: 502.5079975897382,
      yMin: 102.16298762367356,
      yMax: 349.035215984403,
      width: 197.86034726485758,
      height: 246.87222836072945
    },
    keypoints: [
      {x: 406.53152857172876, y: 256.8054528661723, z: 10.2, name:
      "lips"},
      {x: 406.544237446397, y: 230.06933367750395, z: 8},
      ...
    ],
  }
]

It is worth noting that these points are returned as coordinates in canvas space, which means that the reference point, x: 0 and y: 0 points, are located in the upper left corner of the canvas. This will be relevant later when we have to convert the coordinates into Three.js scene space (whose reference point is at the center of the scene).

At this point we have a pixel input source, and a face mesh detector that will give us the detected points. Now, we can move on to the Three.js part!

4. Create an empty point cloud

In order to generate a face mesh in Three.js, we have to load the face mesh points from the detector and then use them as the position property of the Three js Points object. In order for the three js face meshes to reflect the movement in the video (reacting in real time), we have to update this position property every time the face detection changes in the detector we created.

To achieve this, we will create another factory class called PointCloud that will create an empty Points object, as well as a public method that can be used to update the properties of this Points object (such as the location property). The class looks like this:

export default class PointCloud {
  bufferGeometry: THREE.BufferGeometry;
  material: THREE.PointsMaterial;
  cloud: THREE.Points<THREE.BufferGeometry, THREE.PointsMaterial>;
  
  constructor() {
    this.bufferGeometry = new THREE.BufferGeometry();
    this.material = new THREE.PointsMaterial({
      color: 0x888888,
      size: 0.0151,
      sizeAttenuation: true,
    });
    this.cloud = new THREE.Points(this.bufferGeometry, this.material);
  }
updateProperty(attribute: THREE.BufferAttribute, name: string){
  this.bufferGeometry.setAttribute(
    name,
    attribute
  );
  this.bufferGeometry.attributes[name].needsUpdate = true;
  }
}

This class starts an empty BufferGrometry, which is the point’s material and the point object that consumes both. Adding this point object to the scene will not change anything because the geometry does not have any position properties, in other words, no vertices.

The PointCloud class also exposes the updateProperty method, which accepts a buffer property and a property name. It then calls the bufferGeometry setAttribute method and sets the needUpdate attribute to true. This will allow Three.js to reflect changes to bufferAttribute on the next requestAnimationFrame iteration.

We will use this updateProperty method to change the shape of the point cloud based on the points received from the Tensorflow.js detector.

Now we also prepare the point cloud to receive new location data. So, it’s time to tie everything together!

5. Feed tracking information to point cloud

To tie everything together, we’ll create an implementation class that calls the classes, methods, and steps needed to make everything work properly. This class is called FacePointCloud. In the constructor it will instantiate the following class:

The ThreeSetUp class obtains the scene setting object
CanvasWebcam is used to obtain the canvas object that displays the webcam content
The faceLandMark class used to load the tracking model and obtain the detector
PointCloud class is used to set up an empty point cloud and later update it with detection data

constructor() {
  this.threeSetUp = new ThreeSetUp()
  this.setUpElements = this.threeSetUp.getSetUp()
  this.webcamCanvas = new WebcamCanvas();
  this.faceMeshDetector = new faceLandMark()
  this.pointCloud = new PointCloud()
}

This class also has a method called bindFaceDataToPointCloud which performs the main part of our logic, which is to get the data provided by the detector, convert it into a form Three.js can understand, create a Three.js buffer property from it and use It updates the point cloud.

async bindFaceDataToPointCloud(){
  const keypoints = await
  this.faceMeshDetector.detectFace(this.webcamCanvas.canvas)
  const flatData = flattenFacialLandMarkArray(keypoints)
  const facePositions = createBufferAttribute(flatData)
  this.pointCloud.updateProperty(facePositions, 'position')
}

Therefore, we pass the canvas pixel source to the detectorFace method and then perform operations on the returned data in the utility function flattenFacialLandMarkArray. This is very important because there are two questions:

As we mentioned above, points in the face detection model will be returned in the following form:

keypoints: [
  {x: 0.542, y: 0.967, z: 0.037},
  ...
]

And the buffer property expects the data/numbers to have the following shape:

number[] or [0.542, 0.967, 0.037, .....]

Coordinate system differences between data sources (canvases), the coordinate system of the canvas is as follows:

The Three.js scene coordinate system is as follows:

So, with these two options in mind, we implemented the flattenFacialLandMarkArray function to solve these problems. The code for this function looks like this:

function flattenFacialLandMarkArray(data: vector[]){
  let array: number[] = [];
  data.forEach((el)=>{
    el.x = mapRangetoRange(500 / videoAspectRatio, el.x,
      screenRange.height) - 1
    
    el.y = mapRangetoRange(500 / videoAspectRatio, el.y,
      screenRange.height, true) + 1
    el.z = (el.z / 100 * -1) + 0.5;
    
    array = [
      ...array,
      ...Object.values(el),
    ]
  })
  return array.filter((el)=> typeof el === 'number');

The flattenFacialLandMarkArray function takes the keypoint inputs we receive from the face detector and expands them into an array, in the form of numbers[] instead of objects[]. Before passing the numbers to the new output array, it maps them from the canvas coordinate system to the Three.js coordinate system via the mapRangetoRange function. The function looks like this:

function mapRangetoRange(from: number, point: number, range: range, invert: boolean = false): number{
  let pointMagnitude: number = point/from;
  if(invert) pointMagnitude = 1-pointMagnitude;
  const targetMagnitude = range.to - range.from;
  const pointInRange = targetMagnitude * pointMagnitude +
    range.from;
  
  return pointInRange
}

We can now create the initialization function and animation loop. This is implemented in the initWork method of the FacePointCloud class as follows:

async initWork() {
  const { camera, scene, renderer } = this.setUpElements
  camera.position.z = 3
  camera.position.y = 1
  camera.lookAt(0,0,0)
  const orbitControlsUpdate = this.threeSetUp.applyOrbitControls()
  const gridHelper = new THREE.GridHelper(10, 10)
  scene.add(gridHelper)
  scene.add(this.pointCloud.cloud)
  
  await this.faceMeshDetector.loadDetector()
  
  const animate = () => {
    requestAnimationFrame(animate)
    if (this.webcamCanvas.receivingStreem){
      this.bindFaceDataToPointCloud()
    }
    this.webcamCanvas.updateFromWebCam()
    orbitControlsUpdate()
    renderer.render(scene, camera)
  }
  
  animate()
}

We can see how this init function ties everything together, it gets the Three.js settings element and sets up the camera, adding a gridHelper to the scene and point cloud.

It then loads the detector onto the faceLandMark class and starts setting up our animation function. In this animation function, we first check if the WebcamCanvas element is receiving the stream from the webcam and then call the bindFaceDataToPointCloud method, which internally calls the detect face function and converts the data into a bufferAttribute and updates the point cloud position attribute.

Now, if you run your code, you should get the following results in your browser!

Original link: Three.js builds face point cloud in real time – BimAnt