An attempt at WebRTC communication mechanism in IoT scenarios

Specific implementation steps

Run https://github.com/Jhuster/RTCStartupDemo to see the actual effect
Replace the warehouse’s RTCSignalClient with the gateway’s communication mechanism, check whether the project source code can be used: Github, and reconstruct the project. The final runtime is: 2023/05/ 28

WebRTC related knowledge

Transport protocol used by WebRTC:

ICE (Interactive Connectivity Establishment): This is a framework for enabling two devices (peers) to communicate through NAT (Network Address Translation) and firewalls. ICE uses STUN (Session Traversal Utilities for NAT) and TURN (Traversal Using Relays around NAT) protocols to find the best communication path (communication paths include: public network IP: Port direct communication, LAN IP: Port direct communication, use TURN server as a relay point for transmission).
DTLS (Datagram Transport Layer Security): This is a security protocol used to establish a secure connection between two peers to prevent data from being intercepted or tampered with.
SRTP (Secure Real-time Transport Protocol): Once a secure connection is established via DTLS, audio and video streams are sent securely between two peers via SRTP.
SCTP (Stream Control Transmission Protocol): This is a transport protocol used to send data in RTCDataChannel.

Servers involved in WebRTC

During the data transmission process of WebRTC, P2P (Peer to Peer) is used, which generally does not require a server. It only requires the ISP (Internet Service Provider) to forward the data according to its own ipTable.

But in fact, the establishment of P2P transmission requires the participation of a server. The servers used by WebRTC are specifically: Signaling Server, STUN (Session Traversal Utilities for NAT) server, TURN (Traversal Using Relays around NAT) server . The functions of these servers are as follows:

Signaling Server: WebRTC uses a signaling server to exchange information between two peers, including session control messages (such as opening and closing connections), error messages, Media metadata (such as encoding and format), network data (such as network address and port), and security parameters (such as keys used to establish a secure connection). The signaling process is not directly covered by the WebRTC specification, so any reliable transport mechanism can be used (e.g. WebSocket, XMPP, SIP, etc.). Specific to the IoT scenario, we can use the gateway as the Signaling Server.
STUN server (Session Traversal Utilities for NAT): The main function of the STUN server is to help devices behind NAT (Network Address Translation) discover their public IP addresses and ports. This information is included in signaling messages so that other peers know how to communicate with it directly.
TURN server (Traversal Using Relays around NAT): When direct point-to-point communication cannot be established due to NAT or firewall restrictions, the TURN server acts as a relay and forwards data to the corresponding peer. The use of a TURN server will add some latency, but in some cases it is the only way to make communication possible.

STUN Server and TURN Server Workflow

The signaling server should be easier to understand. To put it simply, we use the signaling server we built to exchange IP addresses, supported audio and video format-related information, etc. between peers to establish P2P connections. Next, the workflow of STUN and TURN servers will be explained in detail.

When using a STUN server:

Both Alice and Bob’s computers are behind a NAT (Network Address Translation) device, such as their router. Their computers use private IP addresses and only the router has a public IP address. In this case, Alice and Bob’s computers have no way of knowing their public IP addresses and ports, which are required to establish a point-to-point connection.

At this point, their computer can send a request to the STUN server, which will respond with their public IP address and port. Alice and Bob’s computers then exchange this information to each other through the signaling server. Then, they can try to establish a direct point-to-point connection for audio and video communication.

When using a TURN server:

In some cases, even with the public IP address and port provided by the STUN server, Alice and Bob may not be able to directly establish a point-to-point connection. For example, they may be on a network that blocks point-to-point connections due to NAT type or firewall settings.

At this point, they can use the TURN server. Alice’s computer sends the audio and video data to the TURN server, and the TURN server forwards the data to Bob, and vice versa. Although this adds some latency, it allows Alice and Bob to communicate audio and video.

Establishment of WebRTC transport connection

The connection establishment process requires the use of the three servers mentioned above. Specifically, it includes three stages: Offer, Answer, and IceCandidate. After all three stages are processed, the transmission of relevant data can begin. The specific flow chart is as follows:

The differences between Offer, Answer and ICE Candidate:

Data differences:
- SDP (Session Description Protocol): The SDP information collected when creating Offers and Answers mainly describes the media capabilities and network status of one party. An SDP description might look like this:

o=- 4185532051989611207 2 IN IP4 127.0.0.1
s=-
t=0 0
a=group:BUNDLE audio video
a=msid-semantic: WMS 0a4051f6-a736-4607-b645-b240ba7b95cf
m=audio 9 UDP/TLS/RTP/SAVPF 111 103 9 102 0 8 105 13 110 113 126
c=IN IP4 0.0.0.0
a=rtcp:9 IN IP4 0.0.0.0
a=ice-ufrag:lAHT
a=ice-pwd:zi34AQt9daqojFjk0ayBcVsO
a=ice-options:trickle renomination
a=fingerprint:sha-256 6D:E4:4B:15:DC:26:F2:AB:88:51:B9:24:21:78:19:9B:03:FD:69:C6:3D:58 :24:59:14:24:8A:A1:60:C0:BD:00
a=setup:actpass
a=mid:audio
a=extmap:1 urn:ietf:params:rtp-hdrext:ssrc-audio-level
a=extmap:2 http://www.webrtc.org/experiments/rtp-hdrext/abs-send-time
a=extmap:3 http://www.ietf.org/id/draft-holmer-rmcat-transport-wide-cc-extensions-01
a=sendrecv
a=rtcp-mux
a=rtpmap:111 opus/48000/2
a=rtcp-fb:111 transport-cc
a=fmtp:111 minptime=10;useinbandfec=1
a=rtpmap:103 ISAC/16000
a=rtpmap:9 G722/8000
a=rtpmap:102 ILBC/8000
a=rtpmap:0 PCMU/8000
a=rtpmap:8 PCMA/8000
a=rtpmap:105 CN/16000
a=rtpmap:13 CN/8000
a=rtpmap:110 telephone-event/48000
a=rtpmap:113 telephone-event/16000
a=rtpmap:126 telephone-event/8000
a=ssrc:161843853 cname:5qtwHvMz9P2Je6uK
a=ssrc:161843853 msid:0a4051f6-a736-4607-b645-b240ba7b95cf ARDAMSa0
a=ssrc:161843853 mslabel:0a4051f6-a736-4607-b645-b240ba7b95cf
a=ssrc:161843853 label:ARDAMSa0
m=video 9 UDP/TLS/RTP/SAVPF 96 97 98 99 125 100 101 127
c=IN IP4 0.0.0.0
a=rtcp:9 IN IP4 0.0.0.0
a=ice-ufrag:lAHT
a=ice-pwd:zi34AQt9daqojFjk0ayBcVsO
a=ice-options:trickle renomination
a=fingerprint:sha-256 6D:E4:4B:15:DC:26:F2:AB:88:51:B9:24:21:78:19:9B:03:FD:69:C6:3D:58 :24:59:14:24:8A:A1:60:C0:BD:00
a=setup:actpass
a=mid:video
a=extmap:14 urn:ietf:params:rtp-hdrext:toffset

ICE candidates: The ICE candidates collected in the onIceCandidate() method mainly include possible IP addresses and port information. An ICE candidate might look like this:

{
 candidate: "candidate:842163049 1 udp 1686052607 1.2.3.4 46154 typ srflx raddr 10.0.0.17 rport 46154 generation 0",
 sdpMid: "audio",
 sdpMLineIndex: 0
}

Summary of specific differences:
- SDP (Session Description Protocol): SDP mainly describes the media capabilities and network status of one party. This includes:
  - Media capabilities: For example, audio and video encoding formats (such as Opus, ISAC, G722, PCMU, PCMA, etc.) and related parameters (such as sampling rate, number of channels, etc.).
  - Network status: For example, IP address, port, transport protocol (such as UDP/TLS/RTP/SAVPF), etc.
  - Other settings: For example, ICE username (ufrag), password (pwd), options (options), DTLS fingerprint (fingerprint), setup, etc.
- ICE candidates: ICE candidates mainly include possible IP addresses and port information. This includes:
  - Network address: Such as IP address and port.
  - Network status: For example, transport protocol (such as UDP), priority, etc.
  - Type: For example, the type of candidate (such as host, srflx, prflx, relay), indicating whether the candidate is a direct local address, a server-reflected address, a peer-reflected address, or a relay address. .
  - Association information: such as sdpMid and sdpMLineIndex, used to associate ICE candidates to specific media streams in the SDP description.

Specific examples of WebRTC transports

Public Network (Internet):

When Alice and Bob want to communicate audio and video on the Internet through WebRTC, they first need to exchange so-called SDP (Session Description Protocol) information through the signaling server, which includes their respective IP addresses and port information, as well as some supported Audio and video encoding format and other information.

While exchanging SDP information, they also need to exchange keys through DTLS (Datagram Transport Layer Security). This key will be used to encrypt and decrypt SRTP packets. Alice uses this key to encrypt the audio and video packets and sends them over the Internet to Bob. Because these data packets are encrypted, they cannot be tampered with or interpreted even if intercepted during transmission. After Bob receives the data packet and decrypts it using the same key, he can obtain the original audio and video data.

Local Area Network:

Within a company’s LAN, two computers (for example, A and B) can communicate directly through WebRTC for audio and video. In this case, since they are in the same LAN, they can communicate directly through the LAN IP without the need for IP and port discovery through a STUN or TURN server.

Likewise, they need to exchange keys over DTLS and then use this key to encrypt and decrypt SRTP packets. Computer A uses this key to encrypt audio and video packets and sends them to Computer B over the LAN. Because these data packets are encrypted, they cannot be tampered with or interpreted even if intercepted during transmission. After computer B receives the data packet, it uses the same key to decrypt it to obtain the original audio and video data.

Analysis of advantages and disadvantages of using WebRTC

Advantages
- Powerful functions, relatively simple API. Supports LAN communication and WAN communication, and supports real-time transmission of audio and video. Supports echo cancellation, noise suppression, anti-packet loss processing, jitter buffer, automatic gain control, sound activity detection and more. Supports multiple codecs (such as VP8, VP9, H.264, Opus, G.711, G.722, etc.).
- The framework is mature and each transmission module has corresponding protocol definitions. During the connection establishment phase, ICE is used for NAT traversal to establish a P2P connection, while DTLS is used to ensure data security and integrity between the two communicating nodes. Once the connection is established, data transmission is encrypted and decrypted using SRTP or SCTP to ensure security during transmission.
- It can adapt to various network environments. NAT penetration can be achieved through the STUN/TURN server to achieve communication in a variety of complex network environments. Use P2P for transmission to ensure the transmission rate.
- The framework is open source and its underlying implementation can be viewed and understood.
Disadvantages
- The data transmission protocol uses UDP, which cannot achieve very stable message arrival rate. UDP does not guarantee the arrival and order of data packets, which may result in reduced audio and video quality when the network condition is poor.
- The architecture is relatively mature and complex, and it is difficult to modify the source code for customization. It requires an in-depth understanding of its underlying implementation, and developers need to have sufficient experience in network programming and audio and video development.
- When it comes to WAN support, it takes a certain amount of time and resources to develop and debug, especially when dealing with issues related to NAT, firewalls, etc. If traffic relaying through a TURN server is required, certain operating costs may be incurred.

Summary

WebRTC is a powerful, mature and out-of-the-box P2P audio and video communication framework. It has good support on multiple platforms and is very suitable for integration into applications or secondary development. At the same time, as an open source framework, WebRTC provides the underlying source code, allowing developers to deeply customize it to meet special needs. However, this kind of in-depth customization usually requires developers to have sufficient experience in network programming and audio and video development, which is somewhat challenging for developers. Additionally, using WebRTC in a wide-area network environment may incur additional operational costs, especially when a TURN server is required for traffic relay. Overall, WebRTC is a powerful and flexible tool, but using it may require a certain investment of technology and resources.

Reference

https://github.com/Jhuster/RTCStartupDemo

https://github.com/ddssingsong/webrtc_server_java

WebRTC-Android Exploration – Basic postures for creating audio and video call programs – Nuggets

Android | WebRTC

How does WebRTC work?

Original text An attempt at WebRTC communication mechanism in Iot scenarios – Zhihu

On the business card at the end of the article, you can get free audio and video development learning materials, including (FFmpeg, webRTC, rtmp, hls, rtsp, ffplay, srs) and audio and video learning roadmap, etc.

See below! ↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓