WebRTC client to JVB connection recovery


#1

Hi,

I’d like to clarify best practices of resolving connectivity issues between client app and JVB instance.
As a result (I hope) of this clarification there might be several pull requests with fixes from my side to Jitsi.

I little bit of background regarding my further questions.
I’m working on native mobile client app which uses WebRTC SDK for mobile under the hood.
I’m also have a tiny server app which translates client requests to create/join/hold/leave conference to corresponding requests to JVB instance.
Tiny server app runs JVB instance within same process, so server has direct access to all public interface of JVB.
Basically what tiny server does - it transforms incoming request with SDP from client to join call to corresponding request to JVB.

So, everything works as expected when network conditions are good, the fun starts when client loosing connectivity during the call or switch between networks or network become available or unavailable during a call.

There are several scenarios I’ve considered and tested as “real world” scenarios with mobile client.

Scenario 1 (works):
Let’s assume there is a client which has two network interfaces (could be 3G and WiFi or something else), so during call initiation WebRTC allocates ports on both of the interfaces, which results in reporting 2 local ICE candidates to client.
Now when such a client connects to JVB it able to establish 2 ICE connection with JVB.
Under the hood inside of JVB, ice4j Agent has received STUN ping requests from both of interfaces and sent responses, ice4j Agent also sent STUN ping requests back to both network interfaces.
Both client candidates considered “authorized” since that moment, because STUN ping “handshake” succeeded.
Then ice4j Agent is transited to terminated state after configurable timeout (3 seconds by default).
Since transition to terminated state Agent is no longer accept STUN pings from addresses which it hasn’t seen before, so no new “authorized” address will be discovered.
In that scenario when user looses connection on one of the network - data start flowing over another network - so no problem here, as long as network addresses does not change during a call.

Scenario 2 (does not work):
Consider it as an extension to Scenario 1.
Now client adds new network interface after call was established.
When WebRTC’s continual candidates gathering mode is enabled, WebRTC monitors network interfaces and able to discover new candidates on new interfaces during a call.
In this case 3-rd candidate is allocated on new interface and WebRTC attempts to send STUN ping to Jitsi to see if connection works.
According to Wireshark, ice4j Agent does respond to such STUN ping, but does not send STUN ping request back, because Agent is already terminated and this new 3-rd candidate will not be considered as “authorized”, because STUN “handshake” was not completed.
In this case if initial two candidates become unavailable and only new 3-rd one is actually connectable - the data will not flow between Client and JVB, because JVB will reject all traffic from “unauthorized” candidate, because there was incomplete STUN ping “handshake”
One workaround to this problem was found and it consists of delaying Agent termination “forever”.
In that case it makes ice4j agent compatible with continual gathering of WebRTC - data will flow seamlessly when client is switching between networks.
One problem with this approach is that ice4j Agent creates TerminationThread which is not properly killed, when “forever” timeout outlives Agent.
I’ve fixed this and propose pull request with fix https://github.com/jitsi/ice4j/pull/150, now if timeout is “forever” - termination thread is properly “cancelled”
I also have ongoing enhancement in this area, but what to start with a baby step of pull request #150.

Scenario 3 (does not work):
Consider that WebRTC continual gathering is not enabled (for example due to WebRTC coming from browser, not native SDK).
In such case only way known to me to handle network switch is to trigger “ice-restart”, because otherwise WebRTC will not attempt to allocated candidates on network interfaces which appeared after call established.
When ICE restart is triggered ICE candidate regathering happens on client as well as change to ICE pwd and ufrag attributes in SDP.
Because ICE pwd and ufrag were changed it is necessary to inform JVB about this change, otherwise ice4j Agent will not authorize STUN pings from new candidates.
I’ve done several attempts to make ice-restart work in JVB (via sending difference Colibri requests), none of them worked without local fixes in JVB itself (maybe I’ve used JVB api wrong, so that’s why I’m asking clarification about ways to handle these scenarios).

There attempts which was done to handle ice-restart coming from Client:

Attempt 1:
It was attempted to expire existing channels and transport then create new one with new channel bundle which updated ufrag and pwd.
This attempt was failed, because connection was still interrupted on WebRTC side - it received DTLS “disconnect” alert when existing channels were expired and WebRTC did not recover connection with new ice4j Agent. Maybe it’s a WebRTC bug in this case, maybe not, I haven’t investigated these deep.
Anyone know/experienced something like that with WebRTC + JVB?

Attempt 2:
It was also attempted to update ufrag and pwd of existing channel bundle via Colibri request, but currently JVB code is written in such a way it skip updating ufrag/pwd if agent is terminated (but for some reason it update fingerpint, which is not changed due ot ICE restart).
I’m not sure if it’s valid to update existing channel bundle pwd/ufrag to implement ice restart, could find information about proper way of implementing it.
Hope someone here will clarify “ice-restart” handling coming from client.

Attempt 3 (does work, but I really don’t like it):
It this case when network interruption is detected, “ice-restart” is not triggered on old peer connection, but old peer connection is closed and new one is created. Client initiates regular join request to tiny server, which creates corresponding request to JVB. During this join it is detected, that it is actually “re-join”, so old channels immediately expired and new channels and new channel bundle is created. In such case ICE connection is successfully established over currently (newly) available network interface. Basically it is almost the same as Attempt #1, but when instead of “ice-restart” new peer connection is created on client.

So, could please someone more experienced with Jitsi Videobridge clarify which of the scenarios is currently supported, and if they are supported, how they are implemented in Jitsi Meet + Jitsi Videobridge?

There are some existing topic here about the problem, but there is no indicated that issues were solved:

  1. [jitsi-dev] ICE4J: Continuous gathering of ICE candidates.
  2. Github: Re-Connect on Fail

Thanks in advance,
Yura.


#2

Hi,

This is the area we’re currently working on in Jitsi Meet, but I understand you have your own client.

At the moment I can only give you hint on the “attempt 1” while I’ll keep reading other threads you’ve linked.

So from my observations Chrome will not correctly reconnect the DTLS layer if the DTLS fingerprint remains unchanged after ice restart. And that’s the case because the JVB uses single DTLS and expires it only after a day or so. What you can do is make another sRD/sLD cycle with fake DTLS prior to setting the new transport information. This will “reset” the DTLS and the ice-restart should work.

The WebRTC’s continous ICE candidate gathering looks interesting. How is that enabled in WebRTC ?

Regards,
Pawel


#3

This is the area we’re currently working on in Jitsi Meet, but I understand you have your own client.

Yes own client and own tiny server “wrapper” around Jitsi Videobridge instance. Many of the problems I’m encountering in own client might be already fixed by Meet or Videobridge wrapper, but reconnecting scenario were also worked not in the best way in Meet when I’ve tested several month ago.

How is that enabled in WebRTC ?

There is no way known to me to enable it in browser other than compile your own version of Chromium or Firefox.

But there is a special enum to specify continual ICE candidate gathering policy in RTCConfiguration in WebRTC: Android, iOS and native.
There was an attempt to introduce something called autonomous re-gathering to WebRTC, but for some reason no work has been merged yet.

So from my observations Chrome will not correctly reconnect the DTLS layer if the DTLS fingerprint remains unchanged after ice restart.

Hm, that’s sound like a bug in WebRTC, because initiating “ice-restart” does change ufrag/pwd which is required by spec. Spec does not say anything about fingerprint in this case, but in current WebRTC implementation it remains the same.

Thanks,
Yura.


#4

I see. Maybe that’s fine, because it really useful for mobile and maybe not as much for web. Thanks !

I agree. Anyway that’s how things work there currently.


#5

Continued with ICE-restart experiments until PR #115 is not accepted.
In Attempt 2 which I’ve done previously I’ve tried to update ufrag/pwd of existing channel bundle. This does not work without fixing code in IceTransportMamanger by moving setRemoteUfragAndPwd up to setRemoteFingerprints.

What I did not try initially is to create new channel bundle and try to update channel bundle of existing audio and data channels.
Unfortunately this also does not work via calling Videobridge.handleColibriConferenceIQ.
Here is a unit test I’ve used to test this behavior.
I don’t know if it does not work by design, or just not properly handled by handleColibriConferenceIQ. Maybe someone can clarify this?

Another variation of Attempt 1 just came to my mind:
BTW, how invalid is to create new channels and new channel bundle when client requests ICE restarts, but not expire existing channels? This will result in two audio channels associated with same SSRC, is this supported by JVB?


#6

It is by design, that channel can not be moved to different channel bundle after creation.
So, the only option is to either kill & re-create channels (which causes problem on client side due to DTLS issue mentioned earlier) OR to properly implement update of transport information of channel bundle ( setRemoteUfragAndPwd) - I’ll investigate further this option.