Jicofo ends up in an unusable state

Hi there,

We noticed a jicofo which did not assign the user a JVB when joining the conference. While investigating we discovered the following:

  1. We noticed a high rate of errors referring to bridges which were already (gracefully) shutdown:
Jicofo 2021-03-03 20:24:43.341 SEVERE: [127] org.jitsi.jicofo.bridge.JvbDoctor.log() Unexpected error returned by the bridge: jvbbrewery@domain/jvb, err: <iq to='focus@auth.studio.stream.j │
│ w.org/focus73143829748' from='jvbbrewery@domain/jvb' id='q42Z2-105300' type='error'><error type='cancel'><item-not-found xmlns='urn:ietf:params:xml:ns:xmpp-stanzas'/><text xmlns='urn:ietf: │
│ params:xml:ns:xmpp-stanzas'>Recipient not in room</text></error></iq>
Jicofo 2021-03-03 20:24:47.107 SEVERE: [287] org.jitsi.jicofo.bridge.JvbDoctor.log() Unexpected error returned by the bridge: jvbbrewery@domain/jvb, err: <iq to='focus@auth.studio.stream.j │
│ w.org/focus73143829748' from='jvbbrewery@domain/jvb' id='q42Z2-105303' type='error'><error type='cancel'><item-not-found xmlns='urn:ietf:params:xml:ns:xmpp-stanzas'/><text xmlns='urn:ietf: │
│ params:xml:ns:xmpp-stanzas'>Recipient not in room</text></error></iq>
  1. The affected Jicofo had very high rate of SEVERE errors (about 4000 in about 5h):
Jicofo 2021-03-03 20:29:43.778 SEVERE: [77] org.jitsi.jicofo.bridge.JvbDoctor.log() Unexpected error returned by the bridge: jvbbrewery@domain/jvb, err: <iq to='focus@domain/focus73143829748' from='jvbbrewery@domain/jvb' id='q42Z2-106462' type='error'><error type='cancel'><item-not-found xmlns='urn:ietf:params:xml:ns:xmpp-stanzas'/><text xmlns='urn:ietf:params:xml:ns:xmpp-stanzas'>Recipient not in room</text></error></iq>

Investigation results

  1. JVBs for which the Jicofo errors are saying: Unexpected error returned by the bridge have an error condition at the time of the log
  2. Jicofo’s health check localhost:8888/about/health returns 200
  3. Scaling for the JVB pods is enabled and it scales up and down
  4. We were not able to find errors in Prosody logs
  5. No errors are returned back to the UI when entering the conference

Questions

  1. What could cause this behavior?
  2. How could we detect this behavior in order to mark Jicofo as unhealthy?

Any other suggestion or ideas as to what might be causing this and how to fix it are more than welcome!

1 Like

Which jicofo is this?

We’re deploying it from this package: jicofo=1.0-691-hf-1

And for the container we use this image: jitsi/base-java:stable-4627-1

Not sure what you think about the following idea, but reading Jicofo source I was thinking on extending the way how JvbDoctor is handling XMPP errors. Error we’re experiencing is logged here:

What if we’d add logic, which would call the listener.healthCheckFailed(bridgeJid) in the following circumstances:

  1. Error is not internal_server_error and also not service_unavailable (since that’s already handled)
  2. Unexpected error for a particular bridge has been returned n-times in the past x-minutes (e.g. maybe 10 times in the last minute?)

If you think that’s a good idea we might be able to contribute the code.

The response you get is from prosody, saying this bridge is no longer in the room. You cannot reach the jvb at that jid so there is no point of communication with it … I wonder why jicofo still see is in the list of bridges … Do you have a log that the bridge left …?

Prosody is 0.11 right?

Thanks for the input regarding the code modification suggestion.

Yes, prosody is 0.11

Yes! That’s the question we’d love to have answer to.

Below are the logs from the two bridges which went down, but are still in Jicofo’s list.

mgjkj-log.txt (95.4 KB)
n825n-log.txt (20.9 KB)

Ah, this is old, we have fixed similar problems in the bridge should be in latest stable.

1 Like

@damencho

Thank you very much for your reply. This is great!

Do you think this problem:

is also related with us using an old base-java image?

@damencho I can reproduce this with latest docker version stable-5390-3.

The bridge was removed (for scaling reason) and Jicofo removed it but still trying to do some health check on it (until Jicofo is restarted of course).

Jicofo 2021-03-27 11:10:25.737 INFO: [21] org.jitsi.jicofo.xmpp.BaseBrewery.log() Removed brewery instance: jvbbrewery@internal-muc.jitsi.local/jvb-6c8dfbb4c-7mnsw
Jicofo 2021-03-27 11:10:25.737 INFO: [21] org.jitsi.jicofo.xmpp.BaseBrewery.log() A bridge left the MUC: jvbbrewery@internal-muc.jitsi.local/jvb-6c8dfbb4c-7mnsw
Jicofo 2021-03-27 11:10:25.737 INFO: [21] org.jitsi.jicofo.bridge.BridgeSelector.log() Removing JVB: jvbbrewery@internal-muc.jitsi.local/jvb-6c8dfbb4c-7mnsw
Jicofo 2021-03-27 11:10:31.067 SEVERE: [190] org.jitsi.jicofo.bridge.JvbDoctor.log() Unexpected error returned by the bridge: jvbbrewery@internal-muc.jitsi.local/jvb-6c8dfbb4c-7mnsw, err: <iq to='focus@auth.jitsi.local/focus8578944804785143' from='jvbbrewery@internal-muc.jitsi.local/jvb-6c8dfbb4c-7mnsw' id='5SGr6-520969' type='error'><error type='cancel'><item-not-found xmlns='urn:ietf:params:xml:ns:xmpp-stanzas'/><text xmlns='urn:ietf:params:xml:ns:xmpp-stanzas'>Recipient not in room</text></error></iq>
Jicofo 2021-03-27 11:10:41.068 SEVERE: [273] org.jitsi.jicofo.bridge.JvbDoctor.log() Unexpected error returned by the bridge: jvbbrewery@internal-muc.jitsi.local/jvb-6c8dfbb4c-7mnsw, err: <iq to='focus@auth.jitsi.local/focus8578944804785143' from='jvbbrewery@internal-muc.jitsi.local/jvb-6c8dfbb4c-7mnsw' id='5SGr6-520996' type='error'><error type='cancel'><item-not-found xmlns='urn:ietf:params:xml:ns:xmpp-stanzas'/><text xmlns='urn:ietf:params:xml:ns:xmpp-stanzas'>Recipient not in room</text></error></iq>
...
Jicofo 2021-03-27 19:24:11.066 SEVERE: [71] org.jitsi.jicofo.bridge.JvbDoctor.log() Unexpected error returned by the bridge: jvbbrewery@internal-muc.jitsi.local/jvb-6c8dfbb4c-7mnsw, err: <iq to='focus@auth.jitsi.local/focus8578944804785143' from='jvbbrewery@internal-muc.jitsi.local/jvb-6c8dfbb4c-7mnsw' id='5SGr6-600939' type='error'><error type='cancel'><item-not-found xmlns='urn:ietf:params:xml:ns:xmpp-stanzas'/><text xmlns='urn:ietf:params:xml:ns:xmpp-stanzas'>Recipient not in room</text></error></iq>

I wonder whether same will happen with latest from unstable … we had pushed several fixes around bridges going down …

We’ll update to unstable Jicofo and see what happens.

@damencho seems fixed in latest stable!

1 Like

Is the image you tested latest-5765-1?

yes stable-5765-1

Awesome, thanks so much for confirming!