[ One of CPU cores is sticking on JVB server ]

Hi there!
I have an environment with 6 JVB servers connected to the same jvbbrewery room.

After upgrade JVBs to the latest stable version 2.1-351-g0bfaac1c-1 (Jicofo upgraded as well), I have faced the problem when time to time on the one of JVBs the one core of the CPU has stuck. I saw it via htop Linux utility.

The participants are located on the affected JVB have faced the connectivity problem (like packet loss, freezes of video and audio) during this time.

It could be fixed only restart of affected JVB.
Due to this is production installation I have been forcing to downgrade to the previous version.

Could someone have some knowledge about such issue or some recommendations?

@bbaldino any ideas?

It could be related to a bug we fixed recently which should make its way to stable today. But to know for sure you’d have to do a heap dump (logs may also give a clue).

Thanks for your quick reply
I have a huge amount of logs in Logstash from all JVBs. It would be useful if you provide some keywords which I will be able to search. For checking if something really present

Also, I have got a dump of strace Linux utility. It might be useful ( Link to download: https://drive.google.com/file/d/1OgOsmH2lGKa0T8jOKAU0iDQ3qBEbn0Gi/view?usp=sharing)

There’s not necessarily any one thing to look for. If you can repro the issue and get logs for that time there might be a clue.

Collected the logs for the time when the problems were present. Hope it will help

The logs from 13.10 - 13.40 UTC time.
The problem was reported at 13.23
The restart of JVB was at 13.31

Link to download: https://drive.google.com/file/d/1U7SooVMnglTwMoAZdAH_WgJIg5JNb3TP/view?usp=sharing

Something is definitely up in those logs, there are tons of:

TransportCcEngine.tccReceived#163: TCC packet contained received sequence numbers: 10380. Couldn't find packet detail for the seq nums: 10380. Latest seqNum was 3086, size is 1000. Latest RTT is 32.276334 ms

type logs. That RTT doesn’t look bad, so something else must be weird there. Which client are you using? What about your jicofo version, is that up-to-date as well?

We use your web client, which was also updated to the stable version. Jicofo updated to the same version as well.
Web part we have got from github, and the jvb/jicofo part from Debian repository.
Actually, we always update all components(web, jvb, jicofo) to the actual stable version at the same time, so I guess we can exclude the versioning problem.

Please, let me know if I can help somehow more for finding the reason. It’s really important to understand if I can update to the current latest version and avoid such problems in the future.

I think we have seen something like this with an ICE restart, but that was fixed in Jicofo, I think. Basically the client went through an ICE restart and ended up resetting its TCC sequence numbers, so they were “out of sync” with the bridge. We made a change in Jicofo to clear the endpoint state when this happened to avoid that. This fix was a while ago, though, so if you’re on latest stable it should be there. Do you know if there was any kind of ICE restart that happened here?

Unfortunately, I haven’t such information. I can’t imagine the way how it could be.