Autoscaling TURN servers, are simple scaling policies really enough?

A question for the Jitsi team. I’m studying and learning a lot from the recently committed jitsi-infra repo
This caught my eyes:

If I understand correctly, it seems that the scaling policy used by meet.jit.si for the Coturn pool is fairly simple, based on CPU usage.

But I thought TURN was stateful, so how does down scaling work? Like JVBs we can’t just shut them down while users are using them, can we? Aren’t the user going to lost the connection for a while, until they can switch to another TURN?

I was expecting that the jitsi-autoscaler could also collect information from the coTURN server (with the jitsi-autoscaler-sidecar for example) and be able to gracefully shutdown the TURN. (Some scheme like removing them from the round-robin DNS and wait for the users to drain for example). But I can’t find this kind of code in the infra repo. Is simple autoscaling really not a problem when downscaling in practice?

Ping @Aaron_K_van_Meerten

I fear this is misleading, we actually end up using fixed counts of TURN server instances rather than autoscaling them. We should likely just remove the autoscaling configuration here, but it was added while we were still learning about OCI instance pools in general and thought we needed autoscaling configuration for all pools. If you’ll note, this policy will scale up to the maximum anytime the CPU is above 1%, and only scale down if the CPU is below 0% (should be impossible).

We have considered adding coTURN to the autoscaler, since we do need better graceful shutdown capacity for the TURN servers. However, we were hampered by the fact that out of the box coturn doesn’t easily show the total sessions in a simple REST call, and so we’d need to either re-compile it with prometheus support or add some other method for tracking sessions. There’s also not an easy method in our infrastructure to make a TURN server be in “drain” mode. We’re using round-robin DNS for the initial discovery for users, so we’d need to remove the server from that list for at least the TTL of the DNS record to properly drain it.

So this is on a list of nice-to-have features that we’ve not yet achieved. For the moment we simply run a fix number of TURN servers in each region, and adjust the fixed count as we see traffic go up or down in those regions.

4 Likes

Dream crushed. :laughing:
But it makes much more sense now, thanks for the insight!

I’m a bit surprised that you consider the autoscaling of TURN as a nice-to-have features, but I guess it’s a bit different at your scale, with a worldwide deployment. In our case case we operate in only one time zone and at night our traffic goes almost to zero. So it’s really tempting to find a way to scale down the TURN.

The lack of REST api for coturn is really a pain. Personally we use the telnet access to get the details of the current number of sessions as a poor man’s REST api.

About draining the TURN, we managed to do it on AWS but it was quite hard. We use lifecycle hook to intercept scaledown, then remove the IP from the round-robin DNS (tricky step that need to be properly serialize or it can mess up the round-robin DNS if done concurrently), then wait for the following graceful_shutdown.sh script to complete to send the complete-lifecycle-action.

graceful_shutdown.sh
#!/bin/bash

echo "Graceful shutdown started"

# Returns local session count by calling the telnet interface and extracting the session count.
function getSessionCount {
    /usr/share/coturn/get_sessions.sh > /tmp/turnsessions
    grep 'Total sessions' /tmp/turnsessions | grep -o -E '[0-9]+'
}

sessionCount=$(getSessionCount)
while [[ $sessionCount -gt 0 ]] ; do
    echo "There are still $sessionCount sessions"
    sleep 10
    sessionCount=$(getSessionCount)
done

echo "no more session"
get_sessions.sh.j2
#!/usr/bin/expect -f

log_user 0
spawn telnet 127.0.0.1 5766
expect "Enter password:"
send "{{ jitsi_meet_turn_secret }}\n"
expect "> "
log_user 1
send "pu tls\n"
expect "> "

puts $expect_out(0,string)
exit 0

But we are in the process of migrating to OCI and couldn’t find lifecycle hook equivalent. I guess we are also going first to use a fixed count of TURN server.

Also, congrats about the infra-provisioning/configuration/customization repo. I spent the past few days studying them and the huge amount of effort done to automate jitsi deployment is really quite impressive.

I realized today I read your reply and never said thanks! I appreciate your scripts for coturn as well as the complicate on the infra repos. I’m quite happy to have more eyes on them, so let me know if you see anything you’d like to see in there (or better yet, make a PR!)

1 Like

For anyone looking at autoscaling CoTURN who is using Kubernetes, a few useful points:

  • You can use custom metrics in your HorizontalPodAutoscaler to pull stats from Prometheus. This lets you scale based on actual load (e.g. packet rate) rather than merely CPU (especially useful if you are using smaller instance types where the limiting factor for a TURN server is actually network packet rate and not CPU usage)
  • You can use external-dns with a headless service, to automatically put new CoTURNs into DNS and remove them when they start to shut down
  • You can use a k8s pre-stop hook to poll the metrics and wait until there are no active sessions. external-dns will remove the DNS entry at the start of pod termination, stopping new sessions from starting, then the pre-stop hook will keep the pod running until there are no sessions, then the pod will terminate.

These techniques in combination give a very easily managed and robust setup, which thanks to k8s is also cloud-provider-independent (we run our setup without modifications on five different clouds).

4 Likes