I have a working setup of Jitsi along with Jibri in Kubernetes. I have configured HPA for scale-up/down Jibri based on the idle/busy status. My problem here is Since Kubernetes can randomly send sigterm to any pods, a pod with recording still running can be terminated and impact the stability. Is there anyone who faces a similar issue and found some solution?
As described in the other thread about the same issue, that is not how scale-down of StatefulSet nor ReplicaSet (Deployment) works. If you have observed k8s selecting lower-CPU-usage pods on scale-down, it’s only by sheer luck.
There are several solutions discussed in the other thread, including pod-deletion-cost (the k8s native solution to this, but only available in v1.22+) and a custom pod controller.
Kubernetes doesn’t support scale down on specific pods. Kubernetes scale down is always random and it can kill the active pods where recording is running. There are custom scalers which we can make use of for scaling down pods by checking the busy/idle status.
It’s not random. As I described in the thread linked in the message just before yours, there is a well-defined algorithm used when selecting the pod to delete on scale down. However, this algorithm doesn’t consider CPU usage (and of course doesn’t consider the busy state of Jibri). As of k8s 1.22 though, it does include pod-deletion-cost, which you can set yourself. So if you’re running k8s 1.22, a simple solution that doesn’t require much custom code is to set a higher deletion cost on the pod when Jibri goes busy, and either use single use mode or set it back to the default cost when it goes idle.
Before k8s 1.22, the jibri-pod-controller linked in that same thread is a good solution.
The concept is simple: When a Jibri starts to record or livestream, change its k8s labels so that they don’t match the Deployment’s selector.
This detaches it from the controller, so:
a) the controller will immediately launch another Pod to replace it in the ReplicaSet, and
b) the busy Jibri pod will run to completion regardless of any scaling or rollout activity.
Then you just set your Deployment replica count to how many spare Jibris you want to run. This approach should be used with Jibri’s “single use mode”.
An example implementation is linked in the other thread (I linked to it a few posts back), but a robust implementation would be integrated into whatever system is starting your recordings/livestreams, because ideally you would patch the k8s labels at the same time as Jibri starts to record/stream. Relying on the webhook (as the example implementation does) means that scaling activity can still impact a busy Jibri in the time between when it starts to record/stream and when the webhook happens.
Looks good, I have a customer jibri kubernetes scaler configured which automatically scale up down based on busy/Idle status. This will be straight forward. I shall make it public and share the link soon.