Auto-Scaling Jibri in AWS

Hello,

I want to autoscale Jibri inside AWS. I already set everything up, solved the problem with the unique nicknames which are automatically generated on startup. I can deploy as many Jibri nodes as I need, works perfect.

What I need is to autoscale Jibri. I created an Autoscaling Group but what my question is according to which rules or policies I have to scale. When I monitor the average CPU usage of all instances, it doesnt stop deploying because the CPU usage is above the value I set, of course when Jibri is running.
What I need is that when one Instance is Busy, another one gets deployed. If it goes into Idle, it should be destroyed. Can this be achieved with Cloudwatch?

Thanks.

You need some script from jibri to push the busy/free state from its http api to cloudwatch

1 Like

You got any documentation on this for me? Do I have to integrate it into a script?

My idea was to call an alert, on which new instances get deployed. Is there any script which is called when a recording starts? Then I can put it inside it

Your script can poll Jibri health which will give you health state as well as IDLE/BUSY/EXPIRED states – jibri/http_api.md at master · jitsi/jibri · GitHub

I suspect it will be more reliable to push the IDLE states as cloudwatch metric and use that as alerts (e.g. when <N idle instances) for autoscaling rather than triggering alerts individually from within each jibri instance.

1 Like

I see your point, thanks. So you want to push the idle state to cloudwatch. I also thought about that, but where I ran into issues was setting the value. Do you add 1 to the metric when it is in idle or do you reset the value?

I also wondered about the script. I need a script getting the idle status and pushing it. I also made a script doing this, checking if Jibri is Busy or in Idle. But I cant run the script every X minutes, right? Else it would push that value every X minutes and that would obviously not work. So is there a possibility to listen on the state and push it as soon as it updates? The same for going busy, you got an idea?
Thanks.

The way I understand it, cloudwatch alerts work better with consistent data rather than sporadic data since data points are aggregated over a given period.

I have not done this so cannot give you concrete/proven solutions, but what comes to mind would be for your script to publish idle/busy state regularly (e.g. every 10 seconds) using your ASG name as the dimension, then as long as you use the same period within you CloudWatch alarm then each period would contain a single datapoint from each of your Jibri instances.

What you publish and how you aggregate that data depends on how you wish to trigger your scaling. For example:

  1. If you pubish 1 for IDLE and 0 for BUSY, then aggregating with SUM would give you total number of idle jibris. This allows you to define scaling policies that attempt to maintain a min number of available instances.
  2. If you aggregate using Average, then what you get is a % of idle instances, which allows you to define scaling policies where desired free instances is a % of total instances.

Again, I haven’t tried this so cannot guarantee anything, but hopefully this helps with you moving this forwards. I will watch this thread to see how you fare and perhaps learn from it too. Good luck. :muscle:t4:

1 Like

Makes sense, I will try to apply it! I will keep you updated, thank you so much.

But just one question, when I let one instance report 1 all the time for Busy and I display SUM, then I would have 1+1+1+1+1 or am I wrong? And if I publish 0, will be 1 substracted on SUM? Because when I have 8 Idles and one of them gets busy, I only have 7 idles left

I’m not sure I understand your question. Sorry.

If you publish 1 for Busy and 0 for idle, and say you have 8 jibri instances each pushing one data point every 10 seconds, then if my understanding is correct I’d expect to see in CloudWatch graph – with Statistics=Sum, Period=10 – all data points to be 0 (0+0+0+0+0+0+0+0) when all are idle, and 1 (1+0+0+0+0+0+0+0) when one if then becomes busy.

1 Like

So the period has to be the rate of executing the script?

Could you send me a example command?

aws cloudwatch put-metric-data --metric-name PageViewCount --namespace IdleJibriNodes --value 1 --timestamp 2016-10-20T12:00:00.000Z

How to set the timestamp to push to the actual time the script gets executed?

Yes, the rate of publication should match the period you use to create your CloudWatch graph/alarm or the aggregation for SUM would be wrong (but Average might still be ok?).

You’ll also need to make sure every instance is sending a value (0 or 1) at the same rate, with the value being determined by idle state as returned by Jibri http api.

Timestamp would be the time you capture the state on your instances. Actual command depends on which language you’re using to write your script?

From your example that looks like an aws cli call – assuming bash you could probably get datetime in that format using something like date '+%Y-%m-%dT%H:%M:%S.%3NZ' perhaps?

I’m afraid I do not have canned solutions that I can share with you.

1 Like

I just removed the timestamp parameter, then it takes the current time. I use crontab and the most little interval is 60 seconds. I will try it with that. I also set the Period on 1 Minute then

1 Like

Period is set to one Minute, currently 4 instances are deployed but it shows me “8”.

Any idea?

I am checking for the status and then pushing the data. I also lock the instance so it doesnt get deleted while it is recording.

I cannot believe it! It seems to work! It just doubles the actual amount of jibri nodes, but that doesnt matter!
Thank you so much! Right now there are 5 instances deployed and it shows me “10”, I configured it to scale up < 6, also worked!
Amazing!
But what is about down-scaling? I saw you can set the max lifetime on 1 day, but can you destroy the instances in any way?

You can create scale down policy to reduce instances when too many idle nodes for too long.

1 Like

You can also scale in from inside Jibris - for example set termination protection on all of them in ASG.
Then when a specific Jibri is idle for too long (with a cron script inside it you can check the api from localhost), send an aws call from inside it to remove its termination protection.
The ASG will terminate it, and scale in - to return to the desired number of instances.
(You can fine tune this with a check whether the instance number is higher than the desired one - if it’s not, you don’t want to bother removing the protection, as you won’t be scaling in anyway).
The same cron check can make sure to check/set the protection if the jibri is not idle - this way to prevent accidental auto termination of a busy instance.

3 Likes

I already added that. They get protected as soon as the recording starts and the Scale-In Protection gets removed as soon as it gets into Idle. There are a few instances running multiple hours and no one of them gets deleted, also the number of desired ones isnt static, it also gets increased by AWS and because of that they dont get deleted