SRE’s Inside your cluster

3 min readNov 15, 2021

In the traditional way of managing applications, Site reliability Engineers are responsible for operations such as Deploy, configure and install the application components — called as Day 1 operations.

SRE’s are also responsible for Day 2 operation such as maintenance, responding to alerts, troubleshooting, handling upgrade and updates. They know how to react to specific situation like increased in certain type of errors, how to handle spikes in traffic and other specific situation pertaining your application components.

Along with SRE’s some team also have varying degree of automation to perform repeated tasks ( also called as Toil in SRE terminology ) such as taking backup or sending notifications.

Let’s start with the real world example, There is a speech to text application where 1 instance ( which is called as pod in K8S ) can handle maximum 13 concurrent request to convert speech audio into text. Let’s say current deployment has 10 replicas running, which can handle maximum of 130 concurrent requests. When concurrent request count increases beyond 130, some user start seeing error 503 ( service unavailable ), when error count go beyond certain threshold the SRE’s get paged.

After assessing the situation, SRE team increases the number of pods from 10 to 20 and now it can handle up to 260 number of concurrent requests.

This setup has lot of drawbacks like,

It is manual and reactive in nature
Usage spikes are hard to predict
When demand goes down the SRE again has to scale it down — otherwise it will waste the resources

With that problem in mind let’s look at how Kubernetes works. A Kubernetes object is a “record of intent” — once you create the object, the Kubernetes system will constantly work to ensure that object exists as per the intent specified. For example if you create a deployment with replica count as 100 the k8s will continuously work on making sure there are 100 pods running.

And The good news is Kubernetes is extensible.

As in programming language there are existing classes ( for example String and HashMap ) and you are free to use those or define your own to extend the functionality. Likewise in k8s, you can use existing object such as Deployment, Service and ConfigMap and\or you can define your own objects called as Custom Resource Definition and it’s logic via custom controller.

Operator SDK allows you to develop Kubernetes native application called as Operator to define the custom resources and provide the custom controller logic either in Go, Ansible or Helm.

In Short, Operators are software extension to Kubernetes ( More info: https://kubernetes.io/docs/concepts/extend-kubernetes/operator/ ).

So in new world, SRE and development team has developed operator Speech Auto Scale Operator which monitors the concurrent session count and scale up and down the application accordingly. This new operator defines a CustomResourceDefinition called SpeechAutoScaler

Now SRE team can express the desired replica count and average session count via plain yaml manifest file and operator will work on deploy, monitor, scale up and scale down tasks. Which will free up your SRE’s to do more creative work. And operator will acts as an SRE inside the cluster.

SRE’s Inside your cluster

Written by Santosh Borse