SRE Calculators

In this blog I will cover the three calculators I relied on during my time in SRE and that I continue to use as a Cloud Engineer.

What is SRE?

Site Reliability Engineering - “SRE is what you get when you treat operations as if it’s a software problem.” - SRE Google

Unlike the traditional DevOps philosophy of development and operations being under the same team handling all application operations, the goal of Site Reliability is the practical management of operations while teams are implementing features or new developments.

SREs need to handle the balance of reliability expectations while also allowing for modification of applications/services to meet feature outcomes and business changes.

Is SRE a People Job or a Technical Role?

When you aspire to treat all aspects of operation as if it’s a software problem, it does end up blurring lines around the boundaries of the role. It has taken me from troubleshooting applications, program start-up behaviours, initialisation, multi-cloud environments, and cluster networking to the joys of Istio.

The best SREs I have worked with have been a weird mix of insanely technical but also great people skills, helpful and supportive.

But they have been the most curious people in existence, eg

Deep dived and timed how K8s cleans up process in CPU threads when hitting resourcing limits.
How Pods start failing restarts with very specific conditions.
No networking available on new hosts when joining the cluster.

My curiosity has led me down the paths of how to measure outages, impacts to Kubernetes/platform health, how we measure health for platform applications, and how we can target high reliability with so many components.

SLO Calculator

Enter your SLO target (e.g., 99.9) to calculate the allowed downtime for weekly, monthly, and yearly periods.

SLO Target (%):

If my ~~acceptable~~ JavaScript (again, Platform Engineer here) ever breaks, please see https://uptime.is/.

Toil Calculator

The following calculator helps to evaluate toil in day-to-day tasks, compare the estimate to the cost to automate them, and determine the break-even point for teams; this helps when justifying more complex automation.

Task Name:

Time per manual task (minutes):

Frequency per week:

Time to automate (hours):

Task	Weekly Time (hours)	Monthly Time (hours)	Yearly Time (hours)	Break-even (weeks)	Action

Recovery Calculator

How long would it take to recover from natural disasters, apocalyptic events, or a full migration from one cloud to another in the event of a sustained outage?

Event Name:

Likelihood (per year, e.g., 0.01 for once every 100 years):

MTTD (hours):

MTTR (hours):

Event	Likelihood (per year)	MTTD (hours)	MTTR (hours)	Expected Annual Downtime (hours)	Action

Not a Calculator, But a Thought.

These calculators make very good team discussion topics, provide tools to reduce team burnout, help identify risks to recovery, and support arguments for the right kinds of automation.

In previous workplaces I would start with the most crazy events to place into the recovery calculator to help frame the topic and break the ice, things like

Google has decided to no longer do business with Australia or no longer exists, how long do we need to move to a new cloud provider?
The planet is engulfed by tidal rain and flooding, how long would be required after humans adapt to breathe underwater to create under sea technology that we could rebuild the application on?

These questions/discussions help frame recovery expectations against the likiliness of the events occurring, they also help the team feel more at ease with what they will actually be required to support in an outage and how quickly things actually need to be solved in.

The end goal for these calculators and the way I use them is both defending engineers’ time and prioritizing toil-reduction work and impacts to software/application reliability. I hope to write about SLI/SLO types and the ways I have helped create them in the future.