Share on facebook
Share on twitter
Share on linkedin

Enterprise IT Needs to Learn From Google’s Site Reliability Engineering Philosophy

A Cloud Guru News
A Cloud Guru News

The Site Reliability Engineer is one of the most prestigious and accomplished engineering positions at Google

It was 2003 when Ben Treynor realized that a process-oriented, multi-tiered IT ops organization wasn’t going to cut it for Google — so he developed an approach known as Site Reliability Engineering (SRE).

“Fundamentally, it’s what happens when you ask a software engineer to design an operation function.”

Ben Treynor, VP of Engineering at Google

Treynor’s philosophy is underpinned by the fact that software spends most of its life running — not in design or implementation — and that reliability is a systematic attribute that requires continuous engineering attention.

The sheer vastness of technical territory (applications, data, infrastructure) that SRE traverses to guarantee up-time, performance and low MTTR means that a Site Reliability Engineer is one of the most prestigious and accomplished engineering positions at Google.

Today, Google’s Site Reliability Engineering team employs over 1200 engineers, who hold themselves proudly accountable for the stability, availability, and performance of Google’s planet-scale systems.

Today’s Assumptions Are Tomorrow’s Risks

The digital revolution can be overwhelming for CIOs/CTOs who are expected to supply innovation, transform their environment, support complex systems all while delivering bottomline improvements.

With expanding digital portfolios powered by technological paradigms such as Microservices, Machine Learning, Containers, Cloud, Continuous Delivery, it is natural to assume that newer products and services will operate more efficiently. These assumptions about operational efficiencies can quickly turn into risk, which over a period of time can create serious organizational fault lines.

Non-internet companies incur 40–90% of total system costs after its birth.

Software is never launched with desired levels of stability or performance because it is almost impossible to predict and architect for the uncertainty associated with a live production situation. Additionally, software begins to accrue technical debt the day after launch, which if left untreated can severely cripple its intent and utility.

Movements such as Continuous Delivery automates the shipping of software, although once the product is running live, traditional IT is usually not prepared to offer incredible levels of service availability and performance.

Why IT Operations is Ill-Equipped for a Digital Era

Operating Models

Many IT operations teams have been built atop the fundamentals of IT Service Management. In other words, lots of well compartmentalized teams, defined entry and exit points, connected via streamlined processes and measurement techniques. Whilst managing digital portfolios requires structure, speed cannot be achieved if a complex issue has to follow a lengthy route — across multiple teams — from detection to resolution to prevention.

IT Operations Empowerment

The usual currency between Development and IT Operations is of the subjective and opinionated-kind. This is why we don’t often hear IT operations speak about how well connected and prepared they are to support new changes.

The disconnect is partly due to lack of upstream involvement — and Agile hasn’t done much justice here. Overall, the undertone is always having to accept what is sent your way. In digital models, the balance between speed and stability is vital, and can only be struck if power is equally balanced between both sides in the form of quantitative and accountability-based relationships.

ITIL Methodology and Human Capital

The prevailing dominance of ITIL methodologies has in turn led to the creation of dedicated job families in IT operations — IT Support Analyst, Change Manager, Incident Manager, and so on. These positions have served their purpose in preserving structure and governance.

Going digital means newer job families for IT operations, which are broader, purpose-driven, and multi-disciplinary in nature.

Today’s digital workloads require knowledge beyond just operating the system. Teams maintaining these systems are expected to reliably fix complex problems then and there. This is why Google expects SREs to code, in addition to automating and executing infrastructure tasks.

However, for non-internet companies where IT portfolios also comprise of large packaged solutions and other proprietary technology, it is safe to assume that some form of “maintenance” is inevitable. How much human intervention is required for maintenance tasks is an entirely different question.

IT Telemetry and Monitoring

The mantra goes: you can’t effectively respond to obscure and undetectable events. Telemetry is the automatic collection and transmission of data to centralized locations for analysis. Most IT shops employ threshold-based monitoring solutions, and tons of them. Hence, they live everyday alongside dozens of monitoring graphs and metrics, or in other words, a fragmented view of their ecosystem’s health.

Digital companies must combine and leverage their digital exhaust — machine data from applications, databases, servers, storage, network, cloud platforms, and so on. The data is used not only to quickly recover from incidents, but to proactively prevent them.

Every IT asset generates data during its life that needs to be captured and studied.

For example, Netflix uses advanced statistical techniques such as time-series data and unsupervised machine learning to detect outliers. Undoubtedly, this type of telemetry is the most vital technical enabler to ensure relentless levels of service reliability.

Improving IT Culture

Let’s be real — working in IT operations is like holding a position in the Emergency Room. You are unexpectedly hit with catastrophes to which your response must be immediate and decisive.

Individuals in the team must understand their role and of others around them, leverage their toolbox, operate as a fluid organism, and constantly learn to grow in effectiveness. Because environmental volatility becomes second nature, team members expect high intensity along with stimulating work. Individuals also expect a blame-free workplace — because as long as we are humans, we will make mistakes.

Cultural attributes such as innovation, reward, learning, forgiveness, and a sense of communal identity serve as extremely crucial facets of any successful team.

For digitally evolving companies, this topic is complex and vast. It demands detailed attention to an unexciting area in IT that for the most part has remained subjected to offshoring, outsourcing, and commoditization.

As CIO/CTOs begin to transform their internal wiring of delivering software, they must pay equal attention to running software. Because software isn’t usable unless it is reliable.

Get the skills you need for a better career.

Master modern tech skills, get certified, and level up your career. Whether you’re starting out or a seasoned pro, you can learn by doing and advance your career in cloud with ACG.

Arjun Shah is a Senior Manager and published DevOps thought leader in Accenture Consulting. He has led several award-winning technology transformation programs for high performing organizations in their pursuit for growth, disruption, and sustainability.


Get more insights, news, and assorted awesomeness around all things cloud learning.

Sign In
Welcome Back!

Psst…this one if you’ve been moved to ACG!

Get Started
Who’s going to be learning?