Site Reliability Engineer

Posted on:

July 16, 2025

Job description

Responsibilities

The way of working outlined here (https://github.com/OctopusDeploy/People/tree/main/Engineering/Site-Reliability-Engineering) is your natural way of getting things done.
You excel in an environment focused on availability, reliability, and observability.
You are skilled in systems engineering and may have specialized expertise in specific areas.
You find value in applying safety culture lessons from other industries to your work.
You are adept at leading postmortems and designing deployment and monitoring pipelines.
You have a passion for automating builds, tests, deployments, infrastructure, and operational tasks.
You embrace a "you built it, you run it" culture, with a commitment to quality and system availability, participating in a humane on-call program.
You are self-motivated, work independently with high-quality output, and seek help or new tasks when needed.
You collaborate effectively to solve problems, combining passion, pragmatism, and empathy.
You are results-oriented, adaptive to business direction changes, and encourage the same approach in others.
You thrive on candid feedback, solving complex problems, and helping fellow engineers succeed while working on valuable projects.

Job requirements

Please note - this is to give you an idea of our tools, we don't expect expertise in everything.

A mix of internally developed applications and 3rd Party Software (e.g. TeamCity).
Run in Azure with a mix of AppServices, AKS Clusters, and Azure Functions.
We use Linux containers mostly with a few Windows containers.
Container workloads are run on AKS.
Dockerhub and Artifactory container registries.

We have adopted OpenTelemetry for a lot of our Builds systems.
We are adopting OpenTelemetry for more use cases company-wide, delivering a full telemetry pipeline.
SumoLogic and Honeycomb for analysis.

Working on building new capabilities to increase reliability (we don’t want you staring at monitoring dashboards all day).
Working where you work best, in a home office designed by you, using a device of your choosing, with or without music, in an atmosphere you create for yourself.
Handling a request from an internal team, helping solve a challenging build, test or packaging issue, or offering advice to an engineer to help them fall into the pit of success.
Pairing with another engineer on a Zoom call to solve a complex technical problem or explore and define the problem space for future innovation.
Responding to an actionable alert and working to maintain the reliability of the platform used across the company.
Improving our documentation to help engineers discover solutions for themselves and reduce lead time.
Writing a blog post about something interesting for other engineers or preparing a presentation on what was learned from a recent incident.
Facilitating an incident review or preparing a presentation on what was learned.
Proactively reducing future toil by building automation.