Site Reliability Engineer

Date - JobBoardly X Webflow Template
Posted on:
 
July 16, 2025

Job description

Octopus Deploy is looking for a Senior Site Reliability Engineer (SRE) who can:

  • Use their SRE skills to keep systems running with high reliability.
  • Help improve and iterate our existing reliability practices.
  • Bring new ideas/practices to increase reliability and reduce toil.
  • Spearhead implementation of new capabilities.
  • Share SRE expertise with other teams in the company.

Responsibilities

You will be a great fit for this role if:

  • The way of working outlined here (https://github.com/OctopusDeploy/People/tree/main/Engineering/Site-Reliability-Engineering) is your natural way of getting things done.
  • You excel in an environment focused on availability, reliability, and observability.
  • You are skilled in systems engineering and may have specialized expertise in specific areas.
  • You find value in applying safety culture lessons from other industries to your work.
  • You are adept at leading postmortems and designing deployment and monitoring pipelines.
  • You have a passion for automating builds, tests, deployments, infrastructure, and operational tasks.
  • You embrace a "you built it, you run it" culture, with a commitment to quality and system availability, participating in a humane on-call program.
  • You are self-motivated, work independently with high-quality output, and seek help or new tasks when needed.
  • You collaborate effectively to solve problems, combining passion, pragmatism, and empathy.
  • You are results-oriented, adaptive to business direction changes, and encourage the same approach in others.
  • You thrive on candid feedback, solving complex problems, and helping fellow engineers succeed while working on valuable projects.

Job requirements

Our Tech Stack:

  • Please note - this is to give you an idea of our tools, we don't expect expertise in everything.

Octopus Server:

  • Our primary focus and flagship product.
  • Written in .NET and uses SQL database.

CI/CD:

  • TeamCity is our build system for Octopus Server.
  • Github Actions are used for some internal tools.
  • CD - Octopus Deploy.

Workloads:

  • A mix of internally developed applications and 3rd Party Software (e.g. TeamCity).
  • Run in Azure with a mix of AppServices, AKS Clusters, and Azure Functions.
  • We use Linux containers mostly with a few Windows containers.
  • Container workloads are run on AKS.
  • Dockerhub and Artifactory container registries.

Infrastructure as Code(IaC):

  • We use Terraform as our primary IaC tool.
  • IaC workloads run in Octopus Deploy, with a few running as github actions.

Observability:

  • We have adopted OpenTelemetry for a lot of our Builds systems.
  • We are adopting OpenTelemetry for more use cases company-wide, delivering a full telemetry pipeline.
  • SumoLogic and Honeycomb for analysis.

A typical day might include:

  • Working on building new capabilities to increase reliability (we don’t want you staring at monitoring dashboards all day).
  • Working where you work best, in a home office designed by you, using a device of your choosing, with or without music, in an atmosphere you create for yourself.
  • Handling a request from an internal team, helping solve a challenging build, test or packaging issue, or offering advice to an engineer to help them fall into the pit of success.
  • Pairing with another engineer on a Zoom call to solve a complex technical problem or explore and define the problem space for future innovation.
  • Responding to an actionable alert and working to maintain the reliability of the platform used across the company.
  • Improving our documentation to help engineers discover solutions for themselves and reduce lead time.
  • Writing a blog post about something interesting for other engineers or preparing a presentation on what was learned from a recent incident.
  • Facilitating an incident review or preparing a presentation on what was learned.
  • Proactively reducing future toil by building automation.