Octopus Deploy is looking for a Senior Site Reliability Engineer (SRE) who can:
Use their SRE skills to keep systems running with high reliability.
Help improve and iterate our existing reliability practices.
Bring new ideas/practices to increase reliability and reduce toil.
Spearhead implementation of new capabilities.
Share SRE expertise with other teams in the company.
Responsibilities
You will be a great fit for this role if:
The way of working outlined here (https://github.com/OctopusDeploy/People/tree/main/Engineering/Site-Reliability-Engineering) is your natural way of getting things done.
You excel in an environment focused on availability, reliability, and observability.
You are skilled in systems engineering and may have specialized expertise in specific areas.
You find value in applying safety culture lessons from other industries to your work.
You are adept at leading postmortems and designing deployment and monitoring pipelines.
You have a passion for automating builds, tests, deployments, infrastructure, and operational tasks.
You embrace a "you built it, you run it" culture, with a commitment to quality and system availability, participating in a humane on-call program.
You are self-motivated, work independently with high-quality output, and seek help or new tasks when needed.
You collaborate effectively to solve problems, combining passion, pragmatism, and empathy.
You are results-oriented, adaptive to business direction changes, and encourage the same approach in others.
You thrive on candid feedback, solving complex problems, and helping fellow engineers succeed while working on valuable projects.
Job requirements
Our Tech Stack:
Please note - this is to give you an idea of our tools, we don't expect expertise in everything.
Octopus Server:
Our primary focus and flagship product.
Written in .NET and uses SQL database.
CI/CD:
TeamCity is our build system for Octopus Server.
Github Actions are used for some internal tools.
CD - Octopus Deploy.
Workloads:
A mix of internally developed applications and 3rd Party Software (e.g. TeamCity).
Run in Azure with a mix of AppServices, AKS Clusters, and Azure Functions.
We use Linux containers mostly with a few Windows containers.
Container workloads are run on AKS.
Dockerhub and Artifactory container registries.
Infrastructure as Code(IaC):
We use Terraform as our primary IaC tool.
IaC workloads run in Octopus Deploy, with a few running as github actions.
Observability:
We have adopted OpenTelemetry for a lot of our Builds systems.
We are adopting OpenTelemetry for more use cases company-wide, delivering a full telemetry pipeline.
SumoLogic and Honeycomb for analysis.
A typical day might include:
Working on building new capabilities to increase reliability (we don’t want you staring at monitoring dashboards all day).
Working where you work best, in a home office designed by you, using a device of your choosing, with or without music, in an atmosphere you create for yourself.
Handling a request from an internal team, helping solve a challenging build, test or packaging issue, or offering advice to an engineer to help them fall into the pit of success.
Pairing with another engineer on a Zoom call to solve a complex technical problem or explore and define the problem space for future innovation.
Responding to an actionable alert and working to maintain the reliability of the platform used across the company.
Improving our documentation to help engineers discover solutions for themselves and reduce lead time.
Writing a blog post about something interesting for other engineers or preparing a presentation on what was learned from a recent incident.
Facilitating an incident review or preparing a presentation on what was learned.
Proactively reducing future toil by building automation.
Apply now
When you apply, please mention you found the job on Aussie Intelligence. This encourages more tech companies to hire through our platform, creating more opportunities for you. Good luck with your application!