Site Reliability Engineer (SRE)

Company: Singapore Exchange

Location:

Singapore, SG

Unit: Group Technology Office

Job Type: Permanent (HC)

Requisition ID: 3356

Job Summary

SGX is hiring Site Reliability Engineers who treat operations as a software problem. You'll keep production healthy, but more importantly you'll build the automation, tooling, and agentic workflows that make running our systems boring and predictable. This is an engineering role - if your instinct on a recurring issue is to write code that removes it, you'll fit in well.

We operate in a regulated capital-markets environment, so the bar for reliability, security, and operational rigour is high.

Job Responsibilities

Own production reliability (SLOs, capacity, incident response, postmortems) and turn every incident into a durable fix in code or automation.
Build the platform and tooling that make services easy to deploy, observe, and operate: CI/CD, infrastructure-as-code, observability stacks, runbooks-as-code.
Apply AI agentically across operations (triage, root-cause analysis, remediation, change review) and contribute to our internal agentic ecosystem.
Design and integrate the systems underneath our services: messaging (e.g. Kafka), orchestration (e.g. Kubernetes), and performance-sensitive infrastructure.
Partner with product engineers on release readiness, rollout strategy, and production hardening before things ship.
Continuously reduce toil: measure it, attack it with code, and raise the floor on what "easy to maintain" looks like.

Job Requirements

5+ years in SRE, platform, or infrastructure engineering, with a clear track record of replacing manual work with code
Strong programming ability in at least one modern language (e.g. Go, Python, Kotlin, TypeScript, Rust, etc), you write production code, not just glue scripts
AI-native ways of working: real experience orchestrating agents for ops workflows, not just using AI for autocomplete
Deep hands-on with Kubernetes, IaC (Terraform or equivalent), CI/CD, and modern observability (metrics, logs, traces)
Production experience on a major cloud: GCP preferred, AWS acceptable
Solid foundations in distributed systems and the failure modes that matter in production
Incident-response maturity: calm under pressure, sharp on root cause, disciplined about follow-through
Comfort in complex, regulated environments

Nice to Have

Familiarity with the FIX protocol or capital-markets domain
Experience building internal developer platforms or self-service tooling consumed by other engineers