Site Reliability Engineer (SRE)

Company:  Singapore Exchange
Location: 

Singapore, SG

Unit:  Group Technology Office
Job Type:  Permanent (HC)
Requisition ID:  3356

Job Summary

SGX is hiring Site Reliability Engineers who treat operations as a software problem. You'll keep production healthy, but more importantly you'll build the automation, tooling, and agentic workflows that make running our systems boring and predictable. This is an engineering role - if your instinct on a recurring issue is to write code that removes it, you'll fit in well.

 

We operate in a regulated capital-markets environment, so the bar for reliability, security, and operational rigour is high.

Job Responsibilities

  • Own production reliability (SLOs, capacity, incident response, postmortems) and turn every incident into a durable fix in code or automation.
  • Build the platform and tooling that make services easy to deploy, observe, and operate: CI/CD, infrastructure-as-code, observability stacks, runbooks-as-code.
  • Apply AI agentically across operations (triage, root-cause analysis, remediation, change review) and contribute to our internal agentic ecosystem.
  • Design and integrate the systems underneath our services: messaging (e.g. Kafka), orchestration (e.g. Kubernetes), and performance-sensitive infrastructure.
  • Partner with product engineers on release readiness, rollout strategy, and production hardening before things ship.
  • Continuously reduce toil: measure it, attack it with code, and raise the floor on what "easy to maintain" looks like.

Job Requirements

  • 5+ years in SRE, platform, or infrastructure engineering, with a clear track record of replacing manual work with code
  • Strong programming ability in at least one modern language (e.g. Go, Python, Kotlin, TypeScript, Rust, etc), you write production code, not just glue scripts
  • AI-native ways of working: real experience orchestrating agents for ops workflows, not just using AI for autocomplete
  • Deep hands-on with Kubernetes, IaC (Terraform or equivalent), CI/CD, and modern observability (metrics, logs, traces)
  • Production experience on a major cloud: GCP preferred, AWS acceptable
  • Solid foundations in distributed systems and the failure modes that matter in production
  • Incident-response maturity: calm under pressure, sharp on root cause, disciplined about follow-through
  • Comfort in complex, regulated environments

 

Nice to Have

  • Familiarity with the FIX protocol or capital-markets domain
  • Experience building internal developer platforms or self-service tooling consumed by other engineers


Job Segment: Cloud, Technology