Site Reliability Engineer (SRE)
Singapore, SG
Job Summary
SGX is hiring Site Reliability Engineers who treat operations as a software problem. You'll keep production healthy, but more importantly you'll build the automation, tooling, and agentic workflows that make running our systems boring and predictable. This is an engineering role - if your instinct on a recurring issue is to write code that removes it, you'll fit in well.
We operate in a regulated capital-markets environment, so the bar for reliability, security, and operational rigour is high.
Job Responsibilities
- Own production reliability (SLOs, capacity, incident response, postmortems) and turn every incident into a durable fix in code or automation.
- Build the platform and tooling that make services easy to deploy, observe, and operate: CI/CD, infrastructure-as-code, observability stacks, runbooks-as-code.
- Apply AI agentically across operations (triage, root-cause analysis, remediation, change review) and contribute to our internal agentic ecosystem.
- Design and integrate the systems underneath our services: messaging (e.g. Kafka), orchestration (e.g. Kubernetes), and performance-sensitive infrastructure.
- Partner with product engineers on release readiness, rollout strategy, and production hardening before things ship.
- Continuously reduce toil: measure it, attack it with code, and raise the floor on what "easy to maintain" looks like.
Job Requirements
- 5+ years in SRE, platform, or infrastructure engineering, with a clear track record of replacing manual work with code
- Strong programming ability in at least one modern language (e.g. Go, Python, Kotlin, TypeScript, Rust, etc), you write production code, not just glue scripts
- AI-native ways of working: real experience orchestrating agents for ops workflows, not just using AI for autocomplete
- Deep hands-on with Kubernetes, IaC (Terraform or equivalent), CI/CD, and modern observability (metrics, logs, traces)
- Production experience on a major cloud: GCP preferred, AWS acceptable
- Solid foundations in distributed systems and the failure modes that matter in production
- Incident-response maturity: calm under pressure, sharp on root cause, disciplined about follow-through
- Comfort in complex, regulated environments
Nice to Have
- Familiarity with the FIX protocol or capital-markets domain
- Experience building internal developer platforms or self-service tooling consumed by other engineers
Job Segment:
Cloud, Technology