Key Responsibilities
- Design and implement scalable telemetry pipelines for metrics, logs, traces, and events across distributed systems.
- Develop and maintain observability standards, NMS tooling, dashboards, alerting frameworks, and SLOs in collaboration with product and platform teams.
- Champion best practices in instrumentation, monitoring, and incident response across engineering teams.
- Integrate and optimise observability tools (e.g., OpenTelemetry, Prometheus, Grafana, Splunk, Elastic, etc.) within the NPS ecosystem.
- Collaborate cross-functionally to ensure observability is embedded into the SDLC and CI/CD pipelines.
- Drive adoption of observability platforms through enablement, documentation, and training.
- Continuously evaluate emerging technologies and practices to evolve our observability capabilities.
Required Skills and Experience
- Proven experience in observability, SRE, or platform engineering roles within complex, distributed environments.
- Strong hands-on expertise with telemetry tools such as OpenTelemetry, Prometheus, Grafana, Splunk, Elastic, Loki, Jaeger, or similar.
- Proficiency in at least one programming language (e.g., Python, Go, Java) and infrastructure-as-code tools (e.g., Terraform, Helm).
- Deep understanding of cloud-native architectures (Kubernetes, microservices, service meshes).
- Experience defining and managing SLOs, SLIs, and alerting strategies.
- Strong problem-solving skills and a passion for improving system reliability and developer experience.
Join us and be part of a team that values innovation, quality, and continuous improvement.