Are you passionate about next-gen observability, automation, and operational excellence? As our Systems & Monitoring Engineer, you’ll architect and own the monitoring stack for our Hedera-based ecosystem, blending classic NOC best practices with the unique challenges of DLT and Web3. You’ll be the technical backbone ensuring uptime, resilience, and regulatory compliance for our global support teams.
What You’ll Do
Web3 Observability:
· Design, deploy, and maintain monitoring solutions (Prometheus, Grafana) for DLT-specific metrics (consensus finality, node health, on-chain activity).
· Build custom exporters and dashboards for real-time, actionable insights.
· Distinguish between infrastructure and protocol health to ensure meaningful alerts.
Incident Response & Compliance:
· Integrate and manage PagerDuty for rapid, automated incident response.
· Implement DORA-compliant processes, including automated “kill switches” and regular disaster recovery drills.
· Maintain clear, actionable runbooks for support teams.
Automation & Infrastructure as Code:
· Deploy and manage Mirror Nodes and RPC relays using Terraform/Ansible across AWS/GCP.
· Build CI/CD pipelines for support tooling and state proof verification.
· Automate critical response actions for rapid threat mitigation.
NOC Leadership:
· Serve as the L3 escalation point for complex incidents (“ghost transactions,” API anomalies).
· Perform root cause analysis using logs (Splunk, Datadog) and collaborate with cross-functional teams.
What You Bring
· 4+ years in DevOps, SRE, or NOC roles (with 1–2 years in Web3/Blockchain environments).
· Deep expertise in Prometheus/Grafana, Linux, Docker/Kubernetes, and scripting (Python, Go, Bash).
· Proven experience with cloud platforms (AWS/GCP) and IaC tools (Terraform).
· Strong understanding of Hedera Hashgraph or EVM-based chains, and ability to interpret ledger APIs.
· Familiarity with ITIL/ITSM, DORA, SOC2, or ISO 27001 frameworks.