Senior Site Reliability Engineer

Posted on Indeed on Jun 02, 2021

GridIron IT is seeking a Senior Site Reliability Engineer to work on a REMOTE basis.

**US Citizenship required

**NO Third parties Please


  • Conceive, design, and build infrastructure tooling that improves reliability across the entire product surface area, to improve the availability, scalability, latency, and efficiency of services
  • Manage end-to-end availability and performance of key services and build automation to prevent problem recurrence
  • Build visibility into SLIs, SLOs, SLAs, dependency graphs to reduce operational burden
  • Implement observability and instrumentation patterns to alert on symptoms to help reduce/prevent outages
  • Proactively identify risks and develop engineering process, tooling, or work streams that reduce that risk
  • Evangelize and mentor service owners on reliability, resiliency, and scalability for new services and features
  • Collaborate with service owners to improve production landscape for existing services
  • Facilitate and participate in an on-call rotation and hold retroactive root cause analysis meetings, focusing on identifying remediations using blameless postmortems


  • The ability to take a systematic approach to analyzing, troubleshooting, and diagnosing system problems to identify, locate, resolve, and repair problems
  • You can code to automate management of servers and software. When a problem needs a software solution, you roll up your sleeves and get to work
  • You design for scale. You manage cattle to avoid snowflakes of systems and applications. You design systems to auto-scale and auto-heal.
  • You have a breadth of engineering skills with an interest in service reliability, automation, monitoring, and capacity planning.
  • Understanding of modern architecture, e.g. micro-services, EDA, etc., and you are cautious against overcomplexity and over-engineering
  • You enjoy working with the latest monitoring and metrics platforms, e.g. New Relic, Prometheus, InfluxDB, Grafana, Splunk, etc
  • Deep knowledge with AWS technologies, e.g. CLI, Aurora, S3, IAM, EC2, ECS, ECR, KMS, CloudWatch, Lambda, Route53, SQS, SNS, CodeDeploy
  • Previous experience working within an SRE culture, improving reliability with automation, chaos testing, and process improvement
  • Experience designing and operating distributed systems and cloud infrastructure at scale
  • Strong written communication since we are a remote company
  • Experience in supporting a 24/7 infrastructure including on-call rotations

Job Type: Full-time

Pay: $145,000.00 - $150,000.00 per year


  • 401(k)
  • 401(k) matching
  • Dental insurance
  • Disability insurance
  • Health insurance
  • Health savings account
  • Life insurance
  • Paid time off
  • Retirement plan
  • Vision insurance


  • On call


  • Bachelor's (Preferred)


  • Cloud infrastructure: 5 years (Preferred)
  • designing and operating Distributed Systems: 5 years (Preferred)
  • SLI, SLO and SLA's: 5 years (Preferred)

Work Location:

  • Fully Remote

Let us know

Help us maintain the quality of jobs posted on RemoteTechJobs and let us know if:

Error on reporting

Related jobs

ECI - Sacramento

More jobs by this company