Senior Site Reliability Engineer

Unanet | 15 d ago
Published date Posted on StackOverFlow on Jul 15, 2021

As a member of our Cloud Support group, you will help define our transformation towards an enterprise SaaS solution, hosting numerous top-tier customers. With a quickly growing customer base, we need creative and dynamic engineers to help architect innovative solutions that ensure the best possible experience for our customers.

You will join a team of talented, fast-moving engineers and administrators involved in nearly every aspect of the SaaS delivery and customer experience lifecycle. We are looking for an engineer with a strong software development background, one who has experience building services to ensure our Cloud Support is lean, proactive, and efficient.

Your success will hinge on your ability to apply a software engineer mentality to the functions of operations as well as a firm grasp of automation, cloud architectures, event monitoring, health checks, and metrics gathering. You should be passionate about solving problems and developing creative solutions leveraging automation.

What You’ll Do

  • Provision, configure, and maintain the production environment to handle running several application stacks in the cloud that can scale to the thousands of customers using our products as well as our internal Product team.

  • Automate the deployment and maintenance of cloud platform technologies in the upper environments ensuring changes are also reflected in the lower environments

  • Aid in improving overall product through development of management automation and metrics analysis in the upper environment

    • Integrate current scripts, automations, and functions spread across multiple tools into a coherent Cloud Control system

    • Collaborate with Cloud Development on database deployment capability to release pipeline (automate schema changes across all databases)

    • Evaluate metrics as customers are moved to new production environment

    • Create a metrics-based performance dashboard for production that includes predictive warnings which can be addressed prior to an incident occurring

    • Prepare for multi-tenant product solutions in coordination with cloud development

  • Implement and oversee log management, data warehouse, and database operations, including management of Logging/Audit services

  • Ensure all monitoring systems (infrastructure- and application-level) are in place, report on availability and system health

  • Collect and distill existing management and monitoring tools, scripts, and functions into a single coherent package for easy consumption with the ability to drill down to details

  • Implement strategies around disaster recovery and security for all sub-systems in infrastructure (e.g.,web servers, database, queues, storage, network)

  • Build strategic and tactical plans for continued improvement of cloud architecture and operations

  • Perform capacity management, load and scalability planning

  • Help drive process improvements for service management, including outage/incident management, rollbacks, health checks and reporting

  • Assist management in development and optimization of operational cost models

  • Assist in the establishment of 24x7 performance monitoring, reporting and response protocols

  • With the support of Cloud Development and Product Development, provide on-call support outside of normal work hours/days

Your First 90 Days

In your first 30 Days, as your familiarity with the product and pipeline grows, your responsibilities and influence will grow as well. You will immerse yourself in all facets of the daily operation of the production cloud environment, including provisioning new customers, deploying software builds, reviewing metrics and alerts, troubleshooting, and blameless incident postmortems. Further, you will collaborate with members of both our Product Development and Cloud Development teams to ensure operations can support new functionality.

Within your first 60 Days, you will collaborate with our Director of Cloud Support to define the transition of Cloud Support to a true SRE practice. Working with the rest of the Cloud Support team, you will be responsible for identifying procedures currently handled manually or not fully automated which you will then begin automating. Working with our Cloud Development team, you will identify and implement out the gaps between lower and upper delivery environments, leading to a truly scalable product offering. You, along with the rest of our Cloud Support team, will be responsible for supporting production environments.

Within your first 90 Days, you will drive changes to the operational and development roadmap as we continue onboarding new and existing customers into our hosted production environments. You will work with the Director of Cloud Support to identify a training plan to address skill gaps within the Cloud Support team to support the transition to a true SRE practice. You will also begin gathering and refining requirements for a new Cloud Control system expanding and integrating with our existing Cloud API. This system will incorporate functions and automations related to the daily operation of our production cloud environment.

About You

  • 2+ years of hands-on experience as a production SRE, managing an environment of 500+ containers over 50+ namespaces

  • 4+ years of hands-on development experience with applications and RESTful APIs architected for cloud

  • Performance optimization experience, including troubleshooting and resolving network and server latency issues, performing hardware evaluation/selection tasks, performance vs. cost vs. time analysis

  • 1+ year(s) of experience with Kubernetes and Terraform

  • 1+ year(s) of experience with automation or scripting tools (e.g., GO, Python, Shell)

  • DB skills with ability to automate processes into the pipeline

  • Working knowledge of Agile Development practices (e.g., SCRUM, TDD)

  • Detail-oriented, with excellent documentation skills, and ability to successfully manage multiple priorities

  • Troubleshooting skills that range from diagnosing hardware/software issues to large scale failures within a complex infrastructure

Your Differentiators

  • Bachelor’s Degree in Computer Science

  • Experience implementing production Docker/Kubernetes environments

  • Hands-on experience with building and maintaining a continuous integration and delivery pipeline

  • Experience with Relational Databases (e.g., Oracle, Aurora or Postgres)

  • Experience with Splunk (or other log aggregation tools), Grafana, and Prometheus

  • Experience presenting complex information directly to customers, considering their technical experience level

Our Values

  • We are a Team. Employees, customers, and partners working together.

  • We are Customer-Focused. Customers are the heart of everything we do.

  • We are Driven. Seeking exceptional outcomes.

  • We Own our Success. Every employee has a stake in our company.

  • We do the right thing and have fun in the process.

Unanet is proud to be an Equal Opportunity Employer. Applicants will be considered for positions without regard to race, religion, sex, national origin, age, disability, veteran status or any other consideration made unlawful by applicable federal, state or local laws.

Let us know

Help us maintain the quality of jobs posted on RemoteTechJobs and let us know if:

Error on reporting

Related jobs

Talentiqo RPO Talentiqo RPO |

URGENT: Title: IT Infrastructure AnalystDuration: 11 Months ContractLocation: Pittsburgh, Pennsylvania (100% Remote)Note: USC OnlyRequirements: Must have experience with Linux (and some Windows) and Oracle DB.1-3 years of experienceWill consider a college graduate with formal ed

Arctic Information Technology Arctic Information Technology... |

Job Reference 21-0185Post Date 7/29/2021CARC YesCity/Location TelecommutePoint of Hire Nationwide, USACompany AIT-Arctic Information Technology, Inc.Type of Position Regular Full-TimeFLSA Status ExemptDescriptionThe Data Engineer’s main responsibility is to be the single Po

Gryphon Technologies Gryphon Technologies |

Overview:Gryphon Technologies is a premier engineering and technical services provider supporting National Security programs. Gryphon is the federal Government’s partner working in support of mission critical systems in every phase of their lifecycle. We are proud of our ab

***THIS POSITION REQUIRES AN ACTIVE TOP SECRET CLEARANCE***__**JOB RESPONSIBILITIESRequired Qualifications over Five (5+) years of experience with programming and software development, including analysis, design, development, implementation, testing, maintenance, quality assuranc

Agfa HealthCare, is a division of the Agfa-Gevaert Group which is headquartered in Mortsel, Belgium and traded on Euronext Brussels (AGFB).At Agfa HealthCare, we support healthcare professionals across the globe to transform the delivery of care. Our focus is 100% on providing be

More jobs by this company

Our team is looking for a Senior Software Engineer to help take our web application and tech stack to the next level. We're looking for a dynamic engineer who is no stranger to building well-designed, performant and effective front end web applications that support complex busine

Our team is growing! We’re looking for a Principal Software Engineer, Tech Lead (Java) to help take our web app and tech stack to the next level. We’re looking for a dynamic engineer who is no stranger to building well-designed, performant web applications that suppor

Our growing Product team is looking for a Senior Backend Engineer to help refine our existing APIs (.NET Core) that support our recently launched web (Vue.js and React.js) and mobile (React Native) apps. We're looking for a dynamic engineer who is no stranger to building well-des