6 months ago

Site Reliability Engineer at Shopify

78% 40 hours / week United States (Remote)
Flexible working hours
Food or lunch benefits
Paid health insurance
Paid parental leave
Employee training program

Shopify has many critical components, and sometimes they fail. The Resiliency team are the ones ensuring we can get back to green as fast as possible when that happens.

We will be setting the foundation for building and running resilient systems at Shopify. This is a team of engineers with in-depth operational knowledge of the entire Shopify stack, who will act as first responders and leaders during an incident.

Our job is to get to a resolution as quickly as possible and guide teams to build a more resilient Shopify. We will build the tools and systems used to quickly resolve incidents, and will look to automate away the manual toil.

Commerce happens 24/7, and we need to build a team that can respond whenever necessary. We are hiring for a distributed team to provide availability in Honolulu, Hawaii (UTC -10),

What you’ll do:

  • Respond to automated alerts and execute playbooks.
  • Manage ongoing incidents, using your understanding of Shopify to involve the right teams and resolve as quickly as possible.
  • Clean up the noise in our signals, ensuring we can get an understanding of the system and debug a problem easily.
  • Set the standards with teams for building resilient, debuggable systems.
  • Ensure we never fail for the same reason twice.
  • Follow up each incident to ensure the appropriate action items are in place and prioritized.

About you:

  • You have experience handling on call shifts for mission critical systems.
  • You have been responsible for the tools and processes used to debug and correct failures in those systems.
  • You strongly reject the idea that on call has to be a terrible, disruptive experience.
  • You are a generalist developer who is comfortable with multiple languages such as C, Rust, Ruby, and Go
  • You have done hands-on development with cloud infrastructure (AWS, GCE, Azure, Kubernetes, Docker)

Nice to have but not necessary:

  • You have handled multiple IMOC/on call shifts, and have navigated more than one incident through to the RCA process.
  • You have experience working with a variety of open-source software including nginx, redis, memcached and MySQL.
  • You have familiarity with network and web protocols, from IP to HTTP.