Senior Site Reliability Engineer, REMOTE, Blockchain
Our client is looking for a person who loves automating manual work, solving complicated problems, says no to downtime, working in an energetic and free-thinking environment, feels comfortable challenging opinions, and most importantly who shares with us the same desire to build the distributed Web.
Our client is looking for a Site Reliability Engineer to work as part of all core engineering teams to help cope with the operational load of a fast-growing team. At the company, we must deliver availability, performance, efficiency, monitoring, and emergency response, all while enabling decentralization of the client infrastructure. They are looking for a person to join our distributed on-call rotational team and help us create a self-sustaining blockchain infrastructure.
This is a high-productivity and highly dynamic startup environment so you will need to be comfortable operating quickly but precisely amidst changing needs. There is an opportunity to inject your creativity in almost any aspect of blockchain development.
- Advanced Python coding skills
- Solid understanding of UNIX internals
- Sharp troubleshooting skills, no problem is impossible to solve
- Experience with cloud provisioning tooling like Terraform, Packer, Ansible, Docker
- Experience with monitoring infrastructure like Grafana, Prometheus, Datadog
- Experience with CI infrastructure such as Travis, CircleCI, or Jenkins
- Experience in keeping services up 24/7
- Expertise in large-scale distributed systems.
- Excellent written and verbal communication skills in English
- Proven ability to be effective on a distributed team
- Passion for open source
Nice to Have
- Experience with the Rust programming language
- Experience with multiple cloud providers AWS, Azure and Google Cloud Platform
- Knowledge of blockchain technologies.
- Together with the engineering team you will share the 24/7 on-call rotation (including weekends compensated with day-offs during the week)
- Help build self-driving services which run and repair themself
- Help define SLOs and mission critical metrics
- Drive our incident management response processes
- Build an emergency response playbook with monitoring and alerting
- Work with our core blockchain, middleware, and apps teams to deliver secure and high availability services
- Collaborate with a geographically distributed team, work in the open as part of the client open source project, and engage with clients global community