The Patterns of Scalable, Reliable, and Performant Large-Scale Systems
Updated Jun 14, 2019
A curated list of Site Reliability and Production Engineering resources.
Updated Jun 14, 2019
A curated list of Chaos Engineering resources.
Updated Jun 5, 2019
An easy to use and powerful chaos engineering experiment toolkit.(阿里巴巴开源的一款简单易用、功能强大的混沌实验注入工具)
#52 opened about 2 months ago by starmoon1994
3
#46 opened about 2 months ago by 865428222
#13 opened 3 months ago by ivantian
3
Go
Updated Jun 15, 2019
A collection of postmortem templates
#3 opened about 1 year ago by dastergon
Updated May 30, 2019
Web UI for Jaeger
#385 opened 24 days ago by tiffon
#366 opened 2 months ago by jpkrohling
#171 opened over 1 year ago by ncsibra
5
JavaScript
Updated Jun 14, 2019
What to Read to Learn More About DevOps
Updated Jun 9, 2019
Curated list of good SRE interview questions.
Updated Jun 7, 2019
Google Site Reliability Engineering book converted in audio
Updated Mar 22, 2017
A party card game for engineers caring about reliability. Based on Cards Against Humanity.
A collection of SRE tools
Updated Jan 26, 2018
Calculate how much downtime should be permitted in your SLA
HTML
Updated Jun 9, 2018
A role-playing game for incident management training
HTML
Updated Apr 8, 2019
My opinionated list of products and tools used for high-scalability projects
Updated May 6, 2019
The Skinny Distributed Lock Service
Go
Updated Jun 12, 2019
A collection templates ported from the SRE Workbook
Updated Aug 24, 2018
A list of common Disaster Recovery (DR) scenarios for software companies
Updated Dec 23, 2018
A combination of introduction to operating system and computer network
Updated Feb 2, 2017
🔖 Daily-updated reading list for designing High Scalability 🍒, High Availability 🔥, High Stability 🗻 back-end system…
The agent of Komlog, a PaaS for helping observability teams to better understand their systems.
Python
Updated Nov 14, 2017
Terraform provider for Arachnys' Cabot. Create, manage, and manipulate status checks, and alerts for services.
Go
Updated Sep 15, 2017
Control health checks and toggle upstream node status in load balancers with ease.
Go
Updated May 22, 2017
#3 opened over 1 year ago by danrl
Go
Updated Apr 24, 2018
Endpoint monitoring and DNS failover agent written in Go
Go
Updated Dec 8, 2017
A curated list of awesome Site Reliability and Production Engineering resources.
Overall map of topics to cover for my “Engineering for Site Reliability” blog series.
Updated Jan 1, 2019
A resource website dedicated to Reliability Engineering
CSS
Updated Mar 16, 2019
Calculate the tolerable downtime of your service
HTML
Updated Jun 21, 2018
Resume of M. Adam Kendall, Software Engineer
Updated Jun 12, 2019
Deterministic Subsetting as defined in the SRE book
Python
Updated May 29, 2018