Battle-Tested Reliability Strategies — Incidentally Reliable with Abhishek Ghosh

Friday, Aug 16, 2024 | 2 minute read | Updated at Friday, Aug 16, 2024

@ Incidentally Reliable

Podcast - Incidentally Reliable

Published Aug 16, 2024

Summary:

Introduction and Background: The podcast features a guest with experience at major companies like Microsoft, Google, and Pinterest, discussing their roles and insights from different positions.
Experience at Microsoft: The guest joined Microsoft post-graduation, primarily working as a program manager involved with internet scale distributed systems, including the transition from Live Search to Bing.
Role at Google: Moved to Google Cloud Platform, focusing on Google’s observability products, integrating acquisitions such as Stackdriver, and exploring reliability and SRE (Site Reliability Engineering).
Subsequent Experience at Pinterest: Led to a significant role at Pinterest, where the guest’s earlier experiences helped prepare for a leadership position. At Pinterest, they focused on various metrics to improve user experience and system reliability, including innovative uptime metrics related to user interactions .
Discussion on Metrics and Incident Management: Metrics used included user uptime, the impact of incidents on revenue, and commit safety. The philosophy behind metric selection emphasized measuring impactful aspects rather than sheer incident numbers to drive desired behaviors from engineering teams .
Tools and Technologies: The guest discussed the build vs. buy decisions in tech tooling, especially at Pinterest, where the scale justified building custom solutions. However, in general practice, they leaned towards buying due to the complexity and effort involved in maintaining proprietary tools.
Role of Data in Decision-Making: Discussed the importance of not just collecting data but also interpreting it meaningfully to guide reliable engineering practices and system improvements.
Leadership and SRE Culture: Emphasized the leadership’s role in shaping a proactive SRE culture that focuses on prevention over reaction, and the continual learning that evolves from handling incidents and customer feedback. This summary captures the key discussions and insights shared during the podcast, highlighting the guest’s professional journey and philosophical approach towards reliability engineering and leadership in tech environments.

Listen to the episode: YouTube

Previous page Slight Reliability Episode 88 - OpenTelemetry Revisited with Zach Michel

Next page Slight Reliability Episode 87 - Measuring the value of SRE with Artem Yakimenko