Battle-Tested Reliability Strategies — Incidentally Reliable with Abhishek Ghosh

Friday, Aug 16, 2024 | 2 minute read | Updated at Friday, Aug 16, 2024

Podcast - Incidentally Reliable

Published Aug 16, 2024

Summary:

  1. Introduction and Background: The podcast features a guest with experience at major companies like Microsoft, Google, and Pinterest, discussing their roles and insights from different positions.

  2. Experience at Microsoft: The guest joined Microsoft post-graduation, primarily working as a program manager involved with internet scale distributed systems, including the transition from Live Search to Bing.

  3. Role at Google: Moved to Google Cloud Platform, focusing on Google’s observability products, integrating acquisitions such as Stackdriver, and exploring reliability and SRE (Site Reliability Engineering).

  4. Subsequent Experience at Pinterest: Led to a significant role at Pinterest, where the guest’s earlier experiences helped prepare for a leadership position. At Pinterest, they focused on various metrics to improve user experience and system reliability, including innovative uptime metrics related to user interactions .

  5. Discussion on Metrics and Incident Management: Metrics used included user uptime, the impact of incidents on revenue, and commit safety. The philosophy behind metric selection emphasized measuring impactful aspects rather than sheer incident numbers to drive desired behaviors from engineering teams .

  6. Tools and Technologies: The guest discussed the build vs. buy decisions in tech tooling, especially at Pinterest, where the scale justified building custom solutions. However, in general practice, they leaned towards buying due to the complexity and effort involved in maintaining proprietary tools.

  7. Role of Data in Decision-Making: Discussed the importance of not just collecting data but also interpreting it meaningfully to guide reliable engineering practices and system improvements.

  8. Leadership and SRE Culture: Emphasized the leadership’s role in shaping a proactive SRE culture that focuses on prevention over reaction, and the continual learning that evolves from handling incidents and customer feedback. This summary captures the key discussions and insights shared during the podcast, highlighting the guest’s professional journey and philosophical approach towards reliability engineering and leadership in tech environments.

Listen to the episode: YouTube

About this site

This site is a list of summaries of Ops and SRE related podcast episodes.

I built this to fulfill a personal need.

There are so many podcasts with valuable content out there but it’s impossible for me to listen to them in their entirety. These summaries give me a starting point to decide which of them has stuff that I need to know more about. Based on that I go and listen to the episode.

The summaries are auto-generated by an LLM from the episodes, so it’s possible there are minor errors. I try my best to correct any I that notice. Please reach out to let me know if you come across any.

I would encourage users of this site to go and listen to the actual podcast episodes that they find interesting based on the summaries.

I am not affiliated with any of the podcasts or their authors.

All feedback is welcome. My contact info