Slight Reliability Episode 87 - Measuring the value of SRE with Artem Yakimenko

Thursday, Jul 25, 2024 | 2 minute read | Updated at Thursday, Jul 25, 2024

Podcast - Slight Reliability

Published Jul 25, 2024

Summary:

  1. Podcast Title and Host: The podcast is titled “SL Reliability” and hosted by Steven Townend.
  2. Guest Background: The guest, Artam Yakimo, is from Culture Amp and has a background in operations and reliability engineering, including time at Google working on Google Cloud Storage and Customer Reliability.
  3. Discussion Topic: The main discussion revolves around making reliability meaningful in organizations, specifically how it impacts business value and customer experiences.
  4. Guest’s Role and Activities: Despite being an engineering director, Artam maintains a hands-on role by taking support shifts to understand team challenges and deployment issues. He mentions that this hands-on approach gives valuable insights into daily operational challenges.
  5. Measuring Reliability: Artam argues that while it’s challenging to quantify the impact of reliability engineering in simple monetary terms, one can use metrics such as uptime, service level agreements (SLAs), and service level objectives (SLOs) combined with qualitative analyses of support ticket data to gauge impacts on user experience.
  6. Concrete Example: Artam provides an example of analyzing keywords in support tickets like “latency” or “slow” to identify and prioritize areas for improvement in reliability. This method allows for identifying prevalent issues impacting users, which can then be addressed to enhance system performance and user satisfaction.
  7. Importance of Financial Ties: The discussion also touches on the business aspect, emphasizing the need for reliability efforts to be aligned with revenue-generating activities. Artam suggests that any improvement efforts should focus on areas critical to customer experience and, consequently, revenue generation.
  8. Reliability in Business Context: Lastly, there is a focus on expressing the value of reliability engineering to business stakeholders, ensuring they understand its importance not only in preventing downtime but in fostering a positive company reputation and maintaining customer trust and satisfaction. This summary provides an overview of key themes discussed in the podcast, reflecting on the integration of technical roles with business outcomes and continuous improvement based on practical, data-driven insights.

Listen to the episode: YouTube

About this site

This site is a list of summaries of Ops and SRE related podcast episodes.

I built this to fulfill a personal need.

There are so many podcasts with valuable content out there but it’s impossible for me to listen to them in their entirety. These summaries give me a starting point to decide which of them has stuff that I need to know more about. Based on that I go and listen to the episode.

The summaries are auto-generated by an LLM from the episodes, so it’s possible there are minor errors. I try my best to correct any I that notice. Please reach out to let me know if you come across any.

I would encourage users of this site to go and listen to the actual podcast episodes that they find interesting based on the summaries.

I am not affiliated with any of the podcasts or their authors.

All feedback is welcome. My contact info