Normal view MARC view ISBD view

97 things every SRE should know :

97 things every SRE should know : collective wisdom from the experts Ninety-seven things every Site Reliability Engineer should know edited by Emil Stolarsky and Jaime Woo - Mumbai : O'Reilly, ©2021 - xvii, 231 p. : ill. ; 24 cm.

Includes bibliographical references and index.

New to SRE. Site reliability engineering in six words / Do we know why we really want reliability? / Building self-regulating processes / Four engineers of an SRE seder / The reliability stack / Infrastructure: it's where the power is / Thinking about resilience / Observability in the development cycle / There is no magic / How Wikipedia is served to you / Why you should understand ( a little) about TCP / The importance of a management interface / When it comes to storage, think distributed / The role of cardinality / Security is like an onion / Use your words / Where to SRE / Dear future team / Sustainability and burnout / Don't take advice from Graybeards / Facing that first page / Zero to one. SRE, at any size, is cultural / Everyone is an SRE in a small organization / Auditing your environment for improvements / With incident response, start small / Solo SRE: effecting large-scale change as a single individual / Design goals for SLO measurement / I have an error budget- now what? / How to change things / Methodological debugging / How startups can build an SRE mindset / Bootstrapping SRE in Enterprises / It's okay not to know, and it's okay to be wrong / Storytelling is a superpower / Get your work recognized: write a brag document / One to ten. Making work visible / An overlooked engineering skill / Unpacking the on-call divide / The maestros of incident response / Effortless incident management / If you're doing runbooks, do them well / Why I hate our playbooks / What machines do well / Integrating empathy into SRE tools / Using ChatOps to implement empathy / Move fast to unbreak things / You don't know for sure until it runs in production / Sometimes the fix is the problem / Legendary / Metrics are not SLIs (the measure everything trap) / When SLOs attack: pathological SLOs and how to fix them / Holistic approach to product reliability / In search of the lost time / Unexpected lessons from office hours / Building tools for internal customers that they actually want to use / It's about the individuals and interactions / The human baseline in SRE / Remotely productive or productively remote / Of margins and individuals / The importance of margins in systems / Fewer spreadsheets, more napkins / Sneaking in your DevOps deliciously / Effecting SRE cultural changes in enterprise / To all the SREs I've loved / Complex: the most overloaded word in technology / Ten to hundred. The best advice I can give to teams / Create your supporting artifacts / The order of operations for getting SLO buy-in / Heroes are necessary, but hero culture is not / On-call rotations that people want to join / Study of human factors and team culture to improve paper fatigue / Optimize for MTTBTB (mean time to back to bed) / Mitigating and preventing cascading failures / On-call health: the metric you could be measuring / The SRE as a diplomat / Test your disaster plan / Why training matters to an SRE practice and SRE matters to your training program / The power of uniformity / Bytes per user value / Make your engineering blog a priority / Don't let anyone run code in your context / Trading places: SRE and product / You see teams, I see product / The performance emergency fund / Important but not urgent: roadmaps for SREs / The future of SRE. That 50% thing / Following the path of safety-critical systems / The importance of formal specification / Risk and rot in sociotechnical systems / SRE in crisis / Expected risk limitations / Beyond local risk: accounting for Angry Birds / A word from software safety nerds / Incidents: a window into Gaps / The third age of SRE / Alex Hidalgo -- Niall Murphy -- Denise Yu -- Jacob Scott -- Alex Hidalgo -- Charity Majors -- Justin Li -- Charity Majors and Liz Fong-Jones -- Bouke van der Bijl -- Effie Mouzeli -- Julia Evans -- Salim Virji -- Salim Virji -- Charity Majors and Liz Fong-Jones -- Lucas Fontes -- Tanya Reilly -- Fatema Boxwala -- Frances Rees -- Denise Yu -- John Looney -- Andrew Louis -- Matthew Huxtable -- Matthew Huxtable -- Joan O'Callaghan -- Thai Wood -- Ashley Poole -- Ben Sigelman -- Alex Hidalgo -- Joan O'Callaghan -- Avishai Ish-Shalom and Nati Cohen -- Tamara Miner -- Vanessa Yiu -- Todd Palino -- Anita Clarke -- Julie Evans and Karla Burnett -- Lorin Hochstein -- Murali Suriar -- Jason Hand -- Andrew Louis -- Suhail Patel, Miles Bryant, and Chris Evans -- Spike Lindsey -- Frances Rees -- Michelle Brush -- Daniella Niyonkuru -- Daniella Niyonkuru -- Michelle Brush -- Ingrid Epure -- Jake Pittis -- Elise Gale -- Brian Murphy -- Narayan Desai -- Kristine Chen and Bart Ponurkiewicz -- Ingrid Epure -- Tamara Miner -- Vinessa Wan -- Vinessa Wan -- Effie Mouzeli -- Avleen Vig -- Kurt Andersen -- Kurt Andersen -- Jacob Bednarz -- Vinessa Wan -- Vanessa Yiu -- Felix Glaser -- Laura Nolan -- Nicole Forsgren -- Daria Barteneva and Eva Parish -- David K. Rensin -- Lei Lopez -- Miles Bryant, Chris Evans, and Suhail Patel -- Daria Barteneva -- Spike Lindsey -- Rita Lu -- Caitie McCaffrey -- Johnny Boursiquot -- Tanya Reilly -- Jennifer Petoff -- Chris Evans, Suhail Patel, and Miles Bryant -- Arshia Mufti -- Anita Clarke -- John Looney -- Shubheksha Jalan -- Avleen Vig -- Dawn Parzych -- Laura Nolan -- Tanya Reilly -- Heidy Khlaaf -- Hillel Wayne -- Laura Nolan -- Niall Murphy -- Blake Bisset -- Blake Bisset -- J. Paul Reed -- Lorin Hochstein -- Björn "Beorn" Rabenstein.

"Site reliability engineering (SRE) is more relevant than ever. Knowing how to keep systems reliable has become a critical skill. With this practical book, newcomers and old hats alike will explore a broad range of conversations happening in SRE. You'll get actionable advice on several topics, including how to adopt SRE, why SLOs matter, when you need to upgrade your incident response, and how monitoring and observability differ. Editors Jaime Woo and Emil Stolarsky, co-founders of Incident Labs, have collected 97 concise and useful tips from across the industry, including trusted best practices and new approaches to knotty problems. You'll grow and refine your SRE skills through sound advice and thought-provoking questions that drive the direction of the field."--

ISBN: 9789385889516

LCCN: 2022276249

Nat. Bib. No.: GBC0K1553 bnb

Nat. Bib. Agency Control No.: 020047577 Uk

Subjects--Topical Terms:
Reliability (Engineering)
Engineering--Management.
Cyberinfrastructure--Management.
Allied operations

LC Class. No.: TA169 / .N56 2020

Dewey Class. No.: 620.001 / STO-9