Since 2015, Lex Neva is publishing SRE Weekly. If you’re interested enough in reading about SRE to get past this post, you’re probably familiar with it. If not, there are plenty of great articles out there to catch up! Lex selects approximately 10 entries from the Internet for each issue, focusing on everything from SRE best practices to the social-side of the system to major disruptions in news.
I’ve always thought that Lex should be among the most educated people in SRE, and possibly #1. I met Lex on a call and was excited to chat with him about how SRE Weekly came to be, how it goes, and his outlook on SRE.
Origin of SRE Weekly
I felt like an appropriate start to our conversation was to ask about the start of SRE Weekly: Why did they take on this project? Like many cool projects, Lex was inspired to “be the change he wanted to see.” He was an avid reader of DevOps Weekly, but wanted something similar to exist for SRE. With so much great and educational material created in the SRE space, shouldn’t there be something to help people find the best?
“I wanted every week to have a list of things related to SRE, and no such thing existed, and I’m like… oh.” Lex explained. “I almost fell sideways into it, I thought it was going to be a huge time sink, but it actually turned out to be a lot of fun.”
How the SRE Weekly is Created
When thinking about the logistics of SRE Weekly, one question probably comes to mind: How? How does he have time to read all those articles? SRE is a methodology of methodology, a practice that encourages the creation and improvement of practices. Lex certainly epitomizes this with its efficient way of searching and digesting dozens of articles a week.
First, he finds new articles. RSS feeds are his favorite tool for this. Once he has a buffer of new articles queued up, he uses an Android application called @voice to listen to them with text-to-speech at 2.5x speed! Building up the ability to understand an article at that pace is a challenge, but for someone dealing with the writing output of an entire community, it’s well worth it.
Lex doesn’t have any strict requirements for choosing which articles to include. He is interested in articles that may bring new ideas or perspectives, but likes to include well-written introductory articles from time to time to get people up to speed. Things focusing on the social-side of the socio-technical spectrum also interest him, especially when highlighting the diversity of voices in the SRE.
Event retrospectives are also a style of post that Lex likes to highlight. Companies post public statements about the outages they’ve experienced and what they’ve learned is a trend Lex wants to encourage growing. Although they may seem to tell the story of only one event in a company, good incident retrospectives can bring a more universal lesson. “An event is like an unexpected situation that can teach us something – if it’s something that surprised you about your system, maybe it can also teach someone else about your system.”
Lex described how in the aviation industry, there was a massive leap in reliability when competing airlines began sharing what they learned after the crash. He felt that any potential competitive advantage should be secondary to people working together to keep them safe. “The more you share about your events, the more we may realize that everyone makes mistakes, that we are all human,” Lex says. Promoting the event retrospective is how he can spot these profitable trends.
Lex’s view of SRE
As someone with a front row seat to SRE’s growth, I was curious about what kinds of trends Lex has seen and how he predicts them to grow and change. We touched on many topics, but I’ll cover three major ones here:
Going Beyond the Google SRE Book
Since its publication in 2016, the Google SRE book has been The canonical text when it comes to SRE. In recent years, however, the idea that the book should not be the end all is becoming more prominent. At SREcon 21, one of the book’s authors, Niall Murphy, did it live on camera!
Lex has seen this shift in approach in much of his recent writing, and he is happy to see a more diverse understanding of what SRE can be: “Even if Google came up with the term SRE, A lot of companies were doing this kind of thing. For even longer,” Lex said. “I want SRE to mean not just the technical core of building a reliable code – although it’s also important – but to encompass everything that goes into building a reliable system.”
As SREs become more popular, companies of greater size are seeing the benefits and want to get involved. Not all these companies can mobilize the same resources as Google. In fact, practically only Google is at the level of Google! Lex is looking to learn more about the challenges of doing SRE at other scales, such as startups, where there are no additional resources.
Expanding What an SRE Can Be
As we break away from the Google SRE book, we also begin to break away from the traditional description of what a site reliability engineer needs to do. “SRE is still in growing pains,” Lex said. “We’re still trying to figure out who we are. But that’s not a bad thing. I’ve accepted that there’s a lot under the umbrella.”
We often think that an “engineer” in a site reliability engineer is like a “software engineer” who primarily writes code. But Lex encourages a more holistic approach: that SRE is about engineering reliability in a system, which involves much more than just writing code. He is seeing more writing and approaches from SREs that have “writing code” as a small percentage of their duties – even 0%.
“They’re focusing more on people of things, incident response, and coming up with policies that build credibility in their company… and I think there’s room for that at SRE because that’s at the heart of it right now.” It’s still engineering, it’s still engineering mindset. If you only do the technical side of things, you’re really missing out.”
Diversifying the perspective of SRE
As well as diversifying the role of SREs, Lex expects to see greater diversity among SREs themselves. In our closing discussion, I asked Lex what message he would broadcast to everyone in this space if he could. “It’s all about the people,” he said. “These complex systems that we are building, there will always be people in them. They are an important part of the infrastructure, just like the servers.”
Even though what we build in SRE is governed only by technical interactions, people are intrinsic to what makes those systems reliable. It is not negative; It’s not just people “error-makers”. People are the ones who give strength and flexibility to a system. At this point, Lex highlights what might make this social side of the system better: diversity and inclusion.
“Inclusion is critical to the credibility of our socio-technological systems because we need to understand the perspectives of all our users, not just those of people like us. That means thinking about race, gender expression, class, nervous deviance, everything. This is an area where we need to do better.” Lex hopes to highlight the richness of this diversity in SRE Weekly.
As people standing at the relative inception of SRE, we are given both a challenge and an opportunity, working together to create and develop the practice. To truly understand and engineer credibility in what we do, we need to actively discuss our goals and how we are achieving them. We hope that you will take the time to reflect on the lessons that many great SRE writers share through places like SRE Weekly.