Managing Mistakes

Episode 14: Managing Mistakes
===

Morgan VanDerLeest: [00:00:00] Hello everyone, and welcome back to the PDD podcast. We're back to our regular format where we pick a listener's question. And it's been the episode addressing that. Today's question again, deals with a situation where striking the right balance between proactiveness, reactiveness, and managing our team's emotional response to both is difficult. Specifically for today. What do we do when someone, many, or even ourselves make a costly mistake at work? Let me go ahead and read the question. Dear PDD, I recently joined a series B startup as a director of engineering, and I find myself confused. On my first week at the new job, I attended a postmortem review for an incident. The postmortem did not address some very glaring issues spanning from systemic gaps to irresponsible action by individuals. But when I brought it up again with my VP, they said that quote, we believe in a blameless culture.

Morgan VanDerLeest: And also I expect someone at your level to know that the gaps you've identified are so foundational that we can't just decide among ourselves to fix them. So there's no point [00:01:00] in bringing them up. A week later, an incident stemming from the same root cause happened again. This time the CEO intervened and demanded consequences. And I was asked to terminate the engineer whose actions were most closely related to the issue. This just feels wrong, but I'm also realizing I don't necessarily know any better. What is in practice the difference between accountability and blame? And also, I'm not sure what I would do instead of my VP if the postmortem did identify a critical issue, but engineering is not necessarily authorized to prioritize that.

Morgan VanDerLeest: Any ideas?

Eddie Flaisler: You have no idea how much empathy I have for this person. You know, on the one hand, the rule of thumb for any incoming leader is to meet the team where it is, right? So, withhold judgment. On the other hand, you were hired to lead engineering and something is visibly broken. Do you just sit there and not touch it just because you're new?

Eddie Flaisler: You know, I have a very close manager friend. Don't look at me like that, Morgan, it's not me. Who also on their first [00:02:00] week on the job coincidentally realized the system they inherited was prone to a distributed deadlock. They were asked to sit down and settle in before they start lighting fires, and a few short weeks later that deadlock happened and this critical service had a huge outage that severely impacted the company's credibility.

Morgan VanDerLeest: How come you have so much drama in your life? Anyhow, we definitely have a lot to talk about. Cue the intro. Let's do it. I am Morgan.

Eddie Flaisler: And I am Eddie.

Morgan VanDerLeest: Eddie was my boss.

Eddie Flaisler: Yes.

Morgan VanDerLeest: And this is PDD: People Driven Development. You know, the thing about mistakes is that it's very difficult to talk about them collectively, or at a high level. Every sad story has its own idiosyncrasies. To quote Leo Tolstoy, happy families are all alike, but every unhappy family is unhappy in its own way. So, we were thinking of spending at least part of today's episode working through a concrete example of an incredibly complex situation to make this conversation a lot more tangible. It's a difficult story, [00:03:00] but one with many learnings, the disaster of Space Shuttle Columbia.

Eddie Flaisler: On February 1st, 2003, Space Shuttle Columbia disintegrated during re entry into Earth's atmosphere, tragically killing all seven astronauts on board. It remains one of the most significant disasters in airspace history. We are not here to critique or pass judgment on the monumental efforts of the thousands of people who worked tirelessly over decades to make this mission possible.

Eddie Flaisler: Instead, we are going to spend some time looking at a few key observations made in the Columbia Accident Investigation Board report, because it's a master class in navigating an overwhelming array of constraints while trying to do the right thing.

Eddie Flaisler: The board found that in all likelihood the accident happened because a piece of insulating foam came off the fuel tank during launch and hit the left wing. This foam was supposed to protect the tank from ice, but it broke off and caused a hole in the wing's heat shield. During re entry, when the shuttle was coming back to Earth, the wing got very hot because the [00:04:00] hole in the heat shield superheated air entered the wing and melted its internal structure. This weakened the wing until it broke apart, causing the entire shuttle to break up. From the moment ground control realized there were issues on Columbia's left wing, it took less than eight minutes until the orbiter disintegrated, meaning that even if NASA had been capable of organizing a safe rescue mission, which the board report concluded it hadn't been, it wouldn't have had time.

Eddie Flaisler: Early detection would have been the only way to avoid loss of human life.

Morgan VanDerLeest: At the time of Columbia's launch, insulating foam detaching from the Space Shuttle's tank was considered a normal deviation. I n fact, over 80 percent of previous missions for which imagery was available had experienced this issue. However, a conscious decision was made to accept these events as normal and not as posing a serious threat because historically foam strikes had not resulted in catastrophic damage. The report also identified additional organizational barriers and failures leading to the disaster, including lack of communication channels for the lower level [00:05:00] engineers who recognized the potential danger to effectively convey their concerns, lack of responsiveness when these concerns did make it to senior leadership, absence of a truly independent safety organization with the authority to halt potentially dangerous missions, And finally, Misdirected resource constraints and budget cuts that ultimately created pressure to maintain the flight schedule and minimize delays while overlooking safety concerns.

Eddie Flaisler: And then there was also the ISS.

Morgan VanDerLeest: Hang on, the International Space Station? Eddie, if I remember correctly, Columbia did not go to the space station?

Eddie Flaisler: It did not, but the report found a different connection. So for those unfamiliar, the International Space Station or ISS is a collaboration between five space agencies, NASA from the United States, Roscosmos from Russia, ESA representing Europe, JAXA Japan, and CSA from Canada. It was first occupied in November 2000, and its U. S. segment was technically completed in 2011. NASA was very motivated to have the space station fully [00:06:00] operational. They were under immense pressure. They had multiple agreements with international partners dictating specific schedules. They had already spent billions of dollars. And, not less importantly, this collaboration was a post Cold War symbol of global cooperation, which held significant meaning for some people. This needed to be successful. Now, to complete the work, NASA needed additional funding and political backing from parties who were not necessarily motivated to see the ISS succeed and needed to be convinced it was on the right track.

Eddie Flaisler: But in order to achieve this, NASA had to demonstrate, among other things, that their space shuttle program was reliable and capable, because, you know, the space shuttles were used to transport critical components to and from the station.

Eddie Flaisler: For that reason, they needed to maintain a high flight rate for the entire shuttle fleet. This included Columbia's mission STS 107. Which launched without being ready, even though its primary mission was scientific [00:07:00] research and not time sensitive in any way.

Morgan VanDerLeest: Eddie, why is this starting to sound a little too familiar?

Eddie Flaisler: Because it is. Let's step back for a few minutes from discussing Columbia. We'll get back to the analysis. I recently stumbled upon a very interesting GitHub repository. It belongs to a guy named Dan Lu. The GitHub username is danluu. Anyhow, that repository contains about 200 public postmortems published in recent years by something like 60 plus companies.

Eddie Flaisler: That's 6 0. So, practically every engineering powerhouse you can think of. From FAANG to Cloudflare, Datadog, PagerDuty, Slack, Spotify, Roblox, the list goes on. For the purpose of this episode, I did this exercise where I wrote a script to scrape the text from old postmortems and analyze it for themes in root cause analysis.

Eddie Flaisler: So basically I was interested in what type of production problems do we typically encounter and how prevalent they are.

Morgan VanDerLeest: Interesting. 200 public postmortems. So 200 [00:08:00] situations where there was substantial customer impact. because otherwise you typically don't publish a public postmortem. You've got my attention. Go on.

Eddie Flaisler: While I'm sure there are many ways to slice this data, I found that the root causes most cleanly fit into four buckets. Now, I'm gonna oversimplify the hell out of this.

Eddie Flaisler: But for the purpose of this discussion, I'm going to call the bottom two buckets, in terms of prevalence, the nature of the beast. These buckets are, one, issues with capacity exhaustion, scalability challenges that are not architecture related, and failed resource planning.

Eddie Flaisler: Two, failure of either third party components or a very nuanced issue with upstream dependencies such as network infrastructure.

Morgan VanDerLeest: Every SRE on the planet wants to kill you right now.

Eddie Flaisler: I know, hear me out. Capacity planning and designing systems and particularly infrastructure to scale is an art form. It is incredibly complex and thoughtful work that can lead to beautiful results. It's not a given that we'll fail to scale. [00:09:00] That's not what I mean by nature of the beast. My point is that every single person who's worked in this domain long enough knows that this is also the area with the most visibility into dollar cost of engineering decisions.

Eddie Flaisler: I can configure my cloud provider for unlimited instances of everything with the highest tiers and a ton of reserved capacity. I will go bankrupt within a month at most. So you often need to make a lot of guesstimations on the use of the thing you're building while staying cost conscious. I'm not saying that's an excuse to be negligent, But, I'm going to assume that when something doesn't scale in production one day, it's because estimations were wrong, and not because you deployed a single instance with minimal resources to address a problem statement that warranted dozens of clusters around the world.

Eddie Flaisler: That is why we're not talking about this category today. I want to stick to decisions that are less problematic by design.

Morgan VanDerLeest: I know we're not going to get into this, but just as a side note here, this is why it's also helpful to have your Engineers not just solely focused on the feature that they're delivering, but also how this thing impacts the business so [00:10:00] that you can have some of these costs understandings when we're actually going about planning projects and releasing and the actual work that's done. It's not just a pure engineering to engineering comparison. It's also, what is this going to cost the business for us to do. Again, we don't need to get any more into detail of it than that, but engineers do need to be cost conscious.

Eddie Flaisler: That is exactly right. It's so foundational to the work.

Morgan VanDerLeest: Indeed. So what about the dependency failure?

Eddie Flaisler: So for this one, the decision to not focus on it in our analysis is less obvious. Defensive design when coding around third party dependencies is something I expect from every senior engineer. So I don't take for granted that it failed, but I will say. And maybe I have a lot of empathy here because I've struggled with that myself.

Eddie Flaisler: That when you're integrating with software that isn't yours, software that you typically don't have access to its code, and more often than not you might not even have the matching skill set to understand how it works, it's really difficult to avoid issues. I gave network infrastructure as an example because I myself have been [00:11:00] in situations where no one was pressuring me for time, but as an application engineer, I was simply not equipped to fully appreciate the packet level issue I was seeing. So again, let's stick to situations where it's not easy to excuse problems.

Morgan VanDerLeest: Alright, I'm on board with taking these two buckets out of the conversation for now. How many of the postmortems had root causes in these buckets anyway?

Eddie Flaisler: 29%. That's it. The 71 percent is what I want us to spend our time on today and is also related to the Columbia case.

Morgan VanDerLeest: Go for it.

Eddie Flaisler: Okay, 71 percent of the root causes had five themes. Here they are, not in any particular order. One, logical error in code or design, which was not caught due to either missing tests or lack in review. Two, configuration changes that did follow due process, but were still incorrect. Three, insufficient monitoring and alerting, which could have enabled us to catch typical issues before they become disastrous.

Morgan VanDerLeest: Hashtag Columbia.

Eddie Flaisler: That's right. Four, either a breach in [00:12:00] security or a failed closed situation which accidentally made the system inaccessible. And five, human error that stemmed from either lack of process, lack of discipline to adhere to the process, or ineffective communication patterns between people and the organizations.

Eddie Flaisler: For example, making a change in production environment directly from the cloud management console.

Morgan VanDerLeest: Hold it, Eddie. The stemmed part in five actually sounds weird. I mean, sure the person wasn't disciplined enough, but accessing the cloud console for production with admin privileges and in an established company, that sounds like a lack of minimal guardrails is more of the problem.

Eddie Flaisler: That's exactly right. And because all the five themes are actually so deeply intertwined, I believe they actually fall into just two buckets.

Morgan VanDerLeest: Ah, I can think of the maturity of the software and the level of deliberation in the organization.

Eddie Flaisler: I like the sound of that. Say more.

Morgan VanDerLeest: Well, as you said, each one of these [00:13:00] themes is deeply connected with all the others. The production access example touches discipline and process, but also guardrails. If configuration changes follow due process, but still brought down the system. Did we truly have the level of observability and the deployment architecture to minimize impact with things like effectively tested staging and canary? Maybe we did, but chances are we didn't. And also why wasn't the developer disciplined? Why did some message about quality concerns need to surface to the top and didn't? Why was an alert that actually is in place missed? Is it alert fatigue coming from too much noise or something else? So there's an aspect of things that were in our sphere of control to do in the software. And we didn't. And an aspect of the way we work together is not necessarily conducive to the best outcomes.

Eddie Flaisler: I think you're spot on with this. And that actually leads to the reason I wanted us to take a step back from Columbia and dive into this. You see, Morgan, I like to start from the assumption that people are not lazy, incompetent or both. There are always mishires, but [00:14:00] if we are deliberate, to use your term, about performance management and people analytics, which, by the way, we covered in a lot of depth in a previous episode.

Eddie Flaisler: You end up with individuals who are here because they want to do a good job and they have the minimum skillset required to do that.

Morgan VanDerLeest: I noticed you mentioned performance management, but not hiring.

Eddie Flaisler: Yes, because I woke up a long time ago from the illusion that we as an industry present company included, actually know how to properly vet for the right talent. There are still some proven techniques, but I prefer to leave that conversation to some other time.

Morgan VanDerLeest: Fair. Continue.

Eddie Flaisler: Okay, so if you start from this assumption about people, engineers and managers alike, you end up with just a handful of reasons for immature software or ineffective ways of working in an organization. One, we don't know what we don't know. So we can be very skilled and well intentioned and still we can't know everything so we don't realize that something can pose a problem or even if we do we think it's just something we need to accept [00:15:00] because we are not aware of a solution. So we don't know this design pattern which can solve that or that library or this protocol that is very effective in catching errors when designing a distributed system. We just don't know. Two is a situation where we, the people directly responsible for the design implementation, know something might create a problem and what needs to be done to address it, but we don't think it's important enough to spend time on.

Eddie Flaisler: Three, we know it's a problem. We know how to address it. We want to address it. But there's no appetite for that from those who manage our time and budget. And I'm totally including in this the type of work a manager will do as well. I've encountered situations where leaders wanted to change how incident response was handled, or wanted to execute a shift left on quality as we discussed in the developer productivity episode, and they couldn't.

Eddie Flaisler: There was no alignment on that from above, even though it was a very good solution for a known issue.

Morgan VanDerLeest: One of the biggest things this brings to mind for me is that constant [00:16:00] tug of war between Speed and quality. Because that's essentially what this all boils down to. To take a separate but hopefully helpful example, we might have actually brought this up before, manufacturing plants in the late 1900s. There was a big emphasis on safety toward the latter half, like getting closer to 70s, 80s, 90s, and by putting the emphasis there, they ended up driving a lot more productivity and success because they were catching issues earlier in the process that were causing problems that would then stop work because an issue happened or someone got hurt. Now, software engineering is not the same as manufacturing. You don't have us handling heavy machinery that could severely injure, goodness, kill someone. But, it's a similar thought process for how we look at the way work happens. And if we just force trying to get the work done as quickly as possible, you cut corners.

Morgan VanDerLeest: You don't realize that you know things, or you think that you don't have the budget or the [00:17:00] time to do something right. How many meetings have you been in when we know the right way to do this, but we're not going to do it that way because we need to get it done faster. And then we wonder why there's so many issues with things. We've built it into our processes, we've essentially codified poor quality into the way that we work. And then we're flabbergasted when a big incident happens or when something occurs and we let down our customers because we're not doing that defensive coding and that really craftsman like coding experience throughout our normal process. We are essentially willing to accept the fact that big issues are going to happen because we moved so quickly in the past.

Eddie Flaisler: I find it very interesting that you already brought that up because I was definitely planning on digging deeper into situations like this when we talk about what to do with each of the three reasons.

Morgan VanDerLeest: So you basically identified three meta root causes for the five root causes your analysis found. Why is that relevant?

Eddie Flaisler: It's relevant because our listener asked us [00:18:00] about distinguishing between accountability and blame and about what to do when engineering is aware of an issue but doesn't have the agency to address it. This question doesn't come in a vacuum. Listen to the story. The story is that in this organization they're in right now, you address mistakes, meaning an act or decision that proved to be incorrect and produced undesirable results, by either doing the tactical fix and pretending it never happened otherwise.

Eddie Flaisler: or identifying a head to put on the chopping block. This seems to be a pretty common pattern at workplaces, and understandably it causes a lot of confusion and resentment within the team. The thing is that whenever I dig deeper into an organization with such a pattern, I don't normally find ill intent or lack of intention to properly address mistakes.

Eddie Flaisler: What I do find is lack of nuance because not all mistakes are created equal. And I can tell you that every single time I had an organization adopt this mindset where we consider the meta root cause when thinking how to handle a mistake, [00:19:00] it removed a lot of confusion and frustration.

Morgan VanDerLeest: All right. I think we will need to dig a bit deeper into this because I'm not sure everyone is following. But before I do, let's use this framework of the three metaroot causes on the Columbia disaster. Just to give a more concrete example of what you're saying for people. So according to the investigation board report, the main contributor to the Columbia disaster was your number three, no appetite to preemptively fix from those who manage our time and budget. And there was also a little bit of two. We don't think it's important from middle management as well.

Eddie Flaisler: Excellent. Given that you understand this, now you tell me, what was the most powerful recommendation the board could have made to minimize chances of something like this recurring?

Morgan VanDerLeest: Some checks and balances for the leadership team.

Eddie Flaisler: And more specifically,

Morgan VanDerLeest: Ahhh, risk. Establishing a risk function.

Eddie Flaisler: A risk function. Because you see, Morgan, very few decisions are all good or all bad. I see nothing wrong with the heads of [00:20:00] NASA wanting to meet their obligations to international partners, to not let billions go to waste and to achieve something of historical significance.

Eddie Flaisler: But the question is, at what cost, and this is precisely what risk modeling comes to solve. The science is there. Computational physics can estimate the risk of material breaking. But you need to do that calculation. And more importantly, you need to have authority to say, sorry, not good enough, no go. Now you see how thinking of the meta root cause took you to ask questions about organizational structure and what functions exist, instead of blurting something like everyone who touched this needs to be fired.

Eddie Flaisler: And to NASA's credit, that's actually what happened. Changes were mostly systemic. Not many people lost their jobs because of Columbia.

Morgan VanDerLeest: This is starting to make a lot more sense. I like how we went from meta root cause three to the decision of what to do. Let's talk a bit about meta reasons one and two as well.

Eddie Flaisler: let's, but [00:21:00] before we do, I think we have to make number three relevant to companies that aren't at NASA scale and can't just spin up a group of physicists and mathematicians.

Morgan VanDerLeest: That's fair.

Eddie Flaisler: Okay, so the thing to realize about the no appetite meta root cause is that, again, it doesn't imply ill intent, negligence, or I don't know, someone who is just focused on their narrow agenda and refuses to see anyone else.

Eddie Flaisler: It doesn't necessarily imply any of that. It can also simply mean that the people with the full business picture in their heads, who are aware of all the constraints, are making a decision to bet on taking from Peter to feed Paul, on not investing in tech debt right now, because if I don't show something dazzling to my prospective investors sometime soon, we're all going home. There is nothing wrong with that. It's not a managerial failure. What is a managerial failure is making this decision subconsciously instead of articulating it to yourself and to your team.

Eddie Flaisler: Because when you just do it subconsciously, two things [00:22:00] happen. One, you never realize you need to come up with a contingency plan for where things will go south, just like you described, Morgan. What do you do about customers? What do you do about stakeholders, partners, board members, and two, when something does happen, you need to justify it yourself.

Eddie Flaisler: And since, as the saying goes, we judge ourselves by our intentions and others by their behavior, you end up building a story in your head that you're surrounded by incompetence, and that unfairly breaks your trust in the team.

Eddie Flaisler: Subsequently, you'll start behaving in ways that will break the team's trust in you. So the bottom line is, if you can't afford a crew of risk modelers, totally fine, but make sure to fully own decisions made with missing information, and prepare for the worst, always.

Morgan VanDerLeest: So this for me comes down to simplicity and how the years of zero interest rate policy really changed the way that software engineering businesses worked. We had a period of time where again, money [00:23:00] lost all value. And so we said, screw it. Risk is fine. We'll just throw money at the thing. And now we're in this scenario where we're having to drag our way back out of this mindset of risk doesn't matter. It does. It's always mattered, but it just made a little bit more business sense for a while because money was cheap. Not the case anymore. Probably not going to be the case for a long time again, until we forget about this in 10 to 12 years, but this was never gone. Risk has always been a thing. We just consciously chose to ignore it versus building it into our regular processes.

Eddie Flaisler: Totally.

Morgan VanDerLeest: All right, what about the second meta root cause? We, the people doing the work, don't think that's important enough. I would say that can definitely prove to be a mistake and not just a decision we need to accept, no?

Eddie Flaisler: Absolutely. And what's worse is that deciding something isn't important is such a core part of a professional's job.

Morgan VanDerLeest: Prioritize, prioritize, prioritize.

Eddie Flaisler: Bingo. There is nothing wrong with deciding something is worth your investment. [00:24:00] Again, assuming the right drive and competence and that you're doing something deemed more important in the allotted time.

Eddie Flaisler: Now, if the prioritization exercise led to a mistake, The exercise needs to be questioned for areas of opportunity. Is the right data available? If it's not, that's okay, but are we utilizing the wisdom of crowds technique we previously talked about to compensate for missing data?

Eddie Flaisler: And if we did gather as much public opinion as we could, and the results were still misleading. Is it because shit happens? Or for some reason we don't hear what people actually think?

Morgan VanDerLeest: Psychological safety alert.

Eddie Flaisler: I'll let you do the honors.

Morgan VanDerLeest: We've talked about how important psychological safety is to high performance and also how challenging it is to achieve and how easy it is to break. A leading researcher in this field is Timothy Clark. His framework describes four stages of psychological safety. Inclusion safety, learner safety, contributor safety, and challenger safety. Essentially, first work on a sense of belonging and acceptance within the team. [00:25:00] Then you create an environment where individuals feel safe to ask questions and seek feedback. Then you provide opportunities for individuals to contribute their unique perspectives, ideas, and expertise without fear of judgment or retribution. And finally, you help people feel comfortable to challenge the status quo And drive positive change by you demonstrating open mindedness, receptiveness to feedback, and a willingness to adapt. These steps represent a natural evolution of your team members towards that person who can give you the input you need. But each step requires work, and no step can be skipped.

Eddie Flaisler: Which really makes me wonder what's with all the recent elimination of DEI programs in the industry. I'm getting really confused by the argument that, what was it? The legal and policy landscape surrounding DEI in the U. S. is changing, which is popping up in all these press releases.

Eddie Flaisler: Not sure how that's relevant. Anyhow, that's for another episode.

Morgan VanDerLeest: You know, my fortune teller prediction is that this is a thing we're going to deal with potentially for the next couple of years. [00:26:00] The leaders right now that are cutting these programs are the folks that were successful in the ZIRP time period.

Morgan VanDerLeest: And they're going to do damage to the teams and groups that work within them. Until the pain of that is felt in a couple of years, and different folks end up trickling up into management and leadership after that. That's my fortune cookie belief here.

Eddie Flaisler: I find this really astute and it also reminds me of a conversation I was having recently. We talked about how there's a stark difference between work culture and expectations of employees and In the current generation or recent generations and those of older generations, right?

Eddie Flaisler: And like they expect different treatment, they expect different work conditions, different work scope, like different ways to show appreciation And there's always that complaint coming from older generations Where is the motivation? Where is the willingness to work hard to invest in the business even though it's not yours. And the way this is very relevant to today's episode, is that this [00:27:00] situation where managers are faced with a younger generation who has very different expectations than what people had when these managers started managing is the response to everything that has happened in previous generations.

Eddie Flaisler: Everything culminated to this moment. So the quote unquote mistakes are being paid for right now.

Morgan VanDerLeest: Yes, very much. So I love that. It's very easy to over index on the thing that's right in front of your face or the thing that's the easiest to see, and we kind of forget that there was this, you know, weeks, months, years of effort, culture, mistakes, cutting corners that led to right now. So if we're talking about mistakes, you can't just look at this is the thing that just happened. How did the culture come about that created the situation that we're in now too?

Eddie Flaisler: There's always a price to pay.

Morgan VanDerLeest: Always. So Eddie, I'm curious about something with regards to this, not a good use of our time. What if you just don't have anyone to ask for their opinion, either because there's [00:28:00] literally no one else in your domain, because you're the only engineer working on this, or for any reason, why you can't involve others, let's say a difficult financial decision.

Eddie Flaisler: I think that's exactly where values, principles, and associated behaviors come in handy.

Morgan VanDerLeest: Please don't say as we discussed in the values episode.

Eddie Flaisler: I mean, but we did. Anyhow, this is where they come in handy and why it's so important to build a culture that actually embodies them and doesn't just ignore them or worse, weaponize them. Aligning on organizational values, principles, and associated behaviors with a broader team Is the leader's opportunity to get a playbook for what the right thing to do is when others can't be in the room to act as a sounding board.

Eddie Flaisler: If, for example, your organization cares deeply about self serve and stewardship versus ownership, so the idea is that teams should be able to easily unblock themselves and help unblock others, then there's no question that if you need to decide what tooling to get rid of to meet the constraints of your [00:29:00] new reduced budget, anything live documentation, discoverability related, is very low on that list.

Eddie Flaisler: You would start by cutting something else.

Morgan VanDerLeest: I love that. It's not that there is necessarily a right answer to the thing, but this is our understood answer of how we tackle problems and you get more cohesiveness across your organization. Love it.

Eddie Flaisler: Exactly.

Morgan VanDerLeest: Yeah, Eddie, it seems like we kind of outlined a flow chart here around decision making. Can you repeat that and do it succinctly, please?

Eddie Flaisler: So, how about I make up a mnemonic device and we name it after me because I'm not sufficiently full of myself as it is. Let's call it Eddie's DWARF Waterfall. D is for data. W is for wisdom of crowds.

Eddie Flaisler: AR is for acceptance and reassurance. And F is for fundamentals. So you start with data. If you don't have that, you rely on wisdom of crowds. If that proves ineffective, you ensure acceptance and reassurance, so you can get more signal next time. And if you're stuck making the decision alone, you follow fundamentals, [00:30:00] values and principles.

Morgan VanDerLeest: Oh my God, Eddie, that sounds really bad.

Eddie Flaisler: Which is why everyone will remember it.

Morgan VanDerLeest: I'm gonna have nightmares about Eddie's DWARF waterfall. Anyhow, before we move on to you keep repeating the assumption of competence and drive. What if low drive is a cross team problem because of burnout, low trust in the business leadership, stuff like that. And that does prove to be a cause of negligence that increases mistakes. That's not unheard of.

Eddie Flaisler: It sure isn't. Look, this is a very difficult situation, and I don't think I know the right answer. I will say that, at least in my experience, you don't handle this collectively. You handle it surgically. Even if it's more than one person, when it gets to a situation where someone is in a mental and emotional state that causes them to make repeated mistakes, you work specifically with the individual.

Morgan VanDerLeest: Say more about that.

(MADE IT TO HERE)

Eddie Flaisler: Well, to me, When someone who is competent and has proven themselves as engaged and prolific in the past suddenly isn't, [00:31:00] the isn't comes in two flavors, which assuming you've known the person for quite a while, I argue you can tell apart. Can't and won't. Can't is visible burnout. They're doing their best, but they're tired and lack focus and stamina for, I don't know, reasons related or unrelated work.

Eddie Flaisler: Won't is I'm not interested. You know, I worked with an engineer once who went out on a movie during their on call shift, left the phone at home, the service crashed, the secondary had to jump on it because, you know, the primary was nowhere to be found, and when I brought it up in our one on one, he gave me the most bored look and said, Yeah, I, uh, didn't feel like taking the phone. It's a weekend, you know? That's won't.

Eddie Flaisler: Now, for can't, I have a typical all in or all out approach that I share with a person. Don't look at me like that. It's not as heartless as it sounds.

Eddie Flaisler: My point is that I will help you with paid time off. I will help you with utilization of company's benefits to the extent that I can. I [00:32:00] will do everything in my power to help you.

Eddie Flaisler: I truly want you to take care of yourself because we will all benefit from that. But you need to decide that you take that time off, that you use those benefits like short term disability. I can't do that for you. If you decide not to and you continue working in the same capacity you are now, that means you're all in and you need to somehow get yourself together and start showing up.

Eddie Flaisler: Not sure how, but you have made a decision. You need to own it.

Morgan VanDerLeest: I experienced this with you. I remember you telling me about an engineer at the time. We can do everything that we can to support, but unless they're going to take the time off and actually take advantage of these benefits, what are you going to do?

Eddie Flaisler: Totally.

Morgan VanDerLeest: I will say you may know the difference or recognize the difference between can't and won't, but that doesn't mean either is necessarily easy to deal with. I think, even if you ignore the fact that this podcast and what we believe in is people first, at the end of the day, managers are people and you care about the people on your team Often we're going to try to do whatever we can to help. [00:33:00] But at the end of the day, you're a hundred percent right. Can't make somebody do something. In fact, we shouldn't make somebody do something. You can offer them the options that they have available. And that's really all you could do. Okay, and how about the won't?

Eddie Flaisler: Won't is a little more nuanced. First, as you know, I never expect people to blindly trust the system and to show up while being abused. However, when something is wrong, I do expect the professionalism to do two things. One, think and articulate to me what's wrong. I can't do anything if I don't know what bothers you.

Eddie Flaisler: It sounds obvious, but unfortunately it happened to me more than once that someone who my entire leadership team considered a star and we collectively did everything we could think of to grow them, to compensate them well, and to make them feel appreciated, showed up at my proverbial doorstep one day to let me know they were resigning.

Eddie Flaisler: And they had this glee in their eyes that only deep resentment can bring. They refused to tell me what was wrong. [00:34:00] And one of the things that taught me is that articulating problems is a skill. Not everyone has it. Two, you know, I always say, take it out on me, but don't take it out on the team. If the fact you don't care earful from my boss once or twice, whatever.

Eddie Flaisler: If the fact you don't care meant others on your team had to work the night or weekend, And you still don't show any type of remorse. Goodbye. That doesn't work for me.

Morgan VanDerLeest: You know, I'd like to push back on the point about articulating problems being a skill. Yes. I agree. They're a skill But if you don't have the trust built up with that person to share their problems anyway, you're not going to get what you want. And often if you've got the trust built up, they can say the wrong thing and they know that they can say the wrong thing, but that will get you closer to what the actual problem is. And that's helpful. You need to have an environment with your reports where they can say a thing out of frustration. Have you not hold it against them, and then get to the next stage of that, which [00:35:00] is actually , oh, this is going on. That's actually what I'm upset about. And you can't get to that, or most people can't get to that, unless you express it in some way.

Eddie Flaisler: That I will not deny.

Morgan VanDerLeest: All right, at this point, I think we did a decent job covering the last two metaroot causes. Let's conclude our conversation with the first one. We don't know what we don't know.

Eddie Flaisler: I feel like this one is the one meta root cause people actually talk a lot about. If you think about it, postmortems were created exactly because someone identified that the best thing you could do when an issue happens, sometimes more than fixing it, is learning from it. Postmortems are a great tool as long as you actually identify action items and execute on them. Unlike what our listener is describing, where it's obvious that the postmortem review was held to hold a postmortem review. When the conclusion from investigating an incident is that we didn't know what we didn't know, for example, we did not realize this data store would behave like this under multiple concurrent connections, or we were not aware that this probabilistic algorithm we found on Stack Overflow, [00:36:00] I just don't want to say ChatGPT again, might have great runtime, but its error probability is higher than we thought. You have to take action that is not only corrective, but also addresses the category of the problem. You now know that multiple concurrent connections do weird things to data stores sometimes.

Eddie Flaisler: Maybe we need to reconsider our architectural redundancy scheme to make sure we don't overwhelm anything. Maybe we need better observability if we got to a point that something actually crashed before we realized something was off. Or if the call to the datastore was so tolling, are we sure our queries are optimized?

Eddie Flaisler: You see what I mean? The learning isn't, next time read the documentation better so you know it doesn't like multiple connections.

Morgan VanDerLeest: And this is a perfect example of why a growth mindset is so, so important in any setting where we try to create something. Because you need to be open to the fact that learnings from a failure might be larger in scope or deeper than knowing not to repeat the specific mistake that we made this time.

Eddie Flaisler: You know, I don't normally micromanage people's language including when they use [00:37:00] profanity as long as it's not targeted or hurtful to someone. But I had an engineer once who loved using the word infeasible. This is infeasible, that is infeasible. One day I cracked and told him that the next time they use that word, the promotion they wanted so badly would join the infeasible list.

Eddie Flaisler: I know I'm not proud of it, but you see my point.

Morgan VanDerLeest: Good Lord, Eddie. But at the same time, your language, your attitude, the way you show up in meetings and to other people influences the way that you work and think. So if you're constantly approaching everything with, oh, that's infeasible, it bleeds into the way that your subconscious works and to the way that your team thinks things are going to work and things are going to actually be infeasible for your team to be able to process because they can't think of any other way.

Eddie Flaisler: And it was, which is exactly why I brought that up.

Morgan VanDerLeest: But still, keep it in your head next time. Honestly, I hate to be the one to bring this up, but the listener did talk about the line between accountability and blame, and that made me wonder about learnings that are more around individual behavior. Let's assume we're done investigating the [00:38:00] incident, and while we did identify technical or systemic causes, there are also one or more people who chose a course of action that you, as the organization's leader, consider inappropriate. We gave the example of someone accessing production from the UI and making a change there. Let's play that forward. Because I feel like people may or may not disagree that should be held against the developer.

Eddie Flaisler: Okay, so first, across all situations and circumstances, if a manager believes that is the case, then the matter needs to go through a feedback process, like any other behavior which resulted in a poor outcome. Of course, offline and discreetly. We're not in the business of shaming anyone, no matter what happened.

Eddie Flaisler: By the way, I said process, and that doesn't exempt anyone who doesn't work for a large corporation. No person likes to hear they did something wrong, even if the company size is three. If you're a thoughtful leader, you provide feedback in a structured manner. Second, I think what you're asking me here is if there's a universal definition of what is good engineering judgment.

Eddie Flaisler: I don't know [00:39:00] about universal, but I can tell you mine. I expect engineers to be disciplined individuals who realize the gravity of their actions and the blast radius of a potential error. Since gravity needs to be experienced, I can give a pass to interns or first year in the first job developing live software, but even that is for the first strike only.

Eddie Flaisler: Otherwise, while I don't expect you to know all the things that can go wrong, I do expect you to understand that things might, which means that if you're making a change that deviates from the dictated protocol, You don't make the decision on your own, regardless of seniority. Now, if you stuck to the protocol, demonstrated thoughtfulness, but still ended up with a bad outcome because of a knowledge gap, that becomes a gray area.

Eddie Flaisler: I like to treat people fairly, and the thing is that, as a software engineer, there are so many different things you can end up touching and not touching, that I can't tell you I would agree with any standardized definition of you should know this.

Morgan VanDerLeest: That's a fair point. Do you have an example where you felt [00:40:00] conflicted about whether or not the engineer should have been expected to know something?

Eddie Flaisler: I had a very interesting situation where two of my top engineers who honestly were some of the most skilled technologists I've met and far exceeding my own caliber, designed a retry mechanism for some job orchestrator. It was really fancy with storage and everything.

Eddie Flaisler: So in case the service is not responding, the requests don't get lost and are replayed when the service is ready again. It worked great, but then one day the orchestrator was down for some time, and when we were finally able to bring it back up, the retry mechanism caused a thundering herd problem.

Eddie Flaisler: All the retries were triggered at the same time. It overwhelmed the orchestrator, and it died again. Now, we basically self DDOS'd. I'm going to leave out of the conversation whether or not good load testing would have caught that, because the relevant question is, Techniques like exponential backoff and randomization to prevent this situation exactly should theoretically be the bread and butter of whoever works on distributed systems.

Eddie Flaisler: But they didn't know that. Does that take away from [00:41:00] all the incredibly complex work they've done successfully from the organization? This is why I don't think a manager has a choice but to look at the broader performance picture when deciding what to do with an engineer whose choices were found closely related to an incident and handling that separately from the postmortem remediations.

Morgan VanDerLeest: You mentioned this earlier? I think this is something that's really important to tie together here is the importance of looking at everything through that learning lens. Because at the end of the day, if you want to have the best possible team and outcomes you cannot rely on every person individually needing to go through and learning things on their own and their own personal experience. We really should be leaning on if one person learns it, that is spread to as many people as possible or as makes sense because of the way that we have built our culture and processes. That's a good reason why we have postmortems is so that the individual who was most closely associated with a thing isn't the only person that knows the problem. More people know that, more people are aware [00:42:00] of that, more people are able to bring that new knowledge to future situations. That should apply for everything. Everything that we do should be based much more around how do I learn from what this person did so that I don't have to go through the same thing that they did. And that should be a big emphasis for a manager on a team is how do you make sure that you're building the environment so that the person who quote unquote failed feels comfortable enough to share about that so other people can learn from it and so that other people I feel like they should be going out and trying to learn things from the people that didn't have things go so well for them. And by really building that into your culture and processes in the way that you think, you get to a situation where now all of these individuals who, if they had to follow the exact same path, would take forever to get there because you have to just make mistakes forever to learn all these things. Now you're learning from other people's mistakes, you're benefiting quickly, and you're going to get to the point where do you have this new knowledge mesh layer on your team of folks that, you know. Are going to recover more quickly, avoid problems in the first place, call each other out as something may happen. And you're going to see a [00:43:00] lot more success and more positive outcomes for your team and your company.

Eddie Flaisler: And that is why we always say that the best mentorship is authentic mentorship. The best mentors I've had were those who did not pretend to know everything or be perfect. And as I've watched them tackle failure, I learned the most.

Morgan VanDerLeest: Such a great point. Eddie, before we wrap up for the day, I thought we could talk a bit about the place of mistakes in engineering cultures. Silicon Valley is the birthplace of the principle move fast and break things. That kind of became emblematic of the broader tech industry ethos. How does that align with the super detailed, somewhat conservative approach we're describing here for managing mistakes?

Eddie Flaisler: Ring the boxing bell, it's time for a rant.

Eddie Flaisler: The problem with move fast and break things is that it's an important principle, but not in equal parts. Velocity is very important, but it's optimization has to happen within the constraints of your market and the reality of the business.

Eddie Flaisler: I believe this principle was born at Facebook, right? And what is Facebook? A [00:44:00] viral, heavily backed financially, B2C company. Which was also started at a time where a web page actually loading was in and of itself a monumental achievement and by the time the bar increased, the entire planet was already hooked.

Eddie Flaisler: Yeah, they can totally break things with fairly negligible business consequences. But when you're an enterprise targeted company bound by contractual and regulatory obligations and know that the next time you're down, your biggest customer switches vendor, then maybe not so much.

Eddie Flaisler: None of this is to say that mistakes should be avoided at all costs. Not only is this a needlessly stress inducing mindset, But also being terrified of mistakes is a missed opportunity for a manager. Of course you need to manage risk well and to be proactive, but leaving something to the discretion of the engineers is not only important for their growth and sense of ownership, but also for your ability to build trust in them.

Morgan VanDerLeest: Totally because if you never give them the opportunity to prove they can act well independently, [00:45:00] your belief that they can't will never be challenged.

Eddie Flaisler: That's right.

Morgan VanDerLeest: I think we've had enough for today. Thank you, Eddie.

Eddie Flaisler: Thank you, Morgan.

Morgan VanDerLeest: And to the listeners, if you enjoyed this, don't forget to share and subscribe on your podcast player of choice. We'd love to hear your feedback. Did anything resonate with you? More importantly, did we get anything completely wrong? Let us know. Share your thoughts on today's conversation to people driven development, that's one word, peopledrivendevelopment@gmail.com, or you can find us on X or Twitter @pddpod. Bye y'all.

Eddie Flaisler: Bye bye.

Creators and Guests

Eddie Flaisler
Host
Eddie Flaisler
Eddie is a classically-trained computer scientist born in Romania and raised in Israel. His experience ranges from implementing security systems to scaling up simulation infrastructure for Uber’s autonomous vehicles, and his passion lies in building strong teams and fostering a healthy engineering culture.
Morgan VanDerLeest
Host
Morgan VanDerLeest
Trying to make software engineering + leadership a better place to work. Dad. Book nerd. Pleasant human being.
Managing Mistakes
Broadcast by