DevOps Institute

[E21] Adopting Observability Principles

DevOps Basics, Humans of DevOps, SRE

August 18, 2020

Charity Majors, Co-Founder and Engineer of honeycomb.io, shares her journey of discovery and realization about how to adopt observability principles from control theory to ITOps, including how observability has shifted the way developers manage and support their code. Charity also talks about the culture of observability and why it’s important.

The lightly edited transcript can be found below.

Intro:

You’re listening to The Humans of DevOps Podcast, a podcast focused on advancing the humans of DevOps through skills, knowledge, ideas and learning or the SKIL Framework. Here’s your host, DevOps Institute CEO, Jayne Groll.

Jayne Groll:

Hi, everyone. It’s Jayne Groll, CEO of the DevOps Institute. Welcome to another episode of The Humans of DevOps podcast. I’m excited today to have Charity Majors of Honeycomb.io with me. We’re going to talk a lot about observability. Hey, Charity.

Charity Majors:

Hi. Thanks for having me. It’s nice to be here.

Jayne Groll:

Thank you for joining us. Charity, why don’t you tell us a little bit about your background? I know you coined observability. So as part of that, why don’t you tell us kind of your thought process in terms of the origins?

Charity Majors:

Yeah, totally. I am a dropout music major from Idaho who hadn’t touched a computer until I went to college. I’ve made my niche as an engineer sort of in operations. I’m an ops engineer. My niche has always been I like to join startups when they’re early and young, and I help them grow up, right? They have a product, but it’s not really production ready. I like firefighting. I have ADHD. I’m kind of an adrenaline junkie. So I’ve done that a few times over the course of my career.

Charity Majors:

Most recently for Honeycomb, I did this for a company called Parse, which was a mobile backend-as-a-service, kind of like Heroku for mobile. We were later acquired by Facebook. It was a real roller coaster. We were using some technologies that were probably the right decision when we made them, but we quickly outgrew them and they became real liabilities.

Charity Majors:

But the core problem that I was wrestling with around the time we got acquired was I was coming to the horrified conclusion that we had built a system, doing all the quote unquote right things from the smartest engineers in the world, and it was basically undebuggable. Just wasn’t even possible.

Charity Majors:

If we had seen a problem before, then we could recognize it quickly the next time. But the nature of this beast was that almost every time the pager went off, it was something different. Something we’d never encountered before. With logs, you can find the log line if you know what you’re looking for. And if you happen to admit it with metrics, it can tell you the overall shape of, “Is it healthy? Are there spikes of errors?” Whatever. But you can’t drill down to see which events have errors or what else they have in common with other errors.

Charity Majors:

We tried shipping some home brew stuff. We tried all of these things. After we got acquired by Facebook, we started trying some of the internal Facebook tools. We eventually tried this tool called Scuba, which was this butt ugly, aggressively hostile to users tool. But it did one thing really well, which was it let us slice and dice in near real time on dimensions of high cardinality.

Charity Majors:

If we have a million app IDs, right? Million applications. Any one of them could hit the iTunes store and just go nuts at any time. So using Scuba, we were able to go, “There’s a spike of errors. Let’s break down by app IDs, see which ones have the most errors. Okay, this one has the most errors and those errors started there. So why?” And then you can start looking at, “Well, are all of the end points erroring, or is it just some of them? Maybe it’s just the right end point.”

Charity Majors:

So that tells you something. So you start, you’re like, “Okay, so was it all of the right end points that are erroring? Is it all of the queries that are rights that are erroring, or is it just some of them? Maybe it’s just the ones that are going to a particular shard or particular AZ. Or maybe it’s a particular query type that is associated with a change in the client library that just got shipped.” Or it could be anything. The point is the vast number of things that could go wrong.

Charity Majors:

And then there are the sets of problems where it’s like, “We’re using a shared pool of API workers, a shared pool of push notification service, and we’re using shared databases. So the spike of errors that… Maybe all of the apps are erroring. Which one caused it? Which one caused it to go slow for everyone?”

Charity Majors:

These are the problems we were wrestling with. When we got stuff into Scuba, the time for us to answer these questions and find the answer dropped like a rock. From it’s very open-ended, we’re just going to start looking and God knows how long it’ll take, right? To seconds. Not even minutes. It wasn’t even an engineering problem anymore. It became a support problem. You just go click, click, click, click, click. There it is. Every time.

Charity Majors:

So this made a big impression on me, of course. But I’m an ops. As soon as it was fixed, I was onto the next thing. I didn’t really stop to think too much at the time about what exactly was it that enabled us to do just orders of magnitude better at debugging these systems, until I was leaving Facebook. I was planning to go be an engineering manager at Slack or Stripe, and I suddenly went, “Oh shit, I don’t know how to engineer anymore without this stuff that was built,” because it had become so core to how I perceive my systems. It’s like my five senses. I’m aware of it every day. It’s how I decide what to build. It’s how I know that what I shipped is working. All of these, it’s like seeing the world in HD, right? Instead of having a bandana over your eyes and just whacking at the pinata and hoping you’re in the right direction, which is what so much of systems debugging feels like, right? Instead if you can just see and just look, and it’s transformational.

Charity Majors:

I had never planned to start a company, but clearly this needed to be done, so we did it. Over the course of the first year, we had to write a storage engine from scratch, query planner from scratch, all this stuff. That wasn’t nearly as difficult as trying to figure out how to talk about it, right? It was very clear to me from the start we were not building a monitoring tool. Because monitoring is very… It’s threshold, and it’s after the fact, and it’s a system roughly good enough, right? Metrics are the perfect tool. Dashboards are the perfect tool for that.

Charity Majors:

It was, I think, halfway through the first year that I was just so frustrated. I was staying up for days at a time. I couldn’t sleep. I was like, “How the hell do I talk about this? Every term in data is so overloaded.” That’s when I happened to Google observability. I had heard the term in college, and Twitter had a team called observability at one point. When I looked up the definition, it has this really beautiful heritage. It comes from mechanical engineering. It comes from control theory in which observability is the mathematical dual of controllability. It means can you understand the inner workings of this system just by observing it from the outside? Right?

Charity Majors:

That’s when I had just brain explosions. I was like, “Oh my God. This is what we’re trying to do,” right? We’re trying to help people cope with these unknown unknowns. These new problems that they’d never seen before. They don’t have this rich… All these dashboards and all these playbooks, because that’s a new thing. You’ve never seen it, right? Which means that you need to be able to understand and reason about the inner state of the system no matter what state it gets into, whether or not you’ve ever seen it before, quickly.

Charity Majors:

Observability, and when you translate it into software, that’s what it is. It’s the ability to answer any questions, understand the system state, key point, without shipping new code to handle it. Right? Because that implies that you could have predicted it, that you would have needed that data, right? And the whole point is you can’t predict it, and you just need to stuff a bunch of stuff in there that might someday be useful. Because when you need it, the world is not going to pause while you find what you’re looking for, and then just ship out something to show it to you. It’s like precognition. You don’t get to be a precog. So it’s all about shipping a lot of data. That’s where observability comes from.

Charity Majors:

Now, sadly, and I’ve written a lot about the technical definition of observability as I just defined it. There are a lot of things that precede… If you accept that definition, then there are characteristics of observability tools that are different from monitoring tools, et cetera. The source of truth in monitoring is the metric, right? It’s a number with some tags upended.

Charity Majors:

The data structure that is a source of truth in observability is the arbitrarily wide structured event that contains all of the context for this request. The parameters that were passed in, anything you can see or infer about the environment, the language internals, the business logic, things like shopping cart ID, et cetera. It’s all held continuous in that one blob, which you initialize at the… When you request into the service, you initialize it, pre-populate it with anything, you let it execute. And then right before it’s about to error or exit, you ship that off to your observability tool in that one particular blob. This just opens up so many vistas.

Charity Majors:

It’s interesting if you think about it. This became a big deal around the time of microservices. Why? Because you can attach a debugger to your process, right? But then as soon as the request hops the network, you lose all that context and you have to start again from scratch. Observability is kind of reconstituting that, that single process that contained all of the contexts that you could want to step through.

Charity Majors:

I regret to say that while the rest of the industry has eagerly adopted a lot of our messaging and our words around this, they have not yet… The other people out there who have branded themselves as observability tools, do not actually have some or all of the things that I think define observability. So I hope that this is a temporary state, because honestly, I don’t think we need another term to mean monitoring. I would like it to mean something. But as for now, your Datadogs, your New Relics, your blah, blah, blah, everybody who’s shipping themselves as an observability tool, in my opinion, are not.

Jayne Groll:

You’ve said a couple of things that I thought were really particularly… Everything you said was amazing, but I think there were a couple of key words in there that I’d kind of like to drill in on. First of all, monitoring is very reactive, right?

Charity Majors:

Yes.

Jayne Groll:

I come from an ops state, but a long time ago in very, very traditional monitoring, right?

Charity Majors:

Mm-hmm (affirmative).

Jayne Groll:

So a bunch of guys, mostly guys, sat in the data center, watching these big dashboards while looking for some anomaly that has already happened, right? Maybe they could predict a threshold, but usually they didn’t get the alarm until the threshold was closed, and the window was really-

Charity Majors:

Yeah, usually those just get pruned over time, right? You’re never done fiddling with your thresholds.

Jayne Groll:

Well, exactly, because it becomes a little bit of a moving target. But even if the threshold is threatened or breached, there’s still a reaction. It’s kind of like it becomes an emergency.

Charity Majors:

Yeah, and it’s a very black box way of interacting with your code, right? It’s rooted in the days when the devs had the team and the ops had the team, and neither would ever meet or talk to each other, right? Observability is much more about getting into the beating heart of the code and asking it to explain itself to you, right?

Charity Majors:

Monitoring is heavily biased towards outages, and problems, and emergencies, right? Like you said, it’s very reactive. Observability is not. Yes, when the site’s down, that’s what you want. But it’s also just about understanding. A lot of the things that you want to know about your code are not so much, “Is it a problem? Is it a bug I need to find?” It’s just like, “How does this work?” I am a firm proponent of the fact that if you’re not watching it in production, you don’t know how your code works.

Jayne Groll:

That’s interesting, particularly interesting. I said it’s all interesting. Because it is almost establishing a relationship.

Charity Majors:

Super is.

Jayne Groll:

So you’re worried about monitoring, right? When you look at monitoring, it’s that whole red, amber, green mentality. If everything’s green, grab a cup of coffee, right? Today, they’re not sitting in the data center, but it is a different… It’s [crosstalk 00:12:33].

Charity Majors:

Yeah, and I’m not trying to knock monitoring. There is a real need and place for it. A lot of the monitoring tools, I wish they would be a little bit prouder of themselves. Instead of being, “We’re observability too.” It’s like, “No, you’re fucking monitoring and you’re good at it, and we need it.” It’s a different paradigm. You do not have to be all things for all people.

Charity Majors:

I think that monitoring is somewhat more of an infrastructure paradigm, and observability is something more of a development paradigm. But there’s overlap. It’s a data tool. Both of these tools can be used and abused in many ways other than that which they’re intended. But in terms of right tool for the job, monitoring is great for infrastructure, and observability is great for the people who are writing and shipping code every day.

Jayne Groll:

That’s also interesting because I see there’s a codependency there, right?

Charity Majors:

Yeah.

Jayne Groll:

Monitoring can be an input to observability, because again, you’re looking outside in, but at some point, you’re understanding the internal state. As I said, I perceive it, and please tell me if I’m not looking at it the right way, that it is a relationship. But part of that, it’s like your heart’s beating, but [crosstalk 00:13:45]-

Charity Majors:

Yeah, it absolutely is. I think that shipping code, as my friends over at Intercom say, shipping is the heartbeat of your company, and observability is what allows you to be in this constant conversation with your users who are using that code right now.

Jayne Groll:

Yeah, so let’s talk a little bit about culture, because it is a paradigm shift. I don’t know that it’s a leap. Hopefully, it isn’t too-

Charity Majors:

It’s evolutionary. Yeah.

Jayne Groll:

Sorry?

Charity Majors:

It’s an evolutionary advancement.

Jayne Groll:

Yeah. But hopefully, it’s something that organizations, when they sort of get the fact that it’s also contextual, can start talking about not only the tools… Tools are great. But one of the things in DevOps and SRE is sometimes organizations think they can buy their way in. Like if I buy X number of tools, I get it. But there’s a cultural aspect. So talk a little bit about the culture of observability.

Charity Majors:

Yeah. I think about these things in terms of systems, right? We work and exist in a sociotechnical system. Things that are part of the system are the… Production is part of it. The people who work on it are part of it. The tools, and best practices, and defaults, and deploy scripts, and all of those tools are also part of it. I feel like part of the cultural shift that I think that you’re tuning into is the shift towards helping people own their own code.

Charity Majors:

I believe that a big part of every senior person’s job should be paying attention to these feedback loops that exist in the system. A great example, terrible example, choose your wording, is on call schedules. Ops has a notorious reputation for masochism, right?

Jayne Groll:

Yeah. A long history.

Charity Majors:

Right? I will own that. We have not done ourselves well. Which means that when we start to look at asking other engineers to be on call, for example, they rightly recoil. But the point is is that this old-fashioned on call rotation where it just burns you out and you’re getting woken up all the time, and it’s like going to war for a week, systems don’t actually have to be that bad. We should not accept that as okay. That is a fire alarm. Fire, and you need to fix it. There’s a lot of sort of learned helplessness in our industry around this, but I am telling you right now, it is not inevitable and it can be fixed.

Charity Majors:

But in order to fix that, you have to hook up the loop so that the people who know how to fix it are the people who are getting the information, telling them that there’s something to fix, right? The simplest way, of course, is to have an on call rotation where the people who are writing the code are also on call for it.

Charity Majors:

I don’t believe it’s reasonable to expect anyone to get woken up more than two or three times a year. I think that’s reasonable for everyone who doesn’t have a very small child. I also don’t think that anyone should have two sources of middle of the night wake-up calls. But I think that for adults, if you’re writing for a 24/7 available service, you should know that is a reasonable expectation, I would say, is to own your code as long as it’s not…

Charity Majors:

Because this has to be a handshake, right? Engineers agree to own their own code. Management is on the hook to make sure that they’re given the time to fix it away from shipping features, right? It’s management’s job to make sure that it doesn’t burn people out. Management needs to be watching at the rate of out of hours pages and all this stuff.

Charity Majors:

The point is that the only way to make it better is to work together on this and to get more engineers, pull them much more into the fray, because I’m too old to get woken up all the time either. I don’t want to do it anymore. But we can make it better, but it does mean that… The person who’s best equipped to fix a piece of code is always going to be the person who wrote it. And it’s always going to be the best time as soon after they wrote it as possible while it’s fresh in their head, they remember what they were trying to do, right?

Charity Majors:

So observability fact comes into play here, because it’s what I call observability-driven development. It’s a play on TDD, of course. Most successful software paradigm of my life. I’m not saying you shouldn’t do TDD, you should. But the point is that TDD has been so successful by abstracting away everything outside of your laptop. Everything interesting, they’re just like, “Well, it’s fake. We’re just going to mock out the databases, mock out the network.” And it only tests pure logic, which means that it’s very limited. It does not mean that your code will work. It means that your code is probably logically consistent. You don’t know if it’s going to work until you look at it in production.

Charity Majors:

Staging, I honestly think is more and more just a waste of time, as we now have more and more tools for safely shipping these things in a limited way to production in a way that doesn’t impact our users. Using feature flags, using test users in production. There’s all this shit. There’s lots of writing about it.

Charity Majors:

The point is there’s this very vital feedback loop that everyone should be trying to keep as tight and short as possible. Which is you write the code, commit it, you merge it to master. At that point, hopefully, it automatically kicks off, a run of the CICD system, outputs an artifact, gets deployed, and then you go and you look at it. While you’re writing the code, you need to be instrumenting it with half an eye towards your future self, “How am I going to know if this actually works in production?”

Charity Majors:

No one should ever accept a pull request if they can’t answer the question, “How will I know if this is working in production?” Right? You should have enough telemetry that once it’s up… Ideally, the rate of errors that you catch immediately will… If you get this down to 15 minutes or less, then you can make it muscle memory for people. You can make it a habit, so that they write, they merge, they look. And they just ask themselves, “Is it doing what I expected it to do? And does anything else look weird?”

Charity Majors:

If you’re doing that every day, you’re going to have a very good sense of what normal looks like, right? That moment right there is where you can catch 80% or more of all problems before they have a chance to go out there and wreak havoc, or generate user complaints, or all those things that build up over time into this rat-infested hairball of assistance that nobody really understands, has ever understood. And yet, most people are out there just shipping more code they don’t understand onto that fucking hairball, and then they wonder why their pants are on fire. It’s crazy. You should understand the shit that you’re shipping. That’s my rant.

Jayne Groll:

No, no, no. I really do appreciate that, because I think that as IT evolved… You’ve got to remember we’re still a pretty young industry, right?

Charity Majors:

Yeah.

Jayne Groll:

So we’re kind of like a little bit of belligerent teenagers sometimes. So I think that the ownership, and I think we’re getting closer and closer to own your code, but also develop some of the key skills. You said look. I mean, it’s such a simple word, right? It’s such a simple word, but it is look and see if it’s worth it. Build in the telemetry before you ship it, and make it muscle memory. I mean, all of that, Charity, it’s not hard, right? It’s not hard.

Charity Majors:

It’s not hard. In fact, it’s easier. It is far easier than trying to guess if it’s working based on the complex sky castle in your head that you’ve drawn together, trying to… Nobody has an accurate image of production in their head ever, right? If you can just stop trying to rely on your imagination, if you can just look at it in a tool where you’ve got this shared view of the world that you and everyone else can see the same goddamn thing and it’s updated in real time, you can move so much more quickly and with so much more confidence.

Jayne Groll:

Yes, you’re right, confidence. But also the systems are not so complex because of the fact that we can look. So we have very, very complex environments, but it feels like the evolution also is trying to make it easier to look. Instead of write it, ship it, move on to the next, right? There’s an interim step there. So I’m excited about it.

Jayne Groll:

I wish we could talk about it forever, but we are going to get a chance to hear more from you in a few weeks, because on August 20th, DevOps Institute is doing a SKILup Day all about observability and-

Charity Majors:

Yeah, and I’m going to be doing my talk on observability-driven development.

Jayne Groll:

Yeah, and I’m so excited about it. Honeycomb is going to be a sponsor, so I’m particularly excited about that. Our chat lounge, I think a lot of the people that are going to attend, and we’ve seen really healthy registrations, are people that are really wanting to learn. They’re curious. Listening to you speak about this, Charity, the theme that seems to come through is curiosity. Be curious, right?

Charity Majors:

Yes, yes.

Jayne Groll:

Be curious. [crosstalk 00:24:01].

Charity Majors:

The tools that we have had have not rewarded curiosity. I have seen so many times an engineer goes on call for the first time and they look at the graphs and there’s a big, scary error, and they’re like, “What’s that?” They’re curious, right? And the people around them, they’re just like, “Happens all the time. Nobody knows why,” and they kill that curiosity.

Charity Majors:

The reason they do it is because they themselves have lost days if not weeks, going on snipe hunts just trying to understand these concepts, and they’re trying to save you from it, right? But over time, that kills you, right? If we can just find ways to reward that curiosity and make it so that you can answer those in a short amount of time like, “What is that? That’s what it is,” right? We just become so much more powerful engineers.

Jayne Groll:

Good. Yeah, I absolutely agree with you. I think we’ve beaten it out, right?

Charity Majors:

We do. We have.

Jayne Groll:

We’ve kind of beaten it out of the basket.

Charity Majors:

But you know what? We all became engineers because we’re curious people. I believe that if we start rewarding that curiosity and you start hooking your brain up to those dopamine drips of, “I figured it out. I fixed it. I solved it,” it grows back.

Jayne Groll:

Yeah, I am really excited. As I said, I’m particularly excited to kind of check in in the engagement. Because as I said, I don’t think you’ve been to a SKILup Day yet, but that chat lounge really lights up, and so I’m excited about it.

Charity Majors:

Very cool.

Jayne Groll:

Anyhow, thank you for spending some time with me today.

Charity Majors:

Absolutely. My pleasure. Thank you, Jayne.

Jayne Groll:

I so appreciate it. For those of you who are listening, as I was inferring, on August 20th, DevOps Institute is hosting the August SKILup Day exclusively about observability. Charity’s going to be sharing her knowledge, hopefully visiting us in the speakers’ chat lounge. And have the opportunity to really gain a lot more knowledge about this very important-

Charity Majors:

I love questions, so bring your questions.

Jayne Groll:

I’m telling you, I think you’re really going to enjoy this. Even it’s virtual, there’s a lot of engagement going on. Go up to the DevOps Institute website, you’ll see SKILup Day, register for observability. Charity, we’re going to quote unquote, see you on August 20th.

Charity Majors:

Yeah, I know. You’ll see my face.

Jayne Groll:

Anyhow, thanks again. Stay well.

Charity Majors:

Give back. Thanks, Jayne.

Jayne Groll:

Everyone else, really, you’ve been listening to another episode of The Humans of DevOps podcast. I’m here with Charity Majors of Honeycombe.io. See you all on August 20th. Stay safe.

Charity Majors:

Bye-bye.

Jayne Groll:

Bye.

Outro:

Thanks for listening to this episode of The Humans of DevOps Podcast. Don’t forget to join our global community to get access to even more great resources like this. Until next time, remember, you are part of something bigger than yourself. You belong.

sidebar graphic with register for London SKILup Festival on September 13, 2022CTA

Membership at DevOps Institute

related posts

8 Insights From the Upskilling IT 2022 Report [Infographic]

8 Insights From the Upskilling IT 2022 Report [Infographic]

By Eveline Oehrlich Chief Research Officer, DevOps Institute This year’s Upskilling IT Report reveals a critical need to close DevOps skills gaps, identifies top skills capabilities, and highlights emerging job roles to help individuals and organizations accelerate IT...

[EP81] What is a “Radical Enterprise” with Matt Parker

[EP81] What is a “Radical Enterprise” with Matt Parker

On this episode of the Humans of DevOps, Jason Baum is joined by Matt K. Parker, author of A Radical Enterprise: Pioneering the Future of High-Performing Organizations. Matt and Jason discuss successful and truly radical business models, what leads folks to try and...

What Are Cloud AI Developer Services?

What Are Cloud AI Developer Services?

Cloud AI Developer Services are growing and cloud providers now offer these services to developers. These hosted models allow developers to gain access to Artificial Intelligence/Machine Learning (AI/ML) technologies without needing deep data science expertise.  As an...