Home Product Development Observability for business: how to stop hearing about outages from your customers
Jun 04,2026
14 min read
Observability for business: how to stop hearing about outages from your customers

TL;DR
Every incident has three phases: you don’t know something is happening, you know but don’t know why, you fix it. Most companies think observability shortens the third one. They’re wrong.
Observability dramatically shortens the first phase. That’s where you lose the most money, because it lasts the longest and you don’t even know there’s something to work on.
Without observability, you find out about bugs when a customer calls your CEO. With observability, an alert hits Slack 10 minutes after deployment.
The main trap: most companies buy Datadog and think they have observability. They don’t. A tool is not engineering culture or a sensible process.
Every incident has three phases. Observability hits the most expensive onePicture a typical path without observability. We deploy a change to production at 2:00 PM, and by 2:05 some user hits a problem. Within a minute most of those users simply leave for the competition or abandon the checkout flow, and only a fraction stays around long enough to even consider reporting it. By 5:00 PM there’s a support ticket waiting, but the developers are already off work and support is wrapping up for the day too. Nobody actually looks at the issue until Monday morning, and the whole report-to-fix cycle stretches over a very, very long window. Phone tag eats hours, and users have to actually want to report the bug instead of just leaving and going to a competitor.
Without observability, time to discover is measured in hours; with observability, it’s measured in minutes. We should learn about most of the common bugs within 10 minutes of deployment, not all of them, but exactly the typical ones that usually cause trouble.
So the key insight is this: observability dramatically shortens time to discover. It doesn’t speed up the fix itself, it simply means that instead of waiting hours, we know within 10 minutes that something is off. And that first phase is where you lose the most money, because as long as you don’t know there’s a problem, the problem keeps growing while you don’t even know you should be working on it.
Observability vs. monitoring. Where business misses the difference
Here’s the problem: developers and technical folks throw monitoring, alerting, and observability into one black box. The developer tells the business “I was working on observability”, the business doesn’t understand any of it and lets it go. That’s the wrong dynamic. The business should understand observability, and the business should be the one dictating what we observe.
Monitoring is a different beast: purely technical. It checks CPU usage, RAM, database query rates. To put it concretely, monitoring answers questions like “what was the CPU usage?”, while observability answers “how many transactions of a given type happened in a given window, and how many of them succeeded?”. The first kind of answer is for the IT team; the second is for the business, and the second is the one you should actually want.
Five business levers. Why observability is not a developer’s toy
Time to discover dropsI’ve said this before, but it’s worth repeating because it’s the most important lever: alerts on user impact. Ten minutes after deployment, you already know that a user has trouble completing a purchase. Time to discover drops dramatically, and the moment you detect, you can start working on the fix.
Time to repair dropsBecause the detection is automatic and rich with details, time to recovery also drops; the entry point into the task is simply higher quality. You can trace real production cases, so the question shifts from “what happened?” to “how do we fix this now?”.
Smaller blast radiusFewer affected users, though indirectly. Observability tells you that users in a particular segment have a problem (those in Warsaw, those on Android, those on a specific subscription plan), and AI can layer on top to spot which combinations of details actually matter. Knowing exactly who’s affected lets you roll back the change for that group, or turn the feature off with a feature flag if your system is set up for it.
Better incident decisionsSometimes the problem only affects a single customer, and in that case it’s enough to call them and say “Hey, we know it’s not working for you, we’re on it, don’t worry”. Isn’t that better than that customer calling your CEO to tell him the system is down?
Better product decisionsThe same data that catches incidents also tells you whether a new feature is actually being used, in which segment, and whether shipping it broke something else. That closes the loop between “we shipped it” and “it was worth shipping”.
Case study: Orlikfy
Orlikfy is the world’s first football pitch app that handles organizing, joining, and settling pickup football games end to end. In plain terms: instead of juggling thirty local Facebook groups along the lines of “Local Pitch Crew”, “Pitch Players Krakow”, or “Football Wroclaw”, you have a single app where the organizer posts a game on the map, players find it themselves and sign up with one click, and payments are tracked in one place.
This is a classic two-sided marketplace. On one side you have the organizer, the person who books the pitch, posts the event, and makes sure the squad gets filled. On the other you have the player, someone who wants to play football on Wednesday at 8:00 PM and is looking for a game to join. From a business standpoint, that means two critical user flows:
Game creation by the organizer: posting an event on the map along with parameters like time, place, and price per person.
Game joining by the player, including the payment, since without payment the organizer can’t cover the pitch booking.
If either of these flows breaks, the app stops delivering value in that moment. The organizer can’t post a game, so players have nothing to join; or players can’t sign up, so the organizer fails to fill the squad and ends up cancelling the event. These are exactly the processes that observability has to watch first. Everything else (chat with players, marking who paid, inviting past players to a new game) is secondary, because without a working core none of it matters.
What we observe specificallyWe track the ratio of attempts to successes: how many people clicked “join the game”, and how many actually got in with a paid booking. Same for game creation: how many attempts versus how many events actually went live. Sometimes things fail for legitimate reasons (the game was already full when a player tried to join, or the bank declined the payment), and that’s fine. We have a baseline of 3-5% errors and that’s acceptable; for various reasons, some errors will always happen.
The threshold where something is wrongWhat matters is the change, not the absolute number. If that level rises to 30% errors after a new release, something is clearly off. Even without a new release, if the error rate jumps by 30 percentage points within a ten-minute window, that’s still a problem. Observability tells us “Hey, something happened in this important business process” and pushes a message to the team’s Slack. Who picks it up and when is a separate topic; what matters here is that the message lands in the first place.
Weekly cyclicality, or why hard SLOs are a bad idea at the startHere’s another thing. I’m not a fan of hard SLOs, those rigid targets like “we must stay under 1% errors”, at least not at the start. I prefer reasoning through deviation from a historical baseline, and Orlikfy is a good example of why.
In Orlikfy we see very clear weekly cyclicality, because pickup football is a deeply calendar-driven product. The most people play on Monday afternoons, while Friday afternoons are the quietest, since people are heading out for the evening, visiting family, or leaving for the weekend. Now, if I see 100 newly created games on a Friday between 5:00 and 5:10 PM, is that good or bad? Honestly, no idea. The business would love 3,000, but our reality today is 200. And if it suddenly drops to 100 on a Friday, is that bad or good? Equally unclear, especially if we just ended a marketing campaign and the drop is a natural cooldown.
The way we solved this in Orlikfy is comparison-based: when we analyze Monday at 5:00 PM, we look at how many games were created on the last three Mondays at 5:00 PM and compare. If the deviation goes more than, say, 30% in either direction, an upward swing is probably fine while a downward swing is probably not, unless we just ended a marketing campaign and we already know why. The business has to be able to make this call: is this change in the numbers good or bad? And that judgment will shift over time. If we run a campaign for three months, growth is expected; when we stop, a drop is expected too. But if we didn’t change anything and we still see games created on Monday afternoon dropping by 30%, that’s a signal worth investigating.
Without observability vs. with observabilityWithout observability, we discover the issue with a delay measured in tens of hours: change goes out at 2:00 PM, a user hits a problem at 2:05, those users walk away, by 5:00 PM we have a support ticket but the developers are gone, and only on Monday morning does anyone look at it. The whole sequence drags on forever.
With observability, we get a Slack alert 10 minutes after deployment. We immediately know which flow is broken, which user segment is affected, and which deployment kicked it off. Everything else is just detail.
Observability + AI = time to repair drops twice over
You can’t write an article in 2026 without mentioning AI, but in this case AI genuinely earns its place.
First, observability surfaces the bug fast and creates a ticket automatically (in Jira, on Slack, wherever your workflow lives), and it does so proactively. The developer walks up to a report that already includes what, where, who, which users, and which segment. The alert carries full context: the affected user segment, a deployment trace identifying which deployment started the issue, and even a diff of what changed in the code between deployments.
There’s no need to pull more context out of memory like “if registration is broken, I have to check X, Y, and Z”. Observability bubbles it up from the logs, where we can see exactly that an error fired here and there, and both a human and an AI can pick that up and analyze it. Pair it with AI and, in an ideal world, you can start making code changes automatically: the developer comes in 30 minutes after the bug appeared and finds a ready pull request waiting, complete with a description of what happened and a postmortem of why. They accept it, ship the fix, and move on to the harder question of how to prevent this category of bug in the future.
It’s not as if observability flips the switch to self-healing on day one. You have to give the team room to roll something like this out, and it won’t happen on its own. But typical, simple bugs absolutely should become auto-fixable in this kind of setup.
How to roll this out: Observability implementation strategy
Step 1: write down 3 to 5 critical business processesIdentify the critical business processes. User journeys, endpoints, whatever you call them. Pick the ones you make money on, and the ones whose breakdown hurts the most. Three to five, no more. Don’t try to observe everything at once, or you’ll end up drowning in alert noise that nobody reads.
Step 2: instrument those processesA developer goes in and makes those processes genuinely observable: they add the logs, set up the alerts, and wire the pipeline so that an error log firing more often than the threshold pushes a message to Slack. We leave the technical details to the team, but the business should be clear on the outcome: each chosen process has a real-time health indicator that’s actually visible.
Step 3: set the thresholdsWhere do you find the “more often than X” threshold in the first place? That’s the genuinely hard part, especially when there’s no observability culture in the organization yet. Ask the business “what percentage of failed transactions do we accept?” and the gut answer will be “zero”. But in the real world, zero doesn’t exist, some errors will always happen.
This is exactly where I recommend a deviation-based approach instead of hard SLOs: compare the current value to the average from analogous periods (the same days of the week, the same hours), like the last three Mondays at 5:00 PM in the Orlikfy example. The deviation threshold is set together with the business, and you alert when it’s crossed. Seasonality and marketing campaigns are then baked right into the model.
Want a starting point? Run a self-assessment firstBefore you write down those three to five critical processes and start instrumenting them, it’s worth seeing where your product actually stands today across the technical foundations that observability sits on top of: architecture, testing, CI/CD, observability itself, data, and security.
That’s exactly what our Technical Health Checklist is for. Sixty concrete checks across six areas, designed to be run by your own team in an afternoon. You mark each as Yes, Partially, No, or N/A, and the built-in dashboard scores your maturity and tells you which areas need attention first.
Get the Technical Health Checklist
The trap most companies fall intoMost companies make the same mistake. They buy Datadog or some other observability tool, and assume the problem is solved. It isn’t. A tool isn’t a practice. A tool isn’t engineering culture. A tool isn’t a sensible SDLC process. Those are three separate things, and you have to build each one deliberately.
The second trap is locking yourself into a single log provider or observability vendor. When you pick a solution, it’s worth thinking about how easily you’d be able to swap it out later, because no single vendor will always fit every context your company finds itself in a year or two from now.
Observability culture through Definition of Ready and Definition of Done
Rolling out a tool isn’t enough on its own. In my opinion, the only working way to make observability part of the culture is to wire it into the standard delivery process. To do that, you need two artefacts in place: Definition of Ready and Definition of Done.
Definition of Ready says when a task is ready for the IT team to start working on it; Definition of Done says when it’s finished. It isn’t enough to add observability only at the Definition of Done stage. You have to add it one step earlier, at Definition of Ready.
Concretely, when you’re discussing a new feature with the business, before any code is written, you should already be asking: how will we observe in the logs whether users are actually using this, whether the change broke something else, and whether the business goal is hit? Asking those questions this early lets you understand what you’re really doing, because you’ll already know how you’re going to measure it. You’ll also react to bugs faster, because you’ve already discussed how they would surface.
From there, it becomes a continuous process: you react to alerts, discuss whether they were helpful, cut down the noise when there’s too much of it, and add new observability whenever an incident comes through support that you should have caught earlier. You review and refine, and observability becomes one of the key elements of every change you make to the system.
A one-liner to rememberIt isn’t about the system never going down. It’s about being the first to know when it’s down, knowing exactly who it affects, and fixing it before the user writes a “1-star, doesn’t work, uninstalling” review.
Observability dramatically reduces the cost of a production incident. The biggest impact is on time to discover, but time to repair shrinks too, especially when paired with AI. And on top of that, by managing the change well, you can also shrink the number of affected users along the way, though that’s more of a side benefit than the headline.
Not sure where observability stands in your organization? Or maybe you can see that IT keeps saying “we have Datadog”, but alerts on real business incidents still don’t reach you? Contact us for an audit. We’ll help you map the critical user journeys, design the business metrics that matter, and roll observability out so it plugs into your existing delivery process instead of becoming yet another developer’s toy.
Summarize this article with AI

ChatGPT


Claude


Perplexity
Observability for business: how to stop hearing about outages from your customers

Every incident has three phases. Observability hits the most expensive one

Observability vs. monitoring. Where business misses the difference

Five business levers. Why observability is not a developer’s toy

Time to discover drops

Time to repair drops

Smaller blast radius

Better incident decisions

Better product decisions

Case study: Orlikfy

What we observe specifically

The threshold where something is wrong

Weekly cyclicality, or why hard SLOs are a bad idea at the start

Without observability vs. with observability

Observability + AI = time to repair drops twice over

How to roll this out: Observability implementation strategy

Step 1: write down 3 to 5 critical business processes

Step 2: instrument those processes

Step 3: set the thresholds

Want a starting point? Run a self-assessment first

The trap most companies fall into

Observability culture through Definition of Ready and Definition of Done

A one-liner to remember

Related articles

Automated Testing Cost vs. Skipping It: ROI Guide

How to Take Over an IT Project from a Previous Vendor Effectively

When a vendor raises prices by 700%. How to use third-party software safely

Our Core Software Development Services

Project Rescue Services: We'll Stabilize Your Delivery

Software Project Takeover: We'll Stabilize Your Project in 4 Weeks

Custom Software Development Services

Custom Fintech Software Development Services Company

Newsletter

We're here to become the world's most effective software company.