Strategizing Observability on Complex IT Environments
There are multiple articles on the internet which explain Observability concepts and how to apply them using open sources solutions or even using some monitoring tools from the market.”
Those articles were really important for me and I continue use some of them to learn more and to be aware of all news on this area.
But, during my journeys on how to transform the operations, implementing the Observability, I learned some important things on how to create a strategy for situations where customers use different kind of technologies — not only cloud native applications, but also Private Cloud and traditional data centers.
The idea of this article is to provide on a big picture some important aspects on how to strategize the Observability culture on organizations with this hybrid scenarios and covering multiple kind of technologies, application languages, operating systems, middleware’s and so on.
I will also cover more details of these Observability Journeys with more details on next articles.
1 — Deeply understand your customer
It is not a cliche, you really need to understand the customer, and not only the environment, the technical stuff, but their strategy, the way they are planning to do the business and how it will impact the way they want to strategize the IT environment. Understand their plans to go to Cloud, if they will use the Hybrid Cloud approach, if they are planning to use multi-cloud, and use multi-cloud, which kind of workloads they will send to these Cloud providers, PaaS? IaaS? SaaS? FaaS (Serverless)? For PaaS or SaaS, are the customer using a dedicated or shared instance?
All these decisions will impact the Observability strategy, because the idea is to have a full observable environment, not only covering Cloud Native applications or a part of the environment. To really implement a good Observability and SRE Culture, the customers need to see all aspects of the IT environment that can impact the business, where are the problems that are impacting the performance of the applications that are causing revenue loss for example.
For the SaaS solutions where the customer monitor the availability of the solutions provided by these Cloud Providers or the performance of these SaaS applications accessing from the internal network, probable a simple Synthetic (Artificial Transactions where scripts can simulate the application usage) monitoring will cover the needs.
For the PaaS, SaaS, FaaS you need to do the same evaluation, who is the responsible for the infrastructure, the customer or a third party
2 — Solution Design
On this phase we need to design a solution that can satisfy the customer requirements, initially not putting the costs as a blocker to allow the team to imagine the better solution to cover all customer requirements.
On complex scenarios, you need to consider existing solutions the customer already has, the integration between all the solution components, for example, how to collect the tracing, metrics, logs, alerts and correlate all the data for better visualization.
One important thing is to cover all business-critical components of the customer and have a full stack coverage to allow the solution to contribute on aspects such as root cause analysis, postmortems, improve MTTR, MTTD, reducing the operational costs on the same time we improve the application availability.
It is also important to define the conceptual model, such as Golden Signals, so you can design your solution to collect all those signals to provide a real Observability solution.
Another important aspect is to be aware of some OpenSource projects on this area, for example the OpenTelemetry that can help the teams to create tool agnostic solutions.
3 — Balance the needs with the budget to have the final solution
On this phase we need to balance all the customer requirement and the tools available to provide us some ways to deliver the customer needs.
We have a lot of OpenSource solutions to provide Application Performance Management (APM), Synthetic monitoring, Real User Monitoring, Infrastructure monitoring, Logging monitoring and so on.
But we always need to keep in mind that some commercial tools can provide multiple different functions that can save time to implement and also to operate the solutions. We have tools that can instrument the applications automatically, create infrastructure mapping with topologies, that can reduce operational costs and bring some benefits that can cover the costs of licenses you want to save on this phase.
So, invest your time on Proof of Concepts (PoC) to evaluate multiple different tools and solutions.
Another good option if the cost is really impacting the design, is to consider implementing both OpenSource and commercial tools and prioritize the use of commercial solutions for the most critical apps or to cover requirements where the OpenSource solutions cannot cover.
4 — Promote a very exciting Observability Journey
One of the most important things to have a successful Observability strategy is to have the IT teams really engaged and collaborating with solution implementation.
Each application has many particularities, some of them are based on a front end and the most important thing is to have frontend observable, other applications only have backend, but the performance can impact other frontend applications, we can have applications with multiple components where the performance of one component will not impact the overall application performance.
So, if we don’t have the application teams really engaged on the project, we will never be able to implement a very good Observability solution. They are the ones that know the application components in detail, pain points, points of failure. This knowledge is very important to plan and implement the most appropriate monitoring solution
So, it is really important to provide them a very good explanation about the benefits of the project showing the areas that can be improved and provide then training sessions, create very good documentation, create some videos to explain the strategy and the concept.
Another important thing on this topic is to promote showcases, where the teams that implemented the model can show the benefits they had, it will promote the solution to other teams that have similar problems.
5 — Continuous Maturity improvement
At the end, you need to promote a culture to improve the Observability continuously using Site Reliability Engineer Principles, evaluating possibility to improve all Observability areas always considering the business requirements as the most important factor.
Promote the maturity always avoiding the teams to use the Observability solutions only as monitoring solutions with reactive solution and covering component monitoring and not full stack monitoring.