Symphony of Success: How Fortune 100 Company Revolutionized IT with Observability, AIOps, Sustainability, and FinOps

Tiago Dias Generoso

--

Introduction

In the ever-shifting landscape of modern business, the journey to success is not merely a path of innovation; it’s an exploration of the transformative capabilities of technology. Today, we invite you to join us on an extraordinary journey. This case study encapsulates the remarkable transformation of a visionary multinational IT company through the strategic implementation of observability practices.

Prepare to be enthralled by a narrative rich with challenges surmounted, innovation harnessed, and operational frameworks redefined. These collective efforts have culminated in performance gains that defy expectations, serving as a noteworthy benchmark for the industry.

As we embark on this narrative, we aim to provide insights into how the fusion of technology and human ingenuity can orchestrate success that resonates on a global scale. This is a story of determination, evolution, and a vision of what the future of IT can be. We invite you to join us on this journey as we unravel the secrets behind the transformative power of observability.

Throughout this article, I’ve included links to other in-depth articles I’ve authored to keep this one concise. If you’re looking to dive deeper into these topics, I recommend exploring these additional resources.

My personal challenges

In 2017, I found myself at a professional crossroads, working as a system management architect primarily focusing on enhancing infrastructure reliability, reducing incidents, and promoting automation. My expertise was centered around platforms and infrastructure, with most of our environment operating as monolithic systems. However, our company embarked on a significant cloud transformation project, one that aimed to create a hybrid cloud platform using Red Hat technology, revolutionizing our internal infrastructure. The overarching goals were to foster agility, innovation, and efficiency and substantially reduce operational costs.

Amidst this pivotal transformation, an unexpected opportunity arose. I was invited to join the team responsible for driving the transformation from the Observability and Site Reliability Engineering (SRE) perspective. The objective was clear: empower the organization to scrutinize its systems, swiftly identify and resolve issues, examine applications, and elevate its intelligence.

At the time, the realm of SRE was relatively new to me. My experience had been primarily rooted in infrastructure architecture, with a deep understanding of platforms, networks, and system management. My exposure to applications, cloud-native technologies, and SRE practices had been limited, leaving me feeling unprepared for the challenges that lay ahead.

The organization posed a formidable challenge — leading a team of approximately 60 professionals to chart the course of this ambitious project. My mandate included selecting the right tools, pinpointing requirements, gathering insights from diverse stakeholders, comprehending the intricate web of our environment, and understanding the structural dynamics of our teams. Furthermore, I needed to gauge the maturity of our teams in the realms of SRE, Observability, Cloud-native solutions, and more.

To be candid, I was initially overwhelmed. The sheer scale of the project and its inherent complexity were daunting. Despite my seniority, the task ahead was unlike anything I had encountered in my career. Yet, this was my challenge, a journey into the unknown that would ultimately test my mettle and redefine my professional trajectory.

Company Challenge

The company faced a huge task — managing an expansive network of 1,600 business applications while catering to the diverse needs of over 280,000 global users. With this big challenge in his hand, the Chief Information Officer (CIO) Organization embarked on a transformational challenge. The mission was double: ensure seamless operations and accelerate their digital transformation journey.

As they transitioned toward a hybrid cloud ecosystem, fueled by the power of Red Hat technology, complexity reached new heights. Shifting from monolithic applications to intricate components and the labyrinth of pods and containers necessitated a paradigm shift.

In summary, the organization faced a host of interconnected challenges that spanned Agile practices, enterprise architecture, Observability, and AIOPs integration. These challenges were not only complex but also demanded comprehensive solutions that would eventually lay the foundation for our successful digital transformation journey.

Some other challenges:

  • Siloed Team Structure and Handoffs
  • Complex Processes and Organizational Structure
  • Inefficiencies and Slow Response Times
  • Lack of a cost control on Cloud
  • Lack of Sustainability control

The target was clear — a shift from reactive firefighting to a proactive stance that maximizes resource utilization efficiency, keeps high performance, reduces costs, and contributes sustainable IT.

The Observability Odyssey with New Relic, Dynatrace and Instana

Recognizing the challenge, the company embarked on a transformative journey towards an observability strategy. Their primary objective: gain profound insights into the hybrid cloud infrastructure, streamline operations, and optimize efficiency across applications and resources. The shift from monolithic to microservices architecture called for a new lens, one placing observability at the heart of monitoring and optimizing the intricate component interplay.

In their pursuit of operational excellence, the company’s trajectory from reactive to proactive operations underwent a pivotal transformation with the strategic integration of Application Performance Management (APM) solutions. This pivotal addition fortified the observability strategy, infusing it with deeper insights, control, and optimization.

The amalgamation of APM solutions into the observability framework set the stage for ceaseless innovation. As the company evolved, APM capabilities synergised seamlessly with Artificial Intelligence for IT Operations (AIOps), further automating issue detection, root cause analysis, and predictive maintenance.

Building upon this, I took the helm to lead our team in selecting the most suitable tools available to meet our specific requirements. We ultimately adopted a triad of robust solutions: New Relic, Dynatrace, and Instana.

At the onset, our company had an existing partnership with New Relic. However, due to strategic decisions unrelated to technical considerations, we eventually decided to part ways with this partnership. Notably, we had already implemented New Relic across a significant portion of our environment, predominantly utilizing synthetic monitoring to ensure availability.

As we navigated this transition, prompted by the end of our partnership and our evolving need to provide Observability across a diverse ecosystem encompassing Cloud Native, Hybrid Cloud, and Mainframe AS/400, we embarked on a quest to identify a tool that could comprehensively address these requirements. Our chosen solution was Dynatrace.

Subsequently, the company acquired Instana, ushering in a new transformation phase. We commenced the transition from Dynatrace to Instana. Ultimately, all three platforms, New Relic, Dynatrace, and Instana, were able to meet our needs effectively. The key lay in adapting each of these tools to our unique environment.

For a deeper dive into the criteria that guided our selection of APM tools, I invite you to explore the comprehensive article I authored: Observability Tooling Decision Guide. This guide offers valuable insights into the considerations and factors that shaped our tooling decisions.

Application Resouce Management tool: Automating Excellence

Acknowledging the limitations of manual resource allocation optimization in a dynamic multi-tenant environment, the company embraced IBM Turbonomic’s cutting-edge hybrid cloud cost optimization solution. This watershed moment propelled the company into the realm of automation-driven efficiency. With Turbonomic’s capabilities, resource allocation became automated, and applications were optimized based on real-time data and demand.

The impact was immediate and transformative. Overallocated resources were identified, and allocation optimization, once a complex endeavor, became streamlined and automated. This transition demanded a cultural shift, relinquishing manual control in favor of trust in automation. Yet, comprehensive insights from Turbonomic’s observability solution played a pivotal role in navigating this transformation.

You can see details on the implementation here: A Deep Dive into Application Performance and Efficiency with Instana and Turbonomic.

This strategic solution propelled our Observability maturity to a significantly higher level, enabling us to seamlessly deliver on two critical fronts: FinOps and Sustainable IT. Furthermore, we seamlessly integrated these solutions with other essential tools, including Apptio and Envizi.

Through a methodical step-by-step approach, we gradually ascended to what we refer to as “Business Observability Maturity.” While it may not be flawless and demands ongoing refinements, this progress signifies a substantial elevation in our maturity level. It empowers us to better correlate vital business insights with the wealth of data provided by our observability solutions.

For a deeper understanding of our Observability Maturity Model, I encourage you to explore the article I authored: Observability Maturity Model. This resource delves into the nuances of our maturity journey and the strategies that underpin our approach.

Cultural Evolution and Agile Transformation

Shifting towards automation and observability demanded a cultural transformation. The company reshaped its teams, fostering collaboration and agility. Change management strategies, driven by comprehensive documentation, interactive sessions, training modules, and inspiring success stories, ensured the entire organization embraced the principles of observability and SRE.

The path to automation was not without its challenges. Embracing automated processes required a cultural shift, relinquishing manual control in favor of trust in automation. The comprehensive insights furnished by observability solutions, however, played a pivotal role in overcoming this hurdle. The team’s embrace of this full-stack visibility transformed their troubleshooting approach from the traditional server-by-server mode to a holistic perspective of the entire environment.

This culture shift was not without its challenges, as the company’s existing organizational structure was deeply rooted in siloed teams scattered across multiple locations and time zones.

Recognizing the inefficiencies generated by this fragmented setup, the company embarked on a transformative journey to redesign their squads. The old model, comprising hundreds of isolated teams, was revamped to create a more cohesive and cross-functional structure. The new agile organization model consisted of two distinct squad types, each leveraging services provided by platform squads to streamline operations.

I explained this Agile transformation in more details on this article:

Forging SRE and Observability Squads in a Complex and Imperfect Organization — A Real-World Use Case

Observability Architecture

With a commitment to fostering an open and dynamic architecture that encourages innovation and embraces best practices, our ongoing architectural updates are closely aligned with evolving market offerings and the ever-changing needs of our customers.

Our architectural reference aims to craft an environment that seamlessly integrates proprietary agents with generic agents (OpenTelemetry) or manual instrumentation. This strategic choice prevents vendor lock-in, offering greater flexibility and adaptability. Furthermore, it includes the incorporation of a security layer to manage API calls to our tools, enhancing data access for external entities. We are also keen on delivering a workflow solution that orchestrates all interactions and a database to house essential static external data required for creating dashboards.

The workflow below vividly illustrates the interactions within our observability solution. APM tools, notably Instana, are pivotal in collecting and processing application and infrastructure data. They are instrumental in generating topologies and tracings, enhancing root-cause analysis (RCA), issue identification, and routing data to the logging tool for dashboarding.

Additionally, they send alerts to IT operations management and IT service management tools for event correlation, ticketing, event automation through Ansible, and user notifications. Our solution further extends its reach by providing data to the ARM solution (Turbonomic), empowering application and platform teams to operate infrastructure with greater efficiency and lower hosting costs.

In alignment with our functional requirements, our solution is meticulously designed to:

  • Identify and select the requisite tools to comprehensively address our requirements.
  • Offer a spectrum of observability capabilities, including application performance management (APM), synthetic performance monitoring, real-user monitoring (RUM), and logging solutions tailored for diverse environments.
  • Seamlessly integrate these tools into a unified observability solution.
  • Trigger notifications to the relevant teams in real time when issues arise.
  • Implement automation to manage infrastructure issues efficiently.
  • Ensure the optimal allocation and management of infrastructure resources.

Additionally, several pivotal architectural decisions shape our solution:

  • We will adopt a hybrid approach, combining Software as a Service (SaaS) and on-premises solutions, with a clear preference for SaaS wherever applicable.
  • Our solution will effectively balance multi-tenant and single-tenant usage, underpinned by robust role-based access controls.
  • The architecture will minimize the creation of new infrastructure components, and any necessary will be designed with high availability.
  • Legacy components or tools that are slated for decommissioning soon will remain the same, maintaining a focus on efficiency and streamlined operations.

In this pursuit, our architecture embraces the dynamic nature of the industry while preserving a robust framework to meet the evolving needs of our organization and customers.

For more details about the architecture, please take a look on the following articles I own:

Dynatrace Architecture Design Guidelines

Producing Observability Design to Support a Hybrid Cloud Strategy

Results

The company’s journey of observability-driven success was a stepwise evolution. Starting with infrastructure monitoring, they systematically layered application availability monitoring, monitoring for heterogeneous applications, and a fusion of SRE (Site Reliability Engineering) and AIOps (Artificial Intelligence for IT Operations) capabilities.

Through dedicated efforts and a commitment to excellence, our team has achieved remarkable results in optimizing our operations.

Business Metrics
  • 95% reduction in toil, we’ve successfully streamlined our processes, allowing us to focus on what truly matters.
  • 18,000 tickets are now being effortlessly resolved, accounting for 35% of our total workload.
  • Enhancing observability implementation has yielded outstanding outcomes, with a remarkable 70% increase in speed.
  • Root cause analysis has accelerated by an impressive 50%, enabling us to swiftly pinpoint issues and implement effective solutions.
  • 50 squads now successfully implementing SRE principles.
  • 3.8 TB decrease in cumulative memory limits, 64% decrease in CPU requests and a staggering 45,000 automated resourcing actions each month.
  • Observability for over 1,600 applications and supporting 12,000 user accounts.
Sustainability Reports using Turbonomic and Dynatrace

Conclusion — A Symphony of Success: Observability, Sustainable IT, and FinOps

In conclusion, the showcased company’s journey stands as a testament to the power of adaptability in today’s dynamic business landscape. Shifting from reactive to proactive operations through the synergy of technology and cultural transformation marks a significant evolution in their operational strategy. This narrative emphasizes that embracing change and driving innovation are equally important as technological advancements and synergy of Observability, AIOps, Sustainable IT, and FinOps practices.

Observability, AIOps, sustainability commitment, and financial prudence emerge as pillars that guide organizations towards operational efficiency and responsible growth. As technology relentlessly advances, these pillars offer a clear path to a smarter, eco-conscious, and financially sound future.

The company’s triumphant journey emerged from the harmonious synergy of observability, AIOps, Sustainable IT, and FinOps practices. Observability’s data-driven insights formed the bedrock on which this symphony played out. The fusion of AIOps added intelligence, propelling the company toward efficiency-driven innovation. Aligning with Sustainable IT and FinOps, the company optimized cloud spending while minimizing its environmental footprint, showcasing the compatibility of technological advancement and sustainability.

Tiago Dias Generoso is a Distinguished IT Architect | Senior SRE | Master Inventor based in Pocos de Caldas, Brazil. The above article is personal and does not necessarily represent the employer’s positions, strategies or opinions.

--

--

Tiago Dias Generoso

Distinguished IT Architect | Senior SRE specialized in Observability with 20+ years of experience helping organizations strategize complex IT solutions. Kyndryl