Modern organizations are quickly adapting to an increasingly sophisticated cloud infrastructure to ensure their continuity and operational efficiency. A crucial factor in this digital ecosystem is operational health events, which include issues in their operation, notifications related to software lifecycle, among others. Ineffective management of these elements can lead to unexpected downtimes, increased costs, and revenue loss for companies.
Managing operational events within the cloud represents a considerable challenge, especially for companies with complex structures. These organizations, which may manage thousands of accounts and a wide range of services and resources, may face an overwhelming volume of daily operational events, complicating manual management methods. Although traditional automation approaches offer some relief, they often carry a significant burden in their development and maintenance, as well as intricate mapping rules and rigid triage logic.
To mitigate these challenges, an AI-driven operational assistant has been developed that automatically responds to operational events. This innovative assistant leverages Amazon Bedrock, AWS Health, AWS Step Functions, among other AWS services, to filter out irrelevant events, suggest actions, generate and manage issue tickets in integrated IT service management tools, as well as query knowledge databases for relevant information about operational events. This solution facilitates the automation of complex tasks, optimizing the process of resolving operational events in the cloud, and increasing business continuity along with operational efficiency.
In this context, operational events are understood as occurrences that can impact the performance, resilience, security, or cost of workloads within an organization’s cloud environment. Examples of these events in AWS include health in the availability of AWS services, AWS Security Hub findings regarding security vulnerabilities, and alerts about AWS cost anomaly detection.
Efficiently managing operational events involves a series of steps ranging from notification and triage, to tracking, action execution, and large-scale archiving and reporting. However, traditional programmatic automations are limited when facing multiple tasks. AI has been integrated into this solution to provide greater flexibility and adaptability to organizational changes, service expansions, or new data source formats.
This innovative approach not only optimizes the management of operational events but also enhances organizations’ ability to maintain operational continuity and mitigate cost and downtime risks. By implementing an AI-based operations assistant, organizations can effectively address the high volume of operational events in complex cloud-centric environments with minimal human oversight.
via: MiMub in Spanish