Failure as a Service: Unleashing the Power of Failure for Software System Enhancement


Adina Anderson

. 3 min read


In the realm of software testing, failure refers to the occurrence of incorrect results when defects within an application or product are executed under specific environmental conditions and circumstances. In today's interconnected world, where online chat and communication play a vital role, failures can also arise from issues related to online chat systems, including network connectivity problems, server disruptions, or software glitches affecting the smooth flow of conversations. These factors further emphasize the need for comprehensive testing and analysis to ensure the reliability and functionality of software systems in all aspects, including online chat functionalities.

Breaking the Cycle of Failure in Services

Do you find this scenario familiar? A large retail company (or bank or fast food chain) hires individuals willing, albeit temporarily, to work for slightly higher-than-minimum wages in their customer contact positions. The company simplifies these jobs, reducing them to repetitive and mundane tasks that require minimal training. Unfortunately, this approach does little to foster dedication to the work or loyalty to the company. As a result, the predictable outcome is high employee turnover and growing customer dissatisfaction.

Regrettably, traditional management responses to this situation often exacerbate the problem. The high turnover reinforces the belief that minimizing efforts in selection, training, and building commitment is a sound decision.

Introducing Failure as a Service (FaaS)

The provisioning and administration of virtual or physical servers, operating systems, and web server activities are often necessary for hosting software applications on the internet. Failure as a Service (FaaS) emerges as a groundbreaking concept within cloud computing services (CCS). FaaS allows businesses to design, run, and manage applications directly from the cloud, eliminating the time-consuming tasks associated with maintenance and infrastructure development typically involved in application development and launch. FaaS enables the creation of a "server less" architecture, transforming the way programs are built.

Analyzing Failure: Tools and Methodologies

The purpose of failure analysis is to identify the underlying root cause of a failure, ideally with the intention of eliminating it and implementing preventive measures. The failure analysis procedure involves several essential steps:

  1. Defining the problem and collecting relevant data.
  2. Identifying damage modes and mechanisms.
  3. Testing to determine the actual mechanisms leading to the failure.
  4. Identifying potential root causes.
  5. Confirming cause-effect relationships.
  6. Conducting tests to validate the actual root cause.
  7. Implementing corrective actions.

An integral part of the failure analysis process is the identification of failure modes and their potential effects, which is accomplished through the use of failure mode and effect analysis (FMEA).

Enter Chaos Monkey

In 2011, Netflix introduced Chaos Monkey, an open-source tool designed to randomly disrupt AWS resources at scheduled intervals, allowing for close monitoring of failures. The primary objective of Chaos Monkey is to detect system weaknesses that could lead to major outages and address them proactively. While Chaos Monkey is not a service in itself, other cloud service users can manually deploy it. Chaos Monkey is now part of the Simian Army, a collection of testing tools. However, the random nature of failure drills in Chaos Monkey poses challenges in accurately measuring and handling the outcomes of these random failures.

Introducing Trouble Maker

To overcome the dependency on the cloud and make it suitable for enterprise environments, we developed Trouble Maker as an alternative to Netflix's Chaos Monkey. Trouble Maker targets Java-based web and Microservices-based applications, randomly causing application service disruptions. Additionally, it provides a web console for conducting stability tests on servers. Here's a diagram illustrating its functioning:

Trouble Maker communicates with a registered servlet in the Java-based client Microservice and interacts with a Service Registry to determine the locations of the services to be targeted.


Failure analysis of complex systems with numerous interconnected components is a challenging task, particularly when probabilistic events influence system performance. A probabilistic multifactor representation that encompasses various technical and non-technical factors and events can aid in conducting failure analysis of complex systems. It is evident that engineering complex systems can be approached in various ways; however, Bayesian Networks (BNs) have demonstrated advantages in representing such systems by defining interrelationships among system components. Quantifying BNs relies on data from diverse sources, including logical inference, expert engineering judgment, empirical mathematical models, historical and operational data, and detailed simulations.

More Stories from Tech

Information Networking as Technology

Adina Anderson.1 min read
Information Networking as Technology

Current and Upcoming Technologies in Context of Internet Technology

Adina Anderson.2 min read
Current and Upcoming Technologies in Context of Internet Technology

Technology Development in Asia Long Load Ahead

Ronit Agarwal.1 min read
Technology Development in Asia Long Load Ahead

History of the Networking Technology

History of the Networking Technology

Effects of Technology on Tertiary & Higher Education

Vihaan Disouza.2 min read
Effects of Technology on Tertiary & Higher Education