Applications have to run in high-consequence environments. They have to serve hundreds of thousands of users 24 / 7. Our clients spend millions in hard- and software and highly depend on the revenue generated by these applications. Unnecessary outage of these application is fatal.
Software Architects play an important role in setting up an architecture that can cope with these high demands. At the JAOO, Michael Nygard‘ had a talk, “Failure comes in Flavors“, that gave very good insight in the risks and opportunities of today’s application. The talk was was divided into two sessions. The first session covered the bad news: the stability threats. He discussed several situations that pose a threat to the long and happy life of an application. The second session was a happier one. It covered the patterns that should be applied to the application architecture to prevent these threats.
In this post, I will elaborate on some of the stability threats and pick one specific pattern to resolve them: the circuit breaker.
Before we go on about the stability threats, we need a common understanding about what stability is.
Michael Nygard defines stability as “the consistent, long-term availability of features”, and can be divided into four different qualities:
- Severability; instead of crashing completely, only loose the functional area’s that are struck by failure
- Resilience; automatically recover from failures
- Recoverability; make sure only crashed components need to be restarted, not the entire application
- Tolerance; absorb shocks from other components instead of propagating them
Note that stability doesn’t mean that nothing will ever go wrong. It just states that when something does go wrong, only the features of the applications directly related to the struck component become unavailable, and only until the component comes up again.
The life of a web application developer is interesting. We build applications that are hosted in the most hostile environment imaginable: the internet. People can travel at light speed, nearly invisible and have all the power to bring your application down, willingly or accidentally. In fact, more application crash because of accidental abuse than of hackers.
This brings us to the first threat: the user. Unfortunately, the user pays for our client’s income (and ours too). We’ll have to deal with them.
Several aspects of the users behavior pose a threat for our application:
- Users don’t wait. If a page doesn’t show up quickly enough, he’ll push the link again, starting two threads to work for him, only wonderings the situation
- Users write about cool stuff and come to your site after reading that stuff. This may cause a sudden burst of users, possibly overloading your site
- Buyers will use all integration points, causing traffic in every little corner of your application
- Some users aren’t actually users, but bots, and may cause unused sessions to be created for them
- Users only visit your site when they’re awake, causing traffic surges at daytime, while your processors wait for orders at night.
A threat briefly mentioned above is the integration point. This is the point where your application meets another. An example of this is the database or an external payment provider.
Integration points are dangerous for a couple of reasons:
- External systems typically have to be reached over a network connection, making the integration dependent of ever more components
- Ever had the problem of a firewall dropping your packets? It’s a nice way of not being able to set up a connection without getting any feedback about why this happens.
- Some external systems just keep you waiting. If your payment provider is busy, they’ll be happy to keep you waiting. But will your user wait? How many threads will clog up and the integration point, waiting for a response?
I could go on for a long time, but these are the two threats that I want to cover in this post. If you want more, read Michael Nygard’s book “Release It”.
Fortunately, there is a series of patterns that can be applied to increase an application stability. One of these patterns is the circuit breaker. You’re most likely familiar with the electrical circuit breaker. When you have an electricity leak or high power throughput, the circuit breaker will open and stop the electrical current. You’ll be in the dark, but at least your television, washing machine and music installation are saved from destruction.
Now, imagine your database has a hard time keeping up with your amount of requests. It will probably take longer to process each request, causing even more requests to pile up. In the end, your database server might break, taking your application with it. That is the time where we need a circuit breaker to stop the current from going to our database. I’m talking about digital current here, not electricity. Let’s keep that fans turning on the hardware!
A software circuit breaker should do a few things. Most important is that is should monitor each call performed on a specific backend. When a certain amount of calls fail or take too long, the circuit breaker opens, blocking the current. Each request coming in when the circuit breaker is opened will result in an immediate failure, throwing the last exception that was received from the last call. This has two positive effects: your thread returns more quickly and the load on the external system is relieved, allowing it to recover.
After a certain period of time, the circuit breaker will let a single request pass through to the backend. If that call fails, nothing changes. If the call succeeds, the circuit breaker will close, allowing the requests to pass through to the external system.
The number of use cases for a circuit breaker is most likely limited by our imagination. A nice one I could think of was to limit functionality when load becomes too high. Typically, this would be the functionality that is “nice to have”, but not essential to your application. To do this, your code would have to inspect the state of the circuit breaker when deciding which options are made available to your user. Do keep in mind though, that limiting functionality might cause your circuit breaker to sit and wait for requests which don’t come, never closing it. You’ll need another mechanism for detecting when it can be closed.
Implementing the circuit breaker
I’ve been playing around with a small circuit breaker implementation. You can find the code in our code repository. It’s really small, but seemed really powerful.
There are 2 main components in the application.
- The interceptor is responsible for intercepting calls to the external system and sending the requests and responses to the circuit breaker [see code]
- The circuit breaker itself decides if requests should be sent and what should happen when responses are received [see code]
To find out if a circuit breaker is actually useful, I’ve done a few load tests. I’ve ran these tests against the same application twice. Once with circuit breaker and once without. The results are promising.
When activated, the circuit breaker will cause failures to append a lot faster than when there is none. This means threads are released much faster, allowing the next unfortunate user to receive the error. Of course, it doesn’t help your user much, but at least you’ll be able to tell him something is wrong instead of giving him a timeout after 30 seconds.
Don’t believe me? Have a look at the screen shots of the meter tests below:
In this post, I’ve covered two of the stability threats that can be prevented by a correct implementation of the circuit breaker pattern. Our svn code repository contains example code of an implementation of a circuit breaker. And finally, I’ve shown some unit test results to demonstrate the advantage of applying a circuit breaker to your architecture.