Applications have to run in high-consequence environments. They have to serve hundreds of thousands of users 24 / 7. Our clients spend millions in hard- and software and highly depend on the revenue generated by these applications. Unnecessary outage of these application is fatal.
Software Architects play an important role in setting up an architecture that can cope with these high demands. At the JAOO, Michael Nygard‘ had a talk, “Failure comes in Flavors“, that gave very good insight in the risks and opportunities of today’s application. The talk was was divided into two sessions. The first session covered the bad news: the stability threats. He discussed several situations that pose a threat to the long and happy life of an application. The second session was a happier one. It covered the patterns that should be applied to the application architecture to prevent these threats.
In this post, I will elaborate on some of the stability threats and pick one specific pattern to resolve them: the circuit breaker.
Before we go on about the stability threats, we need a common understanding about what stability is.
Michael Nygard defines stability as “the consistent, long-term availability of features”, and can be divided into four different qualities:
- Severability; instead of crashing completely, only loose the functional area’s that are struck by failure
- Resilience; automatically recover from failures
- Recoverability; make sure only crashed components need to be restarted, not the entire application
- Tolerance; absorb shocks from other components instead of propagating them
Note that stability doesn’t mean that nothing will ever go wrong. It just states that when something does go wrong, only the features of the applications directly related to the struck component become unavailable, and only until the component comes up again.
The life of a web application developer is interesting. We build applications that are hosted in the most hostile environment imaginable: the internet. People can travel at light speed, nearly invisible and have all the power to bring your application down, willingly or accidentally. In fact, more application crash because of accidental abuse than of hackers.
This brings us to the first threat: the user. Unfortunately, the user pays for our client’s income (and ours too). We’ll have to deal with them.
Several aspects of the users behavior pose a threat for our application:
- Users don’t wait. If a page doesn’t show up quickly enough, he’ll push the link again, starting two threads to work for him, only wonderings the situation
- Users write about cool stuff and come to your site after reading that stuff. This may cause a sudden burst of users, possibly overloading your site
- Buyers will use all integration points, causing traffic in every little corner of your application
- Some users aren’t actually users, but bots, and may cause unused sessions to be created for them
- Users only visit your site when they’re awake, causing traffic surges at daytime, while your processors wait for orders at night.
A threat briefly mentioned above is the integration point. This is the point where your application meets another. An example of this is the database or an external payment provider.
Integration points are dangerous for a couple of reasons:
- External systems typically have to be reached over a network connection, making the integration dependent of ever more components
- Ever had the problem of a firewall dropping your packets? It’s a nice way of not being able to set up a connection without getting any feedback about why this happens.
- Some external systems just keep you waiting. If your payment provider is busy, they’ll be happy to keep you waiting. But will your user wait? How many threads will clog up and the integration point, waiting for a response?
I could go on for a long time, but these are the two threats that I want to cover in this post. If you want more, read Michael Nygard’s book “Release It”.
Fortunately, there is a series of patterns that can be applied to increase an application stability. One of these patterns is the circuit breaker. You’re most likely familiar with the electrical circuit breaker. When you have an electricity leak or high power throughput, the circuit breaker will open and stop the electrical current. You’ll be in the dark, but at least your television, washing machine and music installation are saved from destruction.
Now, imagine your database has a hard time keeping up with your amount of requests. It will probably take longer to process each request, causing even more requests to pile up. In the end, your database server might break, taking your application with it. That is the time where we need a circuit breaker to stop the current from going to our database. I’m talking about digital current here, not electricity. Let’s keep that fans turning on the hardware!
A software circuit breaker should do a few things. Most important is that is should monitor each call performed on a specific backend. When a certain amount of calls fail or take too long, the circuit breaker opens, blocking the current. Each request coming in when the circuit breaker is opened will result in an immediate failure, throwing the last exception that was received from the last call. This has two positive effects: your thread returns more quickly and the load on the external system is relieved, allowing it to recover.
After a certain period of time, the circuit breaker will let a single request pass through to the backend. If that call fails, nothing changes. If the call succeeds, the circuit breaker will close, allowing the requests to pass through to the external system.
The number of use cases for a circuit breaker is most likely limited by our imagination. A nice one I could think of was to limit functionality when load becomes too high. Typically, this would be the functionality that is “nice to have”, but not essential to your application. To do this, your code would have to inspect the state of the circuit breaker when deciding which options are made available to your user. Do keep in mind though, that limiting functionality might cause your circuit breaker to sit and wait for requests which don’t come, never closing it. You’ll need another mechanism for detecting when it can be closed.
Implementing the circuit breaker
I’ve been playing around with a small circuit breaker implementation. You can find the code in our code repository. It’s really small, but seemed really powerful.
There are 2 main components in the application.
- The interceptor is responsible for intercepting calls to the external system and sending the requests and responses to the circuit breaker [see code]
- The circuit breaker itself decides if requests should be sent and what should happen when responses are received [see code]
To find out if a circuit breaker is actually useful, I’ve done a few load tests. I’ve ran these tests against the same application twice. Once with circuit breaker and once without. The results are promising.
When activated, the circuit breaker will cause failures to append a lot faster than when there is none. This means threads are released much faster, allowing the next unfortunate user to receive the error. Of course, it doesn’t help your user much, but at least you’ll be able to tell him something is wrong instead of giving him a timeout after 30 seconds.
Don’t believe me? Have a look at the screen shots of the meter tests below:
250 concurrent threads with a targeted throughput of 50 requests per second. The circuit breaker variant doesn’t have any failed calls, while the normal implementation has more than 40% failures.
250 concurrent threads with a targeted throughput of 100 requests per second. This is where the circuit breaker variant seems to have it’s first failures.
And finally, a 1000 thread burst, just to see what happens if you spam tomcat with 1000 concurrent, continuously refreshing users. It’s 80% failure against 32% for the circuit breaker.
In this post, I’ve covered two of the stability threats that can be prevented by a correct implementation of the circuit breaker pattern. Our svn code repository contains example code of an implementation of a circuit breaker. And finally, I’ve shown some unit test results to demonstrate the advantage of applying a circuit breaker to your architecture.
17 thoughts on “Bring some stability to your architecture”
thanks for your reply, and sorry for my late response. For some reason, I wasn’t notified about your comment. The circuit breaker is one of the stability patterns that Michael Nygard mentions in his book “Release It”. See http://www.michaelnygard.com/ for a link to his book. Happy reading.
very interesting read, thanks! you wrote, that the ‘circuit breaker’ is just one of a series of patterns for increasing stability. can you point me to some website (or book) which explains other stability patterns?
I will let you know about results on this. This was just a prototype for a fix, so it may take a while before I can give you any results
Robert vd Steen
Erik van Oosten said: “Could you make the annotation work on the class level as well?”.
Well, yes you could, but I don’t think it is a good idea to do this in general. Not all methods on each class are “dangerous” and should be monitored. Furthermore, as we discussed offline, currently the last exception is repeated when the circuit breaker is open. Different method calls might throw different exceptions, meaning that unexpected exception could arise from method calls, causing trouble. In that case it might be a good idea to throw a generic runtime exception (e.g. SystemNotAvailableException) instead of repeating the last one.
nice to hear your “success story” at the blue swan. Have you had the chance to loadtest and profile the circuit breaker in your environment? If so, would you like to share the (global) results?
Say hi to the blue friend for me!
By Advice I have now implemented the circuitbreaker back here at your old blue friends. Works very nice.
Robert vd Steen
Thanks, I understand now that the setAndCheck will only return true once.
Could you make the annotation work on the class level as well?
The next request is blocked by the circuit breaker. Only when the “one” result comes back and is succesful, the circuit breaker will close and re-establish the connection.
Yes I understand that. But what do you do with the next request? In particular when that first request is still in progress.
Erik, that has nothing to do with the state. In my implementation, I send every first request after 10 seconds to the backend in the “open” state. Using the “half-open” state for the duration of that one request doesn’t add any value.
If many request hammer continuously and simultaneously at your circuit breaker, leaving out the half-open state will make your circuit breaker re-evaluate many concurrent incoming requests at the same time. That does not sound like a good idea.
The half-open state described by Michael is the state where the next incoming call is passed though, to test if the connection has been fixed. Personally, I don’t really see that as a state, just an event that happens during the “open” state.
I had renamed the circuit breakers in the example code. If you check out the example project (http://gridshore.googlecode.com/svn/trunk/StabilityPatterns), you’ll find the up-to-date code samples.
I can’t find an example in the repository. Which file should I look at?
A circuit breaker as described by Michael Nygard also has the half-open state. Yet, I can not find this state in the code. What is your comment on that?
you’re definitely right. It was late yesterday, and a small red light on my laptop wasn’t allowing me to go on much longer without getting out of my comfy sofa and find the charger.
The example code also shows how you can expose the state of the circuit breaker using JMX. Make sure you start your application container (e.g. tomcat) using the command line arguments -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=9004 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false
This will allow you to point your jmxconsole to localhost:9004 and actually see the Catalina and circuit breaker MBeans. If you don’t supply these parameters, you will only see standard JVM MBeans.
I’ll dedicate a separate post to JMX, since there is too much to say about this topic than fits a paragraph.
Ben, sorry, wrong choice of words… It’s fixed.
I like the idea of the circuit breaker. Moreover, I know that the project I’m working on could benefit greatly from having a few, so I’m going to look into it in a short while. Thanks. 😀
Just one question though: if the results of your load test look so promising, why are they appalling?
Nice post, need to embed it in one of my applications. What’s up with the JMX, any code and screens on that as well?
Comments are closed.