Bring some stability to your architecture

circuitbreaker Applications have to run in high-consequence environments. They have to serve hundreds of thousands of users 24 / 7. Our clients spend millions in hard- and software and highly depend on the revenue generated by these applications. Unnecessary outage of these application is fatal.

Software Architects play an important role in setting up an architecture that can cope with these high demands. At the JAOO, Michael Nygard‘ had a talk, “Failure comes in Flavors“, that gave very good insight in the risks and opportunities of today’s application. The talk was was divided into two sessions. The first session covered the bad news: the stability threats. He discussed several situations that pose a threat to the long and happy life of an application. The second session was a happier one. It covered the patterns that should be applied to the application architecture to prevent these threats.

In this post, I will elaborate on some of the stability threats and pick one specific pattern to resolve them: the circuit breaker.

Stability

Before we go on about the stability threats, we need a common understanding about what stability is.

Michael Nygard defines stability as “the consistent, long-term availability of features”, and can be divided into four different qualities:

Severability; instead of crashing completely, only loose the functional area’s that are struck by failure
Resilience; automatically recover from failures
Recoverability; make sure only crashed components need to be restarted, not the entire application
Tolerance; absorb shocks from other components instead of propagating them

Note that stability doesn’t mean that nothing will ever go wrong. It just states that when something does go wrong, only the features of the applications directly related to the struck component become unavailable, and only until the component comes up again.

Stability threats

The life of a web application developer is interesting. We build applications that are hosted in the most hostile environment imaginable: the internet. People can travel at light speed, nearly invisible and have all the power to bring your application down, willingly or accidentally. In fact, more application crash because of accidental abuse than of hackers.

This brings us to the first threat: the user. Unfortunately, the user pays for our client’s income (and ours too). We’ll have to deal with them.

Several aspects of the users behavior pose a threat for our application:

Users don’t wait. If a page doesn’t show up quickly enough, he’ll push the link again, starting two threads to work for him, only wonderings the situation
Users write about cool stuff and come to your site after reading that stuff. This may cause a sudden burst of users, possibly overloading your site
Buyers will use all integration points, causing traffic in every little corner of your application
Some users aren’t actually users, but bots, and may cause unused sessions to be created for them
Users only visit your site when they’re awake, causing traffic surges at daytime, while your processors wait for orders at night.

A threat briefly mentioned above is the integration point. This is the point where your application meets another. An example of this is the database or an external payment provider.

Integration points are dangerous for a couple of reasons:

External systems typically have to be reached over a network connection, making the integration dependent of ever more components
Ever had the problem of a firewall dropping your packets? It’s a nice way of not being able to set up a connection without getting any feedback about why this happens.
Some external systems just keep you waiting. If your payment provider is busy, they’ll be happy to keep you waiting. But will your user wait? How many threads will clog up and the integration point, waiting for a response?

I could go on for a long time, but these are the two threats that I want to cover in this post. If you want more, read Michael Nygard’s book “Release It”.

Stability patterns

Fortunately, there is a series of patterns that can be applied to increase an application stability. One of these patterns is the circuit breaker. You’re most likely familiar with the electrical circuit breaker. When you have an electricity leak or high power throughput, the circuit breaker will open and stop the electrical current. You’ll be in the dark, but at least your television, washing machine and music installation are saved from destruction.

Now, imagine your database has a hard time keeping up with your amount of requests. It will probably take longer to process each request, causing even more requests to pile up. In the end, your database server might break, taking your application with it. That is the time where we need a circuit breaker to stop the current from going to our database. I’m talking about digital current here, not electricity. Let’s keep that fans turning on the hardware!

A software circuit breaker should do a few things. Most important is that is should monitor each call performed on a specific backend. When a certain amount of calls fail or take too long, the circuit breaker opens, blocking the current. Each request coming in when the circuit breaker is opened will result in an immediate failure, throwing the last exception that was received from the last call. This has two positive effects: your thread returns more quickly and the load on the external system is relieved, allowing it to recover.

After a certain period of time, the circuit breaker will let a single request pass through to the backend. If that call fails, nothing changes. If the call succeeds, the circuit breaker will close, allowing the requests to pass through to the external system.

The number of use cases for a circuit breaker is most likely limited by our imagination. A nice one I could think of was to limit functionality when load becomes too high. Typically, this would be the functionality that is “nice to have”, but not essential to your application. To do this, your code would have to inspect the state of the circuit breaker when deciding which options are made available to your user. Do keep in mind though, that limiting functionality might cause your circuit breaker to sit and wait for requests which don’t come, never closing it. You’ll need another mechanism for detecting when it can be closed.

Implementing the circuit breaker

I’ve been playing around with a small circuit breaker implementation. You can find the code in our code repository. It’s really small, but seemed really powerful.

There are 2 main components in the application.

The interceptor is responsible for intercepting calls to the external system and sending the requests and responses to the circuit breaker [see code]
The circuit breaker itself decides if requests should be sent and what should happen when responses are received [see code]

To find out if a circuit breaker is actually useful, I’ve done a few load tests. I’ve ran these tests against the same application twice. Once with circuit breaker and once without. The results are promising.

When activated, the circuit breaker will cause failures to append a lot faster than when there is none. This means threads are released much faster, allowing the next unfortunate user to receive the error. Of course, it doesn’t help your user much, but at least you’ll be able to tell him something is wrong instead of giving him a timeout after 30 seconds.

Don’t believe me? Have a look at the screen shots of the meter tests below:

250 concurrent threads with a targeted throughput of 50 requests per second. The circuit breaker variant doesn’t have any failed calls, while the normal implementation has more than 40% failures.

250 concurrent threads with a targeted throughput of 100 requests per second. This is where the circuit breaker variant seems to have it’s first failures.

And finally, a 1000 thread burst, just to see what happens if you spam tomcat with 1000 concurrent, continuously refreshing users. It’s 80% failure against 32% for the circuit breaker.

Conclusion

In this post, I’ve covered two of the stability threats that can be prevented by a correct implementation of the circuit breaker pattern. Our svn code repository contains example code of an implementation of a circuit breaker. And finally, I’ve shown some unit test results to demonstrate the advantage of applying a circuit breaker to your architecture.

Bring some stability to your architecture

17 thoughts on “Bring some stability to your architecture”

AllardPost author
April 14, 2009 at 7:13 pm

Hi Harald,

thanks for your reply, and sorry for my late response. For some reason, I wasn’t notified about your comment. The circuit breaker is one of the stability patterns that Michael Nygard mentions in his book “Release It”. See http://www.michaelnygard.com/ for a link to his book. Happy reading.

Allard
harald
April 9, 2009 at 9:14 am

very interesting read, thanks! you wrote, that the ‘circuit breaker’ is just one of a series of patterns for increasing stability. can you point me to some website (or book) which explains other stability patterns?

thanks again,
harald
Robert vd Steen
November 13, 2008 at 12:23 pm

Hey allard,

I will let you know about results on this. This was just a prototype for a fix, so it may take a while before I can give you any results

Regards,

Robert vd Steen
AllardPost author
November 13, 2008 at 8:21 am

Erik van Oosten said: “Could you make the annotation work on the class level as well?”.

Well, yes you could, but I don’t think it is a good idea to do this in general. Not all methods on each class are “dangerous” and should be monitored. Furthermore, as we discussed offline, currently the last exception is repeated when the circuit breaker is open. Different method calls might throw different exceptions, meaning that unexpected exception could arise from method calls, causing trouble. In that case it might be a good idea to throw a generic runtime exception (e.g. SystemNotAvailableException) instead of repeating the last one.
AllardPost author
November 13, 2008 at 8:17 am

Hi Robert,

nice to hear your “success story” at the blue swan. Have you had the chance to loadtest and profile the circuit breaker in your environment? If so, would you like to share the (global) results?

Say hi to the blue friend for me!

Regards,

Allard
Robert vd Steen
November 12, 2008 at 3:55 pm

Hey Allard,

By Advice I have now implemented the circuitbreaker back here at your old blue friends. Works very nice.

Thanks,

Robert vd Steen
Erik van Oosten
November 11, 2008 at 9:43 am

Thanks, I understand now that the setAndCheck will only return true once.

Could you make the annotation work on the class level as well?
AllardPost author
November 5, 2008 at 5:12 pm

The next request is blocked by the circuit breaker. Only when the “one” result comes back and is succesful, the circuit breaker will close and re-establish the connection.
Erik van Oosten
November 5, 2008 at 11:33 am

Yes I understand that. But what do you do with the next request? In particular when that first request is still in progress.
AllardPost author
November 4, 2008 at 8:38 pm

Erik, that has nothing to do with the state. In my implementation, I send every first request after 10 seconds to the backend in the “open” state. Using the “half-open” state for the duration of that one request doesn’t add any value.
Erik van Oosten
November 4, 2008 at 4:26 pm

If many request hammer continuously and simultaneously at your circuit breaker, leaving out the half-open state will make your circuit breaker re-evaluate many concurrent incoming requests at the same time. That does not sound like a good idea.
AllardPost author
November 2, 2008 at 8:50 pm

The half-open state described by Michael is the state where the next incoming call is passed though, to test if the connection has been fixed. Personally, I don’t really see that as a state, just an event that happens during the “open” state.

I had renamed the circuit breakers in the example code. If you check out the example project (http://gridshore.googlecode.com/svn/trunk/StabilityPatterns), you’ll find the up-to-date code samples.
Erik van Oosten
November 2, 2008 at 8:36 pm

I can’t find an example in the repository. Which file should I look at?

A circuit breaker as described by Michael Nygard also has the half-open state. Yet, I can not find this state in the code. What is your comment on that?
AllardPost author
October 31, 2008 at 8:19 am

Jettro,

you’re definitely right. It was late yesterday, and a small red light on my laptop wasn’t allowing me to go on much longer without getting out of my comfy sofa and find the charger.

The example code also shows how you can expose the state of the circuit breaker using JMX. Make sure you start your application container (e.g. tomcat) using the command line arguments -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=9004 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false

This will allow you to point your jmxconsole to localhost:9004 and actually see the Catalina and circuit breaker MBeans. If you don’t supply these parameters, you will only see standard JVM MBeans.

I’ll dedicate a separate post to JMX, since there is too much to say about this topic than fits a paragraph.
AllardPost author
October 31, 2008 at 8:07 am

Ben, sorry, wrong choice of words… It’s fixed.
Ben
October 31, 2008 at 1:22 am

Allard,

I like the idea of the circuit breaker. Moreover, I know that the project I’m working on could benefit greatly from having a few, so I’m going to look into it in a short while. Thanks. 😀

Just one question though: if the results of your load test look so promising, why are they appalling?
jettro
October 30, 2008 at 10:41 pm

Nice post, need to embed it in one of my applications. What’s up with the JMX, any code and screens on that as well?

Comments are closed.

Share this:

17 thoughts on “Bring some stability to your architecture”