AWS outage destroys EBS-based AMIs in Europe region

I always recommend to create your own EBS-based AMIs (e.g. for running complex software such as Oracle Fusion Middleware). This hold true for the classic AMIs as well as for the converted Oracle VM templates. Never rely on the existence of AMIs provided by Oracle because:

- Oracle can change or update (or remove) them any time.

- They often don’t exist for certain AWS regions, they are S3-based or only exist based on 32-bit OEL instead of 64-bit.

- Also, the AMIs provided often don’t exist for a specific version of Oracle products.

So always create your own copy! Yet here is something to consider:

AWS broke an EBS-based AMI of mine by deleting arbitrary block in the image. This is particularly annoying since there is no easy way to create an offline copy an EBS-based AMI. You could rsync the running image to local computer but there is absolutely no support to get this done in a user-friendly way from the AWS console.

The good: They informed me in time (being in Sydney if something happens in the EU regions gives you an advantage) and sent an apology. They also replaced the deleted blocks with empty blocks.

The bad: It cost me several days to create this AMI which was an OEL EBS-based, full-blown installation of Oracle SOA Suite 11.1.1.5 (I still have to check if it will be usable after a file system check).

For a more detailed explanation of what happened take a look at Amazon’s summary of the events. It summarizes to an error in the EBS software that was overlayed with a power outage in Dublin.

Hello Amazon: Why don’t you provide an easy way to have an offline backup of EBS-based AMIs for disaster recovery?

Amazon’s AWS outage – did the Cloud Fail?

 

There was a major outage in one of Amazon’s regions affecting several availability zones last Thursday.

- For a summary of the events and their impact see this blog entry of RightScale (I guess but I am not sure if it was written by Thorsten). The RightScale blog is updated now with some more details of the event.

 

- George Reese, the grand homme of Cloud Computing, calls this event a shining moment for clouds. Don’t get me wrong. I am big fan of George, not only because he is following me on twitter :) . He gave a podcast interview repeating that you need to design for the cloud by designing for failure instead of sticking with your traditional architecture.

- Amazon did an poor job communicating what happened. Failures are a part of business but they have to be dealt with accordingly. Add this to your lessons learned list about Clouds. At least I did. Here is their summary.

- In my Cloud Computing book there is a whole chapter about RightScale (who provided the best analysis so far) as well as a section about disaster recovery and another one on designing for clouds (“why it is not enough to simply run WebLogic on AWS”) . There is also a free chapter for download available at Oracle’s Archbeat site.

IMHO this event teaches us that it is not enough to know how to simply run WebLogic on AWS or any other IaaS cloud provider such as Rackspace. By the way, this is one of the reasons why my book has more than the initially planned 120 pages …

munz & more: Oracle Gold Partner

I forked out the equivalent value of a trip to the Carribean (starting from Europe, not from the U.S. !) to become an Oracle Gold partner.

What does it mean?

  • munz & more will continue to be critical and independent
  • opening tickets for bugs should become easier now
  • the partner network offers OPN licenses for presentations

WebLogic 10, 11g and WebLogic 12c and Apache mod_wl Plugin Problems

Unfortunately there is a problem with the Apache plugin for WLS10.3.1. It is rather annoying because it spoils the WebLogic cluster experience as such and even worse: it spoils my distributed cluster demo. I reproduced it working with different teams using WLS 10.3.1 / Windows 32 bit / Apache with mod_wl_20 and also mod_wl_22. And typically I encourage all my customers to create a support case concerning this.

Let me explain what happens. Typically the symptoms are the following: In a distributed cluster setup with three managed server m1, m2, and m3 you configure the plugin for load balancing which initially seems to work fine. You can monitor your requests being distributed in a round-robin way, e.g. m2, m3, m1, m2, m3, m1, m2, m3, … Then, if you kill one managed server, let’s say m2, the pattern changes to m1, m3, m1, m1, m1, m1, m1,…. So the load balancing breaks.

UPDATE:

There is a bug now mentioned in the release notes of WLS 10.3.2 which is pointing towards the same direction but the description is rather vague.

UPDATE2:

Last week I could verify the same behaviour on two clusters with WLS 10.3.3 under Windows. I’d recommend you to try it with your Windows installation. It is reproducible even on one machine running three managed servers.

UPDATE3 WebLogic 10.3.5 and WebLogic 12c:

I thought it would be interesting to verify if the error shows up for WebLogic 12c as well – in contrast to the documentation there is currently no proxy plugin with the WebLogic 12c (12.1.1) distribution. So I used the 10.3.5 version for Linux 32 bit and guess what? The behavior is different! It is not perfect (as it used to be in the good old WLS9 days) but Oracle does have a sense of humor apparently. When running 3 managed servers I observe m1,m1,m1,m2,m2,m2,m3,m3,m3,m1,m1,m1,m2… Interesting, hmm? So it load balances but kind of addresses on server n times. Now here is the good surprise: After killing e.g. m2, I see the following pattern: m1,m1,m1,m3,m3,m3,m1,m1,m1,m3…  So unlike with the previous 10.x versions there is still load balancing but the plugin keeps assigning a particular server more than once.