VMware's Cloud Foundry Experiences Early Hiccups

by Sam Dean - May. 03, 2011Comments (0)

Many new cloud computing platforms--including OpenStack, which is backed by heavy-hitting tech titans--have been beating the war drums for the last couple of years, but VMware's Cloud Foundry, which is billed to allow deployment and scalability of cloud apps in seconds, has been commanding the most attention recently. The recently launched effort differs from VMware's usual playbook in a number of ways, including its focus on open source, but also has the potential to leverage what VMware does especially well: virtualization. Right out of the gate, though, Cloud Foundry has been plagued by downtime incidents. VMware has been admirably transparent about the problems, but the problems--along with Amazon's recent cloud services hiccups--raise questions about the dependability and business continuity of cloud platforms.

VMware's Dekel Tankel, in a blog post, has acknowledged that Cloud Foundry had downtime problems on both April 25th and 26th. He writes, regarding the first outage:

"The root cause of the failure has been identified. It was a partial outage of a power supply in a storage cabinet, which impacted access to a single LUN. While not a “normal event”, it is something that can and will happen from time to time, and it is the reason that highly available systems are always built at both the hardware and software layers. In this case, our software, our monitoring systems, and our operational practices were not in synch. The net impact of these three events was that the Cloud Controller saw a partial loss of connectivity to a single LUN. This is an event that we did not properly handle and the net result is that the Cloud Controller declared a loss of connectivity to a piece of storage that it needs in order to process many control operations."

Regarding the second outage, he writes:

"Unfortunately, at 10:15am PDT, one of the operations engineers developing the playbook touched the keyboard. This resulted in a full outage of the network infrastructure sitting in front of Cloud Foundry. This took out all load balancers, routers, and firewalls; caused a partial outage of portions of our internal DNS infrastructure; and resulted in a complete external loss of connectivity to Cloud Foundry."

Keeping in mind that Cloud Foundry, along with many other cloud-focused platforms, is intended to host business applications, it's hard to miss from these reports how tiny errors cascade into large problems. Amazon recently had to issue an apology to its Amazon Web Services users regarding similarly cascading downtime problems.

It's wrong to conclude that commercially available cloud stacks are not dependable for hosting business applications. After all, all computing platforms experience downtime problems. However, today's cloud services are a patchwork quilt of separate tools and supporting hardware platforms that are susceptible to cascading errors. Furthermore, VMware fully acknowledges that user error caused its second downtime problem.

PCMag's John Dvorak, who often decries the error rates of cloud platforms, gets it right in a recent column:

 "I expect to see more cloud failures in the future, but with each one, something new will be learned and a few companies will emerge with great services. But when people expect me to complain about the cloud in this instance, they are expecting too much. One of my main complaints about the cloud is not the fact that it can crash like anything else. It's the fact that the user is at the mercy of the company providing the services."

As Dvorak notes, the key problem that these cloud downtime incidents raise is that users hosting applications in the cloud are susceptible to not being in control. Mirroring and fault-tolerance are areas where more work needs to be done for the various cloud services players to be taken more seriously. In the case of Cloud Foundry, it has the potential to make open source tools a big part of the mix in hosting cloud applications, and VMware--despite its admirable level of transparency--will need to be able to find ways to prevent cascade effects from the types of simple problems that its cloud platform has recently experienced. 

John Mark Walker uses OStatic to support Open Source, ask and answer questions and stay informed. What about you?


Share Your Comments

If you are a member, to have your comment attributed to you. If you are not yet a member, Join OStatic and help the Open Source community by sharing your thoughts, answering user questions and providing reviews and alternatives for projects.

Promote Open Source Knowledge by sharing your thoughts, listing Alternatives and Answering Questions!