The Beginning of the Real Mess in the Clouds

We made a seed investment at the end of last year in a company called Standing Cloud.  They are working on a set of application development and deployment tools for cloud computing that build on some of the ideas in the posts What Happened To The 4GL? and Clouditude.  They’ve spent the first quarter of 2009 testing out a series of hypotheses we had concerning issues with Cloud Computing as well as building an first release of their cloud deployment tool.

Things in cloud computing land are much worse than our hypotheses, which were already suitably cloudy.  Following is a simple example of a real issue that even a non-techie will be able to understand.

When starting cloud server instances via an API, it’s important to realize that the instance may become "available" from the perspective of the cloud provider system before all of its standard services are running.  In particular, if your automated process will connect to the instance via ssh or http, it will be at least a few seconds after the instance appears before the applicable service is available. This problem generally does not arise if you start servers manually, because by the time you see that it is running, copy the IP address or domain name, and type the command or browser URL to connect, the services are usually ready.  Possible solutions include:

  • Wait a safe, predetermined amount of time.  This is the simplest, but obviously is not robust.
  • Attempt to open a socket on the applicable port (e.g., 22 or 80), and do so in a loop, with a brief sleep between attempts.  These attempts will fail until the service starts, and you should have the loop time out after some period in case there is a deeper problem with the instance.
  • In a similar loop, directly attempt to connect to the service.  Depending on the SSH API you are using, this may have additional overhead or abstraction that is better avoided, but it is robust and likely to work.

A related, but more insidious issue is that the sshd authentication services are not necessarily available as soon as sshd is available on the port.  Thus it is possible to connect to the service and have authentication fail, even though everything should be in order. A sufficient wait period or a loop is once again the solution.  However, if the loop wait period is not sufficient, you may attempt too many failed authentications and lock yourself out of the system.  Thus this approach has no fully robust solution aside from an empirically safe wait period either prior to authentication or in the loop wait.

Clearly these problems are tricky to diagnose if you are not aware of their idiosyncrasies.

A more robust but also more complex overall solution would be to incorporate a service on-board the instance that starts up at boot and checks the status of sshd from the inside, then makes an otherwise unused port available when the system is fully ready for connection and authentication. Of course, this requires that you can boot from a pre-configured image rather than a stock image, and it also requires that another port be open on the firewall.

The set of things like this is endless.  In addition, each cloud computing environment has different nuances and each stack configuration adds more complexity to the mix.  There are so many fun puns I that apply that I get giddy just thinking about all the storms ahead and the need for umbrellas.

  • Sateesh Narahari

    Not sure if this problem is specific to cloud service – the problem exists even in provisioning scenarios within a data center – if you provision, lets say, a new instance of WebSphere server, there are dependent services that need to start ( say you start with Bare Metal Provisioning, you provision the OS, the necessary services( say DB ), and then provision the application server to be available – each component needs to wait for dependent services to be available.

    Manually starting services is not a most optimal way even in data centers. Even administrators who start services manually do use either batch scripts or some other way of automating it. Manually copying IP address etc is error prone.

  • Maybe you have already seen this.
    The interview is pretty good. This company really uses the cloud. I would bet he has your problem solved.
    Listen to the end of the interview. The question about the future of cloud cmputer should really interest you.

  • David Campbell

    Brad, the problems you highlight here are pretty easy to mitigate.

    See for a taste of the more interesting challenges that the security community is presently tackling with respect to cloud computing. Formidable issues, but to summarize Dan, the technology has so many benefits vis a vis scalability and performance that we'll overcome the obstacles in short order.



    • Yes but this is merely one simple example. In 90 days we have come up with what is rapidly approaching what feels like an infinite set of issues like this. Unless you constrain the parameters of the cloud environmnt significantly (not interesting) you end up wig a remarkably long class of problems.

  • Dave

    I'd add to Brad's point that most such things are easy to mitigate, once you diagnose them, but unfortunately (for example) there is not an error message that says "ssh daemon has not completed startup." The idiosyncracies are rampant and furthermore each cloud service seems to have different such issues. Finally, although system administrators and security experts would find these issues straightforward, dealing with such things is anathema to the application developer.

  • Pingback: Cloud Computing Streak Marks()

  • greg davoll

    Not that I find the problem to be overly trivial, but isn’t this being blown a bit out of proportion? I mean, haven’t we (the software industry) solved similar issues on the mainframe, midrange, Unix, and Windows for decades? Sounds like you're seeing the sausage made on this one…

    • I think the difference in cloud (vs mainframe, etc) is that we are now assuming sysadmins are not needed to help provision or manage operating systems instances.
      Everything (in theory! 🙂 will be controlled by APIs that app developers will call.

      So, in that case, how does one abstract some of the sysadmin minutiae and expose it to the application developer? I think that may be one of the questions that needs to be answered if we are really trying to get to a sysadmin-less environment where developers can control the entire application and infrastructure stack.

  • Pingback: Feld Thoughts spurring my thoughts on cloud computing « Yet Another VC Blog()

  • The problems described in this post loosely mirror those in, for example, a slightly heterogeneous Linux server farm where you have different revs of RedHat with slightly different system libraries and slightly different SATA and network controllers on the Dell boxes. We're at firmware version 2.01B instead of 2.01A on the hard drives…

    In fact, you can be suffering from the above explicitly + vagaries in the cloud level services that appear in similar ways.

    So, we could say that we've moved one step backward since across the total solution stack someone (like the cloud provider) is dealing with all of the same problems as before, plus new ones.

    The promise though, is perhaps the cloud software will become much more advanced to the point that you end up with less problems than you had without cloud computing. Some of this might be opsware type solutions (as referenced in the latter article). On the macro level it will push people towards better real-time tracking of system faults and corresponding solutions that get rolled back into standard setups. Things we could be doing already, but often fall short of.

    Specifically in regards to the SSH issue, hot starts should just work.

    I would model the different issues with hot starts the way the database crowd, specifically Oracle (just for a standard), model ACID transactions. For instance, the system you describe sounds like it is supporting a 'dirty' or 'phantom' hot start capability.

    The is-service-up question sounds a lot like asking 'did this transaction commit successfully?'

    There are going to be a lot of strategies for optimizing the spawning and running of services. For instance, why not try starting 3 servers across a grid and use the one that is up first?

    Advocating for that kind of technology will result in a better solution, but will also create a market environment where there will be higher demand for management applications.

  • Pingback: Link dump of online ad news, metrics, trends « ecpm blog()

  • Pingback: Cartier Watches()

  • Pingback: low cost auto insurance michigan()