The Beginning of the Real Mess in the Clouds

Mar 14, 2009

We made a seed investment at the end of last year in a company called Standing Cloud. They are working on a set of application development and deployment tools for cloud computing that build on some of the ideas in the posts What Happened To The 4GL? and Clouditude. They’ve spent the first quarter of 2009 testing out a series of hypotheses we had concerning issues with Cloud Computing as well as building an first release of their cloud deployment tool.

Things in cloud computing land are much worse than our hypotheses, which were already suitably cloudy. Following is a simple example of a real issue that even a non-techie will be able to understand.

When starting cloud server instances via an API, it’s important to realize that the instance may become "available" from the perspective of the cloud provider system before all of its standard services are running. In particular, if your automated process will connect to the instance via ssh or http, it will be at least a few seconds after the instance appears before the applicable service is available. This problem generally does not arise if you start servers manually, because by the time you see that it is running, copy the IP address or domain name, and type the command or browser URL to connect, the services are usually ready. Possible solutions include:
Wait a safe, predetermined amount of time. This is the simplest, but obviously is not robust.
Attempt to open a socket on the applicable port (e.g., 22 or 80), and do so in a loop, with a brief sleep between attempts. These attempts will fail until the service starts, and you should have the loop time out after some period in case there is a deeper problem with the instance.
In a similar loop, directly attempt to connect to the service. Depending on the SSH API you are using, this may have additional overhead or abstraction that is better avoided, but it is robust and likely to work.
A related, but more insidious issue is that the sshd authentication services are not necessarily available as soon as sshd is available on the port. Thus it is possible to connect to the service and have authentication fail, even though everything should be in order. A sufficient wait period or a loop is once again the solution. However, if the loop wait period is not sufficient, you may attempt too many failed authentications and lock yourself out of the system. Thus this approach has no fully robust solution aside from an empirically safe wait period either prior to authentication or in the loop wait.
Clearly these problems are tricky to diagnose if you are not aware of their idiosyncrasies.
A more robust but also more complex overall solution would be to incorporate a service on-board the instance that starts up at boot and checks the status of sshd from the inside, then makes an otherwise unused port available when the system is fully ready for connection and authentication. Of course, this requires that you can boot from a pre-configured image rather than a stock image, and it also requires that another port be open on the firewall.

The set of things like this is endless. In addition, each cloud computing environment has different nuances and each stack configuration adds more complexity to the mix. There are so many fun puns I that apply that I get giddy just thinking about all the storms ahead and the need for umbrellas.