Of Bridges, Towers, Subways and Web Servers

Picture it:
It’s a perfectly fine morning, and you’re on your way to work. You’ve beaten the odds in your daily challenge of toddler managing and gotten out of the house in under the estimated 40 minutes, and now you’re heading to the subway. You get to the station, only to find out that “due to necessary maintenance work, the train will be rerouted through Queens….” Anyone who has gone through this and hasn’t cursed under their breath must have the patience of a saint.

More on subways…for the last few weeks the escalator at the subway stop nearest my daughter’s preschool has been out for maintenance work. This must be one of the deeper underground subway stops in New York City, because without that escalator there are 50 (yes, we’ve counted them) steps we climb before we can even get to the turnstile. Then there’s another 27 before we get to the street. I should be grateful for the extra exercise, but with a space cadet three-year-old who likes to twirl and jump around the steps, my trip up those extra 50 can sometimes be a harrowing experience.

Then I started reading the book titled “To Engineer is Human, The Role of Failure in Successful Design”, by Henry Petroski, and it helped me get a different perspective on those things that seem like annoying hassles of everyday life. The book discusses the tragedies and near-tragedies of bridges, towers, and airplanes that have been poorly designed or maintained. Sometimes the failure is caused by designers trying too hard to have the most lightweight and graceful structure, the Tacoma Narrows Bridge is a perfect example that probably anyone who’s taken a high school physics course is familiar with.

Sometimes the failure is due to engineers & construction crew deciding that the elements of the original design are too difficult to construct to exact specifications and decide to take some short cuts, as in the  Kansas City Hyatt Regency Hotel Skywalks collapse in 1981.

In other cases the failure is due to crucial elements in the structure malfunctioning under extreme weather conditions, the horrible disaster of the Challenger Space shuttle in 1986 is an unfortunate example. 

Structural failure can also come from fatigue, from a crucial element of the tower, bridge, plane, subway car or escalator that has suffered enough fatigue and can no longer perform to its specifications. But these accidents can be prevented by routine maintenance, close checking of all critical elements to ensure that they can perform properly. The maintenance work being done on the train we’re taking or the escalator we wanted to ride is going to help ensure that we will be able to continue to ride safely and comfortably in the future. It’s not something that’s easy to accept when your commute is doubled, but better to be safe than sorry.

So, how does this relate to website applications and web servers? It’s definitely possible for programmers, designers, quality control (and yes even project managers) to take shortcuts, which can sometimes result in failure of an application or website. Taking on a risky project without being fully prepared for those risks can also sometimes result in failure. Or, subjecting your system to conditions that it wasn’t designed for can cause failure. Of course, failure in web servers and web applications is nowhere near as tragic as failure of a bridge or the space shuttle (but of course with computers controlling so much of our lives these days, the calculations made by computers can sometimes have a direct impact on structural success or failure).

What about fatigue? Do websites and web servers suffer from fatigue in the same way a bridge, plane or tower can? Well, maybe not exactly the same, it seems unlikely that a web server will have a crack or a tear in its outer casing that will cause failure of the system (although, lose connections between cables, network cards, etc. can definitely cause instability). Web servers need to be rebooted, patched and monitored to ensure that all critical systems are operating properly. Web Applications need to be scaled or redesigned when the demands on the application grow beyond what was originally planned for. Can this be a hassle to the people who use those systems? Of course! Nobody likes their website going down for system maintenance, e-mail unavailable, etc. We rely so much on the internet that once it’s taken away our productivity is at a stand still. This is a necessary fact of life of using technology and we should all remind ourselves that it’s better that the proactive maintenance is done and potential issues resolved early before they become bigger problems.

The book doesn’t only talk about failures. It also details amazing successes of structural design, and discusses how many engineering successes come from learning from the mistakes of the past. We need to follow this same practice in the software world, to accept and learn from the failures of the past and use them to build our new successes, and make sure that the health of our websites and servers are among our top priorities.

 

Including Uncertainty in Estimates of Software Projects, Fort Building, and anything including a Toddler

Early in a project, so many of the specific details of the nature of the software being built, specific requirements, project plan and staffing details are all very unclear. Because there are so many variables early on in the project, it is crucial to include a large degree of uncertainty or variability in the project estimate. This is not about being purposely misleading or avoiding commitment to an exact number with your stakeholders, this is about accepting the reality of software projects that leave so much to be defined early on. To commit to an exact number at the very beginning would be misleading yourself and your stakeholders and presenting a false sense of confidence in something that still has so much yet to be defined.

Steve McConnell, CEO and Chief Software Engineer at Construx Software, presents the idea of a “Cone of Uncertainty” in his book “Software Estimation: Demystifying the Black Art (2006)”.

 


The horizontal axis shows significant project milestones. The vertical axis shows the degree of error that has been found in estimates created by skilled estimators at various points in the project. What is obvious from the diagram is that estimates created early on in the project are subject to a high degree of error (from .25x lower to 4x higher). As the details of the project become defined and understood, the cone narrows. Obviously the most accurate estimate is made at the very end of the project development, but the challenge in the software world is to find somewhere in between where we know enough about the project to make the best estimate possible while still allowing major stakeholders to plan financially. More about the Cone of Uncertainty, and other estimation resources can be found here:
http://www.construx.com/Page.aspx?hid=1648

In his book Steve McConnell explains several different techniques used in making software estimates. He also made a very interesting and entertaining blog post recently, where he shared similarities between building a fort in his backyard and problems people run into with software estimates. The general idea here, and very humorously explained, is that in the beginning of a project it’s easy for us to assume that everything will go as planned and the project will proceed smoothly and in a timely manner, but it’s very common for things to take longer than expected. In his case, it was the little construction project in his backyard.

I haven’t built any forts lately, but I’ve managed many projects at Flightpath, and some that have taken longer than the original estimate, for one reason or another. But I also see this concept clearly illustrated in my day to day life outside of Flightpath. I find that it’s almost impossible to make any type of time or schedule commitment when a toddler is involved. I’m fortunate to be the mother (or project manager) of two little girls, and have the pleasure of bringing the older one to preschool every morning. What should take only 15 minutes, can sometimes take up to 40…and this is why:

1. The 2nd and 3rd bowl of cereal (6 mins)
2. Trip to the potty before leaving home, which can sometimes include the mandatory reading of the Dr. Seuss book while waiting for the potty business to complete. (8 mins)
3. Sneakers that get taken off and put back on again, only to get taken off one more time (and of course put back on again) before the final trip out the door. (2 mins)
4. Unexpected meltdown about which jacket to wear, and wanting to wear rain boots and bring umbrella on a perfectly sunny day. (6 mins)

You catch my drift…

So, I’m learning more about how to properly include uncertainty in my estimates at the appropriate times in the project development, both in the projects I manage at Flightpath and the mini-projects I manage at home every morning. It’s a good thing that our preschool allows us a 30 minute window for morning dropping off!