So you're just breaking into the infrastructure world and now you're discovering the true meaning of a SLA. Your new client is telling you that we need "Four Nine's Reliability". Well great, but what does that mean exactly?
First get over to this website and determine just how long your website can afford to be down. In this case you're allowed 5 ~minutes a month before you're in trouble.
So great you tell yourself. You can do this, with 2 WFE's and 2 application servers in the backend with a SQL Cluster node, the odds of a failure knocking everything out are about a zillion to one right? (Note not actual numbers).
Whoa...hold on there newb, you are not wise in the ways of SLA. Remember a few points.
- Virtual Machines, they've made it spectacularly easy for hardware to fail mulitple machines at once. Make damn sure your virtualized machines have some sort of backup capability in and of themselves. VMWare typically excels at this, and is my favorite tool for virtualizing environments, period...be wise in the way of the VMWare Server...
- While the odds of a dual server failure are slim, ask yourself what are the odds of a total site failure? Think Earthquake's, Tornado's, Floods, etc. evaluate the environment your datacenter is in and start thinking about how fast you can recover from a site failure. I'll give you another hint, VMWare rocks here as well...but there are still some manual steps occasionally involved that may take you longer than 5 minutes...formulate your plans.
So don't just say 4 nine's reliability, MEAN it. Critical data is at stake here...
|