SteveJ-on-IT: Datacentre Hardware organisation

Senior Google staffers wrote The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, which I thought showed break-through thinking.

The importance of this piece is it wasn't theoretical, but a report of What Works in practice, particularly at 'scale'.

Anything that Google, one of the star performers of the Internet Revolution, does differently is worthy of close examination. What do they know that the rest of us don't get?

While the book is an extra-ordinary blueprint, I couldn't but help asking a few questions:

Why do they stick with generic-design 1RU servers when they buy enough for custom designs?
How could 19-inch racks, designed for mechanical telephone exchanges a century ago, still be a good, let alone best, packaging choice when you build Wharehouse sized datacentres?
Telecommunications sites use DC power and batteries. Why take AC, convert to DC, back to AC, distribute AC to every server with inefficient, over-dimensioned power-supplies?

Part of the management problem with datacentres is minimising input costs whilst maximising 'performance' (throughput and latency).

There are three major costs, in my naive view:

real-estate and construction costs
power costs - direct and ancillary/support (especially HVAC).
server and related hardware costs

Software, Licensing and "Operation, Administration and Maintenance" (OAM) costs may also be 'material' in the Accounting sense, I don't have that information or sources describing them.
[HVAC: Heating, Ventilation, Air Conditioning]

The usual figure-of-merit for Datacentre Power Costs is "Power Use Effectiveness"(PUE), the proportion of power that actually ends up being consumed by the Data Processing (DP) systems.
Google and specialist hosting firms get very close to the Green IT "Holy Grail" of a PUE of One. Most commercial small-medium Datacentres have PUE's of 3-5, according to web sources.

IBM Zurich, in partnership with ETH has done work water-cooling servers that beats a PUE of one...
2007, 2008 and 2009 press releases on their Zero Emission Datacentre.

Their critical insight is that the cooling fluid can start hot, so the waste ('rejected') heat can be used elsewhere. Minimally for hot water, maybe heating buildings in cold climates. This approach depends on nearby consumers for low-grade heat, either residential or commercial/manufacturing demands.

Water is 3,500 times more dense than air. It's much more efficient to use water rather than air as the working fluid in heat exchangers. The cost (plus noise) of moving tons of air around, both in capital costs, space and operation/maintenance is very high. Fans are a major contributor to wasted energy, consuming around 30% of total input power (according to web sources).

But in all that, where's the measure of:

DP Power productively applied (% effectiveness, Η (capital Eta)), and
DP Power used vs needed to service demand (% efficiency, η).

%-Effective, Η, is about power-use of hardware unavailable for production, poor architecture, poor OAM practices or duplicated-but-not-contributing 'mirrors' as in traditional active/passive redundancy. Network Appliance (rightly) make a lot of their active/active configuration.

Pay for two, use two not one. A better than 50% reduction is capital and operational costs right there, because these things scale super-linearly.

%-Efficiency, η, is the proportion of production DP power on-line used to serve the demand. Not dissimilar to "% CPU" for a single system, but needs to include power for storage and networking components, as well as CPU+RAM.

One of the critical deficiencies of PUE is it doesn't account for power-provision losses.
Of each 10MW that's charged at the sub-station, what proportion arrives at the DP-element connector?
[Motherboard, Disk Drive, switch, ...] There are many other issues with Datacentre design, here's a 2010 Top Ten.

When is a Wharehouse not a Wharehouse?

Google coined the term, Warehouse Scale Computing, so why aren't modern wharehousing techniques being applied, especially in "Greenfield" sites?

Why would anyone use raised flooring (vs solid or concrete) in a new datacentre?
One of the unwanted side-effects of raised floors is "zinc whiskers". Over time, the soft metal edging on the tiles suffers tiny growths, 'whiskers'. These get dislodged and when they end up in power supplies, create serious problems.
Raised floors make very good sense for small server rooms in office buildings, not wharehouse sized Datacentres.

Automated wharehousing technologies with tall racking and robotic "pick and place", with small aisles, seem to be quite applicable to Datacentres. Or at least trundling large sub-assemblies around with fork-lifts. These sub-assemblies don't need to be bolted in, saving further install and remove/refit time.

Which goes to the question of racking systems.
Servers are built to an archaic standard - both in width and height.
Modern industrial and commercial racking systems are very different - and because they are in mass-production, available immediately.

Which doesn't exactly go against the trend to use shipping containers to house DP elements, but suggests that real Wharehouses might not choose container-based solutions.

As the cost of Power, and even its availability, changes with economic and political decisions, it's efficient and effective management and use in Datacentres will increase in importance.

Can a Datacentre store power at off-peak times and, like other very large power consumers, go off-grid, or at least reduce power-use in these days of the "smart grid"?
Could those "standby" generators be economically used during times of excess power demand? It creates a way to turn a dead investment into an income producing asset.

A useful result of treating a Wharehouse Datacentre as a true wharehouse is that modulo the additional power and cooling, they are conventional wharehouses. Allowing either existing buildings to be bought and used with minimal conversion, or old/surplus Datacentre buildings to be sold and used "as-is".

Datacentre Cooling

Whilst "Free Cooling" (using environmental air directly, without any cooling plant) is a very useful and under-utilised Datacentre technique applicable to all scales of installation, it's not the only technique available.

One of the HVAC technologies that's common in big shopping malls, but never mentioned in relation to Datacentres is "Thermal Energy Storage", or Off-peak Ice production and Storage.

Overnight, when power is cheap and conditions favour it, you make ice.
During your peak demand period (afternoon), your A/C plant mainly uses the stored ice.
This has the dual benefit of using cheap power and of significantly reducing the size cooling plant required.
In hot climates requiring very large ice storage, they don't even have to insulate the tanks. The ratio of volume to surface area for large tanks means the losses for daily use are small.

"Hot Aisle" and related techniques are well known and in common use. Their use is assumed here.

Datacentre Power Provisioning

Reticulating 110-240V AC to each server requires multiple levels of inefficient redundancy:

UPS's are run in N+1 configuration, to allow a single failure. Additional units are installed, but off-line, to cover failures and increased demand.
Ditto for backup Generators.
Every critical DP-element (not all servers are 'critical', but all routers and Storage Arrays are), requires redundant power-supplies to cater for either PSU or supply failure. You need to draw a line once between what's protected and what isn't, a major gamble or risk. Retooling PSU's in installed servers isn't economic, and mostly impossible.

That's three levels of N+1 redundancy, plus UPS's and generators that must be regularly live-load tested, which in itself is hugely problematic.

Plus you've got a complex AC power distribution system, with at least dual-feeds to each rack, to run, test and maintain. Especially interesting if you need to upgrade capacity. I've never seen a large installation come back without significant problems after major AC power maintenance.

The Telecommunications standard is dual 48V DC supplies to each rack and a number of battery rooms sized for extended outages, with backup generators to charge the batteries.
Low voltage DC has problems, significantly the power-loss (voltage drop) in long bus-bars. Minimising power-losses without incurring huge copper conductor costs is an optimisation challenge.

Lead-acid batteries can deliver twice the power for a little under half the time (they 'derate' at higher loads), so maintenance activities are simple affairs with little installed excess capacity (implied over-capitalisation waste) because a single battery bank can easily supply multiples of it's charging current.

There are three wins from delivering low-volatage DC to servers:

Power feeds can be joined with simple diodes and used to feed redundant PSU's. Cheap, simple and very reliable. Operational cost is a 0.7V or 1.4V supply-drop through the diode, so higher voltages are more efficient.
Internal fans are not needed, less heat is dissipated (according to web sources). I'm unsure of the efficiency of DC-DC converters at well below optimal load, if you over-specify PSU's.
This suggests sharing redundant PSU's between multiple servers and matching loads to PSUs.
the inherent "headroom" of batteries means high start-up currents, or short-term "excessive" power demand are easily covered.

The biggest win for a DC Datacentre supply is the integrated power control we see in every laptop:
the battery, PSU and HVAC systems can dynamically interact with the DP-elements to either minimise power use (and implied heat production) or maximise power to productive DP-elements.

The 3 dimensional optimisation problem of Input Power, Load-based Resource Control and Cooling Capacity can be properly solved. That's got to be worth a lot...

The "standard" 1RU Server

The problems I've got with typical 1RU servers (and by extension, with the usual 'blades') are:

There are 6 different types of heat load/producers in a server. Providing a single airflow for cooling means over-servicing lower-power devices and critical placement modelling/design:

CPU, ~100W per unit. Actively cooled.
RAM, 1-10W per unit. Some need heatsinks
Hard Disks, 2-20W. Some active cooling required.
motherboard 'glue' chips, low-power, passive cooling, no heat sinks.
GPU's, network and PCI cards. Some active cooling required.
PSU. Own active cooling, trickier with isolation cage.

1.75", the Rack Unit, is either too tall or too short for everything.
It's not close to optimal packing density.
Neither is the "standard height rack" of 42RU optimised for large, single-floor Wharehouse buildings. It's also too tall for standard shipping containers with raised flooring.
The fans and heat-sinks needed to cool high-performance CPU's within the 1.7" available are tricky to design and difficult to provide redundantly. Blowers are less efficient at the higher air speed caused by the restricted form-factor. Maintenance and wear is increased as well.
1Gbps ethernet has been around for a decade or more and is far from the fastest commodity networking option available. SAS cards are cheap and available, using standard SATA chips running at 3 or 6Gbps down each of 4 'lanes' with the common SFF 8470 external connector. At $50 extra for 12-24Gbps, why aren't these in common use? At least until 10Gbps copper Ethernet becomes affordable for rack-scale switching.
HDDs, (disks) are a very bad fit for the 1RU form factor, both for space-efficiency and cooling.
3.5" drives are 1" thick. They need to lie flat, unstacked, but then take considerable internal real-estate, especially if more drive bays are allocated than used.
2.5" drives still have to lie flat, but at 9.5-12.5mm thick, 2-3 can be stacked.
3.5" drives running at 10-15,000 RPM consume ~20W. They need good airflow or will fail.
Low-power 2.5" drives (2-5W) resolve some of those issues, but need to be away from the hot CPU exhaust.

One technology solution that starts to address the "blade cooling problem" is the sealed immersed-board liquid-coolant system from the UK startup, "Icetope", also in this referring article.

Sealing a motherboard puts all the liquid-tolerant components in the same environment, providing you've chosen your "liquid" well.
The IBM "ZED" approach above is directed at just the major heat load, the CPU's and uses plain water.
IBM is taking the stance that simple coolant loops, air-cooling of low-power devices and simple field-maintenance are important factors.

Which approach is better for Wharehouse-scale computing? Real-life production data is needed.
As both are new, we have to wait 3-5 years.

The obvious component separations are:

PSU's, separated from the motherboard and shared between multiple motherboards. PSU capacity can be very closely matched to their optimal load and cooling capacity and all DP-elements can benefit from full-path redundancy.
Disk drives can be housed in space- and cooling-optimised drawers.

There are already commercial solutions that pack 42 * 3.5" drives into 4RU drawers, either SATA or SAS. It's cheap and simple to provide a bulk connector, then run cabling to adjacent motherboards.

This shows my antipathy to the usual commercial architecture of large, shared storage-arrays with expensive and fragile Fibre Channel SAN's. They are expensive to buy and run, suffer multiple reliability and performance limitations, can't be optimised for any one service and can't be selectively powered down.
Google chooses not to use them, a very big hint for dynamic, scalable loads.

What could Google Do?

Very specifically, I'm suggesting Wharehouse-scale Datacentres like Google's with dynamic, scalable loads could achieve significant PUE and η gains (especially in a Greenfield site) by:

purpose designed tall racking, possibly robot-only aisles/access
DC power distribution
shared in-rack redundant DC-DC PSU's
liquid-cooled motherboards
packed-drawer disks, and
maybe non-ethernet interconnection fabric.

I suspect the total power demand for a notionally 20MW site might be reduced 25-50%.

Some other links to 'The Hot Aisle" blog by Steve O'Donnell:

SteveJ-on-IT

2010/08/29

Datacentre Hardware organisation

No comments: