SteveJ-on-IT: design rules

2011/11/06

The importance of Design Rules

This started with an aside in "Crypto", Stephen Levy (2000), about Rivest's first attempt at creating an RSA Crypto chip failing because whilst the design worked perfectly on the simulator, it didn't work when fabricated.

[p134] Alderman blames the failure on their overreliance on Carver Mead's publications...

Carver Mead and Lynn Conway at CalTech revolutionised VLSI design and production around 1980, publishing "Introduction to VLSI System Design" and providing access to fabrication lines for students and academics. This has been widely written about:
e.g. in "The Power of Modularity", a short piece on the birth of the microchip from Longview Institute, and a 2007 Computerworld piece on the importance of Mead and Conway's work.

David A. Patterson wrote of a further, related, effect in Scientific American, September 1995, p63, "Microprocessors in 2020"

Every 18 months microprocessors double in speed. Within 25 years, one computer will be as powerful as all those in Silicon Valley today

Most recently, microprocessors have become more powerful, thanks to a change in the design approach.
Following the lead of researchers at universities and laboratories across the U.S., commercial chip designers now take a quantitative approach to computer architecture.
Careful experiments precede hardware development, and engineers use sensible metrics to judge their success.
Computer companies acted in concert to adopt this design strategy during the 1980s, and as a result, the rate of improvement in microprocessor technology has risen from 35 percent a year only a decade ago to its current high of approximately 55 percent a year, or almost 4 percent each month.
Processors are now three times faster than had been predicted in the early 1980s;
it is as if our wish was granted, and we now have machines from the year 2000.

Copyright 1995 Scientific American, Inc.

The important points are:

These acts, capturing expert knowledge in formal Design Rules, were intentional and deliberate.
These rules weren't an arbitrary collection just thrown together, they were a three-part approach, 1) the dimensionless scalable design rules, 2) the partitioning of tasks and 3) system integration and testing activities.
The impact, through a compounding rate effect, has been immense e.g. through Moore's Law doubling time, bringing CPU improvements forward 20 years.
The Design Rules have become embedded in software design and simulation tools, allowing new silicon devices to be designed much faster, with more complexity and with orders fewer errors and faults.
It's a very successful model that's been replicated in other areas of I.T.

So I'm wondering why vendors don't push this model in other areas?
Does it not work, not scale or is not considered 'useful' or 'necessary'?

There are some tools that contain embedded expert knowledge, e.g. for server storage configuration. But they are tightly tied to particular vendors and product families.

Update 13-Nov-2011: What makes/defines a Design Rule (DR)?

Design Rules fall in the middle ground between "Rules-of-Thumb" used in Art/Craft of Practice and the authoritative, abstract models/equations of Science.

They define the middle ground of Engineering:
more formal than R-o-T's but more general and directly applicable than the theories models and equations of pure Science, suitable for creating and costing Engineering designs.

This "The Design Rule for I.T./Computing" approach is modelled after the VLSI technique used for many decades, but is not a slavish derivation of it.

Every well understood field of Engineering has one definitive/authoritative "XXX Engineering Handbook" publication that covers all the sub-fields/specialities, recites all the formal Knowledge, Equations, Models, Relationships and Techniques, provides Case Studies, Tutorials, necessary Tables/Charts and worked examples. Plus basic material of ancillary, related or supporting fields.

The object of these "Engineering Handbooks" is that any capable, competent, certified Engineer in a field can rely on its material to solve problems, projects or designs that come their way. They have a reference they can rely upon for their field.

Quantifying specific costs and materials/constraints comes from vendor/product specifications and contracts or price lists. These numbers are used for the detailed calculations and pricing using the techniques/models/equations given in The Engineering Handbook.

A collection of "Design Rules for I.T. and Computing" may serve the same need.

What are the requirements of a DR?:

Explicitly list aspects covered and not covered by the DR:
eg. Persistent Data Storage vs Permanent Archival Storage
Constraints and Limits of the DR:
What's the largest, smallest or complex system applicable.
Complete: all Engineering factors named and quantified.
Inputs and Outputs: Power, Heat, Air/Water, ...
Scalable: How to scale the DR up and down.
Accounting costs: Whole of Life, CapEx and Opex models.
Environmental Requirements:
Availability and Serviceability:
Contamination/Pollution: Production, Supply and Operation.
Waste generation and disposal.
Consumables, Maintenance, Operation and Administration
Training, Staffing, User education.
Deployment, Installation/Cutover, Removal/Replacement.
Compatibility with systems, components and people.
Optimisable in multiple dimensions. Covers all the aspects traded off in Engineering decisions:
- Cost: per unit, 'specific metric' (eg $$/Gb),
- Speed/Performance: how it's defined, measured, reported and compared.
- 'Space' (Speed and 'Space' in the sense of Algorithmn trade-off)
- Size, Weight, and other Physical characteristics
- 'Quality' (of design and execution, not the simplistic "fault/error rate")
- Product compliance to specification, repeatability of 'performance'. (manufacturing defects, variance, problems, ...)
- Usability
- Safety/Security
- Reliability/Recovery
other factors will be needed to achieve a model/rule that is:
{Correct, Consistent, Complete, Canonical (ie min size)}

2011/09/14

A new inflection point? Definitive Commodity Server Organisation/Design Rules

Summary:

For the delivery of general purpose and wide-scale Compute/Internet Services there now seems to be a definitive hardware organisation for servers, typified by the E-bay "pod" contract.

For decades there have been well documented "Design Rules" for producing Silicon devices using specific technologies/fabrication techniques. This is an attempt to capture some rules for current server farms. [Update 06-Nov-11: "Design Rules" are important: Patterson in a Sept. 1995 Scientific American article notes that the adoption of a quantitative design approach in the 1980's led to an improvement in microprocessor speedup from 35%pa to 55%pa. After a decade, processors were 3 times faster than forecast.]

Commodity Servers have exactly three possible CPU configurations, based on "scale-up" factors:

single CPU, with no coupling/coherency between App instances. e.g. pure static web-server.
dual CPU, with moderate coupling/coherency. e.g. web-servers with dynamic content from local databases. [LAMP-style].
multi-CPU, with high coupling/coherency. e.g. "Enterprise" databases with complex queries.

If you're not running your Applications and Databases in Virtual Machines, why not?
[Update 06-Nov-11: Because Oracle insists some feature sets must run on raw hardware. Sometimes vendors won't support your (preferred) VM solution.]

VM products are close to free and offer incontestable Admin and Management advantages, like 'teleportation' or live-migration of running instances and local storage.

There is a special non-VM case: cloned physical servers. This is how I'd run a mid-sized or large web-farm.
This requires careful design, a substantial toolset, competent Admins and a resilient Network design. Layer 4-7 switches are mandatory in this environment.

There are 3 system components of interest:

The base Platform: CPU, RAM, motherboard, interfaces, etc
Local high-speed persistent storage. i.e. SSD's in a RAID configuration.
Large-scale common storage. Network attached storage with filesystem, not block-level, access.

Note that complex, expensive SAN's and their associated disk-arrays are no longer economic. Any speed advantage is dissolved by locally attached SSD's, leaving only complexity, resilience/recovery issues and price.
Consequentially, "Fibre Channel over Ethernet" with its inherent contradictions and problems, is unnecessary.

Designing individual service configurations can be broken down into steps:

select the appropriate CPU config per service component
specify the size/performance of local SSD per CPU-type.
architect the supporting network(s)
specify common network storage elements and rate of storage consumption/growth.

Capacity Planning and Performance Analysis is mandatory in this world.

As a professional, you're looking to provide "bang-for-buck" for someone else who's writing the cheques. Over-dimensioning is as much a 'sin' as running out of capacity. Nobody ever got fired for spending just enough, hence maximising profits.

Getting it right as often as possible is the central professional engineering problem.
Followed by, limiting the impact of Faults, Failures and Errors - including under-capacity.

The quintessential advantage to professionals in developing standard, reproducible designs is the flexibility to respond to unanticipated load/demands and the speed with which new equipment can be brought on-line, and the converse, retired and removed.

Security architectures and choice of O/S + Cloud management software is outside the scope of this piece.

There are many multi-processing architectures, each best suited to particular workloads.
They are outside the scope of this piece, but locally attached GPU's are about to become standard options.
Most servers will acquire what were known as vector processors and applications using this capacity will start to become common. This trend may need their own Design Rule(s).

Different, though potentially similar design rules apply for small to mid-size Beowulf clusters, depending on their workload and cost constraints.
Large-scale or high-performance compute clusters or storage farms, such as the IBM 120 Petabyte system, need careful design by experienced specialists. With any technology, "pushing the envelope" requires special attention by the best people you have, to even have a chance of success.

Not unsurprisingly, this organisation looks a lot like the current fad, "Cloud Computing" and the last fad, "Services Oriented Architecture".

Google and Amazon dominated their industry segments partly because they figured out the technical side of their business early on. They understood how to design and deploy datacentres suitable for their workload, how to manage Performance and balance Capacity and Cost.

Their "workloads", and hence server designs, are very different:

Google serves pure web-pages, with almost no coupling/communication between servers.
Amazon has front-end web-servers is backed by complex database systems.

Dell is now selling a range of "Cloud Servers" purportedly based on the systems they supply to large Internet companies.