2010/05/09

Microsoft Troubles - IX, the story unfolds with Apple closing in on Microsoft size.

Three pieces in the trade press showing how things are unfolding.

Om Malik points out that Intel and Microsoft fortunes are closely intertwined.
Jean-Louis Gassée suggests that "Personal Computing" (on those pesky Personal Computers) is downsizing and changing.
Joe Wilcox analyses Microsoft latest results and contrasts a little with Apple.

2010/05/03

Everything Old is New Again: Cray's CPU design

I found myself writing, during a commentary on the evolution of SSD's in servers, that  large-slow-memory like Seymour Cray used (not cache), would affect the design of Operating Systems. The new scheduling paradigm:
Allocate a thread to a core, let it run until it finishes and waits for (network) input, or it needs to read/write to the network.
This leads into how Seymour Cray dealt with Multi-Processing, he used multi-level CPU's:
  • There were Application processors, many bits, many complex features like Floating Point and other fancy stuff, but had no kernel mode features or access to protected regions of hardware or memory, and
  • Peripheral Processors (PP's), really a single very simple, very high-speed processor, multiplexed to look like 10 small, slower processors that performed all kernel functions and controlled the operation of the Application Processors (AP's)
Not only did this organisation result in very fast systems (Cray's designs were the fastest in the world for around 2 decades), but very robust and secure ones as well: the NSA and other TLA's used them extensively.

The common received wisdom is that interrupt-handling is the definitive way to interface unpredictable hardware events with the O/S and rest of the system. That polling devices, the old-way, is inefficient and expensive.

Creating a fixed overhead scheme is more expensive in compute cycles than an on-demand, or queuing, system, until the utilisation rate is very high. Then the cost of all the flexibility (or Variety in W. Ross Ashby's Cybernetics term) comes home to roost.

Piers Lauder of Sydney University and Bell Labs improved total system throughput of a VAX-11/780 running Unix V8 under continuous full (student/teaching) load by 30% by changing the serial-line device driver from 'interrupt handling' to polling.

All those expensive context-switches went away, to be replaced by a predictable, fixed overhead.
Yes, when the system was idle or low-load, it spent a little more time polling, but marginal.
And if the system isn't flat-out, what's the meaning of an efficiency metric?

Dr Neil J Gunther has written about this effect extensively with his Universal Scaling Law and other articles showing the equivalence of the seemingly disparate approaches of Vector Processing and SMP systems in the limit of their performance.

My comment about big, slow memory changing Operating System scheduling can be combined with the Cray PP/AP organisation.

In the modern world of CMOS, micro-electronics and multi-core chips, we are still facing the same Engineering problem Seymour Cray was attempting to address/find an optimal solution to:
For a given technology, how do you balance maximum performance with the Power/Heat Wall?
More power gives you more speed, this creates more Heat, which results in self-destruction, the "Halt and Catch Fire" problem. Silicon junctions/transistors are subject to thermal run-away, as they get hotter, they consume more power and get hotter still. At some point that becomes a viscous cycle (positive feedback loop) and its game over. Good chip/system designs balance on just the right side of this knife edge.

How could the Cray PP/AP organisation be applied to current multi-core chip designs?
  1. Separate the CPU designs for kernel-mode and Application Processors.
    A single chip needs only have a single kernel-mode CPU controlling a number of Application CPU's. With its constant overhead cost already "paid for", scaling of Application performance is going to be very close to linear right up until the limit.
  2. Application CPU's don't have forced context switches. They roar along as fast as they can for as long as they can, or the kernel scheduler decides they've had their fair share.
  3. System Performance and Security both improve by using different instruction sets and processor architectures for different applications. While a virus/malware might be able to compromise an Application, it can't migrate into the kernel unless it's buggy. The Security Boundary and Partitioning Model is very strong.
  4. There doesn't have to be competition between the kernel-mode CPU and the AP's for cache memory 'lines'. In fact, the same memory cell designs/organisations used for L1/L2 cache can be provided as small (1-2MB) amounts of very fast direct access memory. The modern equivalent of "all register" memory.
  5. Because the kernel-mode CPU and AP's don't contend for cache lines, each will benefit hugely in raw performance.
    Another, more subtle, benefit is the kernel can avoid both the 'snoopy cache' (shared between all CPU's) and VM systems. It means a much simpler, much faster and smaller (= cooler) design.
  6. The instruction set for the kernel-mode CPU will be optimised for speed, simplicity and minimal transistor count. You can forget about speculative execution and other really heavy-weight solutions necessary in the AP world.
  7. The AP instruction set must be fixed and well-know, while the kernel-mode CPU instruction set can be tweaked or entirely changed for each hardware/fabrication iteration. The kernel-mode CPU runs what we'd now call either a hypervisor or a micro-kernel. Very small, very fast and with just enough capability. A side effect is that the chip manufacturers can do what they do best - fiddle with the internals - and provide a standard hypervisor for other O/S vendors to build upon.
Cheaper, Faster, Cooler, more robust and Secure and able to scale better.

What's not to like in this organisation?

A Good Question: When will Computer Design 'stabilise'?

The other night I was talking to my non-Geek friend about computers and he formulated what I thought was A Good Question:
When will they stop changing??
This was in reaction to me talking about my experience in suggesting a Network Appliance, a high-end Enterprise Storage device, as shared storage for a website used by a small research group.
It comes with a 5 year warranty, which leads to the obvious question:
will it be useful, relevant or 'what we usually do' in 5 years?
I think most of the elements in current systems are here to stay, at least for the evolution of Silicon/Magnetic recording. We are staring at 'the final countdown', i.e. hitting physical limits of these technologies, not necessarily their design limits. Engineers can be very clever.

The server market has already fractioned into "budget", "value" and "premium" species.
The desktop/laptop market continues to redefine itself - and more 'other' devices arise. The 100M+ iPhones, in particular, already out there demonstrate this.

There's a new major step in server evolution just breaking:
Flash memory for large-volume working and/or persistent storage.
What now may be called internal or local disk.
This implies a major re-organisation of even low-end server installations:
Fast local storage and large slow network storage - shared and reliable.
When the working set of Application data in databases and/or files will fit on (affordable) local flash memory, response times improve dramatically because all that latency is removed. By definition, data outside the working set isn't a rate limiting step, so its latency only slightly affects system response time. However, throughput, the other side of the Performance Coin, has to match or beat that of the local storage, or it will become the system bottleneck.

An interesting side question:
 How will Near-Zero-Latency local storage impact system 'performance', both response times (a.k.a. latency) and throughput.

I conjecture that both system latency and throughput will improve markedly, possibly super-linearly, because one of the bug-bears of Operating Systems, the context switch, will be removed. Systems have to expend significant effort/overhead in 'saving their place', deciding what to do next, then when the data is finally ready/available, to stop what they were doing and start again where they left off.

The new processing model, especially for multi-core CPU's, will be:
Allocate a thread to a core, let it run until it finishes and waits for (network) input, or it needs to read/write to the network.
Near zero-latency storage removes the need for complex scheduling algorithms and associated queuing. It improves both latency and throughput by removing a bottleneck.
It would seem that Operating Systems might benefit from significant redesign to exploit this effect, in much the same way that RAM is now large and cheap enough that system 'swap space' is now either an anachronism or unused.

The evolution of USB flash drives saw prices/Gb halving every year. I've recently seen 4Gb SDHC cards at the supermarket for ~$15, whereas in 2008, I paid ~$60 for USB 4Gb.

Rough server pricing for RAM in 2010 is A$65/Gb ±$15.
List prices by Tier 1/2 vendors for 64Gb SSD is $750-$1000 (around 2-4 times cheaper from 'white box' suppliers).
I've seen this firmware limited to 50Gb to improve performance and reliability comparable to current production HDD specs.
This is $12-$20/Gb, depending on what base size and prices used.

Disk drives are ~A$125 for 7200rpm SATA and $275-$450 for 15K SAS drives.
With 2.5" drives priced in-between.
Ie. $0.125/Gb for 'big slow' disks and $1 per GB for fast SAS disks.

Roll forward 5 years to 2015 and 'SSD' might've doubled in size three times, plus seen the unit price drop. Hard disks will likely follow the same trend of 2-3 doublings.
Say SSD 400Gb for $300: $1.25/Gb
2.5" drives might be up to 2-4Tb in 2015 (from 500Gb in 2010) and cost $200: $0.05-0.10/Gb
RAM might be down to $15-$30/Gb.

A caveat with disk storage pricing: 10 years ago RAID 5 became necessary for production servers to avoid permanent data loss.
We've now passed another event horizon: Dual-parity, as a minimum, is required on production RAID sets.

On production servers, price of storage has to factor in the multiple overheads of building high-reliability storage (redundant {disks, controllers, connections}, parity and hot-swap disks and even fully mirrored RAID volumes plus software, licenses and their Operations, Admin and Maintenance) from unreliable parts. A problem solved by electronics engineers 50+ years ago with N+1 redundancy.

Multiple Parity is now needed because in the time taken to recreate a failed drive, there's a significant chance of a second drive failure and total data loss. [Something NetApp has been pointing out and addressing for some years.] The reason for this is simple: the time to read/write a whole drive has steadily increased since ~1980. Recording density (bits per inch) times areal density (tracks per inch) have increased faster than read/write speeds, roughly multiplying recording density times rotational speed.

Which makes running triple-mirrors a much easier entry point, or some bright spark has to invent a cheap-and-cheerful N-way data replication system. Like a general use Google File System.

Another issue is that current SSD offerings don't impress me.

They make great local disk or non-volatile buffers in storage array, but are not yet, in my opinion, quite ready for 'prime time'.

I'd like to see 2 things changed:
  • RAID-3 organisation with field-replaceable mini-drives. hot-swap preferred.
  • PCI, not SAS or SATA connection. I.e. they appear as directly addressable memory.

This way the hardware can access flash as large, slow memory and the Operating System can fabricate that into a filesystem if it chooses - plus if it has some knowledge of the on-chip flash memory controller, it can work much better with it. It saves multiple sets of interfaces and protocol conversions.

Direct access flash memory will be always be cheaper and faster than SATA or SAS pseudo-drives.

We would then see following hierarchy of memory in servers:

  • Internal to server
    • L1/2/3 cache on-chip
    • RAM
    • Flash persistent storage
    • optional local disk (RAID-dual parity or triple mirrored)
  • External and site-local
    • network connected storage array, optimised for size, reliability, streaming IO rate and price not IO/sec. Hot swap disks and in-place/live expansion with extra controllers or shelves are taken as a given.
    • network connected near-line archival storage (MAID - Massive Array of Idle Disks)
  • External and off-site
    • off-site snapshots, backups and archives.
      Which implies a new type of business similar to Amazon's Storage Cloud.
The local network/LAN is going to be ethernet (1Gbps or 10Gbps Ethernet, a.k.a 10GE), or Infiniband if 10GE remains very expensive. Infiniband delivers 3-6Gbps over short distances on copper, external SAS currently uses the "multi-lane" connector to deliver four channels per cable. This is exactly right for use in a single rack.

I can't see a role for Fibre Channel outside storage arrays, and these will go if Infiniband speed and pricing continues to drop. Storage Arrays have used SCSI/SAS drives with internal copper wiring and external Fibre interfaces for a decade or more.
Already the premium network vendors, like CISCO, are selling "Fibre Channel over Ethernet" switches (FCoE using 10GE).

Nary a tape to be seen. (Hooray!)

Servers should tend to be 1RU either full-width or half-width, though there will still be 3-4 styles of servers:
  • budget: mostly 1-chip
  • value: 1 and 2-chip systems
  • lower power value systems: 65W/CPU-chip, not 80-90W.
  • premium SMP: fast CPU's, large RAM and many CPU's (90-130W ea)
If you want removable backups, stick 3+ drives in a RAID enclosure and choose between USB, firewire/IEEE 1394, e-SATA or SAS.

Being normally powered down, you'd expect extended lifetimes for disks and electronics.
But they'll need regular (3-6-12 months) read/check/rewrite cycling or the data will degrade and be permanently lost. Random 'bit-flipping' due to thermal activity, cosmic rays/particles and stray magnetic fields is the price we pay for very high density on magnetic media.
Which is easy to do if they are kept in a remote access device, not unlike "tape robots" of old.
Keeping archival storage "on a shelf" implies manual processes for data checking/refresh, and that is problematic to say the least.

3-5 2.5" drives will make a nice 'brick' for these removable backup packs.
Hopefully commodity vendors like Vantec will start selling multiple-interface RAID devices in the near future. Using current commodity interfaces should ensure they are readable at least a decade into the future. I'm not a fan of hardware RAID controllers in this application because if it breaks, you need to find a replacement - which may be impossible at a future date. (fails 'single point of failure' test).

Which presents another question using a software RAID and filesystem layout: Will it still be available in your O/S of the future?
You're keeping copies of your applications, O/S, licences and hardware to recover/access archived data, aren't you? So this won't be a question... If you don't intend to keep the environment and infrastructure necessary to access archived data, you need to rethink what you're doing.

These enclosures won't be expensive, but shan't be cheap and cheerful:
Just what is your data worth to you?
If it has little value, then why are you spending money on keeping it?
If it is a valuable asset, potentially irreplaceable, then you must be prepared to pay for its upkeep in time, space and dollars. Just like packing old files into archive boxes and shipping them to a safe off-site facility cost money, it isn't over once they are out of your sight.

Electronic storage is mostly cheaper than paper, but it isn't free and comes with its own limits and problems.

Summary:
  • SSD's are best suited and positioned as local or internal 'disks', not in storage arrays.
  • Flash memory is better presented to an Operating System as directly accessible memory.
  • Like disk arrays and RAM, flash memory needs to seamlessly cater for failure of bits and whole devices.
  • Hard disks have evolved to need multiple parity drives to keep the risk of total data loss acceptably low in production environments.
  • Throughput of storage arrays, not latency, will become their defining performance metric.
    New 'figures of merit' will be:
    • Volumetric: Gb per cubic-inch
    • Power: Watts per Gb
    • Throughput: Gb per second per read/write-stream
    • Bandwidth: Total Gb per second
    • Connections:  Number simultaneous connections.
    • Price: $ per Gb available and $ per Gb/sec per server and total
    • Reliability: probability of 1 byte lost per year per Gb
    • Archive and Recovery features: snapshots, backups, archives and Mean-Time-to-Restore
    • Expansion and Scalability: maximum size (Gb, controllers, units, I/O rate) and incremental pricing
    • Off-site and removable storage: RAID-5 disk-packs with multiple interfaces are needed.
  • Near Zero-latency storage implies reorganising and simplifying Operating Systems and their scheduling/multi-processing algorithms. Special CPU support may be needed, like for Virtualisation.
  • Separating networks {external access, storage/database, admin, backups} becomes mandatory for performance, reliability, scaling and security.
  • Pushing large-scale persistent storage onto the network requires a commodity network faster than 1Gbps ethernet. This will either be 10Gbps ethernet or multi-lane 3-6Gbps Infiniband.
Which leads to another question:
What might Desktops look like in 5 years?

Other Reading:
For a definitive theoretical treatment of aspects of storage hierarchies, Dr. Neil J Gunther, ex-Xerox PARC, now Performance Dynamics, has been writing about "The Virtualization Spectrum" for some time.

Footnote 1:
Is this idea of multi-speed memory (small/fast and big/slow) new or original?
No: Seymour Cray, the designer of the world's fastest computers for ~2 decades, based his designs on it. It appears to me to be a old idea whose time has come again.

From a 1995 interview with the Smithsonian:
SC: Memory was the dominant consideration. How to use new memory parts as they appeared at that point in time. There were, as there are today large dynamic memory parts and relatively slow and much faster smaller static parts. The compromise between using those types of memory remains the challenge today to equipment designers. There's a factor of four in terms of memory size between the slower part and the faster part. Its not at all obvious which is the better choice until one talks about specific applications. As you design a machine you're generally not able to talk about specific applications because you don't know enough about how the machine will be used to do that.
There is also a great PPT presentation on Seymour Cray by Gordon Bell entitled "A Seymour Cray Perspective", probably written as a tribute after Cray's untimely death in an auto accident.

Footnote 2:
The notion of "all files on the network" and invisible multi-level caches was built in 1990 at Bell Labs in their Unix successor, "Plan 9" (named for one of the worst movies of all time).
Wikipedia has a useful intro/commentary, though the original on-line docs are pretty accessible.

Ken Thompson and co built Plan 9 around 3 elements:
  • A single protocol (9P) of around 14 elements (read, write, seek, close, clone, cd, ...)
  • The Network connects everything.
  • Four types of device: terminals, CPU servers, Storage servers and the Authentication server.
Ken's original storage server had 3 levels of transparent storage (in sizes unheard of at the time):
  • 1Gb of RAM (more?)
  • 100Gb of disk (in an age where 1Gb drives where very large and exotic)
  • 1Tb of WORM storage (write-once optical disk. Unheard of in a single device)
The usual comment was, "you can go away for the weekend and all your files are still in either memory or disk cache".

They also pioneered permanent point-in-time archives on disk in something appearing to the user as similar to NetApp's 'snapshots' (though they didn't replicate inode tables and super-blocks).

 My observations in this piece can be paraphrased as:
  • re-embrace Cray's multiple-memory model, and
  • embrace commercially the Plan 9 "network storage" model.

Promises and Appraising Work Capability and Proficiency

Max Wideman, PMI Distinguished Contributor and Person of the Year and Canadian author of several Project Management books plus a slew of published papers, not only responded to, and published, some comments and conversations of between us, he then edited up some more emails into a Guest Article of his site.

Many thanks to you Max for all your fine work and for seeing something useful in what I penned.