2010/05/03

Everything Old is New Again: Cray's CPU design

I found myself writing, during a commentary on the evolution of SSD's in servers, that  large-slow-memory like Seymour Cray used (not cache), would affect the design of Operating Systems. The new scheduling paradigm:
Allocate a thread to a core, let it run until it finishes and waits for (network) input, or it needs to read/write to the network.
This leads into how Seymour Cray dealt with Multi-Processing, he used multi-level CPU's:
  • There were Application processors, many bits, many complex features like Floating Point and other fancy stuff, but had no kernel mode features or access to protected regions of hardware or memory, and
  • Peripheral Processors (PP's), really a single very simple, very high-speed processor, multiplexed to look like 10 small, slower processors that performed all kernel functions and controlled the operation of the Application Processors (AP's)
Not only did this organisation result in very fast systems (Cray's designs were the fastest in the world for around 2 decades), but very robust and secure ones as well: the NSA and other TLA's used them extensively.

The common received wisdom is that interrupt-handling is the definitive way to interface unpredictable hardware events with the O/S and rest of the system. That polling devices, the old-way, is inefficient and expensive.

Creating a fixed overhead scheme is more expensive in compute cycles than an on-demand, or queuing, system, until the utilisation rate is very high. Then the cost of all the flexibility (or Variety in W. Ross Ashby's Cybernetics term) comes home to roost.

Piers Lauder of Sydney University and Bell Labs improved total system throughput of a VAX-11/780 running Unix V8 under continuous full (student/teaching) load by 30% by changing the serial-line device driver from 'interrupt handling' to polling.

All those expensive context-switches went away, to be replaced by a predictable, fixed overhead.
Yes, when the system was idle or low-load, it spent a little more time polling, but marginal.
And if the system isn't flat-out, what's the meaning of an efficiency metric?

Dr Neil J Gunther has written about this effect extensively with his Universal Scaling Law and other articles showing the equivalence of the seemingly disparate approaches of Vector Processing and SMP systems in the limit of their performance.

My comment about big, slow memory changing Operating System scheduling can be combined with the Cray PP/AP organisation.

In the modern world of CMOS, micro-electronics and multi-core chips, we are still facing the same Engineering problem Seymour Cray was attempting to address/find an optimal solution to:
For a given technology, how do you balance maximum performance with the Power/Heat Wall?
More power gives you more speed, this creates more Heat, which results in self-destruction, the "Halt and Catch Fire" problem. Silicon junctions/transistors are subject to thermal run-away, as they get hotter, they consume more power and get hotter still. At some point that becomes a viscous cycle (positive feedback loop) and its game over. Good chip/system designs balance on just the right side of this knife edge.

How could the Cray PP/AP organisation be applied to current multi-core chip designs?
  1. Separate the CPU designs for kernel-mode and Application Processors.
    A single chip needs only have a single kernel-mode CPU controlling a number of Application CPU's. With its constant overhead cost already "paid for", scaling of Application performance is going to be very close to linear right up until the limit.
  2. Application CPU's don't have forced context switches. They roar along as fast as they can for as long as they can, or the kernel scheduler decides they've had their fair share.
  3. System Performance and Security both improve by using different instruction sets and processor architectures for different applications. While a virus/malware might be able to compromise an Application, it can't migrate into the kernel unless it's buggy. The Security Boundary and Partitioning Model is very strong.
  4. There doesn't have to be competition between the kernel-mode CPU and the AP's for cache memory 'lines'. In fact, the same memory cell designs/organisations used for L1/L2 cache can be provided as small (1-2MB) amounts of very fast direct access memory. The modern equivalent of "all register" memory.
  5. Because the kernel-mode CPU and AP's don't contend for cache lines, each will benefit hugely in raw performance.
    Another, more subtle, benefit is the kernel can avoid both the 'snoopy cache' (shared between all CPU's) and VM systems. It means a much simpler, much faster and smaller (= cooler) design.
  6. The instruction set for the kernel-mode CPU will be optimised for speed, simplicity and minimal transistor count. You can forget about speculative execution and other really heavy-weight solutions necessary in the AP world.
  7. The AP instruction set must be fixed and well-know, while the kernel-mode CPU instruction set can be tweaked or entirely changed for each hardware/fabrication iteration. The kernel-mode CPU runs what we'd now call either a hypervisor or a micro-kernel. Very small, very fast and with just enough capability. A side effect is that the chip manufacturers can do what they do best - fiddle with the internals - and provide a standard hypervisor for other O/S vendors to build upon.
Cheaper, Faster, Cooler, more robust and Secure and able to scale better.

What's not to like in this organisation?

No comments: