2010/05/03

A Good Question: When will Computer Design 'stabilise'?

The other night I was talking to my non-Geek friend about computers and he formulated what I thought was A Good Question:
When will they stop changing??
This was in reaction to me talking about my experience in suggesting a Network Appliance, a high-end Enterprise Storage device, as shared storage for a website used by a small research group.
It comes with a 5 year warranty, which leads to the obvious question:
will it be useful, relevant or 'what we usually do' in 5 years?
I think most of the elements in current systems are here to stay, at least for the evolution of Silicon/Magnetic recording. We are staring at 'the final countdown', i.e. hitting physical limits of these technologies, not necessarily their design limits. Engineers can be very clever.

The server market has already fractioned into "budget", "value" and "premium" species.
The desktop/laptop market continues to redefine itself - and more 'other' devices arise. The 100M+ iPhones, in particular, already out there demonstrate this.

There's a new major step in server evolution just breaking:
Flash memory for large-volume working and/or persistent storage.
What now may be called internal or local disk.
This implies a major re-organisation of even low-end server installations:
Fast local storage and large slow network storage - shared and reliable.
When the working set of Application data in databases and/or files will fit on (affordable) local flash memory, response times improve dramatically because all that latency is removed. By definition, data outside the working set isn't a rate limiting step, so its latency only slightly affects system response time. However, throughput, the other side of the Performance Coin, has to match or beat that of the local storage, or it will become the system bottleneck.

An interesting side question:
 How will Near-Zero-Latency local storage impact system 'performance', both response times (a.k.a. latency) and throughput.

I conjecture that both system latency and throughput will improve markedly, possibly super-linearly, because one of the bug-bears of Operating Systems, the context switch, will be removed. Systems have to expend significant effort/overhead in 'saving their place', deciding what to do next, then when the data is finally ready/available, to stop what they were doing and start again where they left off.

The new processing model, especially for multi-core CPU's, will be:
Allocate a thread to a core, let it run until it finishes and waits for (network) input, or it needs to read/write to the network.
Near zero-latency storage removes the need for complex scheduling algorithms and associated queuing. It improves both latency and throughput by removing a bottleneck.
It would seem that Operating Systems might benefit from significant redesign to exploit this effect, in much the same way that RAM is now large and cheap enough that system 'swap space' is now either an anachronism or unused.

The evolution of USB flash drives saw prices/Gb halving every year. I've recently seen 4Gb SDHC cards at the supermarket for ~$15, whereas in 2008, I paid ~$60 for USB 4Gb.

Rough server pricing for RAM in 2010 is A$65/Gb ±$15.
List prices by Tier 1/2 vendors for 64Gb SSD is $750-$1000 (around 2-4 times cheaper from 'white box' suppliers).
I've seen this firmware limited to 50Gb to improve performance and reliability comparable to current production HDD specs.
This is $12-$20/Gb, depending on what base size and prices used.

Disk drives are ~A$125 for 7200rpm SATA and $275-$450 for 15K SAS drives.
With 2.5" drives priced in-between.
Ie. $0.125/Gb for 'big slow' disks and $1 per GB for fast SAS disks.

Roll forward 5 years to 2015 and 'SSD' might've doubled in size three times, plus seen the unit price drop. Hard disks will likely follow the same trend of 2-3 doublings.
Say SSD 400Gb for $300: $1.25/Gb
2.5" drives might be up to 2-4Tb in 2015 (from 500Gb in 2010) and cost $200: $0.05-0.10/Gb
RAM might be down to $15-$30/Gb.

A caveat with disk storage pricing: 10 years ago RAID 5 became necessary for production servers to avoid permanent data loss.
We've now passed another event horizon: Dual-parity, as a minimum, is required on production RAID sets.

On production servers, price of storage has to factor in the multiple overheads of building high-reliability storage (redundant {disks, controllers, connections}, parity and hot-swap disks and even fully mirrored RAID volumes plus software, licenses and their Operations, Admin and Maintenance) from unreliable parts. A problem solved by electronics engineers 50+ years ago with N+1 redundancy.

Multiple Parity is now needed because in the time taken to recreate a failed drive, there's a significant chance of a second drive failure and total data loss. [Something NetApp has been pointing out and addressing for some years.] The reason for this is simple: the time to read/write a whole drive has steadily increased since ~1980. Recording density (bits per inch) times areal density (tracks per inch) have increased faster than read/write speeds, roughly multiplying recording density times rotational speed.

Which makes running triple-mirrors a much easier entry point, or some bright spark has to invent a cheap-and-cheerful N-way data replication system. Like a general use Google File System.

Another issue is that current SSD offerings don't impress me.

They make great local disk or non-volatile buffers in storage array, but are not yet, in my opinion, quite ready for 'prime time'.

I'd like to see 2 things changed:
  • RAID-3 organisation with field-replaceable mini-drives. hot-swap preferred.
  • PCI, not SAS or SATA connection. I.e. they appear as directly addressable memory.

This way the hardware can access flash as large, slow memory and the Operating System can fabricate that into a filesystem if it chooses - plus if it has some knowledge of the on-chip flash memory controller, it can work much better with it. It saves multiple sets of interfaces and protocol conversions.

Direct access flash memory will be always be cheaper and faster than SATA or SAS pseudo-drives.

We would then see following hierarchy of memory in servers:

  • Internal to server
    • L1/2/3 cache on-chip
    • RAM
    • Flash persistent storage
    • optional local disk (RAID-dual parity or triple mirrored)
  • External and site-local
    • network connected storage array, optimised for size, reliability, streaming IO rate and price not IO/sec. Hot swap disks and in-place/live expansion with extra controllers or shelves are taken as a given.
    • network connected near-line archival storage (MAID - Massive Array of Idle Disks)
  • External and off-site
    • off-site snapshots, backups and archives.
      Which implies a new type of business similar to Amazon's Storage Cloud.
The local network/LAN is going to be ethernet (1Gbps or 10Gbps Ethernet, a.k.a 10GE), or Infiniband if 10GE remains very expensive. Infiniband delivers 3-6Gbps over short distances on copper, external SAS currently uses the "multi-lane" connector to deliver four channels per cable. This is exactly right for use in a single rack.

I can't see a role for Fibre Channel outside storage arrays, and these will go if Infiniband speed and pricing continues to drop. Storage Arrays have used SCSI/SAS drives with internal copper wiring and external Fibre interfaces for a decade or more.
Already the premium network vendors, like CISCO, are selling "Fibre Channel over Ethernet" switches (FCoE using 10GE).

Nary a tape to be seen. (Hooray!)

Servers should tend to be 1RU either full-width or half-width, though there will still be 3-4 styles of servers:
  • budget: mostly 1-chip
  • value: 1 and 2-chip systems
  • lower power value systems: 65W/CPU-chip, not 80-90W.
  • premium SMP: fast CPU's, large RAM and many CPU's (90-130W ea)
If you want removable backups, stick 3+ drives in a RAID enclosure and choose between USB, firewire/IEEE 1394, e-SATA or SAS.

Being normally powered down, you'd expect extended lifetimes for disks and electronics.
But they'll need regular (3-6-12 months) read/check/rewrite cycling or the data will degrade and be permanently lost. Random 'bit-flipping' due to thermal activity, cosmic rays/particles and stray magnetic fields is the price we pay for very high density on magnetic media.
Which is easy to do if they are kept in a remote access device, not unlike "tape robots" of old.
Keeping archival storage "on a shelf" implies manual processes for data checking/refresh, and that is problematic to say the least.

3-5 2.5" drives will make a nice 'brick' for these removable backup packs.
Hopefully commodity vendors like Vantec will start selling multiple-interface RAID devices in the near future. Using current commodity interfaces should ensure they are readable at least a decade into the future. I'm not a fan of hardware RAID controllers in this application because if it breaks, you need to find a replacement - which may be impossible at a future date. (fails 'single point of failure' test).

Which presents another question using a software RAID and filesystem layout: Will it still be available in your O/S of the future?
You're keeping copies of your applications, O/S, licences and hardware to recover/access archived data, aren't you? So this won't be a question... If you don't intend to keep the environment and infrastructure necessary to access archived data, you need to rethink what you're doing.

These enclosures won't be expensive, but shan't be cheap and cheerful:
Just what is your data worth to you?
If it has little value, then why are you spending money on keeping it?
If it is a valuable asset, potentially irreplaceable, then you must be prepared to pay for its upkeep in time, space and dollars. Just like packing old files into archive boxes and shipping them to a safe off-site facility cost money, it isn't over once they are out of your sight.

Electronic storage is mostly cheaper than paper, but it isn't free and comes with its own limits and problems.

Summary:
  • SSD's are best suited and positioned as local or internal 'disks', not in storage arrays.
  • Flash memory is better presented to an Operating System as directly accessible memory.
  • Like disk arrays and RAM, flash memory needs to seamlessly cater for failure of bits and whole devices.
  • Hard disks have evolved to need multiple parity drives to keep the risk of total data loss acceptably low in production environments.
  • Throughput of storage arrays, not latency, will become their defining performance metric.
    New 'figures of merit' will be:
    • Volumetric: Gb per cubic-inch
    • Power: Watts per Gb
    • Throughput: Gb per second per read/write-stream
    • Bandwidth: Total Gb per second
    • Connections:  Number simultaneous connections.
    • Price: $ per Gb available and $ per Gb/sec per server and total
    • Reliability: probability of 1 byte lost per year per Gb
    • Archive and Recovery features: snapshots, backups, archives and Mean-Time-to-Restore
    • Expansion and Scalability: maximum size (Gb, controllers, units, I/O rate) and incremental pricing
    • Off-site and removable storage: RAID-5 disk-packs with multiple interfaces are needed.
  • Near Zero-latency storage implies reorganising and simplifying Operating Systems and their scheduling/multi-processing algorithms. Special CPU support may be needed, like for Virtualisation.
  • Separating networks {external access, storage/database, admin, backups} becomes mandatory for performance, reliability, scaling and security.
  • Pushing large-scale persistent storage onto the network requires a commodity network faster than 1Gbps ethernet. This will either be 10Gbps ethernet or multi-lane 3-6Gbps Infiniband.
Which leads to another question:
What might Desktops look like in 5 years?

Other Reading:
For a definitive theoretical treatment of aspects of storage hierarchies, Dr. Neil J Gunther, ex-Xerox PARC, now Performance Dynamics, has been writing about "The Virtualization Spectrum" for some time.

Footnote 1:
Is this idea of multi-speed memory (small/fast and big/slow) new or original?
No: Seymour Cray, the designer of the world's fastest computers for ~2 decades, based his designs on it. It appears to me to be a old idea whose time has come again.

From a 1995 interview with the Smithsonian:
SC: Memory was the dominant consideration. How to use new memory parts as they appeared at that point in time. There were, as there are today large dynamic memory parts and relatively slow and much faster smaller static parts. The compromise between using those types of memory remains the challenge today to equipment designers. There's a factor of four in terms of memory size between the slower part and the faster part. Its not at all obvious which is the better choice until one talks about specific applications. As you design a machine you're generally not able to talk about specific applications because you don't know enough about how the machine will be used to do that.
There is also a great PPT presentation on Seymour Cray by Gordon Bell entitled "A Seymour Cray Perspective", probably written as a tribute after Cray's untimely death in an auto accident.

Footnote 2:
The notion of "all files on the network" and invisible multi-level caches was built in 1990 at Bell Labs in their Unix successor, "Plan 9" (named for one of the worst movies of all time).
Wikipedia has a useful intro/commentary, though the original on-line docs are pretty accessible.

Ken Thompson and co built Plan 9 around 3 elements:
  • A single protocol (9P) of around 14 elements (read, write, seek, close, clone, cd, ...)
  • The Network connects everything.
  • Four types of device: terminals, CPU servers, Storage servers and the Authentication server.
Ken's original storage server had 3 levels of transparent storage (in sizes unheard of at the time):
  • 1Gb of RAM (more?)
  • 100Gb of disk (in an age where 1Gb drives where very large and exotic)
  • 1Tb of WORM storage (write-once optical disk. Unheard of in a single device)
The usual comment was, "you can go away for the weekend and all your files are still in either memory or disk cache".

They also pioneered permanent point-in-time archives on disk in something appearing to the user as similar to NetApp's 'snapshots' (though they didn't replicate inode tables and super-blocks).

 My observations in this piece can be paraphrased as:
  • re-embrace Cray's multiple-memory model, and
  • embrace commercially the Plan 9 "network storage" model.

No comments: