Showing posts with label hard disks. Show all posts
Showing posts with label hard disks. Show all posts

2014/07/12

RAID++ and Storage Pools: Leveraging GPT partitions for Asymmetric Media Logical Volumes. Pt 1.

This is an exploration of addressing Storage problems posed by directly connected or (ethernet) networked drives, not for SAN-connected managed disks served by Enterprise Storage Arrays.

The Problem

One of the most important features of the Veritas Logical Volume Manager (LVM) circa 1995 was the ~1MB disk label that contained a full copy of the LVM information of the drive/volume and allowed drives to be renamed by the system of physically shuffled, intentionally or not.

Today we have a standard, courtesy UEFI, for GUID Partition Tables (GPT) of Storage Devices supported by all major Operating Systems. Can this provide similar, or additional capability?

2014/07/10

RAID++ and Storage Pages: We may be asking the wrong questions

The implied contract between Storage Devices, once HDD's only, and systems is a rather weak one.
Storage Devices return blocks of data on a "Best Efforts" basis, failure & error handling are minimalist or non-existent.
There's no implicit contract with the many other components that are now needed to move data off the Storage Device and into Memory, HBA's, cables, adaptors, switches etc. The move to Ethernet and larger Networks compounds the problem: networks are not nearly error-free. This matters when routinely moving around Exabytes and more: errors and failures are guaranteed for any human-scale observation period.

Turning this weak assurance into usable levels of Reliability and Data Durability is currently left to a rather complex set of layers, which can have subtle & undetectable failure modes or in "Recovery" mode, have unusably poor performance and limited or no resilient against additional failures. We need to improve our models to move past current RAID schemes to routinely support thousands of small drives and new Storage Class Memory.

Scaling Storage to Petabyte and Exabyte sized Pools of mixed technologies needs some new thinking.
New mixed technologies now provide us with multiple Price-Size-Performance components, requiring very careful analysis to optimise Systems against owner criteria.

There is no one true balance between DRAM, PCI-Flash, SSD's, fast-HDD, slow-HDD and near-line/off-line HDD or tape and Optical Disk. What there is, is a willingness of an owner to pay. Presumably they have a preference to pay enough, but not significantly more, for their desired or required "performance", either as "response time" latency or "throughput". Very few clients can afford, or need, to store everything in DRAM with some sort of backup system. It's the highest performance and highest priced solution possible, but is only necessary or desirable in very constrained problems.

DRAM is around $10/GB, Flash and SSD about $1/GB and HDD's from $0.04 to $0.30/GB for raw disk.

Here's a possible new contract between Storage Devices and Clients/Systems:
Data is returned Correct, Complete and Verifiable, in whole or part, between the two Endpoints.

2014/06/22

RAID++: So, you cant afford the extra cost of Data Protection at $0.10-$0.20 per GB?

Summary: You and your business probably now depend on computers and smartphones/tablets for most of your daily work and other activities. If you don't pay up-front to protect your data, you'll pay for it many times over at a later date, when, not if, you have a drive fail and lose all data.

When data is $0.20/GB (or even $1/GB) and wages are $35-60/hour and it will take a minimum of 1 day to reconstruct data, more likely a week+, spending a little money up-front for Data Protection seems prudent to me.

The 1987 Berkeley RAID paper was written at a time few people had PC's and storage cost $40,000/GB in current dollars. The economics of swapping space for computation were compelling at the time, nowdays, very few people have even$1,000 invested in Disk Storage, let alone $250,000.

Good desktops or laptops are now available in $500-$1,000 range, with Commodity Drives costing $0.04-$0.10/GB and Enterprise Drives from $0.12-$0.65/GB, and more for high-spec variants. Times are very different: raw prices have fallen 500,000, Bit Error Rates (BER/UBER) are up ~100 times, Mean Time Between Failures (MTBF) have increased 10-100 fold,  raw read/write rates have increased 100-300 times, while access times (rotation & seek) are 2-5 times different. As is estimated disk utilisation: at some point after 2000 the average drive went for 90%-100% full to ~75%, at least for Desktops. This suggests that drives are now "Big Enough" and not a System Constraint, at least not for Capacity. The advent of affordable, large Flash Memory with reasonable read/write speeds and uniform access times has removed one of the big constraints of storage: random I/O per second.

Researching RAID designs, I was surprised by I.T. Professionals and home users alike, that baulk at the cost of reasonable Data Protection, even $100 for a single USB drive plus a 4-drive NAS unit is definitely "too expensive" (<$1,500). Do they have such volumes of data that the cost of extra drives is overwhelming? Or is the data worth so little, or cost so little, that it's not worth protecting?

2013/01/04

Storage: Smaller is better in Hard Disks

I upgraded my TimeMachine ® drive two days ago on my Apple Desktop [2009 Intel Mac Mini, OS/X Snow Leopard].

I went to a local office supply chain-store and bought a Seagate USB3 drive for $90, replacing my older 'Buffalo' USB2 drive, maybe 12-18 months old. While the Mac Mini only has USB2 ports and I can realise the higher bus speed, it will work on newer computers as well.

My new drive is 1TB, 2.5", covered by the statutory 12mth warranty. Better to replace without pressure, than when things are broken... The old drive can go on the shelf as a long-term roll-back.

This is a trick I learnt from a friend with a number of Macbooks and iMacs:
permanently attach an external drive for TimeMachine... Cheap and Effective!
2.5" is important.
The consumer drive variants (not 'enterprise', designed for servers) are built to be mobile (laptops & portable devices) and so are more robust, shock-resilient, etc.

But that's not the 'secret sauce': it's power consumption, and its relative, heat dissipation.

2.5" drives can be fully powered by USB 2, in the normal/correct specification [not the 10W+ that the iPad demands, but well under 5W, even 2.5W]. They run cool, quietly, on low-power.

Not needing a "wall-wart" is really important: it can never be lost, never blow up, never "leak mains power" or cause earth-loop problems (my Western Digital 3.5" drive sparks against the frame when I plug it in. Scary.)

There's a reason based in physics for 2.5" drives using a lot less power:
the aerodynamic drag of the platters,  the main power consumer that is turned into heat, is affected by 3 factors: number of platters, rotational rate, size of platters.
Firstly, they only have 1 or 2 platters, versus 4 (max 5?) you'll find in 3.5" drives.
"low performance" 2.5" drives at 5400rpm spin slower than 3.5" drives, mostly7200rpm  -  a ratio of 1.33:1.
Aerodynamic drag increases with the cube (third power) of speed/rotational rate. The cube of 1.33 is 2.37.

Just these two factors reduce potential power demand by about five times, albeit for one-quarter the space compared to a 4TB 3.5" drive. But 1TB is more than enough to maintain snapshots of my 300GB drive and is a good fit for me.

For the next factor:
Disk Drive form-factors are related - each size has platters half the area of the larger size, so the diameter/radius ratio is 0.7/1.4 times, the square root...
The 3rd factor overshadows all others:
The drag is proportional to the fifth (5th) power of diameter.
This ratio for a diameter ratio 0.7/1.41 is 0.18/5.66.
Just halving platter area, and halving capacity/plater, reduces power needs by another five times.

Without trying, 2.5" drives use 20 times less power, for only 1/2 to 1/4 the capacity.
For the same capacity but slower spin-rate, they use 10 (ten) times less power,

Four 2.5" drives use just a little less space than a single 3.5" drive [102mm x 140mm x 19-25.4mm] vs [102mm x 147mm x 26.1mm], ignoring connectors.

The Small Form Factor Committee very nicely designed the footprint of each device size to be the same ratio meaning the length of the next smaller size is the width of the current size. If you've ever noticed, you can lay a 2.5" drive sideways across a 3.5" drive. Four 2.5" drives, in 2 stacks of 2, will fit within a box that would hold a single 3.5" drive. Adding connectors is trickier - they are at right-angles to each other in the two form factors.

A side benefit you have 4 sets of independent electronics, buffers and heads working for you.

You get better aggregate throughput and seek performance from 4 smaller drives, OR you can sacrifice 33% of capacity for radically improved fault-resilience and setup RAID-5 over 4 drives.

But 2.5" drives cost around 35-100% more per GB, depending on models.
It depends what value you put on your time and your data...
I think $90/TB vs $75/TB is insignificant compared to the benefits, even for home use.

If its business or professional-use data for a modest drive count, the economics of 2.5" drives are a slam-dunk (or "no-brainer").  If you have petabytes to store, you got other problems to solve.

At this point you might be wondering "If small is good, and much smaller is much better, why don't we have smaller Form Factors in common use?"

We do.

1.75 inch drives are manufactured in volume, but not sold in "computers", only embedded in mobile devices. They are tricky to manufacture with the same problems as making watches.
The Small Form Factor Committee never specified a standard, especially for thickness, of these drives.

There have been previous credible attempts at micro-minature HDD's: around 2000, IBM built the largest capacity "Compact Flash" drive available. It was a hard disk. While they were both highest capacity and best price/MB, they weren't popular with working Professional Photographers - they were too fragile... One of the extreme advantages of solid-state memory, like "Flash", is its robustness, especially when turned off. If you keep static electricity away from them, they can be nearly indestructible. The data will fade away ("Flash" EPROM is not permanent) long before they'll succumb to mechanical damage...

But within a couple of years the market collapsed because real Flash memory overtook them in capacity, price/MB and transfer rate.

Not long after RAID-5 was proposed in 1989, there was a good academic paper that suggested by 2000 we'd have a new type of storage: many single platter 1 inch drives soldered onto boards.

The economics of very small form-factor disk drives turned out never to be compelling.

If nobody is building the drives in volume, the price cannot be competitive with substitutes, like Flash for smaller units and 2.5" and 3.5" drives for larger units.
If there is no demand for a product, then no manufacturer will invest in the R&D and manufacturing facilities to ramp-up to volume production.

It sounds like a chicken-and-egg problem: nobody builds small drives so nobody designs products based on them so there is no reason to build small drives.

But that's not the whole story: drive manufacturers can do the Maths and know their production economics and capabilities very well.
If they thought they could produce rugged, small drives that were cheaper and higher-capacity compared to Flash, then one of them would've tried it. If they'd have made a profit, everyone would've followed, just like the 5.25 inch to 3.5 inch and 3.5 inch to 2.5 inch changeovers.

The stumbling block is per-device manufacturing cost.

Disks are complex mechanical devices, assembled individually to very fine tolerances. Their production costs are dominated by that, not raw materials and the process doesn't scale well with volume. While disk capacity has doubled every year for 15 years, the minimum unit price has either remained the same or risen.

Very small format drives, even in high volume, would, per unit, cost a sizeable fraction of 2.5" or 3.5" drives.

Compact Flash chips, whilst small capacity (1-32Gb) compared to even 2.5 inch drives (1000Gb), are cents per chip to produce and the technology to scale volume production of them is very well known and researched, but more importantly lends itself to scaling up.

Flash current sells for $1-$10/Gb and 32Gb is $10-20 retail, or less. SSD's are more expensive per GB than SD cards and USB Flash memory devices for a number of reasons related to speed, reliability/durability and wear-rate.

No 2.5" drive can be smaller than mid-capacity Flash memory or SSD: no user would buy one.
Or, the smallest 2.5" drives you can sell for $50-$60 are 250GB, or $0.20/GB.

If you could make a very small format disk drive for as little as $30 retail, it has to compete with Flash.
It'd have to be over 128GB to sell, or at least half the capacity of a small 2.5 inch drive or one-eight the size of high-capacity 2.5 inch drives.

Which gives a smallest economic form-factor of 1.25 inch, this year.
Next year, when Flash memory reduces another 30%, it's marginal and the year after, when Flash cost/GB has halved, completely uneconomic. A two year product life is well below the payback period.

There haven't been any discoveries or developments in magnetic disks that will change these economies within 10 years: we won't ever see high volume production of computer drives smaller than 2.5 inch.



Drive Dimensions:
2.5":  70mm x 102mm x 9mm [thickness varies, 7mm to 12.7mm. 15mm for 'enterprise']
3.5": 102mm x 147mm x 26.1mm [std "1-inch" thickness]

Sources:
[http://www.sffcommittee.com/ie/index.html]
[ftp://ftp.seagate.com/sff/8000_PRJ.HTM]
[ftp://ftp.seagate.com/sff/SFF-8200.PDF] 2.5"
[ftp://ftp.seagate.com/sff/SFF-8300.PDF] 3.5"

2011/09/14

A new inflection point? Definitive Commodity Server Organisation/Design Rules

Summary:

For the delivery of general purpose and wide-scale Compute/Internet Services there now seems to be a definitive hardware organisation for servers, typified by the E-bay "pod" contract.

For decades there have been well documented "Design Rules" for producing Silicon devices using specific technologies/fabrication techniques. This is an attempt to capture some rules for current server farms. [Update 06-Nov-11: "Design Rules" are important: Patterson in a Sept. 1995 Scientific American article notes that the adoption of a quantitative design approach in the 1980's led to an improvement in microprocessor speedup from 35%pa to 55%pa. After a decade, processors were 3 times faster than forecast.]

Commodity Servers have exactly three possible CPU configurations, based on "scale-up" factors:
  • single CPU, with no coupling/coherency between App instances. e.g. pure static web-server.
  • dual CPU, with moderate coupling/coherency. e.g. web-servers with dynamic content from local databases. [LAMP-style].
  • multi-CPU, with high coupling/coherency. e.g. "Enterprise" databases with complex queries.
If you're not running your Applications and Databases in Virtual Machines, why not?
[Update 06-Nov-11: Because Oracle insists some feature sets must run on raw hardware. Sometimes vendors won't support your (preferred) VM solution.]

VM products are close to free and offer incontestable Admin and Management advantages, like 'teleportation' or live-migration of running instances and local storage.

There is a special non-VM case: cloned physical servers. This is how I'd run a mid-sized or large web-farm.
This requires careful design, a substantial toolset, competent Admins and a resilient Network design. Layer 4-7 switches are mandatory in this environment.

There are 3 system components of interest:
  • The base Platform: CPU, RAM, motherboard, interfaces, etc
  • Local high-speed persistent storage. i.e. SSD's in a RAID configuration.
  • Large-scale common storage. Network attached storage with filesystem, not block-level, access.
Note that complex, expensive SAN's and their associated disk-arrays are no longer economic. Any speed advantage is dissolved by locally attached SSD's, leaving only complexity, resilience/recovery issues and price.
Consequentially, "Fibre Channel over Ethernet" with its inherent contradictions and problems, is unnecessary.

Designing individual service configurations  can be broken down into steps:
  • select the appropriate CPU config per service component
  • specify the size/performance of local SSD per CPU-type.
  • architect the supporting network(s)
  • specify common network storage elements and rate of storage consumption/growth.
Capacity Planning and Performance Analysis is mandatory in this world.

As a professional, you're looking to provide "bang-for-buck" for someone else who's writing the cheques. Over-dimensioning is as much a 'sin' as running out of capacity. Nobody ever got fired for spending just enough, hence maximising profits.

Getting it right as often as possible is the central professional engineering problem.
Followed by, limiting the impact of Faults, Failures and Errors - including under-capacity.

The quintessential advantage to professionals in developing standard, reproducible designs is the flexibility to respond to unanticipated load/demands and the speed with which new equipment can be brought on-line, and the converse, retired and removed.

Security architectures and choice of O/S + Cloud management software is outside the scope of this piece.

There are many multi-processing architectures, each best suited to particular workloads.
They are outside the scope of this piece, but locally attached GPU's are about to become standard options.
Most servers will acquire what were known as vector processors and applications using this capacity will start to become common. This trend may need their own Design Rule(s).

Different, though potentially similar design rules apply for small to mid-size Beowulf clusters, depending on their workload and cost constraints.
Large-scale or high-performance compute clusters or storage farms, such as the IBM 120 Petabyte system, need careful design by experienced specialists. With any technology, "pushing the envelope" requires special attention by the best people you have,  to even have a chance of success.

Not unsurprisingly, this organisation looks a lot like the current fad, "Cloud Computing" and the last fad, "Services Oriented Architecture".



Google and Amazon dominated their industry segments partly because they figured out the technical side of their business early on. They understood how to design and deploy datacentres suitable for their workload, how to manage Performance and balance Capacity and Cost.

Their "workloads", and hence server designs, are very different:
  • Google serves pure web-pages, with almost no coupling/communication between servers.
  • Amazon has front-end web-servers is backed by complex database systems.
Dell is now selling a range of "Cloud Servers" purportedly based on the systems they supply to large Internet companies.





2010/05/03

A Good Question: When will Computer Design 'stabilise'?

The other night I was talking to my non-Geek friend about computers and he formulated what I thought was A Good Question:
When will they stop changing??
This was in reaction to me talking about my experience in suggesting a Network Appliance, a high-end Enterprise Storage device, as shared storage for a website used by a small research group.
It comes with a 5 year warranty, which leads to the obvious question:
will it be useful, relevant or 'what we usually do' in 5 years?
I think most of the elements in current systems are here to stay, at least for the evolution of Silicon/Magnetic recording. We are staring at 'the final countdown', i.e. hitting physical limits of these technologies, not necessarily their design limits. Engineers can be very clever.

The server market has already fractioned into "budget", "value" and "premium" species.
The desktop/laptop market continues to redefine itself - and more 'other' devices arise. The 100M+ iPhones, in particular, already out there demonstrate this.

There's a new major step in server evolution just breaking:
Flash memory for large-volume working and/or persistent storage.
What now may be called internal or local disk.
This implies a major re-organisation of even low-end server installations:
Fast local storage and large slow network storage - shared and reliable.
When the working set of Application data in databases and/or files will fit on (affordable) local flash memory, response times improve dramatically because all that latency is removed. By definition, data outside the working set isn't a rate limiting step, so its latency only slightly affects system response time. However, throughput, the other side of the Performance Coin, has to match or beat that of the local storage, or it will become the system bottleneck.

An interesting side question:
 How will Near-Zero-Latency local storage impact system 'performance', both response times (a.k.a. latency) and throughput.

I conjecture that both system latency and throughput will improve markedly, possibly super-linearly, because one of the bug-bears of Operating Systems, the context switch, will be removed. Systems have to expend significant effort/overhead in 'saving their place', deciding what to do next, then when the data is finally ready/available, to stop what they were doing and start again where they left off.

The new processing model, especially for multi-core CPU's, will be:
Allocate a thread to a core, let it run until it finishes and waits for (network) input, or it needs to read/write to the network.
Near zero-latency storage removes the need for complex scheduling algorithms and associated queuing. It improves both latency and throughput by removing a bottleneck.
It would seem that Operating Systems might benefit from significant redesign to exploit this effect, in much the same way that RAM is now large and cheap enough that system 'swap space' is now either an anachronism or unused.

The evolution of USB flash drives saw prices/Gb halving every year. I've recently seen 4Gb SDHC cards at the supermarket for ~$15, whereas in 2008, I paid ~$60 for USB 4Gb.

Rough server pricing for RAM in 2010 is A$65/Gb ±$15.
List prices by Tier 1/2 vendors for 64Gb SSD is $750-$1000 (around 2-4 times cheaper from 'white box' suppliers).
I've seen this firmware limited to 50Gb to improve performance and reliability comparable to current production HDD specs.
This is $12-$20/Gb, depending on what base size and prices used.

Disk drives are ~A$125 for 7200rpm SATA and $275-$450 for 15K SAS drives.
With 2.5" drives priced in-between.
Ie. $0.125/Gb for 'big slow' disks and $1 per GB for fast SAS disks.

Roll forward 5 years to 2015 and 'SSD' might've doubled in size three times, plus seen the unit price drop. Hard disks will likely follow the same trend of 2-3 doublings.
Say SSD 400Gb for $300: $1.25/Gb
2.5" drives might be up to 2-4Tb in 2015 (from 500Gb in 2010) and cost $200: $0.05-0.10/Gb
RAM might be down to $15-$30/Gb.

A caveat with disk storage pricing: 10 years ago RAID 5 became necessary for production servers to avoid permanent data loss.
We've now passed another event horizon: Dual-parity, as a minimum, is required on production RAID sets.

On production servers, price of storage has to factor in the multiple overheads of building high-reliability storage (redundant {disks, controllers, connections}, parity and hot-swap disks and even fully mirrored RAID volumes plus software, licenses and their Operations, Admin and Maintenance) from unreliable parts. A problem solved by electronics engineers 50+ years ago with N+1 redundancy.

Multiple Parity is now needed because in the time taken to recreate a failed drive, there's a significant chance of a second drive failure and total data loss. [Something NetApp has been pointing out and addressing for some years.] The reason for this is simple: the time to read/write a whole drive has steadily increased since ~1980. Recording density (bits per inch) times areal density (tracks per inch) have increased faster than read/write speeds, roughly multiplying recording density times rotational speed.

Which makes running triple-mirrors a much easier entry point, or some bright spark has to invent a cheap-and-cheerful N-way data replication system. Like a general use Google File System.

Another issue is that current SSD offerings don't impress me.

They make great local disk or non-volatile buffers in storage array, but are not yet, in my opinion, quite ready for 'prime time'.

I'd like to see 2 things changed:
  • RAID-3 organisation with field-replaceable mini-drives. hot-swap preferred.
  • PCI, not SAS or SATA connection. I.e. they appear as directly addressable memory.

This way the hardware can access flash as large, slow memory and the Operating System can fabricate that into a filesystem if it chooses - plus if it has some knowledge of the on-chip flash memory controller, it can work much better with it. It saves multiple sets of interfaces and protocol conversions.

Direct access flash memory will be always be cheaper and faster than SATA or SAS pseudo-drives.

We would then see following hierarchy of memory in servers:

  • Internal to server
    • L1/2/3 cache on-chip
    • RAM
    • Flash persistent storage
    • optional local disk (RAID-dual parity or triple mirrored)
  • External and site-local
    • network connected storage array, optimised for size, reliability, streaming IO rate and price not IO/sec. Hot swap disks and in-place/live expansion with extra controllers or shelves are taken as a given.
    • network connected near-line archival storage (MAID - Massive Array of Idle Disks)
  • External and off-site
    • off-site snapshots, backups and archives.
      Which implies a new type of business similar to Amazon's Storage Cloud.
The local network/LAN is going to be ethernet (1Gbps or 10Gbps Ethernet, a.k.a 10GE), or Infiniband if 10GE remains very expensive. Infiniband delivers 3-6Gbps over short distances on copper, external SAS currently uses the "multi-lane" connector to deliver four channels per cable. This is exactly right for use in a single rack.

I can't see a role for Fibre Channel outside storage arrays, and these will go if Infiniband speed and pricing continues to drop. Storage Arrays have used SCSI/SAS drives with internal copper wiring and external Fibre interfaces for a decade or more.
Already the premium network vendors, like CISCO, are selling "Fibre Channel over Ethernet" switches (FCoE using 10GE).

Nary a tape to be seen. (Hooray!)

Servers should tend to be 1RU either full-width or half-width, though there will still be 3-4 styles of servers:
  • budget: mostly 1-chip
  • value: 1 and 2-chip systems
  • lower power value systems: 65W/CPU-chip, not 80-90W.
  • premium SMP: fast CPU's, large RAM and many CPU's (90-130W ea)
If you want removable backups, stick 3+ drives in a RAID enclosure and choose between USB, firewire/IEEE 1394, e-SATA or SAS.

Being normally powered down, you'd expect extended lifetimes for disks and electronics.
But they'll need regular (3-6-12 months) read/check/rewrite cycling or the data will degrade and be permanently lost. Random 'bit-flipping' due to thermal activity, cosmic rays/particles and stray magnetic fields is the price we pay for very high density on magnetic media.
Which is easy to do if they are kept in a remote access device, not unlike "tape robots" of old.
Keeping archival storage "on a shelf" implies manual processes for data checking/refresh, and that is problematic to say the least.

3-5 2.5" drives will make a nice 'brick' for these removable backup packs.
Hopefully commodity vendors like Vantec will start selling multiple-interface RAID devices in the near future. Using current commodity interfaces should ensure they are readable at least a decade into the future. I'm not a fan of hardware RAID controllers in this application because if it breaks, you need to find a replacement - which may be impossible at a future date. (fails 'single point of failure' test).

Which presents another question using a software RAID and filesystem layout: Will it still be available in your O/S of the future?
You're keeping copies of your applications, O/S, licences and hardware to recover/access archived data, aren't you? So this won't be a question... If you don't intend to keep the environment and infrastructure necessary to access archived data, you need to rethink what you're doing.

These enclosures won't be expensive, but shan't be cheap and cheerful:
Just what is your data worth to you?
If it has little value, then why are you spending money on keeping it?
If it is a valuable asset, potentially irreplaceable, then you must be prepared to pay for its upkeep in time, space and dollars. Just like packing old files into archive boxes and shipping them to a safe off-site facility cost money, it isn't over once they are out of your sight.

Electronic storage is mostly cheaper than paper, but it isn't free and comes with its own limits and problems.

Summary:
  • SSD's are best suited and positioned as local or internal 'disks', not in storage arrays.
  • Flash memory is better presented to an Operating System as directly accessible memory.
  • Like disk arrays and RAM, flash memory needs to seamlessly cater for failure of bits and whole devices.
  • Hard disks have evolved to need multiple parity drives to keep the risk of total data loss acceptably low in production environments.
  • Throughput of storage arrays, not latency, will become their defining performance metric.
    New 'figures of merit' will be:
    • Volumetric: Gb per cubic-inch
    • Power: Watts per Gb
    • Throughput: Gb per second per read/write-stream
    • Bandwidth: Total Gb per second
    • Connections:  Number simultaneous connections.
    • Price: $ per Gb available and $ per Gb/sec per server and total
    • Reliability: probability of 1 byte lost per year per Gb
    • Archive and Recovery features: snapshots, backups, archives and Mean-Time-to-Restore
    • Expansion and Scalability: maximum size (Gb, controllers, units, I/O rate) and incremental pricing
    • Off-site and removable storage: RAID-5 disk-packs with multiple interfaces are needed.
  • Near Zero-latency storage implies reorganising and simplifying Operating Systems and their scheduling/multi-processing algorithms. Special CPU support may be needed, like for Virtualisation.
  • Separating networks {external access, storage/database, admin, backups} becomes mandatory for performance, reliability, scaling and security.
  • Pushing large-scale persistent storage onto the network requires a commodity network faster than 1Gbps ethernet. This will either be 10Gbps ethernet or multi-lane 3-6Gbps Infiniband.
Which leads to another question:
What might Desktops look like in 5 years?

Other Reading:
For a definitive theoretical treatment of aspects of storage hierarchies, Dr. Neil J Gunther, ex-Xerox PARC, now Performance Dynamics, has been writing about "The Virtualization Spectrum" for some time.

Footnote 1:
Is this idea of multi-speed memory (small/fast and big/slow) new or original?
No: Seymour Cray, the designer of the world's fastest computers for ~2 decades, based his designs on it. It appears to me to be a old idea whose time has come again.

From a 1995 interview with the Smithsonian:
SC: Memory was the dominant consideration. How to use new memory parts as they appeared at that point in time. There were, as there are today large dynamic memory parts and relatively slow and much faster smaller static parts. The compromise between using those types of memory remains the challenge today to equipment designers. There's a factor of four in terms of memory size between the slower part and the faster part. Its not at all obvious which is the better choice until one talks about specific applications. As you design a machine you're generally not able to talk about specific applications because you don't know enough about how the machine will be used to do that.
There is also a great PPT presentation on Seymour Cray by Gordon Bell entitled "A Seymour Cray Perspective", probably written as a tribute after Cray's untimely death in an auto accident.

Footnote 2:
The notion of "all files on the network" and invisible multi-level caches was built in 1990 at Bell Labs in their Unix successor, "Plan 9" (named for one of the worst movies of all time).
Wikipedia has a useful intro/commentary, though the original on-line docs are pretty accessible.

Ken Thompson and co built Plan 9 around 3 elements:
  • A single protocol (9P) of around 14 elements (read, write, seek, close, clone, cd, ...)
  • The Network connects everything.
  • Four types of device: terminals, CPU servers, Storage servers and the Authentication server.
Ken's original storage server had 3 levels of transparent storage (in sizes unheard of at the time):
  • 1Gb of RAM (more?)
  • 100Gb of disk (in an age where 1Gb drives where very large and exotic)
  • 1Tb of WORM storage (write-once optical disk. Unheard of in a single device)
The usual comment was, "you can go away for the weekend and all your files are still in either memory or disk cache".

They also pioneered permanent point-in-time archives on disk in something appearing to the user as similar to NetApp's 'snapshots' (though they didn't replicate inode tables and super-blocks).

 My observations in this piece can be paraphrased as:
  • re-embrace Cray's multiple-memory model, and
  • embrace commercially the Plan 9 "network storage" model.