2014/05/22

RAID++: What invention or change is needed now to take us the next 25 years?

For an industry, digital data storage, that's seen disruptions every 25 years, we're either overdue for one or The Next Revolution has arrived, but nobody's noticed.

What will the Next Big Thing in Storage look like? What, if anything, will succeed RAID, if it hasn't already? Are their lessons we can learn by examining the last big disruption circa 1990 in Storage: RAID arrays?

The one certain lesson from 1955 and 1990 is that nobody can guess, not even remotely, what Data Storage will look like in 25 years time. Any prediction of the 2040 market will be wildly inaccurate, but we can talk about current market forces and technologies and where the trends point for the next 5 and 10 years.

In 1956, IBM released the first magnetic disk, not 'drum', drive, the RAMAC 350, providing 5MB via 50x24" swappable disks. 1,000 of these units were sold for $50,000 each. In today's money, that's around $80 million per Gigabyte, compared to the current $0.04 to $0.12 per GB. A 100 million times reduction in price over 5.5 decades, or 5-doublings every decade. The price per bit averaged a halving every two years: a most impressive technological achievement.

In terms of the $100,000- $500,000 per unit for Enterprise Storage from 1950 to 2000, sub-$100 drives and cents per GB, render current raw storage effectively Free and Infinite.

What's changed in the last 10 years is the rate that Drive size increased and hence Price per GB dropped, has slowed from 100%/year at its peak to 7%-12% now. The economic policies of yesteryear can no longer work: "Wait 3 years and you can store everything you have now, for less" no longer applies. In 5 years, you might be lucky to pay less per GB. The floods in Thailand showed that supply is not a given and future prices can't be programmed in.

There has also been an explosion in the number of identifiable market segments (from one for all computers in 1956), each with segment having different product demands and price-points. The embedded device 1.8" drive market is unrelated to mobile computing which is unrelated to Enterprise Storage, the modern name for 1980's "mainframes" and high-end Unix computers.

The collapse since 2007 in rate of capacity increase is compounded by the increasing speed of Internet connections, with a continued high, exponential, rates of both data downloaded and stored. At exactly the time customers and the Industry need the yearly 100% expansion in capacity, it isn't available.

The most startling insight for many is that Data Storage devices, domestic, commercial, Enterprise and Cloud are not capital items. Storage isn't a Balance Sheet Asset, it's a consumable, albeit with a 5-year life. Drives vendors understand this, but customers seldom act this way.

Technology cannot be separated from Competitive forces and Commercial drivers: Computing and Information Technology is primarily a business, not an Engineering or Research Discipline.

Successful I.T. technologies are driven by money, not the engineering. It's also how they fail.

What customers will buy and how much they are prepared to pay for it at any one time completely determines the market price and volumes.

It's a demand-side business: supply doesn't drive purchases,  consumer demand does.
Purchases are driven by customer need, price-point and compatibility with existing systems and staff capability. Data Storage is only a minor component of whole Computing/I.T. costs for Enterprises of all sizes. Building "a better mouse-trap" in this market won't yield one sale unless you can convince customers it meets a real need of theirs at an attractive price.

Storage devices can't be compared solely on purchase price, they can only be compared through Total Cost of Ownership, TCO, over the complete purchase, install, operations, migration, replacement and disposal process, with backups, archives and DR Plan protection/recovery included.

Given the turnover in devices and increasing demand for Data Storage, every site now has to cope with constant upgrades. The major factor in purchases should now be: How much can I grow this unit and is that enough for the next 3-5 years? How much will I have paid in 5 years time, when I come to replace this unit?

The correct accounting option for comparing Data Storage is: yearly TCO, per GB or IO/sec or another metric valued by the customer.

While it may cost $100-$500 to buy a single drive, using that drive will cost 5-10 times as much over its lifetime. Adding the resources to provide reliable access to that data may raise the costs by another factor of 10. Being a Digital Packrat and never cleaning up and properly housekeeping Data is becoming prohibitively expensive.

We know from PCIe-Flash cards (Fusion-IO) and "Pure Flash" Enterprise Devices that the high-end market is prepared to compare on TCO and whole system performance, not price per GB.
Raw "Flash", or NAND EEPROM, has around 10-times the raw price per GB of current Hard Disk Drives, HDD's, yet they are winning substantial sales.

This suggests two things:
  • The fully-loaded systems cost per GB and cost per IO/sec of HDD storage is higher than Flash. HDD's, when built into large RAID Arrays aren't cheap, they're between $1-$10/GB not the raw cost of $0.10/GB.
  • Cost per GB is not a useful prime comparison metric anymore for most users.
    Enterprise Data Storage isn't capacity constrained now.
The RAMAC's performance in 1956 was slow by todays' standards: 600 msec access times, 10kbit/sec transfer and 1,200RPM
versus current 2-4 msec seek times, 1-1.5Gbps transfer rates and 5,400 - 15,000 RPM.

Disk performance has improved by 10-100 times for access and for transfer rate by 100,000 times, changing the performance balance of drives radically. This shows up in the time it takes to "scan", or read/write, an entire drive. A 4TB drive takes ~10 hours to read/write in 'streaming' mode, but 75 - 150 hours via random read/writes at 64KB. This longer time is the rebuild time for a failed drive in a RAID array, without allowing for any data updates (RAID-5 & 6 carry a 3-6 fold penalty to write new data),  any other Application access or any congestion of the internal hardware and links of the RAID array. Real world experiences of RAID arrays taking over 1 week to rebuild a failed drive, and losing half the performance for the duration, are not uncommon. This is a problem because RAID-5 is very fragile: if there is even one read error during this time, then you lose all the data in that Volume. It doesn't event take a second failed drive to lose the lot.

Disk Drives, or more properly, Digital Data Storage, have many characteristics of interest to customers, besides the basic performance specs: {capacity, sustained transfer rate, latency or seek time and its inverse, random IO per sec}.

Reviewing the literature, historic technical specifications and industry press, the Industry, both Vendors and Consumers, still seem to be solely focussed on Price and Capacity, either as per-unit or Price per GB, as the prime comparative metric for Data Storage.

There are many other operational dimensions, besides Performance and Price. Around 1995, all large businesses had become completely dependent on Computers/I.T. for back-office systems, by 2005 and the Internet Revolution, this had transitioned to Front-Office Operations for large businesses.

Reliable Data Storage isn't optional anymore in a 24/7, fully on-line world. Your GB have to work, have to be complete & correct and have to be available, or your business will suffer badly, even fail.

With the advent of Mobile Devices, Cloud Computing and affordable, ubiquitous Virtual Machine environments, this trend has pushed even further. For many markets, like portable MP3 players, Flash at $1/GB has become the only economic solution, forcing consumers to look for alternate means, like Cloud Storage, to meet their Reliability and Data Protection/Archive needs. Here are some of the Dimensions now considered when purchasing Digital Data Storage:
  • Unit Price and Systems Price (includes racking, controllers, cables & connectors, switches and ancillary/support equipment like power-supplies, fans etc)
  • Power and Cooling needs,
  • Machine Room Footprint or Device Volume.
    • Large, heavy devices needing 200-tonne cranes to lift into place are no longer saleable.
  • Duty Cycle: The Load Factor per hour, day or week the drive is expected to handle.
  • Drive Reliability: Mean Time Between Failures (MTBF)
  • Media Durability: Expected Media/Device Lifetime in Hours or Operations.
  • Data Protection: Mean Time to Data Loss
  • Data Durability: Mean Time to 5% Media Degradation
  • Data Reliability/Certainty: Undetected Bit/Block Error Rate (UBER) and 
  • and many more for specific Applications and Environments:
    • Weight and Floor-loading. Over-weight devices can't be used in commercial buildings.
    • Disaster Recovery, Business Continuity and Contingency Planning constraints and interactions.
      • If a site is destroyed, how quickly can the business be trading again and to be fully operational?
      • What is the additional cost for Continuous Data Protection and off-site replicas?
    • Media replacement costs, Off-load, Upgrade and Migration options:
      • Can the device be upgraded in place? Is it sealed? Must it be replaced 'with a forklift', not upgraded in-place.
      • Floppy disks wore out in use and writable Optical Disks faded, creating a yearly consumables budget. Do new technologies, like USB Flash drives, work the same?
      • If it takes 3 months to load and off-load all data on a device, its unusable for most clients.
      • If you're an Airline or even a large on-line business with a 24/7 zero-break operation, how do you migrate to a new datacentre or install new Storage?
    • Special Operational Requirements, such as Cryogenic Cooling, refrigeration/coolant or high temperatures.
    • Multiple suppliers, to avoid "lock-in" and monopolistic price-gouging. [The IBM 'SLED' customer experience.]
    • Support and Supply: Can the vendor supply product all demand with low & predictable lead times, support the products customers purchase and are enough trained operators and admins available to early or late adopters?
    • Extended Vendor support and replacement: some applications demand no uncertified hardware changes in a system service life of 25+ years.
    • Quality as consistency of performance & durability for demanding applications.
    • Robustness against change in hostile environments, like Space or Deep Sea.
    • Endurance in extreme environments: very hot, very cold, very wet or dry or high G-force or extreme vibration. [Mobile devices are 'challenging' for normal HDD's.]
    • Media Longevity for Archival uses (100+ years)
    • Security and Safety: can data be read or copied from discarded media/devices, or can the device catastrophically fail and injure people? Exploding drives is a bad look.
    • Usability, Operability, Manageability: Is the hardware cheap in use? Is it compatible with existing systems, applications and other hardware?
    • Scalability and Inter-device interference: How many devices can I put in a single enclosure, while still being able to do in-place, live upgrades?
It's beyond the scope of this piece to fully examine the causes of the last Big Storage Disruption in 1990, the rise of (Enterprise) RAID Arrays.

IBM, for almost 3 decades from 1964 with it's S/360-architecture, had 60% by Revenue of the Computing market. In the 1970's, the "mini" computer arrived, creating a whole new low-cost market and putting pressure on its traditional business.  IBM introduced the PC XT in 1984, leading to real pressure on both the mainframe and mini-computer businesses. The 1980's also saw the arrival of high-end "minis" and RISC, with the rise of the DEC VAX and SUN's SPARC, used extensively for databases and business applications.

The S/360 line, a revolutionary idea at the time, was years late and famously over-budget. It was truly a "bet the company" decision by the IBM Board, just as the 747 was later for Boeing, arriving in 1969. Both companies, against the odds, succeeded, and succeeded beyond the wildest dreams of managers, shareholders and the salesmen on quotas and bonuses. Sometimes big business bets pay off, but that's the exception, not rule.

By the time the RISC research group at Berkeley published their RAID paper in 1987, IBM had ~80% of an $11.5B/yr Mainframe Storage market. The 3380 line cost $25-$50,000/GB in 1988. It's successor, the 3390 was released in 1989 for around $25,000/GB, twice the cost that EMC ended up charging for its Symmetrix 4200 line, based on 1GB, 5.25" Seagate ST4120 SCSI drives, costing $3-4,000/GB.

EMC's first 24GB RAID-0 (no protection) systems sold for $200-$300,000 and were more expensive than IBM's 3380/90 drives, though considerably faster and smaller (by a factor of five) and consumed a small fraction of the power, providing significant savings in machine room floorspace and avoiding cooling upgrades being forced by more power-hungry CPUs. EMC fuelled its growth through high Gross Margins (60+%), strong product development (becoming a product leader and consistently first-to-market), diversifying into other markets, forming supplier and reseller alliances and very good strategic acquisitions. Mainly, they succeeded because they listened very closely to consumer concerns and demands and responded quickly to those concerns.

The Berkeley paper had dubbed existing mainframe drives "SLEDs", Single Large Expensive Drives, and invented the acronym, RAID, as Reliable Array of Inexpensive Disks, later changed by the industry to Independent, as no high-end products were 'inexpensive', nor would high-end customers buy 'cheap', they wanted Fast, Good and Reliable.

The drivers of the 1990 Storage Disruption were two-fold:
  • a halving in price per GB, with a massive lift in performance and operational factors, and
  • Radically increased Reliability and hence better Data Protection. 
    • Seagate drives had 40,000hr MTBF and IBM 3380/90 more like 52,000hrs (vs the 800,000-1.2M of current drives). [11-dec-2020. KATZ et al, 1989 "Introduction to redundant arrays of inexpensive disks"]
    • But with RAID-1 or RAID-5 and hot-spares, EMC could effectively deliver an near infinite MTBF.
By 1994, EMC had overtaken IBM as the dominant vendor (~35% each), in a market that had shrunk significantly, in revenue terms, and was to continue to shrink to half its original size, leaving IBM with closer to 10% and EMC nearer 50% market shares. IBM did attempt to counter the RAID phenomena, by partnering with Storage Technology (StorTek, STK) rebadging their "Iceberg", then later signing an exclusive redistribution deal with STK, only to be challenged by the US Justice Dept. as anticompetitive and monopolistic.

From 1991 to 1993, IBM declared losses of $16B, with $8B in one year, the largest corporate loss in US history to the time. This was IBM's first loss since the 1914 merger that created IBM. From the 1964 introduction of the S/360, IBM had made record profits every year. The sudden and unanticipated collapse of IBM's business model took them and the business world by surprise. The sudden collapse of their mainframe Data Storage business didn't help their overall financial results, but when combined with a large-scale market rejection of overly high charges across all lines of IBM's business, it led to a massive co-ordinated business disaster for them. For over a decade, IBM had been unwilling to cannibalise it's own lines of business or to meet the market and reduce its Gross Margins. IBM managers had, they thought, a captive market and could "dial-in" the years profits beforehand. As customers fled in increasing numbers, the prices charged to the smaller remaining group had to increase faster and faster, until finally customers found "substitutes" faster than IBM could raise prices.

The collapse of the mainframe SLED market cannot be viewed in isolation from the financial implosion of IBM. IBM's recovery wasn't certain, a lesser management team would've failed.

EMC did everything right, following exactly the path IBM took from 1950 to 1964: listening to the customer, innovating, being first-to-market, selling by "Business Benefit" not "technical numbers" and, most importantly, maintaining high Gross Margins while still fitting in with customer price-points. It then reaped the benefits of an early lead for the last two decades.

EMC has continued to diversify, buying companies like VMware, the innovator and market leader in the Intel x86 Virtual Machine segment. They are far from a spent force, but are they following IBM to a sudden collapse in their market? That doesn't ever show up in the preceding Annual Reports.

No comments: