SteveJ-on-IT: RAID++ and Storage Pages: We may be asking the wrong questions

2014/07/10

RAID++ and Storage Pages: We may be asking the wrong questions

The implied contract between Storage Devices, once HDD's only, and systems is a rather weak one.

Storage Devices return blocks of data on a "Best Efforts" basis, failure & error handling are minimalist or non-existent.

There's no implicit contract with the many other components that are now needed to move data off the Storage Device and into Memory, HBA's, cables, adaptors, switches etc. The move to Ethernet and larger Networks compounds the problem: networks are not nearly error-free. This matters when routinely moving around Exabytes and more: errors and failures are guaranteed for any human-scale observation period.

Turning this weak assurance into usable levels of Reliability and Data Durability is currently left to a rather complex set of layers, which can have subtle & undetectable failure modes or in "Recovery" mode, have unusably poor performance and limited or no resilient against additional failures. We need to improve our models to move past current RAID schemes to routinely support thousands of small drives and new Storage Class Memory.

Scaling Storage to Petabyte and Exabyte sized Pools of mixed technologies needs some new thinking.
New mixed technologies now provide us with multiple Price-Size-Performance components, requiring very careful analysis to optimise Systems against owner criteria.

There is no one true balance between DRAM, PCI-Flash, SSD's, fast-HDD, slow-HDD and near-line/off-line HDD or tape and Optical Disk. What there is, is a willingness of an owner to pay. Presumably they have a preference to pay enough, but not significantly more, for their desired or required "performance", either as "response time" latency or "throughput". Very few clients can afford, or need, to store everything in DRAM with some sort of backup system. It's the highest performance and highest priced solution possible, but is only necessary or desirable in very constrained problems.

DRAM is around $10/GB, Flash and SSD about $1/GB and HDD's from $0.04 to $0.30/GB for raw disk.

Here's a possible new contract between Storage Devices and Clients/Systems:

Data is returned Correct, Complete and Verifiable, in whole or part, between the two Endpoints.

To explain and unpack that.

Between Endpoints, or End-End Checks.

Systems use many logical abstractions, like Filesystems and Virtual Memory, that don't just make it easy for Programmers, but allow systems designers to improve Performance (e.g. caching) and to transparently detect and correct Errors. This is only possible with very strong models, especially when formally analysed, and "orthogonal" concerns can be separated out

2014/07/10

RAID++ and Storage Pages: We may be asking the wrong questions

No comments: