Within 5 years all "Single Large Expensive Disks" (SLED's) were out of production, will we see Flash disks in Storage Arrays and low-latency SAN's out-of-production by 2017?
A more interesting "real world" demo by Fusion-io in early 2012 was loading MS-SQL in 16 virtual machines running Windows 2008 under VMware. They achieved a 2.8-fold improvement in throughput with a possible (unstated) 5-10 fold access-time improvement.
Updated 16:00, 31-Jan-2012. A little more interpretation of the demo descriptions and detailed PCI bus analysis.
Fusion used a total of 64 cards in 8 servers running against a "custom load generator", or 16 million IO/sec per card.
There are two immediate problems:
- How did they get the IO results off the servers? Presumably ethernet and TCP IP. [No, internal load generator, no external I/O.]
- The spces on those cards (2.4TB ioDrive2 Duo) only rate them for 0.93M or 0.86 sequential IO/sec (write, read) with 512 byte, a 16-fold shortfall.
The card spec sheet also quotes throughput of 3GB/sec (24Gbps) for large read and 2.5GB/sec (20Gbps) for large writes (1MB I/O).
There was no information on the read/write ratio of the load, nor if they were random or sequential.
From a piece on Read Caching also speeding up writes by 3-10 fold, Fusion show they are savvy systems designers, they use a 70:30 (R:W) workload as representative of real-world loads.
That nothing was said about the workload suggests it may have pure read or pure write - whichever was faster with ACM under Linux. If the cards' ACM performance tracks the quoted specs (via a block mode interface), this would be pure sequential write.
The workload must have been 50:50 to allow full utilisation of the single shared PCI bus on each system, otherwise the bus would've saturated.
Also, as this is as much a demonstration of ACM, integrated with the Virtual Memory system to cause "page faults", the transfers to/from the Fusion cards were probably in whole VM pages. The VM page size isn't stated.
In Linux pages default to 4KB or 8KB, but are configurable to at least 256MB. Again, these are savvy techs with highly competent kernel developers involved, so an optimum page file for the Fusion card architecture, potentially 1-4MB, was chosen for the demo. [Later, 64KB is used for PCI bus calculations.]
The 64GB/sec total I/O throughput gives 8GB/sec, or 64Gbps, per server.
The HP ProLiant DL370 servers used in the tests, according to the detailed specs, only support 9 PCI-e v 2 cards, most slots are 4 lane ('x4'), at 4Gbps per lane, bi-directionally.
Reading the "datasheet" carefully, including inferring from the diagram which has no external connections between systems and no external load-gen host, an internal load-generator was used, one per host. There may have been some co-ordination between hosts of the load generators, such as partitioning work units. From the datasheet commentary:
Custom load generator that exercises memory-mapped I/O at a rate of approximately 125 million operations per second on each server. Each operation is a 64-byte packet.A little more information is available in a blog entry.
PCI Express v2 is ~4Gbps per lane, or 64Gbps for x16, each direction.
Even though not used, HP support dual 10Gbps ethernet cards, PCI-e v2 x8, in the DL370 G6, but with a maximum of two per system. This suggests a normal operational limit of the PCI backplane of 20-40Gbps per direction. The aggregate 64Gbps is achievable if split into 32Gbps in each direction.
From the DL370 specs, the system has 8 available full-length slots:
- 2 * x16,
- 1 * x8 and
- 5 * x4.
The per-server load is 64Gbps, spread amongst 8 cards, or 8Gbps/card, which is only 2 lanes (x2).
The per-card bandwidth would be possible if the single shared PCI bus wasn't saturated.
Per direction, the x4 PCI-e lanes would only support 16Gbps and the x8 32Gbps.
The simple average (5 * 16 + 3 * 32)/8, or 22Gbps, is insufficient for the load.
The two x16 slots and x8 slot could support the maximum transfer rate of the Fusion cards, 32Gbps per direction, or an aggregate of 64Gbps. The cards' spec sheet allows 20-24Gbps large (1MB) transfers per card, which with some load-generator tuning, could've resulted in 60Gbps aggregate from just 3 cards.
If the I/O total load, 64Gbps, is split evenly between cards, each card must process an aggregate 8Gbps, with equal read/write loads, or 4Gbps per direction.
If 64KB pages are read/written, then each card will need to process 64K (65,536) pages per second per direction.
The x4 slots, with 16Gbps available in each direction (aggregate of 32Gbps), will transfer 64KB in 15.28 usec.
The x8 slots will transfer a 64KB half that time, 7.63 usec.
The average 64KB transfer time for the mix of cards (5 * x4, 3* x8) in the system is:
( (5 * 15.28) + (3 * 7.63)) / 8, or 12.40 usec,or 80,659 64KB pages per second per direction, leaving 25% headroom for other traffic and bus controller overheads.
The required 32Gbps per direction seems feasible.