SteveJ-on-IT: How they got to 1 Billion IO per second.

Is Fusion-io's demonstration of "1 Billion IO operations per second" the same sort of game-changer that the 1987/8 RAID paper by Patterson, Katz and Gibson was?

Within 5 years all "Single Large Expensive Disks" (SLED's) were out of production, will we see Flash disks in Storage Arrays and low-latency SAN's out-of-production by 2017?

A more interesting "real world" demo by Fusion-io in early 2012 was loading MS-SQL in 16 virtual machines running Windows 2008 under VMware. They achieved a 2.8-fold improvement in throughput with a possible (unstated) 5-10 fold access-time improvement.

Updated 16:00, 31-Jan-2012. A little more interpretation of the demo descriptions and detailed PCI bus analysis.

Fusion used a total of 64 cards in 8 servers running against a "custom load generator", or 16 million IO/sec per card.
There are two immediate problems:

How did they get the IO results off the servers? Presumably ethernet and TCP IP. [No, internal load generator, no external I/O.]
The spces on those cards (2.4TB ioDrive2 Duo) only rate them for 0.93M or 0.86 sequential IO/sec (write, read) with 512 byte, a 16-fold shortfall.

The IO's used in the Demo were small, 64 byte, plus they avoided the Linux block driver sub-system, using their Direct Memory Mapping scheme, "Auto Commit Memory" (ACM) versus their Read-only Cache, "ioTurbine".
The card spec sheet also quotes throughput of 3GB/sec (24Gbps) for large read and 2.5GB/sec (20Gbps) for large writes (1MB I/O).

There was no information on the read/write ratio of the load, nor if they were random or sequential.
From a piece on Read Caching also speeding up writes by 3-10 fold, Fusion show they are savvy systems designers, they use a 70:30 (R:W) workload as representative of real-world loads.

That nothing was said about the workload suggests it may have pure read or pure write - whichever was faster with ACM under Linux. If the cards' ACM performance tracks the quoted specs (via a block mode interface), this would be pure sequential write.

The workload must have been 50:50 to allow full utilisation of the single shared PCI bus on each system, otherwise the bus would've saturated.

Also, as this is as much a demonstration of ACM, integrated with the Virtual Memory system to cause "page faults", the transfers to/from the Fusion cards were probably in whole VM pages. The VM page size isn't stated.
In Linux pages default to 4KB or 8KB, but are configurable to at least 256MB. Again, these are savvy techs with highly competent kernel developers involved, so an optimum page file for the Fusion card architecture, potentially 1-4MB, was chosen for the demo. [Later, 64KB is used for PCI bus calculations.]

The Fusion write does not say how they checked the IO's were correctly written. With 153.6TB total storage and 64GB/sec in the test work load, the tests could've run 2,500 seconds (40min) before filling the cards. Perhaps they read-back the contents and compared that to the generated input, though nothing is said. In the best of all worlds, there would've been a real-time read-back check, i.e. a 50:50 R:W workload.

The 64GB/sec total I/O throughput gives 8GB/sec, or 64Gbps, per server.
The HP ProLiant DL370 servers used in the tests, according to the detailed specs, only support 9 PCI-e v 2 cards, most slots are 4 lane ('x4'), at 4Gbps per lane, bi-directionally. ~~With 8 slots taken by the Fusion-io cards, only one (x16?) slot was available for the network card needed to supplement the 4 1Gbps on-board ethernet ports.~~
~~80-100Gbps ethernet capacity would normally be needed to support 64Gbps of IP traffic.~~

Reading the "datasheet" carefully, including inferring from the diagram which has no external connections between systems and no external load-gen host, an internal load-generator was used, one per host. There may have been some co-ordination between hosts of the load generators, such as partitioning work units. From the datasheet commentary:

Custom load generator that exercises memory-mapped I/O at a rate of approximately 125 million operations per second on each server. Each operation is a 64-byte packet.

A little more information is available in a blog entry.

PCI Express v2 is ~4Gbps per lane, or 64Gbps for x16, each direction. ~~Potentially just enough if an on-board ToE was used either with a 8-way 10Gbps card, a dual-port 40Gbps card or a single-port 100Gbps card.~~

~~From this, we're can't be sure if the load generator was internal to the Linux servers or external.~~

Even though not used, HP support dual 10Gbps ethernet cards, PCI-e v2 x8, in the DL370 G6, but with a maximum of two per system. This suggests a normal operational limit of the PCI backplane of 20-40Gbps per direction. The aggregate 64Gbps is achievable if split into 32Gbps in each direction.

~~The Fusion cards "half-length" and are x8 (so will work with x1, x2, and x4 slots as well).~~
From the DL370 specs, the system has 8 available full-length slots:

2 * x16,
1 * x8 and
5 * x4.

~~The other half-length slot is probably x8.~~

The per-server load is 64Gbps, spread amongst 8 cards, or 8Gbps/card, which is only 2 lanes (x2).
The per-card bandwidth would be possible if the single shared PCI bus wasn't saturated.
Per direction, the x4 PCI-e lanes would only support 16Gbps and the x8 32Gbps.
The simple average (5 * 16 + 3 * 32)/8, or 22Gbps, is insufficient for the load.

The two x16 slots and x8 slot could support the maximum transfer rate of the Fusion cards, 32Gbps per direction, or an aggregate of 64Gbps. The cards' spec sheet allows 20-24Gbps large (1MB) transfers per card, which with some load-generator tuning, could've resulted in 60Gbps aggregate from just 3 cards.

If the I/O total load, 64Gbps, is split evenly between cards, each card must process an aggregate 8Gbps, with equal read/write loads, or 4Gbps per direction.

If 64KB pages are read/written, then each card will need to process 64K (65,536) pages per second per direction.

The x4 slots, with 16Gbps available in each direction (aggregate of 32Gbps), will transfer 64KB in 15.28 usec.
The x8 slots will transfer a 64KB half that time, 7.63 usec.

The average 64KB transfer time for the mix of cards (5 * x4, 3* x8) in the system is:

( (5 * 15.28) + (3 * 7.63)) / 8, or 12.40 usec,

or 80,659 64KB pages per second per direction, leaving 25% headroom for other traffic and bus controller overheads.

The required 32Gbps per direction seems feasible.

~~This either says the DL370 has multiple PCI-e buses, not mentioned in the spec sheet, or something else happened.~~

1 comment:

Neil Gunther said...: Trying to legitimize a "mine is bigger than yours" press release is a fool's errand.

Using numbers supplied by a servile trade-rag reporter, whose job it is to collect readership eyeballs, doesn't help matters. Sensationalist journalism rules.

* What workload?
* Run how?
* Measured how?
* Measured where?
* What I/O?
* Reads, writes, random, serial?
* What kernel mods?
* Etc., etc?

This is nothing more than a PR stunt and is precisely what's wrong with clandestine benchmarking. I won't even mention the likelihood that Fusion-io is shopping their technology for acquisition.

Nothing less than a bona fide SPEC or TPC benchmark disclosure should be acceptable. Even those don't guarantee anything, but at least we know what the workloads and run rules are, and it would also cause many technical eyeballs to scrutinize the required full disclosure report looking for fails in Fusion-io's claim. A very different audience from trade-rag readers.

The other bad things is, you know the number you're trying explain. In the course of trying to explain it (in the absence of disclosure), you end up making all kinds of assumptions, applying averages and bending datasheet specs; steps which may or may not be correct. You really don't know so, in the end, you explain nothing.

The most constructive thing to do would be to publicly criticize Fusion-io for not providing a detailed technical disclosure to support their claim. Use the social graph as an anti-PR weapon. That might even embarrass them into providing such, whereby you could check your back-of-the-envelope estimates.; 7:54 am GMT+11

2012/01/31

How they got to 1 Billion IO per second.

1 comment: