2009/08/25

When 'ping' fails

In Networking, the 'performance objects' are links, usually end-to-end, consisting of many elements - like ethernet segments, switches, routers/firewalls, long-distance circuits and security scanner devices, laid on top of Telco/backbone services that provide dynamic and asymmetrical routes.

The most frequently used measure is 'ping' time - ICMP “echo request” packets, often 64 bytes long.
Firewalls, routers and underlying networks filter/block classes of traffic, implement traffic & flow rules and attempt "Quality of Service". There are many common rulesets in use:
  • blocking 'ping' outright for 'security reasons'. Stops trivial network scanning.
  • QoS settings for different traffic classes. VoIP/SIP gets high priority, FTP traffic is low, your guess on HTTP, DNS, time (NTP) and command line access - SSH, Telnet, ...
  • traffic profiling on IP type (ICMP, UDP, TCP) and packet size (vs link MTU).
  • traffic prioritisation based on source/destination.
Real-time voice needs many small packets, would like low latency/delay, and no jitter, and can stand some packet loss or data errors. FTP relies on TCP to detect packet loss & retransmit them. It likes big packets and attempts to increase bandwidth consumed through TCP 'fast-start' etc.

The only time 'ping' is accurate is within a simple ethernet segment - no rules, no firewalls, no 'traffic engineering', no link losses, no collisions, ...
Otherwise, it's a dangerous and potentially very misleading measure.

'Time of Flight' for UDP & ICMP packets is only measurable when you control both ends of the link and can match headers. Not something most people can or want to do.

TCP packets can be accurately timed - sessions are well identified, packets can be uniquely identified and they individually acknowledged. It is possible to accurately and definitively monitor & measure round-trip times and effective throughput (raw & corrected) of links and connections at any point TCP packets are inspected - host, router, firewall, MPLS end-point, ...
I'm not aware of this being widely used in practice, but that's a lack of knowledge on my part.

This is not a tutorial in Networking or TCP/IP.
Neither am I a Networking expert. I'm demonstrating that even with my knowledge, "tools ain't tools" (meaning not all tools and methods are equal),
and that using just 2 metrics, 'bandwidth' & 'latency' to characterise links is simplistic and fraught. As professionals, we have to be able to "dig under the covers" to diagnose & fix subtle faults.

Consider the case of TCP/IP over a noisy satellite phone link, the type you buy from Intelsat for ships or remote areas. The service notionally delivers a data channel of 64kbps, but is optimised for a digital voice circuit, not IP packets. The end-end link has per-bit low latency on/off the link, long transmission delays, nil jitter and limited error-correction (forward-error-correction (FEC), no retransmit) - nice for telephony. These links also have buckets of errors - which voice, especially simple PCM, easily tolerates & can even be smoothed or interpolated out with simple equipment - which will be there to handle echo cancellation.

Say you're on a ship at the end of one of these links.
People are complaining that email 'takes forever' and they can't download web pages.
You run a ping test out to various points - and everything is Just Fine.

What next?

The most usual response is 'Blame the Victim' and declare the link to be fine and 'the problem' to be too many people sending large messages, 'clogging up the link', and too much web surfing. You might set quotas and delete large emails. That might work, or at least improve things marginally.

Radio links, especially back-to-back ones, each crossing 36,000km to geostationary satellites, are noisy.
If the BER is 1:100000 (1:10^5, under the hoped for 1:10^6) and you're using the default ethernet MTU of 1500 bytes, you'll get an error every 1-2 seconds. No worries, eh?
1500 bytes = 12,000 bits = 5.3 packets/second. Or 1 in 10 packets. Hardly noticeable.
TCP/IP has a minimum overhead of 40bytes/packet (less with Van Jacobson compression).
The data payload per packet is 1460 bytes for the ethernet default MTU.

Sending a 1.25Mb file (10Mbit), that's ~856 raw sent packets and 103 errors, or ~12% retransmissions. Of those 103 resends, 12 get errors as well and are resent, and 1.5 errors of those go onto a 3d round...
Or ~115 errored packets, or 14% errors on raw packets. Just a minor problem.
There's a probability of >1 error per packet, but I don't have the maths to solve for that.

The effective bandwidth (throughput) of the link, using 970 * 1500-by packets to move 1Mb in 182 seconds, is 55kpbs. Quite acceptable.

What if there's a corroded connector or tired tracking gimbal and you get a 3db change in SNR and the BER doubles? (That's a guess, not science.)
856 raw packets and 250 1st round errors, 62 2nd, 15 3d, 4 4-th, 1 5th = 333 resends. An almost 50% increase in total number of packets needed to send the file. 223 seconds and 45kbps.
Doubling the BER again (4:10^5 or 0.004%) increases the 1st round error rate to 400 packets, or 47% - 750 retransmits in 10 rounds. 301 seconds and 33.1kpbs. Half-speed.

Back to the original BER, if you were running 'jumbo frames' (9000 by) locally & these went down the link as is, you get 0.8 packets/sec and have a 72% chance of an error in a packet. One in four of the packets would get through unscathed. 140 'jumbo frames' are sent raw, 350 packets are needed with 16 rounds of retransmission.
The file takes 400 seconds at 25kbps - a hefty penalty for forgetting to configure a switch.

The problem is that packet size amplifies error rates.
A change in BER of 0.001% to 0.004%, undetectable by the human ear, halves the throughput fo TCP/IP.
Using an MTU size of 168 (128 data + 40 TCP/IP overhead) gives good performance at a BER of 1:10^4, trading 25% protocol overhead for link robustness.

'ping', using default 512 bit packets, won't detect the error.
But who'd think the MTU was a problem when standard diagnostics were reporting 'all clear'?

Summary: In difficult conditions, the BER doesn't have to drop much for link throughput to significantly degrade.

This example is about simple-minded tools and drawing incorrect conclusions.
The Latency of the link was constant, but the Effective Bandwidth, throughput, changed because of noise or link errors.

Surely that proves the Myth: Latency & Bandwidth are unrelated.

Nope, it proves that link speed and throughput bear a complex relationship to each other.

If you had been measuring TCP/IP statistics (throughput & round-trip-time) at the outbound router, or using 'ping' with MTU of 1500, you'd have seen the average latency rising as throughput dropped. All those link-layer errors & subsequent retransmits were causing packets to take longer.

But a simple low-bandwidth radio link isn't the whole story.
It's a "thin, long pipe" in some parlances.
What was special about that link was:
  • no link contention, data rate was guaranteed transfer rate.
  • synchronous data

No comments: