Tuesday, 12 June 2018

Negative latency

Made with Remarkable!

Negative latency is impossible. There is no such thing.

OK. So, why are you meandering about it then?

The impossible made other people millions of dollars over the years when my team and I gave them negative latency.

But you said it was impossible?

The fun stuff always is.

What are we actually measuring?

The details of what we’re measuring and its importance is so often overlooked.

In “The Accidental HFT firm” I glossed over measuring the latency although it featured prominently in the tale.

I mentioned the “ < 2ms ” being scrawled on the window of my old employer as the team’s new latency target. I then mentioned the first cut of what I termed the “fast trading engine” (FTE) reduced this measure of latency to fewer than five microseconds. This was a fair bit under the two millisecond target. I also mentioned that my employer at the time did not take into account the network stack or operating system overhead. The measurements we are talking about here are just the application latency measurements. A few years later, six I think, I interviewed a new employee from the same old place and they were still measuring latency the same incorrect way.

From “The Accidental HFT firm,”

“This was a time in which 2-3 GHz processors were new. Our performance measurement approach was a bit lame. We measured performance internally within the application rather than properly from the external network. This was fine when talking about two millisecond time frames as the network stack was in the 50 to 100 microsecond zone in those days of yore. The key point to remember here is that when you can do more than a billion instructions a second, that translates to more than two million instructions for two milliseconds. Frankly, it was just criminally wrong that even a millisecond, over a million instructions, was burnt on simple trading tasks. If it takes you a million sequential steps to decide on a trade, retire now! A modern processor can do thousands of instructions in a microsecond. Every nanosecond is sacred.”

Let’s look at a basic diagram:

Simple latency diagram for a trading engine So, the referred to latency is:

$$ t_{app\ latency} = t_{app\ out} - t_{app\ in} $$

FWIW this is normally measured by a high resolution timer such as provided by the operating system, or by using C++’s chrono library,

auto start_time = std::chrono::high_resolution_clock::now();

// do work...

auto current_time = std::chrono::high_resolution_clock::now();
uint64_t nanos = std::chrono::nanoseconds(current_time - start_time).count();
or by a roll your own Intel rdtsc or rdtscp instruction such as,

inline uint64_t rdtsc() {
  uint32_t low, high;
  asm volatile ("rdtsc" : "=a" (low), "=d" (high));
  return (uint64_t) high << 32 | low;
}
which has a bevy of caveats in terms of accuracy, to flush the pipeline or not, overhead, technology, and history. Don’t get me started ;-)

As mentioned, this measurement is both pretty useful and pretty useless. On the one hand it represents what a software engineer may control. On the other hand if you can do your work in 500 nanoseconds and the rest of the OS and network stack takes 50 microseconds aren’t you barking up the wrong tree?

So t app latency is interesting but not too useful. What we really want to measure is the so-called tick to trade latency, let’s call it t tick to trade :

$$ t_{tick\ to\ trade} = t_{Order} - t_{MD} $$

Simple tick to trade latency diagram for a trading engine

Here we want to measure the latency from port to port, or wire to wire. We wish to use some kind of arrangement where we may scoop the market data packet up off the line on the way in to record t MD and then record the order as it goes out to market, or our simulator in test, as t Order .

Most people who read meanderful understand this already but I need to set this up so we can be precise. Reality matters in the world of negative latency.

What is zero latency?

It is unlikely that your processing overhead will be zero, but if it was, you’d expect:

$$ t_{tick\ to\ trade} = t_{Order} - t_{MD} = 0\\\therefore t_{Order} = t_{MD} $$ Now it is impossible for those to be equal you’d think, especially if we are measuring correctly to the nanosecond.

Packets and wires are things too!

Consider a 100Mbps Ethernet UDP market data feed coming from the KRX like it used to do. Say your packet is 500 bytes long, or 4,000 bits. At 100 Mbps, a bit takes 1/100,000,000 seconds, or 10 nanoseconds. Your packet takes 4,000 x 10 nanoseconds or 40 microseconds of wire time.

A fibre optic cable allows 5 microseconds of propagation every 1 km. If you have a cable greater than 8 km long your packet, for a while, would only exist on the cable. It’s cute to think of the cable storing the packet in this way. It reminds me of the mecury delay lines used in the early computers. Your data centre's fibre length will be shorter than 8km. Some of the packet will be on the cable, some will be on the transmitting host, and some will be in the receiving host.

You’d like to get a timestamp to be more accurate than 40 microseconds, so it begs the question about how does the timestamping and packet buffering work on the host?

A packet on the wire

In the diagram above, the packet is 40 microseconds long. Parts may be on the receiver, on the wire, and in the sender all at the same time. Similarly, when we send an order in response, part of the order packet ends up in our machine and some lives on the wire.

Market data and order packets in flight simultaneously

The question here is when is, or what part of, the packet is timestamped on the way in, and what about on the way out? We're executing thousands of instructions per microsecond on our host. Details matter.

What would tcpdump do?

tcpdump , the common packet capture utility, delegates the timestamp responsibility to libpcap . libpcap ‘s normal behaviour is to get a clock for when the packet has arrived as part of its delegation to the kernel driver. Pcap will mark when it starts to send the packet out also via the driver. Wireshark also delegates the timestamping to libpcap just as tcpdump does. Windows is similar in spirit.

This is interesting and concerning as we know the packet is 40 microseconds in length due to the 500 bytes, or 4,000 bits at 100Mbps. Our t tick to trade reported by our packet capture with the impossible zero processing delay means that the first bit of the packet has been in the machine for 40,000 nanoseconds before a response was considered. Yikes! The application will only be notified in the usual UDP or TCP case once the packet has arrived, perhaps later if you don’t have the right network and kernel options set.

You can find the gory details of how this works here for completeness but I suggest you skip the next indented bit, Kernel documentation network timestamping :

SOF_TIMESTAMPING_RX_HARDWARE:

Request rx timestamps generated by the network adapter.

SOF_TIMESTAMPING_RX_SOFTWARE:
Request rx timestamps when data enters the kernel. These timestamps are generated just after a device driver hands a packet to the kernel receive stack.

SOF_TIMESTAMPING_TX_HARDWARE:
Request tx timestamps generated by the network adapter. This flag can be enabled via both socket options and control messages.

SOF_TIMESTAMPING_TX_SOFTWARE:
Request tx timestamps when data leaves the kernel. These timestamps are generated in the device driver as close as possible, but always prior to, passing the packet to the network interface. Hence, they require driver support and may not be available for all devices. This flag can be enabled via both socket options and control messages.

SOF_TIMESTAMPING_TX_SCHED:
Request tx timestamps prior to entering the packet scheduler. Kernel transmit latency is, if long, often dominated by queuing delay. The difference between this timestamp and one taken at SOF_TIMESTAMPING_TX_SOFTWARE will expose this latency independent of protocol processing. The latency incurred in protocol processing, if any, can be computed by subtracting a userspace timestamp taken immediately before send() from this timestamp. On machines with virtual devices where a transmitted packet travels through multiple devices and, hence, multiple packet schedulers, a timestamp is generated at each layer. This allows for fine grained measurement of queuing delay. This flag can be enabled via both socket options and control messages.

SOF_TIMESTAMPING_TX_ACK:
Request tx timestamps when all data in the send buffer has been acknowledged. This only makes sense for reliable protocols. It is currently only implemented for TCP. For that protocol, it may over-report measurement, because the timestamp is generated when all data up to and including the buffer at send() was acknowledged: the cumulative acknowledgment. The mechanism ignores SACK and FACK. This flag can be enabled via both socket options and control messages.

At least the pcap timestamps are measuring some of the stack, even if not all. I hope you find this consideration of which bits of bits a bit disturbing. What is the point of a clever 500 nanosecond processing time even if you have a fast 1 to 2 microsecond SolarFlare card if the packet is being coalesced into a buffer for 40,000 damned nanoseconds before you get told?! We have to do better than this.

A good start to measuring t tick to trade better is to use an external measurement device that doesn’t rely on pcap behaviour and settings. You can see in the above list of pcap's settings, the library supports co-operating hardware. Pcap will extract a hardware timestamp by curious packet magic. Often the checksum of the Ethernet frame is used. Perhaps your solution may just add the timestamping data at the end of the Ethernet frame.

A timing card, such as an Endace DAG will stamp your packets from the wire:

Endace DAG example

It is the normal convention for hardware to punch the timestamp into the timing mechanism for the start of the packet. This is different to the normal receive (rx) stamp in the kernel as you now know. It is now the same end, being the start, used by pcap on the transmit (tx) side of the sending the order. To a degree, if we ignore some network stack overhead fuzziness.

If we do this and get our packet traces up in Wireshark, we’ll see at least a 40 microsecond difference for the t tick to trade latency for hardware compared to software pcap. The hardware timestamping is what we really want. We need this improved understanding of our latencies.

Endace cards default to timestamping at the start of the packet, as is the usual convention. Most of their cards do have the option to make this at the end of the packet. This is sometimes useful. For example you may want to measure properly the overhead of software pcap. If you timestamp via both software and hardware and have a mechanism to correlate the time bases you can learn useful things about your network stack.

We’ve just moved the goalposts to make our life harder with zero latency becoming 40 microseconds. It's much harder to wrangle our thinking into a negative latency way forward now. Truth in measurement is better than fooling ourselves. One of the beautiful aspects of proprietary trading I've always enjoyed is that only the science and engineering matters. Everything has a purpose. Marketing, colour, philosophical fluff may entertain some but it not important to the ultimate meritocracy that is the field of proprietary trading. With many thousands of very clever people opposing your quest, intellecutal honesty is paramount. Honest, fair dinkum, negative latency is an achieveable thing. You’ll soon know how to do this for a variety of circumstances.

One problem with the Endace DAG cards is not only are they expensive, you’ll also need network taps or splitters plus a number of ports to capture anything. You will also have to keep jiggling around your connections to reconnect and measure different parts of your network. Add to that the simple fact that the cards randomly timestamp to a window over within 6ns which translates, due to the jitter and clock boundary conditions, to being really only accurate to within 10 to 20 nanoseconds. This is a pain in the accuracy ass, especially when dealing with 50 nanosecond minimum length packets on a 10Gbps wire or fibre. What a pain!

A much better and more economical solution is one of the Metamako devices that can split, tap, timestamp accurately to 1ns all on many ports at once.

Metamako MetaConnect 48

Also, you can repatch by UI or API remotely. If you do this stuff seriously and you’re not using Metamako gear for your trading, you’re doing something wrong. Deutsche Börse is doing it right :

You can get by with tcpdump but it is awkward and prone to unreliability. If you do go down that path, remember the “packet length” in time to make the adjustments manually and do pin the capture task on its own processor with suitably high priority. Rolling your own analysis tools is useful. I highly recommend libtins as higher level way of easing yourself into some pleasant C++ to augment the use of libpcap if you're so inclined beyond the plain plane.

Back to getting negative latency… are we now measuring the right thing?

Nope, it is actually worse than you think

We now know for this 500 byte 100Mbps example we are 40us behind the game when we get told by the OS the packet has arrived.

It is much worse though. We have to do two things here. We have to understand what is important in getting our order to hit before others and how the data is actually getting to us from the exchange and back to the exchange.

All exchanges in the world are currently software based. Some use Infiniband and some use Ethernet in their inner rings. What matters is getting your order first to the matching engine that is running the auction for your symbol. What does getting there first mean, getting all of your packet, to the last byte, to the matching engine. You may have to traverse a number of switches, machines / gateways, firewalls, cables, protocol translations, routers, rate limiters, all sorts, but the final arbiter is normally the matching engine.

CME’s auction is a little a different for the iLink gateways in that a timestamp at the Market Segment Gateways is put place using a SolarFlare card's WODA hardware on message receipt. This is the vital point where the ordering for the auction is confirmed. (As has been confirmed to me June 13.)

What is important is that a full message, normally a network packet, has to arrive at some exchange element, normally the matching engine, but sometimes the gateway, and that sets your priority.

Your market data is a rear view mirror view on to that data. It doesn’t matter if it is the SIP or a direct data feed. All you are seeing a version of history that used to exist at the auction site. There is no certainty about the current state of the auction when you trade, just the hopes and dreams of a trader ready to be crushed by the boot of reality.

In the example we are using I’m drawing on the old KRX as a proxy. I’m using round numbers rather than the real ones but the idea is there.

Let’s have a closer look at the premises at the broker’s site:

The market data path is now complicated. In this example, it is coming in on fibre, travelling via some telecommunications gear to come out as interleaved channelised layers that are feed through some T1 or E1 modems to come out as Ethernet. Fundamentally though, the data is only coming in at 256kbps. How long is our packet really?

4,000 bits at 1/256,000 or ~3.9 microseconds per bit for 15.625 milliseconds in length. That’s a helluva holy whole hole in the thinking about timing. If you click on the diagram above you will see the trading engine market data input time is now labelled t MD 0 .

At t MD 2 we’re getting a 256kbps feed. If that ends up going out at 100Mbps then it will store the packet until complete and then send out on the Ethernet’s wire. With no overhead , it would take 15,625 microseconds to store before forwarding the packet onto the next step. Plus we need to add the 40 microseconds for the wire time. Next, add the overhead of the low-end Cisco store and forward switch (a 29xx series) being used, say 20 microseconds, plus another 40 microseconds of wire time.

What we would traditionally measure as zero latency with tcpdump on the trading server is really 40 us latency with an external measurement from the front of the packet. Which is really at least 100 us at t MD 1 . Or 15,725 us at t MD 2 . And perhaps this means 15,825 us of latency as measured at t MD 3 at the telco gear interface to the building’s fibre (generously).

Our market data latency is worse then we think here. Let’s sit on that for a minute and think about it later. Recollecting the waste is making my head hurt.

Transmission tricks

Turning now to the other side of the equation, the TX. We normally start to measure as the packet starts going out of the trading engine. Our zero latency libpcap is at least 40 microseconds in reality with proper external measurement. Let’s assume that even though we are talking 100Mbps Ethernet there is a constraining 256kbps channel for the order line, as there was in a version of history at the KRX. This may be a physical channel, a real rate limited channellised group via some E1 or T1 line, or a reserving rate bucket limiter on a Cisco device somewhere in the bowels of the KRX campus datacentre which we’ll consider the same even though they are not.

Let's assume our order packet is also 500 bytes or 4,000 bits to keep the numbers simple. It too would take 15,625 microseconds to transition through the channel at 256kbps.

Stuff that. We know there is a TCP/IP header plus some KRX standard crapola that we need to send before each message. It is just TCP so there is no need to send an order message all in the same packet , so let’s send one TCP packet followed by another to complete the order message. Here we are costing an extra TCP/IP header but saving on the KRX order message. Let’s say it is 20 bytes or 160 bits from the order we save. That’s about 3.9us per bit, tongue to the side, tilt the head, and we save 625 microseconds if we send that section of the order before we even know we have to send an order.

Is it a true saving? Yes. Our packets will be joined by the matching engines, really the gateway process in this case, and they will really, truly, cross-my-heart be there 625 microseconds earlier.

Calculating the effective latency

Without any RX trick, we have just shaved 625 microseconds off our latency. Let's say we’re as slow as a wet week in Huonville. We take as long as 20 microseconds to process the market data and produce an order, or really an order completion.

Let's look at the timing. That is 40 microseconds of waiting for the packet in our normal buffer, plus 20 microseconds of glacial processing, less 625 microseconds we have saved via order specualtion, for a nett result of:

-565 microseconds.

Wait, what? Is that really a true comparison of a tick to trade where we measure the t MD to t Order .

It is. Folks we’ve gone negative. We have time travelled thanks to the speculation on the start of the order packet. This is because what matters is when the matching engine's reference point receives the last byte of the message, not the first byte.

It is an implied negative latency. This is the only real way to compare it to your normal tick to trade timing which would be a zero latency or 40 microsecond latency under magical no-overhead conditions. With negative latency, your order really will arrive faster and your hit rate really does go up. It is a thing; not a theoretical construct.

For this case with the ye olde KRX, my hit rates actually went up better than expected. This was likely because the KRX used store and forward switches so smaller packets traversed quicker. Also, sending speculative packets lightened your bandwidth at critical times getting you a little more ahead at each store and forward stage, kind of a nice upwardly beneficial percolating effect.

Now you can read more about the packet slicing, tapping at t MD 3 tricks and the TCP out of order sequence numbering, and IP fragmentation tricks along with the crazy ARIA cipher CBC hacking I got up to in The accidental HFT firm .

The real truth was that the RX and TX trickery saved us milliseconds in those ancient times. This was gradually whittled away as the exchange increased the bandwidth at various stages, but it remained very healthy for many years. I expect I’ll talk about a bunch of further tricks in the book.

A backlash against negative latency

When I started up Zeptonics after the accidental HFT firm, we marketed a KRX gateway that had better and different tricks to that old HFT firm. We also expanded the market data capture to slice after the bid field arrived, before the ask field. As you now know by hacking the telco fibre coming into the building we could get a very nice early read as we rode the earliest possible bit. There was 75 microseconds between the bid and ask fields even though they were both early in the packet. It was useful to know the bid in advance.

Near the start of 2010 Lime Brokerage in NY started marketing their broking service as a nett negative latency service. It wasn’t really. The fine print was comparing their service to what they thought you could do as they surmised they would have less overhead. “Less overhead than you” Lime called negative latency. The argument makes a bit of sense. They are saying using their service instead of building your own saves you latency whereas a normal broker would be slower than your custom system. That kind of make sense but I disagree with the use of the term. They would have slowed me down for sure.

Here I am using negative latency to mean faster than if you took no time at all to respond, not Lime's misrepresentation. This is both accurate and reasonable even if it’s a little hard to get your head around. More money from better hit rates doesn’t lie.

Lime sullied the "negative latency" term a little. When Zeptonics started on a similar marketing track we advertised, initially, a negative latency of at least 100 microseconds. Reality was better, but selling 100 microseconds was hard enough. We heard a lot of people saying to us, “Yeah, right, F * off…” Fair enough really. It is an odd idea. Plug your own trading engine into this gateway and suddenly you'll have less latency to the exchange and improved hit rates. What the funk? Do something extra and it takes less time? Sounds a bit daft, doesn't it? If something sounds too good to be true, it must be, right? Just imagine if we had tried to tell them the truth that it was a multiple of 100 microseconds faster than zero. Have I got an ICO for you!

So, we started referring to it as “trade acceleration” which feedback told us was less confronting,

“ZeptoAccess KRX provides direct trading access to the Korea Exchange (KRX) at a speed that is 100 microseconds faster than a direct wire connection. This somewhat counter-intuitive result - a gateway that’s faster than having no gateway at all - is made possible by harnessing patent-pending “trade acceleration” technology developed by Zeptonics.”

Here is a snapshot from the wayback machine from, well, wayback, or 2012 to be precise:

So why is a crusty old exchange relevant to me?

I can hear you mumbling, “So, you need some old crusty exchange with ancient T1/E1 lines like the KRX? Dude, move on already. Stop wasting my time.”

No. Many people have done similar. No, not waste your time, but achieve some trade acceleration through technological trickery.

Nomura had a nice FPGA solution for the Tokyo Stock Exchange. They used a slightly different technique. The protocol was 100Mbps or 1Gbps Ethernet to a virtual TSE session via TCP/IP . They would speculate and use a CRC invalidation to kill the packet at the Ethernet layer if they didn’t have an order. Network stacks were a bit slower in those days. I heard on the grapevine that Nomura’s tick to trade was around 2.3 microseconds, much faster than anyone else could achieve at the time. Pretty slow by my standards at least.

As an example, think about trading at BATS in the ye olde daze of 1Gbps. BATS had a native FIX 4.2 protocol back then. In 2009 they were the fastest exchange on the planet with a 443 microsecond RTT. How quaint ;-)

A FIX 4.2 single order message starts off like this:

The standard header has a few compulsory fields such as FIX4.2 as the BeginString , a BodyLength , MsgType , SenderCompID , TargetCompID plus whatever unchanging fields following the standard header in the rest of the order message.

It may not be inconceivable to save 25 bytes in an Ethernet frame. That is not much at 1Gbps. It is only 200 bits or 200 nanoseconds. You’ll need an FPGA to do it, but being 200 nanoseconds ahead of everyone else may be worth it. Perhaps you don’t need to wait for 100 bytes of the 200 byte market data frame, you can save another 800 bits or 800 nanoseconds. Now we’re suddenly talking about a negative latency or trade acceleration of a full microsecond. Real money.

Today's markets are mostly 10G which is much harder. Short packets are 50ns on the wire. A net save on a similar protocol to the BATS example would only be 100 nanoseconds, but perhaps worth it to some. Savings are savings. Faster is faster. Speed is money. Three word sentences end here.

For modern FPGAs in trading, time in the SERDES, serialising on and off the wire, tends to dominate processing rather than doing work. Tick to trade for a good FPGA system on 10Gbps Ethernet is under 100 nanoseconds, mainly in the SERDES. Perhaps you'll find negative latency remains possible if your protocol and IT team are kind to you.

I hope this helps explain what negative latency is. It doesn’t exist. It is just that we can move our frame of reference so our latency bounds are shifted. The implied latency in comparison is negative. This is just because of the limitations of our original point of view. Good engineers will optimise the problem you give them. Give them a broader view so they just don't optimise an isolated problem that keeps you behind in the latency race. The best engineers step back, if allowed, to view the full extent of the battlefield.

Think outside the trading engine box.



Happy trading,

–Matt.



Feed back on making this easier and better to read would be most welcome. I expect I’ll expand on this and add to it, in a more narrative form in the planned book, The accidental HFT firm If you’re interested in such details and real world war stories please think about supporting the Kickstarter for the book .

1 comment: