Saturday, 11 March 2017

Using the matrix - exchange and enterprise network improvement

To achieve the lowest of low latency market data distribution you need to use layer one switching devices, otherwise known as matrix switches. Sometimes they are called crossbar switches. In the old days of a telephone exchange such things were built with relays, and the like, but the world builds nice electronic ones now. They are not a packet switch. A matrix switch is all about making a circuit from one place to another. Here is a little simplified diagram:
Example crosspoint schematic
(click to enlarge - source)
If we close the Y1 to X1 switch we have a connection between them and a signal can flow whether it is a 10G Ethernet packet or an analogue voice call. It's just a wire. If the circuit supported one source and many destinations we could close the Y1/X1 switch and the Y1/X3 switch so the signal from Y1 could flow to both X1 and X3 fairly instantly. That is the essence of the matrix if you choose to live in it.

The step after requiring a telephone operator to patch your wire into its destination wire by asking them to patch you, was the development of fancy schmancy dialing systems that had matrix switches at their core. Here is a picture of a Western Electric crossbar switch manufactured in 1970:

6-wire crossbar switch. Manufactured by Western Electric, April 1970.
(click to enlarge - source)
N Reynolds of Western Electric first invented the crossbar selector in 1913.  Here is a picture of a technician fooling around at East 30th St NY, NY in 1938:

(click to enlarge - source)
 It kind of reminds me of a steam punk version of a modern matrix switch, expect with no steam or punk, I guess. Anyway, here is a picture from 1955 showing us the modern data centre of its day, care of AT&T:

(click to enlarge - source)
Matrix switches are important and have been around for over a hundred years. Quite a few of us don't appreciate their long history. I was certainly late to the party.

It is with some irony you might like to mull on the idea that the crossbar switch of 1913 was capable of lower latency than a modern data centre’s packet switch infrastructure. Yes, Arista, Mellanox, Juniper, and Cisco packet switches are all slower than one hundred year old technology. The switching was precomputed by the relays choosing the path. That path was then simply a direct connection. No microseconds or nanoseconds wasted on choosing paths. Certainly the clink, clank, thunk of the path setting was pretty slow, but once the path is in place, you’re off to the races - as fast as your electrons can travel.

There are a couple of handfuls of vendors who provide these kind of layer one matrix switches. Such switches have a strong history of use for video and audio environments, besides the obvious telecommunication use cases. Zeptonics introduced a device specifically targeting financial applications a few years ago but times have moved on and better devices exist from excellent new generation vendors such as Metamako.

Market data

In a packet switched environment, multicast UDP is typically used for market data delivery.  Whether it be dense or sparse multi-cast, a load is added to the switching device that may also interfere with other traffic. Even without contention, there will be one or two orders of magnitude difference between packet switched multi-cast and matrix switched traffic. Matrix switches are simply really good at doing nothing. If your destination is preordained by the network's omniscient subscription God, then let it be.

Some packet switches offer spanning which may cost as little as 50ns for duplication between ports, but this is still an order of magnitude different to the typical matrix switch port to port cost of around 5ns. That 5ns cost is largely related to trace lengths within the device with only 0.1 to 2 ns due the matrix switching chip. Distance matters at such speed.

So, the best and most efficient way to send data from one point to many, as you need to do in a market data environment, is to simply use a matrix switch, or layers of such, and fan out the data.

There are two main types of modern matrix switch, the optical and the electronic. The optical usually uses a MEMS chip, think little magical mirrors, that directs beams around the place. They are a bit slow to set up, typically milliseconds, but typically lead the electronics devices in the bandwidth race. Often they have more scale, that is ports, but you pay for the privilege as they are not a cheap thing to make. It is a bit easier to stamp out a complex bit of silicon as society is better geared up for that. The electronic variety is typically faster in set up, microseconds, but a bit lower in bandwidth. 25G being new for silicon but a bit older for optical MEMS. The electronic variety usually has advantages of multi-tap or multi-cast where you can go from one to many, or all, which is harder to arrange with optical. Also electrickery usually fares better than optical with regard to signal integrity simply due to the fact that we are better massaging electrickery than we are at catching rainbows.

One of the magical things I used to use such matrix switches for was unit based performance tests. As the connectivity of the matrix is scriptable, when code was checked into the version repository, a series of tests would be run to check the performance. The network config was part of the unit test. The network was appropriately reconfigured for each test to get real wire to wire measurements for components. This is a very handy way to keep the nanoseconds under control and stop speed bumps being inadvertently introduced. A graphite database we used could show all the nanoseconds of code evolution over time. Alarms would ring and emails would go out if a handful of extraneous nanoseconds suddenly appeared which is surprisingly easy to do. The key to this was having the network being completely configurable for each unit test. That is a joy of the matrix switch in the lab.

If I were to set about designing your enterprise or data centre network, I would always use a matrix switch just to save on maintenance by having it as a flexible, no remote hands, patch panel. You can save yourself a lot of expensive data centre walks or transport fares. That kind of use case is usually enough to justify the cost of the device to start with. It is one of the few obvious ways to save money in network design.

If you are distributing market data and you care about latency, you really should always use a matrix switch, otherwise you’re not doing your job properly. Stick a ye olde modern packet switch in the way, and now you’re just going slow. Don’t do that. Packet switches are good for humans, but not so good for algorithms. If an algorithm is hooked up to a packet switch and another algo is hooked up to a matrix switch, the algo hooked up to the packet switch will lose. Don't lose: use a matrix switch.

One thing exchange failures have taught us is that vendors are not to be trusted ;-) Homogeneity kills. A vendor wants you to use all of their equipment and be homogeneous. Sometimes that makes sense but often it doesn't. We need our own architects to bypass the BS and build real resilience into our network architectures. For this reason, I would do my redundancy with a packet switched multi-cast UDP network. That is, I’d do both layer one and layer two or three, with the packet switching being the back-up path. To me, heterogeneity in both design and vendors matters. You won’t get that kind of advice from your vendor, which is why you need to rely on your own team.

For finance there is one stand out vendor for matrix switches. Metamako make by far the best product. This is for two main reasons. Their switches understand common packet structures. As I’ve explained, essentially the layer one matrix is pretty much just a point to point wire, but in addition to just doing that, the Metamako gear understands a few things, such as 1G and 10G Ethernet and you can get some packet information which helps a great deal in building and monitoring the network. Just about all the other vendors just give you a cable equivalent and you’re somewhat blind to the packets going across. Some will also do similar signal integrity stuff to what Metamako do, but this is not the same as counting packets. 

The second win from the Metamako stuff is the embedded latency measuring stuff. In the old daze you’d have to use an expensive Endace DAG card, or equivalent, and tap a specific line to it to capture packets with time-stamp annotations. The Metamako stuff lets you add time-stamps to any port you want and benchmark many ports simultaneously. Depending on how you think of it, such time-stamping capability is either a massive cost saving or a massive capability upgrade. The downside is that a few hundred nanoseconds are added to a time-stamped packet, but you can add a tap, as is the nature of the matrix switch, and have an undelayed reproduction in addition to the delayed time-stamped line. A big advantage of the electronic matrix chips is that when you add an additional replicant of the line, the original line is undisturbed, so you can add replicants of the data in production without disturbing the original line. Very cool. You’d better get it right though as it is unwise to mess with a production network and with great flexibility comes great danger. But when you need Felix's bag of tricks, you really need it.

I was triggered to write this meandering piece as Metamako just released a 2RU 96 port matrix switch:
Metamako's MetaConnect 96
(click to enlarge)

This is a pretty serious bit of kit that encourages large enterprise fanouts. With only two layers of MetaConnect 96s you have support for over nine thousand end points of replicated market data. Neat. Exchanges, banks, and brokers should take note. Three layers would give you the possibility of over 800,000 end points at a cost of around 18 nanoseconds plus wire time. Wire length becomes the obvious constraint rather than the switch.

Future networks

In a future world, I hope we’ll see what I like to call SDN++ networks, next gen software defined networks, that not only support advanced flexible routing, virtual networks, and packet switching but also directed circuits via matrix switches. Perhaps we’ll then see support for on demand bandwidth, such as for VM migrations, plus the automation of resilience planning as well as the expected latency and bandwidth optimization planning.

Resilience planning is especially interesting to me. Just as you may use instrument delta and gamma simplifications for VaR planning for fast risk, you should be thinking of using device and link failure potentials for network deltas to plan your instant and automated redundancy responses.

P4 is a step in the right direction, but it is not enough. Plexxi is one such hybrid approach but it is also limited and not enough. Plexxi is neat but seems to have lost sight of the forest for the trees even though I think they are heading in the right direction. The future will belong to not just packets but also to circuits. That is, the key will be the orchestration of not just packets, or flow tables, but also the planning of links and the transient rewiring of links to support bandwidth, latency, and resiliency within the context of competing priorities.

Feeding into such a framework should also be aggregation. Metamako’s fastest Mux application is a 69ns service that is handily faster than say a fully connected layer 2 switch could be with the same technology. A mux is faster than a fully connected switch simply because it is simpler. Such things are important when you have a latency critical criteria for an important aggregation problem, such as a financial service talking to an exchange or broker. So imagine a future where you have not only flow tables in different devices to optimise, but flexible circuits, packet switches, and aggregators; plus all the monitoring, timing, and measurement foo you wish to throw around. Then consider clients, servers, operating systems, network cards, and network devices all having flexible circuit and switching capabilities. Such a rich environment provides awesome opportunities for optimisation and improvement.  We write VHDL and have software optimise our circuit layouts as part of modern chip design, but we still hand specify our networks due to the artistry involved. Hmmm. There is an obvious destination if we can find the right path.

As the old saying goes, many problems can be solved by adding a layer of indirection, hence packet switching. Reducing layers of indirection by circuits is also a noble act. Let's do that too.

I really want this future SDN++ network. No one yet is planning such a beast. Modern matrix switches and better monitoring with measurement is a step in the right direction but there is much more to come. It feels we’re on the threshold of exciting changes that will bring real, practical benefits to the data centre.

Happy trading,


No comments:

Post a Comment