Sunday, 23 November 2014

SC14 view from afar

Supercomputing 14 in New Orleans has wrapped up. Despite the stasis in the top 500 echelons some meanderings of interest to traders continue to emerge from HPC.

DE Shaw's Anton 2 - click to enlarge
(The Anton 2 Chip: A 2nd Generation ASIC for Molecular Dynamics Simulation Hot Chips 2014)

The BIG announcement for me for the week was Xilinx's SDAccel. Every latency focused trading firm is going to have to evaluate this framework from a functionality and productivity point of view. There have been quite a few different toolkits for doing this kind of work over the years but this feels a game changer.

I've previously been involved with Impulse-C and found it an interesting and useful product from a good company. However, it was largely not used in my shop as the types and pragmas required in the code essentially removed the C'ness which made it yet another C-like foreign language in practical terms. We found it more productive to just go straight to VHDL code at the end of the day, so the Impulse-C, though impressive, became shelf-ware. Perhaps if we had more of a computational burden it may have been more useful, but much of the work at the time was tricky I/O and thus VHDL mattered more.

The nicest implementation of C-like hardware languages I've reviewed over the years was Handel-C. Lovely CSP / Occam / Pi-calculus like constructs. Quite elegant code. Not C though, but this was not that important for me at the time. The killer was that the platform implementation targets were limited and the performance of the code was somewhat, er, um, challenged. Limited performance is not what you're looking for when biting off hardware implementations. Impulse-C targeted VHDL or Verilog so it had the advantage that the specific platform compiler could optimize. Handel-C tried to do more and targeted the RTL level directly but it didn't quite do the terrific job the native compilers could do. I think they lost an opportunity there.

There have been a bunch of others over the years. For example, Mitrion-C was bundled with Cray's hybrid FPGA supercompute platform for a while (Cray XD1(TM) supercomputer) and SGI's RASC. No stand outs for success. C-like programming for FPGA platforms is hard.

Hopefully, Xilinx's SDAccel can pick up some success. I think the big game changer is both the support and weight of Xilinx behind the platform and the availability of partial reconfiguration on the fly. It remains to be seen whether the intrusions of specific types, code styles and the dreaded pragmas overwhelms the benefits a high level language brings, but it sounds quite the advance to me. I'm optimistic. An impressive aspect from the demos was the occasional improvement possible over raw VHDL from both clock and area perspectives. When combined with SDNet and their QPI toolkit, extremely compelling architectures may be possible with small development teams. Let's hope the C-like approach works for FPGAs this time. All trading firms will need to have a look at this kind of framework if truly low latency is what makes them tick.

Some other tech from SC14

Intel had some interesting announcements, not all at SC14. The further development of the Knights platform to Knights Hill is very compelling, especially when combined with HMCC memory. With the progression from Corner, Landing to Hill, there is plenty of geographical growth possible. I can't wait to see what the Mountain platform or 2030 Galaxy platform looks like ;-)

The release of HMCC 2.0 and the support of various vendors, including Xilinx, is exciting for a few reasons, but the main one for me is the indication of continued growth in 3D, or 2D stacking, manufacturing. Intel was also talking about this for Flash memory, just as Samsung has recently done, and Intel presaged packaging solutions with around 10TB of Flash in SSDs in only a couple of years. NVidia also previously announced stacking of memory with their Pascal GPU platform. NVidia's Volta, announced for Summit and Sierra, will continue that. Continued development of innovative packing will drive densities higher and costs down though at a cost of accessibility as it seems this is a bit like the new PCB but only accessible to the very few.

D E Shaw's Anton 2 chip was an exciting platform to read about. I'm not sure if it won the Bell prize, but if it didn't, I think it probably should have (* update: it did). One notable takeaway from reading their paper is that so much more is still possible with further overlap, out of order speculation and processes beyond 40nm and 2B gates. I can't wait to see Anton 3. It is good to see a firm be prepared to build a custom ASIC. Their last one, the original Anton, was done by an e-beam house in Japan but the markings on this chip are from Korea, so I wonder who they are using to fab there? Whilst it is for molecular dynamics, I'm sure every trader would love to have such awesome power at their finger tips for "Hardware Support for Fine-Grained Event-Driven Computation." Though I'm sure their research motives are altruistic, I'm also sure a clever trading firm would find a use for that ;-)

100Gb connectivity continues to mature with announcements from Mellanox of a sub-90ns 36 port 100Gbps IB switch and with the Invea-tech platform, previously mentioned, being demonstrated at SC14. I must say I find the 25G Ethernet movement, also on demo at SC14 running "under 100ns" platform to platform, a little more practical as a future path for now.

Mellanox, in addition to their 100Gbps IB/Eth card, also joined SolarFlare in having an FPGA NIC solution. It is an alluring solution that looks like it may be better than the SolarFlare for latency as it supports FPGA fabric on either or both of the network and PCIe facing sides. The press doesn't declare the flavour of FPGA nor the tools available. We'll have to wait to understand it better to see if it can compete with a vanilla network connected FPGA NIC for low latency trading.

The Power8 CAPI platform had its first official platform release with Nallatech at the show. A good latency reduction technique to be aware of. I think I'd be focusing on the Xilinx / Intel QPI solution myself though.

Some of my old friends at Metamako were at SC14 next to the Xilinx booth with their compelling platform. If you want the fastest way to get two packets from two wires onto the one wire at 10G, say for facing an exchange, risk gateway or other financial choke point, such as a mandated firewall, then their new MetaMux 32 reigns supreme. It is the fastest way at ~100ns to get two packets onto the same wire at 1G / 10G speeds. This is in additional to their fancy-smancy layer 1 reconfig tricks. I think the biggest use case for me for Metamako is simply using their MetaConnect platform just for timestamping though. You could save a bunch of money using one of their MetaConnects, or perhaps even the MetaMux, to timestamp and collect packet information rather than using expensive timestamping cards but I'm not sure people realise how cool that is from a non-intrusive tap reconfig and cost saving point of view. I'd really like a MetaConnect to script up for dynamic network reconfig and running performance unit tests for financial apps in the test lab.

A mystery for me is the intriguing Intel Omni-Scale platform. Just enough details have been released to know that it is going to be interesting but not enough to really know what it is all about. So, keep your eye out for developments there. Intel have long had the opportunity to kill off much of the NPU, custom processor and even perhaps the entire NIC market by incorporating compelling networking interfaces into the coherent level of their microprocessors. It is not clear when and how this may happen with Omni-Scale. Vendors relying on NIC card revenue should be starting to shift a little uncomfortably in their seats. A big leap in network latency reduction could be at hand but there remains an opportunity for a firm like Cavium or AMD to gazump Intel on the low latency network and compute combination. Direct fabric integration, not the kludgy PCIe SoC kind, is long overdue for mainstream processors. Maybe this is at last on its way.

Perhaps not so relevant to trading apps, the announcement of Summit and Sierra was intriguing with their reliance on nVidia's Volta with stacked RAM and NVLink. It is certainly a nice win for the IBM Power 9 platform even if the headline performance is really reliant on nVidia rather than the Power 9. It also just shows how hard an ExaFLOP is, at least for LinPack, as both of don't quite get to ExaFLOP performance, so the ExaFLOP race remains on.

Happy trading,


{NB: Scant public details from Intel on Knights Hill (optical connect, stacked memory, faster) and Omni-Scale/Path.}

No comments:

Post a Comment