Sunday, 30 November 2014

Ode to Teaching

It's another end to a school year. For months, the world over, an army of teachers have been toiling in the confines of the education factories leading to a final preparatory push and some exhaustion.

There is often much personal and career sacrifice that goes with the decision to be a teacher, and, to continue as a teacher. It is not a profession renown for its pay, chefs on duty, foosball tables in the break out room, nor, sadly, community respect. Stock options are fairly rare. Dedicated professionals are haunted by the cruel and excruciating epitaph echoed by far too many, "Can't do, teach."

A warm thank you from their students sometimes makes all the difference.

A lovely reward for a year's work.
(click to enlarge - broken out below)

As many teachers do, at year end my wife appreciatively received some small gifts and kind words. Carolyn often pecks on her keyboard late into the night. Rises early for gym, the cross town commute, and the 7:30am "extra" tutes. Every day is a pretty long day. The occasional aggressive and threatening parent at a parent-teacher interview challenges the spirit. Each year, every semester, every class, is an opportunity of relentless importance. The wheels never stop. I'm proud of my caring wife and her choice to be a teacher. It been a tough couple of years with twenty years of her teaching superannuation being stolen, a cancer scare, suicide attempts and subsequent hospitalisation in the direct family. Some days aren't easy. Sometimes a note of thanks, like this one, makes that life choice of being a teacher completely worthwhile. 

Hopefully my wife doesn't see this too quickly so that I don't get told to take this note down. She's not one for a fuss or embarrassment. Yes, this card is a fabulous reflection on my wife, but it also shines the same light on the dedication of all her fellow professionals. When we read the unadulterated thoughts of appreciative students, we can all appreciate the importance and fine work that too often is hidden in the classroom. It's a wonderful profession for which we should all give thanks.

I hope this is a nice reminder that in the lives of our children and in the progression and advancement of our society:

Teaching Matters


A respectful token to energize the soul
(click to enlarge - broken out below)


One card for one soul can provide more heat than a fistful of dollars. 

If you know a teacher, take a moment and say a simple thanks. A little kindness can travel far.























Sunday, 23 November 2014

SC14 view from afar

Supercomputing 14 in New Orleans has wrapped up. Despite the stasis in the top 500 echelons some meanderings of interest to traders continue to emerge from HPC.

DE Shaw's Anton 2 - click to enlarge
(The Anton 2 Chip: A 2nd Generation ASIC for Molecular Dynamics Simulation Hot Chips 2014)


The BIG announcement for me for the week was Xilinx's SDAccel. Every latency focused trading firm is going to have to evaluate this framework from a functionality and productivity point of view. There have been quite a few different toolkits for doing this kind of work over the years but this feels a game changer.

I've previously been involved with Impulse-C and found it an interesting and useful product from a good company. However, it was largely not used in my shop as the types and pragmas required in the code essentially removed the C'ness which made it yet another C-like foreign language in practical terms. We found it more productive to just go straight to VHDL code at the end of the day, so the Impulse-C, though impressive, became shelf-ware. Perhaps if we had more of a computational burden it may have been more useful, but much of the work at the time was tricky I/O and thus VHDL mattered more.

The nicest implementation of C-like hardware languages I've reviewed over the years was Handel-C. Lovely CSP / Occam / Pi-calculus like constructs. Quite elegant code. Not C though, but this was not that important for me at the time. The killer was that the platform implementation targets were limited and the performance of the code was somewhat, er, um, challenged. Limited performance is not what you're looking for when biting off hardware implementations. Impulse-C targeted VHDL or Verilog so it had the advantage that the specific platform compiler could optimize. Handel-C tried to do more and targeted the RTL level directly but it didn't quite do the terrific job the native compilers could do. I think they lost an opportunity there.

There have been a bunch of others over the years. For example, Mitrion-C was bundled with Cray's hybrid FPGA supercompute platform for a while (Cray XD1(TM) supercomputer) and SGI's RASC. No stand outs for success. C-like programming for FPGA platforms is hard.

Hopefully, Xilinx's SDAccel can pick up some success. I think the big game changer is both the support and weight of Xilinx behind the platform and the availability of partial reconfiguration on the fly. It remains to be seen whether the intrusions of specific types, code styles and the dreaded pragmas overwhelms the benefits a high level language brings, but it sounds quite the advance to me. I'm optimistic. An impressive aspect from the demos was the occasional improvement possible over raw VHDL from both clock and area perspectives. When combined with SDNet and their QPI toolkit, extremely compelling architectures may be possible with small development teams. Let's hope the C-like approach works for FPGAs this time. All trading firms will need to have a look at this kind of framework if truly low latency is what makes them tick.

Some other tech from SC14


Intel had some interesting announcements, not all at SC14. The further development of the Knights platform to Knights Hill is very compelling, especially when combined with HMCC memory. With the progression from Corner, Landing to Hill, there is plenty of geographical growth possible. I can't wait to see what the Mountain platform or 2030 Galaxy platform looks like ;-)

The release of HMCC 2.0 and the support of various vendors, including Xilinx, is exciting for a few reasons, but the main one for me is the indication of continued growth in 3D, or 2D stacking, manufacturing. Intel was also talking about this for Flash memory, just as Samsung has recently done, and Intel presaged packaging solutions with around 10TB of Flash in SSDs in only a couple of years. NVidia also previously announced stacking of memory with their Pascal GPU platform. NVidia's Volta, announced for Summit and Sierra, will continue that. Continued development of innovative packing will drive densities higher and costs down though at a cost of accessibility as it seems this is a bit like the new PCB but only accessible to the very few.

D E Shaw's Anton 2 chip was an exciting platform to read about. I'm not sure if it won the Bell prize, but if it didn't, I think it probably should have (* update: it did). One notable takeaway from reading their paper is that so much more is still possible with further overlap, out of order speculation and processes beyond 40nm and 2B gates. I can't wait to see Anton 3. It is good to see a firm be prepared to build a custom ASIC. Their last one, the original Anton, was done by an e-beam house in Japan but the markings on this chip are from Korea, so I wonder who they are using to fab there? Whilst it is for molecular dynamics, I'm sure every trader would love to have such awesome power at their finger tips for "Hardware Support for Fine-Grained Event-Driven Computation." Though I'm sure their research motives are altruistic, I'm also sure a clever trading firm would find a use for that ;-)

100Gb connectivity continues to mature with announcements from Mellanox of a sub-90ns 36 port 100Gbps IB switch and with the Invea-tech platform, previously mentioned, being demonstrated at SC14. I must say I find the 25G Ethernet movement, also on demo at SC14 running "under 100ns" platform to platform, a little more practical as a future path for now.

Mellanox, in addition to their 100Gbps IB/Eth card, also joined SolarFlare in having an FPGA NIC solution. It is an alluring solution that looks like it may be better than the SolarFlare for latency as it supports FPGA fabric on either or both of the network and PCIe facing sides. The press doesn't declare the flavour of FPGA nor the tools available. We'll have to wait to understand it better to see if it can compete with a vanilla network connected FPGA NIC for low latency trading.

The Power8 CAPI platform had its first official platform release with Nallatech at the show. A good latency reduction technique to be aware of. I think I'd be focusing on the Xilinx / Intel QPI solution myself though.

Some of my old friends at Metamako were at SC14 next to the Xilinx booth with their compelling platform. If you want the fastest way to get two packets from two wires onto the one wire at 10G, say for facing an exchange, risk gateway or other financial choke point, such as a mandated firewall, then their new MetaMux 32 reigns supreme. It is the fastest way at ~100ns to get two packets onto the same wire at 1G / 10G speeds. This is in additional to their fancy-smancy layer 1 reconfig tricks. I think the biggest use case for me for Metamako is simply using their MetaConnect platform just for timestamping though. You could save a bunch of money using one of their MetaConnects, or perhaps even the MetaMux, to timestamp and collect packet information rather than using expensive timestamping cards but I'm not sure people realise how cool that is from a non-intrusive tap reconfig and cost saving point of view. I'd really like a MetaConnect to script up for dynamic network reconfig and running performance unit tests for financial apps in the test lab.

A mystery for me is the intriguing Intel Omni-Scale platform. Just enough details have been released to know that it is going to be interesting but not enough to really know what it is all about. So, keep your eye out for developments there. Intel have long had the opportunity to kill off much of the NPU, custom processor and even perhaps the entire NIC market by incorporating compelling networking interfaces into the coherent level of their microprocessors. It is not clear when and how this may happen with Omni-Scale. Vendors relying on NIC card revenue should be starting to shift a little uncomfortably in their seats. A big leap in network latency reduction could be at hand but there remains an opportunity for a firm like Cavium or AMD to gazump Intel on the low latency network and compute combination. Direct fabric integration, not the kludgy PCIe SoC kind, is long overdue for mainstream processors. Maybe this is at last on its way.

Perhaps not so relevant to trading apps, the announcement of Summit and Sierra was intriguing with their reliance on nVidia's Volta with stacked RAM and NVLink. It is certainly a nice win for the IBM Power 9 platform even if the headline performance is really reliant on nVidia rather than the Power 9. It also just shows how hard an ExaFLOP is, at least for LinPack, as both of don't quite get to ExaFLOP performance, so the ExaFLOP race remains on.

Happy trading,

--Matt.


{NB: Scant public details from Intel on Knights Hill (optical connect, stacked memory, faster) and Omni-Scale/Path.}

Monday, 17 November 2014

Alpha Data - PCIe 3.0 Xilinx UltraScale 10/40G

I was browsing the exhibitors at SC14 which is on this week in New Orleans. I noticed that I had missed this nice looking Xilinx Kintex UltraScale ADM-PCIE-KU3 from Alpha Data which was news way back in April 2014.
Source: Alpha Data - Data sheet for ADM-PCIE-KU3
(click to enlarge)
Kintex UltraScale is now going mainstream so perhaps just noticing it nowish is cutely serendipitous. I'm not sure about the position of the SMA connector for timing as it looks a little inconvenient in this picture though the picture was an early release. Hopefully it has moved. PCIe 3.0 allowing direct access to Intel uP cache memory (data direct), from Sandy Bridge onwards, would allow lower latency than competing PCIe 2.X solutions. PCIe 2 solutions really should be avoided in this day and age if you're going to the trouble of eeking out the little bits of latency you can.

Simple is good. It is indeed a pretty simple board with up to 8 x 10G lines via break out cables or 2 x 40G, plus additional RAM. Not sure too many in trading would find the SATA ports useful but some may for direct recording to disk but I can't imagine too many bothering.

It looks a good choice and worth checking out at their booth at SC14 this week if you're after a pretty clean Kintex UltraScale design for PCIe 3.0 goodness.

I still think I'd prefer a Zynq solution, such as the Hitech Gobal board HTG-Z7-PCIE-HH or one of the not yet released Arria 10 ARM SoC based boards (e.g. BittWare’s A10PHQ). Still, I'd think a Zynq 7045 would be preferable to the 7100 on the HTG as I'd feel better trading better transceivers for less FPGA fabric, so I'm yet to find the goldilocks board for me. I'm interesting in hearing about boards that you might have found that are just right for your network oriented application. Drop me a mail.

Happy trading,

--Matt.

_____________
PS: Nearly two year old FPGA summary from this blog
PPS: Terasic FPGA board previously mentioned

PPPS:

The mystery regarding the PPS placement mentioned above is solved. Here is a picture of the Alpha Data card from SC14 via Xilinx's blog:

Picture of Alpha Data card from Xilinx Forum
(Click to enlarge)

This shows the SMA being exposed to the outside world via the PCIe bracket. It's a bit of an ugly hack but, hey, it works. Overall the card looks a pretty satisfying solution especially when teamed with the particularly awesome SDAccel framework Xilinx announced at SC14.  This particular card was one of the launch devices. SDaccel is a game changer, especially for trading and HFT. All low latency trading firms will need to review SDAccel from both a functionality and productivity point of view.

--Matt.



Saturday, 8 November 2014

Don't buy an ExaNIC from ExaBlaze

[Update: Solarflare is suing Exablaze for patent infringement]

I've seen more dubious PR from the nasty people at Zomojo/ExaBlaze recently.

I thought it worthwhile warning people once more before they get sucked into ExaBlaze's reality distortion field.

Specifically: why their ExaNIC cards are not a smart idea; and, why you should steer well away from such wasteful expenditure.

It would be easy to be seduced by the pitch for the ExaNIC but the pitch is hollow with parts simply straight out lies. It's not your fault if you've been deceived into purchasing one.

There are two potential reasons for looking at an ExaNIC:
   1) A low latency network card, and
   2) An even lower latency FPGA trading solution.

Neither make sense. Let's meander through why.

Is it a good NIC?

In the first case, the ExaNIC card is feature poor and it has little advantage over a much better solution from Mellanox or SolarFlare. Mellanox[2] and SolarFlare[1][3] are quality NICs that have rich feature sets, kernel bypass and trusted implementations with open source drivers. Also, Chelsio and Myricom are worthy of some consideration if you have specific features you need.

You should be concerned about the signal integrity of the cards from ExaBlaze as the initial versions had poor to little signal integrity on the networking side which is a difficult problem to spot and solve. It's fine when it works but potentially devastating when you're having to solve problems. ExaBlaze push silly claims about latencies quoting 60 byte packets when the minimum frame length on 1G and 10G is 64 bytes. A 60 byte frame is illegal and referred to as a "runt" frame. To top it all off, the ExaNIC cards are expensive. Sure, buy one to play with if you must, but it makes no sense to invest in the card as a NIC solution.

If a quantum of a hundred nanoseconds is really important to you then, regardless of any NIC, you should be considering an FPGA solution to avoid the PCIe latency cost.

So, what about the ExaNIC as an FPGA solution for trading?

Just don't.

As an FPGA solution it is a much simpler and clearer case to weigh up. It is one of the slowest 10G or 1G Xilinx FPGA solutions you could possibly use. Yes, one of the slowest. If you are going to go all the way to an FPGA solution, which is an expensive way to go, then make sure you do it right.

The fundamental limiting factor in the ExaNIC x4 is that the network facing SERDES are off chip and slower than using native on-chip Xilinx transceivers. ExaNIC as an FPGA path is a dumb solution for low-latency. There are plenty of other excellent FPGA boards out there so you have no reason to waste your money. If you have been fooled into purchasing an ExaNIC on the mistaken belief that it was a fast way to connect to the market and trade then you may be entitled to your money back. They are lying to their customers about their high-latency architecture by referring to it as low-latency.

"It hung in the air in exactly the way that bricks don't."

Look at the following picture from the ExaBlaze web site:

(click to enlarge)
The standard way an FPGA NIC card works, except an ExaNIC doesn't. The SERDES for ExaNIC x4 are not on the FPGA, they are external, an extra hop, and higher latency. Don't be deceived.
The diagram clearly shows the network SERDES to be on-chip with respect to the FPGA. This is a lie for the ExaNIC x4 and it may also be a lie for the ExaNIC x2 if it is a similar design. The SERDES on the ExaNIC x4 is an external mux-demux chip and a slower solution than an on-chip FPGA transceiver. Why would a vendor use a higher latency architecture? To use a cheaper FPGA and save money. Don't be fooled. If you have fallen for the ExaBlaze misrepresentation, go get your money back is all I can suggest.

In summary, as NICs go, there is better quality, functionality and more reasonable prices from Mellanox and SolarFlare NICs, so avoid ExaNIC. As an FPGA solution, it is just a dumb idea to invest in an architecturally slow solution. You can't make it fast. It is slow by design. Get a real FPGA board from a trustworthy vendor as there are plenty of good alternatives. The combined ARM & FPGA SoC solutions as NICs are my favourites at the moment even though the ARMs are a little underwhelming. The SoC FPGA fabric to uP hop beats PCIe latencies for interesting hybrid solutions.

You should try to buy equipment from a trustworthy vendor. Matt Chapman annoyed a lot of people with his little video at Inside HPC where he dances around a product that Zeptonics developed and he claimed as his own. What a schmuck. Greg Robinson used stolen IP from ITG when he was sacked as CEO of ITG Australia to compete on the trading front at Zomojo with ITG's prop trading (Canada / NY interlisted arb specifically) and told multiple staff (so it's not just my claim) not to talk to Bill Burdett about it as Bill was still on the Board of directors of NYSE listed ITG Inc. Bill is a Zomojo beneficial shareholder. Then you have my little continuing dispute with them where I hope for a miracle one day that will see my faith in justice restored with some kind of judicial remedy that passes the smell test. Exablaze / Zomojo lied to either the court or their customers claiming Chapman was no longer associate with the trading at Zomojo to customers and yet at the same time claiming to the court he was an officer of Zomojo. Those statements are mutually exclusive, so both couldn't be true. Zomojo/Exablaze lied to the court many times. Zomojo and Exablaze are nasty people. Do you want to buy from a vendor that not only trades against you but misleads you? Caveat emptor.

If you've bought an ExaNIC, you should use their misleading representations as a reason to return the NICs and get your money back. Fortunately for you, there are plenty of good people in the industry making better product. It's simple. Exablaze serves no useful purpose. Use better products.

Happy trading,

--Matt.


[Update Sunday 2014-11-23]

PS: My speculation the ExaNIC x2 works the same way as the ExaNIC x4 with an external _SLOW_ mux-demux SERDES requiring an extra hop is confirmed. Xilinx point out the x2 board uses a Kintex 7-160T FPGA which has a maximum of 8 high speed transceivers. As the card has 8 x PCIe 2 (yeah, no PCIe 3.0), it has no transceivers left for facing the network.  So, it is confirmed, the x2 has a similar architecture to the ExaNIC x4 and ExaBlaze are lying to their customers about the design on their public website. The architecture is slow by design. Use a better product.
______

[1] Some Solarflare 10GbE Server Adapter Features

• LSO, LRO, GSO large packet offloads • TCP/UDP/IP checksums offloads • Line rate packet
filtering • Receive Side Scaling (RSS) • Accelerated Receive Flow Steering (RFS) • NetQueue,
VMQ, SR-IOV • 256 Multicast filters • Jumbo Frames (9KB) • 4096 VLANs/port • PXE boot,
iSCSI boot • IEEE 802.3ae – 10 Gigabit Ethernet • IEEE 802.3an – 10GBASE-T • IEEE
802.3ad – Link Aggregation and Failover • IEEE 802.1Q, 802.1p – VLAN tags, priority • IEEE
802.3x – Pause

Operating Systems: RHEL 5, 6, 7; MRG; SLES 10, 11; SLERT; other Linux; Windows Server 2003, 2003 R2, 2008, 2008 R2, 2012, 2012 R2; OS X v10.6, 10.7, 10.8, 10.9; Solaris x86 10, 11; ESX 3.5, 4.x, ESXi 5.x; KVM; Windows Hyper-V; XenServer 5.x, 6.0.All server adapters support: SR-IOV, 127 VFs per port,1024 vNICs per port.

[2] Some Mellanox NIC features

ETHERNET
– IEEE Std 802.3ae 10 Gigabit Ethernet
– IEEE Std 802.3ba 40 Gigabit Ethernet
– IEEE Std 802.3ad Link Aggregation
– IEEE Std 802.3az Energy Efficient Ethernet
– IEEE Std 802.1Q, .1P VLAN tags and priority
– IEEE Std 802.1Qau Congestion Notification
– IEEE Std 802.1Qbg
– IEEE P802.1Qaz D0.2 ETS
– IEEE P802.1Qbb D1.0 Priority-based Flow
Control
– IEEE 1588v2
– Jumbo frame support (9600B)
OVERLAY NETWORKS
– VXLAN and NVGRE - A Framework for
Overlaying Virtualized Layer 2 Networks over
Layer 3 Networks. Network Virtualization
hardware offload engines 
HARDWARE-BASED I/O VIRTUALIZATION
– Single Root IOV
– Address translation and protection
– Dedicated adapter resources
– Multiple queues per virtual machine
– Enhanced QoS for vNICs
– VMware NetQueue support
ADDITIONAL CPU OFFLOADS
– RDMA over Converged Ethernet
– TCP/UDP/IP stateless offload
– Intelligent interrupt coalescence
FLEXBOOT™ TECHNOLOGY
– Remote boot over Ethernet
– Remote boot over iSCSI
PROTOCOL SUPPORT
– Open MPI, OSU MVAPICH, Intel MPI, MS
– MPI, Platform MPI
– TCP/UDP
– iSER, NFS RDMA
– uDAPL
PCI EXPRESS INTERFACE
– PCIe Base 3.0 compliant, 1.1 and 2.0 compatible
– 2.5, 5.0, or 8.0GT/s link rate x8
– Auto-negotiates to x8, x4, x2, or x1
– Support for MSI/MSI-X mechanisms
CONNECTIVITY
– Interoperable with 10/40GbE Ethernet switches.
Interoperable with 56GbE Mellanox Switches.
– Passive copper cable with ESD protection
– Powered connectors for optical and active cable
support
– QSFP to SFP+ connectivity through QSA module
OPERATING SYSTEMS/DISTRIBUTIONS
– Citrix XenServer 6.1
– RHEL/CentOS 5.X and 6.X, Novell SLES10 SP4;
SLES11 SP1 , SLES 11 SP2, OEL, Fedora 14,15,17,
Ubuntu 12.04
– Windows Server 2008/2012/2012 R2
– FreeBSD
– OpenFabrics Enterprise Distribution (OFED)
– OpenFabrics Windows Distribution (WinOF)
– VMware ESXi 4.x and 5.x

[3] Some specific SolarFlare features

Product Number
SFN7322F
Standards & Compliance
IEEE 1588 v2
IEEE 802.3ae
IEEE 802.3ad
IEEE 802.1Q
IEEE 802.1p
IEEE 802.3x
RoHS Compliant
Power
5.9W (typical)

Precision Packet Time Stamping
7.5ns resolution
Stable Precision Oscillator
Stratum 3 compliant; short term drift < 3.7*10-7 in 24 hours
Server Clock Synchronization Accuracy
Sub 200ns
1PPS-input circuit
Rising edge active, TTL into 50Ω
1PPS-output circuit
Rising edge on-time, TTL into 50Ω
I/O Virtualization
2048 guest OS protected vNICs; 240 virtual functions;
16 physical functions
PCI Express
PCIe 3.0 x8 @ 8.0 GT/s
SFC9120 10G Ethernet Controller
Supports high-performance 10GbE
SFP+ Support
Supports optical & copper SFP/SFP+ modules; Direct-Attach,
Fiber (10G or 1G), 1G/10G combo
1000BASE-T SFP Support
Supports 1G 1000BASE-T SFP modules
Low Latency
Cut-through architecture/intelligent interrupt coalescing
Receive Side Scaling (RSS)
Distributes IPv4, IPv6 loads across all CPU cores;
MSI-X minimizes interrupt overhead
Hardware Offloads
TSO, LRO, GSO; IPv4/IPv6; TCP, UDP checksums
Adapter Teaming/Link Aggregation
LACP for redundant links & increased bandwidth
(compatible with MLAG)
Jumbo Frames
9216 byte MTU for performance
Enhanced Tuning
Adaptive interrupt moderation
IP Flow Filtering
Hardware directs packets based on IP, TCP, UDP headers
Advanced Packet Filtering
4096 multicast filters; 4096 VLANs/port; adaptive TCP/UDP/IP,
MAC, VLAN, RSS, RFS filtering; Accelerated Receive Flow
Steering (RFS)
Intel QuickDataTM
Uses host DMA engines to accelerate I/O
Remote Boot
PXE, iSCSI boot; unattended installation
Management
SNMP, ACPI v3.0
Virtualization Support
VMware ESXi; Microsoft Hyper-V; XenServer; Linux KVM;
SR-IOV
Operating Systems
RHEL 5, 6, 7, MRG; SLES 10, 11, SLERT; other Linux;
Windows Server 2008 R2, 2012, 2012 R2