Friday, 30 June 2017

FPGAs and AI processors: DNN and CNN for all

Here is a nice hidden node from a traditional 1990's style gender identification neural net I did a few weeks ago.

A 90's style hidden node image in a simple gender identifier net
Source: my laptop
My daughter was doing a course as part of her B Comp Eng, the degree after her acting degree. Not being in the same city I thought maybe I could look at her assignment and help in parallel. Unsurprisingly, she didn't need nor want my help. No man-splaining necessary from the old timer father. Nevertheless, it was fun to play with the data Bec pulled down on faces. Bec's own gender id network worked fine for 13 out of 14 photos of herself fed into the trained net. Nice.

I was late to the party and first spent time with neural nets in the early nineties. As a prop trader at Bankers Trust in Sydney, I used a variety of software including a slightly expensive graphical tool from NeuroDimension that also generated C++ code for embedding. It had one of those parallel port copy protection dongles that were a pain. I was doing my post-grad at a group at uni that kept changing its name from something around connectionism, to adaptive methods, and then data fusion. I preferred open source and the use of NeuroDimension waned. I ported the Stuttgart Neural Network Simulator, SNNS, to the new MS operating system, Windows NT (with OS/3 early alpha branding ;-) ), and briefly became the support guy for that port. SNNS was hokey code with messy static globals but it worked pretty fly for a white guy.

My Master of Science research project was a kind of cascade correlation-like neural net, Multi-rate Optimising Order Statistic Equaliser (MOOSE), for intraday Bund trading. The MOOSE was a bit of work designed for acquiring fast LEO satellite signals (McCaw's Teledesic), repurposed for playing with Bunds as they migrated from LIFFE to DTB. As a prop trader at an investment bank, I could buy neat toys. I had the world's fastest computer at the time: an IBM MicroChannel dual Pentium Pro 200MHz processors plus SCSI with some megabytes of RAM. Pulling 800,000 points into my little C++ stream/dag processor seemed like black magic in 1994. Finite differencing methods let me do oodles of O(1) incremental linear regressions and the like to get 1000 fold speed-ups. It seemed good at the time. Today, your phone would laugh in my general direction.

There was plenty of action in neural nets back in those days. Not much of it was overly productive but it was useful. I was slightly bemused to read Eric Schmidt's take on machine learning and trading in Lindsay Fortado and Robin Wigglesworth's FT article "Machine learning set to shake up equity hedge funds",
Eric Schmidt, executive chairman of Alphabet, Google’s parent company, told a crowd of hedge fund managers last week that he believes that in 50 years, no trading will be done without computers dissecting data and market signals.
“I’m looking forward to the start-ups that are formed to do machine learning against trading, to see if the kind of pattern recognition I’m describing can do better than the traditional linear regression algorithms of the quants,” he added. “Many people in my industry think that’s amenable to a new form of trading.”
Eric, old mate, you know I was late to the party in the early nineties, what does that make you?

Well, things are different now. I like to think of it and have written about the new neural renaissance as The Age of Perception. It is not intelligence, it is just good at patterns. It is still a bit hopeless at language ambiguities. It will also be a while before it understands the underlying values and concepts for deep financial understanding. 

Deep learning is simultaneously both overhyped and underestimated. It is not intelligence, but it will help us get there. It is overhyped by some as an AI breakthrough that will give us cybernetic human-like replicants. We still struggle with common knowledge and ambiguity in simple text for reasoning. We have a long way to go. The impact of relatively simple planning algorithms and heuristics along with the dramatic deep learning based perception abilities from vision, sound, text, radar, et cetera, will be as profound as every person and their dog now understands. That's why I call it, The Age of Perception. It is as if the supercomputers in our pockets have suddenly awoken with their eyes quickly adjusting to the bright blinking blight that is the real world. 

The impact will be dramatic and lifestyle changing for the entire planet. Underestimate the impact at your peril. No, we don't have a date with a deep Turing conversationalist that will provoke and challenge our deepest thoughts - yet. That will inevitably come, but it is not on the visible horizon. Smart proxies aiding by speech, text and Watson-like Jeopardy databases will give a very advanced Eliza, but no more. Autonomous transport, food production, construction, yard and home help will drive dramatic lifestyle and real-estate value changes.

Apart from this rambling meander, my intention here was to collect some thoughts on the chips driving the current neural revolution. Not the most exciting thought for many, but it is a useful exercise for me.

Neural network hardware

Neural processing is not a lot different today compared to twenty years ago. Deep is more of a brand than a difference. The activation functions have been simplified which suits hardware better. Mainly there is more data and a better understanding of how to initialise the weights, handle many layers, parallelise, and improve robustness via techniques such as dropout. The Neocognitron architecture from 1980 is not much different to today's deep learner or CNN, but it helped that Yann LeCun allowed it to learn. 

Back in the nineties there was also plenty of neural hardware platforms such as CNAPS (1990) with its 64 processing units and 256kB of memory for doing 1.6 GCPS (connections per second CPS) for 8/16-bit or 12.8 GCPS for 1-bit. You can read about Synapse-1, CNAPS, SNAP, CNS Connectionist Supercomputer, Hitachi WSI, My-Neupower, LNeuro 1.0, UTAK1, GNU Implementation (no, not GNU GNU, General Neural Unit), UCL, Mantra 1, Biologically-Inspired Emulator, INPG Architecture, BACHUS, and ZISC036 in "Overview of neural hardware", [Heemskerk, 1995, draft].

Phew, it seems a lot but that excluded the software and accelerator board/CPU combos, such as ANZA plus, SAIC SIGMA-1, NT6000, Balboa 860 coprocessor, Ni1000 Recognition Accelerator Hardware (Intel), IBM NEP, NBC, Neuro Turbo I, Neuro Turbo II, WISARD, Mark II & IV, Sandy/8, GCN (Sony), Topsi, BSP400 (400 microprocessors), DREAM Machine, RAP, COKOS, REMAP, General Purpose Parallel Neurocomputer, TI NETSIM, and GeNet. Then there were quite a few analogue and hybrid analogue implementations, including Intel's Electrically Trainable Analog Neural Network (801770NX). You get the idea, there was indeed a lot back in the day.

All a go go in 1994:

Optimistically Moore's Law was telling us a TeraCPS was just around the corner,
"In the next decade micro-electronics will most likely continue to dominate the field of neural network implementation. If progress advances as rapidly as it has in the past, this implies that neurocomputer performances will increase by about two orders of magnitude. Consequently, neurocomputers will be approaching TeraCPS (10^12 CPS) performance. Networks consisting of 1 million nodes, each with about 1,000 inputs, can be computed at brain speed (100-1000 Hz). This would offer good opportunities to experiment with reasonably large networks."
The first neural winter was the cruel subversion of research dollars by Minsky and Papert's dissing of Rosenblatt's perceptron dream with incorrect hand-wavy generalisations about hidden layers that ultimately led to Rosenblatt's untimely death. In 1995 another neural winter was kind of underway although I didn't really know it at the time. As a frog in the saucepan, I didn't notice the boil. This second winter was fired up by a lack of exciting progress and general boredom. 

The second neural winter ended with the dramatic improvements in ImageNet processing with the University of Toronto's SuperVision from AlexNet in 2012 thanks to Geoffrey Hinton's winter survival skills. This result was then blown apart by Google's LeNet 2014 Inception model. So, the Age of Perception started in 2012 by my reckoning. Mark your diaries. We're now five years in.

Google did impressive parallel CPU work with lossy updates across a few thousand regular machines. Professor Andrew Ng and friends made the scale approachable by enabling dozens of GPUs to do the work of thousands of CPUs. Thus, we were saved from the prospect of neural processing being only for the well funded. Well, kind of, now the state of the art sometimes needs thousands of GPUs or specific chips. 

More data and more processing have been quite key. Let's get to the point and list some of the platforms that are key to the Age of Perception's big data battle:

GPUs from Nvidia

These are hard to beat. The subsidisation that comes from the large video processing market drives tremendous economies of scale. The new Nvidia V100 can do 15 TFlops of SP or 120 TFlops with its new Tensor core architecture which is a FP16 multiply and FP32 accumulate or add to suit ML. Nvidia is packing up 8 boards into their DGX-1 for 960 Tensor TFlops. 

GPUs from AMD

AMD has been playing catch-up with Nvidia in the ML space. The soon to be released AMD Radeon Instinct MI25 is promising 12.3 TFlops of SP or 24.6 TFlops of FP16. If your calculations are amenable to Nvidia's Tensors, then AMD can't compete. Nvidia also does twice the bandwidth with 900GB/s versus AMD's 484 GB/s. 

Google's TPUs

Google's original TPU had a big lead over GPUs and helped power DeepMind's AlphaGo victory over Lee Sedol in a Go tournament. The original 700MHz TPU is described as having 95 TFlops for 8-bit calculations or 23 TFlops for 16-bit whilst drawing only 40W. This was much faster than GPUs on release but is now slower than Nvidia's V100, but not on a per W basis. The new TPU2 is referred to as a TPU device with four chips and can do around 180 TFlops. Each chip's performance has been doubled to 45 TFlops for 16-bits. You can see the gap to Nvidia's V100 is closing. You can't buy a TPU or TPU2. Google is making them available for use in their cloud with TPU pods containing 64 devices for up to 11.5 PetaFlops of performance. The giant heatsinks on the TPU2 are some cause for speculation, but the market is changing from devices to units with groups of devices and also such groups within the cloud.

Wave Computing

Wave's Aussie CTO, Dr Chris Nicol, has produced a wonderful piece of work with Wave's asynchronous data flow processor in their Compute Appliance. I was introduced to Chris briefly a few years ago in California by Metamako Founder Charles Thomas. They both used to work on clockless async stuff at NICTA. Impressive people those two. 

I'm not sure Wave's appliance was initially targeting ML but their ability to run TensorFlow at 2.9 PetaOPS/sec on their 3RU appliance is pretty special. Wave refers to their processors at DPUs and an appliance has 16 DPUs. Wave uses processing elements it calls Coarse Grained Reconfigurable Arrays (CGRAs). It is unclear what bit width the 2.9 PetaOPS/s is referring to. From their white paper, the ALUs can do 1b, 8b, 16b and 32b,  
"The arithmetic units are partitioned. They can perform 8-b operations in parallel (ideal for DNN inferencing) as well as 16-b and 32-b operations (or any combination of the above). Some 64-b operations are also available and these can be extended to arbitrary precision using software.
Here is a bit more on one of the 16 DPUs included in the appliance,
"The Wave Computing DPU is an SoC that contains a 16,384 PEs, configured as a CGRA of 32x32 clusters. It includes four Hybrid Memory Cube (HMC) Gen 2 interfaces, two DDR4 interfaces, a PCIe Gen3 16-lane interface and an embedded 32-b RISC microcontroller for SoC resource management. The Wave DPU is designed to execute autonomously without a host CPU."
On TensorFlow ops, 
"The Wave DNN Library team creates pre-compiled, relocatable kernels for common DNN functions used by workflows like TensorFlow. These can be assembled into Agents and instantiated into the machine to form a large data flow graph of tensors and DNN kernels."
"...a session manager that interfaces with machine learning workflows like TensorFlow, CNTK, Caffe and MXNet as a worker process for both training and inferencing. These workflows provide data flow graphs of tensors to worker processes. At runtime, the Wave session manager analyzes data flow graphs and places the software agents into DPU chips and connects them together to form the data flow graphs. The software agents are assigned regions of global memory for input buffers and local storage. The static nature of the CGRA kernels and distributed memory architecture enables a performance model to accurately estimate agent latency. The session manager uses the performance model to insert FIFO buffers between the agents to facilitate the overlap of communication and computation in the DPUs. The variable agents support software pipelining of data flowing through the graph to further increase the concurrency and performance. The session manager monitors the performance of the data flow graph at runtime (by monitoring stalls, buffer underflow and/or overflow) and dynamically tunes the sizes of the FIFO buffers to maximize throughput. A distributed runtime management system in DPU-attached processors mounts and unmounts sections of the data flow graph at run time to balance computation and memory usage. This type of runtime reconfiguration of a data flow graph in a data flow computer is the first of its kind."
Yeah, me too. Very cool.

The exciting thing about this platform is that it is coarser than FPGA in architectural terms and thus less flexible, but likely to perform better. Very interesting.

KnuEdge's KnuPath

I tweeted about KnuPath back in June 2016. Their product page has since gone missing in action. I'm not sure what they are up to with the $100M they put into their MIMD architecture. It was described at the time as having 256 tiny DSP, or tDSP, cores on each ASIC along with an ARM controller suitable for sparse matrix processing in a 35W envelope. 

(source: HPC Wire - click to enlarge)
The performance is unknown, but they compared their chip to a current NVIDIA, at that time, and said they had 2.5 times the performance. We know Nvidia is now more than ten times faster with their Tensor cores so KnuEdge will have a tough job keeping up. A MIMD or DSP approach will have to execute awfully well to take some share in this space. Time will tell. 

Intel's Nervana

Intel purchased Nervana Systems who was developing both a GPU/software approach in addition to their Nervana Engine ASIC. Comparable performance is unclear. Intel is also planning in integrating into the Phi platform via a Knights Crest project. NextPlatform suggested the 2017 target on 28nm may be 55 TOPS/s for some width of OP. There is a NervanaCon Intel has scheduled for December, so perhaps we'll see the first fruits then.

Horizon Robotics

This Chinese start-up has a Brain Processing Unit (BPU) in the works. Dr Kai Yu has the right kind of pedigree as he was previously the head of Baidu's Institute of Deep Learning. Earlier this year a BPU emulation on an Arria 10 FPGA was shown in this Youtube clip. There is little information on this platform in public.


Eyeriss is an MIT project that developed a 64nm ASIC with unimpressive raw performance. The chip is about half the speed of a Nvidia TK1 on AlexNet. The neat aspect was that such middling performance was achieved by a 278mW reconfigurable accelerator thanks to its row stationary approach. Nice.


Graphcore raised $30M of Series-A late last year to support the development of their Intelligence Processing Unit, or IPU. Their web is a bit sparse on details with hand-wavy facts such at >14,000 independent processor threads and >100x memory bandwidth. Some snippets have snuck out with NextPlatform reporting over a thousand true cores on the chip with a custom interconnect. It's PCIe board has a 16-processor element. It sounds kind of dataflowy. Unconvincing PR aside, the team has a strong rep and the investors are not naive, so we'll wait and see.


Tenstorrent is a small Canadian start-up in Toronto claiming an order of magnitude improvement in efficiency for deep learning, like most. No real public details but they're are on the Cognitive 300 list.


Cerebras is notable due to its backing from Benchmark and that its founder was the CEO of SeaMicro. It appears to have raised $25M and remains in stealth mode.


Thinci is developing vision processors from Sacremento with employees in India too. They claim to be at the point of first silicon, Thinci-tc500, along with benchmarking and winning of customers already happening. Apart from "doing everything in parallel" we have little to go on.


Koniku's web site is counting down and has 72 days showing until my new reality. I can hardly wait. They have raised very little money and after watching their Youtube clip embedded in this Forbes page, you too will not likely not be convinced, but you never know. Harnessing biological cells is certainly different. It sounds like a science project, but, then this,
"We are a business. We are not a science project," Agabi, who is scheduled to speak at the Pioneers Festival in Vienna, next week, says, "There are demands that silicon cannot offer today, that we can offer with our systems."
The core of the Koniku offer is the so-called neuron-shell, inside which the startup says it can control how neurons communicate with each other, combined with a patent-pending electrode which allows to read and write information inside the neurons. All this packed in a device as large as an iPad, which they hope to reduce to the size of a nickel by 2018.


Adapteva is a favourite little tech company of mine to watch as you'll see in this previous meander, "Adapteva tapes out Epiphany-V: A 1024-core 64-bit RISC processor." Andreas Olofsson taped out his 1024 core chip late last year and we await news of its performance. Epiphany-V has new instructions for deep learning and we'll have to see if this memory-controller-less design with 64MB of on-chip memory will have appropriate scalability. The impressive efficiency of Andrea's design and build may make this a chip we can all actually afford, so let's hope it performs well.


Knowm talks about Anti-Hebbian and Hebbian (AHaH) plasticity and memristors. Here is a paper covering the subject, "AHaH Computing–From Metastable Switches to Attractors to Machine Learning." It's a bit too advanced for me. With a quick glance I can't tell the difference between this tech and hocus-pocus but it looks sciency. I'm gonna have to see this one in the flesh to grok it. The idea of neuromemristive processors is intriguing. I do like a good buzzword in the morning.


A battery powered neural chip from Mythic with 50x lower power. Not so many real details out there. The chip is the size of a button, but aren't most chips?
"Mythic's platform delivers the power of desktop GPU in a button-sized chip"
Perhaps another one that is suitable for drones and phones that is likely to be eaten or sidelined by a phone.


Phones are an obvious place for ML hardware to crop up. We want to identify the dog type, flower, leaf, cancerous mole, translate a sign, understand the spoken word, etc. Our pocket supercomputers would like all the help they can get for the Age of Perception.

Qualcomm has been fussing around ML for a while with the Zeroth SDK and Snapdragon Neural Processing Engine. The NPE certainly works reasonably well on the Hexagon DSP that Qualcomm use. The Hexagon DSP is far from a very wide parallel platform and it has been confirmed by Yann LeCun that Qualcomm and Facebook are working together on a better way in Wired's "The Race To Build An AI Chip For Everything Just Got Real",
"And more recently, Qualcomm has started building chips specifically for executing neural networks, according to LeCun, who is familiar with Qualcomm's plans because Facebook is helping the chip maker develop technologies related to machine learning. Qualcomm vice president of technology Jeff Gehlhaar confirms the project. "We're very far along in our prototyping and development," he says."
Perhaps we'll see something soon beyond the Kryo CPU, Adreno GPU, Hexagon DSP, and Hexagon Vector Extensions. It is going to be hard to be a start-up in this space if you're competing against Qualcomm's machine learning.

Pezy-SC and Pezy-SC2

These are the 1024 core and 2048 core processors that Pezy develop. The Pezy-SC 1024 core chip powered the top 3 systems on the Green500 list of supercomputers back in 2015. The Pezy-SC2 is the follow up chip that is meant to be delivered by now, and I do see a talk in June about it, but details are scarce yet intriguing,
"PEZY-SC2 HPC Brick: 32 of PEZY-SC2 module card with 64GB DDR4 DIMM (2.1 PetaFLOPS (DP) in single tank with 6.4Tb/s"
It will be interesting to see what  2,048 MIMD MIPS Warrior 64-bit cores can do. In the June 2017 Green500 list, a Nvidia P100 system took the number one spot and there is a Pezy-SC2 system at number 7. So the chip seems alive but details are thin on the ground. Motoaki Saito is certainly worth watching.


Despite many promises, Kalray has not progressed their chip offering beyond the 256 core beast I covered back in 2015, "Kalray - new product meander." Kalray is advertising their product as suitable for embedded self-driving car applications though I can't see the product architecture being an ideal CNN platform in its current form. Kalray has a Kalray Neural Network (KaNN) software package and claims better efficiency than GPUs with up to 1 TFlop/s on chip.

Kalrays NN fortunes may improve with an imminent product refresh and just this month Kalray completed a new funding that raised $26M. The new Coolidge processor is due in mid-2018 with 80 or 160 cores along with 80 or 160 co-processors optimised for vision and deep learning.

This is quite a change in architecture from their >1000 core approach and I think it is most sensible.

IBM TrueNorth

TrueNorth is IBM's Neuromorphic CMOS ASIC developed in conjunction with the DARPA SyNAPSE program.
It is a manycore processor network on a chip design, with 4096 cores, each one simulating 256 programmable silicon "neurons" for a total of just over a million neurons. In turn, each neuron has 256 programmable "synapses" that convey the signals between them. Hence, the total number of programmable synapses is just over 268 million (228). In terms of basic building blocks, its transistor count is 5.4 billion. Since memory, computation, and communication are handled in each of the 4096 neurosynaptic cores, TrueNorth circumvents the von-Neumann-architecture bottlenecks and is very energy-efficient, consuming 70 milliwatts, about 1/10,000th the power density of conventional microprocessors. [Wikipedia]
Previously criticised for running spiking neural networks rather than being fit for deep learning, IBM developed a new algorithm for running CNNs on TrueNorth,
Instead of firing every cycle, the neurons in spiking neural networks must gradually build up their potential before they fire...Deep-learning experts have generally viewed spiking neural networks as inefficient—at least, compared with convolutional neural networks—for the purposes of deep learning. Yann LeCun, director of AI research at Facebook and a pioneer in deep learning, previously critiqued IBM’s TrueNorth chip because it primarily supports spiking neural networks... 
...the neuromorphic chips don't inspire as much excitement because the spiking neural networks they focus on are not so popular in deep learning.
To make the TrueNorth chip a good fit for deep learning, IBM had to develop a new algorithm that could enable convolutional neural networks to run well on its neuromorphic computing hardware. This combined approach achieved what IBM describes as “near state-of-the-art” classification accuracy on eight data sets involving vision and speech challenges. They saw between 65 percent and 97 percent accuracy in the best circumstances.
When just one TrueNorth chip was being used, it surpassed state-of-the-art accuracy on just one out of eight data sets. But IBM researchers were able to boost the hardware’s accuracy on the deep-learning challenges by using up to eight chips. That enabled TrueNorth to match or surpass state-of-the-art accuracy on three of the data sets.
The TrueNorth testing also managed to process between 1,200 and 2,600 video frames per second. That means a single TrueNorth chip could detect patterns in real time from between as many as 100 cameras at once..." [IEEE Spectrum]
Power efficiency is quite brilliant on TrueNorth and makes it very worthy of consideration.

Brainchip's Spiking Neuron Adaptive Processor (SNAP)

SNAP will not do deep learning and is a curiosity without being a practical drop in CNN engineering solution, yet. IBM's stochastic phase-change neurons seem more interesting if that is a path you wish to tread.

Apple's Neural Engine

Will it or won't it?  Bloomberg is reporting it will as a secondary processor but there is little detail. Not only is it an important area for Apple, but it helps avoid and compete with Qualcomm.


Cambricon - Chinese Academy of Sciences invests $1.4M  for chip. It is an instruction set architecture for NNs with data-level parallelism, customised vector/matrix instructions, on-chip scratchpad memory. Claims 91 times CPU-x86 and 3 times faster than a K40M with 1% or 1.695W of peak power use. "Cambricon-X: An Accelerator for Sparse Neural Networks" and "Cambricon: An Instruction Set Architecture for Neural Networks."

Ex-googlers and groq inc. Perhaps another TPU?


Deep Vision is bulding low-power chips for deep learning. Perhaps one of these papers by the founders have clues, "Convolution Engine: Balancing Efficiency & Flexibility in Specialized Computing" [2013] and "Convolution Engine: Balancing Efficiency and Flexibility in Specialized Computing" [2015].

Deep Scale.

Reduced Energy Microsystems are developing lower power asynchronous chips to suit CNN inference. REM was Y Combinator's first ASIC venture according to TechCrunch.

Leapmind is busy too.


Microsoft has thrown its hat into the FPGA ring, "Microsoft Goes All in for FPGAs to Build Out AI Cloud." Wired did a nice story on the MSFT use of FPGAs too, "Microsoft Bets Its Future on a Reprogrammable Computer Chip"
"On Bing, which an estimated 20 percent of the worldwide search market on desktop machines and about 6 percent on mobile phones, the chips are facilitating the move to the new breed of AI: deep neural nets."
I have some affinity for this approach. Xilinx and Intel's (nee Altera) FPGAs are powerful engines. Xilinx naturally claim their FPGA's are best for INT8 with one of their white papers containing the following slide,

Both vendors have good support for machine learning with their FPGAs:

Whilst performance per Watt is impressive for FPGAs, the vendors' larger chips have long had earth shatteringly high chip prices for the larger chips. Xilinx's VU9P lists at over $US 50k at Avnet.

Finding a balance between price and capability is the main challenge with the FPGAs.

One thing that is to love about the FPGA approach is the ability to make some quite wonderful architectural decisions. Say you want to improve you memory streaming of floating point via compressing off board DRAM for HBM and uncompress it in real time, there is a solution if you try hard enough, "Bandwidth Compression of Floating-Point Numerical Data Streams for FPGA-Based High-Performance Computing"

This kind of dynamic architectural agility would be a hard thing to pull off with almost any other technology.

Too many architectural choices may be considered a problem, but I kind of like that problem myself. Here is a nice paper on closing the performance gap between custom hardware and FPGA processors with an FPGA-based horizontally microcoded compute engine that reminds of the old DISC or discrete instruction set computer from many moons ago, "Reducing the Performance Gap between Soft Scalar CPUs and Custom Hardware with TILT"


Trying to forecast a winner in this kind of race is a fool's errand. Qualcomm will be well placed simply due to their phone dominance. Apple will no doubt succeed with whatever they do. Nvidia's V100 is quite a winner with its Tensor units. I'm not sure I can see Google's TPU surviving in a relentless long-term silicon march despite its current impressive performance. I'm fond of the FPGA approach but I can't help but think they should release DNN editions at much cheaper price points so that they don't get passed by the crowd. Intel and AMD will have their co-processors. As all the major players are mucking in, much of it will come down to supporting standard toolkits, such as TensorFlow, and then we will not have to care too much about the specifics, just the benchmarks.

From the smaller players, as much as I like and am cheering for the Adapteva approach I think their memory architecture may not be well suited to DNN. I hope I'm wrong.

Wave Computing is probably my favourite approach after FPGAs. Their whole asynchronous data flow approach is quite awesome. It appears REM is doing something similar, but I think they may be too late. Will Wave Computing be able to hold their head up in face of all the opposition? Perhaps as their asynchronous CGRA has an inherent advantage. Though I'm not sure they need just DNNs to succeed as their tech has much broader applicability.

Neuromorphic spiking processor thing-a-ma-bobs are probably worth ignoring for now but keep your eye on them due to their power advantage. Quantum crunching may make it all moot anyway. The exception to this rule is probably IBM's TrueNorth thanks to its ability to not just do spiking networks but to also run DNNs efficiently.

For me, Random Forests are friendly. They are much harder to screw up ;-)

Happy trading,


Wednesday, 28 June 2017

U.S. Equity Market Structure Part I: A Review of the Evolution of Today’s Equity Market Structure and How We Got Here

If you have three hours and forty-six minutes of time to kill then you should probably not watch the Committee on Financial Services testimony from earlier today, Huonville time:

This is the Committee web reference reproduced:

Hearing entitled “U.S. Equity Market Structure Part I: A Review of the Evolution of Today’s Equity Market Structure and How We Got Here” 
Tuesday, June 27, 2017 10:00 AM in 2128 Rayburn HOB 
Capital Markets, Securities, and Investment

Click here for the Committee Memorandum.
Witness List
Panel I
Panel II
I think Mr Larry Tabb summed up the prospects for reform nicely in one of his recent Market Structure Weekly video pieces,
"Tabb dissects the debate over US equities market structure and Reg NMS, and the difficulties in reaching a consensus."
That is, there is unlikely to be any consensus anytime soon. However, some rays of hope did appear. I interpreted there to be general support for:
  • getting more companies public;
  • tick size variation;
  • depth of market being added to SIP and further SIP improvements; and,
  • support for better disclosure on market performance and routeing.
Otherwise, most of the committee testimony pointed to differences of opinion. Even though most parties suggested a thorough review should take place, Mr Joe Saluzzi suggested this should not happen and only certain aspects should be reviewed. Mr Saluzzi spoke well but dropped a little clanger when he misled the committee and told them that SIP feeds could be used for pricing PFOF when that has been against regulation since Nov 2015.

The only truly bad behaviour was from Mr Brad Katsuyama. His referral to rebates as kickbacks and talk of syphoning off $2.5B in kickbacks as part of a corrupt system was at best an inconsiderate use of language and at worst libel. I've discussed this previously here:
Near the end, Mr Chris Concannon showed some backbone and started to dig into Mr Katsuyama's falsehoods with a muted degree of fury but the time pocketed format didn't really allow much debate.

It is quite impressive the delusion IEX continues to suffer from. They really don't understand the harm they are doing to the market and their own customers. I've more than covered that to death in the past and it is getting tired, so here are the highlights from older meanderings:
Mr David Weisberger wrote a more pointed criticism of Mr Katsuyama's testimony, "ViableMkts ANNOTATION of the Testimony of Investors Exchange Chief Executive Officer Bradley Katsuyama."

You have to give credit to Mr Katsuyama, he really believes he is doing the right thing. He doesn't understand the harm he is doing:
  • Lack of price discovery via a preponderance of dark liquidity;
  • Speed-bump flaws that expose client orders to others before they may receive their own notifications exposing their clients to latency arbitrage in a way that is worse than all other exchanges;
  • Expensive transaction costs for majority of their orders;
  • Complex order types instead of the "three only" simple order types they originally made the case for in "Flash Boys";
  • Unfairness of a lack of co-lo where traders can game POP access and get latency advantages;
  • The need for sophisticated infrastructure including multiple sites with RF or laser required to maximise information and minimise leakage from IEX;
  • A large degree of false positives from a poor one size fits all Crumbling Quote Indicator (CQI) that will lose priority too easily;
  • The CQI preventing the ability to trade in a moving market - increasing risk;
  • Excessive potential to miss fills and let the market move away;
  • Preventing innovation with the wrong kind of flawed innovation;
  • Misleading market statistics due their dark reliance and lack of trade on market ticks;
  • Poor displayed liquidity with only CHX having shorter queues showing the difficulty with, and fragility of IEX's displayed market; and,
  • MM-Peg latency issues, despite it being a post only order that is not expected to trade.
IEX has real problems, but not that you would know from their marketing.

On the consensus points, Mr John Comerford from Instinet chose to focus on the problem with the one size fits all tick size. He pointed out that the current tick size was only really appropriate for a third of the market. Tick sizes for other stocks were both too big, or too small. This was a great focal point and one that didn't see much disagreement. Mr Tom Wittman supported the idea of "intelligent tick sizes" that Nasdaq had also raised at the last EMSAC. This is an obvious thing that needs to be done.

PFOF was politely contentious. Without tick size adjustment there is no real way that public exchanges can compete against dark sub-penny increments, including PFOF. Retail would be worse off if PFOF were simply eliminated, see:
Sub-pennies rule!
Another important point that seemed to engender consensus was the need for better information and analysis around routeing and trade reporting. That would be a good thing to move forward as too much remains in the dark or is too onerous to analyse.

I was a bit surprised by the olive branch that seemed to be held out by the exchanges on the SIP. There seemed to be non-opposition for adding depth to the SIP. That would be an advance. Perhaps it is a deferment to try to take the heat of their ongoing market data fee argument?

One exchange was a bit misleading with the idea that direct feeds from exchanges were subject to competition. That claim was a bit cheeky. Mr Saluzzi quite correctly disputed that idea. There remains much consternation around market data costs and fees. The exchanges will stoutly defend this territory.

Mr Ari Rubenstein from GTS made a host of decent points. The one that showed an unfortunate bias was the claim that the BATS closing auction would harm the market. That is a hard proposition to support. Mr Rubenstein's position as a large DMM at NYSE is an obvious conflict.

Mr Jeff Brown spoke well as a representative of Schwab but lost his way for a moment in the defence of PFOF. I'm not fond of PFOF but do accept that it delivers unassailable benefits to retail thanks to the sub-penny rule despite intermediation of best execution responsibilities. This should be better articulated. Mr Comerford's tick adjustments will be the way public exchanges assail that fortress for the public good, in time.

Mr Thomas Farley lost his way on listing standards which was understandable but otherwise handled most questions deftly. He raised an excellent point about not enjoying getting the blame for SEC deferment for proposals to committees. The exchanges expressed a preference for the SEC to do the work so the exchanges don't get the blame for unpopular decisions. It seems both the SEC and the exchanges would prefer the tenure that comes from being able to blame someone else. This perhaps needs a rebalance. Exchange SRO responsibilities were also contentious.

I disagreed with Mr Matt Lyons from The Capital Group on rebates but he put his case well. Mr Saluzzi agreed. Mr Katsuyama undermined this argument with his hysterical and harmful approach to demonising rebates with his silly kickback diatribe. Others made the strong case for the need for rebates for liquidity, especially for small to medium stocks.

There was a good conversation around the lack of companies going public. This certainly needs more attention but the difficulties in preventing the private market from gazumping the public market should not be underestimated. The "why bother" question is not easy. Forcing companies to be public is not realistic. Now that particular genie is out of the bottle, getting rid of impediments may help but it may be too late,
"the new higher level was not reduced when the fine was removed" [p 15]
The CAT received both support and disdain. I'm in a camp that says it is needed but I'm slightly horrified by the "Crazy CAT" implementation NMS has been lumbered with:
Crazy CAT approved by SEC.
Exchange resistance to market data feed expense mitigation in the face of overwhelming opposition looks like fair picking for regulatory reform. PFOF should be a thing that goes away but it needs to wait for the public market's ability to perform just as well, which for now it can not.

It is going to be difficult to progress, especially with evident support for the alternate facts in "Flash Boys" from some of the committee questioning. Disinformation has a long shadow.

Where there is civil debate, there is hope.

Happy trading,


Monday, 26 June 2017

Finra ATS Tier 1 statistical update

As a few things are afoot, it may be handy to get our heads around the current anatomy of the US ATS market. Let's meander through this dark corner.

We'll just look at the statistics for tier one stocks as these are the most timely reports.

There is no change to the relatively stable rankings of the top three pools. UBS's ATS and CS's Crossfinder remain way out in front. DB had a poor week with its position at #3 in the greatest peril for some time with both JP Morgan and Barclays being the closest to DB for some months.

Goldman Sach's transition to their new ATS has largely been completed with their newer platform rising ten places to #11 this week. KCG dropped three places to #14. LiquidNet H20 gained 4 spots. NYFX Cowen Exec Services dropped 6 places to #21.

In ATS news this week it was announced that Instinet is purchasing State Street's ATS. You can see Instinet's current pool is ranked tenth with 105M shares traded with State Street ranked twentieth. If they were combined, which is not being suggested yet, they would have rank of #9. The big difference between the two is that State Street's pool has an average trade size of 12,482 shares as compared to 229 for Instinet's current CBX pool.

DealerWeb (360,125) and LiquidNet (40,853) lead the average trade block sizes.

Luminex's paltry 5.3M shares traded, and fifth last ranking, clearly demonstrates that markets require diversity. Markets work despite the motivations of the participants. That's their ultimate beauty. Diversity matters and homogeneity risks growth. Luminex only managed 162 trades for the week. I'm not sure you need technology beyond a notebook and pen for that. At least the average block size at 32,879 was high, being the third largest. This emphasises that liquidity is a carefully orchestrated dance of mutual benefits. A dance of offer, parry, hedge, replenish. Quite the tango that is oft misunderstood as war rather than for being the carefully calibrated artistry that it truly is.

Rank ATS ATS T1 share % Volume Avg trade size
1 UBSA UBS ATS 17.61 486.2 M 172
2 CROS CROSSFINDER 13.97 385.6 M 189
3 DBAX SUPERX 7.14 197.2 M 195
4 MSPL MS POOL (ATS-4) 6.64 183.4 M 260
6 LATS BARCLAYS ATS ("LX") 5.90 163.0 M 214
7 EBXL LEVEL ATS 5.59 154.4 M 208
8 MLIX INSTINCT X 5.01 138.2 M 228
9 BIDS BIDS TRADING 4.47 123.5 M 788
11 SGMT GOLDMAN SACHS & CO. LLC 3.48 96.1 M 203
12 ITGP POSIT 3.47 95.9 M 308
13 KCGM KCG MATCHIT 3.31 91.3 M 184
14 MSTX MS TRAJECTORY CROSS (ATS-1) 2.15 59.3 M 177
15 XSTM CROSSSTREAM 1.58 43.6 M 391
16 DLTA DEALERWEB 1.45 40.0 M 360,125
18 CXCX CITI CROSS 1.13 31.2 M 230
19 LQNA LIQUIDNET H2O 0.92 25.5 M 17,565
22 LQNT LIQUIDNET ATS 0.86 23.7 M 40,853
23 XIST INSTINET CROSSING 0.64 17.8 M 5,196
24 PDQX CODA MARKETS, INC. 0.50 13.8 M 230
25 CBLC CITIBLOC 0.31 8.6 M 19,651
26 MSRP MS RETAIL POOL (ATS-6) 0.26 7.0 M 186
28 WDNX XE 0.05 1.3 M 1,636
29 AQUA AQUA 0.02 0.6 M 6,488
30 BCDX BARCLAYS DIRECTEX 0.01 0.2 M 29,471

(click to enlarge)

The top 5 pools represent over half the ATS volume traded. The top ten pools collective share has been steadily rising to the current accounting of three quarters of all ATS volume. This was assisted by IEX's dark pool transitioning to being the SEC's first dark public exchange which corresponds to the short period of the largest rise.

(click to enlarge)

The average trade size of the top 15 pools mainly resides in the minimal 100-300 shares per trade range with only XSTM CrossStream and BIDS being the consistent larger exceptions. The largest pool, UBS, typically has the smallest average trade size as you may see in the following chart. You may note the strange red line in the bottom right of the chart representing the new Goldman Sachs platform leaping into life.

(click to enlarge)

That previous chart makes it a bit hard to see if any of the top pools, apart from BIDS, have increased their average trade size. An alternative view of the top ten pools below shows their average trade size for the week compared to their average trade size over time, to make it easier to see variations in size compared to their own normal:

(click to enlarge)

Well, the size variation was meant to somewhat easier to understand in that chart for some strange definition of easier.

There does seem too many pools and exchanges. I can't help but wonder if there shouldn't be tighter policing of the proliferation by treating the NMS space more like radio spectrum and considering the venue space as a scarce resource. The bad old days of NYSE dominance showed one exchange to rule them all was not the best idea, but surely the US does not need more than forty exchanges and ATS pools.

I also remain of the belief that the SEC should carefully consider the two types of pools we see in this ATS mix. There is quite a different utility to a large block trading pool and a pool with a small average trade size. They are different beasts. Perhaps the SEC needs to explicitly partition their rule space for such species.

I'm not sure a small average trade sized pool with lots of volume should exist for many years if it is not a public exchange. I'm biased against such such parasitic pools due to their lack of participation in price discovery. Parasitic pools, like index funds, may have some utility but it should be clearly articulated what their efficiency or utility really is. It is not always clear what such low average trade size pools offer apart from being an embryonic step to being a public exchange. If there are some benefits gained by the low trade sized ATS pools due to easier rule enforcement then perhaps the rules for exchanges should be changed to allow the same efficiencies. If such rules aren't suitable for a public exchange, then perhaps they have no place for an ATS either.

Perhaps time limited ATS licenses should be granted for the low average trade sized ATS? Go big, or go home. Be an exchange in five years or stop clogging up NMS plumbing. All systems need a cleanse from time to time.

Happy trading,


OTC Transparency data is provided via and is copyrighted by FINRA 2017

Thursday, 22 June 2017

IEX MM-Peg follow up

It has been pointed out to me by more than one person that though they are not fans of IEX they would like to see the MM-PEG order allowed as submitted. I poured scorn on this order type here, "IEX's new order's unintended consequences."

My scorn stands but I understand the dilemma best captured by Mr Adam Nunes here,

The issue that this order type is addressing is having a continuous presence in the market. This is the rather ridiculous requirement set to a one hundred percent obligation for official market makers in the US.

Now, this order type is not really ever expected to trade. It is close to a spoof in that regard except for the idea that you'd be happy if it did trade. Such a happy intention takes it away from being a spoof, but the silliness remains. That is, buying 8% below the NBBO or selling 8% above the NBBO would likely be welcome.

The issues around the timing of the order are real in that it may bake in a systemic advantage or disadvantage at that price level, far from the market where it doesn't really matter. This may then set a precedent allowing IEX to extend such a latency problem all the way to the BBO which would be a bigger problem.

The right answer would be for the SEC to only require market making obligations for some high but not crazy percentage of the time, say 95%. Then this order type, that is never expected to trade, would not be required. We need to fix the issues rather than skirt around the edges with such MM-Peg artifices.

I do wish we could stick to a small set of atomic primitives from which all order types may be created. Then participants could ignore the more complex order types if they chose to. Until then, we'll all have to be "puzzle masters."

Happy trading,


Tuesday, 20 June 2017

IEX's new order's unintended consequences

IEX offered up for the SEC's consideration a new order type last week, "Proposed rule change to introduce a new market maker peg order."

The new Market Maker Peg Order, or MM-Peg, is not an unreasonable order type. I've long been on the record as opposing unnecessary order types and this fits that category. It is similar to order types on other exchanges. The innovation is limited. However, MM-Peg adds to the order proliferation pollution problem that IEX has long promised it would avoid. Here is an excerpt from Flash Boys concerning the puzzle masters,

(click to enlarge)

Back in 2014, IEX was promoting the idea of simple order types,
"Only four types of orders – IEX eschews certain types of orders that were created to accommodate the HFT crowd, such as the Post-Only order and “Hide Not Slide” order. Instead it offers only four basic types of orders – market, limit, Mid-Point Peg, and IEX Check (Fill or Kill). The Mid-Point Peg gives the investor a price between the current bid and offer for the stock."
Well, we've moved on from there with the Discretionary Peg and its complex conditions and changing formulae with its high false positive rates. The crumbling quote factor as been added to the Primary Peg. And now we behold the MM-Peg, a displayed peg that has priority over non-displayed. Not a big deal in itself as it is just a small incremental extension. A bit of an outhouse, really. IEX is simply replicating the same utility payoff for order type development that has got us into this NMS order type mess in first place. All may make sense in isolation but who wants to fly in such an NMS Rube Goldberg contraption?

IEX is no different from other exchanges with such order type development. The market order proliferation problem needs some kind of "start" agreement where these arms are controlled. The only real beneficiaries of the current proliferation are the sophisticated market participants that have the resources and skills to puzzle out all the order puzzles and apply them to their problems as solutions. HFTs might just fit into that category. IEX's biggest traders are HFTs. This may be the outcome they are looking for.

So, this little bit of hypocrisy on IEX's part cuts a little deep to their core values. This order proliferation has long been something the "puzzle masters" have protested loudly against. Not a big deal as a piece of incrementalism but, nevertheless, surprising as order type proliferation is a real problem to which IEX is succumbing. Why is it surprising? Well IEX has rallied against a number of things such as rebates and co-location, both of which may actually benefit markets and yet on order types - they continue to transgress on their values. Curious.

The big issue

The main issue I see with the MM-peg is that it may bake in a strategic advantage in latency to particular types of customers, "The Market Maker Peg Order would be limited to registered market makers" [page 6].

I read it that the repricing still has to go through some guise of 350-microsecond delay, perhaps even the original magic shoe box,
"Furthermore, pursuant to Rule 11.190(b)(13), each time a Market Maker Peg Order is automatically adjusted by the System, all inbound and outbound communications related to the modified order instruction will traverse an additional POP between the Market Maker Peg Order repricing logic, and the Order Book."
However, this isn't the problem directly. The problem is how the latency may compare to co-located access from NY5 where the POP is. That is, how does it compete against the exchange's own customers?

The exchange's network architecture should have reasonably good/low jitter due to it being 10G Ethernet. It is hard to do that really badly, so let's assume IEX haven't stuffed that up. The latency difference between customers in NY5 on the customer facing side of the POP and the internal MM-Peg repricing mechanism may then be implied to be significant for all or some set of customers. That is, significant in terms of expected jitter. That difference may be advantageous or disadvantageous due to those reified architectural differences. That is, the timing is largely baked in.

If MM-Peg was to have a benefit in terms of latency, that would be bad as you are forced to use it and eschew other order types, but only if you can. You may not be a registered market maker and be at a structural disadvantage. The other side of the coin is a baked on disadvantage implying you never want to use an MM-Peg. Then again on some day, it may magically improve due to some technical rejigging. What if it changes without you knowing and suddenly your trading is at a surprising disadvantage? I'm imagining Haim Bodek breathing fire. I agree with him. This is a poor situation.

Either faster or slower is problematic for IEX's customers. It is a no-win situation - caveat emptor.

And, just to add fuel to the fire, SIP customers may be notified of the requotes before the IEX customer waiting at the IEX pop.

Happy trading,



Note: much to do about nothing. This is all about an order type that lives "at least" 8% (IEX rule 11.151(a)(6)) away from the BBO if it is an S&P500 or Russell1000 name. Busy work that is an unprincipled precedent. You have to wonder why they'd bother with it.

PS: Kipp Rogers points out that it was only three order types back in the Flash Boy daze of 2014: