Tuesday, 12 December 2017

That's not a bubble. That's a bubble...

(With apologies to Crocodile Dundee)

You think Bitcoin is a bubble?

I'll show you a bubble:


(source, alternative source, Shiller)
"Just kids having fun." Bitcoin has a $100-400B market cap, depending on the precise second. Global markets are about to cross the $100,000 Billion threshold. 

Focus.

Happy trading,

--Matt.

PS: Did you notice 1987?

Monday, 11 December 2017

Curious Bitcoin curiosities

Time for a cup of reality?
Say yesterday, you meandered into a diner near Wall and Broad as you plotted your future Bitcoin futures strategy. It was a rare diner that takes bitcoin for food. Your lucky day!

You get your Morning Joe and it costs you $3 buckaroos.

Magic happens. You pay by bitcoin.

Or did you?

Did you wait the average of ten minutes for your transaction to be processed?

I thought not. It wasn't a real bitcoin transaction. It was just a promise. Probably through someone like Coinbase, the big mama of bitcoin broking. There's nothing wrong with that, it is just not really the libertine free-wheeling anarchy that most people think it is. It's just another layered financial service.

If your underlying vendors comply with Know Your Customer (KYC) and Anti-Money Laundering (AML) regulations then you're known, watched, traced, observed. Every bitcoin transaction is completely traceable until the exit points run afoul of illegal anonymity. Tumblers may tumble, but your Merkle tree with its double sha256 is an inviolable truth despite its leading zeros.

That ten minutes you would have to wait is baked into the system design. If miners process your transactions faster they get asked to find more zeros to maintain the 10 minutes. That is how it works.

Do you know how much your $3 coffee cost in fees?

Let's look at that last 24 hours:

Capture from https://blockchain.info/stats
(click to enlarge)
Yep, the cost per transaction was $121.55 in fees yesterday. Your coffee cost $3 plus $121.55 for a total of $124.55. And that excludes the processing fee which may have been absorbed by the retailer. Makes retail fx spreads look like a gender neutral young choir participant , no?

But, but, you didn't really pay that much I hear you say? That's right. That is the clever trick. Whilst the miners earned nearly $40M yesterday for processing just 160 blocks or wee little hashes, 18% was the optional fee embedded into the protocol, and the vast majority of the rich rewards were just new issuances. Free bitcoin for all lucky miners! A dilutive gift from the protocol God. An economics PhD is not required for further analysis - just a couch.

Look at the awesome power of the bitcoin processing:

Bitcoin Hashrate on a log scale from https://blockchain.info/charts/hash-rate?timespan=all&scale=1
(click to enlarge)
That's right 12.653 billion GigaHashes per second. Colour me impressed! That is, a bunch of hardware throwing random bits at a hashing function to combine with the tree that is called a chain so that 160 blocks can complete during that day. Pesky little hard to find runs of zeros. It is about 2500 binary operations to do an SHA256 double hash to see if you have enough zeros on the left of the slot machine to be a winner. That makes a supercomputer with roughly 31.6 x 10^21 binary operations a second or about 17.1 x 10^24 ops per block. Phew. That's 17.1 yotta ops or ~17,100,000 exaops, or 17100000000000000000000000 ops: to process a block.

Puts the power bill into perspective, no?

(don't click to enlarge this one)
Feel the POWER ... bill!


Email thought about doing proof-of-work before bitcoin came along. It was a thought bubble about preventing spam. Perhaps if you make sending email cost a tiny amount then spammers sending millions of emails may think twice? Hmmm. Fortunately, we are not so burdened with our email.

Well, at least it is secure?

There are some shady and simply unknown characters in the mining industry. It is best to keep your head down rather than invite regulation if you can, so let's emphasise unknown.

In bitcoin land, the majority and the longest chain, or tree branch, or direct acyclic graph, or whatever the thingy with the hash in the Merkle tree wants to be called today, is the law. The most popular and biggest wins. As long as there are lots of people it is hard to overrule the majority opinion. So how many people would it take to subvert the transaction truth?

Four:

Mining pool share for the last 24 hours
(source: https://blockchain.info/pools) 

(click to enlarge)
Four is enough for a majority. I'm sure they're fine upstanding citizens. All with a AAA rating from the Moodster and Poor Standards no doubt. In the past, there were times when only one dudette controlled the mining flow - or 50% of the power. Bitcoin survived that despite the violation of safety. It shouldn't have to trust the humanity of one as that was never the point.

It is a bit hard to work out a discounted cash flow model on all of this. Especially as we don't know how it ends, or rather we know it does end. No more coins will be issued. There will be no incentive to mine. There is a finite horizon, the bitcoin event horizon, that is its own y2k, 2038, problem to solve with a twist that there is no tomorrow.

Not to worry, your bitcoin will likely be stolen before then. IBM and others are hoping quantum supremacy will be next year. Maybe not. It is not decades away though. Whilst your hashes will be safe, your wallets will be open for all to pilfer thanks to the quantumtastically hackable public key encryption that protects your transactions. Even the ASX's so-called perfect forward secrecy will be broken by the very same Digital Asset monster.

At least everyone has their own copy of the database?  Well, it grew to over 100GB in 2016. Feels efficient that everyone should have a copy, no? No wonder the financial system keeps growing from 2% of the economy a bit over a hundred years ago to the 6-10% of GDP it represents in most modern estates. Fintech be damned. We keep finding inefficient ways to shoot ourselves in the foot.

So, now your cup of coffee is making someone else rich with a C note of fees; the transaction could be foxed in its recording; its processing faith rests with four people from a self appointed few agreeing not to collude; you're polluting the world and killing people via the pollution from your excess electrickery consumption; and you're supporting a criminally inefficient transaction system with bloated databases replicated everywhere that is favoured by child pornographers, mobsters, cyber hackers, and terrorists. You and your bloody morning coffee!

*Shrug*

In cryptography we trust. In Bitcoin we place our faith. Pass on the ICO. Amen.

Happy "efficient" trading,

--Matt.

PS: Worth reading, Is your exchange for real? Kipp Rogers on Bitcoin exchanges, "Some questions for Crypto-Exchanges"

Monday, 4 December 2017

Is IEX a Giffen Good?

I'll kill the suspense to this rhetoric: no, IEX is not such a good. It is more of an evil.

IEX behaves a little like a Giffen Good though. The price is more expensive than other offerings but the demand continues to rise. It remains dark and expensive, like a restaurant, with around 81% of the order flow to IEX not being displayed in November.


IEX Lit / total handled %


November 19.0%
October 18.2%
September 20.0%
August 20.6%
July 20.3%
June 20.0%
May 17.5%
April 18.7%
March 19.8%
Not much is actually displayed and traded at this so-called public exchange

It's a pretty poor outcome for a public exchange. I'd hope the SEC finds it an embarrassment. Public exchanges should promote price discovery. IEX's Dark Fader subverts price discovery.

Whilst IEX's market share remains small, it continues to rise modestly, despite the luxury pricing, in true Giffen Good style:

IEX market share
(Source data: IEX)
IEX's growth still shows that higher volume days have less displayed transactions. The darkness grows. Jedis required.



In some ways, it is unsurprising as the execution quality is crap less than ideal for displayed orders.

The marketing remains the strength of IEX. It promotes fairy tales such as queues being short being great for your trading. IEX neglects to mention that queues are even shorter at CHX. Schmarketing.

IEX trumpets its execution quality via the stats on BATS but fails to mention that its largely dark execution is not comparable via such a methodology. Scratch the surface and the execution quality of IEX is not just bad, it is shockingly bad.

In a galaxy far far away, Kipp Rogers pointed out IEX's customer and trade concentration allowed some strong possibilities of trade identification, "Pershing Square and Information Leakage on IEX." But it gets worse. It turns out that traders at BATS, NYSE, and Nasdaq co-lo facilities are likely to know about IEX trades before IEX's own customers. This is by design. The SIPS are faster at disseminating quotes than IEX's infamous magic shoebox delay. IEX continues to promote the idea that they protect customers from latency arbitrage when they are actually the exchange that allows the highest level of latency arbitrage possibility. Schmarketing...

In the mostly fictional FlashBoys, Michael Lewis promoted the two Netscape Jim's, Barksdale and Clark, investment in IEX as well as the Barksdale family business Spread Networks. Forbes reported that the Jim Barksdale, a key figurine in Lewis' book, "The New New Thing" invested $225 million in Spread with outside financing suggested to be $75 million,
"Spread won't disclose cost, but Jason Cohen, the chief operating officer of Allied Fiber, which is building a nationwide network, says laying cable through easy terrain runs $200,000 per mile. Half of Spread's route, however, is through tough virgin terrain, pushing forbes' estimate of its cost toward $300 million. Jim Barksdale put up all of the capital other than $75 million financed by outside investors." (Source: Forbes Sep 2010 "Wall Street's Speed War")
Last week, Zayo picked up Spread Networks for $127 million. This may not a discount to the investment as Spread had some rather lucrative and long contracts signed by firms who were not always aware that microwave was faster. Also, it has not been disclosed if the purchase included any liabilities. Still, Spread doesn't sound like the best performing investment, despite the Lewis marketing machine.

IEX's expensive trading fees may provide a good deal of comfort to IEX investors. Imagine paying ten times the price for worse execution quality. If you can pitch that successfully to customers and provide them with a story that they won't be latency arb'd by others and yet make the latency arb worse, you just might go down with the Wolf on Wall Street as the ultimate pen marketer. Schmarketing.

Then imagine the slightly ridiculous message that having short queues, i.e. no one to trade with, is good for your trading because you won't be at the back of the queue. It's hard to imagine this fictional world in which other exchanges don't exist.

Mr David Weisberger points this out on his blog, "IEX ignores DATA (again) to market their exchange,"
"Notice that IEX is dead last in executed percentage, with a fill rate of 1.8%, more than 75% below NYSE and 65% lower than NASDAQ.  Readers should note that this is the best data available[1] as it counts only shares accepted by the markets which are priced at the NBBO when the order was received."
Mr Weisberger also points out that IEX has the largest effective spread by one apples-to-apples measure in "The Not-So-Amazing IEX “Flea Circus” Continues",

ExchangeEffective/Quoted SpreadExecuted Shares Reported
New York Stock Exchange945,998,168,176
Arca93.63,569,541,813
Nasdaq933,494,330,238
Bats94.32,726,036,147
IEX98.3497,251,536
"This data, provided by BestXStats, is based on all marketable orders in NYSE listed stocks sent to the exchanges and reported pursuant to Rule 605 in July. It only includes orders without trading restrictions that are from 100 to 9,999 shares. This is the most relevant metric for determining execution quality, as it is the only “apples to apples” comparison between the exchanges. Considering that IEX is materially inferior to the other primary exchanges in execution quality in NYSE listed stocks..."
Mr Weisberger does a pretty good job on showing why you really shouldn't route to IEX if you take your best execution obligations seriously and wish to survive an SEC audit in "IEX Marketing Sinks to a New Low",
"Additionally, IEX only has displayed liquidity 10% of the time when the inverted venues don’t have displayed quantity.  This implies that their displayed liquidity is not competitive with the listing exchanges, which additional analysis confirms.  To underscore this point, according to data provided by MayStreet and analyzed by ViableMkts, IEX had a total NBBO participation metric of 0.48% for the month of June in NYSE Listed Securities.  This compares to 30.76% for the NYSE, 8.92% for Nasdaq, 8.39% for ARCA, and a combined 10.95% for the combination of EDGX and Bats.  For Nasdaq listed equities, IEX has a NBBO participation metric of 0.77% compared to 33.91% for Nasdaq, 9.52% for ARCA, 11.69% for EdgX and 5.13% for Bats.   (as described earlier this week, this metric is derived for each exchange by multiplying the percentage of time spent at the NBBO for each exchange by each exchange’s displayed volume when at the NBBO, and dividing that by the aggregate average displayed volume at the NBBO."
Schmarketing.

Ah, that feels better. It's been a while since I've had a whinge about IEX. I guess my inner engineer just doesn't like schmarketing.

Happy trading,

--Matt.
_________________

Older IEX meanderings

Sunday, 13 August 2017

IEX fee regression

Public exchanges are meant to promote efficient price discovery and risk management. IEX thwarts such efficiencies.

IEX's new fee scheme further damages both price discovery and risk management. Let's meander through this.

The Fall of Icarus, 17th century, Musée Antoine Vivenel
Here is the filing IEX lodged for the fee change:  SR-IEX-2017-27,
"a proposed rule change to increase the fees assessed under specified circumstances for execution of orders that take liquidity during periods when the IEX System has determined that a “crumbling quote” exists" [p3]
That is, IEX is hiking the fees for taking prices, erroneously called taking liquidity, to the maximum fee allowed by the SEC, $0.0030 per share, when their crumbling quote indicator (CQI) goes off 350 microseconds into the future for a period of 2.35 milliseconds into the future, from your external point of view.

Previously IEX used rebates, a subsidised price of zero in this case, for displayed price taking orders that complied with particular volume constraints. Well kind of, if you hit a displayed price, yes free, but maybe for maybe not if hitting non-displayed prices.  Specifically, taking non-displayed prices costs $0.0009 unless,
"Taking Non-Displayed Liquidity with a Displayable Order and at least 90% of TMVD was identified by IEX as Providing Displayed Liquidity (i.e., the Member’s execution reports reflect that the sum of executions with Fee Code L and a Last Liquidity Indicator (FIX tag 851) of '1' (Added Liquidity), divided by the sum of executions with Fee Code L, is at least 90% for the calendar month​)"  [IEX web]
At least that is the old text. IEX has also filed a rule change for this to be specific to a particular MPID: SR-IEX-2017-25,
"Taking Non-Displayed Liquidity with a Displayable Order and at least 90% of TMVD, on a per MPID basis, was identified by IEX as Providing Displayed Liquidity (i.e., the Member’s execution reports reflect that the sum of executions with Fee Code L and a Last Liquidity Indicator (FIX tag 851) of '1' (Added Liquidity), divided by the sum of executions with Fee Code L, is at least 90% for the calendar month)" [p20]
The definition of "TMVD" was also changed to include an MPID reference,
""TMVD" means total monthly volume displayable calculated as the sum of executions from each of the Member's MPID’s (on a per MPID basis) displayable orders during the calendar month." [p19]
The MPID change is to be effective from September 1st, 2017.

Interestingly in this SR-IEX-2017-25, IEX admits it has been charging members incorrectly as it had been using an MPID based formula all along instead of the published and approved member method. Did IEX report the billing violation to the SEC as a separate event? Should the SEC step in and fine IEX for incorrect billing?
"IEX reviewed Member invoices since its launch as an exchange in August 2016 through June 30, 2017 to assess whether any Members were charged fees that differed from those described in the Fee Schedule. In other words, IEX recalculated the Non-Displayed Match Fee and the 90% threshold exception on a “per Member” basis (which is how the Fee Schedule currently reads) instead of on a “per MPID” basis (which is how IEX in practice had been calculating that fee). This assessment identified that nine Members were charged such differential fees in particular months, in some cases more than the fees described in the Fee Schedule and in some cases less than the fees described in the Fee Schedule. In total, seven Members were charged and paid $18,948.54 in excess fees and eight Members were not charged $44,175.28 in fees that should have been charged. Five Members were overcharged and undercharged in different months." [p14]
To add insult to injury, IEX is going after those people it has been incorrectly undercharging for the last twelve months. Bumper bills in September,
"IEX will charge..each impacted member for the net amount..underpaid and will be included in the August 2017 monthly invoices to be sent in September 2017" [p14]
I'm not sure how I'd define great customer service, but this would not be it.

Let's meander back to the main issue of charging the maximum fee possible for CQI conditions. It is not quite as simple as just charging the maximum fee under those conditions. IEX applies some threshold relief. The wording is a little poorly written for a formal document, but the idea is that the big fee applies if you do at least one million shares a month then it applies to the number of taken prices above more the number represented by 5% of the total executions, on an MPID basis.
"At the end of each calendar month, executions with Fee Code Q that exceed the CQRF Threshold are subject to the Crumbling Quote Remove Fee. Otherwise, to the extent a Member receives multiple Fee Codes on an execution, the lower fee shall apply."
" “CQRF Threshold” means the Crumbling Quote Remove Fee Threshold. The threshold is equal to 5% of the sum of a Member’s total monthly executions on IEX if at least 1,000,000 shares during the calendar month, measured on an MPID basis."
"Executions with Fee Code Q that exceed the CQRF Threshold are subject to the Crumbling Quote Remove Fee."
Apart from trying to make NYSE American's task harder, IEX's goal is to prevent adverse selection against price providers.

IEX reports,
"Across all approximately 8,000 symbols available for trading on IEX, the CQI is on only 1.24 seconds per symbol per day on average (0.005% of the time during regular market hours), but 30.4% of marketable orders are received during those time periods, which indicates that certain types of trading strategies are seeking to aggressively target liquidity providers during periods of quote instability. " [p26]
That is, IEX is looking to dramatically increase fees on 30.4% of marketable orders. If you read the bold statement in bold above you might find yourself nodding. IEX overstates this. Remember the CQI applies for 2,000,000 nanoseconds after it is triggered. When the CQI is a true positive, this means that if you want to trade on IEX when the price changes, then you pay a premium.

That is, IEX applies the highest price it legally can to discourage trading around the time the price changes. That is a harsh penalty that impacts the efficiency of both price discovery and risk management. I guess it is just the important times; those times prices change. Why would a trader want to trade at important times? Such an attitude goes against the explicit goals the SEC has memorialized many times with regard to the purpose of the National Market System. Then again, if you're a Franken-pool that prefers dark trading, why not permeate further your destructive to public market interest microstructure.

Another important feature of such a beast is that you can't always really decide in advance if your order will be subject to the CQI as IEX has the benefit of last-look, or looking into the future, within the exchange. You may only know after the event that a CQI applies, but not as you place the trade. Trading with an unknown fee may be less than optimal for some institutions. Best execution obligations are certainly harder. Perhaps it is best not to trade at IEX if you may be inadvertently violating best-ex.

Another amusing aside to the silliness of it all comes from the poor implementation of the CQI. When IEX changed to their new IEX Signal implementation of the CQI, they reported,
"On our example day of December 15, 2016, ... This new candidate formula would have produced about 2 million true positives and 2.1 million false positives." [The Evolution of the Crumbling Quote Signal, Allison Bishop, p28]
IEX has a pretty dumb one-size-fits-all CQI implementation that has more false positives than true positives - according to IEX. False positive domination means the CQI is normally fake news. That is, the majority of the time IEX charges you the SEC's legally maximum possible fee, their invalid rationale is invalid. You couldn't make this up if it wasn't true.

There are also two humorous outcomes relating to IEX's routing implementation. IEX's routed orders may be subject to an excessive CQI fee for a marketable order, particularly large institutional orders. Even funnier, it may perhaps be a best execution requirement that if there are shares available for an order elsewhere, the IEX router should not route to IEX as this will save clients' money due to avoiding the high IEX price taking fees. If IEX does not comply, then you'd hope the SEC will take action against IEX's violation of best-ex obligations. It would be funny to see IEX fined for routing orders to itself. That'd definitely be worth a chuckle.

IEX leaks information by design. It is more subject to latency arbitrage due to its SIP leakage and lack of fair co-lo. It not only prevents trading at price change time with its dark fading orders but now wants to discourage price discovery and risk management with high fees for when trading is needed most - at times of change. What happened to simple and few order types with simple and transparent pricing?

The IEX cult is becoming a lot more like the Flat Earth Society. I wonder if the mainstream media will ever call out IEX's misleading hypocrisy for the hubristic bullshit it truly is?

Happy trading,

--Matt.

_________

PS: IEX, instead of being greedy by taking the fee for yourself, you could generously provide some of it to the price provider. Would that be a compensatory rebate or a kickback?

Friday, 30 June 2017

FPGAs and AI processors: DNN and CNN for all

Here is a nice hidden node from a traditional 1990's style gender identification neural net I did a few weeks ago.

A 90's style hidden node image in a simple gender identifier net
Source: my laptop
My daughter was doing a course as part of her B Comp Eng, the degree after her acting degree. Not being in the same city I thought maybe I could look at her assignment and help in parallel. Unsurprisingly, she didn't need nor want my help. No man-splaining necessary from the old timer father. Nevertheless, it was fun to play with the data Bec pulled down on faces. Bec's own gender id network worked fine for 13 out of 14 photos of herself fed into the trained net. Nice.

I was late to the party and first spent time with neural nets in the early nineties. As a prop trader at Bankers Trust in Sydney, I used a variety of software including a slightly expensive graphical tool from NeuroDimension that also generated C++ code for embedding. It had one of those parallel port copy protection dongles that were a pain. I was doing my post-grad at a group at uni that kept changing its name from something around connectionism, to adaptive methods, and then data fusion. I preferred open source and the use of NeuroDimension waned. I ported the Stuttgart Neural Network Simulator, SNNS, to the new MS operating system, Windows NT (with OS/3 early alpha branding ;-) ), and briefly became the support guy for that port. SNNS was hokey code with messy static globals but it worked pretty fly for a white guy.

My Master of Science research project was a kind of cascade correlation-like neural net, Multi-rate Optimising Order Statistic Equaliser (MOOSE), for intraday Bund trading. The MOOSE was a bit of work designed for acquiring fast LEO satellite signals (McCaw's Teledesic), repurposed for playing with Bunds as they migrated from LIFFE to DTB. As a prop trader at an investment bank, I could buy neat toys. I had the world's fastest computer at the time: an IBM MicroChannel dual Pentium Pro 200MHz processors plus SCSI with some megabytes of RAM. Pulling 800,000 points into my little C++ stream/dag processor seemed like black magic in 1994. Finite differencing methods let me do oodles of O(1) incremental linear regressions and the like to get 1000 fold speed-ups. It seemed good at the time. Today, your phone would laugh in my general direction.

There was plenty of action in neural nets back in those days. Not much of it was overly productive but it was useful. I was slightly bemused to read Eric Schmidt's take on machine learning and trading in Lindsay Fortado and Robin Wigglesworth's FT article "Machine learning set to shake up equity hedge funds",
Eric Schmidt, executive chairman of Alphabet, Google’s parent company, told a crowd of hedge fund managers last week that he believes that in 50 years, no trading will be done without computers dissecting data and market signals.
“I’m looking forward to the start-ups that are formed to do machine learning against trading, to see if the kind of pattern recognition I’m describing can do better than the traditional linear regression algorithms of the quants,” he added. “Many people in my industry think that’s amenable to a new form of trading.”
Eric, old mate, you know I was late to the party in the early nineties, what does that make you?

Well, things are different now. I like to think of it and have written about the new neural renaissance as The Age of Perception. It is not intelligence, it is just good at patterns. It is still a bit hopeless at language ambiguities. It will also be a while before it understands the underlying values and concepts for deep financial understanding. 

Deep learning is simultaneously both overhyped and underestimated. It is not intelligence, but it will help us get there. It is overhyped by some as an AI breakthrough that will give us cybernetic human-like replicants. We still struggle with common knowledge and ambiguity in simple text for reasoning. We have a long way to go. The impact of relatively simple planning algorithms and heuristics along with the dramatic deep learning based perception abilities from vision, sound, text, radar, et cetera, will be as profound as every person and their dog now understands. That's why I call it, The Age of Perception. It is as if the supercomputers in our pockets have suddenly awoken with their eyes quickly adjusting to the bright blinking blight that is the real world. 

The impact will be dramatic and lifestyle changing for the entire planet. Underestimate the impact at your peril. No, we don't have a date with a deep Turing conversationalist that will provoke and challenge our deepest thoughts - yet. That will inevitably come, but it is not on the visible horizon. Smart proxies aiding by speech, text and Watson-like Jeopardy databases will give a very advanced Eliza, but no more. Autonomous transport, food production, construction, yard and home help will drive dramatic lifestyle and real-estate value changes.

Apart from this rambling meander, my intention here was to collect some thoughts on the chips driving the current neural revolution. Not the most exciting thought for many, but it is a useful exercise for me.

Neural network hardware


Neural processing is not a lot different today compared to twenty years ago. Deep is more of a brand than a difference. The activation functions have been simplified which suits hardware better. Mainly there is more data and a better understanding of how to initialise the weights, handle many layers, parallelise, and improve robustness via techniques such as dropout. The Neocognitron architecture from 1980 is not much different to today's deep learner or CNN, but it helped that Yann LeCun allowed it to learn. 

Back in the nineties there was also plenty of neural hardware platforms such as CNAPS (1990) with its 64 processing units and 256kB of memory for doing 1.6 GCPS (connections per second CPS) for 8/16-bit or 12.8 GCPS for 1-bit. You can read about Synapse-1, CNAPS, SNAP, CNS Connectionist Supercomputer, Hitachi WSI, My-Neupower, LNeuro 1.0, UTAK1, GNU Implementation (no, not GNU GNU, General Neural Unit), UCL, Mantra 1, Biologically-Inspired Emulator, INPG Architecture, BACHUS, and ZISC036 in "Overview of neural hardware", [Heemskerk, 1995, draft].

Phew, it seems a lot but that excluded the software and accelerator board/CPU combos, such as ANZA plus, SAIC SIGMA-1, NT6000, Balboa 860 coprocessor, Ni1000 Recognition Accelerator Hardware (Intel), IBM NEP, NBC, Neuro Turbo I, Neuro Turbo II, WISARD, Mark II & IV, Sandy/8, GCN (Sony), Topsi, BSP400 (400 microprocessors), DREAM Machine, RAP, COKOS, REMAP, General Purpose Parallel Neurocomputer, TI NETSIM, and GeNet. Then there were quite a few analogue and hybrid analogue implementations, including Intel's Electrically Trainable Analog Neural Network (801770NX). You get the idea, there was indeed a lot back in the day.

All a go go in 1994:


Optimistically Moore's Law was telling us a TeraCPS was just around the corner,
"In the next decade micro-electronics will most likely continue to dominate the field of neural network implementation. If progress advances as rapidly as it has in the past, this implies that neurocomputer performances will increase by about two orders of magnitude. Consequently, neurocomputers will be approaching TeraCPS (10^12 CPS) performance. Networks consisting of 1 million nodes, each with about 1,000 inputs, can be computed at brain speed (100-1000 Hz). This would offer good opportunities to experiment with reasonably large networks."
The first neural winter was the cruel subversion of research dollars by Minsky and Papert's dissing of Rosenblatt's perceptron dream with incorrect hand-wavy generalisations about hidden layers that ultimately led to Rosenblatt's untimely death. In 1995 another neural winter was kind of underway although I didn't really know it at the time. As a frog in the saucepan, I didn't notice the boil. This second winter was fired up by a lack of exciting progress and general boredom. 

The second neural winter ended with the dramatic improvements in ImageNet processing with the University of Toronto's SuperVision from AlexNet in 2012 thanks to Geoffrey Hinton's winter survival skills. This result was then blown apart by Google's LeNet 2014 Inception model. So, the Age of Perception started in 2012 by my reckoning. Mark your diaries. We're now five years in.

Google did impressive parallel CPU work with lossy updates across a few thousand regular machines. Professor Andrew Ng and friends made the scale approachable by enabling dozens of GPUs to do the work of thousands of CPUs. Thus, we were saved from the prospect of neural processing being only for the well funded. Well, kind of, now the state of the art sometimes needs thousands of GPUs or specific chips. 

More data and more processing have been quite key. Let's get to the point and list some of the platforms that are key to the Age of Perception's big data battle:

GPUs from Nvidia

These are hard to beat. The subsidisation that comes from the large video processing market drives tremendous economies of scale. The new Nvidia V100 can do 15 TFlops of SP or 120 TFlops with its new Tensor core architecture which is a FP16 multiply and FP32 accumulate or add to suit ML. Nvidia is packing up 8 boards into their DGX-1 for 960 Tensor TFlops. 

GPUs from AMD

AMD has been playing catch-up with Nvidia in the ML space. The soon to be released AMD Radeon Instinct MI25 is promising 12.3 TFlops of SP or 24.6 TFlops of FP16. If your calculations are amenable to Nvidia's Tensors, then AMD can't compete. Nvidia also does twice the bandwidth with 900GB/s versus AMD's 484 GB/s. 

Google's TPUs

Google's original TPU had a big lead over GPUs and helped power DeepMind's AlphaGo victory over Lee Sedol in a Go tournament. The original 700MHz TPU is described as having 95 TFlops for 8-bit calculations or 23 TFlops for 16-bit whilst drawing only 40W. This was much faster than GPUs on release but is now slower than Nvidia's V100, but not on a per W basis. The new TPU2 is referred to as a TPU device with four chips and can do around 180 TFlops. Each chip's performance has been doubled to 45 TFlops for 16-bits. You can see the gap to Nvidia's V100 is closing. You can't buy a TPU or TPU2. Google is making them available for use in their cloud with TPU pods containing 64 devices for up to 11.5 PetaFlops of performance. The giant heatsinks on the TPU2 are some cause for speculation, but the market is changing from devices to units with groups of devices and also such groups within the cloud.

Wave Computing

Wave's Aussie CTO, Dr Chris Nicol, has produced a wonderful piece of work with Wave's asynchronous data flow processor in their Compute Appliance. I was introduced to Chris briefly a few years ago in California by Metamako Founder Charles Thomas. They both used to work on clockless async stuff at NICTA. Impressive people those two. 

I'm not sure Wave's appliance was initially targeting ML but their ability to run TensorFlow at 2.9 PetaOPS/sec on their 3RU appliance is pretty special. Wave refers to their processors at DPUs and an appliance has 16 DPUs. Wave uses processing elements it calls Coarse Grained Reconfigurable Arrays (CGRAs). It is unclear what bit width the 2.9 PetaOPS/s is referring to. From their white paper, the ALUs can do 1b, 8b, 16b and 32b,  
"The arithmetic units are partitioned. They can perform 8-b operations in parallel (ideal for DNN inferencing) as well as 16-b and 32-b operations (or any combination of the above). Some 64-b operations are also available and these can be extended to arbitrary precision using software.
Here is a bit more on one of the 16 DPUs included in the appliance,
"The Wave Computing DPU is an SoC that contains a 16,384 PEs, configured as a CGRA of 32x32 clusters. It includes four Hybrid Memory Cube (HMC) Gen 2 interfaces, two DDR4 interfaces, a PCIe Gen3 16-lane interface and an embedded 32-b RISC microcontroller for SoC resource management. The Wave DPU is designed to execute autonomously without a host CPU."
On TensorFlow ops, 
"The Wave DNN Library team creates pre-compiled, relocatable kernels for common DNN functions used by workflows like TensorFlow. These can be assembled into Agents and instantiated into the machine to form a large data flow graph of tensors and DNN kernels."
"...a session manager that interfaces with machine learning workflows like TensorFlow, CNTK, Caffe and MXNet as a worker process for both training and inferencing. These workflows provide data flow graphs of tensors to worker processes. At runtime, the Wave session manager analyzes data flow graphs and places the software agents into DPU chips and connects them together to form the data flow graphs. The software agents are assigned regions of global memory for input buffers and local storage. The static nature of the CGRA kernels and distributed memory architecture enables a performance model to accurately estimate agent latency. The session manager uses the performance model to insert FIFO buffers between the agents to facilitate the overlap of communication and computation in the DPUs. The variable agents support software pipelining of data flowing through the graph to further increase the concurrency and performance. The session manager monitors the performance of the data flow graph at runtime (by monitoring stalls, buffer underflow and/or overflow) and dynamically tunes the sizes of the FIFO buffers to maximize throughput. A distributed runtime management system in DPU-attached processors mounts and unmounts sections of the data flow graph at run time to balance computation and memory usage. This type of runtime reconfiguration of a data flow graph in a data flow computer is the first of its kind."
Yeah, me too. Very cool.

The exciting thing about this platform is that it is coarser than FPGA in architectural terms and thus less flexible, but likely to perform better. Very interesting.

KnuEdge's KnuPath

I tweeted about KnuPath back in June 2016. Their product page has since gone missing in action. I'm not sure what they are up to with the $100M they put into their MIMD architecture. It was described at the time as having 256 tiny DSP, or tDSP, cores on each ASIC along with an ARM controller suitable for sparse matrix processing in a 35W envelope. 

(source: HPC Wire - click to enlarge)
The performance is unknown, but they compared their chip to a current NVIDIA, at that time, and said they had 2.5 times the performance. We know Nvidia is now more than ten times faster with their Tensor cores so KnuEdge will have a tough job keeping up. A MIMD or DSP approach will have to execute awfully well to take some share in this space. Time will tell. 

Intel's Nervana

Intel purchased Nervana Systems who was developing both a GPU/software approach in addition to their Nervana Engine ASIC. Comparable performance is unclear. Intel is also planning in integrating into the Phi platform via a Knights Crest project. NextPlatform suggested the 2017 target on 28nm may be 55 TOPS/s for some width of OP. There is a NervanaCon Intel has scheduled for December, so perhaps we'll see the first fruits then.

Horizon Robotics

This Chinese start-up has a Brain Processing Unit (BPU) in the works. Dr Kai Yu has the right kind of pedigree as he was previously the head of Baidu's Institute of Deep Learning. Earlier this year a BPU emulation on an Arria 10 FPGA was shown in this Youtube clip. There is little information on this platform in public.

Eyeriss

Eyeriss is an MIT project that developed a 64nm ASIC with unimpressive raw performance. The chip is about half the speed of a Nvidia TK1 on AlexNet. The neat aspect was that such middling performance was achieved by a 278mW reconfigurable accelerator thanks to its row stationary approach. Nice.

Graphcore

Graphcore raised $30M of Series-A late last year to support the development of their Intelligence Processing Unit, or IPU. Their web is a bit sparse on details with hand-wavy facts such at >14,000 independent processor threads and >100x memory bandwidth. Some snippets have snuck out with NextPlatform reporting over a thousand true cores on the chip with a custom interconnect. It's PCIe board has a 16-processor element. It sounds kind of dataflowy. Unconvincing PR aside, the team has a strong rep and the investors are not naive, so we'll wait and see.

Tenstorrent

Tenstorrent is a small Canadian start-up in Toronto claiming an order of magnitude improvement in efficiency for deep learning, like most. No real public details but they're are on the Cognitive 300 list.

Cerebras

Cerebras is notable due to its backing from Benchmark and that its founder was the CEO of SeaMicro. It appears to have raised $25M and remains in stealth mode.


Thinci

Thinci is developing vision processors from Sacremento with employees in India too. They claim to be at the point of first silicon, Thinci-tc500, along with benchmarking and winning of customers already happening. Apart from "doing everything in parallel" we have little to go on.


Koniku

Koniku's web site is counting down and has 72 days showing until my new reality. I can hardly wait. They have raised very little money and after watching their Youtube clip embedded in this Forbes page, you too will not likely not be convinced, but you never know. Harnessing biological cells is certainly different. It sounds like a science project, but, then this,
"We are a business. We are not a science project," Agabi, who is scheduled to speak at the Pioneers Festival in Vienna, next week, says, "There are demands that silicon cannot offer today, that we can offer with our systems."
The core of the Koniku offer is the so-called neuron-shell, inside which the startup says it can control how neurons communicate with each other, combined with a patent-pending electrode which allows to read and write information inside the neurons. All this packed in a device as large as an iPad, which they hope to reduce to the size of a nickel by 2018.

Adapteva

Adapteva is a favourite little tech company of mine to watch as you'll see in this previous meander, "Adapteva tapes out Epiphany-V: A 1024-core 64-bit RISC processor." Andreas Olofsson taped out his 1024 core chip late last year and we await news of its performance. Epiphany-V has new instructions for deep learning and we'll have to see if this memory-controller-less design with 64MB of on-chip memory will have appropriate scalability. The impressive efficiency of Andrea's design and build may make this a chip we can all actually afford, so let's hope it performs well.

Knowm

Knowm talks about Anti-Hebbian and Hebbian (AHaH) plasticity and memristors. Here is a paper covering the subject, "AHaH Computing–From Metastable Switches to Attractors to Machine Learning." It's a bit too advanced for me. With a quick glance I can't tell the difference between this tech and hocus-pocus but it looks sciency. I'm gonna have to see this one in the flesh to grok it. The idea of neuromemristive processors is intriguing. I do like a good buzzword in the morning.

Mythic

A battery powered neural chip from Mythic with 50x lower power. Not so many real details out there. The chip is the size of a button, but aren't most chips?
"Mythic's platform delivers the power of desktop GPU in a button-sized chip"
Perhaps another one that is suitable for drones and phones that is likely to be eaten or sidelined by a phone.

Qualcomm

Phones are an obvious place for ML hardware to crop up. We want to identify the dog type, flower, leaf, cancerous mole, translate a sign, understand the spoken word, etc. Our pocket supercomputers would like all the help they can get for the Age of Perception.

Qualcomm has been fussing around ML for a while with the Zeroth SDK and Snapdragon Neural Processing Engine. The NPE certainly works reasonably well on the Hexagon DSP that Qualcomm use. The Hexagon DSP is far from a very wide parallel platform and it has been confirmed by Yann LeCun that Qualcomm and Facebook are working together on a better way in Wired's "The Race To Build An AI Chip For Everything Just Got Real",
"And more recently, Qualcomm has started building chips specifically for executing neural networks, according to LeCun, who is familiar with Qualcomm's plans because Facebook is helping the chip maker develop technologies related to machine learning. Qualcomm vice president of technology Jeff Gehlhaar confirms the project. "We're very far along in our prototyping and development," he says."
Perhaps we'll see something soon beyond the Kryo CPU, Adreno GPU, Hexagon DSP, and Hexagon Vector Extensions. It is going to be hard to be a start-up in this space if you're competing against Qualcomm's machine learning.

Pezy-SC and Pezy-SC2

These are the 1024 core and 2048 core processors that Pezy develop. The Pezy-SC 1024 core chip powered the top 3 systems on the Green500 list of supercomputers back in 2015. The Pezy-SC2 is the follow up chip that is meant to be delivered by now, and I do see a talk in June about it, but details are scarce yet intriguing,
"PEZY-SC2 HPC Brick: 32 of PEZY-SC2 module card with 64GB DDR4 DIMM (2.1 PetaFLOPS (DP) in single tank with 6.4Tb/s"
It will be interesting to see what  2,048 MIMD MIPS Warrior 64-bit cores can do. In the June 2017 Green500 list, a Nvidia P100 system took the number one spot and there is a Pezy-SC2 system at number 7. So the chip seems alive but details are thin on the ground. Motoaki Saito is certainly worth watching.

Kalray

Despite many promises, Kalray has not progressed their chip offering beyond the 256 core beast I covered back in 2015, "Kalray - new product meander." Kalray is advertising their product as suitable for embedded self-driving car applications though I can't see the product architecture being an ideal CNN platform in its current form. Kalray has a Kalray Neural Network (KaNN) software package and claims better efficiency than GPUs with up to 1 TFlop/s on chip.

Kalrays NN fortunes may improve with an imminent product refresh and just this month Kalray completed a new funding that raised $26M. The new Coolidge processor is due in mid-2018 with 80 or 160 cores along with 80 or 160 co-processors optimised for vision and deep learning.

This is quite a change in architecture from their >1000 core approach and I think it is most sensible.

IBM TrueNorth

TrueNorth is IBM's Neuromorphic CMOS ASIC developed in conjunction with the DARPA SyNAPSE program.
It is a manycore processor network on a chip design, with 4096 cores, each one simulating 256 programmable silicon "neurons" for a total of just over a million neurons. In turn, each neuron has 256 programmable "synapses" that convey the signals between them. Hence, the total number of programmable synapses is just over 268 million (228). In terms of basic building blocks, its transistor count is 5.4 billion. Since memory, computation, and communication are handled in each of the 4096 neurosynaptic cores, TrueNorth circumvents the von-Neumann-architecture bottlenecks and is very energy-efficient, consuming 70 milliwatts, about 1/10,000th the power density of conventional microprocessors. [Wikipedia]
Previously criticised for running spiking neural networks rather than being fit for deep learning, IBM developed a new algorithm for running CNNs on TrueNorth,
Instead of firing every cycle, the neurons in spiking neural networks must gradually build up their potential before they fire...Deep-learning experts have generally viewed spiking neural networks as inefficient—at least, compared with convolutional neural networks—for the purposes of deep learning. Yann LeCun, director of AI research at Facebook and a pioneer in deep learning, previously critiqued IBM’s TrueNorth chip because it primarily supports spiking neural networks... 
...the neuromorphic chips don't inspire as much excitement because the spiking neural networks they focus on are not so popular in deep learning.
To make the TrueNorth chip a good fit for deep learning, IBM had to develop a new algorithm that could enable convolutional neural networks to run well on its neuromorphic computing hardware. This combined approach achieved what IBM describes as “near state-of-the-art” classification accuracy on eight data sets involving vision and speech challenges. They saw between 65 percent and 97 percent accuracy in the best circumstances.
When just one TrueNorth chip was being used, it surpassed state-of-the-art accuracy on just one out of eight data sets. But IBM researchers were able to boost the hardware’s accuracy on the deep-learning challenges by using up to eight chips. That enabled TrueNorth to match or surpass state-of-the-art accuracy on three of the data sets.
The TrueNorth testing also managed to process between 1,200 and 2,600 video frames per second. That means a single TrueNorth chip could detect patterns in real time from between as many as 100 cameras at once..." [IEEE Spectrum]
Power efficiency is quite brilliant on TrueNorth and makes it very worthy of consideration.

Brainchip's Spiking Neuron Adaptive Processor (SNAP)

SNAP will not do deep learning and is a curiosity without being a practical drop in CNN engineering solution, yet. IBM's stochastic phase-change neurons seem more interesting if that is a path you wish to tread.

Apple's Neural Engine

Will it or won't it?  Bloomberg is reporting it will as a secondary processor but there is little detail. Not only is it an important area for Apple, but it helps avoid and compete with Qualcomm.

Others


Cambricon - Chinese Academy of Sciences invests $1.4M  for chip. It is an instruction set architecture for NNs with data-level parallelism, customised vector/matrix instructions, on-chip scratchpad memory. Claims 91 times CPU-x86 and 3 times faster than a K40M with 1% or 1.695W of peak power use. "Cambricon-X: An Accelerator for Sparse Neural Networks" and "Cambricon: An Instruction Set Architecture for Neural Networks."

Ex-googlers and groq inc. Perhaps another TPU?

Aimotive.

Deep Vision is bulding low-power chips for deep learning. Perhaps one of these papers by the founders have clues, "Convolution Engine: Balancing Efficiency & Flexibility in Specialized Computing" [2013] and "Convolution Engine: Balancing Efficiency and Flexibility in Specialized Computing" [2015].

Deep Scale.

Reduced Energy Microsystems are developing lower power asynchronous chips to suit CNN inference. REM was Y Combinator's first ASIC venture according to TechCrunch.

Leapmind is busy too.

FPGAs


Microsoft has thrown its hat into the FPGA ring, "Microsoft Goes All in for FPGAs to Build Out AI Cloud." Wired did a nice story on the MSFT use of FPGAs too, "Microsoft Bets Its Future on a Reprogrammable Computer Chip"
"On Bing, which an estimated 20 percent of the worldwide search market on desktop machines and about 6 percent on mobile phones, the chips are facilitating the move to the new breed of AI: deep neural nets."
I have some affinity for this approach. Xilinx and Intel's (nee Altera) FPGAs are powerful engines. Xilinx naturally claim their FPGA's are best for INT8 with one of their white papers containing the following slide,


Both vendors have good support for machine learning with their FPGAs:

Whilst performance per Watt is impressive for FPGAs, the vendors' larger chips have long had earth shatteringly high chip prices for the larger chips. Xilinx's VU9P lists at over $US 50k at Avnet.

Finding a balance between price and capability is the main challenge with the FPGAs.

One thing that is to love about the FPGA approach is the ability to make some quite wonderful architectural decisions. Say you want to improve you memory streaming of floating point via compressing off board DRAM for HBM and uncompress it in real time, there is a solution if you try hard enough, "Bandwidth Compression of Floating-Point Numerical Data Streams for FPGA-Based High-Performance Computing"


This kind of dynamic architectural agility would be a hard thing to pull off with almost any other technology.

Too many architectural choices may be considered a problem, but I kind of like that problem myself. Here is a nice paper on closing the performance gap between custom hardware and FPGA processors with an FPGA-based horizontally microcoded compute engine that reminds of the old DISC or discrete instruction set computer from many moons ago, "Reducing the Performance Gap between Soft Scalar CPUs and Custom Hardware with TILT"




Winners


Trying to forecast a winner in this kind of race is a fool's errand. Qualcomm will be well placed simply due to their phone dominance. Apple will no doubt succeed with whatever they do. Nvidia's V100 is quite a winner with its Tensor units. I'm not sure I can see Google's TPU surviving in a relentless long-term silicon march despite its current impressive performance. I'm fond of the FPGA approach but I can't help but think they should release DNN editions at much cheaper price points so that they don't get passed by the crowd. Intel and AMD will have their co-processors. As all the major players are mucking in, much of it will come down to supporting standard toolkits, such as TensorFlow, and then we will not have to care too much about the specifics, just the benchmarks.

From the smaller players, as much as I like and am cheering for the Adapteva approach I think their memory architecture may not be well suited to DNN. I hope I'm wrong.

Wave Computing is probably my favourite approach after FPGAs. Their whole asynchronous data flow approach is quite awesome. It appears REM is doing something similar, but I think they may be too late. Will Wave Computing be able to hold their head up in face of all the opposition? Perhaps as their asynchronous CGRA has an inherent advantage. Though I'm not sure they need just DNNs to succeed as their tech has much broader applicability.

Neuromorphic spiking processor thing-a-ma-bobs are probably worth ignoring for now but keep your eye on them due to their power advantage. Quantum crunching may make it all moot anyway. The exception to this rule is probably IBM's TrueNorth thanks to its ability to not just do spiking networks but to also run DNNs efficiently.

For me, Random Forests are friendly. They are much harder to screw up ;-)

Happy trading,

--Matt.