SmartAssist Technology

Five Takeaways on the State of Natural Language Processing

Thoughts following the 2015 "Text By The Bay" Conference

The first " Text By the Bay” conference, a new natural language processing (NLP) event from the “ Scala bythebay” organizers, just wrapped up tonight. In bringing together practitioners and theorists from academia and industry I’d call it a success, save one significant and glaring problem. 
Read More

Topics: Machine Learning, Data Science, Conferences and Workshops, Predictive Analytics

Asking RNNs+LTSMs: What Would Mozart Write?

Preamble: A natural progression beyond artificial intelligence is artificial creativity. I've been interested in AC for awhile and started learning of the various criteria that the scholarly community has devised to test AC in art, music, writing, etc. (I think crosswords might present an interesting Turing-like test for AC). In music, a machine-generated score which is deemed interesting, challenging, and unique (and indistinguishable from the real work of a great master), would be a major accomplishment. Machine-generated music has a long history (cf. "Computer Models of Musical Creativity" by D. Cope; Cambridge, MA: MIT Press, 2006).

Deep Learning at the character level: With the resurgence of interest in Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM), I thought it would be interesting to see how far we could go in autogenerating music. RNNs have actually been around in music generation for awhile (even with LSTM; see this site and this 2014 paper from Liu & Ramakrishnan and references therein), but we're now getting into an era where we can train on a big corpus and thus train a big, complex model. Andrej Karpathy's recent blog showed how training a character-level model on Shakespeare and Paul Graham essays could yield interesting, albeit fairly garbled, text that seems to mimic the flow and usage of

Read More

Topics: Machine Learning, Data Science

Make Docker images Smaller with This Trick

The architectural and organizational/process advantages of containerization (eg., via Docker) are commonly known. However, in constructing images, especially those that serve as the base for other images, adding functionality via package installation is a double edged sword. On one hand we want our images to be most useful for the purposes they are built but—as images are downloaded, moved around our networks and live in our production environments—we pay a real speed and cost price for bloated image sizes. The obvious onus on image creators is to make them as practically small as possible without sacrificing efficicacy and extensibility. This blog shows how we shrunk our images with a pretty simple trick...

Read More

Topics: Machine Learning, Data Science, Software Engineering

ParaText: CSV parsing at 2.5 GB per second

Despite extensive use of distributed databases and filesystems in data-driven workflows, there remains a persistent need to rapidly read text files on single machines. Surprisingly, most modern text file readers fail to take advantage of multi-core architectures, leaving much of the I/O bandwidth unused on high performance storage systems. Introduced here, ParaText, reads text files in parallel on a single multi-core machine to consume more of that bandwidth. The alpha release includes a parallel Comma Separated Values (CSV) reader with Python bindings.

Read More

Topics: Machine Learning, Data Science, Software Engineering

Benchmarking Random Forest Classification

Last month, wise.io co-founder Joey wrote a blog post on the principles of doing proper benchmarks of machine learning frameworks. Here, I start to put those principles into practice, presenting the first in a series of blog posts on ML benchmarking. For those short of time, you can jump to conclusions.

For this benchmark, I focus on comparing accuracy and speed of four random forest®1 classifier implementations, including the high-performance WiseRF™. In follow-up posts we will cover random forest regression and benchmark against other machine learning algorithms. We will also benchmark memory usage across different implementations — another very important, but often overlooked aspect of benchmarking, especially when it comes to “big data.”

Tools

We created a standardized benchmarking platform to compare the accuracy and speed of the following random forest implementations:

All of the tools, with the exception of the randomForest R package, are multi-threaded and were parallelized across available cores. H2O has the option of running a distributed (multi-machine) implementation, but I considered herein the more common single-node workstation for this benchmarking exercise.

Read More

Topics: Machine Learning, Data Science