Introduction to the Introduction
When the conversation turns to Artificial Intelligence (AI), Big Data and analytics many exciting fantasies of the future (Robots replacing people! Martian colonies!) are passed around with little thought to the myriad possibilities already available and under-utilized. Artificial Intelligence is a branch of computer science and Mathematics that attempts to solve discrete problems using pre-defined rules or routines (algorithms). Leaving aside a deeper discussion of what AI may bring, the problems that AI users have already tackled range from how to read your license plate from a distance to issue a traffic violation (optical character recognition), to language translation (Natural Language Processing) and autonomously operating vehicles (perception, motion and action). One subfield that is exciting to us is Natural Language Processing (NLP). We can trace the very beginnings of NLP and AI back to the genius insights of the great Alan Turing who proposed a linguistic test, the Turing Test, to determine if computers were “intelligent”. NLP is intimately intertwined with AI.
Where are we now?
Although we might be tempted to predict what the next five or ten years may hold for us, because it is exciting to make bold predictions that no one will remember in five years (Flying cars! Tang every day!), a few researchers (Joseph Mariani , Gil Francopoulo , Patrick Paroubek and Frédéric Vernier) went through roughly 55 years of papers concerned with research and development in speech and language processing to uncover patterns in the theory and the analytical tools used to tackle the many problems in NLP. Their work, The NLP4NLP Corpus (I, II, III): 50 Years of Publication, Collaboration and Citation in Speech and Language Processing (which in in of itself is a fascinating NLP project) gives us estimates of the depth of this field with a study of over 65,000 documents. It is useful to look at the academic papers because scientists and academics lead in development of new ideas. Tech start-ups may develop very good and innovative uses from the academic’s work, but without the hard, academic work there is nothing to innovate from. The recent academic work will give us clues to what is going on right now in the development of NLP linked products and services.
The authors wisely identified three sub-fields to concentrate on: Speech, NLP, and Information Retrieval (IR). Much of the research concerns acoustics and signal processing as can be seen in the below graph from the same report which charts the rise in Speech Recognition (SR) as a subject over the years. We will not discuss this further, but you are welcome to ask your nearest, speech recognition device for more information.
If we instead, look at the word rankings from the same report we find some interesting insights.
As much as the statistician in me wants to digress on the use of Gaussian Distributions and Fourier Transformations we will stick with the NLP list. We will briefly look at each term and give a few passing thoughts on uses this may indicate.
Where does this leave us in understanding the trends and focus of innovation?
Firstly, the list above gives us a clearer understanding that much of the work in NLP is concerned with identifying and classifying words and phrases. This can help us focus our attention on the tasks that are involved in identifying words and phrases. Matching similar or related phrases, perhaps to find sentiment in financial reports, or to search for specific words in large reams of documents such as a medical library is where NLP excels. But we should not get carried away. Because we know a machine works better when there is a human attached, uses of machines to help us will be immensely more satisfying. For example, with a POS tagger, an editor could highlight all the adverbs in a sentence and gain some insight if a journalist was being sufficiently fair in her writing. It is not the machine that decides or asks the question, but a machine helps make this decision faster and more accurate.
Businesses have much data locked up in documents, documents that are largely mute and of such volume as to be inaccessible. NLP unlocks the value in this data. One industry likely to use these tools to great effect will be the industry that relies primarily on words, the legal profession. Inundated with reams of impenetrable documentation stretching back years, the legal profession has formed verbal wall that has protected the inhabitants from the pressures in many industries. But that will change soon. when the financial pressure builds significantly enough, the legal profession will be transformed in a matter of months to a more outward, customer facing businesses, putting many tools and processes at the disposal of clients, removing all the carefully constructed barriers to understanding and democratizing one of the last standing guilds.
The only question is who will be first to build this tool?