Since the beginning of time, humans and some animals have been able to communicate using several abstraction techniques. The communication exists in different forms such as gestures, languages, emotions, etc. Next to abstract mathematics, both spoken and written language is perhaps the most complicated and most interesting communication tool humans use. In the general field of Artificial Intelligence (AI) using complicated techniques such as machine learning and deep learning, is a sub-field that is interested in understanding how humans communicate. This is commonly referred to as Natural Language Understanding or Natural Language Processing (NLP).
When we talk about NLP, we are concerned with a narrow subset of tasks. These tasks are typically connected with designing a computer or machine to understand human language through learning. The machine uses data such as the spoken word or written symbols that are pre-processed by statistical and mathematical models and made usable by computers. The topic under the discussion for today data and how we use and measure this data to gain insight into the efficacy of NLP models.
We believe that humans understand and communicate using tasks gained from past experience. And every experience is stored in the human brain as a complex structure. So, when a human communicates they decode past data from their brain cells for use in language. The field of neurolinguistic biology is enormous and beyond the scope of this post, but NLP is not the same as a brain. It is inspired by and uses similar concepts of how we believe a brain uses data, but in the end it NLP models are constructed using one of human kind’s most powerful concepts, statistics. By using statistics to search and analyse amounts of data we can train computers to do stunningly complex tasks quickly and with fewer errors than a human. But what do we mean by “data” and how do we know the data is useful? How do we judge one computer system versus another? Luckily a few researchers, Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman have tackled the task of pre-processing the data and making it useful. This is summarized in an academic paper called, GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. I am indebted to their work in the field and draw on their work for this post.
Which benchmark and why?
Benchmarks are an impassionate method to discover the relative performance of similar actions. Benchmarks are diverse and can be a stock index or the error rate in a manufacturing line. By measuring performance versus a benchmark one can identify gaps in an organization’s processes so as to improve and gain a competitive advantage. For our purposes we will look at a benchmark that serves as a measurement of how the machine learning community including University and Industry researchers have performed on a common set of tasks. To measure the performance of several NLP tasks the GLUE benchmark is used. We will describe GLUE in the following sections.
The creators of the GLUE benchmark state it is “several hundred sentence pairs labelled with their entailment relations (entailment, contradiction, or neutral) in both directions and tagged with a set of linguistic phenomena involved in justifying the entailment labels.” In other words, it is a large data set designed to measure the outcomes of specific tasks. The GLUE dataset includes: i) the Corpus of Linguistic Acceptability; ii) the Stanford Sentiment Treebank; iii) the Microsoft Research Paraphrase Corpus; iv) the Semantic Textual Similarity Benchmark; v) the Quora Question Pairs; vi) the MultiNLI Matched; vii) the MultiNLI Mismatched; viii) the Question NLI; ix) the Recognizing Textual Entailment; x) the Winograd NLI; and, xi) the Diagnostics Main datasets.
What are the tasks used to test versus this data?
Researchers can access these datasets to test the accuracy of their computers on understanding specific tasks such as if the sentences are grammatically correct or if they contradict each other.
There are four primary NLP tasks that are currently tested, but the data is flexible and one can create as many task as needed in order to deepen our knowledge. However for now, the main tasks are as follows.
The GLUE data sets used for this challenge are MNLI, QQP, QNLI, STS-B, MRPC, RTE and SWAG. This task involves identifying weather the two given input sentences are a pair or not.
The GLUE data sets. used for this challenge are SST-2 and CoLA. This is a classification task. Given an sentence input, the job is to find of the label of the class.
The GLUE data set used for this challenge is SQUAD. This problem is a relatively difficult task as compared to the previous two tasks. This task is to input a passage of text, pose a question about the text and return an answer based on the given passage.
The GLUE data set used for this challenge is CoNLL-NER. The job of this task is to find the entity in a given sentence. The researcher would input a sentence or sentences, and the output would the classes identifying the entity. This is similar to a classification task.
There have been other attempts at benchmarks such as SentEval which also relies on a variety of existing classiﬁcation tasks involving either one or two sentences as inputs. SentEval is well-suited for evaluating general purpose sentence representations in isolation. But something better is needed like GLUE to test cross-sentence contextualization and alignment, so we can drive new research and create state-of-the-art performance on tasks.
What are we really testing?
There are four broad categories of phenomena we are looking at:
This is by no means a complete set of phenomena in a language. For example, meiosis, or the use of understatement to enhance an impression on the hearer is not tested. Remember we are testing a specific set of phenomena allowing us to create better tools, not creating a speaking android. As mentions, the creators of GLUE have set sentence pairs to each of these tasks and labelled the relationship between the pairs as entailment, contradiction, or neutral. Each of these in turn are:
Entailment: the hypothesis states something that is definitely correct about the situation or event in the premise.
Neutral: the hypothesis states something that might be correct about the situation or event in the premise.
Contradiction: the hypothesis states something that is definitely incorrect about the situation or event in the premise.
Let’s give an example of an entailment.
If I want to test entailment in lexical statement, the statement might be, “I climbed a mountain.” This naturally entails that “I can climb a mountain”. But if I say, “I want to climb a mountain”, it does not entail that “I can climb a mountain.” The computer needs to successfully identify if there is an entailment between the sentences.
How does good are human beings versus the benchmark?
Based on the current Leader Board ratings on the gluebenchmark.com website we can see the following.
It important to ignore the precise numbers given versus a human being. Why? Because one should never let perfect be the enemy of good. Productivity of humans vary depending on any number of physical and emotional factors. Computers have their own challenges, but emotional factors are not one of them. Sometimes the steady performance on a task may be preferable. Further computers can execute a task on an enormous corpus of work which a human cannot do in a reasonable period of time.
So why do we not see more adoption of these models? We reach the barrier of the limits of human imagination. You may have created a tagger/ parser/ summarizer/ ASR system/ NLG module, but so what? What you do with it is what is important. Computers can not ask the question, why and what for? That is what humans are for, to pose the questions. Delvify helps you pose the right questions so your business is able to use the tools of NLP more effectively.
*Note on evaluation. The GLUE testing uses a generalized (averaging the square error from the mean value in each of 3 dimensions) version of the Matthews correlation coefficient to measure results. This measure takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes. In essence it is a correlation coefficient between the observed and predicted binary classifications. It is better than an F Score measure because the F measure has little to say on Specificity/True Negatives and doesn’t really measure the proportion of actual negatives.