La pratique des inférences

Ce travail est motivé par le problème de la propagation de fake news sur le Web et par l’objectif de proposer une solution de vérification automatique des faits. Le problème de vérification des faits concerne la désinformation et les fake news provenant de sources considérées comme peu fiables.

Une approche récem- ment suggérée consiste à considérer ce problème comme une tâche de classifica- tion de documents parmi les trois catégories suivantes : “Fiable”, “Non fiable” ou “Pas assez d’information”.

Dans des travaux connexes, la vérification automatique des faits est considérée comme un problème pouvant éventuellement inclure une tâche d’inférence en langage naturel (NLI), c’est-à-dire de détection d’implica- tion textuelle. L’objectif de cette tâche est de déterminer si une phrase (hypothèse) peut être raisonnablement déduite d’une autre phrase (prémisse).

Pour les tâches d’inférence linguistique, la compréhension des relations sémantiques et logiques entre les phrases est essentielle. Néanmoins cette compréhension est difficile pour la machine en raison de son manque de connaissances du monde exterieur.

Pour construire notre modèle, nous utilisons l’idée d’une augmentation préalable des connaissances (sens commun). Nous enrichissons un jeu de données NLI avec des faits supplémentaires, sous forme de texte brut (phrases), provenant d’une base de connaissances générales additionnelle. Nous proposons ensuite une architec- ture de modèle applicable au jeu de données nouvellement créé et construisons plusieurs configurations du modèle dans le but de comparer les résultats obtenus. Pour pouvoir interpréter les résultats de manière quantitative, nous construisons un deuxième modèle similaire, qui n’utilise cependant pas cette augmentation des faits avec des connaissances exterieures.

Ensuite, nous comparons les résultats des modèles avec et sans les faits. Nous observons que les deux types de modèles atteignent une précision comparable : 64.79% pour le modèle sans faits, contre 65.34% pour l’experience utilisant l’augmentation des faits, ce qui est un résultat surperformant et prometteur. L’intégration d’un modèle pré-entrainé comme BERT réduit considérablement la différence de précision, la rendant presque négligeable : 67.77% pour le modèle BERT sans l’addition des faits comparé à 68.03% pour la configuration augmentée des faits. Cependant, le pré-entrainement de BERT est une opération très coûteuse.

Les résultats obtenus montrent que l’approche proposée est viable et compétitive, tout en ouvrant d’autres pistes de recherche concernant les améliorations possibles.

De plus, nous effectuons une analyse qua- litative des performances du modèle en examinant manuellement un échantillon des exemples correctement et incorrectement classifiés. Nous réalisons que, dans certains cas, notre modèle est plus robuste en raison de

la connaissance supplé- mentaire dont il dispose, alors que dans d’autres, les connaissances additionnelles introduisent du bruit, rendant l’exemple plus difficile à classifier pour le modèle.

This chapter provides a summary of the essential contents of the master’s research project. It gives the background required for better understanding of the research topic, states the problem, explains the proposed approach and the methods used. Finally, it introduces principal results of the report, and gives the overview of the content of this paper.

1.1 BackgroundNatural language processing (NLP) is a key component of artificial intelligence research field. It lays on the border of 2 domains: computer science and computational linguistics. The goal of the NLP is to make computers to be capable to understand and process any kind of writ- ten or spoken natural human language.

The field has become highly research demanding due to the extremely fast exploding data in the Internet. It has become impossible for humans to process all the tons of unstructured information appearing every minute. Nowadays people re- quire computers to be outstandingly intelligent to catch up with human’s level (or even higher) of language processing quality. Among the tasks solving by NLP systems the following can be defined: machine translation, language modeling, natural language understanding (NLU), etc. By NLU it is usually meant the capability of the computer to read the text, understand it, answer to related questions, perform logical inference based on the given piece of information, etc.

Previously, most of the approaches for solving such tasks were based on the human-written rules (in the rule-based systems) or human-designed features (in machine learning algorithms). Recently, the trend has changed. More and more NLP systems started using deep neural net- work architectures which are more flexible and less time-consuming (in terms of human efforts) than the previous approaches.Deep learning models have already shown themselves as a powerful instrument for solving NLP tasks.

All the current state-of-the-art results were achieved by incorporating deep learning approaches into the systems not only in research, but also in corporate production. For exam- ple, since November 2016, Google has been using the neural machine translation approach (Wu et al., 2016), which has provided noticeable improvement in the machine translation qual- ity. With the advancement in training methods, the more difficult tasks and problems started appearing on the scene.

1.2 Motivation and Problem statementThe focus on misinformation has recently stimulated increase in research in fact checking, which is the task of assessing the truthfulness of a claim (Thorne and Vlachos, 2018). The enhanced ability of fast spreading of both true and false information via Internet has become an important concern as it is speculated that this has affected decisions such as public votes, especially since recent studies have shown that false information reached greater audiences (Vosoughi et al., 2018). It has initiated an increased demand for automated fact checking of the content in the Web.

The process of fact-checking or true search includes the identification and retrieval of relevant evidences to enable reasoning on what can be inferred from a given claim and how close it is to the reality. A limited number of published datasets resources for fact checking is available so far. The automated fact checking systems relies on either of two basic NLP approaches: textual entailment/natural language inference (NLI) (Dagan et al., 2006), or knowledge base construction techniques (Ji et al., 2011). In this work we try to combine these two approaches to see how it can help solving the task.NLI Similar to that (Ferreira and Vlachos, 2016) suggested a fact checking model based on recognition of the textual entailment (RTE) approach, predicting whether a sentence is “for”, “against”, or “observing” a given claim.

The RTE-based fact checking models assume that the textual evidence is already given, so that the retrieval part of the task is being omitted. Knowl- edge base The additional context can help improving the classification accuracy. Knowledge graphs could support the task of fact checking as they provide a rich collection of structured canonical information about the world. Moreover, this knowledge is stored in a machine read- able format.

1.3 Proposed approach and resultsOur approach combines two different problems: natural language inference and common sense knowledge utilization in NLP models.

This project was motivated by the true search problem which we consider as a sentence pairs classification task, and which matches perfectly with NLI. The idea of the suggested solution is to augment SNLI (Stanford Natural Language Inference, (SNLI, 2019)) dataset with additional commonsense knowledge loaded from the ConceptNet (Speer and Lowry-Duda, 2017), and then to propose an architecture of a possible way to embed such additional facts to the NLI model. We use unsupervised word vectors pretraining, and a simple neural architecture with several layers including GRU memory unit and max pooling. We develop six different types of models.

Four of them include factual information (2 with independent facts and 2 with pair-wisely joint facts), and 2 more which are not augmented at all for the comparison and evaluation. All the work and the experiments presented in this paper have been implemented in python using PyTorch1, an open source deep learning platform.

1.4  Primarily it was developed by Face- book’s artificial intelligence research team. PyTorch provides a rich ecosystem of tools and libraries which extends PyTorch and supports development in computer vision, NLP and more. PyTorch provides two high-level features:

• Tensor computing (like NumPy) with strong acceleration via graphics processing units (GPU)• Deep neural networks built on a tape-based autodiff systemTensors in PyTorch are different from mathematics, so they can be treated as multidimen- sional array data structures.

Here Tensors are similar to NumPy arrays, the difference is that they can be operated on a CUDA-capable GPU. PyTorch uses a method called automatic dif- ferentiation. It also records the operations have performed, and then it replays it backward to compute the gradients. This method saves time on one epoch by calculating differentiation of the parameters at the forward pass.

It is especially powerful when building neural networks. The TensorBoard2 is used for building the visualizations, tracking quantitative parameters, and visual debugging. For example, we generated summaries at each several steps (from 100 to 1000) to monitor the parameters like loss and accuracy values while training.

1.5 Contents of this reportThe structure of this report is as follows: we first explain what problem this project tends to solve. To do so, in chapter 2, we introduce the automated fact checking problem, natural lan- guage inference, common sense utilization, and the relation between those three along with the current state-of-the-art and possible applications.

We then proceed to discussing the theoretical foundations needed to understand for the proposed solution in chapter.

3. We explain the es- sential concepts like word embeddings, deep learning networks, and memory units. Following this, in chapter 4, we provide detailed explanation of the practical implementation to support the proposed approach. We start with discussing the facts extraction process and the dataset, and then move towards the model architecture and training details. Subsequently, we introduce the evaluation metrics used for the results analysis. In chapter 5, we then present the achieved results.

Finally, chapter 6 concludes the discussion by summarizing all the work which was done, and proposing future research directions as a possible continuation for this work. The last section of this chapter provides the master’s project goals and evaluation criteria analysis as it is stated in the report template.

This chapter details the stated problem and provides the current state-of-the-art regarding it. This work combines several well-known NLP tasks: problem of fact checking, NLI, and prior knowledge utilization for language processing in order to suggest a novel approach to solving the existing problems in the domain.

2.1 Fact checking problemThe Internet is currently exploding extremely fast. According to (Marr, 2018), 2.5 quintillion bytes of new data is being created on the Web each and every day.

Apparently, some of those data might be false or only partially true. The fact that people cannot control this endless informational stream any more creates a need in a comprehensive tool to be able to evaluate the information people face in the Web or wherever else critically. Fact checking problem deals with misinformation and fake news coming from unreliable sources. The systems approaching this problem aim at detecting such kind of simulated information.

Automated fact checking is the task of machine assessing the truthfulness of a claim and giving humans the clues on making their own unbiased opinion regarding this claim. Despite the fact that this problem is relatively new, some scientific expertise already exists on this subject as well as the workshop in fact checking organized and sponsored by Cambridge University research. The shared task is called FEVER (Fact Extraction and VERification) (Thorne et al., 2018).

The goal of FEVER Challenge is to evaluate the ability of a system to verify the informa- tion using evidence from Wikipedia. For a given claim, the task is to extract corresponding sentences from Wikipedia to help the system to either support or refute the claim and then, using the evidence, label the claim as ’support’, ’refute’, or ’notEnoughInfo’. Therefore, gen- erally speaking, FEVER is a 3-class text classification task. Suggested models are evaluated by classification accuracy and evidence recall scores. More specifically, FEVER task consists of 3 steps:

1. Document retrieval: among all Wikipedia pages, it is needed to only load the relevant documents which contain potentially evidential information for supporting or refuting the given claim. Neural semantic matching networks, logistic regression model over article titles, Entity linking approach (consistency parsing, rules), TFIDF score, etc. might be used for solving this task.2. Sentence selection: among the retrieved documents on the previous step, it is needed to only select relevant sentences, and focus the model’s attention only on them.

Neural semantic matching networks, logistic regression model over text features, sentence ranking model based on ESIM, TFIDF score against the claim.3. Claim verification + aggregation: performing 3-class pairwise sentence classification for each of retrieved evidential sentences against the claim, and aggregation of the each pair results to obtain the final score. For this purpose the winners of FEVER Challenge 2018 used the following models: ESIM (Chen et al., 2017b): biLSTM + neural aggregation, Attention mechanism, Transformer network + attention (Vaswani et al., 2017). The table 2.1 shows 4 best achieved results in FEVER Challenge 2018.In this work we concentrate on the 3rd component of the true search task: sentence pairs classification.

We develop an approach which might be rather universal for many similar tasks, and test it on an NLI dataset for the sake of simplification.2.2 Natural language inference problemNatural Language Inference (NLI) problem is a subset of NLU tasks (natural language under- standing) which aims at making conclusions with regard to the given information.

NLI models are models that attempt to infer the correct label based on the two sentences by reading them and judging the relationship between them on one of three possible options: ’entailment’, ’con- tradiction’, or ’neutral’. Therefore, it is again a sentence pairs 3-class classification task similar to the 3rd step of FEVER challenge discussed above.The SNLI task is challenging, but with making available large-scale annotated datasets like SNLI (Bowman et al., 2015) and MultiNLI (Nangia et al., 2017), the researchers have man- aged to make lots of progress on solving this problem.

The current state-of-the-art models are rather big and complicated. They employ such approaches as training deep neural networks on several tasks, for example, MT-DNN (Liu et al., 2019), applying an ensemble of semantic sen- tence matching networks with densely-connected recurrent and co-attentive information (Kim et al., 2018), implementing generative pre-training by employing fine-tuned LM-pretrainedPremise Hypothesis Label A man inspects the uniform of a figure in some East Asian country. The man is sleeping contradictionAn older and younger man smiling. Two men are smiling and laughingat the cats playing on the floor. neutralA black race car starts up in front of a crowd of people.

A man is driving down a lonely road. contradictionA soccer game with multiple males playing. Some men are playing a sport. entailmentTable 2.3: A few example pairs taken from the SNLI corpus along with their labelstransformer (Radford et al., 2018), or using discourse marker augmented networks (Pan et al., 2018), etc. The table 2.2 shows the current state-of-the-art for SNLI benchmark by depict- ing top-4 best models on the noment of writing this report which is taken from official (SNLI, 2019) project web-page.

The original SNLI corpus (Bowman et al., 2015) is a collection of 570k human-written pairs of sentences which are labeled according to their relationship to each other as described above for natural language inference (NLI) task to support 3-class clas- sification with possible labels: ’entailment’, ’contradiction’, or ’neutral’. A few examples taken from the SNLI corpus are shown in the table 2.3.2.3 Prior knowledge and knowledge base problemWhen people are asked to solve a question answering task, or to infer something out of a given piece of data, people easily manage to do it even if the given information is insufficient.

Why can they do that?

Apparently, humans do not start from scratch, they already have a set of prior knowledge about the world, and, most likely, about the domain the asked question belongs to, which they can use as an augmentation to the given information in order to solve the problem. So that the idea of using additional information (prior knowledge) as an input for a machine learning model to improve its ability of solving NLP tasks appeared.

The goal of this approach is to enrich primarily provided data with more facts and knowledge to give more context to the machine, i.e. more evidential information. Latest works have shown the effectiveness of prior knowledge utilization on a variety of tasks, including language modeling (Ahn et al., 2016), machine translation (Shi et al., 2016), word embedding (Chen et al., 2015; Liu et al., 2015), and dialogue systems (Chen et al., 2016).

Prior knowledge and knowledge base in ML models problem itself consists of 2 parts: what data (or which source of data) to use, and how to embed them into the system. The source, type, and structure of such commonsense knowledge can be different.First of all, data collection is a tough process. While it can be automated, the satisfied level of quality can be only achieved by human manual work on filling in the commonsense knowledge base which is extremely time and effort consuming.

The data can be represented as plain text, or in a structural format as Resource Description Framework (RDF) which describes resources and their relations based on the XML syntax, or as a graph. Representation of the data plays a very important role. Firstly, it should be stored in a convenient for a machine format tobe handled efficiently. Secondly, it has to support the relationships between the words.

Finally, it should help the system avoiding word ambiguity, so that, the machine should be able to distinguish different meanings of the same words.Graph-based representations are good at addressing all those issues described above. There- fore, WordNet (Fellbaum, 1998) and ConceptNet (Speer et al., 2016) have become extremely popular sources of commonsense data. For example, (Talmor et al., 2018) investigates question answering with prior knowledge taken from ConceptNet, and in (Chen et al., 2017a) WordNet is used as a source of external knowledge for solving natural language inference task.

3 Theoretical Foundations for the SolutionThis chapter describes in abstract (theoretical) terms how the proposed approach can be imple- mented and how to solve related sub problems, using the state-of-the-art as an analysis tool.3.1 Word and Token EmbeddingsEmbedding methods also known as ’encoding’ or ’vectorising’ convert symbolic representa- tions (i.e. words, emojis, categorical items, dates, times, other features, etc.) into meaningful numbers (i.e. real numbers that capture underlying semantic relations between the symbols) (Terry-Jack, 2019).

In other words, such embedded representations map words from sparse space to a high dimensional continuous vector space where word vectors share properties. The pre-trained models can capture that some words have similar meanings, so their vector rep- resentations would share the same (or closer) neighborhood (the figure 3.1 visualizes such examples).While some word embeddings such as One-hot vectors assume that there are no inherent relationship between different items, for NLP tasks it is very important to have a difference in similarity measure between word vectors. For that purposes feature vectors can be employed, so that depending on different features of the words we can evaluate how the words are similar in meaning.

Features can also be more abstract relations, such as the context in which a word occurs (assuming words with similar contexts must have similar meanings). Such an embed- ding can accurately capture the sense in which the word is used (a word’s usage can change depending on the time, community and context it is being used in), so that such an approachFigure 3.1: Visualization of relations between vectors of word embedding representations, taken from (Ruizendaal, 2017).kills the ambiguity. Word vectors are feature extractors that encode semantic features of words in their dimensions.

The words are being projected from a sparse, 1−of −V encoding, onto a lower dimensional vector space whereV is the vocabulary size. Likewise, semantically similar words are close in euclidean or cosine distance in the lower dimensional vector space in such dense representations.

Kim (2014).Thus, neural embedding methods can be employed for capturing words features with re- gards to the context. We assume that the hidden layers of a trained neural network have learnt useful features, that can be used as words’ representations in further training to help the network to ’understand’ the meaning of the words provided as an input. In this work we experimented with three different popular pre-trained word embeddings which are described in more detail further.

3.1.1 GloVeGloVe (Jeffrey Pennington, 2014) is an unsupervised learning algorithm for obtaining vec- tor representations for words. Its training is performed on aggregated global word-word co- occurrence statistics from a corpus.GloVe is a log-bilinear model which has a weighted least-squares objective (also known as weighted linear regression). The main intuition why the model works is the simple observation that ratios of word-to-word co-occurrence probabilities can possibly encode some of the word’s meaning. GloVe learns word vectors in the way that their dot product equals the logarithm of the words’ probability of co-occurrence, which is considered as the training objective for this model.

Moreover, since the logarithm of a ratio equals the difference of logarithms, this kind of objective associates with the logarithm of ratios of co-occurrence probabilities with vector differences in the word vector space. Due to the fact that these ratios encode some of word’s meaning, all the information is being encoded as vector differences.

3.1.2 ConceptNet NumberbatchConceptNet Numberbatch consists of the state-of-the-art semantic vectors (also known as word embeddings) which can be used directly as a representation of word meanings or as a starting point for further training of a machine learning model.ConceptNet Numberbatch is part of the ConceptNet open data project which is used to create representations of word meanings as vectors, similar to GloVe, but with larger source of information.

It is built using an ensemble that combines data from ConceptNet (Speer et al., 2016), word2vec (Mikolov et al., 2013), GloVe (Jeffrey Pennington, 2014), and OpenSubtitles 2016, using a variation on retrofitting. These word embeddings are free, multilingual, aligned across languages, and designed to avoid representing the stereotypes.

Their performance at word similarity, within and across languages, was shown to be state of the art at SemEval 2017 (Speer and Lowry-Duda, 2017).3.1.3 Bidirectional Encoder Representations from Transformers(BERT)BERT (Devlin et al., 2018) is designed to pretrain deep bidirectional representations from un- labeled text by jointly conditioning on both left and right context in all layers. Bert uses trans-former encoder architecture which can help solving many NLP tasks by simply adding a classi- fication layer to the pre-trained model, and then fine-tuning all parameters on the specific task. Although, this approach does not apply to all the tasks. However, BERT still can be extremely useful in extracting pre-trained contextual representations by using the activations from one or several layers of BERT without any fine-tuning.

BERT model is trained on large amount of data with total parameters of 110M (base model) or 340M (large model) which makes the pre- trained vectors derived from it extremely powerful in terms of storing semantic information of the words.Meanwhile, training BERT is extremely costly. Originally, BERT base was trained with 4 TPU pods within 4 days and BERT large with 16 TPUs again within 4 days. In (Dettmers, 2018) it was calculated that a GPU cluster of V100s/RTX 2080 Tis with good networking (Infiniband +56GBits/s) and good parallelization algorithms it can be expected to train BERT large on 64 GPUs (the equivalent to 16 TPUs) or BERT base on 16 GPUs in 5 1/3 days or 8 1/2 days. For a non-commercial setting, one can expect to train BERT large in 21 days or 34 days and BERT base in 10 2/3 or 17 days on an 8 GPU machine for V100/RTX 2080 Tis using a parallelization algorithm like PyTorch or TensorFlow. For a standard 4 GPU desktop with RTX 2080 Ti (much cheaper than other options), one can expect to replicate BERT large in 68 days and BERT base in 34 days. To train our model pre-trained with BERT token representations, it requires at least 5 days which is 30 times more than training the same model using ’Glove’. Below we show that the good results can be obtained even without it.

Recently, the report (Strubell et al., 2019) has shown the harmful impact of training such big and costly models on the environment in terms of carbon dioxide emission. According to that, training BERT had a footprint of about 1,400 pounds of carbon dioxide. This fact also encourages researchers to find another solutions, which would be less resource-consuming, but yet effective.3.2 Deep learning for sentence classificationText or sentence classification is one of the fundamental tasks in NLP (Natural Language Pro- cessing) field, where machine is suggested to tag a given piece of textual information, e.g. such tasks can be spam or negation detection, sentiment analysis, natural language inference, etc.

Deep learning is a set of algorithms and techniques inspired by how the human brain works. Text classification has benefited from the recent resurgence of deep learning architectures due to their potential to reach high accuracy with less need of engineered features. The main deep learning architectures used in text classification are Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and the Transformer.On the one hand, deep learning algorithms require much more training data than traditional machine learning algorithms, i.e. at least millions of tagged examples. On the other hand, traditional machine learning algorithms such as SVM and NB reach a certain threshold where adding more training data does not improve their accuracy.

In contrast, deep learning classifiers continue to get better the more data you feed them.Within NLP, much work with deep learning methods involve learning word vector represen- tations through neural language models as described above in word embeddings section. Then such models include performing composition over the learned word vectors for classification (Kim, 2014).Figure 3.2: Illustration of (a) LSTM and (b) gated recurrent units. (a) i is the input, f – forget, and o – output gates. c and̃c represent the memory cell and the new memory cell. (b) r and z are the reset and update gates, h and̃h are the activation and the candidate activation functions. This picture is taken from (Chung et al., 2014).

3.2.1 Convolutional Neural NetworksCNN is a class of deep, feed-forward artificial neural networks ( where connections between nodes do not form a cycle). In general, it is a variation of multilayer perceptrons designed to require minimal pre-processing. Such structure is inspired by animal visual cortex.Convolutional neural networks utilize layers with convolving filters that are applied to local features (LeCun et al., 1998).

CNNs were invented for and mostly used in the field of computer vision. However they have recently been applied to various NLP tasks (Collobert et al., 2011), and have shown their effectiveness on them.3.2.2 Recurrent Neural NetworksA recurrent neural networks (RNN) are one of the most popular architectures used in NLP because their recurrent structure overcomes traditional neural networks’ shortcomings that arise when dealing with sequence data: text, time series, videos, DNA sequences, etc.RNN is one type of artificial neural network. In the RNN, the connections between nodes form a directed graph along a sequence of tokens. In fact, RNN is a sequence of neural network blocks that are linked to each others like a chain.

The main value of the RNN is that it is able to exhibit dynamic temporal behavior for a time sequence. Training RNN requires a sufficiently large amount of training data, which is hard to obtain. However, many systems suggest using a joint learning across multiple related tasks. This approach helps to increase the amount of training data as well as to make it more diverse.An RNN learns by recursively applying a transition function to its internal hidden state vectors of the input sequence. During training of a classical RNN, components of the gra- dient vectors can grow or decay exponentially over long sequences, causing the well-known exploding or vanishing gradient problem. It makes difficult for RNN to learn a long-distance correlations.

However, Gated Recurrent Neural Networks were invented to address the issue of long-term dependencies, such as the long short-term memory (LSTM) architecture (Hochreiter and Schmidhuber, 1997) and Gated Recurrent Unit (GRU) (Cho et al., 2014). The figure 3.2 taken from Chung et al. (2014) illustrates the structure of the LSTM and GRU units.Unlike to the traditional recurrent unit which overwrites its content at each time-step, theFigure 3.3: Gated Recurrent Unit, fully gated version (the picture taken from Wikipedia).LSTM has a separate memory cell, which maintains updating and exposing its content only when deemed necessary.

At each step, the memory cell is updated by partially forgetting the existing memory and adding a new memory content. The extent to which the existing memory is forgotten is modulated by a forget gate, and the degree to which the new memory content is added to the memory cell is modulated by an input gate.Bidirectional GRUs are a type of bidirectional recurrent neural networks with only the input and forget gates. The same way as in the LSTM unit, the GRU utilizes gating units that control the flow of information inside the unit, but the GRU do not have separate memory cells.

Two directions allow for the use of information from both previous time steps and later time steps to make predictions about the current state. For GRU, the hidden state ht is computed as:zt = σ(xtUz+ht−1Wz+bz)rt = σ(xtUr +ht−1Wr+br)̃ht = tanh((xtUh+(rt ∗ht−1)Wh) ht = (1−zt)ht−1+zt ∗̃ht Here * denotes the Hadamard product (element-wise matrix multiplication), r is a reset gate, and z is an update gate. In other words, the reset gate defines how to combine the new input with the previous memory, while the update gate determines the portion of the previous memory to keep (to remember). Figure 3.3 gives more detailed visualization of the GRU architecture 1. Utilization of the knowledge from an external source can enhance the precision of the RNN because it embeds new information (lexical and semantic) about the words into the model.

3.2.3 TransformerThe Transformer (Vaswani et al., 2017) is a network architecture based on attention mecha- nisms, dispersing recurrence and convolutions. This model allows to obtain superior results, and requires less training time. The Transformer follows the encoder-decoder architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder. 3.4: The Transformer – model architecture: encoder (left) and decoder (right). This schema is taken from the original (Vaswani et al., 2017) paper.input sequence of symbol representations (x1,…,xn) to a sequence of continuous representa- tions z = (z1,…,zn). Given z, the decoder then generates an output sequence (y1,…,ym) of symbols one element at a time. At each and every step the model consumes the symbols which were previously generated as an additional input while generating the next ones. Encoder and Decoder are both composed of modules that can be stacked on top of each other multiple times (described by Nx in the figure). The modules include Multi-Head Attention and Feed Forward layers.

The inputs and outputs are first embedded into an n-dimensional space. The attention function is essentially a mapping of a query and a set of key-value pairs to an output, where all the query, keys, values, and output are vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

In the equation 3.1, Q is a matrix that contains the query (vector representation of one word in the sequence), K are all the keys (vector representations of all the words in the sequence) and V are the values, which are again the vector representations of all the words in the sequence.Attention(Q,K,V) = softmax(QKT √dk )V (3.1)The Transformer architecture represents an attention-based model which is able to compete with convolutional and recurrent networks, while it is cheaper to train.4 Practical implementationThis chapter gives a concrete discussion of how the proposed solution was implemented and evaluated. It describes the prior knowledge extraction and dataset building process.

Then it explains the types of pre-training being applied, as long as the model architecture and our training details. Finally, it gives an overview of the evaluation metrics used in the further analysis.4.1 Facts extractionOur model uses ConceptNet (Speer and Lowry-Duda, 2017) as a source of external data. As it is mentioned above, ConceptNet is a graph-based semantic network that contains commonsense knowledge about concepts (words). It is designed to help computers understand the meaning of words that people use. ConceptNet is open content, multilingual, and domain general knowl- edge graph which has a Web API to access its data. We utilize an interesting property of this API which is called “surfaceText”.

It allows to load the linked concepts along with the type of relationships between them in a form of plain text. In other words, following the formula <concept> + <relationship> + <linked concept>, we obtain a regular sentence which then can be treated as usual text and used as an input to the model. We refer to those sentences as “facts”. For example, the edge from ConceptNet 5.7 shown on the Figure 4.1 would form a sentence like this: “boy is related to man”. So that the machine learning algorithm will be aware that those two concepts are close in meaning.

The choice to load the facts in the surface text form was made intentionally. The main idea behind it is that in the future we can load any selected text from the Web to gain additional knowledge we need from various sources, e.g. Wikipedia or any reliable source of news. In this case, we would not be dependent on aFigure 4.1: Edge details for “boy” and “man” concepts taken from ConceptNet 5.7 which would form a sentence like: “boy is related to man”.specific form of knowledge representation, i.e. this approach allows to design more flexible models.

To build our facts augmented dataset, for each sentence taken from SNLI training dataset (Bowman et al., 2015), we extract surface text from ConceptNet, following the rules:• concepts can be only nouns or adjectives;• for each concept (or pair of concepts) we load maximum 5 top facts ordering by weight.Another interesting feature which we also utilize here is the ability of loading relations for pairs of concepts. So, following the same rules as described above, we also load facts for concepts taken from premises and hypotheses of SNLI dataset pair-wisely.

Further in this work we will refer to those as the joint facts or facts pairs, as opposed to simple facts loaded for each sentence independently. Below we provide an example for one training sample taking through all the steps of the process to illustrate it in more details.1. Take one sentence (premise):”A man in shorts and a woman in a red long dress are walking down a road with a small white car in the background.”2. Create a list of all the concepts (nouns and adjectives only) containing in this sentence:[’man’, ’shorts’, ’woman’, ’red’, ’long’, ’dress’, ’road’, ’small’, ’white’, ’car’, ’back- ground’].3. Load facts related to those concepts as a surface text from conceptNet for (max 5 factsper concept): Example facts for word ’man’: [’boy is related to man’, ’man is related to person’, ’man is related to boy’, ’person is related to man’, ’fellow is related to man’].

4. Take another sentence (hypothesis):”A couple is on the run from a robbery.”5. Create a list of all the concepts (nouns and adjectives only) containing in this sentence:[’couple’, ’run’, ’robbery’].6. Load facts related to those concepts as a surface text from conceptNet for (max 5 factsper concept): Example facts for word ’couple’: [’a couple can tie the knot’, ’A couple is two’, ’An activity a couple can do is watch a movie’, ’a couple can row about anything’, penjepit is a translation of couple].7. Match each concept of the premise with each concept of the hypothesis creating the wordcouples: Example of word couples: [(man, couple), (shorts, couple), (woman, run), (red, run), (dress, robbery), etc.]8.

Load facts related to each pair of concepts, i.e. to both premise, and hypothesis, as asurface text from conceptNet for (max 5 facts per couple). Example facts for pair of words (red, run): [’red is related to run’] .9. Keep the original SNLI label corresponding to the pair of sentences (premise and hy-pothesis): contradiction.

4.2 Augmented dataset descriptionEventually, our new augmented with the ConceptNet facts SNLI dataset has the following structure (each bullet point corresponds to a separate field of the dataset):

• premise: an original SNLI sentence• set of facts related to the premise: a list of sentences• hypothesis: an original SNLI sentence

• set of facts related to the hypothesis: a list of sentences

• set of facts related to both the premise and the hypothesis: a list of sentences loaded in the pair-wise manner

• label: original SNLI label: entailment/contradiction/neutralThe table 4.1 provides an example of one sample of the new augmented dataset. We were able to load the facts for only a subset of the SNLI examples.

The original SNLI corpus is a collection of 570k human-written English sentence pairs manually labeled. Our modified version of the dataset is smaller in terms of number of examples, but much larger in terms of tokens obtained and the vocabulary range. In total, the augmented dataset contains 169,7k samples, which were divided into training, validation, and test sets at the ratio 0.8/0.2/0.2, respectively.Further in this work we will refer to the 3 different settings of the proposed approach with respect to the data used for training. Firstly, the versions of the model called facts pairs in- clude all the fields of the augmented dataset, i.e. premise, hypothesis, facts for premise, for hypothesis, and the joint ones.

This is the most reach in data setting, which is at the same time (we will see later) the noisiest one. Secondly, the models referred as simple facts exclude the joint facts field of the augmented dataset, i.e. they utilize only 2 sets of facts which we loaded independently from each other.

Lastly, our referent no facts models take into account only the original SNLI dataset without any modifications, i.e. pairs of sentences + the label.4.3 Pre-trainingThis section describes how the raw text is pre-processed, converted to tensors, and which pre- training methods are used for obtaining word vector embeddings.premise (p) Children playing checkers.facts for the premiseDames is a translation of checkers, jeu de dames is a translation of checkers, damas is a translation of checkers, dam is a translation of checkers, king is used in the context of checkers hypothesis (h) Children are playing board games.facts for the hypothesisboard is related to wood, board is related to plank, board is related to flat, You can use a board to build, game is related to board, play is related to games, a marble is used for games, board is related to games, games are fun to play joint facts for both p and h board is related to checkers label entailmentTable 4.1: An example of one sample of the new augmented with facts SNLI dataset4.3.1 The models without BERT pre-trainingFor the models without BERT pre-training, we tokenize each sentence using “spaCy” tokenizer (spaCy, 2019).

It segments text into words, punctuation and so on applying rules specific to each language (we use it only for English). Then, using capabilities of the torchtext framework, we perform numericalization and padding of each sentence individually. For this purpose we utilize torchtext.data.Field class. Field class models common text processing data types that can be represented by tensors. It holds a Vocab object that defines the set of possible values for elements of the field and their corresponding numerical representations. Moreover, the Field object keeps other parameters relating to how a data type should be numericalized and tokenized, and the kind of Tensor to be produced.Most importantly, if a Field is shared between two columns in a dataset (e.g., premise and hypothesis), then these columns will have a shared vocabulary (torchtext, 2019). In our case, we need to have a shared vocabulary between premise, hypothesis, and their corresponding facts which have a slightly different structure (they are lists of sentences).

For the facts we utilize torchtext.data.NestedField class. A nested field holds another field called nesting field (in our case it is a Field described above for the sentence level representation), accepts an untokenized string or a list of strings (in our case) and treats them as one field. Every token will be pre-processed and padded.

This means a nested field shares the vocabulary with the sentence level field. The numericalization results for all the facts in the list are stacked into a single tensor. This field is primarily used to implement character embeddings, but we use it for facts representation in the sense that each sentence in the list is firstly tokenized by words (again using SpaCy), and then the whole list is tokenized by sentence.

Eventually, we have each field of the dataset represented as a single tensor without loosing any properties of the text and having shared vocabulary among all the dataset.As it was mentioned above we experimented with three different ways of obtainig word vector representations which are Glove (Jeffrey Pennington, 2014), ConceptNet Numberbatch (Speer et al., 2016), and BERT (Devlin et al., 2018).

All word embeddings are updated during training.For the GloVe setting, the word embeddings are initialized by 300D Google GloVe 840B (Pennington et al., 2014) , and out-of-vocabulary words among them are initialized randomly. For the ConceptNet Numberbatch embeddigns, we used 16.09 official verson of the vectorswith matrix dimention of 9161912 by 300. ConceptNet has its own strategy on treating out-of- vocabulary words, which is as follows: removing a letter from the end of the word and checking if that is a prefix of known words, then repeating it until a single character remains.

Eventually, we obtain out data as vectors of [B x S x E] and [B x S x N x E] dimentions for the single sentences and sets of facts, respectively, where B is a minibatch size, S is a sequence length, N is the number of sentences inside the nested field (only for facts), and E is the embedding dimension, which equals 300 in our case. These vectors are then used as an input on the next step of the algorithm which is gated recurrent unit.4.3.2 The models utilizing BERT pre-trainingFor the models utilizing BERT pre-training, we use another class of torchtext data fields called RawField. Since BERT model provides its own capabilities for tokenization and numerical- ization, torchtext.data.RawField class suits well for this.

A RawField object does not assume any property of the data type and it holds parameters relating to how a data type should be pro- cessed. Firstly, the BERT model has its own associated tokenizer which performs end-to-end tokenization, i.e. basic tokenization followed by WordPiece tokenization.

Then we use raw BERT Transformer model (fully pre-trained) to obtain the numerical vectors representations of the words. To do that we need to get the hidden states computed by bertModel. As it was ad- vised by the original BERT paper, we take 4 last hidden states of the model and sum over them to obtain the best contextualization.

Additionally, during the tokenization phase we have to add special characters to indicate the beginning of the sentence “[CLS]” and the end of it “[SEP]”, and to follow the pattern specified as a required input for BERT model. For the numericaliza- tion, firstly, we convert all the words into their vocabulary indices, padding them with 0s to make the similar sequence lengths within the minibatch. We treat each sentence separately, so there is no need for specific segments IDs. After that we apply BERT model to the obtained tensors, and get 12 hidden states as an output.

Each of them has a dimension of [B x S x H] and [B x S x N x H] for the single sentences and for the facts, respectively, where B is a minibatch size, S is a sequence length, N is the number of factual sentences, and H is the size of the BERT hidden state (in our case it equals 768). As it was mentioned above, we peform element-wise summation over the last 4 hidden layers of BERT. Then we use the obtained vectors as an input on the next step of the algorithm which is gated recurrent unit for our BERT pre-trained models.4.4 Model architectureThe complete model schema is shown on the Figure 4.2. The pre-trained word embeddings obtained on the previous step go to the gated recurrent unit (Cho et al., 2014).

The GRU in our model plays the sentence encoder role. The models called “sentence encoding-based models” transform sequences of words into fixed-length vector representations. In this manner we obtain vector representations for the premise and its facts, separately applying GRU to their embeddings. After that we take the last hidden states of the GRU output and concatenate them over the sentence number dimention (1 is for premise and N is for its facts). So that after concatenation we obtain a mixed representation of premise and its facts as a single tensor of a dimention [B x (N + 1) x 2H], where B is a minibatch size, N is the number of factual sentences + 1 for the premise, and H is the hidden size of the GRU multiplied by 2 which is the number of directions. All this goes to the max pooling layer with with kernel size equal to N+1 and theFigure 4.3: Simple SNLI model for non-augmented (no facts) models.stride equal to 1, i.e. from each hidden state of the representation we take the maximal number and keep it as a representer of this state in the final vector.

This vector we call facts augmented representation of a premise. In parallel, we perform all the same steps for hypothesis and its facts. So that we obtain facts augmented representation of a hypothesis. The two augmented representations then are being concatenated over the hidden state dimension to keep all the information accumulated in them. Finally, the obtained result are being passed to the final linear layer, which returns the output of the dimension [B x C], where B is the minibatch size and C is the number of classes (3 in our case).For the comparison and evaluation of the suggested model progress we implemented a simple SNLI model which has similar to the proposed solution structure.

Its architecture is shown on the Figure 4.3. We omit all the steps related to facts augmentation of the model. Therefore, we again obtain the sentence encodings for premise and hypothesis via the GRU, and then concatenate them. The concatenation results go to the classification linear layer which output the 3-class prediction.4.5 Training detailsFor all 6 experiments the cross-entropy (log loss) minimization was used as the learning ob- jective. Cross-entropy loss measures the performance of a classification model whose outputFigure 4.2: The proposed augmented SNLI model architecture.is a probability value between 0 and 1. Cross-entropy loss increases in case of the predicted probability diverges from the actual label. A perfect model has log loss which equals to 0, and the model is weak otherwise. In binary classification, where the number of classes M equals 2, cross-entropy can be calculated as: −(y∗log(p)+(1−y)∗log(1−p)). If M > 2 (i.e. multi-class classification), we calculate a separate loss for each class label per observation and sum the result: −∑Mc=1yo,clog(po,c), where M – number of classes (entailment, contradiction, neutral), log – the natural logarithm, y – binary indicator (0 or 1) if class label c is the correct classification for observation o po,c – predicted probability of observation o belongs to the class c.Adam (Kingma and Ba, 2014) optimization algorithm is used for updating the weights itera- tively based on training data. Adam differs from classical stochastic gradient descent.

Stochas- tic gradient descent maintains a single learning rate which does not change while training for all weight updates. A learning rate is applied to each network parameter (weight) and being separately adapted as learning unfolds. Adam benefits from two optimizational algorithms: AdaGrad and RMSProp.- Adaptive Gradient Algorithm (AdaGrad) maintains a per-parameter learning rate that improves performance on problems with sparse gradients (e.g. natural language and computer vision problems).- Root Mean Square Propagation (RMSProp) also maintains learning rates that are being adapted based on the average of the magnitudes of the gradients for the weight, i.e. how quickly it is changing.

This means the algorithm does well on the noisy problems.On top of that a learning rate scheduler is used for the learning adaptation. The idea of such a scheduler is to reduce learning rate when a metric has stopped improving. It is common practice to reduce the learning rate once learning of the model stagnates. The scheduler reads a metrics quantity (validation loss) and if no improvement (no decrease) is seen for a ’patience’ number of epochs (we set patience parameter equals to 3), the learning rate is reduced.

The initial learning rate is being tuned for each experimental model separately as well as weight decay parameter for Adam optimizer.We performed fine-tuning on the following hyper parameters: batch size, dropout value (dropout), the GRU hidden size (hidden), initial learning rate (lr) and weight decay (l2), num- ber of layers in the network (numlayers), and the type of word embedding (word_embedding): GloVe (’glove’) or ConceptNet Numberbatch (’conceptnet_emb’). By experimenting with dif- ferent sets of the hyper parameters, we figured out that the mini-batch of size set to 64 is optimal for all the models.

The rest of the parameters differ from one model to another. The full final set of the parameters is given in the table 4.2.4.6 Evaluation metricsTo evaluate the quality of the suggested model, we use 2 classical metrics: classification ac- curacy (Wikipedia, 2019a) and F1 measure (Wikipedia, 2019b). Additionally, we perform an analysis of the misclassified examples from both model with and without facts augmentation as a sanity check. The sklearn.metrics module which implements several loss, score, and utility functions to measure classification performance is used.

In the fields of science and engi- neering, the accuracy is a description of systematic errors, a measure of statistical bias; low accuracy causes a difference between a result and a “true” value which is called this trueness. The accuracy_score function computes the accuracy, either as the fraction or the count of cor-Model Fine-tuned Parametersdropout hidden lr l2 numlayers word_embeddingfacts pairs (no BERT) 4.2: Summary of the fune-tuned parameters for each of the proposed models.rect predictions. We define accuracy of our model as a fraction of correctly classified samples. Since, we have 3 classes, in multi-label classification, this function returns the subset accuracy.

If all the predicted labels for a sample match with the true labels, then the subset accuracy is 1, and 0 otherwise. If ˆyi is the predicted value of the i-th sample and is the corresponding true value, then the fraction of correct predictions over nsamples is defined as:accuracy(y, ˆy) = nsamples1nsamples∑−1 i=0 1( ˆyi = yi), (4.1) where 1(x) is the indicator function.In statistical analysis of binary classification, the F1 score (also F-score or F-measure) is a measure of a test’s accuracy.

It considers both the precision p and the recall r of the test to compute the score: p is the number of correct positive results (true positives) divided by the number of all positive results, and r is the number of correct positive results divided by the number of all relevant samples (all samples that are supposed to be identified as positive). The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. F1 score, also known as F-measure, can be interpreted as a weighted average of the precision and recall. The best value for F1 score reaches 1, and the worst score is 0. F1 score is equally contributed by both precision and recall.

The formula for the F1 score is:F1 = 2∗(precision∗recall)/(precision+recall) (4.2)In our multi-class case, this is the average of the F1 score of each class with weighting depending on the average parameter. We use the sklearn.metrics module where the ’weighted’ parameter helps to calculate metrics for each label, and find their average weighted.In the sanity check part, we are interested in the test samples which we correctly classified by the facts augmented models against the same samples misclassified by the simple SNLI models and vice versa.

Not only the quantity, but the quality of such examples matters. Since the augmented models are supposed to have more background knowledge, they are expected to be more resistant to the paraphrasing, using synonyms and antonyms, and other language phenomena like this. On the other hand, the facts augmented models may contain more noise in their data which can be misleading while decision-making.

Performing this analysis wecalculate the number of correctly and incorrectly classified pairs by all of our 6 models. Addi- tionally, we provide some textual examples of such samples. The results are described in more details in the next chapter.5 ResultsThis chapter summarizes the problem, approach, implementation and evaluation.

It depicts the principal results, and discusses expected impact and further research directions.We evaluate each of the 6 model variations to find out how many pairs of sentences were solved correctly and to scale the overall performance of the proposed solutions.

To do so, we calculate the accuracy along with the F1-measure for each model. The table 5.1 shows the quantitative results of the experiments, and the figures 5.1 and 5.2 visualize the accuracies of the models for the first several epochs of training. In this work, we focus on the difference in performance between the models augmented and non-augmented with facts, which have similar (almost identical) architecture. Such a comparison gives an insight on how the prior knowledge augmentation can or cannot affect a model’s quality, and how such an approach can help with solving a task.

We observe that our approach reaches competitive results while having a very simple architecture. We obtain 65.34% of accuracy with our simple facts aug- mented model, which outperforms the non-augmented model (no facts) by 0.55%. The models using simple facts and the joint facts pairs do not drastically differ in the performance, since the joint facts are mostly an intersection of the simple facts plus some additional minor infor- mation. Moreover, depending on the type of additional facts it brings, it either gives a slight performance improvement, or adds more noise to the model. In general, our best performing simple facts model seems to be a good balance between the no facts model, which misclas- sifies the examples containing incomplete or misleading information, and facts pairs model, which mostly is over noised.

The models which are pre-trained with BERT perform better than the models without BERT pre-training. However, the pre-trainig with BERT significantly af- fects the training time, increasing it by almost 30 times which makes the training process of the BERT models very costly. Both facts augmented and non-augmented models benefit from the pre-training with BERT by about +3%. Our best result with BERT reaches 68.03% (+2.69% absolute). Although it slightly improves, the accuracy difference is not as big as the difference in the training cost of BERT, so our model can be considered as an alternative cheaper way of training such kind of models.

Note, that the number of training examples in our dataset is three times smaller then in the original SNLI (570k in SNLI vs 170k in our dataset), which restricts us from proper comparison with its current baseline.Furthermore, we compared the textual errors for 2 of our models (both pre-trained with BERT): the simple facts model and the basic SNLI model no facts without augmentation. Over- all, our best performing model (simple facts + BERT) classifies correctly 4k samples which are misclassified by the basic SNLI model (no facts + BERT).

We suppose that the main reason for such a result is the addition of the factual commonsense knowledge taken from the Concept-Figure 5.1: The 3 models (no BERT) test accuracy graph visualization generated by Tensor- boardX. Legend: facts pairs (blue), simple facts (orange), no facts (pink).Figure 5.2: The 3 BERT pre-trained models test accuracy graph visualization generated by TensorboardX. Legend: facts pairs + BERT (green), simple facts + BERT (orange), no facts + BERT (blue).Model Parameters Accuracy (%) F1 measure (%)facts pairs (no BERT) 1.6m 64.33 64.28simple facts (no BERT) 1.6m 65.34 65.28no facts (no BERT) 187.8k 64.79 64.68 facts pairs + BERT 110.5m 67.89 67.87simple facts + BERT 110.5m 68.03 68.01no facts + BERT 127.6m 67.77 67.76Table 5.1: Summary of parameter numbers, accuracies and F1 measures for each of 6 SNLI- based models: 2 models enriched with facts pairs (joint facts), 2 models enriched with inde- pendent (simple) facts, 2 models without facts.

Each pair of the models consists of the setting with (bottom 3) and without (top 3) BERT pre-training.premise (p) Some of the hair in his beard are turning grey.facts for the premiseYou are likely to find hair in someone’s head, wool is related to hair, hair is part of your head, Mammals have hair, hair can fall out, beard is related to hair, barba is a translation of beard, kumis is a translation of beard, barbe is a translation of beard, moustache is a translation of beard, squirrel is related to grey, smoke is related to grey, silver is related to grey, gris is a translation of grey hypothesis (h) The man has grey hairs in his beard.facts for the hypothesisboy is related to man, man is related to person, man is related to boy, person is related to man, fellow is related to man, squirrel is related to grey, smoke is related to grey, silver is related to grey, gris is a translation of grey, beard is related to hair, barba is a translation of beard, kumis is a translation of beard, barbe is a translation of beard, moustache is a translation of beard joint facts for both p and h beard is related to hair, beard is a type of hair, beard is made ofhair, man is related to beard correct label entailment label given by simple SNLI (no facts) neutralTable 5.2: A sample which was misclassified by basic SNLI model (no facts), but was correctly classified by our augmented SNLI (simple facts)Net.

For example, we observed that the relevant facts which explain the meaning of some terms or, at least, show the relatedness of one term to another help the algorithm to make a correct decision. An example of such a sample, misclassified by the basic SNLI model without facts (no facts + BERT), and correctly classified by our augmented SNLI (simple facts + BERT), is given in Table 5.2. In this case, the retrieved fact supports the relation between the terms “road” and “outdoors”, while one of them presents in the premise sentence, and the other does in the hypothesis sentence. On the other hand, 3k samples were found to be incorrectly classified by our approach, while the basic SNLI (no facts) model did not make such mistakes. We figured out that in most such cases the reason of the mistake is the availability of irrelevant to either of a claim facts in the augmented dataset.

An example of such a sample is given in Table 5.3. We observe a lot of noisy information in this example. The facts related to both the premise and the hypothesis do not provide any useful information which could help to make a correct classifi- cation decision even for human. Such facts are not necessarily wrong or misleading, but they create additional noise as an obstacle for the classification model.

However, we encountered this problem, it can be considered as a future research venue for this approach. Namely, the way of retrieving the factual information can be improved to load only relevant to the particular sample (pair of sentences) facts in order to reduce the noise, which can be further investigated.premise (p) Six kids splash in water.facts for the premisechildren is related to kids, toy is related to kids, play is related to kids, Kids like to play with them, Kids like to play, You are likely to find a fish in water, river is related to water, steam is related to water, You are likely to find water in a lake, rain is water hypothesis (h) Kids are in a pool.facts for the hypothesisYou are likely to find water in a pool, a pool is used for swiming, a pool is for Swimming, You can use a pool to get out of the heat, swim is related to pool joint facts for both p and h You are likely to find water in a pool, a pool contains water, poolis related to water, water is related to pool correct label neutral label given by our facts augmented model entailmentTable 5.3: A sample which was misclassified by our fact augmented SNLI, but was correctly classified by basic SNLI model (no facts)6 ConclusionThis chapter discusses lessons learned from the experiments, and new problems that are raised.

It summarizes the problem, approach, implementation and evaluation described in the previous chapters, and overviews the principal results in abstract terms along with expected impact and further research directions. The last section of this chapter explains how the project satisfies the evaluation criteria for a Masters Research project.

6.1 The problem and proposed solution summaryThis project is another try to solve natural language inference problem by combining the orig- inal Stanford Natural Language Inference (SNLI) corpus and prior knowledge from Concept- Net, along with proposing a novel model architecture. The idea of this work is even more global as it was inspired by fact checking and truth search topic which is very actual now due to all the fake news appearing in the Web. Indeed, the fact checking problem is a pair-wise classification task which can be clearly represented by a labeled dataset containing pairs of sentences.

Generally, this work combines two different problems: natural language inference and common sense knowledge utilization in NLP models along with their existing state-of-the- art solutions, and aims to propose a new, simplified, but yet effective and universal solution. To build our model we load the related to the original SNLI premise and hypothesis concepts and their relations from the ConceptNet graph in a form of plain text to form regular sentences. We call those sentences ’facts’.

The facts then go to the model as an additional source of in- formation for the classification algorithm. We have built 6 different variations of the proposed architecture, fine-tuning each of them separately, and then performed the experiments to eval- uate their performance and quality. Our best performing model is referred as ’simple facts’. It gives 65.34% of accuracy, which is 0.55% more than the non-augmented model (no facts) obtains. Additionally, we experiment with Bidirectional Encoder Representations from Trans- formers (BERT) as a way of model pre-training.

BERT improves our best results by 2.69%, but the training is extrimely expensive for both the business and the environment. Therefore, we prefer to focus on the search of the alternative cheaper way of training such a models, and our approach seems to promising. The difference between the models with and without facts augmentation is the essential criterion in the proposed solution evaluation, since both types of the models are fairly simple. The most important outcome is that the models augmented with facts perform better than SNLI models without fact augmentation. It proves that the idea of prior knowledge embedded into a classification model may help to improve the performance.Although, the obtained results are quite interesting, the current work seems to become the first step of a bigger research due to the discovery of the variety of potential improvements.

6.2 Further research directionsWhile working on this project, we discovered several ways in which the model can be improved. Firstly, more data can help to improve the performance. As it was stated before, the original SNLI dataset contains 570k samples (pairs of sentences + label), while our augmented dataset has 170k samples (pairs of sentences + corresponding facts + label), which is more than 3 times smaller in terms of the number of samples (but not in terms of the amount of data).

However, extraction of facts for more data samples could significantly improve the performance. This is a time-consuming process which goes beyond the current research project. Additionally, the multiple data sources can be mixed to serve this goal. Since we used only the facts loaded from ConceptNet, we were limited in the vocabulary and also in variability of the factual data as well as in its domain specificity. Mixing different sources like ConceptNet, WordNet, DBpedia, etc. can provide more flexibility to the model and improve its robustness. In the longer perspective we plan to employ the data from variety of sources, not necessarily from structured knowledge bases, but from any selected data from the Web.Secondly, the architecture of the proposed models were intentionally simplified in order to investigate the impact of factual augmentation on the model performance.

In order to achieve better accuracy, more complex architectures can be build and evaluated. For example, adding more different layers to the network, attention units, interactions, normalization, etc. The way of pre-training can also affect the overall model’s scores.

For example, (Radford et al., 2018) demonstrate that the large gains of NLU tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text (unsupervised), followed by discrimina- tive fine-tuning on each specific task, e.g. on NLI.Lastly, many state-of-the-art approaches suggest to use multi-task training.

They combine different natural language understanding tasks and train a model on the larger corpus adding afterwards task-specific layers with aggregations and additional tuning when it is needed. For instance, MT-DNN model (Liu et al., 2019) uses this approach. It includes four shared layers (lexicon encoder, token embedding, transformer encoder, and context embedding). Then on top of those 4 layers the other 2 task-specific layers appear to make the model able to solve a particular task.Therefore, there is still plenty of room for the research in the stated direction. This project work is a decent starting point for a deeper research.

This work has shown that the approach of embedding commonsense knowledge to the dataset is a promising mechanism which helps the machine to solve a language understanding task, namely NLI.6.3 Project evaluation criteria analysisFirst of all, this section provides the list of methods and tools I (as a student) has learned during this research project. Each point of this list was explained earlier in this work in terms of the practical application to the project.

To summarize, I specify the following list of competences:

• Theoretical and practical knowledge in neural networks and deep learning;

• Natural Language Processing techniques;

• Various NLP problems, e.g. NLI, Question Answering (QA);• Natural Language Inference and Question Answering datasets;

• Python as a programming tool for ML projects;

• PyTorch and other useful for data science libraries;

• TensorboardX for building visualizations;

• System of version control such as Git for performing group projects;

• LaTex for scientific writings;

• Skills of the scientific writing and reporting;

• Skills of making presentations.

At the end of this master’s project, I achieved significant results in developing my deep machine learning, self-studying, research, and analytical skills. I am now able to:1. Search for and study relevant literature and gain in-depth knowledge in a machine learn-ing topic on my own;2.

Conduct research in the field of data science and report on it in a manner that meets thestandards of the discipline;3. Work together in a team on a research or an applied development project;4. Communicate conclusions both written and orally to various audiences in English;5. Enroll in a Ph.D. programme in informatics or begin a career as a professional datascientist.This project introduces the results of sufficient novelty and quality.

It proposes an original use of complex pre-existing concepts like neural networks, use of graph knowledge bases, pre-trained representations of the words, etc. The report is structured in a way providing a clear navigation from section to section. All the examples and architectural solutions are well illustrated. All the references to the external resources (scientific literature, Internet resources, etc.) can be found in the bibliography section. This thesis denotes a fully mastered scientific approach along with the achieved results, from theory to original examples, including state-of- the-art.

The code can be provided on demand.A AppendixWe attach additional visualizations in this section.Figure A.1: The training loss breakdown for 3 (no BERT) models: simple facts (light blue), no facts (blue), facts pairs (pink) – generated by TensorboardX.Figure A.2: The training loss breakdown for 3 BERT pre-trained models: simple facts (pink), facts pairs (blue), no facts (green) – generated by TensorboardX.BibliographySungjin Ahn, Heeyoul Choi, Tanel Pärnamaa, and Yoshua Bengio. A neural knowledge lan-guage model. CoRR, abs/1608.00318, 2016. URL http://arxiv.org/abs/1608.00318.Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning.

A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Con- ference on al Methods in Natural Language Processing (EMNLP). Association for Compu- tational Linguistics, 2015.

Learning phrase representations using RNN encoder-decoder for sta- tistical machine translation. Overview of the tac2011 knowledge base populationtrack. 01 2011.Seonhoon Kim, Jin-Hyuk Hong, Inho Kang, and Nojun Kwak. Semantic sentence match- ing with densely-connected recurrent and co-attentive information.