MV
8 min readDec 3, 2020

How Language discovered itself — ML techniques in NLP

Way back in history, when neural nets became common, there was the neural network language model, based on basic neural networks. These failed, however, due to the curse of dimensionality. The challenge with language is that it is rarely repetitive, and the same information can be portrayed in myriad ways — structure and syntax matter. It is impossible to use traditional neural nets since they can’t learn long-distance dependencies over multiple words. To work around this, n-gram word sequences began to be analysed.

Then-gram sequence came to follow the math as below

The probability of a sequence of words can be obtained from the probability of each word given the context of words preceding it, using Bayes’ theorem

P(w1,w2,…,wt−1,wt)=P(w1)P(w2|w1)P(w3|w1,w2)…P(wt|w1,w2,…wt−1).

Probabilistic language models approximate P(wt|w1,w2,…wt−1) using a fixed context of size n−1 , i.e. using P(wt|wt−n+1,…wt−1) , as in n-grams.

However, the traditional n-grams did not solve the problem of having deep memory or retaining a summary of underlying context. This led to series of NN’s, called Recurrent neural networks (RNNs), a kind of architecture which can remember things over time.

RNN

The RNN learns a representation of context that summarizes the past word sequence in a way that preserves information predictive of the future. This learned summarization would keep higher-level abstract summaries of more remote text, and a more detailed summary of very recent words.

Compare the RNN structure on the right to the neural net on the left. The connections between the hidden units allow the RNN to carry over a memory.

We can think of an RNN as a dynamic system with one (or multiple) sets of hidden units which feed into themselves. The network’s graph would then have self-loops. We train the RNN by UNROLLING the RNN’s graph by explicitly representing the units at all time steps. The weights and biases are shared between all time steps. In other architectures of the RNN, weights and biases may be calculated differently.

In an RNN language model, each word is represented as an indicator vector, The model predicts a distribution, and we can train it with cross-entropy loss, typically with forward and backpropagation algos as usual. This model can learn long-distance dependencies. When we generate from the model (i.e. compute samples from its distribution over sentences), the outputs feed back in to the network as inputs.

A teacher forcing is an effective strategy for training recurrent neural networks — in this case, model (actual or expected) output from a prior time step (Yt-1)is taken as an input in the next time step X(t).

A practical issue during the training of RNN is the vanishing or exploding gradient issue. For an example, say a typical sentence length is 20 words. This means there’s a gap of 20 time steps between the time the RNN sees a relevant word and when it needs the word. The derivatives need to travel over this entire pathway, leading to signal attenuation or explosion. (see example below)

vanishing or exploding gradients

There are some ways to work around this —

  1. by gradient clipping (clip gradients over a certain range — gradients are biased, but they don’t blow up)
  2. ReLU activation function. The hidden units are a kind of memory. Therefore, their default behavior should be to keep their previous value — meaning the function at each time step should be close to the identity function. A non linear activation function will move the value away from the identity value. A ReLU activation, initialising all the weight matrices to the identity matrix, will solve this by range-binding the values.
  3. reverse the input sequence. This way, there’s only one time step between the first word of the input and the first word of the output. The network can first learn short-term dependencies between early words in the sentence, and then long-term dependencies between later words.

But then, we’re better off redesigning the architecture, since the exploding/vanishing problem is a conceptual problem with RNN’s. Enter LSTM’s.

LSTM

The LSTM is based on the idea that a network’s activations are its short-term memory and its weights are its long-term memory. The LSTM architecture retains short-term memory for a long time. Think of this as memory cells which have controllers saying when to store or forget information. The LSTM replaces each single unit in an RNN by a memory block. The data feeding into the LSTM gates are the input at the current time step and the hidden state of the previous time step. They are processed by three fully-connected layers with a sigmoid activation function to compute the values of the input, forget. and output gates. As a result, values of the three gates are in the range of (0,1)(0,1).

LSTM

LSTMs are explicitly designed to avoid the long-term dependency problem. There are two key elements —

“Cell state” which works like a conveyor belt runs straight down the entire chain, easy for information to flow along without changes.

“Gates” which control or decide what kind of information could go or throw away from the cell state.

The forget gate decides what to forget — Take the example of a language model trying to predict the next word based on all the previous ones. In such a problem, the cell state might include the gender of the present subject, so that the correct pronouns can be used. When we see a new subject, we want to forget the gender of the old subject.

The input gate, decides what new info we want to store in the gate. First, a sigmoid layer decides the values we will update. Next, a tanh layer creates a vector of new candidate values that could be added to the state.

The update state drops the information about the old subject’s gender and add the new information, as we decided in the previous steps.

But LSTM’s do have an intrinsic problem. The hidden memory unit is difficult to control, and is computationally very expensive to calculate four different states. What is we could compress the LSTM and make it simpler, and provide more control over the entire network states without the hidden transmission layer? That led to the GRU.

GRU

GRU — an improvement over the LSTM

The GRU is similar to the LSTM but with only two gates and less parameters.

The “update gate” determines how much of previous memory to be kept.

The “reset gate” determines how to combine the new input with the previous memory.

Comparison of the LSTM abd GRU gating mechanisms using logic gates

Sequence -2-sequence

Algorithmically, there was a lot of progress, but this did not translate to better language processing because there was a problem with the training. Training usually happens in minibatches — while in image processing, we can provide the same number and size of images, in language modeling, examples don’t have the same length.

Also, where translations were needed between languages, the length of the sentence in two languages could be very different in characters and in words.

To solve this, the encoder-decoder architecture was utilised. The network first reads and memorizes the sentence. When it sees the end token, it starts outputting the translation. The encoder and decoder are two different networks with different weights

The Encoder models the input/source entence F=(f1,….f|F|). The decoder hidden state is initialized with the last hidden state of the encoder. The Decoder models the output/target sentence E=(e1,….e|E|).

The encoder-decoder architecture can handle inputs and outputs that are both variable-length sequences, thus is suitable for sequence transduction problems such as machine translation. The encoder takes a variable-length sequence as the input and transforms it into a state with a fixed shape. The decoder maps the encoded state of a fixed shape to a variable-length sequence.

Encoder-decoder functional form

The encoder-decoder architecture solved for varying lengths of sentences. But some sentences can be really long. Can we really store all the information in a vector of hidden units? Inefficiencies were identified because of the length of the sentence.

The attention mechanism helps address this issue — this is an example of incorporating inductive bias in model architecture by letting the decoder refer to the input sentence.

Attention mechanism

Attention mechanism was introduced in the translation model from the classic paper: Bahdanau et al., Neural machine translation by jointly learning to align and translate. ICLR, 2015. the basic idea is that each output word comes from one word, or a handful of words, from the input. We could learn to attend only to the relevant words by encode each word in source sentence into a vector, and identifying a set of “attention weights” — a linear combination of these vectors mapping the output to the input words, and thereby learn which words were relevant.

The model below has both an encoder and a decoder. It takes the form of a bidirectional RNN. Information earlier or later in the sentence can help disambiguate a word, so we need both directions. The encoder computes an annotation of each word in the input. The Decoder receives a receives a context vector c (t) at each time step, which is computed by attending to the inputs.

Attention mechanism

The context vector is computed as a weighted average of the encoder’s annotations.

The attention weights are computed as a softmax, where the inputs depend on the annotation and the decoder’s state:

The attention function depends on the annotation vector, rather than the position in the sentence, leading to content-based addressing.

No responses yet