Recurrent Neural Networks: Part 1

Working with Text Data

Laxfed Paulacy

--

Third Section in a Series of Python Deep Learning Posts.

Previous Sections

Additionally, you can check out the series of posts on Apache Spark

This series covers:

  • Preprocessing text data into useful representations
  • Working with recurrent neural networks
  • Using 1D convnets for sequence processing

This section explores deep-learning models that can process text (understood as sequences of words or sequences of characters), timeseries, and sequence data in general. The two fundamental deep-learning algorithms for sequence processing are recurrent neural networks and 1D convnets, the one-dimensional version of the 2D convnets that was covered in the previous section.

We’ll discuss both of these approaches in this section.

Applications of these algorithms include the following:

  • Document classification and timeseries classification, such as identifying the topic of an article or the author of a book
  • Timeseries comparisons, such as estimating how closely related two documents or two stock tickers are
  • Sequence-to-sequence learning, such as decoding an English sentence into French
  • Sentiment analysis, such as classifying the sentiment of tweets or movie reviews as positive or negative
  • Timeseries forecasting, such as predicting the future weather at a certain location, given recent weather data

This section’s examples focus on two narrow tasks: temperature forecasting, and sentiment analysis on the IMDB dataset, a task approached in a previous section.

But the techniques demonstrated for these two tasks are relevant to all the applications just listed, and many more.

Working with Text Data

Text is one of the most widespread forms of sequence data. It can be understood as either a sequence of characters or a sequence of words, but it’s most common to work at the level of words. The deep-learning sequence-processing models introduced in the following sections can use text to produce a basic form of natural-language understanding, sufficient for applications including document classification, sentiment analysis, author identification, and even question-answering (QA) (in a constrained context). Of course, keep in mind throughout this section that none of these deep-learning models truly understand text in a human sense; rather, these models can map the statistical structure of written language, which is sufficient to solve many simple textual tasks. Deep learning for natural-language processing is pattern recognition applied to words, sentences, and paragraphs, in much the same way that computer vision is pattern recognition applied to pixels.

Like all other neural networks, deep-learning models don’t take as input raw text: they only work with numeric tensors. Vectorizing text is the process of transforming text into numeric tensors. This can be done in multiple ways:

  • Segment text into words, and transform each word into a vector.
  • Segment text into characters, and transform each character into a vector.
  • Extract n-grams of words or characters, and transform each n-gram into a vector. N-grams are overlapping groups of multiple consecutive words or characters.

Collectively, the different units into which you can break down text (words, characters, or n-grams) are called tokens, and breaking text into such tokens is called tokenization. All text-vectorization processes consist of applying some tokenization scheme and then associating numeric vectors with the generated tokens. These vectors, packed into sequence tensors, are fed into deep neural networks. There are multiple ways to associate a vector with a token. In this section, I’ll present two major ones: one-hot encoding of tokens, and token embedding(typically used exclusively for words, and called word embedding). The remainder of this section explains these techniques and shows how to use them to go from raw text to a Numpy tensor that you can send to a Keras network.

One-Hot Encoding of Words and Characters

One-hot encoding is the most common, most basic way to turn a token into a vector. It consists of associating a unique integer index with every word and then turning this integer index i into a binary vector of size N (the size of the vocabulary); the vector is all zeros except for the ith entry, which is 1.

Of course, one-hot encoding can be done at the character level, as well. To unambiguously drive home what one-hot encoding is and how to implement it, the examples below show two toy examples: one for words, the other for characters.

Word-level one-hot encoding:

import numpy as npsamples = ['The cat sat on the mat.', 'The dog ate my homework.']token_index = {}
for sample in samples:
for word in sample.split():
if word not in token_index:
token_index[word] = len(token_index) + 1
max_length = 10results = np.zeros(shape=(len(samples),
max_length,
max(token_index.values()) + 1))
for i, sample in enumerate(samples):
for j, word in list(enumerate(sample.split()))[:max_length]:
index = token_index.get(word)
results[i, j, index] = 1.

Character-level one-hot encoding:

import stringsamples = ['The cat sat on the mat.', 'The dog ate my homework.']
characters = string.printable 1
token_index = dict(zip(range(1, len(characters) + 1), characters))
max_length = 50
results = np.zeros((len(samples), max_length, max(token_index.keys()) + 1))
for i, sample in enumerate(samples):
for j, character in enumerate(sample):
index = token_index.get(character)
results[i, j, index] = 1.

Note that Keras has built-in utilities for doing one-hot encoding of text at the word level or character level, starting from raw text data. You should use these utilities, because they take care of a number of important features such as stripping special characters from strings and only taking into account the N most common words in your dataset (a common restriction, to avoid dealing with very large input vector spaces).

Using Keras for word-level one-hot encoding:

from keras.preprocessing.text import Tokenizersamples = ['The cat sat on the mat.', 'The dog ate my homework.']tokenizer = Tokenizer(num_words=1000)
tokenizer.fit_on_texts(samples)
sequences = tokenizer.texts_to_sequences(samples)one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

A variant of one-hot encoding is the so-called one-hot hashing trick, which you can use when the number of unique tokens in your vocabulary is too large to handle explicitly. Instead of explicitly assigning an index to each word and keeping a reference of these indices in a dictionary, you can hash words into vectors of fixed size. This is typically done with a very lightweight hashing function. The main advantage of this method is that it does away with maintaining an explicit word index, which saves memory and allows online encoding of the data (you can generate token vectors right away, before you’ve seen all of the available data). The one drawback of this approach is that it’s susceptible to hash collisions: two different words may end up with the same hash, and subsequently any machine-learning model looking at these hashes won’t be able to tell the difference between these words. The likelihood of hash collisions decreases when the dimensionality of the hashing space is much larger than the total number of unique tokens being hashed.

Word-level one-hot encoding with hashing trick:

samples = ['The cat sat on the mat.', 'The dog ate my homework.']dimensionality = 1000
max_length = 10
results = np.zeros((len(samples), max_length, dimensionality))
for i, sample in enumerate(samples):
for j, word in list(enumerate(sample.split()))[:max_length]:
index = abs(hash(word)) % dimensionality
results[i, j, index] = 1.

--

--

Laxfed Paulacy

Delivering Fresh Recipes, Crypto News, Python Tips & Tricks, and Federal Government Shenanigans and Content.