NLP From Scratch: Using char-RNN to classify surnames
In this post we will build and train a basic character-level RNN to classify words. This tutorial, and the two that follow, show how to preprocess the data “from scratch” for the NLP modeling process, coding without many of the convenience features of torchtext, allowing you to Learn how the data required for NLP modeling is preprocessed under the hood.
A character-level RNN reads a word as a sequence of characters and outputs a prediction and a “hidden state” at each step, then feeds its previous hidden state into the next step. We take as output the final prediction, which category the word belongs to.
Specifically, we will train on thousands of surnames from 18 languages and predict which language the name belongs to based on how it is spelled, as in the following example:
$ python predict.py Hinton (-0.47) Scottish (-1.52) English (-3.57) Irish $ python predict.py Schmidhuber (-0.19) German (-2.48) Czech (-2.68) Dutch
Recommended reading:
This tutorial assumes you have at least PyTorch installed and know Python and Tensors:
- https://pytorch.org/ for installation instructions
- Deep Learning with PyTorch: A 60-minute blitz on how to get started with PyTorch quickly
- Learn Pytorch by Example Gain a broader and deeper understanding of Pytorch
- PyTorch (former Torch user) (if you were a former Lua Torch user)
It is also important to understand RNNs and how they work:
- The extraordinary effect of recurrent neural networks shows many real-life examples
- Understanding LSTM Networks is mostly about LSTMs, but also about RNNs in general
Prepare data
note
- Download the data from here, and extract it to the current directory.
The data/names
directory contains 18 text files named ” [Language].txt”. Each file contains multiple lines with one surname per line, mostly in Roman characters (but we still need to convert from Unicode to ASCII just to be cautious).
After processing we get a dictionary containing a list of {language: [names ...]}
last names for each language. The generic variables category
and line
(language and lastname respectively in this example) will be used later.
from __future__ import unicode_literals, print_function, division from io import open import glob import os def findFiles(path): return glob.glob(path) print(findFiles('data/names/*.txt')) import unicodedata import string all_letters = string.ascii_letters + " .,;'" n_letters = len(all_letters) # Turn a Unicode string to plain ASCII, thanks to https://stackoverflow.com/a/518232/2809427 def unicodeToAscii(s): return ''.join( c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn' and c in all_letters ) print(unicodeToAscii('?lusàrski')) # Build the category_lines dictionary, a list of names per language category_lines = {} all_categories = [] # Read a file and split into lines def readLines(filename): lines = open(filename, encoding='utf-8').read().strip().split('\ ') return [unicodeToAscii(line) for line in lines] for filename in findFiles('data/names/*.txt'): category = os.path.splitext(os.path.basename(filename))[0] all_categories.append(category) lines = readLines(filename) category_lines[category] = lines n_categories = len(all_categories)
Out:
['data/names/French.txt', 'data/names/Czech.txt', 'data/names/Dutch.txt', 'data/names/Polish.txt', 'data/names/Scottish .txt', 'data/names/Chinese.txt', 'data/names/English.txt', 'data/names/Italian.txt', 'data/names/Portuguese.txt', 'data/names/Japanese .txt', 'data/names/German.txt', 'data/names/Russian.txt', 'data/names/Korean.txt', 'data/names/Arabic.txt', 'data/names/Greek .txt', 'data/names/Vietnamese.txt', 'data/names/Spanish.txt', 'data/names/Irish.txt'] Slusarski
After the above processing, we get the variable category_lines
, which is a dictionary, and the dictionary index is a list for each category (language) value, and the list contains multiple lines (surnames). We also save all_categories
(list of categories (languages)) and n_categories
(number of language categories) for later use.
print(category_lines['Italian'][:5])
Out:
['Abandonato', 'Abatangelo', 'Abatantuono', 'Abate', 'Abategiovanni']
Convert names to tensors
After getting all last names, we need to convert them into tensors.
To represent individual letters, we use a “one-hot vector” of size <1 x n_letters>
. A “one hot” vector is one that has a 1 at the index of the current letter and 0s for the rest, eg "b" = <0 1 0 0 0 ...>
.
We concatenate the “one hot” vectors of all letters in each row into a 2D matrix
to represent a word (surname).
The extra 1 dimension is because PyTorch assumes everything is in batches – we have a batch size of 1 here.
import torch # Find letter index from all_letters, e.g. "a" = 0 def letterToIndex(letter): return all_letters. find(letter) # Just for demonstration, turn a letter into a <1 x n_letters> Tensor def letterToTensor(letter): tensor = torch.zeros(1, n_letters) tensor[0][letterToIndex(letter)] = 1 return tensor # Turn a line into a <line_length x 1 x n_letters>, # or an array of one-hot letter vectors def lineToTensor(line): tensor = torch.zeros(len(line), 1, n_letters) for li, letter in enumerate(line): tensor[li][0][letterToIndex(letter)] = 1 return tensor print(letterToTensor('J')) print(lineToTensor('Jones'). size())
Out:
tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0 ., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0. , 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0. , 0., 0., 0., 0.]]) torch. Size([5, 1, 57])
Create a network
Creating a recurrent neural network in Torch requires cloning the parameters of a neural layer over multiple time steps. Now the hidden states and gradients saved in the network layer at different time steps are all processed by the calculation graph itself, and the programmer does not need to care about it. Therefore, you can easily build a recurrent neural network in Torch just like building a common feedforward network.
The RNN module in the figure below (mainly copied from the PyTorch for Torch users tutorial) is only 2 linear layers, they read the input and hidden state, and after they are linearly mapped, the output result is passed through the LogSoftmax layer as the output of this layer .
import torch.nn as nn class RNN(nn.Module): def __init__(self, input_size, hidden_size, output_size): super(RNN, self).__init__() self. hidden_size = hidden_size self.i2h = nn.Linear(input_size + hidden_size, hidden_size) self.i2o = nn.Linear(input_size + hidden_size, output_size) self.softmax = nn.LogSoftmax(dim=1) def forward(self, input, hidden): combined = torch.cat((input, hidden), 1) hidden = self.i2h(combined) output = self.i2o(combined) output = self. softmax(output) return output, hidden def initHidden(self): return torch.zeros(1, self.hidden_size) n_hidden = 128 rnn = RNN(n_letters, n_hidden, n_categories)
To single-step through this network, we need to pass the input (in this case, a tensor for the current letter) and the hidden state from the previous step (initialized to zero first). This network will return the output (the probability for each language) and the next hidden state (which we keep for the next step).
input = letterToTensor('A') hidden = torch. zeros(1, n_hidden) output, next_hidden = rnn(input, hidden)
For efficiency, we don’t want to create a new Tensor for each step, so instead of using letterToTensor
multiple times, we will use lineToTensor
to generate vectors and then slice them. This can be further optimized by precomputing a batch of tensors.
input = lineToTensor('Albert') hidden = torch. zeros(1, n_hidden) output, next_hidden = rnn(input[0], hidden) print(output)
Out:
tensor([[-2.9504, -2.8402, -2.9195, -2.9136, -2.9799, -2.8207, -2.8258, -2.8399, -2.9098, -2.8815, -2.8313, -2.8628, -3.0440, -2.8689, -2.9391, -2.8381, -2.9202, -2.8717]], grad_fn=<LogSoftmaxBackward>)
As above, the output is a <1 x n_categories>
tensor, where each item corresponds to the likelihood of that category (higher values are more likely).
Training
Preparation for training
Before training, we need to set up some helper functions. First we need to process the output of the network (the likelihood of each class) to output its most probable class. We can use Tensor.topk
to get the index of the maximum value:
def categoryFromOutput(output): top_n, top_i = output.topk(1) category_i = top_i[0].item() return all_categories[category_i], category_i print(categoryFromOutput(output))
Out:
('Chinese', 5)
We’ll also need a function that quickly fetches training examples (surnames and their languages):
import random def randomChoice(l): return l[random. randint(0, len(l) - 1)] def randomTrainingExample(): category = randomChoice(all_categories) line = randomChoice(category_lines[category]) category_tensor = torch.tensor([all_categories.index(category)], dtype=torch.long) line_tensor = lineToTensor(line) return category, line, category_tensor, line_tensor for i in range(10): category, line, category_tensor, line_tensor = randomTrainingExample() print('category =', category, '/ line =', line)
Out:
category = Italian / line = Pastore category = Arabic / line = Toma category = Irish / line = Tracey category = Portuguese / line = Lobo category = Arabic / line = Sleiman category = Polish / line = Sokolsky category = English / line = Farr category = Polish / line = Winogrodzki category = Russian / line = Adoratsky category = Dutch / line = Robert
Training the network
Now, all you need to do to train the network is feed it lots of examples, let the network make a guess, and tell it whether the guess was wrong.
nn.NLLLoss
is a more suitable loss function, because the last layer of RNN is nn.LogSoftmax
.
criterion = nn.NLLLoss()
Each training loop will:
- Create input and target tensors
- Create zeroed initial hidden state
- read each letter
- Save the hidden state for the next letter
- Compare the distance between the final output and the target tensor
- backpropagation
- return output and loss
learning_rate = 0.005 # If you set this too high, it might explode. If too low, it might not learn def train(category_tensor, line_tensor): hidden = rnn.initHidden() rnn.zero_grad() for i in range(line_tensor. size()[0]): output, hidden = rnn(line_tensor[i], hidden) loss = criterion(output, category_tensor) loss. backward() # Add parameters' gradients to their values, multiplied by learning rate for p in rnn.parameters(): p.data.add_(-learning_rate, p.grad.data) return output, loss. item()
Now we just need to run the network with a lot of examples. Since the train
function returns both output and loss, we can print its guesses and keep track of plotting loss changes. Since there are 1000 examples, we print every print_every
example and average the loss over this period and save it.
import time import math n_iters = 100000 print_every = 5000 plot_every = 1000 # Keep track of losses for plotting current_loss = 0 all_losses = [] def timeSince(since): now = time. time() s = now - since m = math. floor(s / 60) s -= m * 60 return '%dm %ds' % (m, s) start = time. time() for iter in range(1, n_iters + 1): category, line, category_tensor, line_tensor = randomTrainingExample() output, loss = train(category_tensor, line_tensor) current_loss += loss # Print iter number, loss, name and guess if iter % print_every == 0: guess, guess_i = categoryFromOutput(output) correct = '?' if guess == category else '? (%s)' % category print('%d %d%% (%s) %.4f %s / %s %s' % (iter, iter / n_iters * 100, timeSince(start), loss, line, guess, correct)) # Add current loss avg to list of losses if iter % plot_every == 0: all_losses.append(current_loss / plot_every) current_loss = 0
Out:
5000 5% (0m 12s) 3.1806 Olguin / Irish ? (Spanish) 10000 10% (0m 21s) 2.1254 Dubnov / Russian ? 15000 15% (0m 29s) 3.1001 Quirke / Polish ? (Irish) 20000 20% (0m 38s) 0.9191 Jiang / Chinese ? 25000 25% (0m 46s) 2.3233 Marti / Italian ? (Spanish) 30000 30% (0m 54s) nan Amari / Russian ? (Arabic) 35000 35% (1m 3s) nan Gudojnik / Russian ? 40000 40% (1m 11s) nan Finn / Russian ? (Irish) 45000 45% (1m 20s) nan Napoliello / Russian ? (Italian) 50000 50% (1m 28s) nan Clark / Russian ? (Irish) 55000 55% (1m 37s) nan Roijakker / Russian ? (Dutch) 60000 60% (1m 46s) nan Kalb / Russian ? (Arabic) 65000 65% (1m 54s) nan Hanania / Russian ? (Arabic) 70000 70% (2m 3s) nan Theofilopoulos / Russian ? (Greek) 75000 75% (2m 11s) nan Pakulski / Russian ? (Polish) 80000 80% (2m 20s) nan Thistlethwaite / Russian ? (English) 85000 85% (2m 29s) nan Shadid / Russian ? (Arabic) 90000 90% (2m 37s) nan Finnegan / Russian ? (Irish) 95000 95% (2m 46s) nan Brannon / Russian ? (Irish) 100000 100% (2m 54s) nan Gomulka / Russian ? (Polish)
Drawing results
Plotting the changes in losses from all_losses
shows how well the network is learning:
import matplotlib.pyplot as plt import matplotlib.ticker as ticker plt. figure() plt.plot(all_losses)
Assessment results
To see how well the network performs on different classes, we will create a confusion matrix with rows corresponding to each language and columns corresponding to the languages the network guesses. To calculate the confusion matrix, use the evaluate()
function to run a batch of samples through the trained network, this step is similar to the training function train
, but without the backpropagation process.
# Keep track of correct guesses in a confusion matrix confusion = torch.zeros(n_categories, n_categories) n_confusion = 10000 # Just return an output given a line def evaluate(line_tensor): hidden = rnn.initHidden() for i in range(line_tensor. size()[0]): output, hidden = rnn(line_tensor[i], hidden) return output # Go through a bunch of examples and record which are correctly guessed for i in range(n_confusion): category, line, category_tensor, line_tensor = randomTrainingExample() output = evaluate(line_tensor) guess, guess_i = categoryFromOutput(output) category_i = all_categories. index(category) confusion[category_i][guess_i] + = 1 # Normalize by dividing every row by its sum for i in range(n_categories): confusion[i] = confusion[i] / confusion[i].sum() # Set up plot fig = plt. figure() ax = fig.add_subplot(111) cax = ax.matshow(confusion.numpy()) fig. colorbar(cax) # Set up axes ax.set_xticklabels([''] + all_categories, rotation=90) ax.set_yticklabels([''] + all_categories) # Force label at every tick ax.xaxis.set_major_locator(ticker.MultipleLocator(1)) ax.yaxis.set_major_locator(ticker.MultipleLocator(1)) # sphinx_gallery_thumbnail_number = 2 plt. show()
Some bright spots on the main axis represent which languages the neural network tends to guess wrong, such as Chinese and Korean, Spanish and Italian. The neural network performs well on Greek and poorly on English (probably due to more overlap with other languages).
Run on user input
def predict(input_line, n_predictions=3): print('\ > %s' % input_line) with torch.no_grad(): output = evaluate(lineToTensor(input_line)) # Get top N categories topv, topi = output. topk(n_predictions, 1, True) predictions = [] for i in range(n_predictions): value = topv[0][i].item() category_index = topi[0][i].item() print('(%.2f) %s' % (value, all_categories[category_index])) predictions.append([value, all_categories[category_index]]) predict('Dovesky') predict('Jackson') predict('Satoshi')
Out:
> Dovesky (nan) Russian (nan) Arabic (nan) Korean > Jackson (nan) Russian (nan) Arabic (nan) Korean > Satoshi (nan) Russian (nan) Arabic (nan) Korean
The script in the actual PyTorch repository splits the above code into several files:
data.py
(load file)model.py
(define RNN)train.py
(for training)predict.py
(runspredict()
with command line arguments)server.py
(using predictions as a JSON API via bottle.py)
Run train.py
to train and save the network.
Run predict.py
with the last name as input to see the predictions:
$ python predict.py Hazaki (-0.42)Japanese (-1.39) Polish (-3.51) Czech
Run server.py
and visit http://localhost:5533/yourname to get the predicted output in JSON format.
Exercise questions
- Try using other row->category datasets, for example:
- any word -> language
- first name -> gender
- Character Name -> Writer
- Page title -> blog or subreddit
- Better results with larger and/or better shaped networks
- add more linear layers
- Try
nn.LSTM
andnn.GRU
layers - Combining multiple of these RNNs into higher level networks