NLP From Scratch: Classifying Surnames Using char-RNN

NLP From Scratch: Using char-RNN to classify surnames

In this post we will build and train a basic character-level RNN to classify words. This tutorial, and the two that follow, show how to preprocess the data “from scratch” for the NLP modeling process, coding without many of the convenience features of torchtext, allowing you to Learn how the data required for NLP modeling is preprocessed under the hood.

A character-level RNN reads a word as a sequence of characters and outputs a prediction and a “hidden state” at each step, then feeds its previous hidden state into the next step. We take as output the final prediction, which category the word belongs to.

Specifically, we will train on thousands of surnames from 18 languages and predict which language the name belongs to based on how it is spelled, as in the following example:

$ python predict.py Hinton
(-0.47) Scottish
(-1.52) English
(-3.57) Irish

$ python predict.py Schmidhuber
(-0.19) German
(-2.48) Czech
(-2.68) Dutch

Recommended reading:

This tutorial assumes you have at least PyTorch installed and know Python and Tensors:

https://pytorch.org/ for installation instructions
Deep Learning with PyTorch: A 60-minute blitz on how to get started with PyTorch quickly
Learn Pytorch by Example Gain a broader and deeper understanding of Pytorch
PyTorch (former Torch user) (if you were a former Lua Torch user)

It is also important to understand RNNs and how they work:

The extraordinary effect of recurrent neural networks shows many real-life examples
Understanding LSTM Networks is mostly about LSTMs, but also about RNNs in general

Prepare data

note

Download the data from here, and extract it to the current directory.

The data/names directory contains 18 text files named ” [Language].txt”. Each file contains multiple lines with one surname per line, mostly in Roman characters (but we still need to convert from Unicode to ASCII just to be cautious).

After processing we get a dictionary containing a list of {language: [names ...]} last names for each language. The generic variables category and line (language and lastname respectively in this example) will be used later.

from __future__ import unicode_literals, print_function, division
from io import open
import glob
import os

def findFiles(path): return glob.glob(path)

print(findFiles('data/names/*.txt'))

import unicodedata
import string

all_letters = string.ascii_letters + " .,;'"
n_letters = len(all_letters)

# Turn a Unicode string to plain ASCII, thanks to https://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
        and c in all_letters
    )

print(unicodeToAscii('?lusàrski'))

# Build the category_lines dictionary, a list of names per language
category_lines = {}
all_categories = []

# Read a file and split into lines
def readLines(filename):
    lines = open(filename, encoding='utf-8').read().strip().split('\
')
    return [unicodeToAscii(line) for line in lines]

for filename in findFiles('data/names/*.txt'):
    category = os.path.splitext(os.path.basename(filename))[0]
    all_categories.append(category)
    lines = readLines(filename)
    category_lines[category] = lines

n_categories = len(all_categories)

Out:

['data/names/French.txt', 'data/names/Czech.txt', 'data/names/Dutch.txt', 'data/names/Polish.txt', 'data/names/Scottish .txt', 'data/names/Chinese.txt', 'data/names/English.txt', 'data/names/Italian.txt', 'data/names/Portuguese.txt', 'data/names/Japanese .txt', 'data/names/German.txt', 'data/names/Russian.txt', 'data/names/Korean.txt', 'data/names/Arabic.txt', 'data/names/Greek .txt', 'data/names/Vietnamese.txt', 'data/names/Spanish.txt', 'data/names/Irish.txt']
Slusarski

After the above processing, we get the variable category_lines, which is a dictionary, and the dictionary index is a list for each category (language) value, and the list contains multiple lines (surnames). We also save all_categories (list of categories (languages)) and n_categories (number of language categories) for later use.

print(category_lines['Italian'][:5])

Out:

['Abandonato', 'Abatangelo', 'Abatantuono', 'Abate', 'Abategiovanni']

Convert names to tensors

After getting all last names, we need to convert them into tensors.

To represent individual letters, we use a “one-hot vector” of size <1 x n_letters>. A “one hot” vector is one that has a 1 at the index of the current letter and 0s for the rest, eg "b" = <0 1 0 0 0 ...>.

We concatenate the “one hot” vectors of all letters in each row into a 2D matrix to represent a word (surname).

The extra 1 dimension is because PyTorch assumes everything is in batches – we have a batch size of 1 here.

import torch

# Find letter index from all_letters, e.g. "a" = 0
def letterToIndex(letter):
    return all_letters. find(letter)

# Just for demonstration, turn a letter into a <1 x n_letters> Tensor
def letterToTensor(letter):
    tensor = torch.zeros(1, n_letters)
    tensor[0][letterToIndex(letter)] = 1
    return tensor

# Turn a line into a <line_length x 1 x n_letters>,
# or an array of one-hot letter vectors
def lineToTensor(line):
    tensor = torch.zeros(len(line), 1, n_letters)
    for li, letter in enumerate(line):
        tensor[li][0][letterToIndex(letter)] = 1
    return tensor

print(letterToTensor('J'))

print(lineToTensor('Jones'). size())

Out:

tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0 ., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0. , 1.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0. , 0.,
         0., 0., 0.]])
torch. Size([5, 1, 57])

Create a network

Creating a recurrent neural network in Torch requires cloning the parameters of a neural layer over multiple time steps. Now the hidden states and gradients saved in the network layer at different time steps are all processed by the calculation graph itself, and the programmer does not need to care about it. Therefore, you can easily build a recurrent neural network in Torch just like building a common feedforward network.

The RNN module in the figure below (mainly copied from the PyTorch for Torch users tutorial) is only 2 linear layers, they read the input and hidden state, and after they are linearly mapped, the output result is passed through the LogSoftmax layer as the output of this layer .

import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()

        self. hidden_size = hidden_size

        self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
        self.i2o = nn.Linear(input_size + hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):
        combined = torch.cat((input, hidden), 1)
        hidden = self.i2h(combined)
        output = self.i2o(combined)
        output = self. softmax(output)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, self.hidden_size)

n_hidden = 128
rnn = RNN(n_letters, n_hidden, n_categories)

To single-step through this network, we need to pass the input (in this case, a tensor for the current letter) and the hidden state from the previous step (initialized to zero first). This network will return the output (the probability for each language) and the next hidden state (which we keep for the next step).

input = letterToTensor('A')
hidden = torch. zeros(1, n_hidden)

output, next_hidden = rnn(input, hidden)

For efficiency, we don’t want to create a new Tensor for each step, so instead of using letterToTensor multiple times, we will use lineToTensor to generate vectors and then slice them. This can be further optimized by precomputing a batch of tensors.

input = lineToTensor('Albert')
hidden = torch. zeros(1, n_hidden)

output, next_hidden = rnn(input[0], hidden)
print(output)

Out:

tensor([[-2.9504, -2.8402, -2.9195, -2.9136, -2.9799, -2.8207, -2.8258, -2.8399,
         -2.9098, -2.8815, -2.8313, -2.8628, -3.0440, -2.8689, -2.9391, -2.8381,
         -2.9202, -2.8717]], grad_fn=<LogSoftmaxBackward>)

As above, the output is a <1 x n_categories> tensor, where each item corresponds to the likelihood of that category (higher values are more likely).

Training

Preparation for training

Before training, we need to set up some helper functions. First we need to process the output of the network (the likelihood of each class) to output its most probable class. We can use Tensor.topk to get the index of the maximum value:

def categoryFromOutput(output):
    top_n, top_i = output.topk(1)
    category_i = top_i[0].item()
    return all_categories[category_i], category_i

print(categoryFromOutput(output))

Out:

('Chinese', 5)

We’ll also need a function that quickly fetches training examples (surnames and their languages):

import random

def randomChoice(l):
    return l[random. randint(0, len(l) - 1)]

def randomTrainingExample():
    category = randomChoice(all_categories)
    line = randomChoice(category_lines[category])
    category_tensor = torch.tensor([all_categories.index(category)], dtype=torch.long)
    line_tensor = lineToTensor(line)
    return category, line, category_tensor, line_tensor

for i in range(10):
    category, line, category_tensor, line_tensor = randomTrainingExample()
    print('category =', category, '/ line =', line)

Out:

category = Italian / line = Pastore
category = Arabic / line = Toma
category = Irish / line = Tracey
category = Portuguese / line = Lobo
category = Arabic / line = Sleiman
category = Polish / line = Sokolsky
category = English / line = Farr
category = Polish / line = Winogrodzki
category = Russian / line = Adoratsky
category = Dutch / line = Robert

Training the network

Now, all you need to do to train the network is feed it lots of examples, let the network make a guess, and tell it whether the guess was wrong.

nn.NLLLoss is a more suitable loss function, because the last layer of RNN is nn.LogSoftmax.

criterion = nn.NLLLoss()

Each training loop will:

Create input and target tensors
Create zeroed initial hidden state
read each letter
- Save the hidden state for the next letter
Compare the distance between the final output and the target tensor
backpropagation
return output and loss

learning_rate = 0.005 # If you set this too high, it might explode. If too low, it might not learn

def train(category_tensor, line_tensor):
    hidden = rnn.initHidden()

    rnn.zero_grad()

    for i in range(line_tensor. size()[0]):
        output, hidden = rnn(line_tensor[i], hidden)

    loss = criterion(output, category_tensor)
    loss. backward()

    # Add parameters' gradients to their values, multiplied by learning rate
    for p in rnn.parameters():
        p.data.add_(-learning_rate, p.grad.data)

    return output, loss. item()

Now we just need to run the network with a lot of examples. Since the train function returns both output and loss, we can print its guesses and keep track of plotting loss changes. Since there are 1000 examples, we print every print_every example and average the loss over this period and save it.

import time
import math

n_iters = 100000
print_every = 5000
plot_every = 1000

# Keep track of losses for plotting
current_loss = 0
all_losses = []

def timeSince(since):
    now = time. time()
    s = now - since
    m = math. floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)

start = time. time()

for iter in range(1, n_iters + 1):
    category, line, category_tensor, line_tensor = randomTrainingExample()
    output, loss = train(category_tensor, line_tensor)
    current_loss += loss

    # Print iter number, loss, name and guess
    if iter % print_every == 0:
        guess, guess_i = categoryFromOutput(output)
        correct = '?' if guess == category else '? (%s)' % category
        print('%d %d%% (%s) %.4f %s / %s %s' % (iter, iter / n_iters * 100, timeSince(start), loss, line, guess, correct))

    # Add current loss avg to list of losses
    if iter % plot_every == 0:
        all_losses.append(current_loss / plot_every)
        current_loss = 0

Out:

5000 5% (0m 12s) 3.1806 Olguin / Irish ? (Spanish)
10000 10% (0m 21s) 2.1254 Dubnov / Russian ?
15000 15% (0m 29s) 3.1001 Quirke / Polish ? (Irish)
20000 20% (0m 38s) 0.9191 Jiang / Chinese ?
25000 25% (0m 46s) 2.3233 Marti / Italian ? (Spanish)
30000 30% (0m 54s) nan Amari / Russian ? (Arabic)
35000 35% (1m 3s) nan Gudojnik / Russian ?
40000 40% (1m 11s) nan Finn / Russian ? (Irish)
45000 45% (1m 20s) nan Napoliello / Russian ? (Italian)
50000 50% (1m 28s) nan Clark / Russian ? (Irish)
55000 55% (1m 37s) nan Roijakker / Russian ? (Dutch)
60000 60% (1m 46s) nan Kalb / Russian ? (Arabic)
65000 65% (1m 54s) nan Hanania / Russian ? (Arabic)
70000 70% (2m 3s) nan Theofilopoulos / Russian ? (Greek)
75000 75% (2m 11s) nan Pakulski / Russian ? (Polish)
80000 80% (2m 20s) nan Thistlethwaite / Russian ? (English)
85000 85% (2m 29s) nan Shadid / Russian ? (Arabic)
90000 90% (2m 37s) nan Finnegan / Russian ? (Irish)
95000 95% (2m 46s) nan Brannon / Russian ? (Irish)
100000 100% (2m 54s) nan Gomulka / Russian ? (Polish)

Drawing results

Plotting the changes in losses from all_losses shows how well the network is learning:

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

plt. figure()
plt.plot(all_losses)

Assessment results

To see how well the network performs on different classes, we will create a confusion matrix with rows corresponding to each language and columns corresponding to the languages the network guesses. To calculate the confusion matrix, use the evaluate() function to run a batch of samples through the trained network, this step is similar to the training function train, but without the backpropagation process.

# Keep track of correct guesses in a confusion matrix
confusion = torch.zeros(n_categories, n_categories)
n_confusion = 10000

# Just return an output given a line
def evaluate(line_tensor):
    hidden = rnn.initHidden()

    for i in range(line_tensor. size()[0]):
        output, hidden = rnn(line_tensor[i], hidden)

    return output

# Go through a bunch of examples and record which are correctly guessed
for i in range(n_confusion):
    category, line, category_tensor, line_tensor = randomTrainingExample()
    output = evaluate(line_tensor)
    guess, guess_i = categoryFromOutput(output)
    category_i = all_categories. index(category)
    confusion[category_i][guess_i] + = 1

# Normalize by dividing every row by its sum
for i in range(n_categories):
    confusion[i] = confusion[i] / confusion[i].sum()

# Set up plot
fig = plt. figure()
ax = fig.add_subplot(111)
cax = ax.matshow(confusion.numpy())
fig. colorbar(cax)

# Set up axes
ax.set_xticklabels([''] + all_categories, rotation=90)
ax.set_yticklabels([''] + all_categories)

# Force label at every tick
ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

# sphinx_gallery_thumbnail_number = 2
plt. show()

Some bright spots on the main axis represent which languages the neural network tends to guess wrong, such as Chinese and Korean, Spanish and Italian. The neural network performs well on Greek and poorly on English (probably due to more overlap with other languages).

Run on user input

def predict(input_line, n_predictions=3):
    print('\
> %s' % input_line)
    with torch.no_grad():
        output = evaluate(lineToTensor(input_line))

        # Get top N categories
        topv, topi = output. topk(n_predictions, 1, True)
        predictions = []

        for i in range(n_predictions):
            value = topv[0][i].item()
            category_index = topi[0][i].item()
            print('(%.2f) %s' % (value, all_categories[category_index]))
            predictions.append([value, all_categories[category_index]])

predict('Dovesky')
predict('Jackson')
predict('Satoshi')

Out:

> Dovesky
(nan) Russian
(nan) Arabic
(nan) Korean

> Jackson
(nan) Russian
(nan) Arabic
(nan) Korean

> Satoshi
(nan) Russian
(nan) Arabic
(nan) Korean

The script in the actual PyTorch repository splits the above code into several files:

data.py (load file)
model.py (define RNN)
train.py (for training)
predict.py (runs predict() with command line arguments)
server.py (using predictions as a JSON API via bottle.py)

Run train.py to train and save the network.

Run predict.py with the last name as input to see the predictions:

$ python predict.py Hazaki
(-0.42)Japanese
(-1.39) Polish
(-3.51) Czech

Run server.py and visit http://localhost:5533/yourname to get the predicted output in JSON format.

Exercise questions

Try using other row->category datasets, for example:
- any word -> language
- first name -> gender
- Character Name -> Writer
- Page title -> blog or subreddit
Better results with larger and/or better shaped networks
- add more linear layers
- Try nn.LSTM and nn.GRU layers
- Combining multiple of these RNNs into higher level networks