Linear regression predicts Boston housing prices & the reason for loss is NAN & draws a scatter plot to find the relationship between features and labels

Boston house price csv file

%matplotlib inline
import random
import torch
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch

Get the data set from CSV

# Load data, the first line is a useless line, skip it directly
boston = pd.read_csv('../data/boston_house_prices.csv',skiprows=[0])
# There are 14 columns in total, the first thirteen columns are features, and the last column is price

Take the last column and set it to labels, and all the previous columns to features

# The last column is used as labels, and the contents of the first thirteen columns are used as features.
# Directly let the last column go off the stack, leaving Boston with the first 13 columns.
labels = boston.pop('MEDV')

Draw a scatter plot to see the relationship between features and house prices. If it is a linear relationship, it means that there is a certain correlation between the feature and the label. Select features related to labels as the final features

# Look at the scatter plot of each feature and house price
data_xTitle = ['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B' , 'LSTAT']
# Set 5 rows, 3 columns = 15 subgraphs
fig, a = plt.subplots(5, 3)
m = 0
for i in range(0, 5):
    if i == 4:
        a[i][0].scatter(features[str(data_xTitle[m])], labels, s=30, edgecolor='white')
        for j in range(0, 3):
            a[i][j].scatter(features[str(data_xTitle[m])], labels, s=30, edgecolor='white')
            m = m + 1
# It can be seen from the figure below that CRIM, RM, LSTAT and y have a linear relationship, so these three features are selected as eigenvalues.

# CRIM, RM, LSTAT have a linear relationship with y, so these three features are selected as eigenvalues.
features = features[['LSTAT','CRIM','RM']]

Convert data format to tensor

features = torch.tensor(np.array(features)).to(torch.float32)
labels = torch.tensor(np.array(labels)).to(torch.float32)
features.shape, labels.shape

(torch.Size([506, 13]), torch.Size([506]))

Define linear regression, loss function, optimization function

# Develop linear regression model
def linreg(X,w,b):
    return torch.matmul(X,w) + b
# Define loss function
def squared_loss(y_hat,y):
    return (y_hat - y.reshape(y_hat.shape)) **2 /2
#Define optimization function
def sgd(params,lr,batch_size):
    '''mini-batch stochastic gradient descent'''
    with torch.no_grad():
        for param in params:
            param -= lr * param.grad/batch_size

data_iter function, fetch data by batch

def data_iter(batch_size, features, labels):
    num_examples = len(features)
    indices = list(range(num_examples))
    # These samples are read randomly, in no specific order
    for i in range(0, num_examples, batch_size):
        batch_indices = torch.tensor(indices[i: min(i + batch_size, num_examples)])
        yield features[batch_indices], labels[batch_indices]

Set parameters

w = torch.normal(0, 0.01, size=(features.shape[1],1), requires_grad=True)
b = torch.zeros(1, requires_grad=True)
lr = 0.03
# lr = 0.0001
num_epochs = 100
loss = squared_loss
batch_size = 10

The shapes of w and b are:
torch.Size([3, 1])

Start training

for epoch in range(num_epochs):
    for X, y in data_iter(batch_size, features, labels):
        l = loss(net(X, w, b), y)
        # Mini-batch loss for X and y
        # Because the shape of l is (batch_size,1), not a scalar. All elements in l are added together,
        # And use this to calculate the gradient about [w,b]
        sgd([w, b], lr, batch_size)
    # Update parameters using their gradients
    with torch.no_grad():
        train_l = loss(net(features, w, b), labels)
        print(f'epoch {<!-- -->epoch + 1}, loss {<!-- -->float(train_l.mean()):f}')

When the learning rate of the model is set to 0.03, the loss directly becomes NAN

epoch 1, loss nan
epoch 2, lost nan
epoch 3, lost nan
epoch 50, loss nan
epoch 51, lost nan
epoch 52, lost nan
epoch 100, loss nan

When the learning rate of the model is set to 0.0001, the loss is normal and the model begins to converge

epoch 1, loss 141.555878
Epoch 2, loss 115.449852
Epoch 50, loss 15.606522
Epoch 51, loss 15.546185
Epoch 100, loss 14.999675


Why is the loss NAN when the learning rate is 0.03?

This shows that the step is too big for the loss function of the model, and the optimal point is passed directly. Reduce the learning rate, and as the epoch increases, the loss decreases and the model converges.