Linear Regression

Linear neural network

Note:
This article is for the author to study deep learning notes, referring to the following two open source deep learning materials:

Deep Learning (Flower Book)
https://github.com/exacity/deeplearningbook-chinese
Hands-on deep learning (Li Mu)
https://zh-v2.d2l.ai/

Linear regression is a commonly used statistical analysis method, which can be used to study the linear relationship between one or more independent variables and dependent variables. Linear regression has a wide range of applications, such as:

In economics, linear regression can be used to estimate the demand function, production function, consumption function, etc., and analyze the impact of various factors on economic growth, inflation, unemployment rate, etc.
In the social sciences, linear regression can be used to explore the relationship between variables such as education level, income level, health status, and political leaning, as well as to evaluate policy effects, social welfare, and population changes.
In natural science, linear regression can be used to establish mathematical models of physical phenomena, chemical reactions, biological processes, etc., as well as predict future development trends, optimize experimental design, and test hypotheses.

For example, when performing linear regression modeling on the income level, to predict the average consumption level of the group through the number of respondents (10,000) and income level (US dollars), it is usually necessary to collect a real data set, which includes Income level, crowd size etc. This data set is usually called a training data set. Each row of data is called a sample. The final prediction result is a label.

usually used

n is the number of samples in the dataset. For the sample with index i, its input is

(

)

[

(

)

(

)

]

x^{(i)} = [x_1^{(i)},x_2^{(i)}]^T

x(i)=[x1(i)?,x2(i)?]T, the corresponding label is

the y

(

)

y^{(i)}

y(i)

Linear model

Usually a linear model is identified as follows:

the y

^

=

w

1

x

1

+

.

.

.

+

w

d

x

d

+

b

.

\hat{y} = w_1 x_1 + … + w_d x_d + b.

y^?=w1?x1? + … + wd?xd? + b.
in

the y

^

\hat{y}

y^? is the predicted value,

x

x

x is the feature vector,

w

i

w_i

wi? is weight(weight), b is offset(offset) or intercept(intercept)
Or can also be identified as:

the y

^

=

w

?

x

+

b

.

\hat{y} = \mathbf{w}^\top \mathbf{x} + b.

y^?=w?x + b.

Loss function

Usually when fitting a model, a loss function is always defined, which can show the gap between the target and the predicted value. The definition of the loss value is usually a non-negative number to show the gap. For linear regression, the loss function definition as follows:

(

)

(

)

(

the y

(

)

the y

(

)

l^{(i)}(\mathbf{w}, b) = \frac{1}{2}(\hat{y}^{(i)} – y^{(i)})^ 2.

l(i)(w,b)=21?(y^?(i)?y(i))2.
As shown below:
[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture Save it and upload directly (img-5adZ2x0h-1679495386406)(null)]

The above loss function is for a single predicted value, then for the training set

The loss value of n samples is equivalent to summing all individual times:

(

)

∑

(

)

(

)

∑

(

)

the y

(

)

L(\mathbf{w}, b) =\frac{1}{n}\sum_{i=1}^n l^{(i)}(\mathbf{w}, b) =\ frac{1}{n} \sum_{i=1}^n \frac{1}{2}\left(\mathbf{w}^\top \mathbf{x}^{(i )} + b – y^{(i)}\right)^2.

L(w,b)=n1?i=1∑n?l(i)(w,b)=n1?i=1∑n?21?(w?x(i) + b?y(i)) 2.

Analytic solution

The solution of linear regression can be expressed simply by a formula, and this kind of solution is called analytical solution. like:

(

)

the y

\mathbf{w}^* = (\mathbf X^\top \mathbf X)^{-1}\mathbf X^\top \mathbf{y}.

w?=(X?X)?1X?y.

Stochastic Gradient Descent

Usually for training fit, you can use a method called gradient descent, which is applicable to almost all deep learning models,

The specific method is to calculate the derivative of the function with respect to the model parameters (which can be called the gradient). But this execution will be slow (if the sample size is large), because every time the parameters are updated, the entire (n) data set must be traversed. To this end, a small batch of samples can be randomly selected, which is called minibatch gradient descent (Minibatch stochastic gradient descent).
Each iteration, a small batch of data is randomly extracted

B

\mathcal{B}

B. The derivative of the average loss with respect to the model parameters for the mini-batch is then computed. Then multiply the gradient by a predetermined positive number

n

\eta

η can be called the learning rate (learning rate). The following formula is the update process:

(

w

,

b

)

←

(

w

,

b

)

?

n

∣

B

∣

∑

i

∈

B

?

(

w

,

b

)

l

(

i

)

(

w

,

b

)

.

(\mathbf{w},b) \leftarrow (\mathbf{w},b) – \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{(\mathbf{w},b)} l^{(i)}(\mathbf{w},b).

(w,b)←(w,b)?∣B∣η?i∈B∑(w,b)?l(i)(w,b).
Therefore, to summarize the above knowledge, the algorithm steps are as follows:
1. Initialize the value of the model parameters, such as random initialization;
2. A small batch of samples is randomly selected from the data set and the parameters are updated in the direction of the negative gradient, and this step is iterated continuously.

For the squared loss and affine transformation, we can explicitly write it as follows:

←

∣

∑

∈

(

)

(

)

∣

∑

∈

(

)

(

)

the y

(

)

←

∣

∑

∈

(

)

(

)

∣

∑

∈

(

)

the y

(

)

\begin{aligned} \mathbf{w} & amp;\leftarrow \mathbf{w} – \frac{\eta}{|\mathcal{B}|} \sum_{i \ in \mathcal{B}} \partial_{\mathbf{w}} l^{(i)}(\mathbf{w}, b) = \mathbf{w} – \frac{\ eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \mathbf{x}^{(i)} \left(\mathbf{w}^ \top \mathbf{x}^{(i)} + b – y^{(i)}\right),\ b & amp;\leftarrow b – \frac{\eta }{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_b l^{(i)}(\mathbf{w}, b) = b – \ \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \left(\mathbf{w}^\top \mathbf{ x}^{(i)} + b – y^{(i)}\right). \end{aligned}

wb?←w?∣B∣η?i∈B∑w?l(i)(w,b)=w?∣B∣η?i∈B∑?x(i)(w?x(i ) + b?y(i)),←b?∣B∣η?i∈B∑b?l(i)(w,b)=b?∣B∣η?i∈B∑?(w ?x(i) + b?y(i)).?

When the training reaches a predetermined number of iterations, or stops when certain conditions are met, record the estimated value elaborated by the model, expressed as

\hat{w}

w^ and

\hat{b}

Code operation

Vectorization acceleration

When training the model, it is usually necessary to vectorize the calculation samples, using a linear algebra library instead of using cumbersome and complicated for loops for training

Required codebase:
In particular, the d2l here is the code library written by Mr. Li Mu himself, which needs to be installed by himself. The installation command is as follows:

pip install -U d2l

%matplotlib inline
import math
import time
import numpy as np
import torch
from d2l import torch as d2l

If you compare vector calculations with for loop calculations, you can write a simple matrix addition function for comparison.
Here, instantiate two 10,000-dimensional vectors with all 1s, and use the for loop and the addition of the linear algebra library to perform time comparison tests:

# vector definition
n = 10000
a = torch.ones([n])
b = torch.ones([n])

# timer definition
class Timer: #@save
    """Record multiple running times"""
    def __init__(self):
        self. times = []
        self. start()

    def start(self):
        """Start timer"""
        self.tik = time.time()

    def stop(self):
        """Stop the timer and record the time in the list"""
        self.times.append(time.time() - self.tik)
        return self. times[-1]

    def avg(self):
        """Return Average Time"""
        return sum(self. times) / len(self. times)

    def sum(self):
        """Returns the sum of time"""
        return sum(self. times)

    def cumsum(self):
        """Return cumulative time"""
        return np.array(self.times).cumsum().tolist()

For loop addition time calculation:

c = torch. zeros(n)
timer = Timer()
for i in range(n):
    c[i] = a[i] + b[i]
f'{<!-- -->timer.stop():.5f} sec'

Out: 0.09227 sec’

Linear algebra library calculations:

timer.start()
d = a + b
f'{<!-- -->timer.stop():.5f} sec'

Out:‘0.00103 sec’

It is not difficult to see that vectorized code will bring exponential acceleration, while increasing the speed, it also reduces the possibility of errors

Normal distribution in squared loss

The normal distribution is often closely related to linear regression. Here, the square loss objective function is interpreted by defining the normal distribution.

Normal distribution definition

def normal(x, mu, sigma):
    p = 1 / math.sqrt(2 * math.pi * sigma**2)
    return p * np.exp(-0.5 / sigma**2 * (x - mu)**2)

Visualize the normal distribution

# use numpy for visualization
x = np.arange(-7, 7, 0.01)

# mean and standard deviation pair
params = [(0, 1), (0, 2), (3, 1)]
d2l.plot(x, [normal(x, mu, sigma) for mu, sigma in params], xlabel='x',
         ylabel='p(x)', figsize=(4.5, 2.5),
         legend=[f'mean {<!-- -->mu}, std {<!-- -->sigma}' for mu, sigma in params])

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture Save it and upload directly (img-qEb232Np-1679495387769)(null)]

Deep Web

Neural network diagram

Neural Network

Linear regression is a single-layer neural network. In the neural network shown in the above figure, the input

x_1,…,x_d

x1?,…,xd? , so the number of neurons connected to the input layer is d, which is the feature dimensionaloty. The output of the network is

o_1

o1? , that is, the output dimension is 1.

For linear regression, every input is connected to every output (in this case there is only one output), and this transformation is called a fully-connected layer or a dense layer.