Data analysis – numpy

Numpy

numpy creates an array

import numpy as np
a = np.array([1,2,3,4,5])
b = np.array(range(1,6))
c = np.arange(1,6)

–>The contents of a, b, and c above are the same, pay attention to the difference between arange and range The usage of np.arange: arange([start,] stop[, step,], dtype=None)

The class name of the array:

In [1]: a = np. array ([1,2,3,4,5])
In [2]: type(a)
Out[2j: numpy.ndarisy

Type of data

In [3]: a.dtype
out[3]: dtype( 'int64 ')

More common data types in numpy

Data type operation

Specify the data type of the created array

In [40]: a=np.array([1,0,1,0],dtype=np.bool)#Or use dtype=' ?'
In [41]: a
out[41]: array([True, False, True, False], dtype=bool)

Modify the data type of the array

In [44]: a.astype("i1") #Or use a.astype(np.int8)
out [44]: array([1,0,1,0], dtype=int8)

Modify the decimal places of floating point

In [53]: b
out[53]:
array([0.0485436, 0.26320629, 0.69646413, 0.71811003, 0.3576838, 0.58919477, 0.84757749, 0.52428633, 0.486302, 0.48908838])
In [54]: np. round(b,2)
out[54]: array([ 0.05, 0.26, 0.7, 0.72, 0.36, 0.59, 0.85, 0.52, 0.49,

The shape of the array

In [60]: a=np.array([[3,4,5,6,7,8],[4,5,6,7,8,9]])
In [61]: a
out[61]:
array ([[3,4,5,6,7,8],
        [4,5,6,7,8,9]])

View the shape of the array

In [62]: a.shape
out[62]: (2,6)

Modify the shape of the array

In [63]: a.reshape(3,4)
out[63]:
array ([[3,4,5,6],
        [7,8,4,5],
        [6,7,8,9]])
In [64]: a. shape
out[64]: (2,6)

->Why is a still an array of 2 rows and 6 columns?

Convert the array into 1-dimensional data

t1=np.arange(12)
t3=t1.reshape(3,4)
print(t3)
print(t3. shape)
print(t3.reshape(12,)) #or print(t3.flatten())
"""
[[ 0 1 2 3]
 [ 4 5 6 7 ]
 [ 8 9 10 11]]
(3, 4)
[ 0 1 2 3 4 5 6 7 8 9 10 11]
"""

Calculation of arrays and numbers

In [71]: a
Out[71]:
array ([[3,4,5,6,7,8],
        [4,5,6,7,8,9]])

Addition and subtraction

In [72]: a + 1
out[72]:
array([[ 4,5,6,7,8,9],
        [5,6,7,8,9,10]])

Multiplication and division

In [73]: a*3
out[73]:
array([[ 9,12,15,18,21,24],
        [12,15,18,21,24,27]])

Interesting, this is caused by a numpy broadcast mechanism. During the operation, the values of addition, subtraction, multiplication and division are broadcast to all elements

Array and array calculation

In [78]: a
out[78]:
array([[ 3,4,5,6,7,8],
        [4,5,6,7,8,9]])

In [79]: b
out[79]:
array ([[21,22,23,24,25,26],
        [27,28,29,30,31,32]])

Addition and subtraction of arrays and arrays

In [80]: a + b
out[80]:
array([[ 24,26,28,30,32,34],
        [31,33,35,37,39,41]])

Multiplication and division of arrays and arrays

In [81]: a*b
out[81]:
array([[ 63,88,115,144,175,208],
        [108,140,174,210,248,288]])

Broadcasting principles

Two arrays are considered broadcast compatible if the axis lengths of the trailing dimension (that is, the dimension from the end) match, or if one of them has length 1. Broadcasting is performed on missing and/or length-1 dimensions.

Take another look at the example of numpy broadcasting:

t1=np.array([1,2])
t1_2=np.array([0,6])
t2=t1.reshape(1,2)
print(t2)
?
t3=t1_2.reshape(2,1)
print(t3)
?
t4=t1+t3
print(t4)
"""
[[1 2]]
[[0]
 [6]]
[[1 2]
 [7 8]]
"""

Read local files

np.loadtxt(frame,dtype=np.float,delimiter=None,skiprows=0,usecols=None,unpack=False)

parameter	explain
frame	Files, strings, etc. can also be .gz or bz2 compressed files
dtype	Data type, that is, what data type is the string in CSV read into the array, the default is np.float
delimiter	Delimited string, that is, the string that separates data in the CSV file, the default space
skip rows	How many rows before skipping, generally skipping the header of the first row
usecols	Read the specified column, index, tuple type
unpack	If it is True, the read attributes will be written into different array variables respectively, and if False, the read data will only be written into one array variable, which is equivalent to transposition (row to column, column to row) Default False

Transpose

Transposition is a kind of transformation. For the array in numpy, it is to exchange data in the diagonal direction, and the purpose is to process data more conveniently.

import numpy as np
t2=np.arange(24).reshape(4,6)
In [2]:t2
Out[2]:
array([[ 0, 1, 2, 3, 4, 5],
       [ 6, 7, 8, 9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23]])
In [6]: t2.transpose() #Transpose
Out[6]:
array([[ 0, 6, 12, 18],
       [ 1, 7, 13, 19],
       [ 2, 8, 14, 20],
       [ 3, 9, 15, 21],
       [ 4, 10, 16, 22],
       [ 5, 11, 17, 23]])
In [7]:t2.T #transpose
Out[7]:
array([[ 0, 6, 12, 18],
       [ 1, 7, 13, 19],
       [ 2, 8, 14, 20],
       [ 3, 9, 15, 21],
       [ 4, 10, 16, 22],
       [ 5, 11, 17, 23]])
In [8]:t2.swapaxes(1,0) #exchange axis
Out[8]:
array([[ 0, 6, 12, 18],
       [ 1, 7, 13, 19],
       [ 2, 8, 14, 20],
       [ 3, 9, 15, 21],
       [ 4, 10, 16, 22],
       [ 5, 11, 17, 23]])

Indexing and slicing

print(t2)
"""
[[4394029 320053 5931 46245]
 [7860119 185853 26679 0]
 [5845909 576597 39774 170708]
 ...
 [ 142463 4231 148 279 ]
 [2162240 41032 1384 4737]
 [ 515000 34727 195 4722]]
"""

Get Line

# Fetch line
print(t2[2])
"""
[5845909 576597 39774 170708]
"""
# Fetch multiple consecutive lines
print(t2[2:])
"""
[[5845909 576597 39774 170708]
 [2642103 24975 4542 12829]
 [1168130 96666 568 6666]
 ...
 [ 142463 4231 148 279 ]
 [2162240 41032 1384 4737]
 [ 515000 34727 195 4722]]
"""
# Fetch non-consecutive lines
print(t2[[2,8,10]])
"""
[[5845909 576597 39774 170708]
 [1338533 69687 678 5643]
 [859289 34485 726 1914]]
"""

Collection

#Get columns
print(t2[:,0])
"""
[4394029 7860119 5845909 ... 142463 2162240 515000]
"""
# Take consecutive multiple columns
print(t2[:,0:4])
"""
[[4394029 320053 5931]
 [7860119 185853 26679]
 [5845909 576597 39774]
 ...
 [ 142463 4231 148]
 [2162240 41032 1384]
 [ 515000 34727 195]]
"""
# Take discontinuous multiple columns
print(t2[:,[0,3]])
"""
[[4394029 46245]
 [7860119 0]
 [5845909 170708]
 ...
 [ 142463 279 ]
 [2162240 4737]
 [ 515000 4722]]
"""

Get the value of a row and a column

#Take the third row and the fourth column
a = t2[2,3]
print(a)
print(type(a))
"""
170708
<class 'numpy.int32'>
"""

Get multiple rows and columns

#Take multiple rows and columns, take the results from the 3rd row to the 5th row, and the 2nd column to the 4th column
##Take the position of the intersection of row and column
b = t2[2:5,1:4]
print(b)
"""
[[576597 39774 170708]
 [24975 4542 12829]
 [96666 568 6666]]
"""

Take multiple non-adjacent points

#Take multiple non-adjacent points
#The selected result is (0,0) (2,1) (1,3)
c = t2[[0,2,1],[0,1,3]]
print(c)
"""
 (0,0) (2,1) (1,3)
[4394029 576597 0]
"""

Modification of value

t2[:,[1,3]]=1
print(t2)
"""
[[4394029 1 5931 1]
 [7860119 1 26679 1]
 [5845909 1 39774 1]
 ...
 [ 142463 1 148 1]
 [2162240 1 1384 1]
 [ 515000 1 195 1]]
?
"""

bool operation

Change the number of <10 to 3

t3 = np.array(range(24)).reshape(4,6)
print(t3<10)
"""
            Generated as bool type
[[ True True True True True True True]
 [ True True True True False False ]
 [False False False False False False]
 [False False False False False False]]
"""
t3 = np.array(range(24)).reshape(4,6)
t3[t3<10] = 3
print(t3)
"""
[[ 3 3 3 3 3 3]
 [ 3 3 3 3 10 11]
 [12 13 14 15 16 17]
 [18 19 20 21 22 23]]
"""

Ternary operator

Change <10 to 0, >10 to 10

t4=np.where(t3<10,0,10)
print(t4)
"""
[[ 0 0 0 0 0 0]
 [ 0 0 0 0 10 10]
 [10 10 10 10 10 10]
 [10 10 10 10 10 10]]
?
"""

clip (cropping)

Change <10 to 10, >18 to 18

t5=t3.clip(10,18)
print(t5)
"""
[[10 10 10 10 10 10]
 [10 10 10 10 10 11]
 [12 13 14 15 16 17]
 [18 18 18 18 18 18]]
"""

Convert the value to nan

t3[3,3]=np.nan
print(t3)
"""
report error
ValueError: cannot convert float NaN to integer
(The hint is that the int type cannot be converted to the nan type of float)
"""
#So first convert t3 to float type
t3 = t3.astype(float)
t3[3,3]=np.nan
print(t3)
"""
[[ 0. 1. 2. 3. 4. 5.]
 [ 6. 7. 8. 9. 10. 11.]
 [12. 13. 14. 15. 16. 17.]
 [18. 19. 20. nan 22. 23.]]
"""

nan and inf

nan(NAN,Nan): not a number means not a number

When will nan appear in numpy: When we read the local file as float, if there is something missing, nan will appear when doing an inappropriate calculation (such as infinity (inf) minus infinite days)

inf(-inf.inf):infinity, inf means positive infinity, -inf means negative infinity

When does inf appear, including (-inf, + inf) such as dividing a number by 0, (in python, an error will be reported directly, in numpy, it is an inf or -inf)

Nan’s note

Two nans are not equal

In[76]: np.nan==np.nan
out[76]: False

np.nan != np.nan

In [81] : np.nan!=np.nan
out[81]: True

Use the above characteristics to determine the number of nan in the array

In [86]: t
out[86] : array ([ 1.,2.,nan])
In [87] : np.count_nonzero(t!=t)
out[87]: 1

Because of 2, how to judge whether a number is nan? Use np.isnan(a) to judge, return bool type, for example, if you want to replace nan with 0

In [89]: t
out[89]: array([1.,2.,nan])
n [90] : t[np.isnan(t)] =0
In [91]: t
out[91]: array([ 1., 2., 0.])
nan and any value evaluates to nan

Statistics of the number of nan in the array

np.count_nonzero(t!=t)

np.count_nonzero(np.isnan(t))

Array summation (np.sum)

nan and any value evaluates to nan

print(t3)
"""
[[ 0. 1. 2. 3. 4. 5.]
 [ 6. 7. 8. 9. 10. 11.]
 [12. 13. 14. 15. 16. 17.]
 [18. 19. 20. nan 22. 23.]]
?
"""
print(np. sum(t3))
"""
nan
"""

Ordinary sum

print(t4)
"""
[[ 0 1 2 3]
 [ 4 5 6 7 ]
 [ 8 9 10 11]]
"""
# Find the sum
print(np. sum(t4))
"""
66
"""
# Find the sum of each column
print(np.sum(t4,axis=0))
"""
[12 15 18 21]
"""
?
# Find the sum of each row
print(np.sum(t4,axis=1))
"""
[ 6 22 38 ]
"""

Common statistical functions

Summing: t.sum(axis=None)

Mean: t.mean(a,axis=None) is more affected by outliers

Median: np.median(t.axis=None)

Maximum value: t.max(axis=None)

Minimum value: t.min(axis=None)

Extreme value: np.ptp(t,axis-None) is the difference between the maximum and minimum values

Standard deviation: t.std(axis=None)

Returns all statistical results of the multidimensional array by default, and returns a result on the current axis if axis is specified

Array splicing

t1 = np.array(range(12)).reshape(2,6)
t2 = np.arange(12,24).reshape(2,6)
print(t1)
print(t2)
"""
[[ 0 1 2 3 4 5]
 [ 6 7 8 9 10 11]]
 
[[12 13 14 15 16 17]
 [18 19 20 21 22 23]]
"""

Vertical stitching np.vstack

t3 = np.vstack((t1,t2)) #Vertical splicing (vertically)
print(t3)
"""
[[ 0 1 2 3 4 5]
 [ 6 7 8 9 10 11]
 [12 13 14 15 16 17]
 [18 19 20 21 22 23]]
"""

Horizontal stitching np.hstack

t4 = np.hstack((t1,t2)) #horizontal stitching (horizontally)
print(t4)
"""
[[ 0 1 2 3 4 5 12 13 14 15 16 17]
 [ 6 7 8 9 10 11 18 19 20 21 22 23]]
"""

Row and column exchange of array

t5 = np.arange(12,24).reshape(3,4)
print(t5)
"""
[[12 13 14 15]
 [16 17 18 19]
 [20 21 22 23]]
"""

Line swap

t5[[1,2],:] = t5[[2,1],:]
print(t5)
"""
[[12 13 14 15]
 [20 21 22 23]
 [16 17 18 19]]
"""

Column swap

t5[:,[0,2]] = t5[:,[2,0]]
print(t5)
"""
[[14 13 12 15]
 [18 17 16 19]
 [22 21 20 23]]
"""

More useful methods

Get the position of the maximum and minimum values

np.argmax(t.axis=0)

np.argmin(t,axis=1)

Create an array of all 0s: np.zeros((3,4))

Create an array of all 1s: np.ones((3,4))

Create a square array (square matrix) with a diagonal of 1: np.eye(3)

Generate random numbers

np.random

parameter	explain
.rand(d0,d1, ..dn)	Create a uniformly distributed random number array of d0-dn dimensions, floating point numbers, ranging from 0-1
.randn(d0,d1,..dn)	Create a standard normal distribution random number of d0-dn dimension, floating point number, mean 0, standard deviation 1
.randint(low, high, (shape))	Select a random integer from the given upper and lower limits, the range is low, high, and the shape is shape
.uniform(low,high,(size))	Produces an array with uniform distribution, low start value, high end value, size shape
.normal(loc,scale,(size))	Randomly draw samples from the specified normal distribution, the center of the distribution is loc (the mean of the probability distribution), the standard deviation is scale, and the shape is size
.seed(s)	Random number seed, s is the given seed value. Because the computer generates pseudo-random numbers, by setting the same random number seed, the same random number can be generated every time