Sklearn cross validation and binary tree drawing

1. Import of related databases and preliminary preparation for binary tree drawing

Calling the make_blobs function is mainly to generate classification or clustering data sets
n_features indicates how many feature values each sample has
n_sample represents the number of samples
Centers are the number of cluster center points, which can be understood as the number of label types.
random_state is a random seed that can fix the generated data
cluster_std sets the standard deviation of each category (default is 1)
shuffle random arrangement (shuffle)
#Import dataset generator
from sklearn.datasets import make_blobs
help(make_blobs)
Help on function make_blobs in module sklearn.datasets._samples_generator:

make_blobs(n_samples=100, n_features=2, *, centers=None, cluster_std=1.0, center_box=(-10.0, 10.0), shuffle=True, random_state=None, return_centers=False)
    Generate isotropic Gaussian blobs for clustering.
    
    Read more in the :ref:`User Guide <sample_generators>`.
    
    Parameters
    ----------
    n_samples : int or array-like, default=100
        If int, it is the total number of points equally divided among
        clusters.
        If array-like, each element of the sequence indicates
        the number of samples per cluster.
    
        .. versionchanged:: v0.20
            one can now pass an array-like to the ``n_samples`` parameter
    
    n_features : int, default=2
        The number of features for each sample.
    
    centers : int or ndarray of shape (n_centers, n_features), default=None
        The number of centers to generate, or the fixed center locations.
        If n_samples is an int and centers is None, 3 centers are generated.
        If n_samples is array-like, centers must be
        either None or an array of length equal to the length of n_samples.
    
    cluster_std : float or array-like of float, default=1.0
        The standard deviation of the clusters.
    
    center_box : tuple of float (min, max), default=(-10.0, 10.0)
        The bounding box for each cluster center when centers are
        generated at random.
    
    shuffle : bool, default=True
        Shuffle the samples.
    
    random_state : int, RandomState instance or None, default=None
        Determines random number generation for dataset creation. Pass an int
        for reproducible output across multiple function calls.
        See :term:`Glossary <random_state>`.
    
    return_centers : bool, default=False
        If True, then return the centers of each cluster
    
        .. versionadded:: 0.23
    
    Returns
    -------
    X : ndarray of shape (n_samples, n_features)
        The generated samples.
    
    y : ndarray of shape (n_samples,)
        The integer labels for cluster membership of each sample.
    
    centers : ndarray of shape (n_centers, n_features)
        The centers of each cluster. Only returned if
        ``return_centers=True``.
    
    Examples
    --------
    >>> from sklearn.datasets import make_blobs
    >>> X, y = make_blobs(n_samples=10, centers=3, n_features=2,
    ...random_state=0)
    >>> print(X.shape)
    (10, 2)
    >>> y
    array([0, 0, 1, 0, 2, 2, 2, 1, 1, 0])
    >>> X, y = make_blobs(n_samples=[3, 3, 4], centers=None, n_features=2,
    ...random_state=0)
    >>> print(X.shape)
    (10, 2)
    >>> y
    array([0, 1, 2, 0, 2, 2, 2, 1, 1, 0])
    
    See Also
    --------
    make_classification : A more intricate variant.

data = make_blobs(n_samples = 200, centers = 2, random_state = 8)
print(data)
(array([[ 6.75445054, 9.74531933],
       [6.80526026, -0.2909292],
       [7.07978644, 7.81427747],
       [6.87472003, -0.16069949],
       [8.06164078, 8.43736968],
       [7.4934131, 11.00892356],
       [4.69777002, 0.59687317],
       [9.19642422, 11.57536954],
       [8.80996213, 11.9021701],
       [7.5952749, 1.32739544],
       [8.20330317, 1.27929111],
       [8.59258191, -0.29022607],
       [6.89228905, 8.60634293],
       [8.00405631, 10.53695374],
       [8.14715032, 2.09399376],
       [7.06363179, -0.57743891],
       [6.34526126, 8.70677779],
       [5.28435774, 10.16972385],
       [6.62257531, 2.04423066],
       [7.40314915, 10.42342437],
       [7.27423265, 9.18459991],
       [8.77188508, 0.768341],
       [6.39995999, 0.07580004],
       [7.44636985, 11.43674954],
       [7.74488453, 0.14409178],
       [9.10088858, 9.14807411],
       [8.10044749, 0.7596783],
       [8.73747674, 2.0086222],
       [6.51876894, -1.36881715],
       [7.16251356, 9.74878714],
       [6.57119411, -0.74277359],
       [7.1354011, -0.63951267],
       [7.31294296, 9.92166331],
       [7.52733204, 0.2744698],
       [6.0160163, 0.53637761],
       [6.73117031, 1.20886838],
       [6.11962018, 0.21527805],
       [7.88579276, 0.78743005],
       [7.32112244, 0.78510422],
       [7.62051584, 9.37144814],
       [6.96767867, 8.9622523],
       [8.51730001, -0.42711053],
       [7.92672195, 0.44823051],
       [5.52161775, 7.98446372],
       [6.93568163, 0.50274121],
       [7.89765814, 8.21954764],
       [7.40292703, 9.16217702],
       [8.28827095, 10.71730803],
       [7.33912656, -0.07533921],
       [5.27801757, 8.93474119],
       [5.57550594, 0.4274511],
       [8.67425268, -0.37860274],
       [7.55303352, 11.85706105],
       [6.84661976, -0.85945209],
       [6.26977193, 2.11033394],
       [7.09962807, 0.5655205],
       [5.5987887, 7.59170022],
       [8.0060449, 0.80933758],
       [6.85769503, 10.30105929],
       [6.19399963, 8.19786954],
       [8.68173394, 0.54980379],
       [5.82259795, 8.88727231],
       [5.30528133, 0.29441074],
       [6.89703841, 7.98081009],
       [5.9389756, 1.19214956],
       [7.13760133, 9.84345464],
       [7.51718983, 1.31532401],
       [8.08034605, 10.02847377],
       [6.89078889, 10.61298902],
       [6.95802459, 9.19924611],
       [8.91111219, 9.14933265],
       [7.57818277, 9.58629233],
       [6.24007751, 0.55847799],
       [7.79924692, 10.59576952],
       [7.49985237, 9.55274284],
       [9.94109903, 9.22395667],
       [7.07232613, 1.26533062],
       [7.50126258, 0.62517001],
       [6.63110319, 2.65308097],
       [6.6060513, 3.19799895],
       [8.81545663, 8.76386046],
       [6.5688005, 0.09522898],
       [9.15668309, 9.59459888],
       [7.45637594, 0.24440634],
       [7.29548244, -0.22293119],
       [8.20316159, 12.01375618],
       [6.97321804, 2.576281],
       [6.42049196, 0.26683712],
       [7.40783871, 6.93633083],
       [6.54464509, 0.89987351],
       [7.58423725, 10.70124388],
       [8.80002143, 8.54323521],
       [7.1847723, 2.22950427],
       [7.80361128, 9.74561264],
       [7.96481592, 8.03914659],
       [6.6571269, 7.72756233],
       [7.29433984, 9.79486468],
       [7.237824, 1.70291874],
       [8.37153676, 0.98810496],
       [6.49932355, 0.24955722],
       [9.02255525, 10.06777901],
       [7.61227907, 9.4463627],
       [8.89464606, 10.29806397],
       [7.01747287, -1.22016798],
       [8.10434971, 1.83659293],
       [7.68373899, 1.5632695],
       [9.43042008, 0.68726533],
       [ 6.26211747, 1.577057 ],
       [9.59017028, 0.58441955],
       [7.82182216, 0.52633087],
       [7.6025272, 8.98962387],
       [8.48011698, 0.69122126],
       [7.63890536, -0.06731493],
       [5.84965451, 0.72241791],
       [7.46996922, 8.44935323],
       [6.8117005, 10.8840413],
       [8.67502392, 0.37561206],
       [8.12519495, 1.67159478],
       [5.07337492, 10.52482973],
       [7.48665378, 0.21345453],
       [8.11950967, 0.56120493],
       [6.15895483, 8.70208685],
       [7.94310647, 8.20622208],
       [7.95311372, 8.36897664],
       [4.96938735, 1.32531048],
       [8.8583269, -0.34648253],
       [10.01367527, 10.52089453],
       [8.99334153, 9.7313491],
       [8.22871505, 1.23014656],
       [6.19407512, -0.03183561],
       [7.26697254, 9.87045836],
       [7.94970781, -0.37340645],
       [5.62803952, 9.77585443],
       [8.50049461, 9.12147855],
       [7.31054144, 0.39102866],
       [7.49814373, 9.29677019],
       [8.32245091, 9.67819196],
       [8.32813617, 9.14002426],
       [7.56475962, 11.24762868],
       [7.92129785, 0.78018447],
       [8.00236864, 10.1691733],
       [4.33366829, 10.51034676],
       [6.02937898, 10.31974057],
       [6.88953097, 0.80526874],
       [7.51239046, 2.06597042],
       [9.17061801, 10.37690696],
       [7.63027116, 8.69797933],
       [8.35312192, 0.20325714],
       [8.72578696, 10.34691678],
       [5.44099009, 1.59585563],
       [7.56093115, -0.51702689],
       [6.02376341, -0.52025947],
       [7.15013321, 9.52893935],
       [7.56833386, 9.32443309],
       [7.09022949, 8.57919798],
       [5.94356564, 0.6092466],
       [6.25817082, 9.79505477],
       [5.94205586, 10.50768333],
       [7.82510107, 8.41865266],
       [5.88994248, 2.1198068],
       [6.40269472, 0.08495368],
       [7.64534862, -1.89105765],
       [6.8830708, 1.38045511],
       [7.24044576, 1.07171623],
       [9.4035308, 8.09592099],
       [6.55819206, 8.84793239],
       [6.58341965, 8.42678679],
       [7.83939881, -0.10906103],
       [7.22095192, 8.06544414],
       [7.8440213, 10.29060403],
       [7.39634594, 8.90196559],
       [9.10772988, -0.06937041],
       [6.93540782, 1.74268311],
       [7.9465776, -0.37622421],
       [7.92430026, 0.10451121],
       [6.79156708, 0.47231026],
       [6.28516091, 11.28717687],
       [7.54257819, 7.02403019],
       [7.40565933, 8.8292448],
       [7.51463404, 10.14107588],
       [6.40863862, 0.09433704],
       [6.5342397, 9.45532341],
       [5.17209648, 11.78064756],
       [5.49953213, 9.04384494],
       [9.86936252, 0.76402347],
       [7.84725158, -0.25808463],
       [8.14330144, 1.05961829],
       [7.28724996, 7.620998],
       [6.0888764, -0.01613322],
       [7.59635095, 8.0197955],
       [6.71388804, 1.38741885],
       [7.3307687, 0.97105895],
       [8.18240421, 8.16999978],
       [8.53178848, 1.68305022],
       [6.91511696, 8.64812384],
       [7.82944816, 9.62627158],
       [6.09382282, 9.38044447],
       [7.24211001, 7.48506871],
       [8.2634157, 10.34723435],
       [ 8.39800148, 2.8397151 ]]), array([0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0 , 0, 1,
       1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0,
       1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0,
       0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0,
       0, 1]))
X, y = data #Separate independent variables and dependent variables
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(X[:, 0], X[:, 1], c = y, cmap = plt.cm.spring, edgecolors = 'k')
<matplotlib.collections.PathCollection at 0xaf15f70>

#Import iris data set
from sklearn.datasets import load_iris
iris = load_iris()
#Import boston data set
from sklearn.datasets import load_boston
boston = load_boston()
class sklearn.preprocessing.MinMaxScaler(feature_range(0,1),copy = True)
Scale data to a specified range
claas sklearn.preprocessing.MaxAbsScaler(copy = True)
Scale the maximum value of the data to 1
#Transform boston data to (10,100)
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler(feature_range=(10,100))#instantiation
mms.fit(boston.data)#Just calculate the mean and standard deviation (preparatory work)
MinMaxScaler(feature_range=(10, 100))
boston_mms = mms.transform(boston.data)
mms2 = MinMaxScaler(feature_range=(10,100), copy = False)
mms2.fit_transform(boston.data)
array([[ 10. , 26.2 , 16.10337243, ..., 35.85106383,
        100. , 18.07119205],
       [10.02123303, 10., 31.80718475, ..., 59.78723404,
        100. , 28.40231788],
       [10.0212128, 10., 31.80718475, ..., 59.78723404,
         99.07635282, 15.71192053],
       ...,
       [10.05507032, 10., 47.84090909, ..., 90.42553191,
        100. , 19.7102649 ],
       [10.10446569, 10., 47.84090909, ..., 90.42553191,
         99.21705583, 21.79635762],
       [10.04156575, 10., 47.84090909, ..., 90.42553191,
        100. , 25.27317881]])
from sklearn.preprocessing import MaxAbsScaler
mas = MaxAbsScaler()#Default is 0-1
mas.fit_transform(boston.data)#The original data has not changed, copy = True
array([[0.1 , 0.262 , 0.16103372, ..., 0.35851064, 1. ,
        0.18071192],
       [0.10021233, 0.1, 0.31807185, ..., 0.59787234, 1.,
        0.28402318],
       [0.10021213, 0.1, 0.31807185, ..., 0.59787234, 0.99076353,
        0.15711921],
       ...,
       [0.1005507, 0.1, 0.47840909, ..., 0.90425532, 1.,
        0.19710265],
       [0.10104466, 0.1, 0.47840909, ..., 0.90425532, 0.99217056,
        0.21796358],
       [0.10041566, 0.1, 0.47840909, ..., 0.90425532, 1.,
        0.25273179]])
Normalization of data -- vector unitization
sklearn.preprocessing.normalize(
X, axis = 1, copy = True
norm = 'l2' : 'l1', 'l2', or 'max', specific norm used for regularization
return_norm = False: whether to return the norm used
)
norm is the norm
from sklearn.preprocessing import normalize
help(normalize)
Help on function normalize in module sklearn.preprocessing._data:

normalize(X, norm='l2', *, axis=1, copy=True, return_norm=False)
    Scale input vectors individually to unit norm (vector length).
    
    Read more in the :ref:`User Guide <preprocessing_normalization>`.
    
    Parameters
    ----------
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
        The data to normalize, element by element.
        scipy.sparse matrices should be in CSR format to avoid an
        un-necessary copy.
    
    norm : {'l1', 'l2', 'max'}, default='l2'
        The norm to use to normalize each non zero sample (or each non-zero
        feature if axis is 0).
    
    axis: {0, 1}, default=1
        axis used to normalize the data along. If 1, independently normalize
        each sample, otherwise (if 0) normalize each feature.
    
    copy: bool, default=True
        set to False to perform inplace row normalization and avoid a
        copy (if the input is already a numpy array or a scipy.sparse
        CSR matrix and if axis is 1).
    
    return_norm : bool, default=False
        whether to return the computed norms
    
    Returns
    -------
    X : {ndarray, sparse matrix} of shape (n_samples, n_features)
        Normalized input X.
    
    norms : ndarray of shape (n_samples, ) if axis=1 else (n_features, )
        An array of norms along given axis for X.
        When X is sparse, a NotImplementedError will be raised
        for norm 'l1' or 'l2'.
    
    See Also
    --------
    Normalizer : Performs normalization using the Transformer API
        (e.g. as part of a preprocessing :class:`~sklearn.pipeline.Pipeline`).
    
    Notes
    -----
    For a comparison of the different scalers, transformers, and normalizers,
    see :ref:`examples/preprocessing/plot_all_scaling.py
    <sphx_glr_auto_examples_preprocessing_plot_all_scaling.py>`.

X1 = [[1,1,2],[2,2,4]]
normalize(X1,
          norm = 'l2', #Select norm type
          return_norm=True #Return the norm of each vector
         )
array([[0.40824829, 0.40824829, 0.81649658],
       [0.40824829, 0.40824829, 0.81649658]])
normalize(X1,
          norm = 'l1', #Select norm type
          return_norm=True #Return the norm of each vector
         )
(array([[0.25, 0.25, 0.5],
        [0.25, 0.25, 0.5 ]]),
 array([4., 8.]))
Standardization method that takes into account outliers
Robust standardization
The median and percentile (interquartile range is used by default) instead of the mean and standard deviation respectively for data standardization
More suitable for data known to have outliers
sklearn.preprocessing.robust_scale(
X, axis = 0, with_centering = True, with_scaling = True
quantile_range = (25.0, 75.0) : Percentile used to calculate the degree of dispersion
copy=True
)
class sklearn.preprocessing.RobustScaler(
with_centering = True, with_scaling = True,
quantile_raange = (25.0, 75.0), copy = True
)
#robuststandardization
from sklearn.preprocessing import robust_scale
from sklearn.preprocessing import RobustScaler
robust_scale(boston.data)
array([[-0.06959315, 1.44, -0.57164988, ..., -1.33928571,
         0.26190191, -0.63768116],
       [-0.06375455, 0. , -0.20294345, ..., -0.44642857,
         0.26190191, -0.22188906],
       [-0.06376011, 0. , -0.20294345, ..., -0.44642857,
         0.06667466, -0.73263368],
       ...,
       [-0.05445006, 0. , 0.17350891, ..., 0.69642857,
         0.26190191, -0.57171414],
       [-0.04086745, 0. , 0.17350891, ..., 0.69642857,
         0.09641444, -0.48775612],
       [-0.05816351, 0. , 0.17350891, ..., 0.69642857,
         0.26190191, -0.34782609]])
rs = RobustScaler()#instantiation
rs.fit_transform(boston.data)
array([[-0.06959315, 1.44, -0.57164988, ..., -1.33928571,
         0.26190191, -0.63768116],
       [-0.06375455, 0. , -0.20294345, ..., -0.44642857,
         0.26190191, -0.22188906],
       [-0.06376011, 0. , -0.20294345, ..., -0.44642857,
         0.06667466, -0.73263368],
       ...,
       [-0.05445006, 0. , 0.17350891, ..., 0.69642857,
         0.26190191, -0.57171414],
       [-0.04086745, 0. , 0.17350891, ..., 0.69642857,
         0.09641444, -0.48775612],
       [-0.05816351, 0. , 0.17350891, ..., 0.69642857,
         0.26190191, -0.34782609]])
S-fold cross validation S-fold cross validation abbreviated as cv
S is a hyperparameter. Divide the data into S folds. The model will be trained several times if divided into several parts.
Fairer than simple cross-validation
The extreme case is Leave one out cross validation (LOOCV, Leave one out cross validation)
LOOCV retains one data point. It can also retain P data points as the verification set. This method is called LPOCV.
Split into training set and test set
sklearn.model_selection.train_test_split(
*arrays: Data objects of equal length that need to be split. Multiple data can be split at the same time, but the data length must be consistent.
test_size = 0.25: float, int, None, sample proportion used to verify the model, ranging from 0-1
When None, all samples will be used for training
train_size = None: float, int, or None, sample ratio used to train the model, 0-1
When it is None, it is automatically calculated based on test_size.
random_state = None random seed
shuffle = True: whether to randomly arrange the samples before splitting
stratify = None: array-like or None, whether to stratify the data according to the specified category label
) returns: list after splitting the input object, length = 2 * len (arrays)
# Split into training set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size = 0.3)
len(X_train)
354
len(boston.data)
506
len(y_train)
354
Cross-validation combines splitting and evaluation
sklearn.model_selection
cross_val_score combines splitting and evaluation
estimator : the name of the estimator object used to fit the data
X: array-like, data array used to fit the model
cross_validate uses multiple evaluation indicators at the same time
cross_val_predict uses the cross-validated model to predict
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
scores = cross_val_score(reg, boston.data, boston.target, cv = 10)
scores
array([ 0.73376082, 0.4730725 , -1.00631454, 0.64113984, 0.54766046,
        0.73640292, 0.37828386, -0.12922703, -0.76843243, 0.4189435 ])
scores.mean(), scores.std()
(0.20252899006055367, 0.5952960169512383)
The boston data set is arranged sequentially, which results in poor model scores and a large gap.
#Randomly arrange the data set to ensure uniformity of splitting
import numpy as np
X, y = boston.data, boston.target
indices = np.arange(y.shape[0])
np.random.shuffle(indices)
X, y = X[indices], y[indices]
reg = LinearRegression()
scores = cross_val_score(reg, X, y, cv = 10)
scores
array([0.77212498, 0.79470905, 0.59899391, 0.80717087, 0.76007414,
       0.75699564, 0.72688181, 0.24256808, 0.6518304, 0.66100191])
scores.mean(), scores.std()
(0.6772350793373447, 0.1585378148669398)
Create a decision tree using sklearn
class sklearn.tree.DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
ct = DecisionTreeClassifier()#instantiation
ct.fit(iris.data, iris.target)#Model training
DecisionTreeClassifier()
ct.max_features_
4
ct.feature_importances_#Feature importance score
array([0.01333333, 0. , 0.06405596, 0.92261071])
ct.predict(iris.data)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
#The best metric output for classification models to compare the quality of different models
from sklearn.metrics import classification_report
print(classification_report(iris.target, ct.predict(iris.data)))
              precision recall f1-score support

           0 1.00 1.00 1.00 50
           1 1.00 1.00 1.00 50
           2 1.00 1.00 1.00 50

    accuracy 1.00 150
   macro avg 1.00 1.00 1.00 150
weighted avg 1.00 1.00 1.00 150

# Presentation of classification results: confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(iris.target, ct.predict(iris.data), labels = [2,1,0])#Customized category order output confusion matrix
cm
array([[50, 0, 0],
       [0, 50, 0],
       [0, 0, 50]], dtype=int64)
#Displayed in the form of heat map
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.heatmap(cm, cmap = sns.color_palette("Blues"), annot = True)
<AxesSubplot:>

?

2. Draw a binary tree using the iris database as an example

#Import iris data set
from sklearn.datasets import load_iris
iris = load_iris()
import numpy as np
X, y = iris.data, iris.target
indices = np.arange(y.shape[0])
np.random.shuffle(indices)
X, y = X[indices], y[indices]
#Split iris data into training set and test set
from sklearn.model_selection import train_test_split
X_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size = 0.3)
#Create decision tree
from sklearn.tree import DecisionTreeClassifier
rt = DecisionTreeClassifier()#instantiation
rt.fit(iris.data, iris.target)
DecisionTreeClassifier()
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LinearRegression
import graphviz
reg = LinearRegression()
scores = cross_val_score(reg, X, y, cv = 10)
scores
array([0.86316316, 0.87764635, 0.90032253, 0.89369341, 0.94963924,
       0.96141896, 0.93654241, 0.93546444, 0.88819228, 0.95601217])
scores.mean(), scores.std()
(0.916209493818017, 0.03372720008929205)
rt.max_features_
4
rt.feature_importances_
array([0. , 0.01333333, 0.56405596, 0.42261071])
rt.predict(iris.data)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
from sklearn.metrics import classification_report
print(classification_report(iris.target, rt.predict(iris.data)))
              precision recall f1-score support

           0 1.00 1.00 1.00 50
           1 1.00 1.00 1.00 50
           2 1.00 1.00 1.00 50

    accuracy 1.00 150
   macro avg 1.00 1.00 1.00 150
weighted avg 1.00 1.00 1.00 150

from sklearn.tree import export_graphviz
dot_data = export_graphviz(rt,
               feature_names = iris.feature_names,
               class_names = iris.target_names)
graph = graphviz.Source(dot_data)
graph

The following is the binary tree drawn: