Sklearn Tutorial

What is Sklearn?

An open-source Python package to implement machine learning models in Python is called Scikit-learn. This library supports modern algorithms like KNN, random forest, XGBoost, and SVC. It is constructed over NumPy. Both well-known software companies and the Kaggle competition frequently employ Scikit-learn. It aids in various processes of model building, like model selection, regression, classification, clustering, and dimensionality reduction (parameter selection).

Scikit-learn is simple to work with and delivers successful performance. Scikit Learn, though, does not enable parallel processing. We can implement deep learning algorithms in sklearn, though it is not a wise choice, especially if using TensorFlow is an available option.

Installation of Sklearn on our System

We need to first install the following libraries before installing sklearn as its dependencies:

  1. NumPy
  2. SciPy

Before installing the sklearn library, verify that NumPy and SciPy are already installed on the computer. Using pip after NumPy and SciPy have already been installed correctly is the easiest way to install scikit-learn:

Importing the Dataset

The Iris Plants Dataset is the one we'll be using in this sklearn tutorial, as we discussed previously. We don't require getting this data set from an external server because Scikit Learn Python already includes it. We will immediately import the dataset, but first, we must import Scikit-Learn and Pandas libraries using the commands below:

Code

After importing sklearn, using the following command, we can quickly import the iris plant dataset from sklearn:

Code

Splitting the Dataset

We can divide the complete dataset into two parts-a training dataset and a testing dataset-to spare some unseen data to check the model's accuracy. Use the testing dataset to test or validate the model once it has been trained using the training set. Then, we can assess how well the trained model performed.

This example will divide the data into a 70:30 ratio, meaning that 70% of the data will be used for training the model, and 30% will be used for testing the model. The dataset used in the example is the same as above.

Code

Output:

(105, 3)
(45, 3)
(105,)
(45,)

Train the Model

We can then train a predicting model using our dataset. As previously mentioned, scikit-learn offers an extensive collection of modern Machine Learning algorithms with a standardised user interface for fitting, predicting the accuracy score for the predictions, recall, etc.

We'll use the KNN (K nearest neighbours) classifier in the example we were working on. KNN classifier will make clusters of the dataset based on their similarities. We will see how to implement this machine learning algorithm in the code below.

Code

Output:

Accuracy: 0.9777777777777777
Predictions: ['versicolor', 'setosa']

Linear Modelling

These are the regression algorithms that Sklearn provides to perform linear regression analysis.

Sr.NoModel & Description
1Linear Regression
The association between a dependent variable (Y) and a specific collection of independent variables is studied using one of the best statistical models (X).
2Logistic Regression
Contrary to what its name suggests, logistic regression is a classification algorithm. It estimates discrete values (0 or 1, yes/no, true/false) using a set of independent variables.
3Ridge Regression
The regularisation method that carries out L2 regularisation is ridge regression or Tikhonov regularisation. Adding the penalty (shrinkage amount) equal to the square of the coefficients' magnitude alters the loss function.
4Bayesian Ridge Regression
Using probability distributors rather than point estimates when designing linear regression, Bayesian regression enables a natural process to survive the absence of sufficient data or data with an uneven distribution.
5LASSO
L1 regularisation is carried out using the regularisation method LASSO. Adding the penalty (shrinkage quantity) equal to the tally of the absolute values of the coefficients it alters the loss function.
6Multi-task LASSO
It enables the joint fitting of numerous regression problems while requiring that the characteristics chosen for each regression issue, also known as a task, be the same. Sklearn offers a linear model called MultiTaskLasso that simultaneously estimates sparse coefficients for multiple regression problems. It was trained using a mixed L1 and L2-norm for regularisation.
7Elastic-Net
The Lasso and Ridge regression methods' L1 and L2 penalties are combined linearly by the Elastic-Net regularised regression method. When there are several connected traits, it is helpful.
8Multi-task Elastic-Net
It is an Elastic-Net model that allows fitting multiple regression problems jointly, enforcing the selected features to be the same for all the regression problems, also called tasks.

Clustering Methods

Clustering is one of the best-unsupervised ML techniques for finding patterns of similarity and relationships between data sets. After that, they divide those samples into groups based on features that are similar to one another. The intrinsic grouping of the available unlabeled data is determined by clustering, which is why it is significant.

Sklearn.cluster, a component of the Scikit-Learn package, is used to cluster unlabeled data. Scikit-learn offers the following clustering techniques under this module:

KMeans

This algorithm calculates the centroids, which then identifies the ideal centroid through iteration. It assumes there are already known clusters because it needs the number of clusters to be given. The fundamental idea behind this approach is to cluster the data by splitting samples into n groups with identical variances while reducing the inertia criterion. Scikit-learn has sklearn.cluster, which represents how many clusters the algorithm found. K-Means clustering is done using the KMeans package of Sklearn. The sample weight argument allows sklearn.cluster to compute the cluster centres and inertia value and the KMeans module to give some samples additional weight.

Code

Output:

The number of clusters are:  [6 0 6 2 0 8 8 5 6 2 8 6 0 6 0 5 3 5 2 2 8 8 2 7 2 6 8 2 4 3 2 4 1 4 4 9 3
 2 5 6 5 8 6 9 1 6 2 8 0 1]

Spectral Clustering

Before clustering, this approach essentially performs dimensionality reduction in fewer dimensions by using the eigenvalues, or spectrum, of the similarity matrix of the data. When there are many clusters, it is not desirable to apply this approach.

Code

Output:

The number of clusters are:  [0 2 0 8 4 3 6 4 9 1 3 0 4 6 2 8 5 4 7 1 7 6 9 5 2 8 3 9 1 3 9 5 0 5 4 5 1
 5 8 1 7 3 6 5 0 6 1 3 6 8]

Hierarchical Clustering

By successively merging or breaking the clusters, this algorithm creates nested clusters. This cluster hierarchy is shown as a dendrogram, often called a tree, and it fits into the two groups listed below.

Hierarchical aggregative algorithms: Every data point is regarded as a single cluster in this type of hierarchical algorithm. The two clusters are then aggregated one after the other, following a bottom-up methodology.

Hierarchical algorithms that divide all data points are considered a single large cluster in this hierarchical approach. In this case, the clustering procedure entails breaking a single large cluster into numerous minor clusters using a top-down technique.

Code

Output:

The number of clusters are:  [3 6 3 5 6 0 0 1 3 5 0 2 6 3 6 1 4 1 5 6 0 0 5 9 5 2 0 5 6 4 5 0 8 7 6 7 4
 5 1 3 1 0 2 7 8 3 0 0 3 2]

Decision Tree Algorithm

A node represents a feature (or property), a branch indicates a decision function, and every leaf node indicates the conclusion in a decision tree, which resembles a flowchart. The root node in a decision tree is the first node from the top. It gains the ability to divide data according to attribute values. Recursive partitioning is the process of repeatedly dividing a tree. This framework, which resembles a flowchart, aids in decision-making. It is a flowchart-like representation that perfectly replicates how people think. Decision trees are simple to grasp and interpret because of this.

Code

Output:

Accuracy scores:  [1.         0.93333333 1.         0.93333333 0.93333333 0.86666667
 0.93333333 1.         1.         1.        ]
Mean accuracy score:  0.96

Gradient Boosting

We might use a gradient boosting method when there are problems with regression and classification. It creates a predictive model based on many lesser prediction models, typically decision trees.

To work, the Gradient Boosting Classifier needs a loss function. In addition to handling custom loss functions, gradient boosting classifiers may take many standardised loss functions, and the loss function must, however, be differentiable.

Squared errors may be used in regression techniques, although logarithmic loss is typically used in classification algorithms. In gradient boosting systems, we don't need to explicitly derive a loss function for each incremental boosting step; instead, we can use any differentiable loss function.

Code

Output:

Accuracy scores:  0.9185416666666667
Dimensionality Reduction using PCA in Sklearn

Exact PCA

The information's Singular Value Decomposition (SVD) is utilized to perform the linear dimensionality reduction using Principal Component Analysis (PCA) to cast the data to a reduced dimensional feature space. Before utilizing the SVD during PCA reduction, input data is centred but not adjusted for each feature.

The sklearn.decomposition module is part of the Scikit-learn ML toolkit.

In its fit() method, the PCA module, which is applied as a transformer object, learns n number of components. It can also be employed to cast new data onto these elements.

Code

Output:

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])
['malignant' 'benign']
['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']
Shape of data after PCA:  (569, 3)
Components:  [[ 0.21890244  0.10372458  0.22753729  0.22099499  0.14258969  0.23928535
   0.25840048  0.26085376  0.13816696  0.06436335  0.20597878  0.01742803
   0.21132592  0.20286964  0.01453145  0.17039345  0.15358979  0.1834174
   0.04249842  0.10256832  0.22799663  0.10446933  0.23663968  0.22487053
   0.12795256  0.21009588  0.22876753  0.25088597  0.12290456  0.13178394]
 [-0.23385713 -0.05970609 -0.21518136 -0.23107671  0.18611304  0.15189161
   0.06016537 -0.03476751  0.19034877  0.36657546 -0.10555215  0.08997968
  -0.08945724 -0.15229262  0.20443045  0.23271591  0.1972073   0.13032154
   0.183848    0.28009203 -0.21986638 -0.0454673  -0.19987843 -0.21935186
   0.17230436  0.14359318  0.09796412 -0.00825725  0.14188335  0.27533946]
 [-0.00853123  0.0645499  -0.00931421  0.02869954 -0.10429182 -0.07409158
   0.00273384 -0.02556359 -0.04023992 -0.02257415  0.26848138  0.37463367
   0.26664534  0.21600656  0.30883896  0.15477979  0.17646382  0.22465746
   0.28858428  0.21150377 -0.04750699 -0.04229782 -0.04854651 -0.01190231
  -0.25979759 -0.2360756  -0.1730573  -0.17034416 -0.27131265 -0.23279135]]
Explained variance ratio:  [0.44272026 0.18971182 0.09393163]

Incremental PCA

Principal Component Analysis (PCA) primarily permits batch computing, which implies that all of the independent features to be analysed must fit in the storage. Incremental Principal Component Analysis (IPCA) is utilised to overcome this constraint.

The sklearn.decomposition module is part of the Scikit-learn ML toolkit. IPCA package that provides the use np.memmap, a memory-mapped document, avoids loading the complete file into ram, permitting the use of its partial fit function on progressively fetched portions of data, or both.

Parallel to PCA, input data is centred but not normalized for every feature prior to performing the SVD when decomposing data with IPCA.

Example

The sample below uses the Sklearn digit dataset to use sklearn.decomposition.IPCA module.

Code

Output:

dict_keys(['data', 'target', 'frame', 'feature_names', 'target_names', 'images', 'DESCR'])
Shape of the dataset before decomposition:  (1797, 64)
Shape of the dataset after decomposition:  (1797, 15)

In this case, we can use the fit() method to split the information into batches, or we can partly fit on smaller lots of data (like we did on 200 per batch).

Kernel PCA

Using kernels, PCA's Kernel Principal Component Analysis modification reduces non-linear dimensionality. Both transform() and inverse_transform() methods are supported.

We can use KernelIPCA class of the sklearn.decomposition module

Example

We will use the digit dataset of sklearn to show the use of KernelIPCA. The kernel we are using is sigmoid.

Code

Output:

dict_keys(['data', 'target', 'frame', 'feature_names', 'target_names', 'images', 'DESCR'])
Shape of the dataset before decomposition:  (1797, 64)
Shape of the dataset after decomposition:  (1797, 15)

PCA using Randomized SVD

Projecting variables to a lower-dimensional feature space through Principal Component Analysis (PCA) with randomised SVD preserves most variation by removing the singular vector of features linked to lower singular values. The sklearn.decomposition.PCA class with the additional svd_solver = 'randomized' argument will be quite helpful in this situation.

Example

The example below will utilise the sklearn.decomposition.PCA class and the svd_solver = 'randomized' auxiliary parameter to identify the top 10 principal components from the sklearn's breast cancer dataset.

Code

Output:

Shape of the dataset before decomposition:  (569, 30)
Shape of data after PCA:  (569, 10)
Explained variance ratio:  [0.45067848 0.18239963 0.09159257 0.06781847 0.05626861 0.04135939
 0.01989181 0.01637191 0.01397121 0.01209004]





Latest Courses