Scikit-learn – RunModule

Scikit-learn is a Python library for the traditional machine learning tasks (e.g. regression, classification, clustering, etc.).

It is FOSS under a BSD license.

scikit-learn official website

scikit-learn code repository

History

It was developed originally by David Cournapeau in 2007. It is maintained by a team of researchers at the French Institute for Research in Computer Science and Automation (Inria).

It is bundled in packages such as Mambaforge and Anaconda. It can be installed using package managers such as pip and conda.

Concepts

A feature is a column contained in the training dataset.

An estimator is any object that can learn from data. They are initialized as untrained estimators. They become a trained estimator after being trained using the common .fit(x,y) function.

Supervised algorithms requires both x and y arguments (features and target labels), while unsupervised algorithms only require x (features).

Main types of estimators:

Transformer
Predictor

A transformer transforms data. Examples of transformer operations are scaling and encoding.

A predictor predicts data. It has a predict() function that calculates the estimated class. It sometimes has a predict_proba() to obtain the probability estimates (a value between 0 and 1) for class predictions.

The pipeline object chains different estimators.

The column transformer groups estimators affecting different columns.

A dense matrix is a matrix that contains meaningful data on most of its cells.

A dense matrix in Scikit-learn is stored as a NumPy ndarray.

A sparse matrix is one that contains meaningful data on a small number of cells.

A sparse matrix in Scikit-learn is stored using a special object from Scikit-learn for sparse matrixes.

Some estimators return a dense matrix while other return a sparse matrix.

One-hot encoding implies representing different categories as numbers.

For example, if a column contains the values red/green/blue, a one-hot encoder creates a column for each category (3) and assigns the value 1 where the value contained was the category matching the value and 0 in other case.

The OneHotEncodeer returns a sparse matrix.

Evaluation

Evaluation can be done with train_test_split() or k_-fold cross-validation feature.

The k_-fold cross-validation feature takes 10 folds, that are divisions, and performs evaluations. It uses the function cross_val_score().

Adjusting hyperparameters

The class GridSearchCV is used to adjust hyperparameters.

RandomizedSearchCV is also used to adjust hyperparameters. It is useful when the there are many possible combinations or continous variables.

The Halving-like variants of these classes aims to use the resources more efficiently.

Learning

Resources:

Scikit-learn user guide [online]. Available at: https://scikit-learn.org/stable/user_guide.html
Scikit-learn [online]. Wikipedia. Available at: https://en.wikipedia.org/wiki/Scikit-learn

Related entries

Machine learning frameworks