This tutorial demonstrates how to use the Sklearn Random Forest (a Python library package) to create a classifier and discover feature importance.
1. Random Forest Classifiers – A Powerful Prediction Algorithm
Classification is a big part of machine learning. Random Forest Classifier is a flexible, easy to use algorithm used for classifying and deriving predictions based on the number of decision trees. So, Random Forest is a set of a large number of individual decision trees operating as an ensemble. Each individual tree spits out as a class prediction. The class with more number of votes becomes the preferred prediction model.
2. What is Classification?
Classification refers to a process of categorizing a given data sets into classes and can be performed on both structured and unstructured data. So, given data of predictor variables (inputs, X) and a categorical response variable (output, Y) build a model for
- Predicting the value of the response from the predictors
- Understanding the relationship between the predictors and the response
Example:
Predicting 5-year survival (yes/no) of a person based on their age, height, weight, etc.
Classification Examples:
Y: loan defaults (yes/no)
X: credit score, own or rent, age, marital status, etc.
Y: land cover of grass, trees, water, roads…
X: satellite image data of frequency bands
Y: presence/absence of disease
X: diagnostic measurements
Y: dementia status
X: scores on a battery of psychological tests
3. Importance of Random Forest Classifiers
Classification always helps us to know what a class, an observation belongs to. Classifying observations is very important for various business applications. Random Forest classifiers are extremely valuable to make accurate predictions like whether a specific customer will buy a product or forecasting whether a load given to a customer will be default or not, forecasting stock portfolio, spam and ham email classification, etc.
Random Forest Classifier is near the top of the classifier hierarchy of Machine learning winning above a plethora of best data science classification algorithms for accurate predictions for binary classifications. Random Forest Classifier works on a principle that says a number of weakly predicted estimators when combined together form a strong prediction and strong estimation.
4. Working model of Random Forest Classifiers
The basic parameters required for Random Forest Classifier are the total number of trees to be generated and the decision tree parameters like split, split criteria, etc. It works based on four steps:
- Consider a master dataset D of interest which has many X rows and Y number of features
- Pick the samples of rows and some samples of features i.e. random samples from the dataset. This is termed as Row sampling RS and Feature sample FS. Both samples combined are called a data set.
- These samples are given to Decision trees. So, construct a decision tree for each sample and train them and find a prediction result for each decision tree.
- Performing voting for each result predicted. by using the aggregate of majority vote.
- Now Aggregate results of all data set by using majority vote. So, the final prediction result is selected with the majority vote and that result is the final prediction model.
A single decision tree always makes results of low bias and high variance. With Random Forest Classification using multiple decision trees aggregated with the majority vote, results are more accurate with low variance.
5. Random Forest Explained
Next, If you want to learn more about the Random Forest algorithm works, I would recommend this great Youtube video. This tutorial targets the Python code on how to run it.
6. Case of Study
Moreover, In this tutorial, we use the training set from Partie. Robert Edwards and his team using Random Forest to classify if a genomic dataset into 3 classes: Amplicon, WGS, Others). Partie uses the percent of unique kmer, 16S, phage, and Prokaryote as features – please read the paper for more details.
7. Random Forest Sklearn Classifier
First, we are going to use Sklearn package to train how Random Forest. Below are all the important modules and variables needed to start.
import csv
from sklearn.metrics import (precision_score,
recall_score,
roc_auc_score)
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# set seed to make results reproducible
RF_SEED = 30
Next, we want to parse out input data which in this case is a CSV file. The function below should do the job by creating 3 lists: 1) Contains the labels (classes) for each record, 2) Contains the raw data to train the model, and 3) Feature names.
def load_input(model_data):
# Read input file
data = []
labels = []
with open(model_data) as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
feature_names = next(csv_reader)[1:-1]
for row in csv_reader:
data.append([float(x) for x in row[1:-1]])
labels.append(row[-1])
return labels, data, feature_names
Now, it is time to split the data between the training set and the testing set. Here we do a split 80% of the data and 20% to test. Also, the function below trains the random forest with 1000 trees and using all the processors available on your machine.
def split_data_train_model(labels, data):
# 20% examples in test data
train, test, train_labels, test_labels = train_test_split(data,
labels,
test_size=0.2,
random_state=RF_SEED)
# training data fit
regressor = RandomForestRegressor(n_estimators=1000, random_state=RF_SEED)
regressor.fit(x_data, y_data)
return test, test_labels, regressor
The lines below will read the data, train and test the model.
labels, data, feature_names = load_input("SRA_used_for_training.csv")
test, test_labels, rf_model = split_data_train_model(labels, data)
# model performance with testing data
# class prediction
rf_predictions = rf_model.predict(test)
# probability
rf_probabilities = rf_model.predict_proba(test)
Finally, now that we have a trained model, we can compute Precision and Recall for the model
# calculate precision
precision = precision_score(test_labels, rf_predictions, average="weighted")
# calculate recall
recall = recall_score(test_labels, rf_predictions, average="weighted")
print("The Model Precision: {}".format(precision))
print("The Model Recall: {}".format(recall))
As you can see below, the model has high Precision and Recall.
The Model Precision: 0.966509327401938
The Model Recall: 0.9713050607438325
8. Feature Importance
Furthermore, using the following code below you can figure what the importance of each feature in the model.
features_importance = rf_model.feature_importances_
print("Feature ranking:")
for i, data_class in enumerate(feature_names):
print("{}. {} ({})".format(i + 1, data_class, features_importance[i]))
As you can see percent_unique_kmer and percent_16S are the most important features to classify this dataset.
Feature ranking:
1. percent_unique_kmer (0.6378746124706806)
2. percent_16S (0.24147258847640057)
3. percent_phage (0.01902732095237143)
4. percent_Prokaryote (0.10162547810054742)
['percent_unique_kmer', 'percent_16S', 'percent_phage', 'percent_Prokaryote']
More Resources
Here are two of my favorite Machine Learning in Python Books in case you want to learn more about it
- Introduction to Machine Learning with Python: A Guide for Data Scientists by Andreas C. Müller, Sarah Guido
- Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn by Sebastian Raschka, Vahid Mirjalili
Conclusion
In summary, hopefully, now you understand how random forest and can use it to classify your dataset and figure out which features are the most important to classify your data.
In case you have discrete classes, you can use regression to build your model. Here is a tutorial on how to use random forest to do it.