Painless Random Forest Regression in Python – Step-by-Step with Sklearn


Machine Learning

This tutorial demonstrates a step-by-step on how to use the Sklearn Python Random Forest package to create a regression model.

1. Random Forest Regression – An effective Predictive Analysis

Random Forest Regression is a bagging technique in which multiple decision trees are run in parallel without interacting with each other. It is an ensemble algorithm that combines more than one algorithm of the same or different kind regression problems. Multiple decision trees are trained over as data set and averaged to arrive at the final result.

2. What is Regression?

Regression is a machine learning algorithm that is based on supervised learning and is used to perform regression tasks. Regression models target prediction value based on independent variables. It is used to finding the relationship between variables and forecasting. So, given data of predictor variables (inputs, X) and a continuous response variable (output, Y) build a model for

Predicting the value of the response from the predictors.

  • Predicting the value of the response from the predictors
  • Understanding the relationship between the predictors and the response

Predicting a systolic blood pressure of a person based on their age, height, weight, etc.

3. Regression Examples:

Y: crop yield

X: rainfall, temperature, humidity,, etc.

Y: income

X: age, education, sex, occupation, etc.

Y: selling price of homes

X: size, age, location, quality, etc.

Y: test scores

X: teaching method, age, sex, ability, etc.

Understanding the relationship between the predictors and the response.

4. Importance & Disadvantage of Random Forest Regression

Random Forest Regression is one of the fastest machine learning algorithms giving accurate predictions for regression problems. Random Forest Regression works on a principle that says a number of weakly predicted estimators when combined together form a strong prediction and strong estimation. But the Random Forest Regression algorithm does not perform a good job as a classification because it does not give precise continuous nature prediction. In the case of Random Forest Regression, it doesn’t predict beyond the range in the training data. And hence may overfit data sets that are particularly noisy.

5. Working model of Random Forest Classifiers

It works based on below steps:

  1. Consider a test dataset D of interest which has many X rows and Y number of features
  2. Pick the sample of rows and some sample of features i.e. random samples from the dataset. This is termed as Row sampling RS and Feature sample FS. These both samples combined are called data set DS.
  3. These samples are given to Decision trees. So, construct a decision tree for each sample and train them and find a prediction result for each decision tree.
  4. Now Aggregate results of all data sets by taking either mean or median of their outputs. It depends on the distribution of the output, how the decision tree is basically driven. So, the final prediction result is selected with majority average of the results.

Using single decision tree always give low bias and high variance of results. With Random Forest Regression based on multiple decision trees outputs averaged, results are more accurate with low variance. But still as mentioned, it can’t predict beyond the range of trained dataset and doesn’t beat classification’s precise continuous nature prediction.

5. Random Forest Explained

Next, If you want to learn more about how the Random Forest algorithm works, I would recommend this great Youtube video. This tutorial targets the Python code on how to run it.

6. Case of Study

Moreover, In this tutorial, we use the training set from the UCI machine learning repository – this dataset contains 414 instances of Real estate valuation data set. Here we will use this dataset to predict house price.

7. Sklearn Random Forest Regression

First, we are going to use sklearn Random forest package to train how Random Forest. Below are all the important modules and variables needed to start.

Please also make sure you have installed xlrd (to read excel files), seaborn (to plot the data), and sklearn (to run the regression model).

import xlrd

import numpy as np
import seaborn
import numpy as np
import matplotlib.pyplot as matplotlib

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

from matplotlib.lines import Line2D
from scipy.stats import pearsonr

# set seed to make results reproducible
RF_SEED = 30

Next, we want to parse out input data which in this case is an XLSX file. The function below should do the job by creating 3 lists: 1) Contains the labels (house price) for each record, 2) Contains the raw data to train the model, and 3) Feature names.

def load_input(excel_file):
    y_prediction = []
    data = []
    feature_names = []

    loc = (excel_file)
    wb = xlrd.open_workbook(loc)
    sheet = wb.sheet_by_index(0)
    sheet.cell_value(0, 0)

    for index_row in range(0, 415):
        row = sheet.row_values(index_row)
        row = row[1:]

        if index_row == 0:
            feature_names = row
            row[0] = str(row[0]).split(".")[0]
            data.append([float(x) for x in row[:-1]])

    return y_prediction, data, feature_names[:-1]

Now, it is time to split the data between the training set and the testing set. Here we do a split 80% of the data and 20% to test. Also, the function below trains the random forest with 1000 trees and using all the processors available on your machine.

def split_data_train_model(labels, data):
    # 20% examples in test data
    train, test, train_labels, test_labels = train_test_split(data,

    # training data fit
    regressor = RandomForestRegressor(n_estimators=1000, random_state=RF_SEED), y_data)

    return test, test_labels, regressor

The lines below will read the data, train and test the model.

y_data, x_data, feature_names = load_input("regression_dataset.xlsx")
x_test, x_test_labels, regressor = split_data_train_model(y_data, x_data)

predictions = regressor.predict(x_test)

The variable prediction should have the predictions for the test data. Below we plot the predictions against the real answer using a scatter plot from this tutorial.

# find the correlation between real answer and prediction
correlation = round(pearsonr(predictions, x_test_labels)[0], 5)

output_filename = "rf_regression.png"
title_name = "Random Forest Regression - Real House Price vs Predicted House Price - correlation ({})".format(correlation)
x_axis_label = "Real House Price"
y_axis_label = "Predicted House Price"

# plot data
simple_scatter_plot(x_test_labels, predictions, output_filename, title_name, x_axis_label, y_axis_label)

As you can see below the prediction and real house price correlates very well (0.973!!!!). It means that our model worked great,

8. Feature Importance 

Furthermore, using the following code below you can figure what the importance of each feature in the model.

features_importance = regressor.feature_importances_

print("Feature ranking:")
for i, data_class in enumerate(feature_names):
    print("{}. {} ({})".format(i + 1, data_class, features_importance[i]))

As you can see the distance to the nearest MRT station and house age are the most important features to classify this dataset.

Feature ranking:
1. X1 transaction date (0.01114107463647354)
2. X2 house age (0.18871497068490514)
3. X3 distance to the nearest MRT station (0.5954556646115858)
4. X4 number of convenience stores (0.023544042688038697)
5. X5 latitude (0.10097015953299018)
6. X6 longitude (0.08017408784600653)

More Resources

Here are two of my favorite Machine Learning in Python Books in case you want to learn more about it


In summary, hopefully, now you understand how random forest and can build a regression model to classify your dataset and figure out which features are the most important to classify your data.

In case your data is discrete (you have classes), you can use a random forest classifier to classify it – here is a tutorial.

Related Posts