This short tutorial shows how to find the R squared value in Python using sklearn, which can be helpful when looking at the data correlation in a scatter plot.
1. What is the R Squared?
R-squared is a statistical measure that represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s) in a regression analysis. It is also known as the coefficient of determination.
R-squared is a value between 0 and 1, where a value of 1 indicates that the independent variable(s) perfectly explain the variation in the dependent variable. A value of 0 indicates that the independent variable(s) have no relationship with the dependent variable. In general, a higher R-squared value indicates a stronger relationship between the independent variable(s) and the dependent variable.
R-squared can be used as a tool for model selection, where the goal is to choose the model with the highest R-squared value that also meets other criteria such as parsimony and interpretability. It is also used to evaluate the goodness-of-fit of a regression model, with a higher R-squared value indicating a better fit. However, it is important to note that a high R-squared value does not necessarily imply a good model, as other factors such as overfitting, omitted variables, and residual normality should also be considered.
2. R Squared in Python
First and foremost, make sure you have sklearn installed, which can be installed with bioconda.
Now, there is below the Python code for the r squared calculation.
from sklearn.metrics import r2_score
# define x and y
x = [3, -0.5, 2, 7]
y = [2.5, 0.0, 2, 8]
# r square estimation
r_squared_value = r2_score(x, y)
>>> r_squared_value
>>> 0.9486081370449679
Here is a video explaining R squared in case you want to learn more about it