R-squared To Evaluate A Regression Model

Evaluating a classification model is fairly straightforward and simple. You just count how many of the classifications the model got right and how many it didn't.

Evaluating a regression model is not that straightforward, at least from my perspective. One of the useful metric that is used by a majority of the implementations is R-squared.

What is R-squared?

R-squared is a goodness-of-fit test in order to evaluate how good your model fits the data. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression.

I know the terms might be a bit overwhelming, like the majority of statistical terms, but the explanation is quite simple. It is the percentage of variation from the mean that the model can explain. In simpler words, R-squared shows how much of the variance from the mean is explained by the model.

Consider a set of points in the target set, given by $$y_{1},y_{2},y_{3}...y_{n}$$ Now, consider the set of predicted points $$f_{1},f_{2},f_{3}...f_{n}$$ Let \( \bar{y} \) be the mean of \( y \).

The mean variance of the data is given by, $$SS_{tot} = \sum (y_{i}-\bar{y})^{2}$$ The explained variance by the model is given by, $$SS_{reg} = \sum (f_{i}-\bar{y})^{2}$$ Consequently, the unexplained variance by the model is given by, $$SS_{reg} = \sum (y_{i}-f_{i})^{2}$$ Hence, the definition for R-squared is as follows, $$R^{2}\equiv 1-\frac{SS_{res}}{SS_{tot}}$$ From the above equation, we can see that the value of R-squared lies between 0 and 1. 1 indicating that the model fits the data perfectly and 0 indicating that the model is unable to explain any variation from the mean. Thus we can safely assume that higher the value of R-squared, better the model is.

BUT, THIS IS NOT ENTIRELY TRUE.

Some of the scenarios where this metric cannot be used are: