0 R Squared

admin
0 R Squared Average ratng: 9,4/10 3339 votes

When developing more complex models it is often desirable toreport a p-value for the model as a whole as well as an R-squarefor the model.

  1. R-squared is always between 0 and 100%: 0% represents a model that does not explain any of the variation in the responsevariable around its mean. The mean of the dependent variable predicts the dependent variable as well as the regression model. 100% represents a model that explains all of the variation in the response variable around its mean.
  2. If your paper is about explaining pain levels in humans, you will have a tiny R-squared because lots of things cause pain and getting kicked in the face is very rare (except maybe in Russia). 4 years ago # QUOTE 0 Jab 0 No Jab!
  3. R-squared, also known as the coefficient of determination, is the statistical measurement of the correlation between an investment’s performance and a specific benchmark index. In other words, it shows what degree a stock or portfolio’s performance can be attributed to a benchmark index.

p-values formodels

The p-value for a model determines the significanceof the model compared with a null model. For a linear model, the null model isdefined as the dependent variable being equal to its mean. So, the p-valuefor the model is answering the question, Does this model explain the datasignificantly better than would just looking at the average value of thedependent variable?

However, identifying a ‘good’ value of R-Squared in and of itself is a bit slippery. Generally, an R-Squared above 0.6 makes a model worth your attention, though there are other things to consider: Any field that attempts to predict human behaviour, such as psychology, typically has R-squared values lower than 0.5. That is, an R-squared of 0.60 indicates that 60% of the variability in the dependent variable is explained by the model. Pseudo R-squared. For many types of models, R-squared is not defined. These include relatively common models like logistic regression and the cumulative link models used in this book.

R-squaredand pseudo R-squared

The R-squared value is a measure of how well the modelexplains the data. It is an example of a goodness-of-fit statistic.

R-squared for linear (ordinary least squares) models

In R, models fit with the lm function are linearmodels fit with ordinary least squares (OLS). For these models, R-squaredindicates the proportion of the variability in the dependent variable that isexplained by model. That is, an R-squared of 0.60 indicates that 60% ofthe variability in the dependent variable is explained by the model.

Pseudo R-squared

For many types of models, R-squared is not defined. These include relatively common models like logistic regression and the cumulativelink models used in this book. For these models, pseudo R-squaredmeasures can be calculated. A pseudo R-squared is not directlycomparable to the R-squared for OLS models. Nor can it can beinterpreted as the proportion of the variability in the dependent variable thatis explained by model. Instead, pseudo R-squared measures are relativemeasures among similar models indicating how well the model explains the data.

This book uses three pseudo R-squared measures:McFadden, Cox and Snell (also referred to as ML), Nagelkerke (also referred toas Cragg and Uhler).

In general I favor the Nagelkerke pseudo R-squared,but there is no agreement as to which pseudo R-squared measurementshould be used.

p-values and R-squared values.

p-values and R-squared values measuredifferent things. The p-value indicates if there is a significantrelationship described by the model, and the R-squared measures thedegree to which the data is explained by the model. It is therefore possibleto get a significant p-value with a low R-squared value. Thisoften happens when there is a lot of variability in the dependent variable, butthere are enough data points for a significant relationship to be indicated.

Packages used in this chapter

The packages used in this chapter include:

• psych

• lmtest

• boot

• rcompanion

The following commands will install these packages if theyare not already installed:


if(!require(psych)){install.packages('psych')}
if(!require(lmtest)){install.packages('lmtest')}
if(!require(boot)){install.packages('boot')}
if(!require(rcompanion)){install.packages('rcompanion')}

Example of model p-value, R-squared,and pseudo R-squared

The following example uses some hypothetical data of asample of people for which typing speed (Words.per.minute) and age weremeasured. After plotting the data, we decide to construct a polynomial modelwith Words.per.minute as the dependent variable and Age and Age2as the independent variables. Notice in this example that all variables aretreated as interval/ratio variables, and that the independent variables are also continuous variables.

The data will first be fit with a linear model with the lmfunction. Passing this model to the summary function will display the p-valuefor the model and the R-squared value for the model.

The same data will then be fit with a generalized linearmodel with the glm function. This type of model allows morelatitude in the types of data that can be fit, but in this example, we’ll usethe family=gaussian option, which will mimic the model fit with the lmfunction, though the underlying math is different.

Importantly, the summary of the glm functiondoes not produce a p-value for the model nor an R-squared for themodel.

For the model fit with glm, the p-value can bedetermined with the anova function comparing the fitted model to a nullmodel. The null model is fit with only an intercept term on the right side of themodel. As an alternative, the nagelkerke function described below alsoreports a p-value for the model, using the likelihood ratio test.

There is no R-squared defined for a glmmodel. Instead a pseudo R-squared can be calculated. The function nagelkerkeproduces pseudo R-squared values for a variety of models. It reports threetypes: McFadden, Cox and Snell, and Nagelkerke. In general I recommend usingthe Nagelkerke measure, though there is no agreement on which pseudo R-squaredmeasurement to use, if any at all.

The Nagelkerke is the same as the Cox and Snell,except that the value is adjusted upward so that the Nagelkerke has a maximumvalue of 1.

It has been suggested that a McFadden value of 0.2–0.4 indicates agood fit.

Note that these models makes certain assumptions about thedistribution of the data, but for simplicity, this example will ignore the needto determine if the data met these assumptions.

Input =('
Age Words.per.minute
12 23
12 32
12 25
13 27
13 30
15 29
15 33
16 37
18 29
22 33
23 37
24 33
25 43
27 35
33 30
42 25
53 22
')
Data = read.table(textConnection(Input),header=TRUE)
### Check the data frame
library(psych)
headTail(Data)
str(Data)

summary(Data)
### Remove unnecessary objects
rm(Input)

Plot the data


plot(Words.per.minute ~ Age,
data = Data,
pch=16,
xlab = 'Age',
ylab = 'Words per minute')


Prepare data


### Create new variable for the square of Age
Data$Age2 = Data$Age ^ 2
### Double check data frame
library(psych)
headTail(Data)


Age Words.per.minute Age2
1 12 23 144
2 12 32 144
3 12 25 144
4 13 27 169
... ... ... ...
14 27 35 729
15 33 30 1089
16 42 25 1764
17 53 22 2809

Linear model


model = lm (Words.per.minute ~ Age + Age2,
data=Data)
summary(model) ### Shows coefficients,
### p-value for model, and R-squared


Multiple R-squared: 0.5009, Adjusted R-squared: 0.4296
F-statistic: 7.026 on 2 and 14 DF, p-value: 0.00771
### p-value and (multiple) R-squared value

Simple plot of data and model

For bivariate data, the function plotPredy will plotthe data and the predicted line for the model. It also works for polynomialfunctions, if the order option is changed.


library(rcompanion)
plotPredy(data = Data,
y = Words.per.minute,
x = Age,
x2 = Age2,
model = model,
order = 2,
xlab = 'Words per minute',
ylab = 'Age')


0 Squared Builders

Generalized linear model


model = glm (Words.per.minute ~ Age + Age2,
data = Data,
family = gaussian)
summary(model) ### Shows coefficients

Calculate p-value for model

In R, the most common way to calculate the p-valuefor a fitted model is to compare the fitted model to a null model with the anovafunction. The null model is usually formulated with just a constant on theright side.

0 R Squared Formula


null = glm (Words.per.minute ~ 1, ### Createnull model
data = Data, ### withonly a constant on the right side
family = gaussian)
anova (model,
null,
test='Chisq') ###Tests options 'Rao', 'LRT',
### 'Chisq', 'F','Cp'
### But some work with only some modeltypes


Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 14 243.07
2 16 487.06 -2 -243.99 0.0008882 ***

Calculate pseudo R-squared and p-value for model

An alternative test for the p-value for a fittedmodel, the nagelkerke function will report the p-value for a modelusing the likelihood ratio test.

The nagelkerke function also reports the McFadden,Cox and Snell, and Nagelkerke pseudo R-squared values for the model.


library(rcompanion)
nagelkerke(model)


$Pseudo.R.squared.for.model.vs.null
Pseudo.R.squared
McFadden 0.112227
Cox and Snell (ML) 0.500939
Nagelkerke (Cragg and Uhler) 0.501964
$Likelihood.ratio.test
Df.diff LogLik.diff Chisq p.value
-2 -5.9077 11.815 0.0027184

Likelihood ratio test for p-value for model

The p-value for a model by the likelihood ratio testcan also be determined with the lrtest function in the lmtestpackage.


library(lmtest)
lrtest(model)


#Df LogLik Df Chisq Pr(>Chisq)
1 4 -46.733
2 2 -52.641 -2 11.815 0.002718 **

Simple plot of data and model


library(rcompanion)
plotPredy(data = Data,
y = Words.per.minute,
x = Age,
x2 = Age2,
model = model,
order = 2,
xlab = 'Words per minute',
ylab = 'Age',
col = 'red') ### line color


R Squared 0 To 1

Optional analyses: Confidence intervals for R-squared values

It is relatively easy to produce confidence intervals for R-squared values or other results from model fitting, such as coefficientsfor regression terms. This can be accomplished with bootstrapping. Here the boot.cifunction from the boot package is used.

The code below is a little complicated, but relativelyflexible.

Function can contain any function of interest, as longas it includes an input vector or data frame (input in this case) and anindexing variable (index in this case). Stat is set to producethe actual statistic of interest on which to perform the bootstrap (r.squaredfrom the summary of the lm in this case).

Calculator

The code Function(Data, 1:n) is there simply to testFunction on the data frame Data. In this case, it will producethe output of Function for the first n rows of Data. Since n is defined as the length of the first column in Data,this should return the value for Stat for the whole data frame, if Functionis set up correctly.


Input =('
Age Words.per.minute
12 23
12 32
12 25
13 27
13 30
15 29
15 33
16 37
18 29
22 33
23 37
24 33
25 43
27 35
33 30
42 25
53 22
')
Data = read.table(textConnection(Input),header=TRUE)
Data$Age2 = Data$Age ^ 2

### Check the data frame

library(psych)
headTail(Data)
str(Data)

summary(Data)
### Remove unnecessary objects
rm(Input)

Confidence intervals for r-squared by bootstrap

R-squared value


library(boot)
Function = function(input, index){
Input = input[index,]
Result = lm (Words.per.minute ~ Age + Age2,
data = Input)
Stat = summary(Result)$r.squared
return(Stat)}
### Test Function
n = length(Data[,1])
Function(Data, 1:n)


[1] 0.5009385

### Produce Stat estimate by bootstrap
Boot = boot(Data,
Function,
R=5000)
mean(Boot$t[,1])


[1] 0.5754582

### Produce confidence interval bybootstrap
boot.ci(Boot,
conf = 0.95,
type = 'perc')
### Options: 'norm','basic', 'stud', 'perc', 'bca','all'


BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
Intervals :
Level Percentile
95% ( 0.3796, 0.7802 )
Calculations and Intervals on Original Scale

### Other information
Boot
hist(Boot$t[,1],
col = 'darkgray')

Confidence intervals for pseudor-squared by bootstrap

Pseudo R-squared value

The nagelkerke function produces a list. The seconditem in the list is a matrix named

Pseudo.R.squared.for.model.vs.null.

The third element of this matrix is the value for theNagelkerke pseudo R-squared. So,

nagelkerke(Result, Null)[[2]][3]

yields the value of the Nagelkerke pseudo R-squared.


library(boot)
library(rcompanion)
Function = function(input, index){
Input = input[index,]
Result = glm (Words.per.minute ~ Age + Age2,
data = Input,
family='gaussian')
Null = glm (Words.per.minute ~ 1,
data = Input,
family='gaussian')
Stat = nagelkerke(Result, Null)[[2]][3]
return(Stat)}
### Test Function
n = length(Data[,1])
Function(Data, 1:n)


[1] 0.501964

### Produce Stat estimate by bootstrap
Boot = boot(Data,
Function,
R=1000)
### In this case, even 1000iterations can take a while
mean(Boot$t[,1])


[1] 0.5803598

Why Is 0 Squared 1

### Produce confidence interval bybootstrap
boot.ci(Boot,
conf = 0.95,
type = 'perc')
### Options: 'norm','basic', 'stud', 'perc', 'bca','all'


BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
Intervals :
Level Percentile
95% ( 0.3836, 0.7860 )
Calculations and Intervals on Original Scale

### Other information
Boot
hist(Boot$t[,1],
col = 'darkgray')

References

Kabacoff, R.I. Bootstrapping. Quick-R. www.statmethods.net/advstats/bootstrapping.html.

R-squared.

0 R Squared Calculator

R-squared is a statistical measurement that determines the proportion of a security's return, or the return on a specific portfolio of securities, that can be explained by variations in the stock market, as measured by a benchmark index.

For example, an r-squared of 0.08 shows that 80% of a security's return is the result of changes in the market -- specifically that 80% of its gains are due to market gains and 80% of its losses are due to market losses. The other 20% are the result of factors particular to the security itself.

Dictionary of Financial Terms. Copyright © 2008 Lightbulb Press, Inc. All Rights Reserved.