The original one by SAS is here
UCLA ATS. It focus on how to detect
outliers, leverage and influences before doing any regression.
Here it will be replicated in R.
1: read in data and get summary info:
## since there is no data directly, I will use the spss data file
library(foreign)
read.spss("http://dl.dropbox.com/u/10684315/ucla_reg/crime.sav",to.data.frame=T)->crime
head(crime)
summary(crime)
2: Scatter plot of crime pctmetro poverty and single
names(crime)<-tolower(names(crime))
row.names(crime)<-crime$state
plot(crime[,c("crime","pctmetro","poverty","single")])
here for convenience, convert the column names from upcase to lowcase.
From the scatter plot, it shows some points are far away from the others.
Outlier: if an independent variable has large residual, then it is an outlier;
Leverage: if an dependent variable(predictor) has an obs far away from the others, then it is called an leverage.
Influence: both outlier and leverage
3: plot of crime vs pctmetro, and label the plot points by another variable
plot(crime$pctmetro,crime$crime)
text(crime$pctmetro,crime$crime,crime$state,cex=.5,pos=2)
From the plot it is clear to see the obs for
DC might be an outlier.
Similarly the plots can be drawn for the other pairs of variables.
4: regression and diaglostics with statistical methods
lm(crime~pctmetro+poverty+single,crime)->crime_reg1
summary(crime_reg1)
The value R square=.8399, which means 83.99% of variation of data are explained by the model. Usually this value > .8 means the model is pretty good.
5: some important values
In statistics class, we have learned some statistics vocabularies like: student residual, leverage(diag of hat matrix), cook's distance and so on. In R it's easy to get them:
## calculate leverage, student residual, and cook's distance
leverage=hatvalues(crime_reg1)
r_std=rstudent(crime_reg1)
cd=sqrt(cooks.distance(crime_reg1))
## plot of leverage vs student residual
plot(r_std^2, leverage)
text(r_std^2,leverage, crime$state, cex=.5, pos=2)
It shows DC and MS have high residuals from the plot. DC has high leverage. So DC should be put more attention.
6: DFBETAS: it's calculated for each obs of all the dependent variables, which meaning how the regression coefficient will change if removing the obs.
head(dfbetas(crime_reg1))
The meaning: take an example for obs AK and var single, if include obs AK in the data, the regression coefficient will be increased by 0.0145*standard_error_of_single_coefficient=.0145*15.5=2.248.
Let's check this:
lm(crime~pctmetro+poverty+single,crime)->crime_reg1
lm(crime~pctmetro+poverty+single,crime[-1,])->crime_reg2
crime_reg1
crime_Reg2
7: Plot all debetas on the same table to compare
data.frame(dfbetas(crime_reg1))->dfb
plot(dfb$single,col="black")
points(dfb$pctmetro,col="green")
points(dfb$poverty,col="red")
text(dfb$single, crime$states,cex=.5,pos=2)
In all, the 51_st point(DC) shall catch our attention.
The possible reason is, all others are data for each state but DC is not a state.
8: Finally is the regression without obs from DC
lm(crime~pctmetro+poverty+single,crime[-51,])->crime_reg3