Monday, April 1, 2013

The possible reason getting irregular coefficient(like we think it is positive, but it shows negetive)

Sometimes, we will get irregular coefficient: consider sell_amount vs sell_price, we should expect a negative coefficient because as the price increases, the selling amount should become less from common sense. But sometimes it may give a positive coefficient.

 

The possible reason is we put high correlated variables in the predictors: suppose we have both sell_price and product_price in the model, and we know product_price is highly correlated to sell_price. An example is shown below in example 1.

 

Another  possible reason is we recode the missing data: if missing data is at the left, but we recode it to the right(like if x<0 we recode x=99999 which is a very big number).

 

Example 1:

# set up random number seed
set.seed(1000)
 
# generate 100 x with x ~ N(5,1)
x=rnorm(100,5,1)
 
 
# Y=5*X
y=5*x+rnorm(100,0,1)
 
 
 
# X1=100*X
x1=x*100+rnorm(100,0,.2)
 
 
 
1)it is reasonable that the regression coefficient is 4.947(while the true value is 5). 
#relation between x and y
lm(y~x)
 
 
2) because x1=100*x, the coefficient for x1 should be 1/100 of above, as below:
# relation between x1 and y
lm(y~x1)
 
 
3) but if we put both x and x1 in the model, then we could not get the positive coefficient for both x and x1. This is because of multicollinearity. 
# put both x and x1 as the predictor
lm(y~x+x1)
 
 
 

Example 2:

4) another condition is from we recode missing data. Suppose if x<=4 is missing, and we recode missing as 99999.
# if x<=4, treat it as 99999
x3=ifelse(x>4,x,99999)
 
 
5) Then the regression coefficient is negative because of the 99999.
# relation between x3 and y
lm(y~x3)
 

 

The recode issue can be treated as a special case of non-linear relation.

1 comment: