Monday, April 1, 2013

The possible reason getting irregular coefficient(like we think it is positive, but it shows negetive)

Sometimes, we will get irregular coefficient: consider sell_amount vs sell_price, we should expect a negative coefficient because as the price increases, the selling amount should become less from common sense. But sometimes it may give a positive coefficient.


The possible reason is we put high correlated variables in the predictors: suppose we have both sell_price and product_price in the model, and we know product_price is highly correlated to sell_price. An example is shown below in example 1.


Another  possible reason is we recode the missing data: if missing data is at the left, but we recode it to the right(like if x<0 we recode x=99999 which is a very big number).


Example 1:

# set up random number seed
# generate 100 x with x ~ N(5,1)
# Y=5*X
# X1=100*X
1)it is reasonable that the regression coefficient is 4.947(while the true value is 5). 
#relation between x and y
2) because x1=100*x, the coefficient for x1 should be 1/100 of above, as below:
# relation between x1 and y
3) but if we put both x and x1 in the model, then we could not get the positive coefficient for both x and x1. This is because of multicollinearity. 
# put both x and x1 as the predictor

Example 2:

4) another condition is from we recode missing data. Suppose if x<=4 is missing, and we recode missing as 99999.
# if x<=4, treat it as 99999
5) Then the regression coefficient is negative because of the 99999.
# relation between x3 and y


The recode issue can be treated as a special case of non-linear relation.