Friday, September 30, 2011

Assignment 1 of Webbook about Linear Regression from UCLA ATS

These questions are from the webbook 'Regression of Stata' of ucla ats. The original questions require to use Stata to answer. Here I try to use SAS to answer these question.


                                                                                   Linear Regression

Question 1:  Use data set elemapi2. Make five graphs of api99: histogram, kdensity plot, boxplot, symmetry plot and normal quantile plot.
Answer 1:
1:histogram / kernel density / normal density

2: kernel density


3: boxplot



4: QQ plot


5: normal probability plot



Question 2: What is the correlation between api99 and meals

Answer 2:

          
From the output we can see the correlation of api99 and meals is negative, the value is -.908. That is, as the percentage of free meals increase, the api99 value will decrease.

Question 3: Regress api99 on meals. What does the output tell you?

Answer 3:


p-value is < .0001, that is, meals is significant in the regression. the coefficient is negative which means as meals increase by one unit, the value of api99 will decrease by 4.187.

Question 4: Create and list the fitted (predicted) values

Answer 4: 
The first 20 predict values are



Question 5: Graph meals and api99 with and without the regression line.

Answer 5: 
With Regression Line:

Without Regression Line (in SAS, with / noline after model statement)


Question 6: Look at the correlations among the variables api99 meals ell avg_ed using the corr command. Explain how these commands are different. Make a scatterplot matrix for these variables and relate the correlation results to the scatterplot matrix.

Answer 6: 
Correlation of the variables

Scatter Plot of the variables. It gives virtual representation of the correlation of the variables. api99 is positively correlated with avg_ed and negatively with meals and ell.


Question 7: Perform a regression predicting api99 from meals and ell. Interpret the output. 

Answer 7:

From p-value it shows these predictors are significant. Also, it verifies the negative relation of api99 with meals and ell while positive with avg_ed.

Monday, September 12, 2011

zz: Structuring SAS Programs


Structuring SAS Programs

The primary aim in writing SAS code is to get a job done. But the job is not just to produce some output. You should structure your SAS programs:
• make finding errors as easy as possible
• make the code easy to understand later by yourself or someone else
• document the time and stage of the analysis
• document data edits and corrections

Most projects have an initial data cleaning phase, followed by an analysis to write a paper. After the paper is submitted, there is a delay until the referee reports arrive, followed by another analysis with revisions. Interruptions of weeks or months in a project are common—write your code to make it easy to pick up later.

Advice:
1. List program name, date, author, and revision date(s), and what the program does at the top. Use comments to break code into sections.
2. Edit the data and create new variables in a single data step at the beginning of the program, as far as possible. This makes it easy to find and to correct problems.
3. Use as few data steps as possible. Data steps are the most confusing part of a program.
4. Use comments to explain data edits and identify data problems. Hard code all data corrections in these programs. If there is email or other documentation, include this as comments in the code, so you don’t need to search your mail files later to figure out what happened.


This advice also applies to a collection of programs written for a project:
1. Number the programs within their names as you write them: PlanB_01.sas, PlanB_02.sas, etc.
2. Do all the data editing and creation of permanent datasets in the first program—no analysis. Hard code all data corrections in these programs. Include email correspondence as comments in the code.
3. Perform analysis in later programs that simply call the permanent datasets. Analysis programs should only create temporary datasets

Multiple versions of the same data invite trouble.

Friday, September 2, 2011

Study notes of Logistic regression, GEE and GLMM

This is my study notes with an example to explain the difference for the different approaches.

Study notes (pdf)


Study notes (dvi)