Monday, July 25, 2011

ZZ: A Glossary of DOE Terminology

This page gives definitions and information for many of the basic terms used in DOE.

· Alias: When the estimate of an effect also includes the influence of one or more other effects (usually high order interactions) the effects are said to be aliased (see confounding). For example, if the estimate of effect D in a four factor experiment actually estimates (D + ABC), then the main effect D is aliased with the 3-way interaction ABC. Note: This causes no difficulty when the higher order interaction is either non-existent or insignificant.

· Analysis of Variance (ANOVA): A mathematical process for separating the variability of a group of observations into assignable causes and setting up various significance tests.

· Balanced Design: An experimental design where all cells (i.e. treatment combinations) have the same number of observations.

· Blocking: A schedule for conducting treatment combinations in an experimental study such that any effects on the experimental results due to a known change in raw materials, operators, machines, etc., become concentrated in the levels of the blocking variable. Note: the reason for blocking is to isolate a systematic effect and prevent it from obscuring the main effects. Blocking is achieved by restricting randomization.

· Center Points: Points at the center value of all factor ranges.
Coding Factor Levels: Transforming the scale of measurement for a factor so that the high value becomes +1 and the low value becomes -1 (see scaling). After coding all factors in a 2-level full factorial experiment, the design matrix has all orthogonal columns.

Coding is a simple linear transformation of the original measurement scale. If the "high" value is Xh and the "low" value is XL (in the original scale), then the scaling transformation takes any original X value and converts it to (X - a)/b, where
a = (Xh + XL)/2 and b = ( Xh -X L)/2.
To go back to the original measurement scale, just take the coded value and multiply it by "b" and add "a" or, X = b(coded value) + a.

As an example, if the factor is temperature and the high setting is 65oC and the low setting is 55oC, then a = (65 + 55)/2 = 60 and b = (65 - 55)/2 = 5. The center point (where the coded value is 0) has a temperature of 5(0) + 60 = 60oC.

· Comparative Designs: A design aimed at making conclusions about one a priori important factor, possibly in the presence of one or more other "nuisance" factors.

· Confounding: A confounding design is one where some treatment effects (main or interactions) are estimated by the same linear combination of the experimental observations as some blocking effects. In this case, the treatment effect and the blocking effect are said to be confounded. Confounding is also used as a general term to indicate that the value of a main effect estimate comes from both the main effect itself and also contamination or bias from higher order interactions. Note: Confounding designs naturally arise when full factorial designs have to be run in blocks and the block size is smaller than the number of different treatment combinations. They also occur whenever a fractional factorial design is chosen instead of a full factorial design.

· Crossed Factors: See factors below.

· Design: A set of experimental runs which allows you to fit a particular model and estimate your desired effects.

· Design Matrix: A matrix description of an experiment that is useful for constructing and analyzing experiments.

· Effect: How changing the settings of a factor changes the response. The effect of a single factor is also called a main effect. Note: For a factor A with two levels, scaled so that low = -1 and high = +1, the effect of A is estimated by subtracting the average response when A is -1 from the average response when A = +1 and dividing the result by 2 (division by 2 is needed because the -1 level is 2 scaled units away from the +1 level).

· Error: Unexplained variation in a collection of observations. Note: DOE's typically require understanding of both random error and lack of fit error.

· Experimental Unit: The entity to which a specific treatment combination is applied. Note: an experimental unit can be a

  • PC board
  • silicon wafer
  • tray of components simultaneously treated
  • individual agricultural plants
  • plot of land
  • automotive transmissions
  • etc.

· Factors: Process inputs an investigator manipulates to cause a change in the output. Some factors cannot be controlled by the experimenter but may effect the responses. If their effect is significant, these uncontrolled factors should be measured and used in the data analysis. Note: The inputs can be discrete or continuous.

  • Crossed Factors: Two factors are crossed if every level of one occurs with every level of the other in the experiment.
  • Nested Factors: A factor "A" is nested within another factor "B" if the levels or values of "A" are different for every level or value of "B". Note: Nested factors or effects have a hierarchical relationship.

· Fixed Effect: An effect associated with an input variable that has a limited number of levels or in which only a limited number of levels are of interest to the experimenter.

· Interactions: Occurs when the effect of one factor on a response depends on the level of another factor(s).

· Lack of Fit Error: Error that occurs when the analysis omits one or more important terms or factors from the process model. Note: Including replication in a DOE allows separation of experimental error into its components: lack of fit and random (pure) error.

· Model: Mathematical relationship which relates changes in a given response to changes in one or more factors.

· Nested Factors: See factors above.

· Orthogonality: Two vectors of the same length are orthogonal if the sum of the products of their corresponding elements is 0. Note: An experimental design is orthogonal if the effects of any factor balance out (sum to zero) across the effects of the other factors.

· Random Effect: An effect associated with input variables chosen at random from a population having a large or infinite number of possible values.

· Random error: Error that occurs due to natural variation in the process. Note: Random error is typically assumed to be normally distributed with zero mean and a constant variance. Note: Random error is also called experimental error.

· Randomization: A schedule for allocating treatment material and for conducting treatment combinations in a DOE such that the conditions in one run neither depend on the conditions of the previous run nor predict the conditions in the subsequent runs. Note: The importance of randomization cannot be over stressed. Randomization is necessary for conclusions drawn from the experiment to be correct, unambiguous and defensible.

· Replication: Performing the same treatment combination more than once. Note: Including replication allows an estimate of the random error independent of any lack of fit error.

· Resolution: A term which describes the degree to which estimated main effects are aliased (or confounded) with estimated 2-level interactions, 3-level interactions, etc. In general, the resolution of a design is one more than the smallest order interaction that some main effect is confounded (aliased) with. If some main effects are confounded with some 2-level interactions, the resolution is 3. Note: Full factorial designs have no confounding and are said to have resolution "infinity". For most practical purposes, a resolution 5 design is excellent and a resolution 4 design may be adequate. Resolution 3 designs are useful as economical screening designs.

· Responses: The output(s) of a process. Sometimes called dependent variable(s).

· Response Surface Designs: A DOE that fully explores the process window and models the responses. Note: These designs are most effective when there are less than 5 factors. Quadratic models are used for response surface designs and at least three levels of every factor are needed in the design.

· Rotatability: A design is rotatable if the variance of the predicted response at any point x depends only on the distance of x from the design center point. A design with this property can be rotated around its center point without changing the prediction variance at x. Note: Rotatability is a desirable property for response surface designs (i.e. quadratic model designs).

· Scaling Factor Levels: Transforming factor levels so that the high value becomes +1 and the low value becomes -1.

· Screening Designs: A DOE that identifies which of many factors have a significant effect on the response. Note: Typically screening designs have more than 5 factors.

· Treatment: A treatment is a specific combination of factor levels whose effect is to be compared with other treatments.

· Treatment Combination: The combination of the settings of several factors in a given experimental trial. Also known as a run.

· Variance Components: Partitioning of the overall variation into assignable components.

Sunday, July 17, 2011

zz: SAS Technical Interview Questions

You can go into a SAS interview with more confidence if you know that you are prepared to respond to the kind of technical questions that an interviewer might ask you. I do not provide the specific answers here, both because these questions can be asked in a variety of ways and because it is not my objective to help those who have little actual interest in SAS to bluff their way through a SAS technical interview. The discussion here, though, may give you an idea of whether you have fully considered the implications contained in the questions.

Key concepts

A SAS technical interview typically starts with a few of the key concepts that are essential in SAS programming. These questions are intended to separate those who have actual substantive experience with SAS from those who have used in only a very limited or superficial way. If you have spent more than a hundred hours reading and writing SAS programs, it is safe to assume that you are familiar with topics such as these:
  • SORT procedure
  • Data step logic
  • KEEP=, DROP= dataset options
  • Missing values
  • Reset to missing, or the RETAIN statement
  • Log
  • Data types
  • FORMAT procedure for creating value formats
  • IN= dataset option

Tricky Stuff

After the interviewer is satisfied that you have used SAS to do a variety of things, you are likely to get some more substantial questions about SAS processing. These questions typically focus on some of the trickier aspects of the way SAS works, not because the interviewer is trying to trick you, but to give you a chance to demonstrate your knowledge of the details of SAS processing. At the same time, you can show how you approach technical questions and issues, and that is ultimately more important than your knowledge of any specific feature in SAS.

STOP statement

The processing of the STOP statement itself is ludicrously simple. However, when you explain the how and why of a STOP statement, you show that you understand:
  • How a SAS program is divided into steps, and the difference between a data step and a proc step
  • The automatic loop in the data step
  • Conditions that cause the automatic loop to terminate, or to fail to terminate

RUN statement placement

The output of a program may be different based on whether a RUN statement comes before or after a global statement such as an OPTIONS or TITLE statement. If you are aware of this issue, it shows that you have written SAS programs that have more than the simplest of objectives. At the same time, your comments on this subject can also show that you know:
  • The distinction between data step statements, proc step statements, and global statements
  • How SAS finds step boundaries
  • The importance of programming style

SUM or +

Adding numbers with the SUM function provides the same result that you get with the + numeric operator. For example, SUM(8, 4, 3)provides the same result as 8 + 4 + 3. Sometimes, though, you prefer to use the SUM function, and at other times, the + operator. As you explain this distinction, you can show that you understand:
  • Missing values
  • Propagation of missing values
  • Treatment of missing values in statistical calculations in SAS
  • Why it matters to handle missing values correctly in analytic processing
  • The use of 0 as an argument in the SUM function to ensure that the result is not a missing value
  • The performance differences between functions and operators
  • Essential ideas of data cleaning

Statistics: functions vs. proc steps

Computing a statistic with a function, such as the MEAN function, is not exactly the same as computing the same statistic with a procedure, such as the UNIVARIATE procedure. As you explain this distinction, you show that you understand:
  • The difference between summarizing across variables and summarizing across observations
  • The statistical concept of degrees of freedom as it relates to the difference between sample statistics and population statistics, and the way this is implemented in some SAS procedures with the VARDEF= option

REPLACE= option

Many SAS programmers never have occasion to use the REPLACE= dataset option or system option, but if you are familiar with it, then you have to be aware of:
  • The distinction between the input dataset and the output dataset in a step that makes changes in a set of data
  • The general concept of name conflicts in programming theory
  • Issues of programming style related to name conflicts
  • How the system option compares to the corresponding dataset option
A question on this topic may also give you the opportunity to mention syntax check mode and issues of debugging SAS programs.

WHERE vs. IF

Sometimes, it makes no difference whether you use a WHERE statement or a subsetting IF statement. Sometimes it makes a big difference. In explaining this distinction, you have the opportunity to discuss:
  • The distinction between data steps and proc steps
  • The difference between declaration (declarative) statements and executable (action) statements
  • The significance of the sequence of executable statements in a data step
  • Some of the finer points of merging SAS datasets
  • A few points of efficiency theory (although tests do not seem to bear the theory out in this case)
  • The origin of the WHERE clause in SQL (of course, bring this up only if you’re good at SQL)
  • WHERE operators that are not available in the IF statement or other data step statements

Compression

Compressing a SAS dataset is easy to to, so questions about it have more to do with determining when it is a good idea. You can weigh efficient use of storage space against efficient use of processing power, for example. Explain how you use representative data and performance measurements from SAS to test efficiency techniques, and you establish yourself as a SAS programmer who is ready to deal with large volumes of data. If you can explain why compression is effective in SAS datasets and observations larger than a certain minimum size and why binary compression works better than character compression for some kinds of data, then it shows you take software engineering seriously.

Macro processing

Almost the only reason interviewers ask about macros is to determine whether you appreciate the distinction between preprocessing and processing. Most SAS programmers are somewhat fuzzy about this, so if you have it perfectly clear in your mind, that makes you a cut about the rest — and if not, at least you should know that this is a topic you have to be careful about. There are endless technical issues with SAS macros, such as the system options that determine how much shows up in the log; your experience with this is especially important if the job involves maintaining SAS code written with macros.
SAS macro language is somewhat controversial, so be careful what you say of your opinion of it. To some managers, macro use is what distinguishes real SAS programmers from the pretenders, but to others, relying on macros all the time is a sure sign of a lazy, fuzzy-headed programmer. If you are pressed on this, it is probably safe to say that you are happy to work with macros or without them, depending on what the situation calls for.

Procedure vs. macro

The question, “What is the difference between a procedure and a macro?” can catch you off guard if it has never occurred to you to think of them as having anything in common. It can mystify you in a completely different way if you have thought of procedures and macros as interchangeable parts. You might mention:
  • The difference between generating SAS code, as a macro usually does, and taking action directly on SAS data, as a procedure usually does
  • What it means, in terms of efficiency, for a procedure to be a compiled program
  • The drastic differences in syntax between a proc step and a macro call
  • The IMPORT and EXPORT procedures, which with some options generate SAS statements much like a macro
  • The %SYSFUNC macro function and %SYSCALL macro statement that allow a macro to take action directly on SAS data, much like a procedure

Scope of macro variables

If the interviewer asks a question about the scope of macro variables or the significance of the difference between local and global macro variables, the programming concept of scope is being used to see how you handle the new ways of thinking that programming requires. The possibility that the same name could be used for different things at different times is one of the more basic philosophical conundrums in computer programming. If you can appreciate the difference between a name and the object that the name refers to, then you can probably handle all the other philosophical challenges of programming.

Run groups

Run-group procedures are not a big part of base SAS, so a question about run-group processing and the difference between the RUN and QUIT statements probably has more to do with:
  • What a procedure is
  • What a step is
  • All the work SAS has to go through as it alternately acquires a part of the SAS program from the execution queue, then executes that part of the program
  • Connecting the program and the log messages

SAS date values

Questions about SAS date values have less to do with whether you have memorized the reference point of January 1, 1960, than with whether you understand the implications of time data treated as numeric values, such as:
  • Using a date format to display the date variable in a meaningful way
  • Computing a length of time by subtracting SAS date values

Efficiency techniques

With today’s bigger, faster computers, efficiency is a major concern only for the very largest SAS projects. If you get a series of technical questions about efficiency, it could mean one of the following:
  • The employer is undertaking a project with an especially large volume of data
  • The designated computer is not one of today’s bigger, faster computers
  • The project is weighed down with horrendously inefficient code, and they are hoping you will be able to clean it all up
On the other hand, the interviewer may just be trying to gauge how well you understand the way SAS statements correspond to the actions the computer takes or how seriously you take the testing process for a program you write.

Debugger

Most SAS programmers never use the data step debugger, so questions about it are probably intended to determine how you feel about debugging — does the debugging process bug you, or is debugging one of the most essential things you do as a programmer?

Informats vs. formats

If you appreciate the distinction between informats and formats, it shows that:
  • You can focus on details
  • It doesn’t confuse you that two routines have the same name
  • You have some idea of what is going on when a SAS program runs

TRANSPOSE procedure

The TRANSPOSE procedure has a few important uses, but questions about it usually don’t have that much to do with the procedure itself. The intriguing characteristic of the TRANSPOSE procedure is that input data values determine the names of output variables. The implication of this is that if the data values are incorrect, the program could end up with the wrong output variables. In what other ways does a program depend on having valid or correct data values as a starting point? What does it take to write a program that will run no matter what input data values are supplied?

_N_

Questions about the automatic variable _N_ (this might be pronounced “underscore N underscore” or just “N”) are meant to get at your understanding of the automatic actions of the data step, especially the automatic data step loop, also known as the observation loop.
A possible follow-up question asks how you can store the value of _N_ in the output SAS dataset. If you can answer this, it may show that you know the properties of automatic variables and know how to create a variable in the data step.

PUT function

A question about the PUT function might seem to be a trick question, but it is not meant to be. Beyond showing that you aren’t confused by two things as different as a statement and a function having the same name, your discussion of the PUT function can show:
  • An understanding of what formats are
  • Your experience in creating variables in data step statements
  • A few of the finer points of SQL query optimization

Important SAS trivia

Some SAS trivia may be important to know in a technical interview, even though it may never come up in your actual SAS programming work.
  • MERGE is a data step statement only. There is no MERGE procedure. “PROC MERGE” is a mythical construction created years ago by Rhena Seidman, and if you are asked about it in a job interview, it is meant as a trick question.
  • It is possible to use the MERGE statement without a BY statement, but this usually occurs by mistake.
  • SAS does not provide an easy way to create a procedure in a SAS program. However, it is easy to define informats and formats and use them in the same program. Beginning with SAS 9.2, the same is true of functions.
  • The MEANS and SUMMARY procedures are identical except for the defaults for the PRINT option and VAR statement.
  • Much of the syntax of the TABULATE procedure is essentially the same of that of the SUMMARY procedure.
  • CARDS is another name for DATALINES (or vice versa).
  • “DATA _NULL_” is commonly used as a code word to refer to data step programming that creates print output or text data files.
  • The program data vector (PDV) is a logical block of data that contains the variables used in a data step or proc step. Variables are added to the program data vector in order of appearance, and this is what determines their position (or variable number) attribute.