Wednesday, December 29, 2010

PROC SORT NODUPKEY NODUP DIFFERENCE ZZ

A common interview question for SAS jobs is "What is the difference between proc sort nodup and proc sort nodupkey?". The answer the interviewer is expecting is usually "proc sort nodup gets rid of duplicate records with the same sort key but proc sort nodupkey gets rid of other records with the same sort key". However, this is not correct.
"nodup" is an alias for "noduprecs" which appears to mean "no duplicate records" but there is no way sas can know about these duplicate records unless they, by chance, land next to each other in sequence. That is a matter of chance. Take a look at "nodup" at work. Note the record with the "extra" value of 3. It's still there after the "nodup" sort.
 
data test1;
  input id1 $ id2 $ extra ;
cards;
aa ab 3
aa ab 1
aa ab 2
aa ab 3
;
proc sort nodup data=test1;
by id1 id2;
run;
options nocenter;
proc print data=test1;
run;


Obs    id1    id2    extra
 1     aa     ab       3
 2     aa     ab       1
 3     aa     ab       2
 4     aa     ab       3

Now look again where the two records with an "extra" value of 3 are next to each other in the input dataset. This time it has been removed by "nodup".
 
data test2;
  input id1 $ id2 $ extra ;
cards;
aa ab 3
aa ab 3
aa ab 2
aa ab 1
;
proc sort nodup data=test2;
by id1 id2;
run;
options nocenter;
proc print data=test2;
run;


Obs    id1    id2    extra
 1     aa     ab       3
 2     aa     ab       2
 3     aa     ab       1

If you sort "nodupkey" then you will only be left with one record with that key combination in the above case as you can see below.
 
data test3;
  input id1 $ id2 $ extra ;
cards;
aa ab 3
aa ab 3
aa ab 2
aa ab 1
;
proc sort nodupkey data=test3;
by id1 id2;
run;
options nocenter;
proc print data=test3;
run;


Obs    id1    id2    extra
 1     aa     ab       3

It is a big mistake to think sorting "nodup" will remove duplicate records. Sometime it will, sometime it won't. The only way you can be sure of removing duplicate records is to "proc sort nodupkey" and include enough key variables to be sure you will lose the duplicates you want to lose. In the case shown above, then if we knew of the same "extra" values being duplicates we wanted to remove then this variable should be included in the list of sort variables and then "nodupkey" will remove the duplicates as shown below.
  data test4;
   input id1 $ id2 $ extra ;
 cards;
 aa ab 3
 aa ab 1
 aa ab 2
 aa ab 3
 ;
 proc sort nodupkey data=test4;
 by id1 id2 extra;
 run;
 options nocenter;
 proc print data=test4;
 run;


Obs    id1    id2    extra
 1     aa     ab       1
 2     aa     ab       2
 3     aa     ab       3

 **********************************************************************************
 
Common Programming Mistake with Proc Sort NODUPRECS

Every SAS programmer would know that Proc Sort NODUPRECS would remove exact duplicates…but often forgets / misuses it and alas gets into trouble.

It is very important that the programmer should properly code the variables in the BY statement. For the procedure to actually remove the duplicates, the duplicate records should be brought together by sorting the data with ALL the variables in the BY statement.

Let me explain it through an example here…Watch the lines in RED

data prdsal3;
  set sashelp.prdsal3;
  where product="SOFA"
    and date between "01mar1998"d and "30apr1998"d
    and state="California";
  keep State Date Actual;
run;

proc print data=prdsal3;
run;
Output:

                     Obs    STATE                           ACTUAL    DATE

                    1    California               $973.50    MAR98
                       2    California                     $566.50    APR98
                       3    California                     $845.90    MAR98
                       4    California                   $2,196.70    APR98
                       5    California               $973.50    MAR98
                       6    California                     $313.50    APR98
                       7    California                   $1,428.90    MAR98
                       8    California                   $1,446.50    APR98
                       9    California                   $1,144.00    MAR98
                      10    California                   $1,577.40    APR98
                      11    California                     $856.90    MAR98
                      12    California                   $1,380.50    APR98

Here the input data has 2 duplicates obs 1 and 5..
When a Programmer mistakenly codes like this..i.e. omitting the variable actual he get the output like the one below

proc sort data=prdsal3 out=sortedrandomly noduprecs;
  by state date;
run;

proc print data=sortedrandomly;
run;
Output:

                     Obs    STATE                           ACTUAL    DATE

                       1    California               $973.50    MAR98
                       2    California                     $845.90    MAR98
                    3    California               $973.50    MAR98
                       4    California                   $1,428.90    MAR98
                       5    California                   $1,144.00    MAR98
                       6    California                     $856.90    MAR98
                       7    California                     $566.50    APR98
                       8    California                   $2,196.70    APR98
                       9    California                     $313.50    APR98
                      10    California                   $1,446.50    APR98
                      11    California                   $1,577.40    APR98
                      12    California                   $1,380.50    APR98
The output shows the sorted order by state and date…the duplicates moved to 1 and 3.
In the Output with the proc sort NODUPRECS did not eliminate the duplicates because the original duplicate records (1 and 5) were NOT sorted properly for NODUPRECS to eliminate consecutive Duplicate records…
The SAS documentation says that
NODUPRECS
checks for and eliminates duplicate observations. If you specify this option, then PROC SORT compares all variable values for each observation to those for the previous observation that was written to the output data set. If an exact match is found, then the observation is not written to the output data set.
Note:   See NODUPKEY for information about eliminating observations with duplicate BY values.  [cautionend]
Alias :
NODUP
Interaction:
When you are removing consecutive duplicate observations in the output data set with NODUPRECS, the choice of EQUALS or NOEQUALS can have an effect on which observations are removed.

When the Programmer codes it correctly with all the variables…the desired result is achieved…
proc sort data=prdsal3 out=prdsal3_sorted noduprecs;
  by state actual date;
run;

proc print data=prdsal3_sorted;
run;


                     Obs    STATE                           ACTUAL    DATE

                       1    California                     $313.50    APR98
                       2    California                     $566.50    APR98
                       3    California                     $845.90    MAR98
                       4    California                     $856.90    MAR98
                    5   California                $973.50    MAR98
                       6    California                   $1,144.00    MAR98
                       7    California                   $1,380.50    APR98
                       8    California                   $1,428.90    MAR98
                       9    California                   $1,446.50    APR98
                      10    California                   $1,577.40    APR98
                      11    California                   $2,196.70    APR98

The best possible to achieve this result is to use the _all_ auto variable in the BY statement instead of specifying all the variables.
proc sort data=prdsal3 out=prdsal3_byall noduprecs;
  by _all_;
run;

proc print data=prdsal3_byall;
run;


                     Obs    STATE                           ACTUAL    DATE

                       1    California                     $313.50    APR98
                       2    California                     $566.50    APR98
                       3    California                     $845.90    MAR98
                       4    California                     $856.90    MAR98
                    5    California               $973.50    MAR98
                       6    California                   $1,144.00    MAR98
                       7    California                   $1,380.50    APR98
                       8    California                   $1,428.90    MAR98
                       9    California                   $1,446.50    APR98
                      10    California                   $1,577.40    APR98
                      11    California                   $2,196.70    APR98

***********************************************************************************
NODUPKEY is like FIRST. processing. Both depend on order which is an alien concept to SQL. SQL depends on information stored as data, not in variable names and not in order.
If for example you had the variables GROUP and SEQ where SEQ is a sequence number within group, then you could use a GROUP BY GROUP clause with HAVING SEQ=1. In short there must be something in the data values that indicates which records you want.
The suggestion to use DISTINCT works when all records having the same value of GROUP have all other relevant variables with equal values. This would be equivalent to the NODUP option in PROC SORT.

简单一点说,nodupkey是比较by后面的关键值,把关键值相同的records都删除掉,这个by即起到sort排序的作用,又有确定比较关键值的作用,把by确定的关键词相同的records删掉;而nodup的by只起到sort按照那些变量排序的作用,排完序以后,把相邻的重复的records删掉,而不仅仅是by相同的那些。

SAS SET operator summary



1: A set operator works on the results of two SELECT clauses. This is unlike a join, which is implemented within the FROM clause of a single SELECT statement. 

2: Joins typically align rows and accrete (that is, accumulate or collect) columns; set operator align columns and accrete rows. That is, Joins perform side-by-side (horizontal) combinations, in contrast Set operator perform end-to-end (vertical) combination.

3: CORRESPONGING directs the SQL processor to align like-named columns.

4: When set operators are used, the programmer must make sure that aligned columns are compatible with respect to type.  

5: ALL is used to purge duplicate rows.

6: The accretion rule for the UNION operator is that a row appears in the result if it appears in either data source.

7: INTERSECT accepts only rows in both data source.

8: If F represents the number of times a particular row appears in the first source (the result of the first SELECT clause) and S represents the count from the second source, the row will appear MIN(F,S) times in the INTERSECT result.

9: If F represents the number of times a particular row appears in the first source (the result of the first SELECT clause) and S represents the count from the second source, the row will appear MAX(0,F-S) times in the EXCEPT ALL result.


Tuesday, December 28, 2010

Multiple Local Symbol Tables

Suppose the following two macros, Outer and Inner, have been compiled. The macro named Outer creates a local macro variable named variX and assigns a value of one to it. Then Outer calls another macro program named Inner. The macro named Inner creates a local macro variable named variY and assigns the value of variX to it.
    %macro outer;
%local variX;
%let variX=one;
%inner
%mend outer;

%macro inner;
%local variY;
%let variY=&variX;
%mend inner;
Let's examine what happens to the symbol tables when you submit the following code:
%let variX=zero;
%outer

  1. The macro processor receives %let variX=zero;. It checks the global symbol table for a macro variable named variX. There is none, so the macro processor creates variX and assigns a value of zero to it.
    Global Symbol Table
    variX zero


  2. The macro processor receives %outer. The macro processor retrieves the macro Outer from Work.Sasmacr, then begins executing it.

  3. The macro processor encounters %local variX;. It creates a local symbol table. The macro processor creates the macro variable variX in this local table and assigns a null value to it. This does not affect the macro variable variX that is stored in the global symbol table.
    Global Symbol Table
    variX zero
    Outer Local Symbol Table
    variX      

  4. The macro processor encounters %let variX=one;. The macro processor checks the local symbol table for variX and assigns a value of one to it.
    Global Symbol Table
    variX zero
    Outer Local Symbol Table
    variX one

  5. The macro processor receives %inner. It retrieves the macro Inner from Work.Sasmacr, then begins executing it.

  6. The macro processor encounters %local variY;. It creates a local symbol table. The macro processor creates a macro variable variY in this table and assigns a null value to it. There are now two local symbol tables in existence.
    Global Symbol Table
    variX zero
    Outer Local Symbol Table
    variX one
    Inner Local Symbol Table
    variY      

  7. The macro processor encounters %let variY=&variX;. It checks the most recently created local table for variX. There is no such macro variable in that symbol table, so the macro processor then checks the other local symbol table. It retrieves the value one from that symbol table and substitutes the value into the %LET statement. Then the macro processor checks the most recently created local symbol table for a macro variable named variY. When it finds this macro variable, it assigns the value one to it.
    Global Symbol Table
    variX zero
    Outer Local Symbol Table
    variX one
    Inner Local Symbol Table
    variY one

  8. The Inner macro finishes executing, and the local symbol table that was created within this macro is deleted. There is now only one local symbol table in existence.
    Global Symbol Table
    variX zero
    Outer Local Symbol Table
    variX one

  9. The Outer macro finishes executing, and the local symbol table that was created within this macro is deleted. There are now no local symbol tables in existence. The global symbol table has not been changed since variX was created and was assigned a value of zero.
    Global Symbol Table
    variX zero

As you can see, each macro program in the example above has its own local symbol table that exists as long as the macro executes. When a macro finishes executing, its local symbol table and all of the local macro variables that are contained in that table are erased. The global symbol table and all of the global macro variables that are contained in it remain.

Monday, December 27, 2010

Sum the columns with irregular names using DICTIONARY.COLUMNS and MACROS

/* Here is an example of using SAS DICTIONARY. If using original data set wide and variables summer1-summer4 which is a string with suffixes having the same construction, it's easy to get the sum of each column in proc sql using macros. However, if we change the name to the different one with no similar structure, then we need to use DICTIONARY to help us with it.  */


data wide;
   input summer1-summer4;
   cards;
   1 2 3 4
   1 2 3 4
   ;
run;

proc datasets;
   modify wide;
   rename summer1 = salary
          summer2 = net
                summer3 = gross
                summer4=adjusted
                ;
run;
quit;

proc sql;
  select name, type
  from dictionary.columns
  where libname='WORK' and    /* here WORK is case sensitive */
        memname='WIDE'        /* here WIDE is case sensitive */
       ;
quit;

proc sql;
   select 'sum(' || trim(name) || ') as sum_' || name
   into :selections separated by ' , '
   from dictionary.columns
   where libname='WORK' and   /* here WORK is case sensitive */
         memname='WIDE' and   /* here WIDE is case sensitive */
               type   ='num'        /* here num is case sensitive */
       ;
quit;

%put &selections;

proc sql;
   select &selections
   from wide
   ;
quit;