Wednesday, December 29, 2010

PROC SORT NODUPKEY NODUP DIFFERENCE ZZ

A common interview question for SAS jobs is "What is the difference between proc sort nodup and proc sort nodupkey?". The answer the interviewer is expecting is usually "proc sort nodup gets rid of duplicate records with the same sort key but proc sort nodupkey gets rid of other records with the same sort key". However, this is not correct.
"nodup" is an alias for "noduprecs" which appears to mean "no duplicate records" but there is no way sas can know about these duplicate records unless they, by chance, land next to each other in sequence. That is a matter of chance. Take a look at "nodup" at work. Note the record with the "extra" value of 3. It's still there after the "nodup" sort.
 
data test1;
  input id1 $ id2 $ extra ;
cards;
aa ab 3
aa ab 1
aa ab 2
aa ab 3
;
proc sort nodup data=test1;
by id1 id2;
run;
options nocenter;
proc print data=test1;
run;


Obs    id1    id2    extra
 1     aa     ab       3
 2     aa     ab       1
 3     aa     ab       2
 4     aa     ab       3

Now look again where the two records with an "extra" value of 3 are next to each other in the input dataset. This time it has been removed by "nodup".
 
data test2;
  input id1 $ id2 $ extra ;
cards;
aa ab 3
aa ab 3
aa ab 2
aa ab 1
;
proc sort nodup data=test2;
by id1 id2;
run;
options nocenter;
proc print data=test2;
run;


Obs    id1    id2    extra
 1     aa     ab       3
 2     aa     ab       2
 3     aa     ab       1

If you sort "nodupkey" then you will only be left with one record with that key combination in the above case as you can see below.
 
data test3;
  input id1 $ id2 $ extra ;
cards;
aa ab 3
aa ab 3
aa ab 2
aa ab 1
;
proc sort nodupkey data=test3;
by id1 id2;
run;
options nocenter;
proc print data=test3;
run;


Obs    id1    id2    extra
 1     aa     ab       3

It is a big mistake to think sorting "nodup" will remove duplicate records. Sometime it will, sometime it won't. The only way you can be sure of removing duplicate records is to "proc sort nodupkey" and include enough key variables to be sure you will lose the duplicates you want to lose. In the case shown above, then if we knew of the same "extra" values being duplicates we wanted to remove then this variable should be included in the list of sort variables and then "nodupkey" will remove the duplicates as shown below.
  data test4;
   input id1 $ id2 $ extra ;
 cards;
 aa ab 3
 aa ab 1
 aa ab 2
 aa ab 3
 ;
 proc sort nodupkey data=test4;
 by id1 id2 extra;
 run;
 options nocenter;
 proc print data=test4;
 run;


Obs    id1    id2    extra
 1     aa     ab       1
 2     aa     ab       2
 3     aa     ab       3

 **********************************************************************************
 
Common Programming Mistake with Proc Sort NODUPRECS

Every SAS programmer would know that Proc Sort NODUPRECS would remove exact duplicates…but often forgets / misuses it and alas gets into trouble.

It is very important that the programmer should properly code the variables in the BY statement. For the procedure to actually remove the duplicates, the duplicate records should be brought together by sorting the data with ALL the variables in the BY statement.

Let me explain it through an example here…Watch the lines in RED

data prdsal3;
  set sashelp.prdsal3;
  where product="SOFA"
    and date between "01mar1998"d and "30apr1998"d
    and state="California";
  keep State Date Actual;
run;

proc print data=prdsal3;
run;
Output:

                     Obs    STATE                           ACTUAL    DATE

                    1    California               $973.50    MAR98
                       2    California                     $566.50    APR98
                       3    California                     $845.90    MAR98
                       4    California                   $2,196.70    APR98
                       5    California               $973.50    MAR98
                       6    California                     $313.50    APR98
                       7    California                   $1,428.90    MAR98
                       8    California                   $1,446.50    APR98
                       9    California                   $1,144.00    MAR98
                      10    California                   $1,577.40    APR98
                      11    California                     $856.90    MAR98
                      12    California                   $1,380.50    APR98

Here the input data has 2 duplicates obs 1 and 5..
When a Programmer mistakenly codes like this..i.e. omitting the variable actual he get the output like the one below

proc sort data=prdsal3 out=sortedrandomly noduprecs;
  by state date;
run;

proc print data=sortedrandomly;
run;
Output:

                     Obs    STATE                           ACTUAL    DATE

                       1    California               $973.50    MAR98
                       2    California                     $845.90    MAR98
                    3    California               $973.50    MAR98
                       4    California                   $1,428.90    MAR98
                       5    California                   $1,144.00    MAR98
                       6    California                     $856.90    MAR98
                       7    California                     $566.50    APR98
                       8    California                   $2,196.70    APR98
                       9    California                     $313.50    APR98
                      10    California                   $1,446.50    APR98
                      11    California                   $1,577.40    APR98
                      12    California                   $1,380.50    APR98
The output shows the sorted order by state and date…the duplicates moved to 1 and 3.
In the Output with the proc sort NODUPRECS did not eliminate the duplicates because the original duplicate records (1 and 5) were NOT sorted properly for NODUPRECS to eliminate consecutive Duplicate records…
The SAS documentation says that
NODUPRECS
checks for and eliminates duplicate observations. If you specify this option, then PROC SORT compares all variable values for each observation to those for the previous observation that was written to the output data set. If an exact match is found, then the observation is not written to the output data set.
Note:   See NODUPKEY for information about eliminating observations with duplicate BY values.  [cautionend]
Alias :
NODUP
Interaction:
When you are removing consecutive duplicate observations in the output data set with NODUPRECS, the choice of EQUALS or NOEQUALS can have an effect on which observations are removed.

When the Programmer codes it correctly with all the variables…the desired result is achieved…
proc sort data=prdsal3 out=prdsal3_sorted noduprecs;
  by state actual date;
run;

proc print data=prdsal3_sorted;
run;


                     Obs    STATE                           ACTUAL    DATE

                       1    California                     $313.50    APR98
                       2    California                     $566.50    APR98
                       3    California                     $845.90    MAR98
                       4    California                     $856.90    MAR98
                    5   California                $973.50    MAR98
                       6    California                   $1,144.00    MAR98
                       7    California                   $1,380.50    APR98
                       8    California                   $1,428.90    MAR98
                       9    California                   $1,446.50    APR98
                      10    California                   $1,577.40    APR98
                      11    California                   $2,196.70    APR98

The best possible to achieve this result is to use the _all_ auto variable in the BY statement instead of specifying all the variables.
proc sort data=prdsal3 out=prdsal3_byall noduprecs;
  by _all_;
run;

proc print data=prdsal3_byall;
run;


                     Obs    STATE                           ACTUAL    DATE

                       1    California                     $313.50    APR98
                       2    California                     $566.50    APR98
                       3    California                     $845.90    MAR98
                       4    California                     $856.90    MAR98
                    5    California               $973.50    MAR98
                       6    California                   $1,144.00    MAR98
                       7    California                   $1,380.50    APR98
                       8    California                   $1,428.90    MAR98
                       9    California                   $1,446.50    APR98
                      10    California                   $1,577.40    APR98
                      11    California                   $2,196.70    APR98

***********************************************************************************
NODUPKEY is like FIRST. processing. Both depend on order which is an alien concept to SQL. SQL depends on information stored as data, not in variable names and not in order.
If for example you had the variables GROUP and SEQ where SEQ is a sequence number within group, then you could use a GROUP BY GROUP clause with HAVING SEQ=1. In short there must be something in the data values that indicates which records you want.
The suggestion to use DISTINCT works when all records having the same value of GROUP have all other relevant variables with equal values. This would be equivalent to the NODUP option in PROC SORT.

简单一点说,nodupkey是比较by后面的关键值,把关键值相同的records都删除掉,这个by即起到sort排序的作用,又有确定比较关键值的作用,把by确定的关键词相同的records删掉;而nodup的by只起到sort按照那些变量排序的作用,排完序以后,把相邻的重复的records删掉,而不仅仅是by相同的那些。

No comments:

Post a Comment