I have a file that has 10 fields, an account number and then 9 3-char code fields. I want to find any records that have duplicate codes in those code fields and remove the duplicates. I can easily write a program to trudge through and do this, but I started wondering what is the most efficient way to accomplish it?
So, for example, let's say my file looks something like this, where you can see that a few of the records have a code that is duplicated among the 9 code fields:
So in the above example, Account 12350 has the code ABC duplicated, i.e. it exists in field CODE3 and also CODE5. Account 12355 has DEF in three of it's code fields. Account 12360 has two different code values that are duplicated, bot GHI and JKL.
I can read each record, and then compare CODE1 to each of the other code fields CODE2, CODE3, ..., CODE9., and then compare CODE2 to CODE3, CODE4, ..., CODE9., and then compare CODE3 to CODE4, CODE5, ..., CODE9, etc., until I finally compare CODE 8 to CODE 9. But is there a smarter/more efficient way? Using SCANRPL? Clever use of arrays? Using SQL? Have to make sure we don't falsely identify a value that spans two code fields.
So how would you do it?
So, for example, let's say my file looks something like this, where you can see that a few of the records have a code that is duplicated among the 9 code fields:
ACCT | CODE1 | CODE2 | CODE3 | CODE4 | CODE5 | CODE6 | CODE6 | CODE8 | CODE9 |
12345 | ABC | DEF | GHI | JKL | MNO | PQR | STU | VWX | YZA |
12350 | GHI | PQR | ABC | STU | ABC | DEF | JKL | ||
12355 | DEF | DEF | ABC | DEF | MNO | PQR | STU | VWX | |
12360 | YZA | GHI | JKL | GHI | ABC | JKL |
So in the above example, Account 12350 has the code ABC duplicated, i.e. it exists in field CODE3 and also CODE5. Account 12355 has DEF in three of it's code fields. Account 12360 has two different code values that are duplicated, bot GHI and JKL.
I can read each record, and then compare CODE1 to each of the other code fields CODE2, CODE3, ..., CODE9., and then compare CODE2 to CODE3, CODE4, ..., CODE9., and then compare CODE3 to CODE4, CODE5, ..., CODE9, etc., until I finally compare CODE 8 to CODE 9. But is there a smarter/more efficient way? Using SCANRPL? Clever use of arrays? Using SQL? Have to make sure we don't falsely identify a value that spans two code fields.
So how would you do it?
Comment