ibmi-brunch-learn

Announcement

Collapse
No announcement yet.

Advice on most efficient way to find duplicate codes

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Advice on most efficient way to find duplicate codes

    I have a file that has 10 fields, an account number and then 9 3-char code fields. I want to find any records that have duplicate codes in those code fields and remove the duplicates. I can easily write a program to trudge through and do this, but I started wondering what is the most efficient way to accomplish it?

    So, for example, let's say my file looks something like this, where you can see that a few of the records have a code that is duplicated among the 9 code fields:
    ACCT CODE1 CODE2 CODE3 CODE4 CODE5 CODE6 CODE6 CODE8 CODE9
    12345 ABC DEF GHI JKL MNO PQR STU VWX YZA
    12350 GHI PQR ABC STU ABC DEF JKL
    12355 DEF DEF ABC DEF MNO PQR STU VWX
    12360 YZA GHI JKL GHI ABC JKL






    So in the above example, Account 12350 has the code ABC duplicated, i.e. it exists in field CODE3 and also CODE5. Account 12355 has DEF in three of it's code fields. Account 12360 has two different code values that are duplicated, bot GHI and JKL.

    I can read each record, and then compare CODE1 to each of the other code fields CODE2, CODE3, ..., CODE9., and then compare CODE2 to CODE3, CODE4, ..., CODE9., and then compare CODE3 to CODE4, CODE5, ..., CODE9, etc., until I finally compare CODE 8 to CODE 9. But is there a smarter/more efficient way? Using SCANRPL? Clever use of arrays? Using SQL? Have to make sure we don't falsely identify a value that spans two code fields.

    So how would you do it?
    Last edited by Viking; June 29, 2018, 01:31 PM.

  • #2
    Here is an option with SQL off the top of my head that I think will work. This is just a select but you should be able to use in the 'Set' part of update statement or put the select in a CTE and use that to update. This is assuming that there aren't too many Code columns. This also assumes that you don't care if there are blank code values in the middle, for example blanking Code5 in Acct 12350. If you need to slide them all to the left you can do that after blanking out the duplicates. You can run the Select including the original Code columns to see what the update is going to do prior to updating.

    Code:
    select 
        case when Code9 in (Code8, Code7, Code6, Code5, Code4, Code3, Code2, Code1) then '' else Code9 end as newCode9
        case when Code8 in ( Code7, Code6, Code5, Code4, Code3, Code2, Code1) then '' else Code8 end as newCode8
        case when Code7 in ( Code6, Code5, Code4, Code3, Code2, Code1) then '' else Code7 end as newCode7
        --  each row only compares to Codes prior to it in the list.
    from YourTable

    Comment


    • #3
      Here's another SQL option that will identify which code columns are the duplicates:
      Code:
      with codes (acct, code, codevalue) as
         (select acct, 'CODE1', code1 from codetable
          union all select acct, 'CODE2', code2 from codetable
          ...
          union all select acct, 'CODE9', code9 from codetable),
      codecount as
         (select acct, codevalue from codes where codevalue <> ' ' group by acct, codevalue having count(*) > 1)
      select a.* from codes a join codecount b on a.acct=b.acct and a.codevalue=b.codevalue order by a.acct

      Comment


      • #4
        Starting with a table which looks like this.
        12345 ABC DEF GHI JKL MNO PQR STU VWX YZA
        12350 GHI PQR ABC STU ABC DEF JKL
        12355 DEF DEF ABC DEF MNO PQR STU VWX
        12360 YZA GHI JKL HI ABC JKL
        Code:
        [FONT=courier new]Merge Into QTEMP.CODE400TBL As TGT
        Using
        (Select
        A.ACCT,
        Max(CODE1) As CODE1,
        Max(CODE2) As CODE2,
        Max(CODE3) As CODE3,
        Max(CODE4) As CODE4,
        Max(CODE5) As CODE5,
        Max(CODE6) As CODE6,
        Max(CODE7) As CODE7,
        Max(CODE8) As CODE8,
        Max(CODE9) As CODE9
        From
        (Select
        COLFND,ACCT,CODE,Row_Number() Over(Partition By ACCT Order By ACCT,COLFND) As PUTCOL
        From
        (Select
        Min(ORD) As COLFND,
        B.ACCT,
        B.CODE
        From
        QTEMP.CODE400TBL As A
        Join Lateral
        (Values
        (1,ACCT,CODE1),
        (2,ACCT,CODE2),
        (3,ACCT,CODE3),
        (4,ACCT,CODE4),
        (5,ACCT,CODE5),
        (6,ACCT,CODE6),
        (7,ACCT,CODE7),
        (8,ACCT,CODE8),
        (9,ACCT,CODE9)) As B (ORD,ACCT,CODE)
        On
        A.ACCT = B.ACCT And B.CODE <> ''
        Group By
        B.ACCT,B.CODE) As A
        Order By
        ACCT,COLFND) As A
        Join Lateral
        (Values
        (1,A.ACCT,A.CODE,'','','','','','','',''),
        (2,A.ACCT,'',A.CODE,'','','','','','',''),
        (3,A.ACCT,'','',A.CODE,'','','','','',''),
        (4,A.ACCT,'','','',A.CODE,'','','','',''),
        (5,A.ACCT,'','','','',A.CODE,'','','',''),
        (6,A.ACCT,'','','','','',A.CODE,'','',''),
        (7,A.ACCT,'','','','','','',A.CODE,'',''),
        (8,A.ACCT,'','','','','','','',A.CODE,''),
        (9,A.ACCT,'','','','','','','','',A.CODE)) As B (COLNUM,ACCT,CODE1,CODE2,CODE3,CODE4,CODE5,CODE6,CODE7,CODE8,CODE9)
        On
        A.PUTCOL = B.COLNUM
        Group By
        A.ACCT
        Order By
        A.ACCT) As SRC
        On
        TGT.ACCT = SRC.ACCT
        When Matched Then
        Update Set (TGT.CODE1,TGT.CODE2,TGT.CODE3,TGT.CODE4,TGT.CODE5,TGT.CODE6,TGT.CODE7,TGT.CODE8,TGT.CODE9)
        = (SRC.CODE1,SRC.CODE2,SRC.CODE3,SRC.CODE4,SRC.CODE5,SRC.CODE6,SRC.CODE7,SRC.CODE8,SRC.CODE9)
        Atomic[/FONT]
        After the merge statement completes, the table now looks like this.
        12345 ABC DEF GHI JKL MNO PQR STU VWX YZA
        12350 GHI PQR ABC STU DEF JKL
        12355 DEF ABC MNO PQR STU VWX
        12360 YZA GHI JKL HI ABC
        Jim

        Comment


        • #5
          Just for fun ... another way to find them is to "unpivot" them (columns to rows), then group them and find the duplicates with HAVING:
          PHP Code:
          with foo                                                        
          acct,code1,code2,code3,code4,code5,code6,code7,code8,code9 )  
          as ( 
          values                                                    
           
          12345,'ABC','DEF','GHI','JKL','MNO','PQR','STU','VWX','YZA'),
           ( 
          12350,'GHI','PQR','ABC','STU','ABC','DEF','JKL','   ','   '),
           ( 
          12355,'DEF','DEF','ABC','DEF','MNO','PQR','STU','VWX','   '),
           ( 
          12360,'YZA','GHI','JKL','GHI','ABC','JKL','   ','   ','   ')
              )                                                          
          select                                                          
           acct
          valcount(*)                                            
          from                                                            
           foo 
          as f  ,                                                    
           
          lateral (values f.code1 ) ,                                  
                           ( 
          f.code2 ) ,                                  
                           ( 
          f.code3 ) ,                                  
                           ( 
          f.code4 ) ,                                  
                           ( 
          f.code5 ) ,                                  
                           ( 
          f.code6 ) ,              
                           ( 
          f.code7 ) ,              
                           ( 
          f.code8 ) ,              
                           ( 
          f.code9 )) as val )  
          where VAL <> ' '                            
          group by ACCTVAL                          
          having count
          (*) > 1                          
          order by ACCT
          VAL 
          Which gives:
          Code:
          ACCT     VAL    COUNT ( * )
          12,350   ABC              2
          12,355   DEF              3
          12,360   GHI              2
          12,360   JKL              2
          Last edited by WilliamTasker; June 30, 2018, 03:48 AM.

          Comment


          • #6
            Jim_IT, that is amazing. It is clear to me that while you are diving deep down into the vast and murky waters of SQL knowledge, I am merely snorkeling along the surface.

            After editing to my actual field names, etc., I tested this on a copy of my production file and it worked perfectly. It removed the duplicates and scooted everything to the left-most code fields so there are no blank fields in between populated fields. This is exactly what I wanted. I will test this a bit more and then start running it daily over production. And of course I need to study it some more in order to understand it all. Thank you!

            Thanks to everyone who posted their ideas here... obviously SQL was the tool of choice and this is all good stuff that I'm still digesting... several ideas I will be able to use.
            Last edited by Viking; July 17, 2018, 06:00 PM.

            Comment


            • #7
              Thanks Viking.

              I work a lot with our development group showing these advanced techniques. I generally see somewhere between 100 and 1000 times improvement in processing time when converting RLA to data-centric SQL programming. I recently took an existing production process which runs 11 hours daily and using SQL, I was able to get it to run in 7 minutes, which is all I/O.

              Jim



              Comment


              • #8
                WilliamTasker, I just tried your SQL statement and it works great for finding the accounts with duplicates, and I will be able to use this unpivoting approach for lots of similar things going forward. Thanks!

                Comment

                Working...
                X