Compare between column in one file

01-09-2012

Registered User

5, 0

Join Date: Oct 2008

Last Activity: 10 January 2012, 10:22 AM EST

Posts: 5

Thanks Given: 1

Thanked 0 Times in 0 Posts

Compare between column in one file

Hi all, can anyone help me. I want to compare values within column in one file using awk.

Below are the sample input file(1st file):

Code:

8388014643|20735704|20100713140734|20735505|20100713152501|700111|20110226085427|400501|20100329165139|300046|20100329165139|400400|20100329165139|400000|20100329165139|||||||||||||||||||||||||||
8388040001|20735505|20110916121248|20735604|20110916121248|700333|20110916121248|400400|20110916121248|500510|20110916121248|||||||||||||||||||||||||||||||
8388084999|700333|20101018171805|700666|20100329175101|20735604|20100329175101|400501|20100329175101|300001|20100329175101|500510|20100329175101|||||||||||||||||||||||||||||
83875488100|20735505|20111101130130|20735604|20111011140704|700333|20111011140704|400400|20111011140704|500510|20111011140704|700666|20111011140704|||||||||||||||||||||||||||||
83875426992|400400|20111011141108|400000|20111011141108|500510|20111011141108|20735704|20111011141108|400501|20111011141108|||||||||||||||||||||||||||||||

below are the comparison files (2nd file):

Code:

20735704|20735505|20735604|700333|700666|

What I want to get is, if the 1st file column 2,4,6,8,10,..42 (even column under 43) have data from 2nd file so it will output all the data from 1st row + multiple on 2nd file data.

Below are the sample output

Code:

8388014643|20735704|20100713140734|20735505|20100713152501|700111|20110226085427|400501|20100329165139|300046|20100329165139|400400|20100329165139|400000|20100329165139|||||||||||||||||||||||||||Multiple on 20735704 20735505|
8388040001|20735505|20110916121248|20735604|20110916121248|700333|20110916121248|400400|20110916121248|500510|20110916121248|||||||||||||||||||||||||||||||Multiple on 20735505 20735604 700333|
8388084999|700333|20101018171805|700666|20100329175101|20735604|20100329175101|400501|20100329175101|300001|20100329175101|500510|20100329175101|||||||||||||||||||||||||||||Multiple on 700333 700666 20735604|
83875488100|20735505|20111101130130|20735604|20111011140704|700333|20111011140704|400400|20111011140704|500510|20111011140704|700666|20111011140704|||||||||||||||||||||||||||||Multiple on 20735505 20735604 700333 700666| 
83875426992|400400|20111011141108|400000|20111011141108|500510|20111011141108|20735704|20111011141108|400501|20111011141108|||||||||||||||||||||||||||||||No Multiple data|

Can you guys help me? really appreciate the help, and FYI the data that I'm handling is around 20,000,000 rows.

Moderator's Comments:

How to use code tags

Last edited by Franklin52; 01-09-2012 at 06:31 AM.. Reason: Please use code tags for code and data samples, thank you

lurak

View Public Profile for lurak

Find all posts by lurak

01-09-2012

Registered User

628, 174

Join Date: Oct 2010

Last Activity: 2 December 2017, 5:58 AM EST

Location: Madrid, Spain

Posts: 628

Thanks Given: 8

Thanked 174 Times in 171 Posts

Hi lurak,

This is my try with awk, but last line gives me one coincidence:

Code:

$ cat file1
8388014643|20735704|20100713140734|20735505|20100713152501|700111|20110226085427|400501|20100329165139|300046|20100329165139|400400|20100329165139|400000|20100329165139|||||||||||||||||||||||||||
8388040001|20735505|20110916121248|20735604|20110916121248|700333|20110916121248|400400|20110916121248|500510|20110916121248|||||||||||||||||||||||||||||||
8388084999|700333|20101018171805|700666|20100329175101|20735604|20100329175101|400501|20100329175101|300001|20100329175101|500510|20100329175101|||||||||||||||||||||||||||||
83875488100|20735505|20111101130130|20735604|20111011140704|700333|20111011140704|400400|20111011140704|500510|20111011140704|700666|20111011140704|||||||||||||||||||||||||||||
83875426992|400400|20111011141108|400000|20111011141108|500510|20111011141108|20735704|20111011141108|400501|20111011141108|||||||||||||||||||||||||||||||
$ cat file2
20735704|20735505|20735604|700333|700666|
$ cat script.awk
BEGIN {
        FS = "|"
        if ( ARGC != 3 ) {
                print "Usage: awk -f script.awk <file1> <file2>"
                exit 0
        }

        dummy = ARGV[1]
        ARGV[1] = ARGV[2]
        ARGV[2] = dummy
}

## Process 'file2' of input parameters.
FNR == NR {
        for ( i = 1; i <= NF; i++ ) {
                if ( length( $i ) > 0 ) {
                        comparison[ $i ] = 1
                }

        }
}

## Process 'file1' of input parameters.
FNR < NR {
        for ( i = 2; i <= 42; i += 2 ) {
                if ( comparison[ $i ] ) {
                        multiples = multiples " " $i
                }
        }

        if ( length( multiples ) > 0 ) {
                multiples = "Multiple on" multiples "|"
        } else {
                multiples = "No multiple data|"
        }

        print $0 multiples
        multiples = ""
}
$ awk -f script.awk file1 file2 | cat -n
     1  8388014643|20735704|20100713140734|20735505|20100713152501|700111|20110226085427|400501|20100329165139|300046|20100329165139|400400|20100329165139|400000|20100329165139|||||||||||||||||||||||||||Multiple on 20735704 20735505|
     2  8388040001|20735505|20110916121248|20735604|20110916121248|700333|20110916121248|400400|20110916121248|500510|20110916121248|||||||||||||||||||||||||||||||Multiple on 20735505 20735604 700333|
     3  8388084999|700333|20101018171805|700666|20100329175101|20735604|20100329175101|400501|20100329175101|300001|20100329175101|500510|20100329175101|||||||||||||||||||||||||||||Multiple on 700333 700666 20735604|
     4  83875488100|20735505|20111101130130|20735604|20111011140704|700333|20111011140704|400400|20111011140704|500510|20111011140704|700666|20111011140704|||||||||||||||||||||||||||||Multiple on 20735505 20735604 700333 700666|
     5  83875426992|400400|20111011141108|400000|20111011141108|500510|20111011141108|20735704|20111011141108|400501|20111011141108|||||||||||||||||||||||||||||||Multiple on 20735704|

Regards,
Birei

birei

View Public Profile for birei

Find all posts by birei

01-09-2012

Registered User

286, 4

Join Date: Nov 2011

Last Activity: 13 February 2016, 2:18 AM EST

Location: Bangalore, INDIA

Posts: 286

Thanks Given: 126

Thanked 4 Times in 4 Posts

i am not good at awk... anyway here is the normal code... since you mentioned 20,00,000 rows... execution time might be more.. :-)

file1 has 1st file content file2 has 'multiples'.. file3 is the output file

Code:

rm -r file3
while read line
do
        counter=0
        finalNo=$( echo $line | awk '{ split($0,a,"|"); print length(a) - 1 }' )
        line="$line Multiple on: "
        for (( i = 2; i <= $finalNo; i=`expr $i + 2` )) ### Outer for loop ###
        do
                column1=$( echo $line | cut -d'|' -f$i )
                if [[ "$column1" != "" ]]; then
                        noPresent=$( cat file2 | grep -w "$column1" )
                        if [[ "$noPresent" != "" ]] ; then
                                line="$line $column1"
                                counter=`expr $counter + 1`
                        fi
                fi
        done
        if [ $counter -ne 0 ]; then
                 line="$line |"
                echo $line >> file3
        else
                line="$line |No Multiple found|"
                echo $line >> file3
        fi
done < file1

Code:

cat file3
8388014643|20735704|20100713140734|20735505|20100713152501|700111|20110226085427|400501|20100329165139|300046|20100329165139|400400|20100329165139|400000|20100329165139||||||||||||||||||||||||||| Multiple on: 20735704 20735505 |
8388040001|20735505|20110916121248|20735604|20110916121248|700333|20110916121248|400400|20110916121248|500510|20110916121248||||||||||||||||||||||||||||||| Multiple on: 20735505 20735604 700333 |
8388084999|700333|20101018171805|700666|20100329175101|20735604|20100329175101|400501|20100329175101|300001|20100329175101|500510|20100329175101||||||||||||||||||||||||||||| Multiple on: 700333 700666 20735604 |
83875488100|20735505|20111101130130|20735604|20111011140704|700333|20111011140704|400400|20111011140704|500510|20111011140704|700666|20111011140704||||||||||||||||||||||||||||| Multiple on: 20735505 20735604 700333 700666 |
83875426992|400400|20111011141108|400000|20111011141108|500510|20111011141108|207357|20111011141108|400501|20111011141108||||||||||||||||||||||||||||||| Multiple on: |No Multiple found|

i have modifed last line in file1 since last line has 20735704 which is a multiple

Last edited by vivek d r; 01-09-2012 at 08:24 AM..

vivek d r

View Public Profile for vivek d r

Find all posts by vivek d r

01-09-2012

Registered User

686, 179

Join Date: Mar 2011

Last Activity: 17 March 2020, 9:58 PM EDT

Posts: 686

Thanks Given: 51

Thanked 179 Times in 171 Posts

Try this out:

Code:

awk -F\| '
NR==FNR{
   for(i=1;i<=NF;i++)
      a[$i]=$i;
   next}
{m=""; 
  for(i=2;i<=NF; i+=2) 
     if($i && $i in a) 
        m=m " "$i 
  app = m ? "Multiple on " m "|" : "No multiple data |" ; 
  print $0 app 
}' file2.txt file1.txt

But I agree with birei, that in your sample output, the last line should find 1 match.

mirni

View Public Profile for mirni

Find all posts by mirni

01-10-2012

Registered User

5, 0

Join Date: Oct 2008

Last Activity: 10 January 2012, 10:22 AM EST

Posts: 5

Thanks Given: 1

Thanked 0 Times in 0 Posts

Thanks for the big help and already tried the code myself and work like charm. The only problem is, actually the last line is correct. There is no multiple because there is only one data of 20735704 that is why it show no multiple.

As you can see below are the 2nd file.

Code:

20735704|20735505|20735604|700333|700666|

if there is only one data from the 2nd file there are no multiple. But if there are, let's say 20735704 and 700333 in one line, there are multiple.

Pardon my bad explanation and bad english. Can you guys help me again?

lurak

View Public Profile for lurak

Find all posts by lurak

01-10-2012

Registered User

628, 174

Join Date: Oct 2010

Last Activity: 2 December 2017, 5:58 AM EST

Location: Madrid, Spain

Posts: 628

Thanks Given: 8

Thanked 174 Times in 171 Posts

So there must exists at least two coincidences to consider a line has multiples, in case of zero or one coincidence it is assumed there are not coincidences. Is like this?

Regards,
Birei

birei

View Public Profile for birei

Find all posts by birei

01-10-2012

Registered User

686, 179

Join Date: Mar 2011

Last Activity: 17 March 2020, 9:58 PM EDT

Posts: 686

Thanks Given: 51

Thanked 179 Times in 171 Posts

Just put a counter in there:

Code:

awk -F\| ' 
NR==FNR{
    for(i=1;i<=NF;i++)
       a[$i]=$i;
    next
} 
{m=""; cnt=0;
   for(i=2;i<=NF; i+=2)
       if($i && $i in a) {
         m=m " "$i 
         cnt++
      }
  app = cnt > 1 ? "Multiple on " m "|" : "No multiple data |" ;
  print $0 app
}' file2.txt file1.txt

Or, you can try to count the spaces in variable 'm' and test for that:

Code:

awk -F\| ' 
NR==FNR{
    for(i=1;i<=NF;i++)
       a[$i]=$i;
    next
} 
{m=""; 
   for(i=2;i<=NF; i+=2)
       if($i && $i in a)
         m=m " "$i 
  app = gsub(" "," ",m) > 1 ? "Multiple on " m "|" : "No multiple data |" ;
  print $0 app
}' file2.txt file1.txt

Makes the code shorter, but a little bit more obscure. gsub returns the number of substitutions made.

Last edited by mirni; 01-10-2012 at 07:42 AM.. Reason: gsub trick

mirni

View Public Profile for mirni

Find all posts by mirni

Shell Programming and Scripting

Compare between column in one file

9 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Compare 1st column from 2 file and if match print line from 1st file and append column 7 from 2nd

Discussion started by: tententen

2. UNIX for Beginners Questions & Answers

Compare Values between column in the same file

Discussion started by: Nina2910

3. Shell Programming and Scripting

How to compare the values of a column in a same file using awk?

Discussion started by: utritala

4. Shell Programming and Scripting

Compare 2 text file with 1 column in each file and write mismatch data to 3rd file

Discussion started by: Divya Nochiyil

5. Shell Programming and Scripting

Compare value of a column in a file

Discussion started by: Vivekit82

6. Shell Programming and Scripting

Compare the second column of a file with the second column of another in awk

Discussion started by: amarn

7. UNIX for Dummies Questions & Answers

Compare two column in file and extract complate line

Discussion started by: mscott

8. Shell Programming and Scripting

awk compare column n replace with in one file

Discussion started by: arun1401

9. Shell Programming and Scripting

I need to extract last column of a file and compare the values

Discussion started by: vukkusila