Compare between column in one file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Compare between column in one file
# 1  
Old 01-09-2012
Compare between column in one file

Hi all, can anyone help me. I want to compare values within column in one file using awk.

Below are the sample input file(1st file):
Code:
8388014643|20735704|20100713140734|20735505|20100713152501|700111|20110226085427|400501|20100329165139|300046|20100329165139|400400|20100329165139|400000|20100329165139|||||||||||||||||||||||||||
8388040001|20735505|20110916121248|20735604|20110916121248|700333|20110916121248|400400|20110916121248|500510|20110916121248|||||||||||||||||||||||||||||||
8388084999|700333|20101018171805|700666|20100329175101|20735604|20100329175101|400501|20100329175101|300001|20100329175101|500510|20100329175101|||||||||||||||||||||||||||||
83875488100|20735505|20111101130130|20735604|20111011140704|700333|20111011140704|400400|20111011140704|500510|20111011140704|700666|20111011140704|||||||||||||||||||||||||||||
83875426992|400400|20111011141108|400000|20111011141108|500510|20111011141108|20735704|20111011141108|400501|20111011141108|||||||||||||||||||||||||||||||

below are the comparison files (2nd file):
Code:
20735704|20735505|20735604|700333|700666|

What I want to get is, if the 1st file column 2,4,6,8,10,..42 (even column under 43) have data from 2nd file so it will output all the data from 1st row + multiple on 2nd file data.

Below are the sample output
Code:
8388014643|20735704|20100713140734|20735505|20100713152501|700111|20110226085427|400501|20100329165139|300046|20100329165139|400400|20100329165139|400000|20100329165139|||||||||||||||||||||||||||Multiple on 20735704 20735505|
8388040001|20735505|20110916121248|20735604|20110916121248|700333|20110916121248|400400|20110916121248|500510|20110916121248|||||||||||||||||||||||||||||||Multiple on 20735505 20735604 700333|
8388084999|700333|20101018171805|700666|20100329175101|20735604|20100329175101|400501|20100329175101|300001|20100329175101|500510|20100329175101|||||||||||||||||||||||||||||Multiple on 700333 700666 20735604|
83875488100|20735505|20111101130130|20735604|20111011140704|700333|20111011140704|400400|20111011140704|500510|20111011140704|700666|20111011140704|||||||||||||||||||||||||||||Multiple on 20735505 20735604 700333 700666| 
83875426992|400400|20111011141108|400000|20111011141108|500510|20111011141108|20735704|20111011141108|400501|20111011141108|||||||||||||||||||||||||||||||No Multiple data|

Can you guys help me? really appreciate the help, and FYI the data that I'm handling is around 20,000,000 rows.
Moderator's Comments:
Mod Comment How to use code tags

Last edited by Franklin52; 01-09-2012 at 06:31 AM.. Reason: Please use code tags for code and data samples, thank you
# 2  
Old 01-09-2012
Hi lurak,

This is my try with awk, but last line gives me one coincidence:
Code:
$ cat file1
8388014643|20735704|20100713140734|20735505|20100713152501|700111|20110226085427|400501|20100329165139|300046|20100329165139|400400|20100329165139|400000|20100329165139|||||||||||||||||||||||||||
8388040001|20735505|20110916121248|20735604|20110916121248|700333|20110916121248|400400|20110916121248|500510|20110916121248|||||||||||||||||||||||||||||||
8388084999|700333|20101018171805|700666|20100329175101|20735604|20100329175101|400501|20100329175101|300001|20100329175101|500510|20100329175101|||||||||||||||||||||||||||||
83875488100|20735505|20111101130130|20735604|20111011140704|700333|20111011140704|400400|20111011140704|500510|20111011140704|700666|20111011140704|||||||||||||||||||||||||||||
83875426992|400400|20111011141108|400000|20111011141108|500510|20111011141108|20735704|20111011141108|400501|20111011141108|||||||||||||||||||||||||||||||
$ cat file2
20735704|20735505|20735604|700333|700666|
$ cat script.awk
BEGIN {
        FS = "|"
        if ( ARGC != 3 ) {
                print "Usage: awk -f script.awk <file1> <file2>"
                exit 0
        }

        dummy = ARGV[1]
        ARGV[1] = ARGV[2]
        ARGV[2] = dummy
}

## Process 'file2' of input parameters.
FNR == NR {
        for ( i = 1; i <= NF; i++ ) {
                if ( length( $i ) > 0 ) {
                        comparison[ $i ] = 1
                }

        }
}

## Process 'file1' of input parameters.
FNR < NR {
        for ( i = 2; i <= 42; i += 2 ) {
                if ( comparison[ $i ] ) {
                        multiples = multiples " " $i
                }
        }

        if ( length( multiples ) > 0 ) {
                multiples = "Multiple on" multiples "|"
        } else {
                multiples = "No multiple data|"
        }

        print $0 multiples
        multiples = ""
}
$ awk -f script.awk file1 file2 | cat -n
     1  8388014643|20735704|20100713140734|20735505|20100713152501|700111|20110226085427|400501|20100329165139|300046|20100329165139|400400|20100329165139|400000|20100329165139|||||||||||||||||||||||||||Multiple on 20735704 20735505|
     2  8388040001|20735505|20110916121248|20735604|20110916121248|700333|20110916121248|400400|20110916121248|500510|20110916121248|||||||||||||||||||||||||||||||Multiple on 20735505 20735604 700333|
     3  8388084999|700333|20101018171805|700666|20100329175101|20735604|20100329175101|400501|20100329175101|300001|20100329175101|500510|20100329175101|||||||||||||||||||||||||||||Multiple on 700333 700666 20735604|
     4  83875488100|20735505|20111101130130|20735604|20111011140704|700333|20111011140704|400400|20111011140704|500510|20111011140704|700666|20111011140704|||||||||||||||||||||||||||||Multiple on 20735505 20735604 700333 700666|
     5  83875426992|400400|20111011141108|400000|20111011141108|500510|20111011141108|20735704|20111011141108|400501|20111011141108|||||||||||||||||||||||||||||||Multiple on 20735704|

Regards,
Birei
# 3  
Old 01-09-2012
i am not good at awk... anyway here is the normal code... since you mentioned 20,00,000 rows... execution time might be more.. :-)

file1 has 1st file content file2 has 'multiples'.. file3 is the output file

Code:
rm -r file3
while read line
do
        counter=0
        finalNo=$( echo $line | awk '{ split($0,a,"|"); print length(a) - 1 }' )
        line="$line Multiple on: "
        for (( i = 2; i <= $finalNo; i=`expr $i + 2` )) ### Outer for loop ###
        do
                column1=$( echo $line | cut -d'|' -f$i )
                if [[ "$column1" != "" ]]; then
                        noPresent=$( cat file2 | grep -w "$column1" )
                        if [[ "$noPresent" != "" ]] ; then
                                line="$line $column1"
                                counter=`expr $counter + 1`
                        fi
                fi
        done
        if [ $counter -ne 0 ]; then
                 line="$line |"
                echo $line >> file3
        else
                line="$line |No Multiple found|"
                echo $line >> file3
        fi
done < file1

Code:
cat file3
8388014643|20735704|20100713140734|20735505|20100713152501|700111|20110226085427|400501|20100329165139|300046|20100329165139|400400|20100329165139|400000|20100329165139||||||||||||||||||||||||||| Multiple on: 20735704 20735505 |
8388040001|20735505|20110916121248|20735604|20110916121248|700333|20110916121248|400400|20110916121248|500510|20110916121248||||||||||||||||||||||||||||||| Multiple on: 20735505 20735604 700333 |
8388084999|700333|20101018171805|700666|20100329175101|20735604|20100329175101|400501|20100329175101|300001|20100329175101|500510|20100329175101||||||||||||||||||||||||||||| Multiple on: 700333 700666 20735604 |
83875488100|20735505|20111101130130|20735604|20111011140704|700333|20111011140704|400400|20111011140704|500510|20111011140704|700666|20111011140704||||||||||||||||||||||||||||| Multiple on: 20735505 20735604 700333 700666 |
83875426992|400400|20111011141108|400000|20111011141108|500510|20111011141108|207357|20111011141108|400501|20111011141108||||||||||||||||||||||||||||||| Multiple on: |No Multiple found|

i have modifed last line in file1 since last line has 20735704 which is a multiple

Last edited by vivek d r; 01-09-2012 at 08:24 AM..
# 4  
Old 01-09-2012
Try this out:
Code:
awk -F\| '
NR==FNR{
   for(i=1;i<=NF;i++)
      a[$i]=$i;
   next}
{m=""; 
  for(i=2;i<=NF; i+=2) 
     if($i && $i in a) 
        m=m " "$i 
  app = m ? "Multiple on " m "|" : "No multiple data |" ; 
  print $0 app 
}' file2.txt file1.txt

But I agree with birei, that in your sample output, the last line should find 1 match.
# 5  
Old 01-10-2012
Thanks for the big help and already tried the code myself and work like charm. The only problem is, actually the last line is correct. There is no multiple because there is only one data of 20735704 that is why it show no multiple.

As you can see below are the 2nd file.
Code:
20735704|20735505|20735604|700333|700666|

if there is only one data from the 2nd file there are no multiple. But if there are, let's say 20735704 and 700333 in one line, there are multiple.

Pardon my bad explanation and bad english. Can you guys help me again?
# 6  
Old 01-10-2012
So there must exists at least two coincidences to consider a line has multiples, in case of zero or one coincidence it is assumed there are not coincidences. Is like this?

Regards,
Birei
# 7  
Old 01-10-2012
Just put a counter in there:
Code:
awk -F\| ' 
NR==FNR{
    for(i=1;i<=NF;i++)
       a[$i]=$i;
    next
} 
{m=""; cnt=0;
   for(i=2;i<=NF; i+=2)
       if($i && $i in a) {
         m=m " "$i 
         cnt++
      }
  app = cnt > 1 ? "Multiple on " m "|" : "No multiple data |" ;
  print $0 app
}' file2.txt file1.txt

Or, you can try to count the spaces in variable 'm' and test for that:
Code:
awk -F\| ' 
NR==FNR{
    for(i=1;i<=NF;i++)
       a[$i]=$i;
    next
} 
{m=""; 
   for(i=2;i<=NF; i+=2)
       if($i && $i in a)
         m=m " "$i 
  app = gsub(" "," ",m) > 1 ? "Multiple on " m "|" : "No multiple data |" ;
  print $0 app
}' file2.txt file1.txt

Makes the code shorter, but a little bit more obscure. gsub returns the number of substitutions made.

Last edited by mirni; 01-10-2012 at 07:42 AM.. Reason: gsub trick
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Compare 1st column from 2 file and if match print line from 1st file and append column 7 from 2nd

hi I have 2 file with more than 10 columns for both 1st file apple,0,0,0...... orange,1,2,3..... mango,2,4,5..... 2nd file apple,2,3,4,5,6,7... orange,2,3,4,5,6,8... watermerlon,2,3,4,5,6,abc... mango,5,6,7,4,6,def.... (1 Reply)
Discussion started by: tententen
1 Replies

2. UNIX for Beginners Questions & Answers

Compare Values between column in the same file

Input File:- COLUMN1 COLUMN2 COLUMN3 COLUMN4 COLUMN5 COLUMN6 SMS Email AO Mail Post N Cell VEGE Potato E W 396 12 0 384 0 0 0 0 0 VEGE Onion S W 17 0 17 0 0 0 0 0 0 FRUIT APPLE N W 549 61 0 0 0 0 0 488 0 FRUIT APPLE SE W 291 14 239 38 0 10 0 0 0 FRUIT APPLE EAMS W 397 32 309 56 309 309 0... (27 Replies)
Discussion started by: Nina2910
27 Replies

3. Shell Programming and Scripting

How to compare the values of a column in a same file using awk?

Dear Unix experts, I have got a file where I would like to compare the values of second column if first column is same in such a way that the difference between the values is >50. If not, I would like to discard both values. For example, my input file looks like - comp275_c0_seq2 73... (7 Replies)
Discussion started by: utritala
7 Replies

4. Shell Programming and Scripting

Compare 2 text file with 1 column in each file and write mismatch data to 3rd file

Hi, I need to compare 2 text files with around 60000 rows and 1 column. I need to compare these and write the mismatch data to 3rd file. File1 - file2 = file3 wc -l file1.txt 58112 wc -l file2.txt 55260 head -5 file1.txt 101214200123 101214700300 101250030067 101214100500... (10 Replies)
Discussion started by: Divya Nochiyil
10 Replies

5. Shell Programming and Scripting

Compare value of a column in a file

i have a file which has few columns.From those columns i need to check the value of column 9 if it is 1 or not.If it is 1 then only i need to print column 4,5,7,9. Sample input. Column 1,Column 2,Column 3,Column 4,Column 5,Column 5,Column 6,Column 7,column 8,Column 9,Column 10. ... (5 Replies)
Discussion started by: Vivekit82
5 Replies

6. Shell Programming and Scripting

Compare the second column of a file with the second column of another in awk

Hi, I know that this topic has been discussed in the past and I've tried to follow all the guidelines. Anyhow, I following describe my problem. I have a file (file1 , no. records = 67) containing pairs of IP addresses as follows (with single space as delimiter between the fields): example... (5 Replies)
Discussion started by: amarn
5 Replies

7. UNIX for Dummies Questions & Answers

Compare two column in file and extract complate line

No Chr Pos Qual GT_1 GT_2 1. chr1 478493 595 A/G G/G 2. chr1 879243 700 A/T T/T 3. chr2 889922 1300 C/C C/C 4. chr2 1926372 300 T/A T/A 5. chr3 237474 500 G/C C/C 6. chr3 575757 700 ... (2 Replies)
Discussion started by: mscott
2 Replies

8. Shell Programming and Scripting

awk compare column n replace with in one file

hi Friends need to compare columns in one file where the data looks like below laptop,IBM phone,samsung car,rental user1,laptop user2,laptop user3,phone want to get output as laptop,IBM phone,samsung car,rental user1,IBM user2,IBM user3,samsung need to seach $2 in array of $1 and... (4 Replies)
Discussion started by: arun1401
4 Replies

9. Shell Programming and Scripting

I need to extract last column of a file and compare the values

Hi, I am new to unix and I need help in solving below mentioned issue, really appreciate ur help. I have a file sam, john, 2324, 07142007 tom, thomson, 2343, 07142007 john, scott, 2478, 07142007 its a comma delimited file, I need to extract the last column from each line and this... (4 Replies)
Discussion started by: vukkusila
4 Replies
Login or Register to Ask a Question