Help with merge data based on similarity

12-19-2011

Registered User

242, 1

Join Date: Sep 2009

Last Activity: 24 August 2018, 1:52 AM EDT

Posts: 242

Thanks Given: 27

Thanked 1 Time in 1 Post

Help with merge data based on similarity

Input_file

Code:

data1    USA    100    ASE
data3    UK    20    GWQR
data4    Brazil    40    QWE
data2    Scotland    60    THWE
data5    USA    40    QWERR

Reference_file

Code:

USA    12312    34532
1324    Brazil    23321
231    3421    Scotland
342    34235    UK
231    141    England

Desired_output_file:

Code:

data1    USA    100    ASE    USA    12312    34532
data5    USA    40    QWERR    USA    12312    34532
data4    Brazil    40    QWE    1324    Brazil    23321
data3    UK    20    GWQR    342    34235    UK
data2    Scotland    60    THWE    231    3421    Scotland

I would like to print out those content that shared between column 2 of Input_file with column 1, 2, 3 of reference_file.
Below is the way I deal with it:

Code:

Step 1: merge the share info between column 2 of input_file with column 1 of reference_file:
perl -e ' $col1=1; $col2=0; ($f1,$f2)=@ARGV; open(F2,$f2); while  (<F2>) { s/\r?\n//; @F=split /\t/, $_; $line2{$F[$col2]} .= "$_\n"  }; $count2 = $.; open(F1,$f1); while (<F1>) { s/\r?\n//; @F=split  /\t/, $_; $x = $line2{$F[$col1]}; if ($x) { $num_changes = ($x =~  s/^/$_\t/gm); print $x; $merged += $num_changes } } warn "\nJoining $f1  column $col1 with $f2 column $col2\n$f1: $. lines\n$f2: $count2  lines\nMerged file: $merged lines\n"; ' Input_file Reference_file >  tmp1.txt
data1    USA    100    ASE    USA    12312    34532
data5    USA    40    QWERR    USA    12312    34532

Step 2: merge the share info between column 2 of input_file with column 2 of reference_file:
perl -e ' $col1=1; $col2=1; ($f1,$f2)=@ARGV; open(F2,$f2); while  (<F2>) { s/\r?\n//; @F=split /\t/, $_; $line2{$F[$col2]} .= "$_\n"  }; $count2 = $.; open(F1,$f1); while (<F1>) { s/\r?\n//; @F=split  /\t/, $_; $x = $line2{$F[$col1]}; if ($x) { $num_changes = ($x =~  s/^/$_\t/gm); print $x; $merged += $num_changes } } warn "\nJoining $f1  column $col1 with $f2 column $col2\n$f1: $. lines\n$f2: $count2  lines\nMerged file: $merged lines\n"; ' Input_file Reference_file >  tmp2.txt 
data4    Brazil    40    QWE    1324    Brazil    23321

Step 3: merge the share info between column 2 of input_file with column 3 of reference_file:
perl -e ' $col1=1; $col2=2; ($f1,$f2)=@ARGV; open(F2,$f2); while  (<F2>) { s/\r?\n//; @F=split /\t/, $_; $line2{$F[$col2]} .= "$_\n"  }; $count2 = $.; open(F1,$f1); while (<F1>) { s/\r?\n//; @F=split  /\t/, $_; $x = $line2{$F[$col1]}; if ($x) { $num_changes = ($x =~  s/^/$_\t/gm); print $x; $merged += $num_changes } } warn "\nJoining $f1  column $col1 with $f2 column $col2\n$f1: $. lines\n$f2: $count2  lines\nMerged file: $merged lines\n"; ' Input_file Reference_file >  tmp3.txt 
data3    UK    20    GWQR    342    34235    UK
data2    Scotland    60    THWE    231    3421    Scotland

Concetate all tmp*.txt together:
cat tmp1.txt tmp2.txt tmp3.txt > Desired_output_file.txt

It seems like awk "if..else...else if" condition able to save the progress time?
Thanks for any advice.

patrick87

View Public Profile for patrick87

Find all posts by patrick87

12-19-2011

Registered User

939, 225

Join Date: Mar 2011

Last Activity: 8 May 2020, 3:48 AM EDT

Location: Éire

Posts: 939

Thanks Given: 27

Thanked 225 Times in 219 Posts

As you're using perl already, why not read the reference file into a hash keyed on the non-numeric field in each record and then use that to create the new records as you read the input file?

Code:

$ perl -Mstrict -e '
my %index;
open (my $ref,"<","reference.dat");
while (<$ref>){
   chomp;
   my $record=$_;
   my @record=split(/\s+/,$record);
   for my $field (@record){
      if ($field !~ /^\d+$/){
         $index{$field} = $record;
      }
   }
}
close $ref;
open(my $in, "<","input.dat");
while(<$in>){
   chomp;
   my @fields=split /\s+/,$_;
   print "$_ $index{$fields[1]}\n";
}'

This User Gave Thanks to Skrynesaver For This Post:

Skrynesaver

View Public Profile for Skrynesaver

Find all posts by Skrynesaver

Shell Programming and Scripting

Help with merge data based on similarity

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Merge row based on replicates ID

Discussion started by: giuliangiuseppe

2. Shell Programming and Scripting

Merge lines based on match

Discussion started by: jamie_123

3. Shell Programming and Scripting

Merge files based on columns

Discussion started by: vinus

4. Shell Programming and Scripting

Need to merge lines based on pattern

Discussion started by: satyaatcgi

5. Shell Programming and Scripting

Help with sort list of file based on similarity

Discussion started by: perl_beginner

6. Shell Programming and Scripting

Merge two file data together based on specific pattern match

Discussion started by: patrick87

7. Shell Programming and Scripting

Extract data based on match against one column data from a long list data

Discussion started by: patrick87

8. Shell Programming and Scripting

Merge Two Files based on First column

Discussion started by: apjneeraj

9. Shell Programming and Scripting

merge based on common, awk help

Discussion started by: jkl_jkl

10. Shell Programming and Scripting

Merge files based on key

Discussion started by: sbasetty