Help with merge data based on similarity


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Help with merge data based on similarity
# 1  
Old 12-19-2011
Help with merge data based on similarity

Input_file
Code:
data1    USA    100    ASE
data3    UK    20    GWQR
data4    Brazil    40    QWE
data2    Scotland    60    THWE
data5    USA    40    QWERR

Reference_file
Code:
USA    12312    34532
1324    Brazil    23321
231    3421    Scotland
342    34235    UK
231    141    England

Desired_output_file:
Code:
data1    USA    100    ASE    USA    12312    34532
data5    USA    40    QWERR    USA    12312    34532
data4    Brazil    40    QWE    1324    Brazil    23321
data3    UK    20    GWQR    342    34235    UK
data2    Scotland    60    THWE    231    3421    Scotland

I would like to print out those content that shared between column 2 of Input_file with column 1, 2, 3 of reference_file.
Below is the way I deal with it:
Code:
Step 1: merge the share info between column 2 of input_file with column 1 of reference_file:
perl -e ' $col1=1; $col2=0; ($f1,$f2)=@ARGV; open(F2,$f2); while  (<F2>) { s/\r?\n//; @F=split /\t/, $_; $line2{$F[$col2]} .= "$_\n"  }; $count2 = $.; open(F1,$f1); while (<F1>) { s/\r?\n//; @F=split  /\t/, $_; $x = $line2{$F[$col1]}; if ($x) { $num_changes = ($x =~  s/^/$_\t/gm); print $x; $merged += $num_changes } } warn "\nJoining $f1  column $col1 with $f2 column $col2\n$f1: $. lines\n$f2: $count2  lines\nMerged file: $merged lines\n"; ' Input_file Reference_file >  tmp1.txt
data1    USA    100    ASE    USA    12312    34532
data5    USA    40    QWERR    USA    12312    34532

Step 2: merge the share info between column 2 of input_file with column 2 of reference_file:
perl -e ' $col1=1; $col2=1; ($f1,$f2)=@ARGV; open(F2,$f2); while  (<F2>) { s/\r?\n//; @F=split /\t/, $_; $line2{$F[$col2]} .= "$_\n"  }; $count2 = $.; open(F1,$f1); while (<F1>) { s/\r?\n//; @F=split  /\t/, $_; $x = $line2{$F[$col1]}; if ($x) { $num_changes = ($x =~  s/^/$_\t/gm); print $x; $merged += $num_changes } } warn "\nJoining $f1  column $col1 with $f2 column $col2\n$f1: $. lines\n$f2: $count2  lines\nMerged file: $merged lines\n"; ' Input_file Reference_file >  tmp2.txt 
data4    Brazil    40    QWE    1324    Brazil    23321

Step 3: merge the share info between column 2 of input_file with column 3 of reference_file:
perl -e ' $col1=1; $col2=2; ($f1,$f2)=@ARGV; open(F2,$f2); while  (<F2>) { s/\r?\n//; @F=split /\t/, $_; $line2{$F[$col2]} .= "$_\n"  }; $count2 = $.; open(F1,$f1); while (<F1>) { s/\r?\n//; @F=split  /\t/, $_; $x = $line2{$F[$col1]}; if ($x) { $num_changes = ($x =~  s/^/$_\t/gm); print $x; $merged += $num_changes } } warn "\nJoining $f1  column $col1 with $f2 column $col2\n$f1: $. lines\n$f2: $count2  lines\nMerged file: $merged lines\n"; ' Input_file Reference_file >  tmp3.txt 
data3    UK    20    GWQR    342    34235    UK
data2    Scotland    60    THWE    231    3421    Scotland

Concetate all tmp*.txt together:
cat tmp1.txt tmp2.txt tmp3.txt > Desired_output_file.txt

It seems like awk "if..else...else if" condition able to save the progress time?
Thanks for any advice.
# 2  
Old 12-19-2011
As you're using perl already, why not read the reference file into a hash keyed on the non-numeric field in each record and then use that to create the new records as you read the input file?
Code:
$ perl -Mstrict -e '
my %index;
open (my $ref,"<","reference.dat");
while (<$ref>){
   chomp;
   my $record=$_;
   my @record=split(/\s+/,$record);
   for my $field (@record){
      if ($field !~ /^\d+$/){
         $index{$field} = $record;
      }
   }
}
close $ref;
open(my $in, "<","input.dat");
while(<$in>){
   chomp;
   my @fields=split /\s+/,$_;
   print "$_ $index{$fields[1]}\n";
}'

This User Gave Thanks to Skrynesaver For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Merge row based on replicates ID

Dear All, I was wondering if you may help me with an issue. I would like to merge row based on column 1. input file: b1 ggg b2 fff NA NA hhh NA NA NA NA NA a1 xxx a2 yyy NA NA zzz NA NA NA NA NA a1 xxx NA NA a3 ttt NA ggg NA NA NA NA output file: b1 ggg b2 fff NA NA hhh NA NA NA NA NA... (5 Replies)
Discussion started by: giuliangiuseppe
5 Replies

2. Shell Programming and Scripting

Merge lines based on match

I am trying to merge two lines to one based on some matching condition. The file is as follows: Matches filter: 'request ', timestamp, <HTTPFlow request=<GET: Matches filter: 'request ', timestamp, <HTTPFlow request=<GET: Matches filter: 'request ', timestamp, <HTTPFlow ... (8 Replies)
Discussion started by: jamie_123
8 Replies

3. Shell Programming and Scripting

Merge files based on columns

011111123444 1234 1 20000 011111123444 1235 1 30000 011111123446 1234 3 40000 011111123447 1234 4 50000 011111123448 1234 3 50000 File2: 011111123444,Rsttponrfgtrgtrkrfrgtrgrer 011111123446,Rsttponrfgtrgtr 011111123447,Rsttponrfgtrguii 011111123448,Rsttponrfgtrgtjiiu I have 2 files... (4 Replies)
Discussion started by: vinus
4 Replies

4. Shell Programming and Scripting

Need to merge lines based on pattern

Hi, I have a requirement to merge multiple lines based on search pattern. The search criteria is : it will search for CONSTRAINT and when it found CONSTRAINT, it will merge all lines to 1 line till it founds blank line. For Example: CREATE TABLE "AMS_DISTRIBUTOR_XREF" ( "SOURCE"... (5 Replies)
Discussion started by: satyaatcgi
5 Replies

5. Shell Programming and Scripting

Help with sort list of file based on similarity

Input file (long list of input file): s_1_1_AABCD.txt s_1_1_ABADA.txt s_1_1_DSCBA.txt s_1_1_DSCCA.txt s_1_1_EATTG.txt s_1_1_FADSD.txt s_1_1_TGACC.txt s_1_1_TTAGG.txt s_1_2_AABCD.txt s_1_2_ABADA.txt s_1_2_DSCBA.txt s_1_2_DSCCA.txt s_1_2_EATTG.txt s_1_2_FADSD.txt ... (1 Reply)
Discussion started by: perl_beginner
1 Replies

6. Shell Programming and Scripting

Merge two file data together based on specific pattern match

My input: File_1: 2000_t g1110.b1 abb.1 2001_t g1111.b1 abb.2 abb.2 g1112.b1 abb.3 2002_t . . File_2: 2000_t Ali england 135 abb.1 Zoe british 150 2001_t Ali england 305 g1111.b1 Lucy russia 126 (6 Replies)
Discussion started by: patrick87
6 Replies

7. Shell Programming and Scripting

Extract data based on match against one column data from a long list data

My input file: data_5 Ali 422 2.00E-45 102/253 140/253 24 data_3 Abu 202 60.00E-45 12/23 140/23 28 data_1 Ahmad 256 7.00E-45 120/235 140/235 22 data_4 Aman 365 8.00E-45 15/65 140/65 20 data_10 Jones 869 9.00E-45 65/253 140/253 18... (12 Replies)
Discussion started by: patrick87
12 Replies

8. Shell Programming and Scripting

Merge Two Files based on First column

Hi, I need to join two files based on first column of both files.If first column of first file matches with the first column of second file, then the lines should be merged together and go for next line to check. It is something like: File one: 110001 abc efd 110002 fgh dfg 110003 ... (10 Replies)
Discussion started by: apjneeraj
10 Replies

9. Shell Programming and Scripting

merge based on common, awk help

All, $ cat x.txt z 11 az x 12 ax y 13 ay $ cat y.txt ay TT ax NN Output required: y 13 ay TT x 12 ax NN (3 Replies)
Discussion started by: jkl_jkl
3 Replies

10. Shell Programming and Scripting

Merge files based on key

Hi Friends, Can any one help me with merging these file based on two columns : File1: A|123|99|SAMS B|456|95|GEORGE D|789|85|HOVARD File2: S|123|99|NANcY|6357 S|123|99|GREGRO|83748 A|456|95|HARRY|827|somers S|456|95|ANTONY|546841|RUDOLPH|7263 B|456|95|SMITH|827|BOISE STATE|834... (3 Replies)
Discussion started by: sbasetty
3 Replies
Login or Register to Ask a Question