how to match fields from different files in PERL


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting how to match fields from different files in PERL
# 1  
Old 05-08-2011
how to match fields from different files in PERL

Howdy!

I have multiple files with tab-separated data:

HTML Code:
File1_filtered.txt

gnl|Amel_4.0|Group3.29	1	G	R	42	42	60	15	,.AAA.aa,aa.A..	hh00/f//hD/h/hh
gnl|Amel_4.0|Group3.29	2	C	Y	36	36	60	5	T.,T,	LggJh
gnl|Amel_4.0|Group3.29	3	A	R	27	27	60	9	Gg,,.gg.,	B6hcc22_c
HTML Code:
File2_filtered.txt

gnl|Amel_4.0|Group3.29	1	C	K	12	56	60	3	TGT	L6L
gnl|Amel_4.0|Group3.29	2	C	Y	63	63	60	5	,$,$tt,	EEZZe
HTML Code:
File3_filtered.txt

gnl|Amel_4.0|Group3.29	2	C	Y	36	36	60	5	T.,T,	LggJh
gnl|Amel_4.0|Group3.29	4	A	R	27	27	60	9	Gg,,.gg.,	B6hcc22_c
I created a master list containing all the different rows based on the first two columns (without duplicates)

HTML Code:
masterList.txt

gnl|Amel_4.0|Group3.29	1
gnl|Amel_4.0|Group3.29	2
gnl|Amel_4.0|Group3.29	3
gnl|Amel_4.0|Group3.29	4	
I need to go through each file once, and extract the data on the column 4, and match it to its corresponding line in the master list based on columns 1 and 2 (they need to match exactly).
If there is no entry for a particular line in a data file that matches the masterlist, add and asterisk.
HTML Code:
Like this:

pos1 pos2	pos3	File1	File2	File3
gnl|Amel_4.0|Group3.29	1	R	K	*
gnl|Amel_4.0|Group3.29	2	Y	Y	Y
gnl|Amel_4.0|Group3.29	3	Y	*	R
gnl|Amel_4.0|Group3.29	4	*	*	*
In the code I have so far, I loaded the master list into a hash. Then each data file is loaded in an array of arrays (split by columns).
Everything works except the matching of the hash and the arrays for each file.
As usual, many thanks in advance for any help you may provide.

Cheers!

HTML Code:
#!/usr/bin/perl 

use strict;
use warnings;


##dump the results in this file
my $outfile =  ">> matrix.txt";
open (MATRIX,$outfile);

#open the master list
open(MASTER,"folder/MasterList.txt") || die "open MASTER failed";

#load MASTER list into hash of arrays
my %m_hash=();
while(<MASTER>){  
	chomp;
	my @fieldsM = split (/\s|\t/, $_);
	my $scaff = $fieldsM[0];
	my $pos = $fieldsM[1];
	my $key = $scaff.",".$pos;
	my $value= $fieldsM[2];
	$m_hash{$key} = $value;
	#print "$key\t$value\n";
}
close MASTER;

#Load files into an array
my @itemsToUse;
my $directory= "folder";
opendir (DIR, $directory) or die "cant OPEN directory with files!\n";
my @allitems = readdir(DIR);

foreach my $fs (@allitems) {
	if ($fs =~ /filtered.txt/) {
		my $files = $fs;
		push (@itemsToUse, $files);
	}
}

#open the data files
foreach my $fs (@itemsToUse){
	while(<>){ # sequentially read files and do the comparison on the fly
		chomp;
		my @fieldsSNP=split/\s|\t/;    #  split by space or tab
		#print "$fields[1]\n";
		foreach my $i ( 0 .. $#{ $m_hash{$fieldsSNP[0]} } ) { 
			if (($fieldsSNP[0] == $m_hash{$fieldsSNP[0]}) && ($fieldsSNP[1] == $m_hash{$fieldsSNP[1]})){
				print MATRIX "$m_hash{$fieldsSNP[0]}[$i][0] $m_hash{$fieldsSNP[0]}[$i][1]  $fieldsSNP[4]\n";
			}
		}#close if
	}#close foreach
}#close foreachs
close MASTER;
close MATRIX;
exit 0;
# 2  
Old 05-08-2011
For starters:
You're splitting the master on whitespaces and assigning
$fieldsM[2], which is not defined (only 2 columns in your masterList.txt). Here:
Code:
my $value= $fieldsM[2];

---------- Post updated at 02:44 PM ---------- Previous update was at 01:39 PM ----------

Please try this out:

Code:
#!/usr/bin/awk -f

NR==FNR{
    out[$1 $2]=pat[$1 $2]=$1" "$2;  #remember the pattern to match against
    oldind=ARGIND+1; #init helper variables
    colInd=2;
    next;
} 
 {  #for each record in _filtered.txt files
  for(i in pat) { #loop through stored patterns
      if($1" "$2==pat[i]) { 
        out[$1 $2]=out[$1 $2]" "$4;  #match; append 4th column
      }
  } 
  if(ARGIND!=oldind) #new file taken; fill in  '*'s
  {
      colInd++;
      for(i in out) {
        if(split(out[i],a," ") < colInd) { #missing value, append '*'
          out[i]=out[i]" *"
        }
      }
      oldind=ARGIND
  }
 }
 END{   #do the same thing one more time to fill asterisks for last input file
      colInd++;  
      for(i in out) {
        if(split(out[i],a," ") < colInd) {
          out[i]=out[i]" *"
        }
      }

     for(i in out) { #print it all 
       print out[i]
     }
 }

and invoke it like:
Code:
./run.awk folder/masterList.txt *_filtered.txt

This assumes your awk is GNU awk (ARGIND variable); if not, then store the filename (variable FILENAME) and watch when that changes instead.

Last edited by mirni; 05-08-2011 at 09:48 PM.. Reason: gawk comment
This User Gave Thanks to mirni For This Post:
# 3  
Old 05-09-2011
Thanks for the AWK solution. Will test it.

Yes, I mistakenly deleted a third column from the master file, but the code still not working properly....

Cheers
Santiago
# 4  
Old 05-09-2011
Thanks for the AWK solution. Will test it.

Yes, I mistakenly deleted a third column from the master file, but the code still not working properly....

Cheers
Santiago

---------- Post updated at 08:01 AM ---------- Previous update was at 07:58 AM ----------

I would like to find a solution for this problem using perl.... Any takers?
Thanks!!
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Matching two fields in two csv files, create new file and append match

I am trying to parse two csv files and make a match in one column then print the entire file to a new file and append an additional column that gives description from the match to the new file. If a match is not made, I would like to add "NA" to the end of the file Command that Ive been using... (6 Replies)
Discussion started by: dis0wned
6 Replies

2. UNIX for Beginners Questions & Answers

awk match two fields in two files

Hi, I have two TEST files t.xyz and a.xyz which have three columns each. a.xyz have more rows than t.xyz. I will like to output rows at which $1 and $2 of t.xyz match $1 and $2 of a.xyz. Total number of output rows should be equal to that of t.xyz. It works fine, but when I apply it to large... (6 Replies)
Discussion started by: geomarine
6 Replies

3. UNIX for Beginners Questions & Answers

Match Fields between two files, print portions of each file together when matched in ([g]awk)'

I've written an awk script to compare two fields in two different files and then print portions of each file on the same line when matched. It works reasonably well, but every now and again, I notice some errors and cannot seem to figure out what the issue may be and am turning to you for help. ... (2 Replies)
Discussion started by: jvoot
2 Replies

4. Shell Programming and Scripting

awk to print match or non-match and select fields/patterns for non-matches

In the awk below I am trying to output those lines that Match between file1 and file2, those Missing in file1, and those missing in file2. Using each $1,$2,$4,$5 value as a key to match on, that is if those 4 fields are found in both files the match, but if those 4 fields are not found then missing... (0 Replies)
Discussion started by: cmccabe
0 Replies

5. Shell Programming and Scripting

awk to print fields that match using conditions and a default value for non-matching in two files

Trying to use awk to match the contents of each line in file1 with $5 in file2. Both files are tab-delimited and there may be a space or special character in the name being matched in file2, for example in file1 the name is BRCA1 but in file2 the name is BRCA 1 or in file1 name is BCR but in file2... (6 Replies)
Discussion started by: cmccabe
6 Replies

6. Shell Programming and Scripting

awk help: Match data fields from 2 files & output results from both into 1 file

I need to take 2 input files and create 1 output based on matches from each file. I am looking to match field #1 in both files (Userid) and create an output file that will be a combination of fields from both file1 and file2 if there are any differences in the fields 2,3,4,5,or 6. Below is an... (5 Replies)
Discussion started by: ambroze
5 Replies

7. Homework & Coursework Questions

Regular Expression to match files in Perl

Hi Everybody! I need some help with a regular expression in Perl that will match files named messages, but also files named message.1, message.2 and so on. So really I need one that will find messages and messages that might be followed by a period and a digit without matching other files like... (2 Replies)
Discussion started by: Hax0rc1ph3r
2 Replies

8. Shell Programming and Scripting

Add fields in different files only if some fields between them match

Hi everybody (first time posting here) I have a file1 that looks like > 1,101,0.1,0.1 1,26,0.1,0.1 1,3,0.1,0.1 1,97,0.5,0.5 1,98,8.1,0.218919 1,99,6.2,0.248 2,101,0.1,0.1 2,24,3.1,0.147619 2,25,23.5,0.559524 2,26,34,0.723404with 762 lines.. I have another 'similar' file2 > ... (10 Replies)
Discussion started by: murpholinox
10 Replies

9. UNIX for Dummies Questions & Answers

Match values from 2 files and append certain fields

Hi, I need help on appending certain field in my file1.txt based on matched patterns in file2.txt using awk or sed. The blue color need to match with one of the data in field $2 in file2.txt. If match, BEGIN and FINISHED value in red will have a new value from field $3 and $4 accordingly. ... (1 Reply)
Discussion started by: redse171
1 Replies

10. Shell Programming and Scripting

Match two files and divide fields

I have two files that have the date field in common. I request your help with some script that divide each field value from file1 by the correspond field value of the file2 only when the field date is equal in both files. Thanks in advance ! This is a sample of the files file 1 12/16/2010,... (2 Replies)
Discussion started by: csierra
2 Replies
Login or Register to Ask a Question