perl: comparision of field line by line in two files


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting perl: comparision of field line by line in two files
# 1  
Old 05-07-2012
Computer perl: comparision of field line by line in two files

Hi everybody,
First I apologize if my question seems demasiad you silly, but it really took 4 days struggling with this, I looked at books, forums ... And Also ask help to a friend that is software developer and he told me that it is a bad idea do it by perl... but this is my problem.
I moved to another lab for a couple of months, in which they use perl as tool to analyse DNA data (at my lab I ever use or developed software, command lines to modificate files to use it correctly, and some tools that people of my lab perform previously). Really in the weeks that I'm working here I saw the power of perform your own scripts to solve problem.
The problem is that i must to compare two files and select the lines of one of them whose fields comply a few requirements, which are comparisons with the other file fields.

my files are (of course that are only few lines)
File 1
Code:
Start	End	Origin	HomeCluster	BAPSIndex	Strain
1	58292	5	5	1	TW20.dna
87840	87883	5	5	1	TW20.dna
247298	253176	5	5	1	TW20.dna
395979	400031	5	5	1	TW20.dna
404314	404824	5	5	1	TW20.dna

File 2
Code:
Coordinate	type	RefAllele	Strain	SNPAllele
358909	Int	<T>	5083_6_1	>A<
2074234	syn	<G>	5083_6_1	>A<
31160	non	<G>	5083_6_12	>A<

I must locate the file lines 2, which is within the range Coordinate generated by start and End, and also the strain match. ie I must compare each line of the file 2 with each line of 1.
I started the script many times, the variables are defined ... but can not get results ... I have tried arrays, hash .. I can not.
I include the script (the part that works) and the conditions that must be met.

Code:
#!/usr/bin/perl -w
# insideRecombinantSNP.pl
#Script to analyze the snps inside the recombinat regions
# if the file is not in your working directory, you have to write the complete path 
use warnings;

print "Coordinate	Type	Reference Allele	Strain		Strain Allele\n";

 
open IN, "resultsnplinev2.out" or die;     # file 1 y file 2 compared files
open INN, "turkish_segments_tabularv2.txt" or die;

while(<IN>){
		if(m/^line\s+(\d+\s+\S+\s+\S+\s+\S+\s+\S+)/){
			$lineSNP=$1;
			$lineSNP =~m/^(\d+)\s+\S+\s+\S+\s+\S+\s+\S+/;
			$SNPcoor=$1;
			 $lineSNP =~m/^\d+\s+\S+\s+\S+\s+(\S+)\s+\S+/;
			$SNPstrain=$1;
					 	 		  		 		 }
while(<INN>){	 	 		  		 		 
		if(m/^(\d+\s+\d+\s+\S+\s+\S+\s+\S+\s+.*)/){
		$recline=$1;
		$recline =~m/^\d+\s+\d+\s+\S+\s+\S+\s+\S+\s+(.*)/;
		$recstrain=$1;
		$recline =~m/^(\d+)\s+\d+\s+\S+\s+\S+\s+\S+\s+.*/;
	 	$leftcoor=$1;
	 	$recline =~m/^\d+\s+(\d+)\s+\S+\s+\S+\s+\S+\s+.*/;
		$rightcoor=$1;
		 		}
}
if (($leftcoor<=$SNPcoor) && ($SNPcoor<=$rightcoor)){
print "$lineSNP\n";
}elsif ($recstrain eq $SNPstrain){
print "$lineSNP\n";	
}
}


Any idea, any hint or suggestion ...


Moderator's Comments:
Mod Comment How to use code tags

Last edited by Franklin52; 05-07-2012 at 08:27 AM.. Reason: Please use code tags
# 2  
Old 05-07-2012
hello, check this :
Code:
#cat file1
Start   End     Origin  HomeCluster     BAPSIndex       Strain
1       58292   5       5       1       TW20.dna
87840   87883   5       5       1       TW20.dna
247298  253176  5       5       1       TW20.dna
395979  400031  5       5       1       TW20.dna
404314  404824  5       5       1       TW20.dna

Code:
#cat file2
Coordinate      type    RefAllele       Strain  SNPAllele
358909  Int     <T>     5083_6_1        >A<
2074234 syn     <G>     5083_6_1        >A<
31160   non     <G>     5083_6_12       >A<

Code:
#awk 'NR==1{next}NR==FNR{a[NR]=$0;next}{for(i in a) {split(a[i],b," ");if (b[1] >= $1 && b[1] <=$2) {print a[i]" match --->"$0;next}}}' file2 file1       
31160   non     <G>     5083_6_12       >A< match --->1 58292   5       5       1       TW20.dna

# 3  
Old 05-07-2012
Power Thank you Klashxx, but...

Thanks so much for give me a hand, but ...
I know, I know I'm the worst, but not how to use your help ... that is to introduce into the script for a new script ...?
Thanks again and I regret so little skill .. but only took 10 days to work with perl ...
# 4  
Old 05-07-2012
No worry , give us an example of your expected output.
# 5  
Old 05-07-2012
Thank you again,
I have to extract the lines in file 2, whose coordinates are between Start and End positions of the file 1, which also belong to the same strain. In other words two assumptions must be fulfilled going line by line and check if the strain matches and if the coordinate is within the range formed by Start and End, then the expected output is the line of file 2.
for example
Code:
Int <T> 5083_6_1 358909> A <3_6_1 358909> A <

I know that there are lines that fulfilled this assumptions, I checked it by hand and found several lines.

---------- Post updated at 09:11 PM ---------- Previous update was at 07:39 PM ----------

Really I don't know what happens but the expected output is not that
is...
Coordinate type RefAllele Strain SNPAllele
240450 non <G> 6949_5_23 >A<

Sorry

Last edited by Scrutinizer; 05-07-2012 at 03:36 PM.. Reason: code tags
# 6  
Old 05-07-2012
You mean something like this:
Code:
#cat file1
Start   End     Origin  HomeCluster     BAPSIndex       Strain
1       58292   5       5       1       TW20.dna
87840   87883   5       5       1       TW20.dna
247298  253176  5       5       1       TW20.dna
395979  400031  5       5       1       TW20.dna
404314  404824  5       5       1       TW20.dna    >A<

Code:
#cat file2
Coordinate      type    RefAllele       Strain  SNPAllele
358909  Int     <T>     5083_6_1        >A<
2074234 syn     <G>     5083_6_1        >A<
31160   non     <G>     5083_6_12       >A<
404820 non     <G>     5083_6_12       >A<

Code:
#awk 'NR==1{next}NR==FNR{a[NR]=$0;next}{for(i in a) {e=split(a[i],b," ");if (b[1] >= $1 && b[1] <=$2 && b[e] == $NF) {print a[i];next}}}' file2 file1                                  
404820 non     <G>     5083_6_12       >A<

# 7  
Old 05-07-2012
yes...

Yes something like this, but also has to match the strain...

something like

line x file 2 : 136 non <T> 5083_6_1 >A<
this line match with the line y file 1: 12 52000 1 1 1 5083_6_1.
12<=136<=52000 & 5083_6_1=5083_6_1

Then in my ouput file will appear
136 non <T> 5083_6_1 >A<

I Know that it's a bit difficult (at least for me) but I'm really grateful for your help.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Printing string from last field of the nth line of file to start (or end) of each line (awk I think)

My file (the output of an experiment) starts off looking like this, _____________________________________________________________ Subjects incorporated to date: 001 Data file started on machine PKSHS260-05CP ********************************************************************** Subject 1,... (9 Replies)
Discussion started by: samonl
9 Replies

2. Shell Programming and Scripting

Perl command line option '-n','-p' and multiple files: can it know a file name of a printed line?

I am looking for help in processing of those options: '-n' or '-p' I understand what they do and how to use them. But, I would like to use them with more than one file (and without any shell-loop; loading the 'perl' once.) I did try it and -n works on 2 files. Question is: - is it possible to... (6 Replies)
Discussion started by: alex_5161
6 Replies

3. Shell Programming and Scripting

Add specific string to last field of each line in perl based on value

I am trying to add a condition to the below perl that will capture the GTtag and place a specific string in the last field of each line. The problem is that the GT value used is not right after the tag rather it is a few fields away. The values should always be 0/1 or 1/2 and are in bold in the... (12 Replies)
Discussion started by: cmccabe
12 Replies

4. Shell Programming and Scripting

Replace first field of a line with previous filed of the line

Hi Everyone, I have a file as below: IM2345638,sherfvf,usha,30 IM384940374,deiufbd,usha,30 IM323763822,cdejdkdnbds,theju,15 0,dhejdncbfd,us,20 IM398202038,dhekjdkdld,tj,30 0,foifsjd,u2,40 The output i need is as below IM2345638,sherfvf,usha,30... (4 Replies)
Discussion started by: usha rao
4 Replies

5. Shell Programming and Scripting

Perl how to compare two pdf files line by line

Hi Experts, Would really appreciate if anyone can guide me how to compare two pdf files line by line and report the difference to another file. (3 Replies)
Discussion started by: prasanth_babu
3 Replies

6. Shell Programming and Scripting

File comparision line by line

Hi, I want to compare 2 files and get output file into seperate folder. Both file names will change daily with timestamp (ex: file1_06_17_2013_0514), so i can't mention the file names in the script to compare, but i need to compare these 2 files daily and generate output to another... (28 Replies)
Discussion started by: rkrish123
28 Replies

7. Shell Programming and Scripting

Two files comparision with single field

Hi , Im new to uxin environment and shell scripting.... please help me with the code for the following scenario..... file 1 contains the following fields abc 200 rupee IND cdf 400 dollar USA efg 300 euro GER hij 600 pound ENG file 2 SBI abc 321 dollar CANAD kvr mnd ... (6 Replies)
Discussion started by: shivaji_veer
6 Replies

8. Shell Programming and Scripting

how to read the contents of two files line by line and compare the line by line?

Hi All, I'm trying to figure out which are the trusted-ips and which are not using a script file.. I have a file named 'ip-list.txt' which contains some ip addresses and another file named 'trusted-ip-list.txt' which also contains some ip addresses. I want to read a line from... (4 Replies)
Discussion started by: mjavalkar
4 Replies

9. Shell Programming and Scripting

Line by Line Comparision of 2 files and print only the difference

Hi, I am trying to find an alternative way to do tkdiff. In tkdiff the gui compares each line and highlights only the differences. for eg: John works at McDonalds s(test) He was playing guitar tywejk John works in McDonalds 9908 He was playing guitar I am... (1 Reply)
Discussion started by: naveen@
1 Replies

10. Shell Programming and Scripting

file comparision by line

i have two files and i want to compare these two it shoud print those lines which are not in 2nd file a.txt 1236,AB,0 2345,CD,1 5679,EF,1 9123,AA,1 9223,AA,0 b.txt 1234,AB,0 2345,CD,1 5678,EF,1 9123,AA,0 o/p 1236,AB,0 5679,EF,1 9123,AA,1 9223,AA,0 (6 Replies)
Discussion started by: aaysa123
6 Replies
Login or Register to Ask a Question