Getting non unique lines from concatenated files


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Getting non unique lines from concatenated files
# 64  
Old 03-27-2011
Using this sample file I would like to learn more about file manipulation using Perl from you and others, so I have my next question already ... sorry Smilie
Sample file:
Code:
chr01   levure5 SNP     12745   12745   0.000000        .       .       genotype=S;reference=C;coverage=91;refAlleleCounts=44;refAlleleStarts=28;refAlleleMeanQV=19;novelAlleleCounts=40;novelAlleleStarts=25;novelAlleleMeanQV=18;diColor1=22;diColor2=11;het=1;flag=
chr01   levure6 SNP     12745   12745   0.000000        .       .       genotype=S;reference=C;coverage=62;refAlleleCounts=29;refAlleleStarts=19;refAlleleMeanQV=19;novelAlleleCounts=32;novelAlleleStarts=20;novelAlleleMeanQV=20;diColor1=11;diColor2=22;het=1;flag=
chr01   levure7 SNP     12745   12745   0.000000        .       .       genotype=S;reference=C;coverage=24;refAlleleCounts=9;refAlleleStarts=8;refAlleleMeanQV=23;novelAlleleCounts=13;novelAlleleStarts=12;novelAlleleMeanQV=20;diColor1=11;diColor2=22;het=1;flag=
chr01   levure5 SNP     16254   16254   0.000000        .       .       genotype=R;reference=G;coverage=111;refAlleleCounts=82;refAlleleStarts=41;refAlleleMeanQV=18;novelAlleleCounts=28;novelAlleleStarts=9;novelAlleleMeanQV=18;diColor1=10;diColor2=32;het=1;flag=
chr01   levure6 SNP     16254   16254   0.000000        .       .       genotype=R;reference=G;coverage=96;refAlleleCounts=72;refAlleleStarts=38;refAlleleMeanQV=17;novelAlleleCounts=24;novelAlleleStarts=6;novelAlleleMeanQV=15;diColor1=10;diColor2=32;het=1;flag=
chr01   levure7 SNP     16254   16254   0.000000        .       .       genotype=R;reference=G;coverage=32;refAlleleCounts=20;refAlleleStarts=18;refAlleleMeanQV=19;novelAlleleCounts=12;novelAlleleStarts=5;novelAlleleMeanQV=17;diColor1=10;diColor2=32;het=1;flag=
chr01   levure5 SNP     16511   16511   0.000000        .       .       genotype=A;reference=G;coverage=42;refAlleleCounts=0;refAlleleStarts=0;refAlleleMeanQV=0;novelAlleleCounts=35;novelAlleleStarts=16;novelAlleleMeanQV=19;diColor1=12;diColor2=12;het=0;flag=h4,h10,h9,
chr01   levure6 SNP     16511   16511   0.000000        .       .       genotype=A;reference=G;coverage=32;refAlleleCounts=0;refAlleleStarts=0;refAlleleMeanQV=0;novelAlleleCounts=23;novelAlleleStarts=11;novelAlleleMeanQV=17;diColor1=12;diColor2=12;het=0;flag=h4,h10,h9,
chr01   levure7 SNP     16511   16511   0.000000        .       .       genotype=A;reference=G;coverage=17;refAlleleCounts=0;refAlleleStarts=0;refAlleleMeanQV=0;novelAlleleCounts=14;novelAlleleStarts=11;novelAlleleMeanQV=20;diColor1=12;diColor2=12;het=0;flag=h4,h10,h9,
chr01   levure5 SNP     189778  189778  0.000000        .       .       genotype=G;reference=A;coverage=16;refAlleleCounts=0;refAlleleStarts=0;refAlleleMeanQV=0;novelAlleleCounts=16;novelAlleleStarts=12;novelAlleleMeanQV=24;diColor1=10;diColor2=10;het=0;flag=h4,h10,;transpo
sable_element;ID=COPY_YORCTy2-1;Name=COPY_YORCTy2-1
chrm    levure7 SNP     86086   86086   0.000000        .       .       genotype=A;reference=G;coverage=18;refAlleleCounts=0;refAlleleStarts=0;refAlleleMeanQV=0;novelAlleleCounts=18;novelAlleleStarts=4;novelAlleleMeanQV=13;diColor1=21;diColor2=21;het=0;flag=h4,h10,;gene;ID=
Q0182;Name=Q0182;Alias=ORF11
chrm    levure8 SNP     86086   86086   0.000000        .       .       genotype=A;reference=G;coverage=20;refAlleleCounts=0;refAlleleStarts=0;refAlleleMeanQV=0;novelAlleleCounts=19;novelAlleleStarts=4;novelAlleleMeanQV=14;diColor1=21;diColor2=21;het=0;flag=h4,h10,h9,;gene;
ID=Q0182;Name=Q0182;Alias=ORF11
chrm    levure5 SNP     98064   98064   0.000000        .       .       genotype=A;reference=G;coverage=61;refAlleleCounts=8;refAlleleStarts=4;refAlleleMeanQV=11;novelAlleleCounts=52;novelAlleleStarts=6;novelAlleleMeanQV=16;diColor1=20;diColor2=20;het=0;flag=h4,h10,
chrm    levure6 SNP     98064   98064   0.000000        .       .       genotype=A;reference=G;coverage=59;refAlleleCounts=3;refAlleleStarts=2;refAlleleMeanQV=16;novelAlleleCounts=55;novelAlleleStarts=5;novelAlleleMeanQV=18;diColor1=20;diColor2=20;het=0;flag=h4,h10,
chrm    levure7 SNP     98064   98064   0.019086        .       .       genotype=A;reference=G;coverage=23;refAlleleCounts=3;refAlleleStarts=3;refAlleleMeanQV=15;novelAlleleCounts=18;novelAlleleStarts=5;novelAlleleMeanQV=19;diColor1=20;diColor2=20;het=0;flag=h4,h10,
chrm    levure8 SNP     98064   98064   0.000000        .       .       genotype=A;reference=G;coverage=24;refAlleleCounts=3;refAlleleStarts=2;refAlleleMeanQV=8;novelAlleleCounts=20;novelAlleleStarts=6;novelAlleleMeanQV=19;diColor1=20;diColor2=20;het=0;flag=h4,h10,
scplasm1        levure5 SNP     6153    6153    0.000000        .       .       genotype=C;reference=T;coverage=5752;refAlleleCounts=7;refAlleleStarts=7;refAlleleMeanQV=3;novelAlleleCounts=5588;novelAlleleStarts=58;novelAlleleMeanQV=20;diColor1=22;diColor2=22;het=0;flag=h4,
h10,h9,;region;ID=scplasm1;dbxref=NCBI:NC_001398
scplasm1        levure6 SNP     6153    6153    0.000000        .       .       genotype=C;reference=T;coverage=4046;refAlleleCounts=10;refAlleleStarts=9;refAlleleMeanQV=5;novelAlleleCounts=3862;novelAlleleStarts=58;novelAlleleMeanQV=18;diColor1=22;diColor2=22;het=0;flag=h4
,h10,h9,;region;ID=scplasm1;dbxref=NCBI:NC_001398
scplasm1        levure7 SNP     6153    6153    0.000000        .       .       genotype=C;reference=T;coverage=2264;refAlleleCounts=2;refAlleleStarts=2;refAlleleMeanQV=3;novelAlleleCounts=2184;novelAlleleStarts=58;novelAlleleMeanQV=21;diColor1=22;diColor2=22;het=0;flag=h4,
h10,h9,;region;ID=scplasm1;dbxref=NCBI:NC_001398
scplasm1        levure8 SNP     6153    6153    0.000000        .       .       genotype=C;reference=T;coverage=2606;refAlleleCounts=2;refAlleleStarts=2;refAlleleMeanQV=4;novelAlleleCounts=2537;novelAlleleStarts=57;novelAlleleMeanQV=21;diColor1=22;diColor2=22;het=0;flag=h4,
h10,h9,;region;ID=scplasm1;dbxref=NCBI:NC_001398

What if in the above I wanted to gather only some fields (which from $F[0] to $F[7] should be simple enough to do). You have provided a way to split the $F[8] field and extract subfeild "x" (in red below )information with:
Code:
((split "=",(split ";",(split "[\t ]+",$_)[8])[x])[1])


Now what I want to learn is how can I recover other fields from
$F[8] and then for each report its average value for each position reporting only once in $F[3] and I would like to do the same for $F[0]

So in other words my expected output if I want to gather coverage and novelAlleleStarts subfields after splitting $F[8] should be like
Code:
chr01   12745   Mean_coverage=59   Mean_novelAlleleStarts= 28.33
chr01   16254   Mean_coverage=79.66   Mean_novelAlleleStarts=21.33
chr01   16511   Mean_coverage=30.33   Mean_novelAlleleStarts=24
chr01   189778   Mean_coverage=16   Mean_novelAlleleStarts=16
chrm   86086   Mean_coverage=19   Mean_novelAlleleStarts=18.5   
chrm   98064   Mean_coverage=41.75   Mean_novelAlleleStarts=36.25  
scplasm1   6153   Mean_coverage=3667   Mean_novelAlleleStarts=3542.75

In this way I can learn how to populate hashes of hashes and use the keys as arguements for calculations.

Thanks a lot in advanceSmilie

# 65  
Old 03-27-2011
Code:
#!/usr/bin/perl
open I, "$ARGV[0]";
while (<I>){
  @F=split;
  $F[8]=~/coverage=([^;]+)/;
  $c_s{$F[0]}{$F[3]}+=$1;
  $F[8]=~/novelAlleleStarts=([^;]+)/;
  $nv_s{$F[0]}{$F[3]}+=$1;
  $n{$F[0]}{$F[3]}++;
}
END{
  for $i (keys %c_s){
    for $j (keys %{$c_s{$i}}){
      print "$i\t$j\tMean_coverage=";
      printf "%.2f\t", ($c_s{$i}{$j}/$n{$i}{$j});
      printf "Mean_novelAlleleStarts=%.2f\n", ($nv_s{$i}{$j}/$n{$i}{$j});
    }
  }
}

Run it as usual: script.pl file. Notice that to get coverage and novelAlleleStarts values I used regular expression matching on 8th field, instead of nested split. I think using nested split was way overcomplicated to achieve such a simple task Smilie. Let me know which part of the code needs explanation.

Last edited by bartus11; 03-27-2011 at 05:17 PM.. Reason: small code tweak
This User Gave Thanks to bartus11 For This Post:
# 66  
Old 03-27-2011
Thanks a lot Bartus Smilie ... its amazing ... does exactly what I want, but makes me laugh when you say "simple task" .... I hope it gets so simple for me too Smilie

I'm confused how $1 accumulates both $c_s and $nv_s !!? Please, could you also explain, ofcourse whenever you have time the last two "printf" lines of code please?

Cheers and thanks always ..

++
# 67  
Old 03-27-2011
$1 contains part of the last regex that is enclosed by brackets () -
Code:
$F[8]=~/coverage=([^;]+)/;

So it will contain everything that is after "coverage=", but before ";", so it will match coverage value. When second regex is executed ($F[8]=~/novelAlleleStarts=([^;]+)/;), old $1 contents will get replaced by new match (novelAlleleStarts value). Because of this, $1 should be used immediately after regex match. printf line is standard printf function, in this case formatting the number with two digits precision.
This User Gave Thanks to bartus11 For This Post:
# 68  
Old 03-28-2011
******Please ignore my question above, jumped the gun a little*********
Also can someone tell me how to delete post :P Smilie

Last edited by pkabali; 03-28-2011 at 03:43 AM..
# 69  
Old 03-28-2011
@Bartus :
Yeah thats exactly why I was confused because I was thinking two different regex matches must be stored in different variables ... but seems like it can be done with just one too ...
Thanks for your comments .... Hva nice daySmilie

@pkabali :
I didnt recieve any questions from you so dont worry ... Good day!

---------- Post updated at 12:53 PM ---------- Previous update was at 04:09 AM ----------

@Bartus :
Hi Bartus, Hope all well ... I have another question .... Concerning slightly more detailed text manipulation but its time for me to learn ...
Sample file
Code:
levure5/levure5.raw.vcf:SK1.scplasm1    6153    .    T    C    222    .    DP=6039;AF1=1;CI95=1,1;DP4=2,2,1706,3565;MQ=33;FQ=-282;PV4=0.6,1,1,1    GT:PL:GQ    1/1:255,255,0:99
levure6/levure6.raw.vcf:SK1.scplasm1    6153    .    T    C    222    .    DP=4272;AF1=1;CI95=1,1;DP4=9,1,1250,2356;MQ=33;FQ=-282;PV4=0.00051,1,1,1    GT:PL:GQ    1/1:255,255,0:99
levure7/levure7.raw.vcf:SK1.scplasm1    6153    .    T    C    222    .    DP=2389;AF1=1;CI95=1,1;DP4=8,3,732,1343;MQ=33;FQ=-282;PV4=0.021,0.48,1,1    GT:PL:GQ    1/1:255,255,0:99
levure8/levure8.raw.vcf:SK1.scplasm1    6153    .    T    C    222    .    DP=2768;AF1=1;CI95=1,1;DP4=5,3,861,1559;MQ=33;FQ=-282;PV4=0.14,1,1,1    GT:PL:GQ    1/1:255,255,0:99
levure5/levure5.raw.vcf:SK1.chr08    558643    .    G    A    225    .    DP=449;AF1=0.5;CI95=0.5,0.5;DP4=122,120,93,58;MQ=44;FQ=225;PV4=0.037,1,1.8e-40,1    GT:PL:GQ    0/1:255,0,255:99
levure6/levure6.raw.vcf:SK1.chr08    558643    .    G    A    211    .    DP=371;AF1=0.5;CI95=0.5,0.5;DP4=115,96,64,47;MQ=44;FQ=214;PV4=0.64,1,9e-33,1    GT:PL:GQ    0/1:241,0,255:99
levure7/levure7.raw.vcf:SK1.chr08    558643    .    G    A    210    .    DP=152;AF1=0.5;CI95=0.5,0.5;DP4=44,47,32,15;MQ=45;FQ=213;PV4=0.031,1,7e-21,1    GT:PL:GQ    0/1:240,0,255:99
levure8/levure8.raw.vcf:SK1.chr08    558643    .    G    A    225    .    DP=208;AF1=0.5;CI95=0.5,0.5;DP4=49,69,29,46;MQ=45;FQ=225;PV4=0.76,1,1.1e-29,1    GT:PL:GQ    0/1:255,0,255:99
levure5/levure5.raw.vcf:SK1.chr08    558978    .    C    T    117    .    DP=436;AF1=0.5;CI95=0.5,0.5;DP4=157,45,87,55;MQ=44;FQ=120;PV4=0.0011,3.2e-71,1.8e-39,1    GT:PL:GQ    0/1:147,0,255:99
levure6/levure6.raw.vcf:SK1.chr08    558978    .    C    T    127    .    DP=352;AF1=0.5;CI95=0.5,0.5;DP4=121,45,71,44;MQ=42;FQ=130;PV4=0.052,4.2e-49,1.2e-23,1    GT:PL:GQ    0/1:157,0,255:99

So in the file in the $F[7] there is the DP4 subfeild, so that would be [3] subfeild inside $F[7]. Now if you look in the sample file what I cant understand is how do I split this particular subfield which has four values separated by commas into separate sub-subfeilds separated by tabs. So basically DP4 reports 4 values for the follwing 4 attributes.
Code:
DP4=fwd_ref_allele, rev_ref_allele, fwd_non_ref_allele, rev_non_ref_allele

I would basically want to report the comma separated values in each line of my file at DP4= as four separate tab delimited entries such as
Code:
fwd_ref_allele=value1[TAB]rev_ref_allele=value2[TAB]fwd_non_ref_allele=value3[TAB]rev_non_ref_allele=value4[TAB]

Once they apear as individual feilds, I can then it will be eaier for me to do calculations using your previous codes.

Thank you and let me know if you can provide some insight on such a problem of text manipulation in files.

Cheers ... Have a nice evening Smilie
# 70  
Old 03-28-2011
Something like this?
Code:
perl -pe 's/DP4=(\d+),(\d+),(\d+),(\d+)/fwd_ref_allele=$1\trev_ref_allele=$2\tfwd_non_ref_allele=$3\trev_non_ref_allele=$4\t/' file

This User Gave Thanks to bartus11 For This Post:
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Print number of lines for files in directory, also print number of unique lines

I have a directory of files, I can show the number of lines in each file and order them from lowest to highest with: wc -l *|sort 15263 Image.txt 16401 reference.txt 40459 richtexteditor.txt How can I also print the number of unique lines in each file? 15263 1401 Image.txt 16401... (15 Replies)
Discussion started by: spacegoose
15 Replies

2. UNIX for Dummies Questions & Answers

Print unique lines without sort or unique

I would like to print unique lines without sort or unique. Unfortunately the server I am working on does not have sort or unique. I have not been able to contact the administrator of the server to ask him to add it for several weeks. (7 Replies)
Discussion started by: cokedude
7 Replies

3. Shell Programming and Scripting

Look up 2 files and print the concatenated output

file 1 Sun Mar 17 00:01:33 2013 submit , Name="1234" Sun Mar 17 00:01:33 2013 submit , Name="1344" Sun Mar 17 00:01:33 2013 submit , Name="1124" .. .. .. .. Sun Mar 17 00:01:33 2013 submit , Name="8901" file 2 Sun Mar 17 00:02:47 2013 1234 execute SUCCEEDED Sun Mar 17... (24 Replies)
Discussion started by: aravindj80
24 Replies

4. Shell Programming and Scripting

Print only lines where fields concatenated match strings

Hello everyone, Maybe somebody could help me with an awk script. I have this input (field separator is comma ","): 547894982,M|N|J,U|Q|P,98,101,0,1,1 234900027,M|N|J,U|Q|P,98,101,0,1,1 234900023,M|N|J,U|Q|P,98,54,3,1,1 234900028,M|H|J,S|Q|P,98,101,0,1,1 234900030,M|N|J,U|F|P,98,101,0,1,1... (2 Replies)
Discussion started by: Ophiuchus
2 Replies

5. Shell Programming and Scripting

compare 2 files and return unique lines in each file (based on condition)

hi my problem is little complicated one. i have 2 files which appear like this file 1 abbsss:aa:22:34:as akl abc 1234 mkilll:as:ss:23:qs asc abc 0987 mlopii:cd:wq:24:as asd abc 7866 file2 lkoaa:as:24:32:sa alk abc 3245 lkmo:as:34:43:qs qsa abc 0987 kloia:ds:45:56:sa acq abc 7805 i... (5 Replies)
Discussion started by: anurupa777
5 Replies

6. UNIX for Dummies Questions & Answers

getting unique lines from 2 files

hi i have used comm -13 <(sort 1.txt) <(sort 2.txt) option to get the unique lines that are present in file 2 but not in file 1. but some how i am getting the entire file 2. i would expect few but not all uncommon lines fro my dat. is there anything wrong with the way i used the command? my... (1 Reply)
Discussion started by: anurupa777
1 Replies

7. Shell Programming and Scripting

Compare multiple files and print unique lines

Hi friends, I have multiple files. For now, let's say I have two of the following style cat 1.txt cat 2.txt output.txt Please note that my files are not sorted and in the output file I need another extra column that says the file from which it is coming. I have more than 100... (19 Replies)
Discussion started by: jacobs.smith
19 Replies

8. UNIX for Advanced & Expert Users

In a huge file, Delete duplicate lines leaving unique lines

Hi All, I have a very huge file (4GB) which has duplicate lines. I want to delete duplicate lines leaving unique lines. Sort, uniq, awk '!x++' are not working as its running out of buffer space. I dont know if this works : I want to read each line of the File in a For Loop, and want to... (16 Replies)
Discussion started by: krishnix
16 Replies

9. Shell Programming and Scripting

Comparing 2 files and return the unique lines in first file

Hi, I have 2 files file1 ******** 01-05-09|java.xls| 02-05-08|c.txt| 08-01-09|perl.txt| 01-01-09|oracle.txt| ******** file2 ******** 01-02-09|windows.xls| 02-05-08|c.txt| 01-05-09|java.xls| 08-02-09|perl.txt| 01-01-09|oracle.txt| ******** (8 Replies)
Discussion started by: shekhar_v4
8 Replies

10. Shell Programming and Scripting

Lines Concatenated with awk

Hello, I have a bash shell script and I use awk to print certain columns of one file and direct the output to another file. If I do a less or cat on the file it looks correct, but if I email the file and open it with Outlook the lines outputted by awk are concatenated. Here is my awk line:... (6 Replies)
Discussion started by: xadamz23
6 Replies
Login or Register to Ask a Question