Getting non unique lines from concatenated files

03-27-2011

Registered User

164, 1

Join Date: Mar 2011

Last Activity: 6 August 2015, 12:14 AM EDT

Posts: 164

Thanks Given: 119

Thanked 1 Time in 1 Post

Using this sample file I would like to learn more about file manipulation using Perl from you and others, so I have my next question already ... sorry

Sample file:

Code:

chr01   levure5 SNP     12745   12745   0.000000        .       .       genotype=S;reference=C;coverage=91;refAlleleCounts=44;refAlleleStarts=28;refAlleleMeanQV=19;novelAlleleCounts=40;novelAlleleStarts=25;novelAlleleMeanQV=18;diColor1=22;diColor2=11;het=1;flag=
chr01   levure6 SNP     12745   12745   0.000000        .       .       genotype=S;reference=C;coverage=62;refAlleleCounts=29;refAlleleStarts=19;refAlleleMeanQV=19;novelAlleleCounts=32;novelAlleleStarts=20;novelAlleleMeanQV=20;diColor1=11;diColor2=22;het=1;flag=
chr01   levure7 SNP     12745   12745   0.000000        .       .       genotype=S;reference=C;coverage=24;refAlleleCounts=9;refAlleleStarts=8;refAlleleMeanQV=23;novelAlleleCounts=13;novelAlleleStarts=12;novelAlleleMeanQV=20;diColor1=11;diColor2=22;het=1;flag=
chr01   levure5 SNP     16254   16254   0.000000        .       .       genotype=R;reference=G;coverage=111;refAlleleCounts=82;refAlleleStarts=41;refAlleleMeanQV=18;novelAlleleCounts=28;novelAlleleStarts=9;novelAlleleMeanQV=18;diColor1=10;diColor2=32;het=1;flag=
chr01   levure6 SNP     16254   16254   0.000000        .       .       genotype=R;reference=G;coverage=96;refAlleleCounts=72;refAlleleStarts=38;refAlleleMeanQV=17;novelAlleleCounts=24;novelAlleleStarts=6;novelAlleleMeanQV=15;diColor1=10;diColor2=32;het=1;flag=
chr01   levure7 SNP     16254   16254   0.000000        .       .       genotype=R;reference=G;coverage=32;refAlleleCounts=20;refAlleleStarts=18;refAlleleMeanQV=19;novelAlleleCounts=12;novelAlleleStarts=5;novelAlleleMeanQV=17;diColor1=10;diColor2=32;het=1;flag=
chr01   levure5 SNP     16511   16511   0.000000        .       .       genotype=A;reference=G;coverage=42;refAlleleCounts=0;refAlleleStarts=0;refAlleleMeanQV=0;novelAlleleCounts=35;novelAlleleStarts=16;novelAlleleMeanQV=19;diColor1=12;diColor2=12;het=0;flag=h4,h10,h9,
chr01   levure6 SNP     16511   16511   0.000000        .       .       genotype=A;reference=G;coverage=32;refAlleleCounts=0;refAlleleStarts=0;refAlleleMeanQV=0;novelAlleleCounts=23;novelAlleleStarts=11;novelAlleleMeanQV=17;diColor1=12;diColor2=12;het=0;flag=h4,h10,h9,
chr01   levure7 SNP     16511   16511   0.000000        .       .       genotype=A;reference=G;coverage=17;refAlleleCounts=0;refAlleleStarts=0;refAlleleMeanQV=0;novelAlleleCounts=14;novelAlleleStarts=11;novelAlleleMeanQV=20;diColor1=12;diColor2=12;het=0;flag=h4,h10,h9,
chr01   levure5 SNP     189778  189778  0.000000        .       .       genotype=G;reference=A;coverage=16;refAlleleCounts=0;refAlleleStarts=0;refAlleleMeanQV=0;novelAlleleCounts=16;novelAlleleStarts=12;novelAlleleMeanQV=24;diColor1=10;diColor2=10;het=0;flag=h4,h10,;transpo
sable_element;ID=COPY_YORCTy2-1;Name=COPY_YORCTy2-1
chrm    levure7 SNP     86086   86086   0.000000        .       .       genotype=A;reference=G;coverage=18;refAlleleCounts=0;refAlleleStarts=0;refAlleleMeanQV=0;novelAlleleCounts=18;novelAlleleStarts=4;novelAlleleMeanQV=13;diColor1=21;diColor2=21;het=0;flag=h4,h10,;gene;ID=
Q0182;Name=Q0182;Alias=ORF11
chrm    levure8 SNP     86086   86086   0.000000        .       .       genotype=A;reference=G;coverage=20;refAlleleCounts=0;refAlleleStarts=0;refAlleleMeanQV=0;novelAlleleCounts=19;novelAlleleStarts=4;novelAlleleMeanQV=14;diColor1=21;diColor2=21;het=0;flag=h4,h10,h9,;gene;
ID=Q0182;Name=Q0182;Alias=ORF11
chrm    levure5 SNP     98064   98064   0.000000        .       .       genotype=A;reference=G;coverage=61;refAlleleCounts=8;refAlleleStarts=4;refAlleleMeanQV=11;novelAlleleCounts=52;novelAlleleStarts=6;novelAlleleMeanQV=16;diColor1=20;diColor2=20;het=0;flag=h4,h10,
chrm    levure6 SNP     98064   98064   0.000000        .       .       genotype=A;reference=G;coverage=59;refAlleleCounts=3;refAlleleStarts=2;refAlleleMeanQV=16;novelAlleleCounts=55;novelAlleleStarts=5;novelAlleleMeanQV=18;diColor1=20;diColor2=20;het=0;flag=h4,h10,
chrm    levure7 SNP     98064   98064   0.019086        .       .       genotype=A;reference=G;coverage=23;refAlleleCounts=3;refAlleleStarts=3;refAlleleMeanQV=15;novelAlleleCounts=18;novelAlleleStarts=5;novelAlleleMeanQV=19;diColor1=20;diColor2=20;het=0;flag=h4,h10,
chrm    levure8 SNP     98064   98064   0.000000        .       .       genotype=A;reference=G;coverage=24;refAlleleCounts=3;refAlleleStarts=2;refAlleleMeanQV=8;novelAlleleCounts=20;novelAlleleStarts=6;novelAlleleMeanQV=19;diColor1=20;diColor2=20;het=0;flag=h4,h10,
scplasm1        levure5 SNP     6153    6153    0.000000        .       .       genotype=C;reference=T;coverage=5752;refAlleleCounts=7;refAlleleStarts=7;refAlleleMeanQV=3;novelAlleleCounts=5588;novelAlleleStarts=58;novelAlleleMeanQV=20;diColor1=22;diColor2=22;het=0;flag=h4,
h10,h9,;region;ID=scplasm1;dbxref=NCBI:NC_001398
scplasm1        levure6 SNP     6153    6153    0.000000        .       .       genotype=C;reference=T;coverage=4046;refAlleleCounts=10;refAlleleStarts=9;refAlleleMeanQV=5;novelAlleleCounts=3862;novelAlleleStarts=58;novelAlleleMeanQV=18;diColor1=22;diColor2=22;het=0;flag=h4
,h10,h9,;region;ID=scplasm1;dbxref=NCBI:NC_001398
scplasm1        levure7 SNP     6153    6153    0.000000        .       .       genotype=C;reference=T;coverage=2264;refAlleleCounts=2;refAlleleStarts=2;refAlleleMeanQV=3;novelAlleleCounts=2184;novelAlleleStarts=58;novelAlleleMeanQV=21;diColor1=22;diColor2=22;het=0;flag=h4,
h10,h9,;region;ID=scplasm1;dbxref=NCBI:NC_001398
scplasm1        levure8 SNP     6153    6153    0.000000        .       .       genotype=C;reference=T;coverage=2606;refAlleleCounts=2;refAlleleStarts=2;refAlleleMeanQV=4;novelAlleleCounts=2537;novelAlleleStarts=57;novelAlleleMeanQV=21;diColor1=22;diColor2=22;het=0;flag=h4,
h10,h9,;region;ID=scplasm1;dbxref=NCBI:NC_001398

What if in the above I wanted to gather only some fields (which from $F[0] to $F[7] should be simple enough to do). You have provided a way to split the $F[8] field and extract subfeild "x" (in red below )information with:

Code:

((split "=",(split ";",(split "[\t ]+",$_)[8])[x])[1])

Now what I want to learn is how can I recover other fields from $F[8] and then for each report its average value for each position reporting only once in $F[3] and I would like to do the same for $F[0]

So in other words my expected output if I want to gather coverage and novelAlleleStarts subfields after splitting $F[8] should be like

Code:

chr01   12745   Mean_coverage=59   Mean_novelAlleleStarts= 28.33
chr01   16254   Mean_coverage=79.66   Mean_novelAlleleStarts=21.33
chr01   16511   Mean_coverage=30.33   Mean_novelAlleleStarts=24
chr01   189778   Mean_coverage=16   Mean_novelAlleleStarts=16
chrm   86086   Mean_coverage=19   Mean_novelAlleleStarts=18.5   
chrm   98064   Mean_coverage=41.75   Mean_novelAlleleStarts=36.25  
scplasm1   6153   Mean_coverage=3667   Mean_novelAlleleStarts=3542.75

In this way I can learn how to populate hashes of hashes and use the keys as arguements for calculations.

Thanks a lot in advance

pawannoel

View Public Profile for pawannoel

Find all posts by pawannoel

03-27-2011

Registered User

3,733, 1,154

Join Date: Apr 2009

Last Activity: 3 August 2016, 11:03 AM EDT

Posts: 3,733

Thanks Given: 7

Thanked 1,154 Times in 1,124 Posts

Code:

#!/usr/bin/perl
open I, "$ARGV[0]";
while (<I>){
  @F=split;
  $F[8]=~/coverage=([^;]+)/;
  $c_s{$F[0]}{$F[3]}+=$1;
  $F[8]=~/novelAlleleStarts=([^;]+)/;
  $nv_s{$F[0]}{$F[3]}+=$1;
  $n{$F[0]}{$F[3]}++;
}
END{
  for $i (keys %c_s){
    for $j (keys %{$c_s{$i}}){
      print "$i\t$j\tMean_coverage=";
      printf "%.2f\t", ($c_s{$i}{$j}/$n{$i}{$j});
      printf "Mean_novelAlleleStarts=%.2f\n", ($nv_s{$i}{$j}/$n{$i}{$j});
    }
  }
}

Run it as usual: script.pl file. Notice that to get coverage and novelAlleleStarts values I used regular expression matching on 8th field, instead of nested split. I think using nested split was way overcomplicated to achieve such a simple task

. Let me know which part of the code needs explanation.

Last edited by bartus11; 03-27-2011 at 05:17 PM.. Reason: small code tweak

This User Gave Thanks to bartus11 For This Post:

bartus11

View Public Profile for bartus11

Find all posts by bartus11

03-27-2011

Registered User

164, 1

Join Date: Mar 2011

Last Activity: 6 August 2015, 12:14 AM EDT

Posts: 164

Thanks Given: 119

Thanked 1 Time in 1 Post

Thanks a lot Bartus

... its amazing ... does exactly what I want, but makes me laugh when you say "simple task" .... I hope it gets so simple for me too

I'm confused how $1 accumulates both $c_s and $nv_s !!? Please, could you also explain, ofcourse whenever you have time the last two "printf" lines of code please?

Cheers and thanks always ..

++

pawannoel

View Public Profile for pawannoel

Find all posts by pawannoel

03-27-2011

Registered User

3,733, 1,154

Join Date: Apr 2009

Last Activity: 3 August 2016, 11:03 AM EDT

Posts: 3,733

Thanks Given: 7

Thanked 1,154 Times in 1,124 Posts

$1 contains part of the last regex that is enclosed by brackets () -

Code:

$F[8]=~/coverage=([^;]+)/;

So it will contain everything that is after "coverage=", but before ";", so it will match coverage value. When second regex is executed ($F[8]=~/novelAlleleStarts=([^;]+)/;), old $1 contents will get replaced by new match (novelAlleleStarts value). Because of this, $1 should be used immediately after regex match. printf line is standard printf function, in this case formatting the number with two digits precision.

This User Gave Thanks to bartus11 For This Post:

bartus11

View Public Profile for bartus11

Find all posts by bartus11

03-28-2011

Registered User

44, 3

Join Date: Mar 2011

Last Activity: 29 July 2012, 2:27 PM EDT

Posts: 44

Thanks Given: 4

Thanked 3 Times in 3 Posts

******Please ignore my question above, jumped the gun a little*********
Also can someone tell me how to delete post :P

Last edited by pkabali; 03-28-2011 at 03:43 AM..

pkabali

View Public Profile for pkabali

Find all posts by pkabali

03-28-2011

Registered User

164, 1

Join Date: Mar 2011

Last Activity: 6 August 2015, 12:14 AM EDT

Posts: 164

Thanks Given: 119

Thanked 1 Time in 1 Post

@Bartus :
Yeah thats exactly why I was confused because I was thinking two different regex matches must be stored in different variables ... but seems like it can be done with just one too ...
Thanks for your comments .... Hva nice day

@pkabali :
I didnt recieve any questions from you so dont worry ... Good day!

---------- Post updated at 12:53 PM ---------- Previous update was at 04:09 AM ----------

@Bartus :
Hi Bartus, Hope all well ... I have another question .... Concerning slightly more detailed text manipulation but its time for me to learn ...
Sample file

Code:

levure5/levure5.raw.vcf:SK1.scplasm1    6153    .    T    C    222    .    DP=6039;AF1=1;CI95=1,1;DP4=2,2,1706,3565;MQ=33;FQ=-282;PV4=0.6,1,1,1    GT:PL:GQ    1/1:255,255,0:99
levure6/levure6.raw.vcf:SK1.scplasm1    6153    .    T    C    222    .    DP=4272;AF1=1;CI95=1,1;DP4=9,1,1250,2356;MQ=33;FQ=-282;PV4=0.00051,1,1,1    GT:PL:GQ    1/1:255,255,0:99
levure7/levure7.raw.vcf:SK1.scplasm1    6153    .    T    C    222    .    DP=2389;AF1=1;CI95=1,1;DP4=8,3,732,1343;MQ=33;FQ=-282;PV4=0.021,0.48,1,1    GT:PL:GQ    1/1:255,255,0:99
levure8/levure8.raw.vcf:SK1.scplasm1    6153    .    T    C    222    .    DP=2768;AF1=1;CI95=1,1;DP4=5,3,861,1559;MQ=33;FQ=-282;PV4=0.14,1,1,1    GT:PL:GQ    1/1:255,255,0:99
levure5/levure5.raw.vcf:SK1.chr08    558643    .    G    A    225    .    DP=449;AF1=0.5;CI95=0.5,0.5;DP4=122,120,93,58;MQ=44;FQ=225;PV4=0.037,1,1.8e-40,1    GT:PL:GQ    0/1:255,0,255:99
levure6/levure6.raw.vcf:SK1.chr08    558643    .    G    A    211    .    DP=371;AF1=0.5;CI95=0.5,0.5;DP4=115,96,64,47;MQ=44;FQ=214;PV4=0.64,1,9e-33,1    GT:PL:GQ    0/1:241,0,255:99
levure7/levure7.raw.vcf:SK1.chr08    558643    .    G    A    210    .    DP=152;AF1=0.5;CI95=0.5,0.5;DP4=44,47,32,15;MQ=45;FQ=213;PV4=0.031,1,7e-21,1    GT:PL:GQ    0/1:240,0,255:99
levure8/levure8.raw.vcf:SK1.chr08    558643    .    G    A    225    .    DP=208;AF1=0.5;CI95=0.5,0.5;DP4=49,69,29,46;MQ=45;FQ=225;PV4=0.76,1,1.1e-29,1    GT:PL:GQ    0/1:255,0,255:99
levure5/levure5.raw.vcf:SK1.chr08    558978    .    C    T    117    .    DP=436;AF1=0.5;CI95=0.5,0.5;DP4=157,45,87,55;MQ=44;FQ=120;PV4=0.0011,3.2e-71,1.8e-39,1    GT:PL:GQ    0/1:147,0,255:99
levure6/levure6.raw.vcf:SK1.chr08    558978    .    C    T    127    .    DP=352;AF1=0.5;CI95=0.5,0.5;DP4=121,45,71,44;MQ=42;FQ=130;PV4=0.052,4.2e-49,1.2e-23,1    GT:PL:GQ    0/1:157,0,255:99

So in the file in the $F[7] there is the DP4 subfeild, so that would be [3] subfeild inside $F[7]. Now if you look in the sample file what I cant understand is how do I split this particular subfield which has four values separated by commas into separate sub-subfeilds separated by tabs. So basically DP4 reports 4 values for the follwing 4 attributes.

Code:

DP4=fwd_ref_allele, rev_ref_allele, fwd_non_ref_allele, rev_non_ref_allele

I would basically want to report the comma separated values in each line of my file at DP4= as four separate tab delimited entries such as

Code:

fwd_ref_allele=value1[TAB]rev_ref_allele=value2[TAB]fwd_non_ref_allele=value3[TAB]rev_non_ref_allele=value4[TAB]

Once they apear as individual feilds, I can then it will be eaier for me to do calculations using your previous codes.

Thank you and let me know if you can provide some insight on such a problem of text manipulation in files.

Cheers ... Have a nice evening

pawannoel

View Public Profile for pawannoel

Find all posts by pawannoel

03-28-2011

Registered User

3,733, 1,154

Join Date: Apr 2009

Last Activity: 3 August 2016, 11:03 AM EDT

Posts: 3,733

Thanks Given: 7

Thanked 1,154 Times in 1,124 Posts

Something like this?

Code:

perl -pe 's/DP4=(\d+),(\d+),(\d+),(\d+)/fwd_ref_allele=$1\trev_ref_allele=$2\tfwd_non_ref_allele=$3\trev_non_ref_allele=$4\t/' file

This User Gave Thanks to bartus11 For This Post:

bartus11

View Public Profile for bartus11

Find all posts by bartus11

UNIX for Dummies Questions & Answers

Getting non unique lines from concatenated files

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Print number of lines for files in directory, also print number of unique lines

Discussion started by: spacegoose

2. UNIX for Dummies Questions & Answers

Print unique lines without sort or unique

Discussion started by: cokedude

3. Shell Programming and Scripting

Look up 2 files and print the concatenated output

Discussion started by: aravindj80

4. Shell Programming and Scripting

Print only lines where fields concatenated match strings

Discussion started by: Ophiuchus

5. Shell Programming and Scripting

compare 2 files and return unique lines in each file (based on condition)

Discussion started by: anurupa777

6. UNIX for Dummies Questions & Answers

getting unique lines from 2 files

Discussion started by: anurupa777

7. Shell Programming and Scripting

Compare multiple files and print unique lines

Discussion started by: jacobs.smith

8. UNIX for Advanced & Expert Users

In a huge file, Delete duplicate lines leaving unique lines

Discussion started by: krishnix

9. Shell Programming and Scripting

Comparing 2 files and return the unique lines in first file

Discussion started by: shekhar_v4

10. Shell Programming and Scripting

Lines Concatenated with awk

Discussion started by: xadamz23