Getting non unique lines from concatenated files

03-26-2011

Registered User

3,733, 1,154

Join Date: Apr 2009

Last Activity: 3 August 2016, 11:03 AM EDT

Posts: 3,733

Thanks Given: 7

Thanked 1,154 Times in 1,124 Posts

Code:

#!/usr/bin/perl
open I, "$ARGV[0]";                                                    # open file with filename passed to script as first argument ($ARGV[0])
while (<I>){                                                           # sequentially read lines of that file and put them into $_ variable
  $pos=((split "[\t ]+",$_)[4]);                                       # extract position number from current line stored in $_ variable
  $cov=((split "=",(split ";",(split "[\t ]+",$_)[8])[2])[1]);         # extract coverage value in the same way
  $s{$pos}+=$cov;                                                      # add coverage value for particular position
  $n{$pos}++;                                                          # count number of occurrences for particular position
  $s_tot+=$cov;                                                        # add coverage value for total sum
  $n_tot++;                                                            # count number of coverage occurrences in total
}
END{
  for $i (keys %s){print "Mean_Coverage_for_position _$i = " . ($s{$i}/$n{$i}) . "\n"};   # calculate and print average coverage for each position
  print "Mean_Coverage = " . ($s_tot/$n_tot) . "\n"                    # calculate and print global average
}

This User Gave Thanks to bartus11 For This Post:

bartus11

View Public Profile for bartus11

Find all posts by bartus11

03-26-2011

Registered User

164, 1

Join Date: Mar 2011

Last Activity: 6 August 2015, 12:14 AM EDT

Posts: 164

Thanks Given: 119

Thanked 1 Time in 1 Post

Thanks Bartus,
I have another question. I would like to count according to each chr in field[0], which can be chr1-16, chrm or another scplasm how many different entries are present in field[3]
And ofcourse I would like to do this on multiple files, please

Sample file:

Code:

chr01   levure5 SNP     12745   12745   0.000000        .       .       genotype=S;reference=C;coverage=91;refAlleleCounts=44;refAlleleStarts=28;refAlleleMeanQV=19;novelAlleleCounts=40;novelAlleleStarts=25;novelAlleleMeanQV=18;diColor1=22;diColor2=11;het=1;flag=
chr01   levure6 SNP     12745   12745   0.000000        .       .       genotype=S;reference=C;coverage=62;refAlleleCounts=29;refAlleleStarts=19;refAlleleMeanQV=19;novelAlleleCounts=32;novelAlleleStarts=20;novelAlleleMeanQV=20;diColor1=11;diColor2=22;het=1;flag=
chr01   levure7 SNP     12745   12745   0.000000        .       .       genotype=S;reference=C;coverage=24;refAlleleCounts=9;refAlleleStarts=8;refAlleleMeanQV=23;novelAlleleCounts=13;novelAlleleStarts=12;novelAlleleMeanQV=20;diColor1=11;diColor2=22;het=1;flag=
chrm    levure7 SNP     86086   86086   0.000000        .       .       genotype=A;reference=G;coverage=18;refAlleleCounts=0;refAlleleStarts=0;refAlleleMeanQV=0;novelAlleleCounts=18;novelAlleleStarts=4;novelAlleleMeanQV=13;diColor1=21;diColor2=21;het=0;flag=h4,h10,;gene;ID=Q0
182;Name=Q0182;Alias=ORF11
chrm    levure8 SNP     86086   86086   0.000000        .       .       genotype=A;reference=G;coverage=20;refAlleleCounts=0;refAlleleStarts=0;refAlleleMeanQV=0;novelAlleleCounts=19;novelAlleleStarts=4;novelAlleleMeanQV=14;diColor1=21;diColor2=21;het=0;flag=h4,h10,h9,;gene;ID
=Q0182;Name=Q0182;Alias=ORF11
chrm    levure5 SNP     98064   98064   0.000000        .       .       genotype=A;reference=G;coverage=61;refAlleleCounts=8;refAlleleStarts=4;refAlleleMeanQV=11;novelAlleleCounts=52;novelAlleleStarts=6;novelAlleleMeanQV=16;diColor1=20;diColor2=20;het=0;flag=h4,h10,
scplasm1        levure5 SNP     6153    6153    0.000000        .       .       genotype=C;reference=T;coverage=5752;refAlleleCounts=7;refAlleleStarts=7;refAlleleMeanQV=3;novelAlleleCounts=5588;novelAlleleStarts=58;novelAlleleMeanQV=20;diColor1=22;diColor2=22;het=0;flag=h4,h1
0,h9,;region;ID=scplasm1;dbxref=NCBI:NC_001398

Expected Output:

Code:

chr01[TAB]1
chrm[TAB]2
scplasm[TAB]1

Thank you very much for all your help, input and valuable comments on the code

Have a nice weekend.

pawannoel

View Public Profile for pawannoel

Find all posts by pawannoel

03-26-2011

Registered User

3,733, 1,154

Join Date: Apr 2009

Last Activity: 3 August 2016, 11:03 AM EDT

Posts: 3,733

Thanks Given: 7

Thanked 1,154 Times in 1,124 Posts

Code:

#!/bin/bash
for i in $*; do
echo "$i:"
perl -ane '$h{$F[0]}{$F[3]}++;END{for $i (keys %h){@x=keys %{$h{$i}};print "$i\t" . ($#x+1) . "\n"}}' $i
done

Run it as usual: ./script.sh file*
Explanation tomorrow

This User Gave Thanks to bartus11 For This Post:

bartus11

View Public Profile for bartus11

Find all posts by bartus11

03-26-2011

Registered User

164, 1

Join Date: Mar 2011

Last Activity: 6 August 2015, 12:14 AM EDT

Posts: 164

Thanks Given: 119

Thanked 1 Time in 1 Post

Whenever u have time Bartus .... sorry for so many questions and thanks a lot for your help

Hv a good weekend

pawannoel

View Public Profile for pawannoel

Find all posts by pawannoel

03-27-2011

Registered User

3,733, 1,154

Join Date: Apr 2009

Last Activity: 3 August 2016, 11:03 AM EDT

Posts: 3,733

Thanks Given: 7

Thanked 1,154 Times in 1,124 Posts

I'll just explain the new pieces of code:$h{$F[0]}{$F[3]}++ - This will create "hash of hashes", with first key being 1st field in your file, and second key being field number four. This technique is described in "Intermediate Perl" book. So after running this code over your sample file, the %h hash structure looks like this:

Code:

 %h = {
          'scplasm1' => {
                          '6153' => 1
                        },
          'chr01' => {
                       '12745' => 3
                     },
          'chrm' => {
                      '86086' => 2,
                      '98064' => 1
                    }
        };

As you can see, for each $F[0] field, all the $F[3] fields are present as keys of underlying hash. Values stored in those inner hashes are number occurrences for $F[3] field. This code would also work without storing this information, using: $h{$F[0]}{$F[3]}=1, then hash will look like this:

Code:

%h = {
          'scplasm1' => {
                          '6153' => 1
                        },
          'chr01' => {
                       '12745' => 1
                     },
          'chrm' => {
                      '86086' => 1,
                      '98064' => 1
                    }
        };

Now all we have to do is print the number of keys present in each inner hash (blue) for each main hash key - $F[0] field (red). To do this we iterate over $F[0] values (red) using for $i (keys %h){Then we assign inner hash keys (blue) to @x array - @x=keys %{$h{$i}};. So during each for each $i value, @x will look like this:

Code:

$i = scplasm1
@x = [
          '6153'
        ];

$i = chr01
@x = [
          '12745'
        ];

$i = chrm
@x = [
          '86086',
          '98064'
        ];

Now all we have to do is print $i and respective number of elements of @x array: print "$i\t" . ($#x+1) . "\n"

This User Gave Thanks to bartus11 For This Post:

bartus11

View Public Profile for bartus11

Find all posts by bartus11

03-27-2011

Registered User

164, 1

Join Date: Mar 2011

Last Activity: 6 August 2015, 12:14 AM EDT

Posts: 164

Thanks Given: 119

Thanked 1 Time in 1 Post

Thanks Bartus ... that was really nice of you ... Thank you ever so much for the in depth explanation .... Its really clear ... the only part I didnt undestand was

Code:

. ($#x+1) .

Have a nice Sunday

---------- Post updated at 12:14 PM ---------- Previous update was at 11:32 AM ----------

I reckon its that like that because in Perl indexing occurs from zero, so zero means one, one means two etc ... thats the need for +1 and the $#x means number of occurences of elements in @x array as you said .... am I right about the +1 ??

pawannoel

View Public Profile for pawannoel

Find all posts by pawannoel

03-27-2011

Registered User

3,733, 1,154

Join Date: Apr 2009

Last Activity: 3 August 2016, 11:03 AM EDT

Posts: 3,733

Thanks Given: 7

Thanked 1,154 Times in 1,124 Posts

Yes. To be precise $#x is the last index of @x array. So to get number of elements in that array it needs to be incremented by 1.

This User Gave Thanks to bartus11 For This Post:

bartus11

View Public Profile for bartus11

Find all posts by bartus11

UNIX for Dummies Questions & Answers

Getting non unique lines from concatenated files

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Print number of lines for files in directory, also print number of unique lines

Discussion started by: spacegoose

2. UNIX for Dummies Questions & Answers

Print unique lines without sort or unique

Discussion started by: cokedude

3. Shell Programming and Scripting

Look up 2 files and print the concatenated output

Discussion started by: aravindj80

4. Shell Programming and Scripting

Print only lines where fields concatenated match strings

Discussion started by: Ophiuchus

5. Shell Programming and Scripting

compare 2 files and return unique lines in each file (based on condition)

Discussion started by: anurupa777

6. UNIX for Dummies Questions & Answers

getting unique lines from 2 files

Discussion started by: anurupa777

7. Shell Programming and Scripting

Compare multiple files and print unique lines

Discussion started by: jacobs.smith

8. UNIX for Advanced & Expert Users

In a huge file, Delete duplicate lines leaving unique lines

Discussion started by: krishnix

9. Shell Programming and Scripting

Comparing 2 files and return the unique lines in first file

Discussion started by: shekhar_v4

10. Shell Programming and Scripting

Lines Concatenated with awk

Discussion started by: xadamz23