Sponsored Content
Top Forums UNIX for Dummies Questions & Answers Getting non unique lines from concatenated files Post 302508286 by pawannoel on Sunday 27th of March 2011 02:04:54 PM
Old 03-27-2011
Using this sample file I would like to learn more about file manipulation using Perl from you and others, so I have my next question already ... sorry Smilie
Sample file:
Code:
chr01   levure5 SNP     12745   12745   0.000000        .       .       genotype=S;reference=C;coverage=91;refAlleleCounts=44;refAlleleStarts=28;refAlleleMeanQV=19;novelAlleleCounts=40;novelAlleleStarts=25;novelAlleleMeanQV=18;diColor1=22;diColor2=11;het=1;flag=
chr01   levure6 SNP     12745   12745   0.000000        .       .       genotype=S;reference=C;coverage=62;refAlleleCounts=29;refAlleleStarts=19;refAlleleMeanQV=19;novelAlleleCounts=32;novelAlleleStarts=20;novelAlleleMeanQV=20;diColor1=11;diColor2=22;het=1;flag=
chr01   levure7 SNP     12745   12745   0.000000        .       .       genotype=S;reference=C;coverage=24;refAlleleCounts=9;refAlleleStarts=8;refAlleleMeanQV=23;novelAlleleCounts=13;novelAlleleStarts=12;novelAlleleMeanQV=20;diColor1=11;diColor2=22;het=1;flag=
chr01   levure5 SNP     16254   16254   0.000000        .       .       genotype=R;reference=G;coverage=111;refAlleleCounts=82;refAlleleStarts=41;refAlleleMeanQV=18;novelAlleleCounts=28;novelAlleleStarts=9;novelAlleleMeanQV=18;diColor1=10;diColor2=32;het=1;flag=
chr01   levure6 SNP     16254   16254   0.000000        .       .       genotype=R;reference=G;coverage=96;refAlleleCounts=72;refAlleleStarts=38;refAlleleMeanQV=17;novelAlleleCounts=24;novelAlleleStarts=6;novelAlleleMeanQV=15;diColor1=10;diColor2=32;het=1;flag=
chr01   levure7 SNP     16254   16254   0.000000        .       .       genotype=R;reference=G;coverage=32;refAlleleCounts=20;refAlleleStarts=18;refAlleleMeanQV=19;novelAlleleCounts=12;novelAlleleStarts=5;novelAlleleMeanQV=17;diColor1=10;diColor2=32;het=1;flag=
chr01   levure5 SNP     16511   16511   0.000000        .       .       genotype=A;reference=G;coverage=42;refAlleleCounts=0;refAlleleStarts=0;refAlleleMeanQV=0;novelAlleleCounts=35;novelAlleleStarts=16;novelAlleleMeanQV=19;diColor1=12;diColor2=12;het=0;flag=h4,h10,h9,
chr01   levure6 SNP     16511   16511   0.000000        .       .       genotype=A;reference=G;coverage=32;refAlleleCounts=0;refAlleleStarts=0;refAlleleMeanQV=0;novelAlleleCounts=23;novelAlleleStarts=11;novelAlleleMeanQV=17;diColor1=12;diColor2=12;het=0;flag=h4,h10,h9,
chr01   levure7 SNP     16511   16511   0.000000        .       .       genotype=A;reference=G;coverage=17;refAlleleCounts=0;refAlleleStarts=0;refAlleleMeanQV=0;novelAlleleCounts=14;novelAlleleStarts=11;novelAlleleMeanQV=20;diColor1=12;diColor2=12;het=0;flag=h4,h10,h9,
chr01   levure5 SNP     189778  189778  0.000000        .       .       genotype=G;reference=A;coverage=16;refAlleleCounts=0;refAlleleStarts=0;refAlleleMeanQV=0;novelAlleleCounts=16;novelAlleleStarts=12;novelAlleleMeanQV=24;diColor1=10;diColor2=10;het=0;flag=h4,h10,;transpo
sable_element;ID=COPY_YORCTy2-1;Name=COPY_YORCTy2-1
chrm    levure7 SNP     86086   86086   0.000000        .       .       genotype=A;reference=G;coverage=18;refAlleleCounts=0;refAlleleStarts=0;refAlleleMeanQV=0;novelAlleleCounts=18;novelAlleleStarts=4;novelAlleleMeanQV=13;diColor1=21;diColor2=21;het=0;flag=h4,h10,;gene;ID=
Q0182;Name=Q0182;Alias=ORF11
chrm    levure8 SNP     86086   86086   0.000000        .       .       genotype=A;reference=G;coverage=20;refAlleleCounts=0;refAlleleStarts=0;refAlleleMeanQV=0;novelAlleleCounts=19;novelAlleleStarts=4;novelAlleleMeanQV=14;diColor1=21;diColor2=21;het=0;flag=h4,h10,h9,;gene;
ID=Q0182;Name=Q0182;Alias=ORF11
chrm    levure5 SNP     98064   98064   0.000000        .       .       genotype=A;reference=G;coverage=61;refAlleleCounts=8;refAlleleStarts=4;refAlleleMeanQV=11;novelAlleleCounts=52;novelAlleleStarts=6;novelAlleleMeanQV=16;diColor1=20;diColor2=20;het=0;flag=h4,h10,
chrm    levure6 SNP     98064   98064   0.000000        .       .       genotype=A;reference=G;coverage=59;refAlleleCounts=3;refAlleleStarts=2;refAlleleMeanQV=16;novelAlleleCounts=55;novelAlleleStarts=5;novelAlleleMeanQV=18;diColor1=20;diColor2=20;het=0;flag=h4,h10,
chrm    levure7 SNP     98064   98064   0.019086        .       .       genotype=A;reference=G;coverage=23;refAlleleCounts=3;refAlleleStarts=3;refAlleleMeanQV=15;novelAlleleCounts=18;novelAlleleStarts=5;novelAlleleMeanQV=19;diColor1=20;diColor2=20;het=0;flag=h4,h10,
chrm    levure8 SNP     98064   98064   0.000000        .       .       genotype=A;reference=G;coverage=24;refAlleleCounts=3;refAlleleStarts=2;refAlleleMeanQV=8;novelAlleleCounts=20;novelAlleleStarts=6;novelAlleleMeanQV=19;diColor1=20;diColor2=20;het=0;flag=h4,h10,
scplasm1        levure5 SNP     6153    6153    0.000000        .       .       genotype=C;reference=T;coverage=5752;refAlleleCounts=7;refAlleleStarts=7;refAlleleMeanQV=3;novelAlleleCounts=5588;novelAlleleStarts=58;novelAlleleMeanQV=20;diColor1=22;diColor2=22;het=0;flag=h4,
h10,h9,;region;ID=scplasm1;dbxref=NCBI:NC_001398
scplasm1        levure6 SNP     6153    6153    0.000000        .       .       genotype=C;reference=T;coverage=4046;refAlleleCounts=10;refAlleleStarts=9;refAlleleMeanQV=5;novelAlleleCounts=3862;novelAlleleStarts=58;novelAlleleMeanQV=18;diColor1=22;diColor2=22;het=0;flag=h4
,h10,h9,;region;ID=scplasm1;dbxref=NCBI:NC_001398
scplasm1        levure7 SNP     6153    6153    0.000000        .       .       genotype=C;reference=T;coverage=2264;refAlleleCounts=2;refAlleleStarts=2;refAlleleMeanQV=3;novelAlleleCounts=2184;novelAlleleStarts=58;novelAlleleMeanQV=21;diColor1=22;diColor2=22;het=0;flag=h4,
h10,h9,;region;ID=scplasm1;dbxref=NCBI:NC_001398
scplasm1        levure8 SNP     6153    6153    0.000000        .       .       genotype=C;reference=T;coverage=2606;refAlleleCounts=2;refAlleleStarts=2;refAlleleMeanQV=4;novelAlleleCounts=2537;novelAlleleStarts=57;novelAlleleMeanQV=21;diColor1=22;diColor2=22;het=0;flag=h4,
h10,h9,;region;ID=scplasm1;dbxref=NCBI:NC_001398

What if in the above I wanted to gather only some fields (which from $F[0] to $F[7] should be simple enough to do). You have provided a way to split the $F[8] field and extract subfeild "x" (in red below )information with:
Code:
((split "=",(split ";",(split "[\t ]+",$_)[8])[x])[1])


Now what I want to learn is how can I recover other fields from
$F[8] and then for each report its average value for each position reporting only once in $F[3] and I would like to do the same for $F[0]

So in other words my expected output if I want to gather coverage and novelAlleleStarts subfields after splitting $F[8] should be like
Code:
chr01   12745   Mean_coverage=59   Mean_novelAlleleStarts= 28.33
chr01   16254   Mean_coverage=79.66   Mean_novelAlleleStarts=21.33
chr01   16511   Mean_coverage=30.33   Mean_novelAlleleStarts=24
chr01   189778   Mean_coverage=16   Mean_novelAlleleStarts=16
chrm   86086   Mean_coverage=19   Mean_novelAlleleStarts=18.5   
chrm   98064   Mean_coverage=41.75   Mean_novelAlleleStarts=36.25  
scplasm1   6153   Mean_coverage=3667   Mean_novelAlleleStarts=3542.75

In this way I can learn how to populate hashes of hashes and use the keys as arguements for calculations.

Thanks a lot in advanceSmilie

 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Lines Concatenated with awk

Hello, I have a bash shell script and I use awk to print certain columns of one file and direct the output to another file. If I do a less or cat on the file it looks correct, but if I email the file and open it with Outlook the lines outputted by awk are concatenated. Here is my awk line:... (6 Replies)
Discussion started by: xadamz23
6 Replies

2. Shell Programming and Scripting

Comparing 2 files and return the unique lines in first file

Hi, I have 2 files file1 ******** 01-05-09|java.xls| 02-05-08|c.txt| 08-01-09|perl.txt| 01-01-09|oracle.txt| ******** file2 ******** 01-02-09|windows.xls| 02-05-08|c.txt| 01-05-09|java.xls| 08-02-09|perl.txt| 01-01-09|oracle.txt| ******** (8 Replies)
Discussion started by: shekhar_v4
8 Replies

3. UNIX for Advanced & Expert Users

In a huge file, Delete duplicate lines leaving unique lines

Hi All, I have a very huge file (4GB) which has duplicate lines. I want to delete duplicate lines leaving unique lines. Sort, uniq, awk '!x++' are not working as its running out of buffer space. I dont know if this works : I want to read each line of the File in a For Loop, and want to... (16 Replies)
Discussion started by: krishnix
16 Replies

4. Shell Programming and Scripting

Compare multiple files and print unique lines

Hi friends, I have multiple files. For now, let's say I have two of the following style cat 1.txt cat 2.txt output.txt Please note that my files are not sorted and in the output file I need another extra column that says the file from which it is coming. I have more than 100... (19 Replies)
Discussion started by: jacobs.smith
19 Replies

5. UNIX for Dummies Questions & Answers

getting unique lines from 2 files

hi i have used comm -13 <(sort 1.txt) <(sort 2.txt) option to get the unique lines that are present in file 2 but not in file 1. but some how i am getting the entire file 2. i would expect few but not all uncommon lines fro my dat. is there anything wrong with the way i used the command? my... (1 Reply)
Discussion started by: anurupa777
1 Replies

6. Shell Programming and Scripting

compare 2 files and return unique lines in each file (based on condition)

hi my problem is little complicated one. i have 2 files which appear like this file 1 abbsss:aa:22:34:as akl abc 1234 mkilll:as:ss:23:qs asc abc 0987 mlopii:cd:wq:24:as asd abc 7866 file2 lkoaa:as:24:32:sa alk abc 3245 lkmo:as:34:43:qs qsa abc 0987 kloia:ds:45:56:sa acq abc 7805 i... (5 Replies)
Discussion started by: anurupa777
5 Replies

7. Shell Programming and Scripting

Print only lines where fields concatenated match strings

Hello everyone, Maybe somebody could help me with an awk script. I have this input (field separator is comma ","): 547894982,M|N|J,U|Q|P,98,101,0,1,1 234900027,M|N|J,U|Q|P,98,101,0,1,1 234900023,M|N|J,U|Q|P,98,54,3,1,1 234900028,M|H|J,S|Q|P,98,101,0,1,1 234900030,M|N|J,U|F|P,98,101,0,1,1... (2 Replies)
Discussion started by: Ophiuchus
2 Replies

8. Shell Programming and Scripting

Look up 2 files and print the concatenated output

file 1 Sun Mar 17 00:01:33 2013 submit , Name="1234" Sun Mar 17 00:01:33 2013 submit , Name="1344" Sun Mar 17 00:01:33 2013 submit , Name="1124" .. .. .. .. Sun Mar 17 00:01:33 2013 submit , Name="8901" file 2 Sun Mar 17 00:02:47 2013 1234 execute SUCCEEDED Sun Mar 17... (24 Replies)
Discussion started by: aravindj80
24 Replies

9. UNIX for Dummies Questions & Answers

Print unique lines without sort or unique

I would like to print unique lines without sort or unique. Unfortunately the server I am working on does not have sort or unique. I have not been able to contact the administrator of the server to ask him to add it for several weeks. (7 Replies)
Discussion started by: cokedude
7 Replies

10. UNIX for Beginners Questions & Answers

Print number of lines for files in directory, also print number of unique lines

I have a directory of files, I can show the number of lines in each file and order them from lowest to highest with: wc -l *|sort 15263 Image.txt 16401 reference.txt 40459 richtexteditor.txt How can I also print the number of unique lines in each file? 15263 1401 Image.txt 16401... (15 Replies)
Discussion started by: spacegoose
15 Replies
All times are GMT -4. The time now is 03:00 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy