![]() |
Hello and Welcome from United States to the UNIX and Linux Forums! Thank You for Visiting and Joining Our Global Community.
|
|
google unix.com
|
|||||||
| Forums | Register | Forum Rules | Links | Albums | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here. |
More UNIX and Linux Forum Topics You Might Find Helpful
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| I want to print next 3 lines after pattern matching. | naree | Shell Programming and Scripting | 12 | 05-21-2009 03:04 AM |
| comment/delete a particular pattern starting from second line of the matching pattern | imas | Shell Programming and Scripting | 4 | 10-13-2008 02:37 AM |
| Print block of lines matching a pattern | vanand420 | Shell Programming and Scripting | 1 | 09-29-2008 05:09 AM |
| Pattern Matching and lines after that | kaushys | Shell Programming and Scripting | 4 | 06-23-2008 11:27 AM |
| pattern matching over more lines | trek | Shell Programming and Scripting | 3 | 04-22-2008 06:37 AM |
![]() |
|
|
LinkBack | Thread Tools | Search this Thread |
Rating:
|
Display Modes |
|
|
|
||||
|
counting the lines matching a pattern, in between two pattern, and generate a tab
Hi all,
I'm looking for some help. I have a file (very long) that is organized like below: >Cluster 0 0 283nt, >01_FRYJ6ZM12HMXZS... at +/99% 1 279nt, >01_FRYJ6ZM12HN12A... at +/99% 2 281nt, >01_FRYJ6ZM12HM4TS... at +/99% 3 283nt, >01_FRYJ6ZM12HM946... at +/99% 4 279nt, >01_FRYJ6ZM12HJD9N... at +/99% 5 283nt, >01_FRYJ6ZM12HMM35... at +/99% 6 280nt, >01_FRYJ6ZM12HK26A... at +/99% 7 280nt, >01_FRYJ6ZM12HJ4UN... at +/99% 8 280nt, >01_FRYJ6ZM12HOKP6... at +/99% 9 283nt, >01_FRYJ6ZM12HCH1I... at +/99% 10 280nt, >01_FRYJ6ZM12HKTVV... at +/99% 11 280nt, >01_FRYJ6ZM12HL7IW... at +/98% 12 290nt, >01_FRYJ6ZM12HFI8R... * 13 281nt, >01_FRYJ6ZM12HLLN4... at +/98% 14 280nt, >01_FRYJ6ZM12HI82W... at +/99% 15 267nt, >01_FRYJ6ZM12HISC6... at +/98% 16 270nt, >01_FRYJ6ZM12HMKQG... at +/98% 17 290nt, >01_FRYJ6ZM12HJUQE... at +/98% 18 283nt, >01_FRYJ6ZM12HFSMR... at +/99% 19 280nt, >01_FRYJ6ZM12HK595... at +/99% 20 283nt, >01_FRYJ6ZM12HL768... at +/99% 21 266nt, >01_FRYJ6ZM12HMTF3... at +/100% 22 280nt, >02_FRYJ6ZM12HLE98... at +/99% 23 290nt, >04_FRYJ6ZM12HL1JH... at +/97% 24 275nt, >05_FRYJ6ZM12HE7XC... at +/99% 25 276nt, >05_FRYJ6ZM12HNA0I... at +/98% 26 271nt, >05_FRYJ6ZM12HL9ET... at +/99% 27 275nt, >05_FRYJ6ZM12HH0U0... at +/99% 28 271nt, >05_FRYJ6ZM12HL1AP... at +/99% 29 279nt, >06_FRYJ6ZM12HNECQ... at +/99% 30 278nt, >06_FRYJ6ZM12HMUTE... at +/99% 31 279nt, >06_FRYJ6ZM12HKY06... at +/99% 32 281nt, >08_FRYJ6ZM12HHVLF... at +/99% 33 290nt, >08_FRYJ6ZM12HL1JH... at +/100% 34 276nt, >08_FRYJ6ZM12HLIA7... at +/100% 35 286nt, >08_FRYJ6ZM12HNF98... at +/98% 36 290nt, >08_FRYJ6ZM12HIMCK... at +/100% 37 290nt, >08_FRYJ6ZM12HKJII... at +/100% 38 270nt, >08_FRYJ6ZM12HDIK1... at +/100% 39 279nt, >10_FRYJ6ZM12HEE9R... at +/99% 40 280nt, >10_FRYJ6ZM12HKXEK... at +/98% 41 279nt, >10_FRYJ6ZM12HLZN6... at +/99% 42 275nt, >14_FRYJ6ZM12HGC5C... at +/98% 43 276nt, >15_FRYJ6ZM12HI550... at +/98% 44 271nt, >19_FRYJ6ZM12HMU2M... at +/98% >Cluster 1 0 290nt, >01_FRYJ6ZM12HKQWR... * 1 281nt, >02_FRYJ6ZM12HNJ2B... at +/100% 2 266nt, >03_FRYJ6ZM12HMQY1... at +/100% 3 266nt, >05_FRYJ6ZM12HMPA8... at +/100% 4 280nt, >05_FRYJ6ZM12HE9N5... at +/99% 5 280nt, >05_FRYJ6ZM12HKTHG... at +/100% 6 280nt, >05_FRYJ6ZM12HKP1Z... at +/99% 7 280nt, >05_FRYJ6ZM12HIF2F... at +/99% 8 279nt, >05_FRYJ6ZM12HJ9MO... at +/97% 9 280nt, >05_FRYJ6ZM12HIQQH... at +/100% 10 281nt, >06_FRYJ6ZM12HLHZL... at +/99% 11 280nt, >06_FRYJ6ZM12HH9O0... at +/99% 12 281nt, >06_FRYJ6ZM12HK2SZ... at +/99% 13 281nt, >06_FRYJ6ZM12HJNW4... at +/100% 14 279nt, >06_FRYJ6ZM12HJUIE... at +/97% 15 280nt, >06_FRYJ6ZM12HFHXR... at +/97% 16 281nt, >06_FRYJ6ZM12HND03... at +/99% 17 282nt, >06_FRYJ6ZM12HHC7G... at +/98% 18 280nt, >06_FRYJ6ZM12HF5CY... at +/100% 19 280nt, >06_FRYJ6ZM12HEVGT... at +/99% 20 281nt, >06_FRYJ6ZM12HLILE... at +/99% 21 278nt, >06_FRYJ6ZM12HLWHQ... at +/99% 22 280nt, >06_FRYJ6ZM12HIU71... at +/100% 23 279nt, >06_FRYJ6ZM12HM3GZ... at +/99% 24 281nt, >06_FRYJ6ZM12HF238... at +/99% 25 273nt, >06_FRYJ6ZM12HDO08... at +/98% 26 276nt, >06_FRYJ6ZM12HE3OI... at +/98% 27 280nt, >06_FRYJ6ZM12HHQ56... at +/100% 28 280nt, >06_FRYJ6ZM12HFYQT... at +/100% 29 271nt, >06_FRYJ6ZM12HLGT2... at +/100% 30 281nt, >06_FRYJ6ZM12HM69N... at +/99% 31 281nt, >06_FRYJ6ZM12HG1WU... at +/99% 32 276nt, >06_FRYJ6ZM12HMHA6... at +/98% 33 245nt, >06_FRYJ6ZM12HHDL4... at +/99% 34 281nt, >06_FRYJ6ZM12HMQZI... at +/98% 35 281nt, >06_FRYJ6ZM12HNAR8... at +/100% 36 279nt, >06_FRYJ6ZM12HN5DI... at +/100% 37 280nt, >06_FRYJ6ZM12HGLSU... at +/98% 38 286nt, >11_FRYJ6ZM12HPCXJ... at +/98% 39 290nt, >11_FRYJ6ZM12HGPWI... at +/99% 40 285nt, >11_FRYJ6ZM12HM9YT... at +/98% 41 286nt, >11_FRYJ6ZM12HI2GG... at +/97% 42 290nt, >11_FRYJ6ZM12HMG2Y... at +/99% 43 281nt, >15_FRYJ6ZM12HKZNJ... at +/100% 44 280nt, >15_FRYJ6ZM12HE9QN... at +/99% 45 265nt, >17_FRYJ6ZM12HJRPI... at +/100% 46 275nt, >17_FRYJ6ZM12HLDLG... at +/98% 47 279nt, >17_FRYJ6ZM12HG1RZ... at +/99% 48 279nt, >17_FRYJ6ZM12HI1H8... at +/98% 49 280nt, >17_FRYJ6ZM12HNISU... at +/99% 50 280nt, >17_FRYJ6ZM12HMIHP... at +/99% 51 280nt, >17_FRYJ6ZM12HI58U... at +/99% 52 280nt, >17_FRYJ6ZM12HILMN... at +/100% 53 242nt, >17_FRYJ6ZM12HKVKQ... at +/98% 54 279nt, >17_FRYJ6ZM12HL1B9... at +/99% 55 280nt, >17_FRYJ6ZM12HEW7F... at +/98% 56 271nt, >17_FRYJ6ZM12HGGML... at +/99% 57 280nt, >17_FRYJ6ZM12HPMJM... at +/98% 58 277nt, >17_FRYJ6ZM12HH5V2... at +/99% 59 267nt, >17_FRYJ6ZM12HIDX1... at +/100% 60 271nt, >17_FRYJ6ZM12HHBYP... at +/98% 61 281nt, >17_FRYJ6ZM12HMHMF... at +/99% 62 282nt, >17_FRYJ6ZM12HLC9P... at +/99% 63 282nt, >17_FRYJ6ZM12HDDJ5... at +/99% 64 276nt, >17_FRYJ6ZM12HKV2F... at +/100% 65 276nt, >17_FRYJ6ZM12HK5OD... at +/99% 66 280nt, >17_FRYJ6ZM12HG1JG... at +/99% 67 281nt, >17_FRYJ6ZM12HMHDW... at +/99% 68 264nt, >17_FRYJ6ZM12HCHVO... at +/100% 69 280nt, >17_FRYJ6ZM12HHT9Y... at +/100% 70 280nt, >17_FRYJ6ZM12HGIYR... at +/100% 71 280nt, >17_FRYJ6ZM12HGR8Y... at +/100% 72 278nt, >17_FRYJ6ZM12HE3PW... at +/98% 73 197nt, >17_FRYJ6ZM12HIYK4... at +/100% >Cluster 2 0 286nt, >04_FRYJ6ZM12HPCXJ... at +/99% 1 290nt, >04_FRYJ6ZM12HGPWI... * 2 285nt, >04_FRYJ6ZM12HM9YT... at +/98% 3 266nt, >04_FRYJ6ZM12HJK88... at +/100% 4 281nt, >04_FRYJ6ZM12HKZNJ... at +/97% 5 286nt, >04_FRYJ6ZM12HI2GG... at +/98% >Cluster 3 0 286nt, >04_FRYJ6ZM12HD3BT... * 1 286nt, >06_FRYJ6ZM12HD3BT... at +/97% >Cluster 4 0 286nt, >23_FRYJ6ZM12HI2GG... * >Cluster 5 0 280nt, >04_FRYJ6ZM12HO3WD... at +/97% 1 285nt, >04_FRYJ6ZM12HGI5Z... * 2 285nt, >15_FRYJ6ZM12HGI5Z... at +/97% ....... This is only part of the file. So basically, we have here 6 clusters (numbered 0 to 5, but in my file file I have 1200 total). For each of them, we have the first column ($1) is a count, the second col is the length of the sequence ($2), col3 $3 is the name of the sequence. For each cluster I have what I call the representative sequence (name of the sequence followed by a *). In the name of the sequence, I have first the symbol >, followed by 2 digits (01-24), and then letters. Here is what I need: Create a tab (probably using awk) containing: - in col 1: the name of the representative sequence (one per cluster, name of sequence followed by *). I can use the command: awk '/\*/ {print $3}' file - in col 2 to 25: the count of sequences that belong to group 01 to group 24. Basically in col2, I want to know how many time I have the sequence beginning by >01_ in cluster 0. In col3, how many time I have the seq number starting by >02_ in cluster 0 and so on..... Let me give you the output file that I want for the file given above >01_FRYJ6ZM12HFI8R 22 1 0 1 5 3 0 7 0 3 0 0 0 1 1 0 0 0 1 0 0 0 0 0 >01_FRYJ6ZM12HKQWR 1 1 1 0 7 28 0 0 0 0 5 0 0 0 2 0 29 0 0 0 0 0 0 0 >04_FRYJ6ZM12HGPWI 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >04_FRYJ6ZM12HD3BT 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >23_FRYJ6ZM12HI2GG 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 >04_FRYJ6ZM12HGI5Z 0 0 0 2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 I need this where all the words are sep by a tab off course. Thank you to help me if anybody know what I should do or who to ask? Feel free to ask me questions... Diane ![]() ![]() |
|
||||
|
Code:
#!/usr/bin/perl
use strict;
my (%hash,$cluster);
open my $fh,"<","a.txt";
while(<$fh>){
if(/^>[^\d]*(\d+)[^\d]*$/){
$hash{$1}={};
$cluster=$1;
next;
}
if(/^.+>(\d+).*$/){
$hash{$cluster}->{$1}++;
}
if(/^[^>]*(>[^\.]*).*\*$/){
$hash{$cluster}->{NAME}=$1;
}
}
foreach my $key (sort {$a<=>$b} keys %hash){
print $hash{$key}->{NAME}," ";
map {print $hash{$key}->{$_}?$hash{$key}->{$_}:0," "} ('01'..'24');
print "\n";
}
|
|
||||
|
Thank you. I see this script is a pearl script. I don't know anything about it, could you tell me the command line I'm supposed to use in my terminal. (input file to be treated: file.fas.clstr.clstr, and output file is named file.txt)
Do I have to save the script you gave me into the same directory as my input file? Thank you for the maximum info you can give me, Diane |
![]() |
| Bookmarks |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|