counting the lines matching a pattern, in between two pattern, and generate a tab


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting counting the lines matching a pattern, in between two pattern, and generate a tab
# 1  
Old 03-16-2009
counting the lines matching a pattern, in between two pattern, and generate a tab

Hi all,

I'm looking for some help. I have a file (very long) that is organized like below:

>Cluster 0
0 283nt, >01_FRYJ6ZM12HMXZS... at +/99%
1 279nt, >01_FRYJ6ZM12HN12A... at +/99%
2 281nt, >01_FRYJ6ZM12HM4TS... at +/99%
3 283nt, >01_FRYJ6ZM12HM946... at +/99%
4 279nt, >01_FRYJ6ZM12HJD9N... at +/99%
5 283nt, >01_FRYJ6ZM12HMM35... at +/99%
6 280nt, >01_FRYJ6ZM12HK26A... at +/99%
7 280nt, >01_FRYJ6ZM12HJ4UN... at +/99%
8 280nt, >01_FRYJ6ZM12HOKP6... at +/99%
9 283nt, >01_FRYJ6ZM12HCH1I... at +/99%
10 280nt, >01_FRYJ6ZM12HKTVV... at +/99%
11 280nt, >01_FRYJ6ZM12HL7IW... at +/98%
12 290nt, >01_FRYJ6ZM12HFI8R... *
13 281nt, >01_FRYJ6ZM12HLLN4... at +/98%
14 280nt, >01_FRYJ6ZM12HI82W... at +/99%
15 267nt, >01_FRYJ6ZM12HISC6... at +/98%
16 270nt, >01_FRYJ6ZM12HMKQG... at +/98%
17 290nt, >01_FRYJ6ZM12HJUQE... at +/98%
18 283nt, >01_FRYJ6ZM12HFSMR... at +/99%
19 280nt, >01_FRYJ6ZM12HK595... at +/99%
20 283nt, >01_FRYJ6ZM12HL768... at +/99%
21 266nt, >01_FRYJ6ZM12HMTF3... at +/100%
22 280nt, >02_FRYJ6ZM12HLE98... at +/99%
23 290nt, >04_FRYJ6ZM12HL1JH... at +/97%
24 275nt, >05_FRYJ6ZM12HE7XC... at +/99%
25 276nt, >05_FRYJ6ZM12HNA0I... at +/98%
26 271nt, >05_FRYJ6ZM12HL9ET... at +/99%
27 275nt, >05_FRYJ6ZM12HH0U0... at +/99%
28 271nt, >05_FRYJ6ZM12HL1AP... at +/99%
29 279nt, >06_FRYJ6ZM12HNECQ... at +/99%
30 278nt, >06_FRYJ6ZM12HMUTE... at +/99%
31 279nt, >06_FRYJ6ZM12HKY06... at +/99%
32 281nt, >08_FRYJ6ZM12HHVLF... at +/99%
33 290nt, >08_FRYJ6ZM12HL1JH... at +/100%
34 276nt, >08_FRYJ6ZM12HLIA7... at +/100%
35 286nt, >08_FRYJ6ZM12HNF98... at +/98%
36 290nt, >08_FRYJ6ZM12HIMCK... at +/100%
37 290nt, >08_FRYJ6ZM12HKJII... at +/100%
38 270nt, >08_FRYJ6ZM12HDIK1... at +/100%
39 279nt, >10_FRYJ6ZM12HEE9R... at +/99%
40 280nt, >10_FRYJ6ZM12HKXEK... at +/98%
41 279nt, >10_FRYJ6ZM12HLZN6... at +/99%
42 275nt, >14_FRYJ6ZM12HGC5C... at +/98%
43 276nt, >15_FRYJ6ZM12HI550... at +/98%
44 271nt, >19_FRYJ6ZM12HMU2M... at +/98%
>Cluster 1
0 290nt, >01_FRYJ6ZM12HKQWR... *
1 281nt, >02_FRYJ6ZM12HNJ2B... at +/100%
2 266nt, >03_FRYJ6ZM12HMQY1... at +/100%
3 266nt, >05_FRYJ6ZM12HMPA8... at +/100%
4 280nt, >05_FRYJ6ZM12HE9N5... at +/99%
5 280nt, >05_FRYJ6ZM12HKTHG... at +/100%
6 280nt, >05_FRYJ6ZM12HKP1Z... at +/99%
7 280nt, >05_FRYJ6ZM12HIF2F... at +/99%
8 279nt, >05_FRYJ6ZM12HJ9MO... at +/97%
9 280nt, >05_FRYJ6ZM12HIQQH... at +/100%
10 281nt, >06_FRYJ6ZM12HLHZL... at +/99%
11 280nt, >06_FRYJ6ZM12HH9O0... at +/99%
12 281nt, >06_FRYJ6ZM12HK2SZ... at +/99%
13 281nt, >06_FRYJ6ZM12HJNW4... at +/100%
14 279nt, >06_FRYJ6ZM12HJUIE... at +/97%
15 280nt, >06_FRYJ6ZM12HFHXR... at +/97%
16 281nt, >06_FRYJ6ZM12HND03... at +/99%
17 282nt, >06_FRYJ6ZM12HHC7G... at +/98%
18 280nt, >06_FRYJ6ZM12HF5CY... at +/100%
19 280nt, >06_FRYJ6ZM12HEVGT... at +/99%
20 281nt, >06_FRYJ6ZM12HLILE... at +/99%
21 278nt, >06_FRYJ6ZM12HLWHQ... at +/99%
22 280nt, >06_FRYJ6ZM12HIU71... at +/100%
23 279nt, >06_FRYJ6ZM12HM3GZ... at +/99%
24 281nt, >06_FRYJ6ZM12HF238... at +/99%
25 273nt, >06_FRYJ6ZM12HDO08... at +/98%
26 276nt, >06_FRYJ6ZM12HE3OI... at +/98%
27 280nt, >06_FRYJ6ZM12HHQ56... at +/100%
28 280nt, >06_FRYJ6ZM12HFYQT... at +/100%
29 271nt, >06_FRYJ6ZM12HLGT2... at +/100%
30 281nt, >06_FRYJ6ZM12HM69N... at +/99%
31 281nt, >06_FRYJ6ZM12HG1WU... at +/99%
32 276nt, >06_FRYJ6ZM12HMHA6... at +/98%
33 245nt, >06_FRYJ6ZM12HHDL4... at +/99%
34 281nt, >06_FRYJ6ZM12HMQZI... at +/98%
35 281nt, >06_FRYJ6ZM12HNAR8... at +/100%
36 279nt, >06_FRYJ6ZM12HN5DI... at +/100%
37 280nt, >06_FRYJ6ZM12HGLSU... at +/98%
38 286nt, >11_FRYJ6ZM12HPCXJ... at +/98%
39 290nt, >11_FRYJ6ZM12HGPWI... at +/99%
40 285nt, >11_FRYJ6ZM12HM9YT... at +/98%
41 286nt, >11_FRYJ6ZM12HI2GG... at +/97%
42 290nt, >11_FRYJ6ZM12HMG2Y... at +/99%
43 281nt, >15_FRYJ6ZM12HKZNJ... at +/100%
44 280nt, >15_FRYJ6ZM12HE9QN... at +/99%
45 265nt, >17_FRYJ6ZM12HJRPI... at +/100%
46 275nt, >17_FRYJ6ZM12HLDLG... at +/98%
47 279nt, >17_FRYJ6ZM12HG1RZ... at +/99%
48 279nt, >17_FRYJ6ZM12HI1H8... at +/98%
49 280nt, >17_FRYJ6ZM12HNISU... at +/99%
50 280nt, >17_FRYJ6ZM12HMIHP... at +/99%
51 280nt, >17_FRYJ6ZM12HI58U... at +/99%
52 280nt, >17_FRYJ6ZM12HILMN... at +/100%
53 242nt, >17_FRYJ6ZM12HKVKQ... at +/98%
54 279nt, >17_FRYJ6ZM12HL1B9... at +/99%
55 280nt, >17_FRYJ6ZM12HEW7F... at +/98%
56 271nt, >17_FRYJ6ZM12HGGML... at +/99%
57 280nt, >17_FRYJ6ZM12HPMJM... at +/98%
58 277nt, >17_FRYJ6ZM12HH5V2... at +/99%
59 267nt, >17_FRYJ6ZM12HIDX1... at +/100%
60 271nt, >17_FRYJ6ZM12HHBYP... at +/98%
61 281nt, >17_FRYJ6ZM12HMHMF... at +/99%
62 282nt, >17_FRYJ6ZM12HLC9P... at +/99%
63 282nt, >17_FRYJ6ZM12HDDJ5... at +/99%
64 276nt, >17_FRYJ6ZM12HKV2F... at +/100%
65 276nt, >17_FRYJ6ZM12HK5OD... at +/99%
66 280nt, >17_FRYJ6ZM12HG1JG... at +/99%
67 281nt, >17_FRYJ6ZM12HMHDW... at +/99%
68 264nt, >17_FRYJ6ZM12HCHVO... at +/100%
69 280nt, >17_FRYJ6ZM12HHT9Y... at +/100%
70 280nt, >17_FRYJ6ZM12HGIYR... at +/100%
71 280nt, >17_FRYJ6ZM12HGR8Y... at +/100%
72 278nt, >17_FRYJ6ZM12HE3PW... at +/98%
73 197nt, >17_FRYJ6ZM12HIYK4... at +/100%
>Cluster 2
0 286nt, >04_FRYJ6ZM12HPCXJ... at +/99%
1 290nt, >04_FRYJ6ZM12HGPWI... *
2 285nt, >04_FRYJ6ZM12HM9YT... at +/98%
3 266nt, >04_FRYJ6ZM12HJK88... at +/100%
4 281nt, >04_FRYJ6ZM12HKZNJ... at +/97%
5 286nt, >04_FRYJ6ZM12HI2GG... at +/98%
>Cluster 3
0 286nt, >04_FRYJ6ZM12HD3BT... *
1 286nt, >06_FRYJ6ZM12HD3BT... at +/97%
>Cluster 4
0 286nt, >23_FRYJ6ZM12HI2GG... *
>Cluster 5
0 280nt, >04_FRYJ6ZM12HO3WD... at +/97%
1 285nt, >04_FRYJ6ZM12HGI5Z... *
2 285nt, >15_FRYJ6ZM12HGI5Z... at +/97%
.......

This is only part of the file. So basically, we have here 6 clusters (numbered 0 to 5, but in my file file I have 1200 total). For each of them, we have the first column ($1) is a count, the second col is the length of the sequence ($2), col3 $3 is the name of the sequence.

For each cluster I have what I call the representative sequence (name of the sequence followed by a *).

In the name of the sequence, I have first the symbol >, followed by 2 digits (01-24), and then letters.

Here is what I need:

Create a tab (probably using awk) containing:

- in col 1: the name of the representative sequence (one per cluster, name of sequence followed by *). I can use the command:
awk '/\*/ {print $3}' file

- in col 2 to 25: the count of sequences that belong to group 01 to group 24. Basically in col2, I want to know how many time I have the sequence beginning by >01_ in cluster 0. In col3, how many time I have the seq number starting by >02_ in cluster 0 and so on.....

Let me give you the output file that I want for the file given above
>01_FRYJ6ZM12HFI8R 22 1 0 1 5 3 0 7 0 3 0 0 0 1 1 0 0 0 1 0 0 0 0 0
>01_FRYJ6ZM12HKQWR 1 1 1 0 7 28 0 0 0 0 5 0 0 0 2 0 29 0 0 0 0 0 0 0
>04_FRYJ6ZM12HGPWI 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>04_FRYJ6ZM12HD3BT 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>23_FRYJ6ZM12HI2GG 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
>04_FRYJ6ZM12HGI5Z 0 0 0 2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0

I need this where all the words are sep by a tab off course.


Thank you to help me if anybody know what I should do or who to ask?
Feel free to ask me questions...

DianeSmilieSmilieSmilie
# 2  
Old 03-16-2009
please post in a subforum that has something in common with your problem! i've moved the thread to shell scripting...

welcome to unix.com
DN2
# 3  
Old 03-17-2009
Code:
#!/usr/bin/perl
use strict;
my (%hash,$cluster);
open my $fh,"<","a.txt";
while(<$fh>){
	if(/^>[^\d]*(\d+)[^\d]*$/){
		$hash{$1}={};
		$cluster=$1;
		next;
	}
	if(/^.+>(\d+).*$/){
		$hash{$cluster}->{$1}++;
	}
	if(/^[^>]*(>[^\.]*).*\*$/){
		$hash{$cluster}->{NAME}=$1;
	}  
}
foreach my $key (sort {$a<=>$b} keys %hash){
	print $hash{$key}->{NAME}," ";
	map {print $hash{$key}->{$_}?$hash{$key}->{$_}:0," "} ('01'..'24');
	print "\n";
}

# 4  
Old 03-17-2009
Thank you. I see this script is a pearl script. I don't know anything about it, could you tell me the command line I'm supposed to use in my terminal. (input file to be treated: file.fas.clstr.clstr, and output file is named file.txt)

Do I have to save the script you gave me into the same directory as my input file?

Thank you for the maximum info you can give me,

Diane
# 5  
Old 03-19-2009
I made it work. Thank you very much it is doing exactly what I want.

D.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Print lines after matching two pattern

would like to print everything after matching two patterns AAA and BBB. output : CCC ZZZ sample data : AAA BBB CCC ZZZ (4 Replies)
Discussion started by: jhonnyrip
4 Replies

2. Shell Programming and Scripting

sed -- Find pattern -- print remainder -- plus lines up to pattern -- Minus pattern

The intended result should be : PDF converters 'empty line' gpdftext and pdftotext?xml version="1.0"?> xml:space="preserve"><note-content version="0.1" xmlns:/tomboy/link" xmlns:size="http://beatniksoftware.com/tomboy/size">PDF converters gpdftext and pdftotext</note-content>... (9 Replies)
Discussion started by: Klasform
9 Replies

3. UNIX for Dummies Questions & Answers

Grep -v lines starting with pattern 1 and not matching pattern 2

Hi all! Thanks for taking the time to view this! I want to grep out all lines of a file that starts with pattern 1 but also does not match with the second pattern. Example: Drink a soda Eat a banana Eat multiple bananas Drink an apple juice Eat an apple Eat multiple apples I... (8 Replies)
Discussion started by: demmel
8 Replies

4. Shell Programming and Scripting

Sed: printing lines AFTER pattern matching EXCLUDING the line containing the pattern

'Hi I'm using the following code to extract the lines(and redirect them to a txt file) after the pattern match. But the output is inclusive of the line with pattern match. Which option is to be used to exclude the line containing the pattern? sed -n '/Conn.*User/,$p' > consumers.txt (11 Replies)
Discussion started by: essem
11 Replies

5. Shell Programming and Scripting

Pattern matching...of almost same lines

Hi all I am trying to process some data sample input is like this VARIABLE : T axis TDAY TIME : 02-FEB-2004 17:54 19755. VARIABLE : quality flag FILENAME : 1900054_prof.nc Z : 41 ... (3 Replies)
Discussion started by: Akshay Hegde
3 Replies

6. Shell Programming and Scripting

counting lines that match pattern

I have a file of 1.3 millions lines. some are with the same word twice on the line, some line have two diffrent words. each line has two words, one in brackets. example: foo (foo) bar (bar) thae (awvd) beladf (vswvw) I am sure this can be done with one line of... (6 Replies)
Discussion started by: robsonde
6 Replies

7. Shell Programming and Scripting

Finding lines matching the Pattern and their previous lines in a file

Hi, I am trying to locate the occurences of certain pattern like 'Possible network disconnect' in a text file. I can get the actual lines matching the pttern using: grep -w 'Possible network disconnect' file_name. But I am more interested in getting the timing of these events which are... (7 Replies)
Discussion started by: sagarparadkar
7 Replies

8. Shell Programming and Scripting

pattern matching lines using the date, and then joining the lines

Hi Guys, Was trying to attempt the below using awk and sed, have no luck so far, so any help would be appreciated. Current Text File: The first line has got an "\n", and the second line has got spaces/tabs then the word and "\n" TIME SERVER/CLIENT TEXT... (6 Replies)
Discussion started by: eo29
6 Replies

9. Shell Programming and Scripting

Pattern Matching and lines after that

I have a huge file and every paragraph has a date. But I want to retrieve the paras for the last two days only. So I can grep and findout the linenum for the first line since yesterday. Now I want to display everything after that line. And I am trying to do this inside a script so the linenum is a... (4 Replies)
Discussion started by: kaushys
4 Replies

10. Shell Programming and Scripting

pattern matching over more lines

hi, i have a text file wich contains following informations: 1 Record 90 in base GUJA_2008 (Created: 2008-01-14 19:00:38, Modified: 2008-01-15 18:54:33) 1 YADM_20080101_A91645666_A91645666 4 2008/01/15/ADM.ADM/20080101.ADM.ADM.A91645666G001.jff 1 Record 91 in base GUJA_2008 (Created:... (3 Replies)
Discussion started by: trek
3 Replies
Login or Register to Ask a Question