Count number of pattern matches per line for all files in directory


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Count number of pattern matches per line for all files in directory
# 1  
Old 04-23-2014
Count number of pattern matches per line for all files in directory

I have a directory of files, each with a variable (though small) number of lines. I would like to go through each line in each file, and print the:
-file name
-line number
-number of matches to the pattern /comp[0-9]/ for each line.

Two example files:
Code:
cat ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt 
m.174408g.174408ORFg.174408m.174408type:internallen:82(+)comp664012_c0_seq1:2(-)250(+)	 Phy00425YH_ACYPI	 

m.28514g.28514ORFg.28514m.28514type:completelen:172(+)comp42344_c0_seq1:416(-)931(+)	 m.28517g.28517ORFg.28517m.28517type:3prime_partiallen:112(+)comp42344_c0_seq2:416(-)754(+)	 Phy00422JU_ACYPI	 Phy0042C6U_ACYPI	 Phy00423KN_ACYPI	 m.14126g.14126ORFg.14126m.14126type:internallen:133(-)comp32693_c0_seq1:3(-)401(-)	 m.167269g.167269ORFg.167269m.167269type:3prime_partiallen:54(-)comp457687_c0_seq1:1(-)162(-)	 

Phy00423KN_ACYPI	 m.14126g.14126ORFg.14126m.14126type:internallen:133(-)comp32693_c0_seq1:3(-)401(-)	 m.167269g.167269ORFg.167269m.167269type:3prime_partiallen:54(-)comp457687_c0_seq1:1(-)162(-)	 

cat ACYPI000008-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt 
m.30099g.30099ORFg.30099m.30099type:internallen:216(-)comp42976_c0_seq1:1(-)648(-)	 Phy0041ZCK_ACYPI	 m.42296g.42296ORFg.42296m.42296type:3prime_partiallen:81(+)comp46573_c0_seq1:157(-)402(+)	 

Phy0041ZCK_ACYPI	 m.42296g.42296ORFg.42296m.42296type:3prime_partiallen:81(+)comp46573_c0_seq1:157(-)402(+)

Desired output (tab-separated) is:
Code:
ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   1   1
ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   3   4
ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   5   2
ACYPI000008-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   1   2
ACYPI000008-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   3   1

I've tried using awk so far. This code prints the file name and number of matches in the file, but I'm not sure how to go about breaking it down by line.
Code:
cat ../IDs
ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt
ACYPI000008-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt

while read file
do 
awk '{while (sub(/comp[0-9]/,":")) t++}END{print FILENAME,t}' ${file}
done < ../IDs

Any ideas out there?

P.S. A bonus answer would include a fourth output column: the largest number of consecutive fields with pattern matches. For example, line 3 in the first file (line 2 is blank) has four matches, but at most only two of these maches are in consecutive fields. Output in this case would be:
Code:
ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   1   1   1
ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   3   4   2
ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   5   2   2
ACYPI000008-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   1   2   1
ACYPI000008-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   1   1   1

# 2  
Old 04-23-2014
Is this a homework assignment?

If not, please explain why you need this data.
# 3  
Old 04-23-2014
I'd say this is some kind of bioinformatics data. Anyway, you can try this in a directory containing your files:
Code:
perl -lne '$,=" ";@x=/comp[0-9]+/g;/([^\t]*comp[0-9]+[^\t]*\t?)+/;$tmp=$&;@y=$tmp=~/comp[0-9]+/g;print $ARGV,$.,($#x+1),($#y+1) if ($#x+1);$.=0 if eof' *

This User Gave Thanks to bartus11 For This Post:
# 4  
Old 04-23-2014
bartus11, this works, thank you. It's become clear that I need to spend some time learning perl.

Don Cragun, I am a biologist. This request is to help me parse the results of an analysis I did of data that I generated. I hope to soon be able to do everything from field work to wet lab work to all of the analysis...but I'm not quite there.
# 5  
Old 04-23-2014
Assuming that I am correct in believing that the desired bonus output you provided:
Code:
ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   1   1   1
ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   3   4   2
ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   5   2   2
ACYPI000008-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   1   2   1
ACYPI000008-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   1   1   1

should have been:
Code:
ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   1   1   1
ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   3   4   2
ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   5   2   2
ACYPI000008-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   1   2   1
ACYPI000008-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   3   1   1

and with the sets of three spaces changed to tabs, the following script (using awk instead of perl) seems to also do what you want:
Code:
#!/bin/ksh
awk '
{	nm = nc = ncM = 0
	for(i = 1; i <= NF; i++)
		if(match($i, /comp[0-9]/)) {
			nm++
			if(++nc > ncM)
				ncM = nc
		} else	nc = 0
	if(nm)	printf("%s\t%d\t%d\t%d\n", FILENAME, FNR, nm, ncM)
}' $(cat IDs)

producing the output:
Code:
ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt	1	1	1
ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt	3	4	2
ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt	5	2	2
ACYPI000008-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt	1	2	1
ACYPI000008-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt	3	1	1

If someone wants to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk, /usr/xpg6/bin/awk, or nawk.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Count the number of subset of files in a directory

hi I am trying to write a script to count the number of files, with slightly different subset name, in a directory for example, in directory /data, there are a subset of files that are name as follow /data/data_1_(1to however many).txt /data/data_2_(1 to however many).txt... (12 Replies)
Discussion started by: piynik
12 Replies

2. Shell Programming and Scripting

How to count number of files in directory and write to new file with number of files and their name?

Hi! I just want to count number of files in a directory, and write to new text file, with number of files and their name output should look like this,, assume that below one is a new file created by script Number of files in directory = 25 1. a.txt 2. abc.txt 3. asd.dat... (20 Replies)
Discussion started by: Akshay Hegde
20 Replies

3. Shell Programming and Scripting

grep - match files containing minimum number of pattern matches

I want to search a bunch of files and list only those containing a minimum number of pattern matches. So if I want to identify files containing 3 (or more) instances of the pattern "said:" and I have file1 that contains the lines: He said: She said: and file2 that contains the lines: He... (3 Replies)
Discussion started by: stumpyuk
3 Replies

4. Shell Programming and Scripting

How to count the number of files starting with a pattern in a Directory

Hi! In our current directory there are around 35000 files. Out of these a few thousands(around 20000) start with, "testfiles9842323879838". I want to count the number of files that have filenames starting with the above pattern. Please help me with the command i could use. Thank... (7 Replies)
Discussion started by: atechcorp
7 Replies

5. UNIX for Dummies Questions & Answers

Count number of files in directory excluding existing files

Hi, Please let me know how to find out number of files in a directory excluding existing files..The existing file format will be unknown..each time.. Thanks (3 Replies)
Discussion started by: ammu
3 Replies

6. UNIX for Dummies Questions & Answers

Read directory files and count number of lines

Hello, I'm trying to create a BASH file that can read all the files in my working directory and tell me how many words and lines are in that file. I wrote the following code: FILES="*" for f in "$FILES" do echo -e `wc -l -w $f` done My issue is that my file is outputting in one... (4 Replies)
Discussion started by: jl487
4 Replies

7. Shell Programming and Scripting

Perl line count if it matches a pattern

#!/usr/bin/perl use Shell; open THEFILE, "C:\galileo_integration.txt" || die "Couldnt open the file!"; @wholeThing = <THEFILE>; close THEFILE; foreach $line (@wholeThing){ if ($line =~ m/\\0$/){ @nextThing = $line; if ($line =~ s/\\0/\\LATEST/g){ @otherThing =... (2 Replies)
Discussion started by: nmattam
2 Replies

8. Shell Programming and Scripting

count number of files in a directory

what's the script to do that? i want to only count the number of files in that directory, not including any sub directories at all (5 Replies)
Discussion started by: finalight
5 Replies

9. Shell Programming and Scripting

awk to count pattern matches

i have an awk statement which i am using to count the number of occurences of the number ,5, in the file: awk '/,5,/ {count++}' TRY.txt | awk 'END { printf(" Total parts: %d",count)}' i know there is a total of 10 matches..what is wrong here? thanks (16 Replies)
Discussion started by: npatwardhan
16 Replies

10. Shell Programming and Scripting

Count the number of files in a directory

Hi All, How do i find out the number of files in a directory using unix command ? (14 Replies)
Discussion started by: Raynon
14 Replies
Login or Register to Ask a Question