Recursive find / grep within a file / count of a string


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Recursive find / grep within a file / count of a string
# 1  
Old 12-02-2012
Recursive find / grep within a file / count of a string

Hi All,

This is the first time I have posted to this forum so please bear with me. Thanks also advance for any help or guidance.

For a project I need to do the following.

1. There are multiple files in multiple locations so I need to find them and the location. So I had planned to use
cd LOCATION;
find . -name "FILENAME.TXT" -type f -print > $HOME/list_of_locations.txt

this gives my paths in this format ./dir1/dir2/dir3/FILENAME.txt

2. Each one of these files is of a different format and the only way to work out the different format is to count the number of occurances of the "|" string in each file.

I can either use head -l to take first row and count the number of occurences of the "|" character or else grep the "|" in all rows and divide by the wc -l (number of lines). My preference is on the most efficient.

3. I want to produce a new file listing the full path and the number of occurrences of the "|" character so then I can process the .txt file later. If the number of occurences can somehow be concatenated onto the list_of_locations.txt in 1 or else a new file created with this information.

So what I am asking:

Is there a quick way of doing this?
Using find . -name is very slow - but looks like there is no other way as I am doing a recursive search across subdirectories.
Is there a better way to interogate my .txt file to find out how many "|" characters there are?
Is there a better way to put all of this into a UNIX script?

Thanks in advance for any help you can give either code snippit or advice.

Regards,
Charlie.
# 2  
Old 12-02-2012
You can do all of that in one line:
Code:
find /pathA /pathB ... /pathN -name "filename" -print0 |xargs -0 awk -F\| 'FNR==1 {print FILENAME, NF}'

This will look into the list of locations for the filename(s) you specified and print them out, separated by a "0"- char. xargs will collect them all and run awk on this list. awk will open each file, and print full path and field count from the first line. Redirect as desired.
As I am not aware of how to skip the remainder of the file and go on to the next one, there is some optimization potential. Trials with close("-") right after the print statement showed a little improvement in execution time, but I'm not sure if it does the right thing. EDIT: It does not; returns -1 error code.
Anybody out there knowing about skipping to the next file in awk's argument list?

Last edited by RudiC; 12-02-2012 at 08:39 AM.. Reason: Tried closing stdin to skip remainder / revoked close ("-")
# 3  
Old 12-02-2012
RudiC's suggestion is close, but misses on a couple of points. Since no pathname operands are given to awk, all of the filenames printed by awk will be an empty string. And, if there are x field separators on a line, there are x+1 fields.

The -print0 find primary and the -0 option to xargs are not defined by the standards, so they might not be available on your implementation.

A portable way to do what I believe was requested is:
Code:
find . -name 'FILENAME.TXT' -exec awk -F'|' 'FNR==1{printf("%s %d\n", FILENAME, NF-1)}' {} +

Some implementations of awk have a nextfile statement (like next, but while next restarts processing on the next line, nextfile restarts processing on the first line of the next file). If your awk has this non-standard extension, the following will be much more efficient for long input files:
Code:
find . -name 'FILENAME.TXT' -exec awk -F'|' '{printf("%s %d\n", FILENAME, NF-1);nextfile}' {} +

-------------------------------
Note that the comment I made about Rudi's proposal not printing pathnames is totally bogus. The xargs utility will add the pathname operand to awk as it invokes awk. Smilie

Last edited by Don Cragun; 12-02-2012 at 08:42 AM..
# 4  
Old 12-02-2012
Thank you, Don, for commenting on my proposal.
Quote:
Originally Posted by Don Cragun
. . . Since no pathname operands are given to awk, all of the filenames printed by awk will be an empty string.
At least with the combination of find and awk implemented on my linux system, there's a full path listing avalable, including filenames containing spaces:
Code:
find /var/log -iname \*.log -print0 |xargs -0 awk  -F\| 'FNR==1 {print FILENAME, NF}'
/var/log/auth.log 1
/var/log/dist-upgrade/history.log 0
. . .
/var/log/x y.log 3
/var/log/kern.log 1

Quote:
And, if there are x field separators on a line, there are x+1 fields.
Yes. Still I thought the number of fields to be more relevant than the number of separators. Might have been premature.

Quote:
Code:
find . -name 'FILENAME.TXT' -exec awk -F'|' 'FNR==1{printf("%s %d\n", FILENAME, NF-1)}' {} +

Works, and satisfies the standards, but:
Code:
time find . . . -print0 |xargs -0 awk  -F\| '. . .'
real    0m0.034s
time find . . . -exec awk -F\| '. . .' {} \;
real    0m0.208s

Quote:
Some implementations of awk have a nextfile statement
Special thanks for this; I was looking for that or an equivalent; unfortunately not available on my system.
# 5  
Old 12-02-2012
Quote:
Originally Posted by RudiC
Thank you, Don, for commenting on my proposal.


At least with the combination of find and awk implemented on my linux system, there's a full path listing avalable, including filenames containing spaces:
Code:
find /var/log -iname \*.log -print0 |xargs -0 awk  -F\| 'FNR==1 {print FILENAME, NF}'
/var/log/auth.log 1
/var/log/dist-upgrade/history.log 0
. . .
/var/log/x y.log 3
/var/log/kern.log 1

Hi Rudi,
Yes, but note that by skipping the -print (or -print0) and the invocation of xargs, awk is still given the full pathname as an operand (even if there are spaces, tabs, or newlines included in the pathname).
Quote:
Yes. Still I thought the number of fields to be more relevant than the number of separators. Might have been premature.
Agreed. But it wasn't what Charlie6742 asked for.
Quote:

Works, and satisfies the standards, but:
Code:
time find . . . -print0 |xargs -0 awk  -F\| '. . .'
real    0m0.034s
time find . . . -exec awk -F\| '. . .' {} \;
real    0m0.208s

Not surprising since what you timed runs awk once for each input file.
But note that I specified:
Code:
find . . . -exec awk -F\| '. . .' {} +

not:
Code:
find . . . -exec awk -F\| '. . .' {} \;

With the + instead of the \; find shouldn't execute awk any more times than xargs would and we avoid needing to start xargs at all.
Quote:
Special thanks for this; I was looking for that or an equivalent; unfortunately not available on my system.
# 6  
Old 12-02-2012
Quote:
Originally Posted by Don Cragun
. . .
But note that I specified:
Code:
find . . . -exec awk -F\| '. . .' {} +

Rats ... missed that. Absolutely right, plays in the same league:
Code:
time find . . .  -exec awk -F\| ' . . . ' {} +
. . .
real    0m0.034s

This User Gave Thanks to RudiC For This Post:
# 7  
Old 12-02-2012
Thanks guys. I have played with all the methods you suggested but it does not seem to give me any output. It works without errors - but just doesn't give output. I should have said I am using the bash shell - could some of these commands not be working properly on my setup? Is there a way I can set it up so it works as you have it.

If it helps - this is the message it gives me for one of the options that doesn't work.

Code:
find . -name "a.txt" -exec /usr/bin/awk -F'|' '{printf("%s %d\n", FILENAME, NF-1);nextfile}' {} +
./dir1/a.txt 40
awk: illegal statement 603430
record number 1

Once again thanks in advance for looking at this and so quickly - its really appreciated.

Charlie

Last edited by Scott; 12-04-2012 at 09:00 AM.. Reason: Added code tags; removed formatting
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Grep a string and count following lines starting with another string

I have a large dataset with following structure; C 0001 Carbon D SAR001 methane D SAR002 ethane D SAR003 propane D SAR004 butane D SAR005 pentane C 0002 Hydrogen C 0003 Nitrogen C 0004 Oxygen D SAR011 ozone D SAR012 super oxide C 0005 Sulphur D SAR013... (3 Replies)
Discussion started by: Syeda Sumayya
3 Replies

2. Shell Programming and Scripting

Help with Passing the Output of grep to sed command - to find and replace a string in a file.

I have a file example.txt as follows :SomeTextGoesHere $$TODAY_DT=20140818 $$TODAY_DT=20140818 $$TODAY_DT=20140818I need to automatically update the date (20140818) in the above file, by getting the new date as argument, using a shell script. (It would even be better if I could pass... (5 Replies)
Discussion started by: SriRamKrish
5 Replies

3. Shell Programming and Scripting

Recursive search for string in file with Loop condition

Hi, Need some help... I want to execute sequence commands, like below test1.sh test2.sh ...etc test1.sh file will generate log file, we need to search for 'complete' string on test1.sh file, once that condition success and then it should go to test2.sh file, each .sh scripts will take... (5 Replies)
Discussion started by: rkrish123
5 Replies

4. UNIX for Dummies Questions & Answers

Recursive Find on file size

Is there a way to use the find command to recursively scan directories for files greater than 1Gb in size and print out the directory path and file name only? Thanks in advance. (6 Replies)
Discussion started by: jimbojames
6 Replies

5. Shell Programming and Scripting

Tricky recursive removal (find with grep)

Tricky one: I want to do several things all at once to blow away a directory (rm -rf <dir>) 1) I want to find all files recursively that have a specific file extension (.ver) for example. 2) Then in that file, I want to grep for an expression ( "sp2" ) for example. 3) Then I want to... (1 Reply)
Discussion started by: jvsrvcs
1 Replies

6. Shell Programming and Scripting

How to find the latest file on Unix or Linux (recursive)

Hi all, I need to get the latest file. I have found this command "ls -lrt" that is great but not recursive. Can anyone help? Thanx by advance. (7 Replies)
Discussion started by: 1or2is3
7 Replies

7. Shell Programming and Scripting

Grep string from logs of last 1 hour on files of 2 different servers and calculate count

Hi, I am trying to grep a particular string from the files of 2 different servers without copying and calculate the total count of its occurence on both files. File structure is same on both servers and for reference as follows: 27-Aug-2010... (4 Replies)
Discussion started by: poweroflinux
4 Replies

8. UNIX for Dummies Questions & Answers

Grep and count the string in a file.

Hi, I have to grep a word 'XYZ' from 900 files ( from 2007 till date), take its count month wise. The output should display month, count , word 'XYZ' . I tried searching the forum for solution but could find any. I would apprieciate if any one can help me asap .... Many Thanks:) (12 Replies)
Discussion started by: vikram2008
12 Replies

9. UNIX for Dummies Questions & Answers

to grep and find the count

Hi My files is like a|test|s| b|test2 | n| c|ggg|v| i want to count the no of lines which is ending with "|" ... Please let me know how can it be done. Thanks, Arun (4 Replies)
Discussion started by: arunkumar_mca
4 Replies

10. UNIX for Advanced & Expert Users

find file with date and recursive search for a text

Hey Guyz I have a requirement something like this.. a part of file name, date of modification of that file and a text is entered as input. like Date : 080206 (MMDDYY format.) filename : hotel_rates text : Jim now the file hotel_rates.ZZZ.123 (creation date is Aug 02 2006) should be... (10 Replies)
Discussion started by: rosh0623
10 Replies
Login or Register to Ask a Question