Attach filename to wc results on massive number of files

03-14-2019

Registered User

564, 13

Join Date: Sep 2009

Last Activity: 26 May 2021, 8:59 AM EDT

Location: Saskatchewan, Canada

Posts: 564

Thanks Given: 376

Thanked 13 Times in 12 Posts

Attach filename to wc results on massive number of files

Hello,
I have massive number of big files that needed to be counted for the total number of lines (> 100x millions) each. I want the file name attached to the count results so that they are aligned nicely matching name and counts.
I could do each file at a time, which will take hours to finish, so that the jobs were sent to background as I have multiple cores available to get the job done quickly. The problem with my script is the "echo -n $f" "; always accomplishes first, and the ${f}_R1.fq.gz | wc -l part is behind too much and the result was not aligned as expected.

Here is my code:

Code:

for f in $(cat ${LIST1}); do 
echo -n $f" "  >> raw_reads_count.table1; 
zcat ${f}_R1.fq.gz | wc -l >> raw_reads_count.table1 &      #This is the part
 done
------------------------------------------------------------------------------------------------------
messed-up output:
a      
bb    
ccc   
xyz 
267234214
777234211
937214233
1027254258
------------------------------------------------------------------------------------------------------
 Expected output:
a    267234214
bb   937214233
ccc  777234211
xyz 1027254258

How should I improve my script to get what is expected? Thanks a lot!

yifangt

View Public Profile for yifangt

Find all posts by yifangt

03-14-2019

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

How about

Code:

{ echo -n $f" "; zcat ${f}_R1.fq.gz | wc -l; } >> raw_reads_count.table1 &

Should that fail, write to single files in sequence, then, after the loop, concatenate the files.

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

03-14-2019

Registered User

564, 13

Join Date: Sep 2009

Last Activity: 26 May 2021, 8:59 AM EDT

Location: Saskatchewan, Canada

Posts: 564

Thanks Given: 376

Thanked 13 Times in 12 Posts

@Rudic
No, still the same as the original problem.
I'll do the single files and then concatenate them. Thanks!

yifangt

View Public Profile for yifangt

Find all posts by yifangt

03-14-2019

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

How about something more like:

Code:

LIST1=/what/ever/you/want
OUTPUT1=raw_reads_count.table1

while read -r f
do	(	linecount=$(zcat ${f}_R1.fq.gz | wc -l)
		printf '%s\t%s\n' "$f" "$linecount" >> "$OUTPUT1"
	)&
done < "$LIST1"
wait
printf '%s: %s is ready.\n' "${0##*/}" "$OUTPUT1"

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

03-14-2019

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

Quote:

Originally Posted by yifangt

The problem with my script is the "echo -n $f" "; always accomplishes first, and the ${f}_R1.fq.gz | wc -l part is behind too much and the result was not aligned as expected.

Actually this is a very interesting problem. It is hard simulate without actually create some terabytes of files that are similar in size to what you have to process, therefore, before i start to actually do that, i'd like to offer a few theories first which you may verify:

my suspicion is that the problem is the buffered nature of <stdout>. From time to time this buffer is flushed and because the output of echo is available already it gets written into the file but since the zcat still runs at that time it will be written at a much later time. Maybe the following might help. I used printf instead of echo, but that is not the point: to execute the output statement the subshell has to be finished, therefore the line should get printed completely or not at all. Because the whole process gets put in background the original order of the filenames will no longer be retained - maybe no concern to you but you should be aware of that.

Another point is the number of processes you start: starting an (in principle unlimited) amount of background processes at the same time is always a bit of an hazard. The script might work well with 10 or 20 files generating 10 or 20 background processes but a directory may as well hold millions of files. No system would survive an attempt to start a million background processes, no matter how small they are and how many processors you have. You may want to implement some logic to only have some maximum number of bround processes running concurrently.

Code:

$(printf "%s\t%s\n" "$f" $(zcat ${f}_R1.fq.gz | wc -l) ) >> raw_reads_count.table1 &

I hope this helps.

bakunin

This User Gave Thanks to bakunin For This Post:

bakunin

View Public Profile for bakunin

Find all posts by bakunin

03-15-2019

Registered User

564, 13

Join Date: Sep 2009

Last Activity: 26 May 2021, 8:59 AM EDT

Location: Saskatchewan, Canada

Posts: 564

Thanks Given: 376

Thanked 13 Times in 12 Posts

parallel to restrict the process number

@bakunin @all
Your comments are exactly what I wanted to catch. Here, I reformed my script with GNU parallel to control the process limits, but I hit another wall:

Code:

parallel -a $LIST1 -j 48 "(printf "%s\t%s\n" {} $(zcat {}_R1.fq.gz | wc -l)) >> raw_reads_count.table1"
------------------------------------------------------
a 0 >> raw_reads_count.table1
bb 0 >> raw_reads_count.table1
ccc 0 >> raw_reads_count.table1
xyz 0 >> raw_reads_count.table1

The problem seems with the parallel placeholder expansion. Is it because of the too many layers of parenthesis () ? Need to get myself familiar with quoting in bash.
Thanks for any help!
======================================================================================
It seems to me this is the final solution:

Code:

parallel -a $LIST1 -j 48 "(echo -n {}' '; (zcat ${RAW_DIR1}/{}_R1.fq.gz | wc -l)) > {}_counts.tmp"
cat *_counts.tmp >> raw_reads_count.table1

Thanks you all for the help!

Last edited by yifangt; 03-15-2019 at 03:36 PM..

yifangt

View Public Profile for yifangt

Find all posts by yifangt

03-15-2019

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Parallel is not a go-faster button for files. Unless your CPU is maxing out, there's no benefit.

GNU parallel is just doing individual files like you were doing anyway. It has to, lacking magic mechanisms to predict future filesize and move things where they belong.

If your CPU is maxing out, pigz may work faster one-at-a-time than you were trying to do parallel.

Last edited by Corona688; 03-15-2019 at 05:19 PM..

This User Gave Thanks to Corona688 For This Post:

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

Shell Programming and Scripting

Attach filename to wc results on massive number of files

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Adding filename and line number from multiple files to final file

Discussion started by: bioinfo

2. Shell Programming and Scripting

How to count number of results found?

Discussion started by: demmel

3. Shell Programming and Scripting

counting the number of characters in the filename of all files in a directory?

Discussion started by: LinuxNubBrah

4. UNIX for Dummies Questions & Answers

massive tarred grib files totally unacceptable

Discussion started by: sammysoil

5. Shell Programming and Scripting

How to attach two files in unix script

Discussion started by: meva

6. Shell Programming and Scripting

Filename from splitting files to have the same filename of the original file with counter value

Discussion started by: natalie23

7. Shell Programming and Scripting

attach multiple files in email

Discussion started by: mgirinath

8. Shell Programming and Scripting

attach 2 files using mailx

Discussion started by: anumkoshy

9. UNIX for Advanced & Expert Users

pine does'nt attach files

Discussion started by: maybemedic

10. UNIX for Dummies Questions & Answers

awk | stop after specified number of results

Discussion started by: evan108