Attach filename to wc results on massive number of files


Login or Register to Reply

 
Thread Tools Search this Thread
# 1  
Old 1 Week Ago
Attach filename to wc results on massive number of files

Hello,
I have massive number of big files that needed to be counted for the total number of lines (> 100x millions) each. I want the file name attached to the count results so that they are aligned nicely matching name and counts.
I could do each file at a time, which will take hours to finish, so that the jobs were sent to background as I have multiple cores available to get the job done quickly. The problem with my script is the "echo -n $f" "; always accomplishes first, and the ${f}_R1.fq.gz | wc -l part is behind too much and the result was not aligned as expected.

Here is my code:
Code:
for f in $(cat ${LIST1}); do 
echo -n $f" "  >> raw_reads_count.table1; 
zcat ${f}_R1.fq.gz | wc -l >> raw_reads_count.table1 &      #This is the part
 done
------------------------------------------------------------------------------------------------------
messed-up output:
a      
bb    
ccc   
xyz 
267234214
777234211
937214233
1027254258
------------------------------------------------------------------------------------------------------
 Expected output:
a    267234214
bb   937214233
ccc  777234211
xyz 1027254258

How should I improve my script to get what is expected? Thanks a lot!
# 2  
Old 1 Week Ago
How about
Code:
{ echo -n $f" "; zcat ${f}_R1.fq.gz | wc -l; } >> raw_reads_count.table1 &

Should that fail, write to single files in sequence, then, after the loop, concatenate the files.
This User Gave Thanks to RudiC For This Post:
yifangt (1 Week Ago)
# 3  
Old 1 Week Ago
@Rudic
No, still the same as the original problem.
I'll do the single files and then concatenate them. Thanks!
# 4  
Old 1 Week Ago
How about something more like:
Code:
LIST1=/what/ever/you/want
OUTPUT1=raw_reads_count.table1

while read -r f
do	(	linecount=$(zcat ${f}_R1.fq.gz | wc -l)
		printf '%s\t%s\n' "$f" "$linecount" >> "$OUTPUT1"
	)&
done < "$LIST1"
wait
printf '%s: %s is ready.\n' "${0##*/}" "$OUTPUT1"

This User Gave Thanks to Don Cragun For This Post:
yifangt (1 Week Ago)
# 5  
Old 1 Week Ago
Quote:
Originally Posted by yifangt
The problem with my script is the "echo -n $f" "; always accomplishes first, and the ${f}_R1.fq.gz | wc -l part is behind too much and the result was not aligned as expected.
Actually this is a very interesting problem. It is hard simulate without actually create some terabytes of files that are similar in size to what you have to process, therefore, before i start to actually do that, i'd like to offer a few theories first which you may verify:

my suspicion is that the problem is the buffered nature of <stdout>. From time to time this buffer is flushed and because the output of echo is available already it gets written into the file but since the zcat still runs at that time it will be written at a much later time. Maybe the following might help. I used printf instead of echo, but that is not the point: to execute the output statement the subshell has to be finished, therefore the line should get printed completely or not at all. Because the whole process gets put in background the original order of the filenames will no longer be retained - maybe no concern to you but you should be aware of that.

Another point is the number of processes you start: starting an (in principle unlimited) amount of background processes at the same time is always a bit of an hazard. The script might work well with 10 or 20 files generating 10 or 20 background processes but a directory may as well hold millions of files. No system would survive an attempt to start a million background processes, no matter how small they are and how many processors you have. You may want to implement some logic to only have some maximum number of bround processes running concurrently.

Code:
$(printf "%s\t%s\n" "$f" $(zcat ${f}_R1.fq.gz | wc -l) ) >> raw_reads_count.table1 &

I hope this helps.

bakunin
This User Gave Thanks to bakunin For This Post:
yifangt (1 Week Ago)
# 6  
Old 1 Week Ago
parallel to restrict the process number

@bakunin @all
Your comments are exactly what I wanted to catch. Here, I reformed my script with GNU parallel to control the process limits, but I hit another wall:
Code:
parallel -a $LIST1 -j 48 "(printf "%s\t%s\n" {} $(zcat {}_R1.fq.gz | wc -l)) >> raw_reads_count.table1"
------------------------------------------------------
a 0 >> raw_reads_count.table1
bb 0 >> raw_reads_count.table1
ccc 0 >> raw_reads_count.table1
xyz 0 >> raw_reads_count.table1

The problem seems with the parallel placeholder expansion. Is it because of the too many layers of parenthesis () ? Need to get myself familiar with quoting in bash.
Thanks for any help!
======================================================================================
It seems to me this is the final solution:

Code:
parallel -a $LIST1 -j 48 "(echo -n {}' '; (zcat ${RAW_DIR1}/{}_R1.fq.gz | wc -l)) > {}_counts.tmp"
cat *_counts.tmp >> raw_reads_count.table1

Thanks you all for the help!

Last edited by yifangt; 1 Week Ago at 02:36 PM..
# 7  
Old 1 Week Ago
Parallel is not a go-faster button for files. Unless your CPU is maxing out, there's no benefit.

GNU parallel is just doing individual files like you were doing anyway. It has to, lacking magic mechanisms to predict future filesize and move things where they belong.

If your CPU is maxing out, pigz may work faster one-at-a-time than you were trying to do parallel.

Last edited by Corona688; 1 Week Ago at 04:19 PM..
This User Gave Thanks to Corona688 For This Post:
yifangt (1 Week Ago)
Login or Register to Reply

|
Thread Tools Search this Thread
Search this Thread:
Advanced Search

More UNIX and Linux Forum Topics You Might Find Helpful
Script to attach latest files of directories in a mail. sadique.manzar Shell Programming and Scripting 7 01-23-2018 12:50 PM
Adding filename and line number from multiple files to final file bioinfo Shell Programming and Scripting 14 04-18-2013 04:35 PM
How to count number of results found? demmel Shell Programming and Scripting 13 03-28-2013 06:22 PM
How to count number of files in directory and write to new file with number of files and their name? Akshay Hegde Shell Programming and Scripting 20 12-10-2012 08:05 AM
counting the number of characters in the filename of all files in a directory? LinuxNubBrah Shell Programming and Scripting 1 12-02-2011 05:00 PM
massive tarred grib files totally unacceptable sammysoil UNIX for Dummies Questions & Answers 3 03-21-2011 03:27 PM
How to attach multiple .csv files using mutt command Jassz Shell Programming and Scripting 2 01-03-2011 06:01 AM
How to attach two files in unix script meva Shell Programming and Scripting 22 12-09-2010 06:42 AM
Filename from splitting files to have the same filename of the original file with counter value natalie23 Shell Programming and Scripting 3 11-13-2009 05:15 AM
putting grep -c results number in a variable busdude UNIX for Dummies Questions & Answers 1 02-04-2009 09:30 PM
attach multiple files in email mgirinath Shell Programming and Scripting 2 04-23-2008 10:55 AM
attach 2 files using mailx anumkoshy Shell Programming and Scripting 2 08-27-2007 01:37 AM
pine does'nt attach files maybemedic UNIX for Advanced & Expert Users 2 04-21-2005 03:01 PM
awk | stop after specified number of results evan108 UNIX for Dummies Questions & Answers 6 03-17-2004 01:06 AM