Attach filename to wc results on massive number of files
Hello,
I have massive number of big files that needed to be counted for the total number of lines (> 100x millions) each. I want the file name attached to the count results so that they are aligned nicely matching name and counts.
I could do each file at a time, which will take hours to finish, so that the jobs were sent to background as I have multiple cores available to get the job done quickly. The problem with my script is the "echo -n $f" "; always accomplishes first, and the ${f}_R1.fq.gz | wc -l part is behind too much and the result was not aligned as expected.
Here is my code:
How should I improve my script to get what is expected? Thanks a lot!
The problem with my script is the "echo -n $f" "; always accomplishes first, and the ${f}_R1.fq.gz | wc -l part is behind too much and the result was not aligned as expected.
Actually this is a very interesting problem. It is hard simulate without actually create some terabytes of files that are similar in size to what you have to process, therefore, before i start to actually do that, i'd like to offer a few theories first which you may verify:
my suspicion is that the problem is the buffered nature of <stdout>. From time to time this buffer is flushed and because the output of echo is available already it gets written into the file but since the zcat still runs at that time it will be written at a much later time. Maybe the following might help. I used printf instead of echo, but that is not the point: to execute the output statement the subshell has to be finished, therefore the line should get printed completely or not at all. Because the whole process gets put in background the original order of the filenames will no longer be retained - maybe no concern to you but you should be aware of that.
Another point is the number of processes you start: starting an (in principle unlimited) amount of background processes at the same time is always a bit of an hazard. The script might work well with 10 or 20 files generating 10 or 20 background processes but a directory may as well hold millions of files. No system would survive an attempt to start a million background processes, no matter how small they are and how many processors you have. You may want to implement some logic to only have some maximum number of bround processes running concurrently.
@bakunin @all
Your comments are exactly what I wanted to catch. Here, I reformed my script with GNUparallel to control the process limits, but I hit another wall:
The problem seems with the parallel placeholder expansion. Is it because of the too many layers of parenthesis () ? Need to get myself familiar with quoting in bash.
Thanks for any help!
======================================================================================
It seems to me this is the final solution:
Parallel is not a go-faster button for files. Unless your CPU is maxing out, there's no benefit.
GNU parallel is just doing individual files like you were doing anyway. It has to, lacking magic mechanisms to predict future filesize and move things where they belong.
If your CPU is maxing out, pigz may work faster one-at-a-time than you were trying to do parallel.
Last edited by Corona688; 03-15-2019 at 05:19 PM..
Hi all,
I have 20 files (file001.txt upto file020.txt) and I want to read them from 3rd line upto end of file (line 1002). But in the final file they should appear to start from line 1.
I need following kind of output in a single file:
Filename Line number 2ndcolumn 4thcolumn
I... (14 Replies)
Hi guys,
I'm struggling with this one, any help is appreciated.
I have File1 with hundreds of unique words, like this:
word1
word2
word3
I want to count each word from file1 in file2 and return how many times each word is found.
I tried something like this:
for i in $(cat file1); do... (13 Replies)
I am trying to display the output of ls and also print the number of characters in EVERY file name. This is what I have so far:
#!/bin/sh
for x in `ls`; do
echo The number of characters in x | wc -m
done
Any help appreciated (1 Reply)
Hi, I have 7 terabytes of tar files, one for every single day since 1980. Inside these tar files are GRIB files, each with 100+ variables. There's 8 GRIBs in each tar, corresponding to different times of the day. I need 6 friggin variables..., and it takes TWO WEEKS TO EXTRACT ALL THE TAR FILES... (3 Replies)
Hi,
My script has to send 2 files as a separate attachment(Note : files to be sent without zipping) to the specified email id.
Below code was used but it is not attaching the file as expected instead the file contents are displayed in the body of the email.
Kindly,help with your... (22 Replies)
Hi all,
I have a list of xml file. I need to split the files to a different files when see the <ko> tag.
The list of filename are
B20090908.1100-20090908.1200_CDMA=1,NO=2,SITE=3.xml
B20090908.1200-20090908.1300_CDMA=1,NO=2,SITE=3.xml
B20090908.1300-20090908.1400_CDMA=1,NO=2,SITE=3.xml
... (3 Replies)
if test.dat is the file
cat test.dat|uuencode test.dat|mailx -s "subject" mailid
can be used for attaching test.dat
how can i attach more than one file to a mail using mailx (2 Replies)
Hello All,
I am maintaining a server and I use pine as MUA and sendmail as MTA. Suddenly many users in the network face the problem of not being able to attach files using pine. I checked the sendmail.cf file and found a variable "MaxMessageSize = 1000000". Eventhough the message size... (2 Replies)
I am searching some rather large text files using grep and or awk.
What I would like to know is if there is a way (either with grep, awk, or realy any other unix tool) to stop the search when a predifined number of results are returned.
I would like to do this for speed purpuses. When i get... (6 Replies)