07-11-2010
help to parallelize work on thousands of files
I need to find a smarter way to process about 60,000 files in a single directory.
Every night a script runs on each file generating a output on another directory; this used to take 5 hours, but as the data grows it is taking 7 hours.
The files are of different sizes, but there are 16 cores on the box so I want to run at least 10 parallel processes. (the report generating script is not very cpu intensive)
I can manually split "ls -1" in to 10 lists, then run foreach on every file in the background. This makes the process run in 2 hours; but it isn't the smartest way because the list with the largest files (some over a gig) always takes the longest and the list with small files finishes first.
One way of solving the problem is by listing the files in order of size and somehow putting every 10th file in to a list.
Another way could be to start processing each file one after the other, but maintaining not more than 10 threads?
Finally I was also thinking of keeping a zipped up tarball. gtar or tar piped through gzip takes over 12 hours to run! It would be good to be able to create 10 smaller tarballs in a shorter time.
thanks!
-VH
10 More Discussions You Might Find Interesting
1. UNIX for Advanced & Expert Users
Hey all,
I have a box running SUSE SLES 8 and in the past few months the box will randomly spawn thousands of instances of /USR/SBIN/CRON to the point where the box will lock up entirely. Upwards of 14000 instances! I imagine it's using up all of the available files that can be opened at one... (10 Replies)
Discussion started by: sysera
10 Replies
2. Shell Programming and Scripting
Hi All,
I want to find a specific pattern from approximately 400000 files on solaris platform. Its very heavy for me to grep that pattern to each file individually.
Can anybody suggest me some way to search for specific pattern (alpha numeric) from these forty thousand files. Please note that... (6 Replies)
Discussion started by: aarora_98
6 Replies
3. Shell Programming and Scripting
Im transmiting thousands ftp files to a server, when type the command mput *, an error comes and say. args list to long. set to I. So ihave to transmit them in batch or blocks, but its too sloww. what shoul i do?. i need to do a program, or with a simple command i could solve the problem? (3 Replies)
Discussion started by: alexcol
3 Replies
4. Shell Programming and Scripting
Hi,
Trying to represent a number with thousands separator in AWK:
echo 1 12 123 1234 12345 123456 1234567 | awk --re-interval '{print gensub(/(])(]{3})/,"\\1,\\2","g")}'
1 12 123 1,234 1,2345 1,23456 1,234567
any idea what is wrong here ? (11 Replies)
Discussion started by: ynixon
11 Replies
5. UNIX for Advanced & Expert Users
There is a procedure I do here at work where I have to synchronize file systems. The source file system always has three or four directories of hundreds of thousands of tiny (1k or smaller) or empty files. Whenever my rsync command reaches these directories, I'm waiting for hours for those files... (3 Replies)
Discussion started by: deckard
3 Replies
6. Shell Programming and Scripting
I'm kinda stuck on this one, I have 7 files with 30.000 lines/file like this
050 0.023 0.504336
050 0.024 0.529521
050 0.025 0.538908
050 0.026 0.537035
I want to find the mean line by line of the third column from the files named like this:
Stat-f-1.dat .... Stat-f-7.dat
Stat-s-1.dat... (8 Replies)
Discussion started by: AriasFco
8 Replies
7. Shell Programming and Scripting
Dear all,
I'm a newbie in programming and I would like to know if it is possible to parallelize the script:
for l in {1..1000}
do
cut -f$l quase2 |tr "\n" "," |sed 's/$/\
/g' |sed '/^$/d' >a_$l.t
done
I tried:
for l in {1..1000}
do
cut -f$l quase2 |tr "\n" "," |sed 's/$/\
/g' |sed... (7 Replies)
Discussion started by: valente
7 Replies
8. Shell Programming and Scripting
Hi All,
I want to search for a certain string in thousands of files and these files are distributed over different directories created daily. For that I created a small script in bash but while running it I am getting the below error:
/ms.sh: xrealloc: subst.c:5173: cannot allocate... (17 Replies)
Discussion started by: danish0909
17 Replies
9. Shell Programming and Scripting
Hi to all,
I have thousand of files in a folder with names with format "FILE-YYYY-MM-DD-HHMM" for what I want to send the following AWK command
awk '/Code.*/' FILE-2014*
I'd like to separate all files that have the same date to a folder named with the corresponding date. For example, if I... (7 Replies)
Discussion started by: Ophiuchus
7 Replies
10. Shell Programming and Scripting
Hello,
I have a bunch of jobs (cp, cat or ln -s) on big files (10xGB in size):
# commands_script.sh:
cp file1 path1/newfile1
cp file2 path1/newfile2
cp file3 path1/newfile3
......
cat file11 path2/file21 path1/newfile11
cat file12 path2/file22 path1/newfile12
cat file13 path2/file23... (5 Replies)
Discussion started by: yifangt
5 Replies
LEARN ABOUT CENTOS
trace-cmd-restore
TRACE-CMD-RESTORE(1) TRACE-CMD-RESTORE(1)
NAME
trace-cmd-restore - restore a failed trace record
SYNOPSIS
trace-cmd restore [OPTIONS] [command] cpu-file [cpu-file ...]
DESCRIPTION
The trace-cmd(1) restore command will restore a crashed trace-cmd-record(1) file. If for some reason a trace-cmd record fails, it will
leave a the per-cpu data files and not create the final trace.dat file. The trace-cmd restore will append the files to create a working
trace.dat file that can be read with trace-cmd-report(1).
When trace-cmd record runs, it spawns off a process per CPU and writes to a per cpu file usually called trace.dat.cpuX, where X represents
the CPU number that it is tracing. If the -o option was used in the trace-cmd record, then the CPU data files will have that name instead
of the trace.dat name. If a unexpected crash occurs before the tracing is finished, then the per CPU files will still exist but there will
not be any trace.dat file to read from. trace-cmd restore will allow you to create a trace.dat file with the existing data files.
OPTIONS
-c
Create a partial trace.dat file from the machine, to be used with a full trace-cmd restore at another time. This option is useful for
embedded devices. If a server contains the cpu files of a crashed trace-cmd record (or trace-cmd listen), trace-cmd restore can be
executed on the embedded device with the -c option to get all the stored information of that embedded device. Then the file created
could be copied to the server to run the trace-cmd restore there with the cpu files.
If *-o* is not specified, then the file created will be called
'trace-partial.dat'. This is because the file is not a full version
of something that trace-cmd-report(1) could use.
-t tracing_dir
Used with -c, it overrides the location to read the events from. By default, tracing information is read from the debugfs/tracing
directory. -t will use that location instead. This can be useful if the trace.dat file to create is from another machine. Just tar
-cvf events.tar debugfs/tracing and copy and untar that file locally, and use that directory instead.
-k kallsyms
Used with -c, it overrides where to read the kallsyms file from. By default, /proc/kallsyms is used. -k will override the file to read
the kallsyms from. This can be useful if the trace.dat file to create is from another machine. Just copy the /proc/kallsyms file
locally, and use -k to point to that file.
-o output'
By default, trace-cmd restore will create a trace.dat file (or trace-partial.dat if -c is specified). You can specify a different file
to write to with the -o option.
-i input
By default, trace-cmd restore will read the information of the current system to create the initial data stored in the trace.dat file.
If the crash was on another machine, then that machine should have the trace-cmd restore run with the -c option to create the trace.dat
partial file. Then that file can be copied to the current machine where trace-cmd restore will use -i to load that file instead of
reading from the current system.
EXAMPLES
If a crash happened on another box, you could run:
$ trace-cmd restore -c -o box-partial.dat
Then on the server that has the cpu files:
$ trace-cmd restore -i box-partial.dat trace.dat.cpu0 trace.dat.cpu1
This would create a trace.dat file for the embedded box.
SEE ALSO
trace-cmd(1), trace-cmd-record(1), trace-cmd-report(1), trace-cmd-start(1), trace-cmd-stop(1), trace-cmd-extract(1), trace-cmd-reset(1),
trace-cmd-split(1), trace-cmd-list(1), trace-cmd-listen(1)
AUTHOR
Written by Steven Rostedt, <rostedt@goodmis.org[1]>
RESOURCES
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git
COPYING
Copyright (C) 2010 Red Hat, Inc. Free use of this software is granted under the terms of the GNU Public License (GPL).
NOTES
1. rostedt@goodmis.org
mailto:rostedt@goodmis.org
06/11/2014 TRACE-CMD-RESTORE(1)