Faster command to remove headers for files in a directory
Good evening
Im new at unix shell scripting and im planning to script a shell that removes headers for about 120 files in a directory and each file contains about 200000
lines in average.
i know i will loop files to process each one and ive found in this great forum different solutions using grep, sed, awk, head, etc.
But according to the above scenario and your experince and knowledge which command is the best for performance and does the homework faster ?
Assuming header is first line, I compared execution speed of sed and awk on a 450,251 line file. Here are the results:-
In this case sed won But it depends on what you are trying to do.
Did sed win? Or did file caching speed up sed? Modern controllers and RAM cache - HPUX - can cache 100 MB of a single file without really using up system resources.
I vote for caching. The only fair test is two separate files.
BTW: programs like sed, awk, head, tail, grep are all highly optimized for their respective jobs. There are several of external factors like: caching, I/O load (I/O request queue length), SAN vs disk, that distort these kinds of tests. So, by the time you have runs some tests, any time differences between the commands will likely have been eaten up by testing.
Your best bet is to parallelize, use the cpu and disk I/O to the max. With a quad core maybe you want to consider 4 simultaneous child processes, for example:
Jim, do you mean file caching helped sed because of the sequence of execution I chose? If yes, I tried the other way and still sed took less time to complete this particular task.
Last edited by Yoda; 11-12-2012 at 09:38 PM..
Reason: Code Added
Yes, that was what I meant. And yes it is very likely the grep, egrep, and sed are better at massive I/O than awk, which is running interpreted. The point, I think, is that a lot of tests like this are a lot of fun, but they may not be informative. Unless you understand why results can be set askew.
On my large m4000 Solaris boxes sed always outperforms awk on simple stream editing of massive files. On cygwin they come out really close.
However, by the time I've set up a fair test and run several candidates through, I could have coded and already processed 24 files in parallel, using any reasonable method.
I have a file called "dsout" with empty rows and duplicate headers.
DATE TIME TOTAL_GB USED_GB %USED
--------- -------- ---------- ---------- ----------
03/05/013 12:34 PM 3151.24316 2331.56653 73.988785 ... (3 Replies)
Good evening
I need your help please, im new at Unix and i wanted to remove the first 5 headers for 100000 records files and then create a control file .ctl that contains the number of records and all seem to work out but when i tested at production it didnt wotk.
Here is the code:
#!... (6 Replies)
Hi,
I have catenated multiple output files (from a monte carlo run) into one big output file. Each individual file has it's own two line header. So when I catenate, there are multiple two line headers (of the same wording) within the big file. How do I use the sed command to search for the... (1 Reply)
Hi All,
I have some 80,000 files in a directory which I need to rename. Below is the command which I am currently running and it seems, it is taking fore ever to run this command. This command seems too slow. Is there any way to speed up the command. I have have GNU Parallel installed on my... (6 Replies)
Hi ,
I have a typical situation. I have 4 files and with different headers (number of headers is varible ).
I need to make such a merged file which will have headers combined from all files (comman coluns should appear once only).
For example -
File 1
H1|H2|H3|H4
11|12|13|14
21|22|23|23... (1 Reply)
Hi,
I'm trying to strip all lines between two headers in a file:
### BEGIN ###
Text to remove, contains all kinds of characters
...
Antispyware-Downloadserver.com (Germany)=http://www.antispyware-downloadserver.c
om/updates/
Antispyware-Downloadserver.com #2... (3 Replies)
Hello,
So i want to send mails in any way from a solaris 5.8 system, perhaps using mailx or sendmail. My purpose is to stay clear of systems name in head data. So i want to strip at least the "Message-Id" and the "Recieved" headers of the mail. Yet this seems to be a bit of a problem.
Now i... (2 Replies)
I have a data file with over 500,000 records/lines that has the header throughout the file.
SEQ_ID Name Start_Date Ins_date Add1 Add2
1 Harris 04/02/08 03/02/08 333 Main Suite 101
2 Smith 02/03/08 01/23/08 287 Jenkins
SEQ_ID Name ... (3 Replies)
I have a file with millions of records...Before I experiment, I would like to know which one is faster.
Both the commands work absolutely fine on a smaller set of records.
Please advice.
sed 's/^M//g' ${INPUT_FILE} > tmp.txt
mv tmp.txt ${INPUT_FILE}
tr -d "\15" < ${INPUT_FILE} > ... (11 Replies)
Hi
I am running a script (which compares two directory contents) for which I am getting an output of 70 pages in which few pages are blank so I was able to delete those blank lines.
But I also want to delete the headers present for each page. can any one help me by providing the code... (1 Reply)