Faster command to remove headers for files in a directory

11-12-2012

Registered User

143, 3

Join Date: Sep 2006

Last Activity: 28 April 2020, 7:36 PM EDT

Location: Bogota - Colombia - South America

Posts: 143

Thanks Given: 9

Thanked 3 Times in 3 Posts

Faster command to remove headers for files in a directory

Good evening

Im new at unix shell scripting and im planning to script a shell that removes headers for about 120 files in a directory and each file contains about 200000
lines in average.

i know i will loop files to process each one and ive found in this great forum different solutions using grep, sed, awk, head, etc.

But according to the above scenario and your experince and knowledge which command is the best for performance and does the homework faster ?

Thanks in advance

alexcol

View Public Profile for alexcol

Find all posts by alexcol

11-12-2012

Moderator

3,689, 1,352

Join Date: Jan 2012

Last Activity: 22 August 2020, 11:29 PM EDT

Location: Galactic Empire

Posts: 3,689

Thanks Given: 268

Thanked 1,352 Times in 1,258 Posts

Assuming header is first line, I compared execution speed of sed and awk on a 450,251 line file. Here are the results:-

Code:

HP-UX B.11.31 U ia64

Code:

wc -l infile
450251 infile

Code:

time awk 'FNR>1' infile > out

real    0m5.45s
user    0m2.21s
sys     0m2.86s

Code:

time sed '1d' infile > out

real    0m2.90s
user    0m1.22s
sys     0m1.43s

In this case sed won

But it depends on what you are trying to do.

Yoda

View Public Profile for Yoda

Visit Yoda's homepage!

Find all posts by Yoda

11-12-2012

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

Did sed win? Or did file caching speed up sed? Modern controllers and RAM cache - HPUX - can cache 100 MB of a single file without really using up system resources.

I vote for caching. The only fair test is two separate files.

BTW: programs like sed, awk, head, tail, grep are all highly optimized for their respective jobs. There are several of external factors like: caching, I/O load (I/O request queue length), SAN vs disk, that distort these kinds of tests. So, by the time you have runs some tests, any time differences between the commands will likely have been eaten up by testing.

Your best bet is to parallelize, use the cpu and disk I/O to the max. With a quad core maybe you want to consider 4 simultaneous child processes, for example:

Code:

cd /directory
cnt=1
for fname in $(find . -type f)
do
   (awk 'FNR>1' $fname > tmp.${cnt}; mv tmp.${cnt} $fname)  &
   cnt=$(( $cnt + 1  ))
   [  $(( $cnt % 4 )) -eq 0 ]  && wait
done
wait

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

11-12-2012

Moderator

3,689, 1,352

Join Date: Jan 2012

Last Activity: 22 August 2020, 11:29 PM EDT

Location: Galactic Empire

Posts: 3,689

Thanks Given: 268

Thanked 1,352 Times in 1,258 Posts

Jim, do you mean file caching helped sed because of the sequence of execution I chose? If yes, I tried the other way and still sed took less time to complete this particular task.

Code:

# time sed '1d' infile > out

real    0m3.41s
user    0m1.22s
sys     0m1.62s

Code:

# time awk 'FNR>1' infile > out

real    0m5.60s
user    0m2.20s
sys     0m3.21s

Last edited by Yoda; 11-12-2012 at 09:38 PM.. Reason: Code Added

Yoda

View Public Profile for Yoda

Visit Yoda's homepage!

Find all posts by Yoda

11-12-2012

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

Yes, that was what I meant. And yes it is very likely the grep, egrep, and sed are better at massive I/O than awk, which is running interpreted. The point, I think, is that a lot of tests like this are a lot of fun, but they may not be informative. Unless you understand why results can be set askew.

On my large m4000 Solaris boxes sed always outperforms awk on simple stream editing of massive files. On cygwin they come out really close.

However, by the time I've set up a fair test and run several candidates through, I could have coded and already processed 24 files in parallel, using any reasonable method.

Which is a lot less fun, I admit.

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

11-13-2012

Registered User

143, 3

Join Date: Sep 2006

Last Activity: 28 April 2020, 7:36 PM EDT

Location: Bogota - Colombia - South America

Posts: 143

Thanks Given: 9

Thanked 3 Times in 3 Posts

OK GREAT, thanks you very much all of you for your time and knowledge. ill start working with the script and then testing it.

alexcol

View Public Profile for alexcol

Find all posts by alexcol

Shell Programming and Scripting

Faster command to remove headers for files in a directory

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Remove white space and duplicate headers

Discussion started by: Daniel Gate

2. Shell Programming and Scripting

Remove headers thar dont match

Discussion started by: alexcol

3. UNIX for Dummies Questions & Answers

Using sed command to remove multiple instances of repeating headers in one file?

Discussion started by: rebazon

4. Shell Programming and Scripting

Running rename command on large files and make it faster

Discussion started by: shoaibjameel123

5. Shell Programming and Scripting

Merging of files with different headers to make combined headers file

Discussion started by: marut_ashu

6. Shell Programming and Scripting

Remove text between headers while leaving headers intact

Discussion started by: Trones

7. UNIX for Dummies Questions & Answers

Remove certain headers using mailx or sendmail

Discussion started by: congo

8. Shell Programming and Scripting

Remove Headers throughout a data file

Discussion started by: psmall

9. Shell Programming and Scripting

Which one is faster to remove control m characters?

Discussion started by: madhunk

10. UNIX for Dummies Questions & Answers

help:how to remove headers in output file

Discussion started by: raj_thota