How to make awk command faster?

09-08-2017

Registered User

5, 0

Join Date: Sep 2017

Last Activity: 12 September 2017, 5:03 AM EDT

Posts: 5

Thanks Given: 0

Thanked 0 Times in 0 Posts

Sure, I will post the time for each of the commands.

Just wanted to check , if the below sort be faster , if I give the temp folder path or should I change the path to some other folder?

Code:

sort -T ${NLAP_TEMP} -u ${NLAP_TEMP}/hist1.out > ${NLAP_TEMP}/hist2.final; VerifyExit

Peu Mukherjee

View Public Profile for Peu Mukherjee

Find all posts by Peu Mukherjee

09-08-2017

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

Quote:

Originally Posted by Peu Mukherjee

Just wanted to check , if the below sort be faster , if I give the temp folder path or should I change the path to some other folder?

See, every single disk can do only one thing at a time: reading a byte somewhere means it can't read (or write) a byte somewhere else at that time.

Temporary files are (at least) written once and (at least) read once, your input file is (at least) read once and your output file is written once. For all these tasks you want to involve different disks, so that, while one file is being read or written, another might also be read or written at the same time.

This should answer your question: you want (ideally) for all three involved files separate disks. Perhaps the fastest disk should be assigned to the temporary file because it is probably read and written the most often.

I hope this helps.

bakunin

bakunin

View Public Profile for bakunin

Find all posts by bakunin

09-08-2017

Registered User

5, 0

Join Date: Sep 2017

Last Activity: 12 September 2017, 5:03 AM EDT

Posts: 5

Thanks Given: 0

Thanked 0 Times in 0 Posts

I could see NLAP_TEMP is fastest directory , I added the same in sort command , but it seems not working, the awk command just takes 7 minutes , the issue is only with sort, which is taking long time .

Code:

sort -T ${NLAP_TEMP} -u ${NLAP_TEMP}/aplymeas5d.dyn.out.tmp1 > ${NLAP_HOME}/backup/aplymeas5d.dyn.final1

---------- Post updated at 05:30 AM ---------- Previous update was at 03:29 AM ----------

Please let me know how can I make sort faster, the file size is 4 GB and the sorting is taking 3 hours. we have only one disk in TEMP folder and 50GB space.

Peu Mukherjee

View Public Profile for Peu Mukherjee

Find all posts by Peu Mukherjee

09-08-2017

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

You received several hints in this thread on how to accelerate the sort process. What are the results of either? Did you consult man sort for additional options?

RudiC

View Public Profile for RudiC

Find all posts by RudiC

09-08-2017

Moderator

3,843, 841

Join Date: Jun 2007

Last Activity: 29 June 2020, 12:30 PM EDT

Location: Lancashire, UK

Posts: 3,843

Thanks Given: 2,004

Thanked 841 Times in 727 Posts

Did you check that the directories are on different physical disks? By that, you need to check that they are separate filesystems and where those filesystems are built from, not just that the directories are different. What you may think of as a single update to a file will cause multiple updates on the disk. There is at least:-

the actual disk block for the data
the file's inode update with the last modified time
the directory (for a new file or rename) and it's inode
the filesystem superblock (usually plural) when you get a new disk block from the free list by creating or extending the file

You also have to consider contention from other processing and if this is using NFS mounted filesystems, then you have the overhead of network traffic to bring into it.

I don't know how you have your disks provisioned. Can you explain it? If it is SAN, then that might be more difficult to speed up and depends on the disk at the back-end, the fibre capacity etc. At the other extreme, a PC with a single disk is just going to have contention even if you have a large disk cache.

Overall, if you have lots of data it is just going to take a while. I doubt I will be able to better the suggestions from my fellow learned members. How big is your input file anyway? (in bytes and records) If you try to do too much processing in one chunk, then you may also exhaust memory and cause your server to page/swap. Keeping this to discreet steps may alleviate that bottleneck but may cost more in disk IO. It is difficult to tell.

If you don't mind the 13th field still being there (given that they are all to be 9999) you might be able to save a little by stripping it right back and doing this:-

Code:

grep -E ",9999$" hist1.out | sort -uT ${NLAP_TEMP} > hist2.final

That -u flag on the sort saves the process and therefore all the memory (risk of paging/swapping) and passing the data between them, so that might help.

I hope that this is useful, but there will always be a limit we will hit.

Robin

rbatte1

View Public Profile for rbatte1

Visit rbatte1's homepage!

Find all posts by rbatte1

09-12-2017

Registered User

5, 0

Join Date: Sep 2017

Last Activity: 12 September 2017, 5:03 AM EDT

Posts: 5

Thanks Given: 0

Thanked 0 Times in 0 Posts

I tried all the options, but sort is not returning faster.

Since we have one CPU, should we go for splitting the file, then sorting individual files and then merging into a single file?

Peu Mukherjee

View Public Profile for Peu Mukherjee

Find all posts by Peu Mukherjee

09-12-2017

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

What about all the questions people asked you?

What physical disks are your various folders on?

If you don't know, trying random folders is unlikely to help.

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

Shell Programming and Scripting

How to make awk command faster?

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to make awk command faster for large amount of data?

Discussion started by: brenoasrm

2. Shell Programming and Scripting

awk changes to make it faster

Discussion started by: mirwasim

3. Shell Programming and Scripting

Making a faster alternative to a slow awk command

Discussion started by: s052866

4. Shell Programming and Scripting

Faster way to use this awk command

Discussion started by: SkySmart

5. Shell Programming and Scripting

Multi thread awk command for faster performance

Discussion started by: chetan.c

6. Shell Programming and Scripting

Make script faster

Discussion started by: AlbertGM

7. Shell Programming and Scripting

Running rename command on large files and make it faster

Discussion started by: shoaibjameel123

8. Red Hat

Re:How to make the linux pc faster

Discussion started by: venky_vemuri

9. Shell Programming and Scripting

awk help to make my work faster

Discussion started by: kumar_amit

10. Shell Programming and Scripting

Can anyone make this script run faster?

Discussion started by: shew01