Sorting problem: Multiple delimiters, multiple keys

07-11-2011

Registered User

1,613, 160

Join Date: Oct 2007

Last Activity: 12 February 2019, 12:19 PM EST

Location: USA

Posts: 1,613

Thanks Given: 40

Thanked 160 Times in 150 Posts

Post a few lines of your input file so it is apparent what you are talking about...also try doing it all in a single command be it sort awk perl...in order to minimize the inefficiency due to process forking.

And for grins how about...

Code:

sort -t, -k2,2n -k3,3 file

After reading your post again i realise it wont work as the hrs. field isnt zero padded but this should...

Code:

sort -t, -k2,2n -k3,3n file

Last edited by shamrock; 07-11-2011 at 04:48 PM..

shamrock

View Public Profile for shamrock

Find all posts by shamrock

07-11-2011

Registered User

11, 0

Join Date: Jul 2011

Last Activity: 5 August 2011, 11:23 PM EDT

Posts: 11

Thanks Given: 1

Thanked 0 Times in 0 Posts

This is what happened when I tried to use split:

Code:

split: output file suffixes exhausted

Going to try again with bigger splits.

Current file (example):

>> cat 2009_Trades.csv | head -3

SYMBOL,DATE,TIME,PRICE,SIZE,CORR
AAPL,20090102,7:30:01,84.00,230,0
AAPL,20090102,7:30:02,84.01,270,0

(I changed the prices since I have practically no rights to this data.)

There are more symbols but you don't see any others until X MB into the file because it is currently sorted by symbol, then date, then time.

Goal file (example):

>> cat NEW_2009_Trades.csv | head -6

Id,Symbol,TradeTime,Price,Shares,Corr
1,CSCO,2009-01-02T07:30:00,16.96,580,0
2,GOOG,2009-01-02T07:30:00,321.23,200,0
3,AAPL,2009-01-02T07:30:01,90.75,720,0
4,IBM,2009-01-02T07:30:01,87.37,200,0
5,AMZN,2009-01-02T07:30:02,54.01,90,0

Last edited by Ryan.; 07-11-2011 at 05:31 PM..

Ryan.

View Public Profile for Ryan.

Find all posts by Ryan.

07-11-2011

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Quote:

Originally Posted by shamrock

Post a few lines of your input file so it is apparent what you are talking about...also try doing it all in a single command be it sort awk perl...in order to minimize the inefficiency due to process forking.

And for grins how about...

Code:

sort -t, -k2,2n -k3,3 file

After reading your post again i realise it wont work as the hrs. field isnt zero padded but this should...

Code:

sort -t, -k2,2n -k3,3n file

Process forking isn't an issue. Only a few process are created by the pipeline. Perhaps you meant the back and forth context switching between the few processes which constitute the pipeline.

Neither of your suggestions is appropriate, though. The numeric sort will only look at a leading numeric string. This means that sort will never look beyond the first colon in the time string.

Regards,
Alister

---------- Post updated at 04:02 PM ---------- Previous update was at 03:57 PM ----------

Quote:

Originally Posted by Ryan.

This is what happened when I tried to use split:

Code:

split: output file suffixes exhausted

I'm going to try using numeric suffixes and try again.

Make sure you adjust the suffix length using -a so that the number of permutations can accomodate the number of expected files.

Regards,
Alister

alister

View Public Profile for alister

Find all posts by alister

07-11-2011

Registered User

11, 0

Join Date: Jul 2011

Last Activity: 5 August 2011, 11:23 PM EDT

Posts: 11

Thanks Given: 1

Thanked 0 Times in 0 Posts

Wait.

Wow.

How am I supposed to "sort individually"?

If I split the file up, every time two "sorted" files are combined I still need to sort the merged file, and therefore I run into the same problem.

Ryan.

View Public Profile for Ryan.

Find all posts by Ryan.

07-11-2011

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Quote:

Originally Posted by Ryan.

If I split the file up, every time two "sorted" files are combined I still need to sort the merged file, and therefore I run into the same problem.

No you don't. During the sorting step, the entire file's contents are in use. During the merging step, only one line per file being merged needs to be in memory.

Think about it. If you know that two files are already sorted, you only need to compare two lines at a time, make a decision which comes first, print the correct line, read the line that follows that which was printed, rinse and repeat.

Whereas when a file is not sorted, you do not know where a line goes until you've read the entire file at least once.

Regards,
Alister

This User Gave Thanks to alister For This Post:

alister

View Public Profile for alister

Find all posts by alister

07-11-2011

Registered User

11, 0

Join Date: Jul 2011

Last Activity: 5 August 2011, 11:23 PM EDT

Posts: 11

Thanks Given: 1

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by alister

No you don't. During the sorting step, the entire file's contents are in use. During the merging step, only one line per file being merged needs to be in memory.

Think about it. If you know that two files are already sorted, you only need to compare two lines at a time, make a decision which comes first, print the correct line, read the line that follows that which was printed, rinse and repeat.

Whereas when a file is not sorted, you do not know where the next line goes until you've read the entire file at least once.

Regards,
Alister

I totally misunderstood you the first time.

So, basically try to split the file by each ticker, and then write some simple code to do my own sorting, correct?

Edit: I guess that all could have been summed up in two words: "Insertion sort"

Last edited by Ryan.; 07-11-2011 at 07:18 PM..

Ryan.

View Public Profile for Ryan.

Find all posts by Ryan.

07-11-2011

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Quote:

Originally Posted by Ryan.

I totally misunderstood you the first time.

So, basically try to split the file by each ticker, and then write some simple code to do my own sorting, correct?

Edit: I guess that all could have been summed up in two words: "Insertion sort"

No. I was not describing a specific sorting algorithm (insertion, quicksort, etc...) but an approach which allows one to deal with more data than memory alone allows. External sorting - Wikipedia, the free encyclopedia

As I said earlier, GNU sort should do this external sort for you (you have yet to make it clear which platform you're working with). It checks the size of the file, checks how much memory the system has available, sees it's much too big, and decides to use temp files to store sorted chunks for subsequent merging.

Whatever sort utility you're using, I'm assuming it's doing this since your error message mentions a temp file.

Quote:

Originally Posted by Ryan.

Code:

read failed: /tmp/sortOgLpWg: Input/ouput error

Perhaps someone familiar with your operating system can give more specific advice with that read i/o error.

Is it possible that your /tmp ran out of space during the sort? That something cleared /tmp while the sort was running? That the hardware is having issues?

It would also be helpful to know the specs of your hardware (ram, available space on relevant filesystems, and such).

Alister

alister

View Public Profile for alister

Find all posts by alister

Shell Programming and Scripting

Sorting problem: Multiple delimiters, multiple keys

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Getting fields from a file having multiple delimiters

Discussion started by: dev.devil.1983

2. UNIX for Beginners Questions & Answers

How to append the multiple Delimiters up to requirement?

Discussion started by: vinod.peddiredd

3. Shell Programming and Scripting

Editing phone number with multiple delimiters

Discussion started by: smartSometimes

4. Shell Programming and Scripting

awk multiple delimiters

Discussion started by: jacobs.smith

5. Shell Programming and Scripting

treating multiple delimiters[solved]

Discussion started by: sam_bd

6. Shell Programming and Scripting

Sorting based on multiple delimiters

Discussion started by: gimley

7. Shell Programming and Scripting

AWK with multiple delimiters

Discussion started by: gdub

8. Shell Programming and Scripting

Cutting a file with multiple delimiters into columns

Discussion started by: luckycharm

9. Shell Programming and Scripting

Sorting with multiple numeric keys

Discussion started by: sinpeak

10. Shell Programming and Scripting

awk - treat multiple delimiters as one

Discussion started by: peter.herlihy