Sorting problem: Multiple delimiters, multiple keys

07-11-2011

Registered User

11, 0

Join Date: Jul 2011

Last Activity: 5 August 2011, 11:23 PM EDT

Posts: 11

Thanks Given: 1

Thanked 0 Times in 0 Posts

Sorting problem: Multiple delimiters, multiple keys

Hello

If you wanted to sort a .csv file that was filled with lines like this:

<Ticker>,<Date as YYYYMMDD>,<Time as H:M:S>,<Volume>,<Corr>

(H : [1, 23], M, S: [0, 59])

by date, does anybody know of a better solution than to turn the 3rd and 4th colons of every line into commas, sorting on four keys, and then turning those two commas in every line back to colons? It seems very inefficient to me. (I would just do it and not bother asking if these files weren't 50+GB.)

---------- Post updated at 09:43 PM ---------- Previous update was at 09:27 PM ----------

Meh, I'll let it run overnight.

Code:

sed 's/:/,/g' big_file.csv | sort -k 2,2 -k 3,3 -k 4,4 -k 5,5 -t',' | sed 's/,/:/3' | sed 's/,/:/3' > big_file.sorted.csv

Ryan.

View Public Profile for Ryan.

Find all posts by Ryan.

07-11-2011

Registered User

2,759, 420

Join Date: Jun 2006

Last Activity: 13 September 2015, 8:58 PM EDT

Posts: 2,759

Thanks Given: 44

Thanked 420 Times in 408 Posts

how about:

Code:

sort -t , -k2n

rdcwayx

View Public Profile for rdcwayx

Find all posts by rdcwayx

07-11-2011

Registered User

13, 1

Join Date: Jul 2011

Last Activity: 22 September 2011, 3:30 AM EDT

Posts: 13

Thanks Given: 1

Thanked 1 Time in 1 Post

Try this one

Code:

sed 's/:/,/g' test_sort| sort -t ',' -k 2,7 |sed 's/,/:/3'|sed 's/,/:/3'> test_org_final| mv test_org_final test_sort

Last edited by Franklin52; 07-11-2011 at 06:48 AM.. Reason: Please use code tags for code and data samples, thank you

Abhishek_1984

View Public Profile for Abhishek_1984

Find all posts by Abhishek_1984

07-11-2011

Registered User

11, 0

Join Date: Jul 2011

Last Activity: 5 August 2011, 11:23 PM EDT

Posts: 11

Thanks Given: 1

Thanked 0 Times in 0 Posts

Sort broke @ 2am:

Code:

read failed: /tmp/sortOgLpWg: Input/ouput error

This is a real problem now. Does anybody know of a way to sort a humongous file that won't (likely) break?

Last edited by Ryan.; 07-11-2011 at 03:41 PM..

Ryan.

View Public Profile for Ryan.

Find all posts by Ryan.

07-11-2011

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Quote:

Originally Posted by Ryan.

Hello

If you wanted to sort a .csv file that was filled with lines like this:

<Ticker>,<Date as YYYYMMDD>,<Time as H:M:S>,<Volume>,<Corr>

(H : [1, 23], M, S: [0, 59])

by date, does anybody know of a better solution than to turn the 3rd and 4th colons of every line into commas, sorting on four keys, and then turning those two commas in every line back to colons? It seems very inefficient to me. (I would just do it and not bother asking if these files weren't 50+GB.)

---------- Post updated at 09:43 PM ---------- Previous update was at 09:27 PM ----------

Meh, I'll let it run overnight.

Code:

sed 's/:/,/g' big_file.csv | sort -k 2,2 -k 3,3 -k 4,4 -k 5,5 -t',' | sed 's/,/:/3' | sed 's/,/:/3' > big_file.sorted.csv

Are hours, minutes and seconds all zero padded? For example, 01:02:03 instead of 1:2:3 or 1:02:03? If so, you do not need to modify anything. You can use the default lexicographical sort with the date and time fields as the keys.

Also, you mentioned that hours range betwee 1-23. In case it's relevant, that's only a 23 hour day.

If the source file is 50+ GB, you are going to need a lot of ram. You'll probably need to split the file into smaller chunks, sort them individually, and then merge them with sort -m.

Regards,
Alister

alister

View Public Profile for alister

Find all posts by alister

07-11-2011

Registered User

11, 0

Join Date: Jul 2011

Last Activity: 5 August 2011, 11:23 PM EDT

Posts: 11

Thanks Given: 1

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by alister

Are hours, minutes and seconds all zero padded? For example, 01:02:03 instead of 1:2:3 or 1:02:03? If so, you do not need to modify anything. You can use the default lexicographical sort with the date and time fields as the keys.

Also, you mentioned that hours range betwee 1-23. In case it's relevant, that's only a 23 hour day.

If the source file is 50+ GB, you are going to need a lot of ram. You'll probably need to split the file into smaller chunks, sort them individually, and then merge them with sort -m.

Regards,
Alister

Oddly the hours aren't zero padded but the minutes and seconds are. (I think it's like [1]?[0-9]:[0-5][0-9]:[0-5][0-9] in Regex-speak.)

I'm going to try to figure out how to split it up and then attempt sorting again -- thanks.

Last edited by Ryan.; 07-11-2011 at 04:10 PM.. Reason: Wrong

Ryan.

View Public Profile for Ryan.

Find all posts by Ryan.

07-11-2011

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

By the way, the consecutive seds in the pipeline can be simplified: sed 's/,/:/3 ; s/,/:/3'.

That'll save some time in context switches and copying data in and out of kernel/userland buffers.

Regards,
Alister

---------- Post updated at 02:55 PM ---------- Previous update was at 02:53 PM ----------

Also, it seems GNU sort can handle this situation, by automatically creating tmp files during the sorting process. I'm assuming you're not on Linux. If so, and if you are using GNU sort, you should paste the exact error message.

alister

View Public Profile for alister

Find all posts by alister

Shell Programming and Scripting

Sorting problem: Multiple delimiters, multiple keys

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Getting fields from a file having multiple delimiters

Discussion started by: dev.devil.1983

2. UNIX for Beginners Questions & Answers

How to append the multiple Delimiters up to requirement?

Discussion started by: vinod.peddiredd

3. Shell Programming and Scripting

Editing phone number with multiple delimiters

Discussion started by: smartSometimes

4. Shell Programming and Scripting

awk multiple delimiters

Discussion started by: jacobs.smith

5. Shell Programming and Scripting

treating multiple delimiters[solved]

Discussion started by: sam_bd

6. Shell Programming and Scripting

Sorting based on multiple delimiters

Discussion started by: gimley

7. Shell Programming and Scripting

AWK with multiple delimiters

Discussion started by: gdub

8. Shell Programming and Scripting

Cutting a file with multiple delimiters into columns

Discussion started by: luckycharm

9. Shell Programming and Scripting

Sorting with multiple numeric keys

Discussion started by: sinpeak

10. Shell Programming and Scripting

awk - treat multiple delimiters as one

Discussion started by: peter.herlihy