Sorting problem: Multiple delimiters, multiple keys


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Sorting problem: Multiple delimiters, multiple keys
# 15  
Old 07-11-2011
I'm using Ubuntu Jaunty.

But unless somebody can come up with a solution within the next hour I'm just going to write something to manually split and sort it.
# 16  
Old 07-11-2011
I was just looking at a few sort implementations, they all already use TEMPDIR to store sorted chunks when dealing with very large input. You can try doing it manually, but your sort tool is probably already doing that for you.

Good luck.
# 17  
Old 07-12-2011
Quote:
Originally Posted by alister
Process forking isn't an issue. Only a few process are created by the pipeline. Perhaps you meant the back and forth context switching between the few processes which constitute the pipeline.
Context switching isnt limited just to the pipeline processes...the fewer the processes on the system the lesser the context switching...hope you catch my drift.
Quote:
Originally Posted by alister
Neither of your suggestions is appropriate, though. The numeric sort will only look at a leading numeric string. This means that sort will never look beyond the first colon in the time string.
The sort on my aix and hpux boxes is able to sort the entire h:m:s field not just upto the first colon...perhaps your sort has a limitation mine doesnt.
# 18  
Old 07-12-2011
Quote:
Originally Posted by shamrock
The sort on my aix and hpux boxes is able to sort the entire h:m:s field not just upto the first colon...perhaps your sort has a limitation mine doesnt.
You misunderstand how sort works.

The moment that sort sees that colon in the numeric key, it stops evaluating the key. It will not go any further. All times with the same hour will compare equal because the minutes and seconds are not looked at. At this point, to decide how to order all of those equal records, sort will then look at the rest of the line ... starting with the first character. Also, it will not do so numerically, but lexographically.

The following two excerpts are from the POSIX sort standard, Man Page for sort (POSIX Section 1) - The UNIX and Linux Forums

Quote:
-n
Restrict the sort key to an initial numeric string, consisting of optional <blank> characters, optional minus-sign, and zero or more digits with an optional radix character and thousands separators (as defined in the current locale), which shall be sorted by arithmetic value. An empty digit string shall be treated as zero. Leading zeros and signs on zeros shall not affect ordering.
Unless the colon is a radix or thousands separator character in the current locale (which is not the case in any locale that I'm aware of), it's not a valid part of the numeric string. The numeric key comparison will end right there.

Quote:
Except when the -u option is specified, lines that otherwise compare equal shall be ordered as if none of the options -d, -f, -i, -n, or -k were present (but with -r still in effect, if it was specified) and with all bytes in the lines significant to the comparison. The order in which lines that still compare equal are written is unspecified.


Looking again at what you proposed:
Quote:
Originally Posted by shamrock
After reading your post again i realise it wont work as the hrs. field isnt zero padded but this should...
Code:
sort -t, -k2,2n -k3,3n file

We can see that the third field numeric key will not be inspected beyond the hours portion. All times with the same hour will compare equal regardless of the value of minutes and seconds. To then decide how to order the lines which have compared equal, the first field will be considered and then everything from the first colon in the third field inclusive till the end of the line. This is obviously wrong.

If your sort(1) does not behave as described above, it's broken (with regard to posix compliancy).


Here's a sample to illustrate the point:
Code:
$ cat data
b,11:00:59
a,11:45:00
c,11:15:00
d,11:01:00
e,11:00:00


The following does not give the desired result because the numerical comparison ends at the first colon of the second field. Therefore, every single line compares as equal (key evaluates to 11). At that point, sort then looks at the entire line. Note how in this example the output order is essentially a sort on the first field.
Code:
$ sort -t, -k2,2n data
a,11:45:00
b,11:00:59
c,11:15:00
d,11:01:00
e,11:00:00


To get a sort on the time using numeric comparison (which is not really necessary in this case since each time component is two digits), you need to work around the colon with finer grained keys. The following almost gets us there, but note that in this case, the seconds are not considered and records with the same hour and minute are sorted according to the first field (the first two lines are incorrect):
Code:
$ sort -t, -k2.1,2.2n -k2.4,2.5n data
b,11:00:59
e,11:00:00
d,11:01:00
c,11:15:00
a,11:45:00


The correct solutions for this example:
Code:
$ sort -t, -k2.1,2.2n -k2.4,2.5n -k2.7,2.8n data
e,11:00:00
b,11:00:59
d,11:01:00
c,11:15:00
a,11:45:00
$ sort -t, -k2,2 data
e,11:00:00
b,11:00:59
d,11:01:00
c,11:15:00
a,11:45:00

Although neither of these solutions is valid for this thread's problem since they both depend on the time having a fixed format.

Regards,
Alister
# 19  
Old 07-12-2011
Thanks for all your help and insight, alister.

I just thought I would come on here and explain what I ended up doing.

Problem: Server has too little hard-drive space to perform sort, but plenty of RAM, whereas local machine has plenty of hard-drive space to perform sort, but not enough RAM.

Solution:
Code:
-- Remote Server --

$ cat data/run.sh

# Location with enough space to hold one copy of file
tmp_dir="/devlinux_work3/<name>/tmp" 

sed 's/:/,/g' $1 | sort -k 2,2 -k 3,3 -k 4,4 -k 5,5 -t',' -T ${tmp_dir} | sed 's/,/:/3 ; s/,/:/3'

-- Local Machine --

plink <name>@<machine>.<company> "bash /devlinux_work4/<name>/data/run.sh /devlinux_work4/<name>/data/trades.csv" > trades_sorted.csv

# 20  
Old 07-12-2011
Quote:
Originally Posted by Ryan.
Thanks for all your help and insight, alister.

I just thought I would come on here and explain what I ended up doing.
You're quite welcome. And thank you for reporting back.

Regards,
Alister
# 21  
Old 07-13-2011
Your sort option may not be correct (-k 3,3n) since you said the hour part of the date is not zero padded.
Your server machine can hold one copy of the data, that's enough to do the sort there. You can compress the input file and/or pipe the sort to another compress to do the same with the output file. Don't worry about extra pipe lines, they are every efficient (especially on multi-core servers). Dealing with compressed files sometimes is more efficient due to much less I/O although very CPU intensive. Keep in mind that in most cases you can avoid un-compressing the files to physical files in order to process them by using pipes or named pipes.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Getting fields from a file having multiple delimiters

Hi All, I have a file with a single row having the following text ABC.ABC.ABC,Database,New123,DBNAME,F,ABC.ABC.ABC_APP,"@FUNCTION1("ENT1") ,@FUNCTION2("ENT2")",R, I want an output in the following format ABC.ABC.ABC DBNAME ABC.ABC.ABC_APP '@FUNCTION1("ENT1")... (3 Replies)
Discussion started by: dev.devil.1983
3 Replies

2. UNIX for Beginners Questions & Answers

How to append the multiple Delimiters up to requirement?

HI All, How to append the multiple delimiters to at end the file up to 69 fields. FinalDelimiter Count is 69 recrod Delimeter count is 10 so 69-10=59 this script will add upto 59 Delimiters to that records. this script will check each and every record in a file and append the delimiters... (4 Replies)
Discussion started by: vinod.peddiredd
4 Replies

3. Shell Programming and Scripting

Editing phone number with multiple delimiters

Hello all I have a data base of information that is formatted like so: JSD4863 XXX-XX-XXXX DOE, JOHN C JR-II BISS CPSC BS INFO TECH 412/779-9445 I need the last four digits of the phone number. However, many lines contain 'garbage data' that I'm not interested in. So i used a 'for loop'... (7 Replies)
Discussion started by: smartSometimes
7 Replies

4. Shell Programming and Scripting

awk multiple delimiters

Hi Folks, This is the first time I ever encountered this situation My input file is of this kind cat input.txt 1 PAIXAF 0 1 1 -9 0 0 0 1 2 0 2 1 2 1 7 PAIXEM 0 7 1 -9 1 0 2 0 1 2 2 1 0 2 9 PAKZXY 0 2 1 -9 2 0 1 1 1 0 1 2 0 1 Till the sixth column (which is -9), I want my columns to... (4 Replies)
Discussion started by: jacobs.smith
4 Replies

5. Shell Programming and Scripting

treating multiple delimiters[solved]

Hi, I need to display the last column value in the below o/p. sam2 PS 03/10/11 0 441 Unable to get o/p with this awk code awk -F"+" '{ print $4 }' pwdchk.txt I need to display 441(in this eg.) and also accept it as a variable to treat it with if condition and take a decision.... (1 Reply)
Discussion started by: sam_bd
1 Replies

6. Shell Programming and Scripting

Sorting based on multiple delimiters

Hello, I have data where words are separated by a delimiter. In this case "=" The number of delimiters in a line can vary from 4to 8. The norm is 4. Is it possible to have a script where the file could be separated starting with highest number of delimiters and ending with the lowest An... (8 Replies)
Discussion started by: gimley
8 Replies

7. Shell Programming and Scripting

AWK with multiple delimiters

I have the following string sample: bla bla bla bla bla I would like to extract the "123" using awk. I thought about awk -F"]" '{ print $1 }' but it doesn't work Any ideas ? (7 Replies)
Discussion started by: gdub
7 Replies

8. Shell Programming and Scripting

Cutting a file with multiple delimiters into columns

Hi All I have recently had to start using Unix for work and I have hit brick wall with this prob.... I have a file that goes a little something like this.... EUR;EUR;EUR:USD:USD;USD;;;EUR/USD;XAU/AUD;XAU/EUR;XAU/AUD,GBP/BOB,UAD/XAU;;;1.11;2.22;3.33;4.44;5.55;6.66;;; is it possible to... (7 Replies)
Discussion started by: luckycharm
7 Replies

9. Shell Programming and Scripting

Sorting with multiple numeric keys

Data I want to sort :- 1 10 jj Y 2 100 vv B 19 5 jj A 1 11 hq D 3 8 op X 44 78 ds GG 1 8 hq D and want to sort based on the first 2 columns - which hold numeric values. Am using : cat filename | sort -nk 1,2 But the result is :- 1 10 jj Y 1 11 hq D (1 Reply)
Discussion started by: sinpeak
1 Replies

10. Shell Programming and Scripting

awk - treat multiple delimiters as one

Is there anyway to get awk to treat multiple delimiters as one? Particularly spaces... (6 Replies)
Discussion started by: peter.herlihy
6 Replies
Login or Register to Ask a Question