How do i sort lines lexigraphical in bash?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting How do i sort lines lexigraphical in bash?
# 1  
Old 12-19-2016
How do i sort lines lexigraphical in bash?

I am currently having some problems with my script not sorting my files lexiographically.

The error seem to be localized here where i sort the utt2spk file, which is done like this..

Code:
    for x in test train; do
            for f in text utt2spk; do
                sort data/$x/$f -o data/$x/$f
            done
    done

The problem with this for - loop is that it doesn't sort the numbers correctly. In this case i am having the text

Code:
    fkdo-b-cen6 fkdo
    fkdo-b-cen7 fkdo
    fkdo-b-cen8 fkdo
    flrp-b-an2121 flrp
    flrp-b-an21 flrp
    flrp-b-an22 flrp
    flrp-b-an23 flrp
    flrp-b-an24 flrp
    flrp-b-an25 flrp
    flrp-b-cen1 flrp

which should have been

Code:
    fkdo-b-cen6 fkdo
    fkdo-b-cen7 fkdo
    fkdo-b-cen8 fkdo
    flrp-b-an21 flrp
    flrp-b-an22 flrp
    flrp-b-an23 flrp
    flrp-b-an24 flrp
    flrp-b-an25 flrp
    flrp-b-an2121 flrp
    flrp-b-cen1 flrp

So why isn't it sorting it correctly?, and how does I make it sort correctly?
# 2  
Old 12-19-2016
Quote:
Originally Posted by kidi
The error seem to be localized here where i sort the utt2spk file, which is done like this..

Code:
    for x in test train; do
            for f in text utt2spk; do
                sort data/$x/$f -o data/$x/$f
            done
    done

There are several problems here and they are not necessarily related. let me address them one by one:

1) input file as output file
In general you cannot use the file you read from for input as the output file at the same time. You need to write to an intermediate file and then move that to the original place overwriting the original. This - as a side effect - makes the whole process a little bit safer in case something goes wrong. Take the following as a sketch and modify the error handling according to your needs:

Code:
for x in test train; do
            for f in text utt2spk; do
                # sort data/$x/$f -o data/$x/$f

                if sort data/${x}/${f} -o data/${x}/${f}.tmp ; then
                   mv data/${x}/${f}.tmp data/${x}/${f}
                else
                   echo "something went wrong with data/${x}/${f}" >&2
                   exit 1
                fi
            done
done

2) Note the difference between numerical and alphabetical sorting
In your request you imply your expectation to have the file (partially) sorted numerically. The difference is that alphabetically "a12bc" is after "a123bc" because "3" (4th character in second string) is before "b" in ASCII. But numerically you will want to have "12" before "123". You need to define a numeric sort order by using the "-n" switch of sort. I suggest to read the man page of sort for the details.

3) Internationalisation
This is - according to the POSIX documentation - already done. sort when starting uses the internationalisation variables (LANG, LC_*, ...) to determine the collation sequence applying to the sort. This only applies to special characters, though (like Umlauts in german ["ä", "ö", ...], the spanish enje ["ñ"], etc.). It won't affect the sorting of numbers vs. letters.

I hope this helps.

bakunin

Last edited by bakunin; 12-19-2016 at 03:33 PM.. Reason: typos
# 3  
Old 12-19-2016
So.. To sum it up, you are saying that sort should do this automatically, and my error should be somewhere else?
# 4  
Old 12-19-2016
Quote:
Originally Posted by kidi
So.. To sum it up, you are saying that sort should do this automatically, and my error should be somewhere else?
Actually: no.

To sum it up, i said:

Quote:
Originally Posted by bakunin
You need to define a numeric sort order by using the "-n" switch of sort. I suggest to read the man page of sort for the details.
This (the lack of using the "-n" option), in fact, is what is causing the wrong sort. As i do not know the detailed layout of your input file i cannot explain what exactly you need to specify as sort options, but reading the man page of sort (try the command man sort) should explain to you what you need, given the pointers i gave you.

I hope this helps.

bakunin
# 5  
Old 12-19-2016
I don't think that sort will automatically do what you want, you need to give it the information on how to sort. The problem you might suffer is that if you want to numerically sort, the start of the number is a variable length from the start of the string.

If you had a line that the field was numeric from (say) character 11, we could work with that using the -k flag, even if it was a bit complex. Because we can't be sure where the digits start, this will be more complex.

One way might be to process the file and insert a placeholder character (choose something that will never appear naturally in the file) so as can use it to get the numerics in a fixed position. Then we can sort numerically as a secondary key (with the primary sort key stopping before the numeric) and finally strip out the placeholder character.

Put in a more structured form:-
  1. Convert lines that start something like flrp-b-an2 to start like this flrp-b-an@2 (using @ as the placeholder.
  2. Sort the file primary key starting in field 1, character 1 and ending at field 1 character 10 (inclusive)
  3. ... and the secondary key being numeric starting field1 character 11 and ending at the end of field 1
  4. Strip out the placeholder characters


Would that help?

Robin
# 6  
Old 12-19-2016
Hi.

In situations like this, I use msort. It is in many ways a work-alike for the standard sort, but it has a number of extra features, including the one we use here: a hybrid key, which is composed of alphabetic and numeric. Note that there is just the single command msort that does the work, after the setup and the verification of correctness:
Code:
#!/usr/bin/env bash

# @(#) s1       Demonstrate "hybrid" key ordering, msort.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
em() { pe "$*" >&2 ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C msort pass-fail

FILE=${1-data1}
E=expected-output.txt

pl " Input data file $FILE:"
cat $FILE

pl " Results:"
msort -j -q --line --position 1,1 --comparison-type hybrid $FILE |
tee f1

pl " Verify results if possible:"
paste f1 expected-output.txt
C=$HOME/bin/pass-fail
[ -f $C ] && $C || ( pe; pe " Results cannot be verified." ) >&2

exit 0

producing:
Code:
$ ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution        : Debian 8.6 (jessie) 
bash GNU bash 4.3.30
msort 8.53
pass-fail (local) 1.9

-----
 Input data file data1:
fkdo-b-cen6 fkdo
fkdo-b-cen7 fkdo
fkdo-b-cen8 fkdo
flrp-b-an2121 flrp
flrp-b-an21 flrp
flrp-b-an22 flrp
flrp-b-an23 flrp
flrp-b-an24 flrp
flrp-b-an25 flrp
flrp-b-cen1 flrp

-----
 Results:
fkdo-b-cen6 fkdo
fkdo-b-cen7 fkdo
fkdo-b-cen8 fkdo
flrp-b-an21 flrp
flrp-b-an22 flrp
flrp-b-an23 flrp
flrp-b-an24 flrp
flrp-b-an25 flrp
flrp-b-an2121 flrp
flrp-b-cen1 flrp

-----
 Verify results if possible:
fkdo-b-cen6 fkdo        fkdo-b-cen6 fkdo
fkdo-b-cen7 fkdo        fkdo-b-cen7 fkdo
fkdo-b-cen8 fkdo        fkdo-b-cen8 fkdo
flrp-b-an21 flrp        flrp-b-an21 flrp
flrp-b-an22 flrp        flrp-b-an22 flrp
flrp-b-an23 flrp        flrp-b-an23 flrp
flrp-b-an24 flrp        flrp-b-an24 flrp
flrp-b-an25 flrp        flrp-b-an25 flrp
flrp-b-an2121 flrp      flrp-b-an2121 flrp
flrp-b-cen1 flrp        flrp-b-cen1 flrp

-----
 Comparison of 10 created lines with 10 lines of desired results:
 Succeeded -- files (computed) f1 and (standard) expected-output.txt have same content.

There is a price to pay for the features -- msort is slower than the standard sort.

The command msort can be found in many repositories, or, as noted in the details below, be also found at the msort home site:
Code:
msort   sort records in complex ways (man)
Path    : /usr/bin/msort
Version : 8.53
Type    : ELF 64-bit LSB executable, x86-64, version 1 (SYSV ...)
Help    : probably available with -h,--help
Home    : http://billposer.org/Software/msort.html

Best wishes ... cheers, drl
This User Gave Thanks to drl For This Post:
# 7  
Old 12-19-2016
Thanks for the response.. I guess i might have added a detail, on how i check if the sorting is done correctly.

I have a function which goes through, and if that function deem it ok, it would be sorted correctly. Of what i think looks it checks lexigraphically

the function is

Code:
function check_sorted_and_uniq {
  ! awk '{print $1}' $1 | sort | uniq | cmp -s - <(awk '{print $1}' $1) && \
    echo "$0: file $1 is not in sorted order or has duplicates" && exit 1;
}

---------- Post updated at 02:41 PM ---------- Previous update was at 10:33 AM ----------

Quote:
Originally Posted by drl
Hi.

In situations like this, I use msort. It is in many ways a work-alike for the standard sort, but it has a number of extra features, including the one we use here: a hybrid key, which is composed of alphabetic and numeric. Note that there is just the single command msort that does the work, after the setup and the verification of correctness:
Code:
#!/usr/bin/env bash

# @(#) s1       Demonstrate "hybrid" key ordering, msort.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
em() { pe "$*" >&2 ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C msort pass-fail

FILE=${1-data1}
E=expected-output.txt

pl " Input data file $FILE:"
cat $FILE

pl " Results:"
msort -j -q --line --position 1,1 --comparison-type hybrid $FILE |
tee f1

pl " Verify results if possible:"
paste f1 expected-output.txt
C=$HOME/bin/pass-fail
[ -f $C ] && $C || ( pe; pe " Results cannot be verified." ) >&2

exit 0

producing:
Code:
$ ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution        : Debian 8.6 (jessie) 
bash GNU bash 4.3.30
msort 8.53
pass-fail (local) 1.9

-----
 Input data file data1:
fkdo-b-cen6 fkdo
fkdo-b-cen7 fkdo
fkdo-b-cen8 fkdo
flrp-b-an2121 flrp
flrp-b-an21 flrp
flrp-b-an22 flrp
flrp-b-an23 flrp
flrp-b-an24 flrp
flrp-b-an25 flrp
flrp-b-cen1 flrp

-----
 Results:
fkdo-b-cen6 fkdo
fkdo-b-cen7 fkdo
fkdo-b-cen8 fkdo
flrp-b-an21 flrp
flrp-b-an22 flrp
flrp-b-an23 flrp
flrp-b-an24 flrp
flrp-b-an25 flrp
flrp-b-an2121 flrp
flrp-b-cen1 flrp

-----
 Verify results if possible:
fkdo-b-cen6 fkdo        fkdo-b-cen6 fkdo
fkdo-b-cen7 fkdo        fkdo-b-cen7 fkdo
fkdo-b-cen8 fkdo        fkdo-b-cen8 fkdo
flrp-b-an21 flrp        flrp-b-an21 flrp
flrp-b-an22 flrp        flrp-b-an22 flrp
flrp-b-an23 flrp        flrp-b-an23 flrp
flrp-b-an24 flrp        flrp-b-an24 flrp
flrp-b-an25 flrp        flrp-b-an25 flrp
flrp-b-an2121 flrp      flrp-b-an2121 flrp
flrp-b-cen1 flrp        flrp-b-cen1 flrp

-----
 Comparison of 10 created lines with 10 lines of desired results:
 Succeeded -- files (computed) f1 and (standard) expected-output.txt have same content.

There is a price to pay for the features -- msort is slower than the standard sort.

The command msort can be found in many repositories, or, as noted in the details below, be also found at the msort home site:
Code:
msort   sort records in complex ways (man)
Path    : /usr/bin/msort
Version : 8.53
Type    : ELF 64-bit LSB executable, x86-64, version 1 (SYSV ...)
Help    : probably available with -h,--help
Home    : http://billposer.org/Software/msort.html

Best wishes ... cheers, drl

I am not sure i understand how i should make it sort my input text?.. How slow is it?
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to sort file with certain criteria (bash)?

I am running a command that is part of a script and this is what I am getting when it is sorted by the command: command: ls /tmp/test/*NDMP*.z /tmp/test/CARS-GOLD-NET_CHROMJOB-01-XZ-ARCHIVE-NDMP.z /tmp/test/CARS-GOLD-NET_CHROMJOB-01-XZ-NDMP.z... (2 Replies)
Discussion started by: newbie2010
2 Replies

2. UNIX for Dummies Questions & Answers

awk - (URGENT!) Print lines sort and move lines if match found

URGENT HELP IS NEEDED!! I am looking to move matching lines (01 - 07) from File1 and 77 tab the matching string from File2, to File3.txt. I am almost done but - Currently, script is not printing lines to File3.txt in order. - Also the matching lines are not moving out of File1.txt ... (1 Reply)
Discussion started by: High-T
1 Replies

3. Shell Programming and Scripting

How to sort lines according words?

Hello I greped some lines from an xml file and generated a new file. but some entries are missing my table is unsorted. e.g. NAME="Adel" ADDRESS="Donaustr." NUMBER="2" POSTCODE="33333" NAME="Adel" ADDRESS="Donaustr." NUMBER="2" POSTCODE="33333" NAME="Adel" NUMBER="2" POSTCODE="33333"... (5 Replies)
Discussion started by: witchblade
5 Replies

4. UNIX for Dummies Questions & Answers

Bash script to sort files

I've got a disorganized list of items and quantities for each. I've been using a combination of grep and sort to find out how much to buy of each item. I'm tired of having to constantly using these commands so I've been trying to write a shell script to make it easier, but I can't figure out how... (3 Replies)
Discussion started by: PTcharger
3 Replies

5. Shell Programming and Scripting

Bash - remove duplicates without sort

I need to use bash to remove duplicates without using sort first. I can not use: cat file | sort | uniq But when I use only cat file | uniq some duplicates are not removed. (4 Replies)
Discussion started by: locoroco
4 Replies

6. Shell Programming and Scripting

grep from 3 lines and sort

Pseudo name=hdiskpower54 Symmetrix ID=000190101757 Logical device ID=0601 state=alive; policy=SymmOpt; priority=0; queued-IOs=0 ============================================================================== ---------------- Host --------------- - Stor - -- I/O Path - -- Stats --- ### HW... (7 Replies)
Discussion started by: Daniel Gate
7 Replies

7. Shell Programming and Scripting

BASH: Sort four lines based on first line

I am in the process of sorting an AutoHotkey script's contents so as to make it easier for me to find and view its nearly 200 buzzwords (when I forget which one corresponds with what phrase, which I do now and then). About half to two-thirds of the script's key phrases correspond to locations... (7 Replies)
Discussion started by: SilversleevesX
7 Replies

8. Shell Programming and Scripting

Need Help to sort text lines

I need to sort input file as below to display as below: input.txt User: my_id File: oracle/scripts/ssc/ssc_db_info User: your_id File: pkg_files/BWSwsrms/request User: your_id File: pkg_files/BWSwsco/checkConfig.sh OUTPUT: User: my_id File: ... (3 Replies)
Discussion started by: tqlam
3 Replies

9. Shell Programming and Scripting

How to sort decimal values in bash

Hi, I have a list of values from associative array from 0,..till 1.0000. I tried various sort options; sort -g, sort -nr but it still couldnt work. In other words, the numbers are not sorted accordingly. Please help. Thanks. (1 Reply)
Discussion started by: ahjiefreak
1 Replies

10. Shell Programming and Scripting

Sort (bash command)

I did a search on this, and found lots on SORT but no answer to my question. I have a C program that fetches all of our users from Netware, and I have that it makes a file that I later include in a html as a select tag drop-down menu. Here is what 1 line looks like: <option... (5 Replies)
Discussion started by: booboo
5 Replies
Login or Register to Ask a Question