Sorting by length


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Sorting by length
# 1  
Old 11-01-2012
Sorting by length

Hello,
I have a very large file: a dictionary of headwords of around 40000 and would like to have the dictionary sorted by its length i.e. the largest string first and the smallest at the end.
I have hunted for a perl or awk script on the forum which can do the job but there is none available.
I am a newbie to Perl and more accustomed to C programming. The C program I wrote takes ages and I believe Perl or Awk are blazing fast.
Could anybody provide me with a script and if possible help me out by useful comments so that I can start off writing scripts on my own.
Many thanks for your kind help
# 2  
Old 11-01-2012
This thread might help.

Note that for reverse order sorting you can use sort -nr
This User Gave Thanks to Yoda For This Post:
# 3  
Old 11-01-2012
Hi.

The available utility msort allows a number of different comparison types, and among them is string length. Here's an example on a sample of dictionary data:
Code:
#!/usr/bin/env bash

# @(#) s1	Demonstrate sort lines by length, msort.
# See: http://freecode.com/projects/msort

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C msort

FILE=${1-data1}

pl " Input data file $FILE:"
cat $FILE

pl " Results from msort:"
msort -q --line -n 1,1 --comparison-type size $FILE

exit 0

producing:
Code:
% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
msort 8.44

-----
 Input data file data1:
camera
excitatory
incense
liken
offertory
peregrine
prairie
proportion
redwood
Riemannian

-----
 Results from msort:
liken
camera
redwood
incense
prairie
offertory
peregrine
excitatory
proportion
Riemannian

If msort is not in your system repository, see URL in the script comments.

Best wishes ... cheers, drl
This User Gave Thanks to drl For This Post:
# 4  
Old 11-01-2012
awk oneliner alternative

Code:
awk '{w[length] = w[length] ? w[length]"\n"$0 : $0} END {for(l in w) print w[l]}' file

This User Gave Thanks to ripat For This Post:
# 5  
Old 11-01-2012
Hello,
I would really appreciate if you could comment the first part of the code
Code:
{w[length] = w[length] ? w[length]"\n"$0 : $0} END

Suppose I wanted to tweak it to sort from smallest to largest, how would I do it.
Am still a newbie in AWK and PERL and this is a great learning experience.
# 6  
Old 11-01-2012
For the reverse order, just pipe the output in tac:

Code:
awk '{w[length] = w[length] ? w[length]"\n"$0 : $0} END {for(l in w) print w[l]}' file | tac

length() is a awk function that output the, ahem, length of a string. With no argument, it output the length of $0.

w is an array whose indexes will be words length. Every time a word of a given length is seen, it will be concatenated to the w array for that index. Except if that length is seen for the first time in which case the w[length] is initialized with $0. The ternary expression w[length] = w[length] ? w[length]"\n"$0 : $0 coud be written as:

Code:
if (w[length]) {
    w[length]=w[length]"\n"$0
} else {
    w[length]=$0
}

This User Gave Thanks to ripat For This Post:
# 7  
Old 11-01-2012
Or, instead of tac, expanding on ripat's suggestion:
Code:
awk '{l=length; if(l>m)m=l; w[l]=w[l] $0 RS} END{for(l=m;l>=1;l--) if(w[l])printf "%s",w[l]}' infile

This User Gave Thanks to Scrutinizer For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Convert variable length record to fixed length

Hi Team, I have an issue to split the file which is having special chracter(German Char) using awk command. I have a different length records in a file. I am separating the files based on the length using awk command. The command is working fine if the record is not having any... (7 Replies)
Discussion started by: Anthuvan
7 Replies

2. Shell Programming and Scripting

Sorting a file with frequency on length

Hello, I have a file which has the following structure word space Frequency The file is around 30,000 headwords each along with its frequency. The words have different lengths. What I need is a PERL or AWK script which can sort the file on length of the headword and once the file is sorted on... (12 Replies)
Discussion started by: gimley
12 Replies

3. Shell Programming and Scripting

Sorting on length with identification of number of characters

Hello, I am writing an open-source stemmer in Java for Indic languages which admit a large number of suffixes. The Java stemmer requires that each suffix string be sorted as per its length and that all strings of the same length are arranged in a single group, sorted alphabetically. Moreover as a... (3 Replies)
Discussion started by: gimley
3 Replies

4. Shell Programming and Scripting

Flat file-make field length equal to header length

Hello Everyone, I am stuck with one issue while working on abstract flat file which i have to use as input and load data to table. Input Data- ------ ------------------------ ---- ----------------- WFI001 Xxxxxx Control Work Item A Number of Records ------ ------------------------... (5 Replies)
Discussion started by: sonali.s.more
5 Replies

5. UNIX for Dummies Questions & Answers

Sorting words based on length

i need to write a bash script that recive a list of varuables kaka pele ronaldo beckham zidane messi rivaldo gerrard platini i need the program to print the longest word of the list. word in the output appears on a separate line and word order in the output is in the order Llachsicografi costs.... (1 Reply)
Discussion started by: yairpg
1 Replies

6. UNIX for Dummies Questions & Answers

Conditional sorting on fixed length flat file

I have a fixed length file that need to be sorted according to the following rule IF B=1 ORDER by A,B Else ORDER by A,C Input file is ABC 131 112 122 231 212 222 Output needed ABC 112 131 122 212 231 222 (1 Reply)
Discussion started by: zsk_00
1 Replies

7. UNIX for Dummies Questions & Answers

What the command to find out the record length of a fixed length file?

I want to find out the record length of a fixed length file? I forgot the command. Any body know? (9 Replies)
Discussion started by: tranq01
9 Replies

8. UNIX for Dummies Questions & Answers

Sed working on lines of small length and not large length

Hi , I have a peculiar case, where my sed command is working on a file which contains lines of small length. sed "s/XYZ:1/XYZ:3/g" abc.txt > xyz.txt when abc.txt contains lines of small length(currently around 80 chars) , this sed command is working fine. when abc.txt contains lines of... (3 Replies)
Discussion started by: thanuman
3 Replies

9. Shell Programming and Scripting

creating a fixed length output from a variable length input

Is there a command that sets a variable length? I have a input of a variable length field but my output for that field needs to be set to 32 char. Is there such a command? I am on a sun box running ksh Thanks (2 Replies)
Discussion started by: r1500
2 Replies
Login or Register to Ask a Question