Sorting on length with identification of number of characters | Unix Linux Forums | Shell Programming and Scripting

  Go Back    


Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

Sorting on length with identification of number of characters

Shell Programming and Scripting


Closed Thread    
 
Thread Tools Search this Thread Display Modes
    #1  
Old 01-19-2013
gimley gimley is offline
Registered User
 
Join Date: Feb 2011
Last Activity: 10 August 2014, 9:00 AM EDT
Posts: 186
Thanks: 80
Thanked 1 Time in 1 Post
Sorting on length with identification of number of characters

Hello,
I am writing an open-source stemmer in Java for Indic languages which admit a large number of suffixes.
The Java stemmer requires that each suffix string be sorted as per its length and that all strings of the same length are arranged in a single group, sorted alphabetically. Moreover as a header I need to specify the numeric value of the string, say

Code:
5
6
7
8
etc.

Since the languages in question have over 300 and more suffixes, trying to sort on length and identifying the length of each string and counting it becomes a difficult issue.
An example will make this clear.
Input:

Code:
आधी
इतक
इतपत
ईचना
ईचनात
ई
ईना
ईन

Expected output

Code:
1
ई
2
ईन
3
आधी
इतक
ईना
4
इतपत
ईचना
5
ईचनात

Since handling such a large database is laborious, is it possible to write a script in AWK or PERL which would enable the above output.
Your help would go a long way in putting java-based stemmers in different languages in the open-source community.
Many thanks in advance for your kind help

Last edited by Scrutinizer; 01-23-2013 at 12:16 AM.. Reason: quote tags -> code tags
Sponsored Links
    #2  
Old 01-19-2013
Scrutinizer's Avatar
Scrutinizer Scrutinizer is offline Forum Staff  
Moderator
 
Join Date: Nov 2008
Last Activity: 24 October 2014, 8:44 PM EDT
Location: Amsterdam
Posts: 9,549
Thanks: 285
Thanked 2,426 Times in 2,174 Posts
Try:

Code:
gawk '{print length, $1}' infile | sort -n | gawk '$1!=p{print $1}{print $2; p=$1}'

You would need to use a version of awk that correctly counts multi-byte characters:
The Following User Says Thank You to Scrutinizer For This Useful Post:
gimley (01-19-2013)
Sponsored Links
    #3  
Old 01-19-2013
gimley gimley is offline
Registered User
 
Join Date: Feb 2011
Last Activity: 10 August 2014, 9:00 AM EDT
Posts: 186
Thanks: 80
Thanked 1 Time in 1 Post
Many thanks it worked. However since GAWK under windows does not allow pipes, I had to create 3 scripts
one for processing count, the other for sorting and the third for printing out the count.
Desperate situations call for desperate measures. Any way in which I can handle pipes under windows.
    #4  
Old 01-20-2013
drl's Avatar
drl drl is online now Forum Advisor  
Registered Voter
 
Join Date: Apr 2007
Last Activity: 24 October 2014, 9:24 PM EDT
Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris
Posts: 1,686
Thanks: 42
Thanked 197 Times in 179 Posts
Hi.
Quote:
Originally Posted by gimley
Any way in which I can handle pipes under windows.
I thought cmd and command in MS Windows could handle simple pipes, like dir | more . Even in the case that they do, you may not find sort, et al, to be available.

So ... see Cygwin for a very complete solution.

Best wishes ... cheers, drl

Last edited by drl; 01-20-2013 at 12:31 PM..
The Following User Says Thank You to drl For This Useful Post:
gimley (01-20-2013)
Sponsored Links
Closed Thread

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Sorting by length khoremand Shell Programming and Scripting 8 11-01-2012 12:27 PM
Need to find lines where the length is less than 50 characters Gangadhar Reddy Shell Programming and Scripting 4 08-13-2012 07:51 AM
Sorting words based on length yairpg UNIX for Dummies Questions & Answers 1 12-03-2011 05:33 AM
Sorting with non- and alphanumeric characters fed.m.ang Shell Programming and Scripting 2 09-15-2009 12:43 PM
Conditional sorting on fixed length flat file zsk_00 UNIX for Dummies Questions & Answers 1 02-27-2009 01:01 PM



All times are GMT -4. The time now is 09:24 PM.