Sorting on length with identification of number of characters | Unix Linux Forums | Shell Programming and Scripting

  Go Back    


Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

Sorting on length with identification of number of characters

Shell Programming and Scripting


Closed Thread    
 
Thread Tools Search this Thread Display Modes
    #1  
Old 01-19-2013
gimley gimley is offline
Registered User
 
Join Date: Feb 2011
Last Activity: 11 December 2014, 8:17 AM EST
Posts: 190
Thanks: 84
Thanked 2 Times in 2 Posts
Sorting on length with identification of number of characters

Hello,
I am writing an open-source stemmer in Java for Indic languages which admit a large number of suffixes.
The Java stemmer requires that each suffix string be sorted as per its length and that all strings of the same length are arranged in a single group, sorted alphabetically. Moreover as a header I need to specify the numeric value of the string, say

Code:
5
6
7
8
etc.

Since the languages in question have over 300 and more suffixes, trying to sort on length and identifying the length of each string and counting it becomes a difficult issue.
An example will make this clear.
Input:

Code:
आधी
इतक
इतपत
ईचना
ईचनात
ई
ईना
ईन

Expected output

Code:
1
ई
2
ईन
3
आधी
इतक
ईना
4
इतपत
ईचना
5
ईचनात

Since handling such a large database is laborious, is it possible to write a script in AWK or PERL which would enable the above output.
Your help would go a long way in putting java-based stemmers in different languages in the open-source community.
Many thanks in advance for your kind help

Last edited by Scrutinizer; 01-23-2013 at 01:16 AM.. Reason: quote tags -> code tags
Sponsored Links
    #2  
Old 01-19-2013
Scrutinizer's Avatar
Scrutinizer Scrutinizer is offline Forum Staff  
Moderator
 
Join Date: Nov 2008
Last Activity: 20 December 2014, 7:20 AM EST
Location: Amsterdam
Posts: 9,691
Thanks: 300
Thanked 2,491 Times in 2,230 Posts
Try:

Code:
gawk '{print length, $1}' infile | sort -n | gawk '$1!=p{print $1}{print $2; p=$1}'

You would need to use a version of awk that correctly counts multi-byte characters:
The Following User Says Thank You to Scrutinizer For This Useful Post:
gimley (01-20-2013)
Sponsored Links
    #3  
Old 01-20-2013
gimley gimley is offline
Registered User
 
Join Date: Feb 2011
Last Activity: 11 December 2014, 8:17 AM EST
Posts: 190
Thanks: 84
Thanked 2 Times in 2 Posts
Many thanks it worked. However since GAWK under windows does not allow pipes, I had to create 3 scripts
one for processing count, the other for sorting and the third for printing out the count.
Desperate situations call for desperate measures. Any way in which I can handle pipes under windows.
    #4  
Old 01-20-2013
drl's Avatar
drl drl is offline Forum Advisor  
Registered Voter
 
Join Date: Apr 2007
Last Activity: 19 December 2014, 4:05 PM EST
Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris
Posts: 1,709
Thanks: 45
Thanked 202 Times in 184 Posts
Hi.
Quote:
Originally Posted by gimley
Any way in which I can handle pipes under windows.
I thought cmd and command in MS Windows could handle simple pipes, like dir | more . Even in the case that they do, you may not find sort, et al, to be available.

So ... see Cygwin for a very complete solution.

Best wishes ... cheers, drl

Last edited by drl; 01-20-2013 at 01:31 PM..
The Following User Says Thank You to drl For This Useful Post:
gimley (01-20-2013)
Sponsored Links
Closed Thread

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Sorting by length khoremand Shell Programming and Scripting 8 11-01-2012 01:27 PM
Need to find lines where the length is less than 50 characters Gangadhar Reddy Shell Programming and Scripting 4 08-13-2012 08:51 AM
Sorting words based on length yairpg UNIX for Dummies Questions & Answers 1 12-03-2011 06:33 AM
Sorting with non- and alphanumeric characters fed.m.ang Shell Programming and Scripting 2 09-15-2009 01:43 PM
Conditional sorting on fixed length flat file zsk_00 UNIX for Dummies Questions & Answers 1 02-27-2009 02:01 PM



All times are GMT -4. The time now is 08:54 AM.