Unix/Linux Go Back    


Shell Programming and Scripting Unix shell scripting - KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and shell scripts and shell scripting languages here.

Sorting on length with identification of number of characters

Shell Programming and Scripting


Closed Linux or Unix Question    
 
Thread Tools Search this Thread Display Modes
    #1  
Old Unix and Linux 01-19-2013
gimley gimley is offline
Registered User
 
Join Date: Feb 2011
Last Activity: 24 April 2015, 3:29 AM EDT
Posts: 201
Thanks: 93
Thanked 2 Times in 2 Posts
Sorting on length with identification of number of characters

Hello,
I am writing an open-source stemmer in Java for Indic languages which admit a large number of suffixes.
The Java stemmer requires that each suffix string be sorted as per its length and that all strings of the same length are arranged in a single group, sorted alphabetically. Moreover as a header I need to specify the numeric value of the string, say

Code:
5
6
7
8
etc.

Since the languages in question have over 300 and more suffixes, trying to sort on length and identifying the length of each string and counting it becomes a difficult issue.
An example will make this clear.
Input:

Code:
आधी
इतक
इतपत
ईचना
ईचनात
ई
ईना
ईन

Expected output

Code:
1
ई
2
ईन
3
आधी
इतक
ईना
4
इतपत
ईचना
5
ईचनात

Since handling such a large database is laborious, is it possible to write a script in AWK or PERL which would enable the above output.
Your help would go a long way in putting java-based stemmers in different languages in the open-source community.
Many thanks in advance for your kind help

Last edited by Scrutinizer; 01-23-2013 at 12:16 AM.. Reason: quote tags -> code tags
Sponsored Links
    #2  
Old Unix and Linux 01-19-2013
Scrutinizer's Unix or Linux Image
Scrutinizer Scrutinizer is online now Forum Staff  
Moderator
 
Join Date: Nov 2008
Last Activity: 28 April 2015, 2:02 PM EDT
Location: Amsterdam
Posts: 10,130
Thanks: 326
Thanked 2,668 Times in 2,386 Posts
Try:

Code:
gawk '{print length, $1}' infile | sort -n | gawk '$1!=p{print $1}{print $2; p=$1}'

You would need to use a version of awk that correctly counts multi-byte characters:
The Following User Says Thank You to Scrutinizer For This Useful Post:
gimley (01-19-2013)
Sponsored Links
    #3  
Old Unix and Linux 01-19-2013
gimley gimley is offline
Registered User
 
Join Date: Feb 2011
Last Activity: 24 April 2015, 3:29 AM EDT
Posts: 201
Thanks: 93
Thanked 2 Times in 2 Posts
Many thanks it worked. However since GAWK under windows does not allow pipes, I had to create 3 scripts
one for processing count, the other for sorting and the third for printing out the count.
Desperate situations call for desperate measures. Any way in which I can handle pipes under windows.
    #4  
Old Unix and Linux 01-20-2013
drl's Unix or Linux Image
drl drl is offline Forum Advisor  
Registered Voter
 
Join Date: Apr 2007
Last Activity: 27 April 2015, 6:54 AM EDT
Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris
Posts: 1,764
Thanks: 57
Thanked 232 Times in 208 Posts
Hi.
Quote:
Originally Posted by gimley
Any way in which I can handle pipes under windows.
I thought cmd and command in MS Windows could handle simple pipes, like dir | more . Even in the case that they do, you may not find sort, et al, to be available.

So ... see Cygwin for a very complete solution.

Best wishes ... cheers, drl

Last edited by drl; 01-20-2013 at 12:31 PM..
The Following User Says Thank You to drl For This Useful Post:
gimley (01-20-2013)
Sponsored Links
Closed Linux or Unix Question

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Unix or Linux Image More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Sorting by length khoremand Shell Programming and Scripting 8 11-01-2012 12:27 PM
Need to find lines where the length is less than 50 characters Gangadhar Reddy Shell Programming and Scripting 4 08-13-2012 07:51 AM
Sorting words based on length yairpg UNIX for Dummies Questions & Answers 1 12-03-2011 05:33 AM
Sorting with non- and alphanumeric characters fed.m.ang Shell Programming and Scripting 2 09-15-2009 12:43 PM
Conditional sorting on fixed length flat file zsk_00 UNIX for Dummies Questions & Answers 1 02-27-2009 01:01 PM



All times are GMT -4. The time now is 02:17 PM.