Unix/Linux Go Back    


Shell Programming and Scripting Unix shell scripting - KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and shell scripts and shell scripting languages here.

Sorting on length with identification of number of characters

Shell Programming and Scripting


Closed    
 
Thread Tools Search this Thread Display Modes
    #1  
Old Unix and Linux 01-19-2013
gimley gimley is offline
Registered User
 
Join Date: Feb 2011
Last Activity: 6 August 2016, 7:08 AM EDT
Posts: 241
Thanks: 115
Thanked 3 Times in 3 Posts
Sorting on length with identification of number of characters

Hello,
I am writing an open-source stemmer in Java for Indic languages which admit a large number of suffixes.
The Java stemmer requires that each suffix string be sorted as per its length and that all strings of the same length are arranged in a single group, sorted alphabetically. Moreover as a header I need to specify the numeric value of the string, say

Code:
5
6
7
8
etc.

Since the languages in question have over 300 and more suffixes, trying to sort on length and identifying the length of each string and counting it becomes a difficult issue.
An example will make this clear.
Input:

Code:
आधी
इतक
इतपत
ईचना
ईचनात
ई
ईना
ईन

Expected output

Code:
1
ई
2
ईन
3
आधी
इतक
ईना
4
इतपत
ईचना
5
ईचनात

Since handling such a large database is laborious, is it possible to write a script in AWK or PERL which would enable the above output.
Your help would go a long way in putting java-based stemmers in different languages in the open-source community.
Many thanks in advance for your kind help

Last edited by Scrutinizer; 01-23-2013 at 12:16 AM.. Reason: quote tags -> code tags
Sponsored Links
    #2  
Old Unix and Linux 01-19-2013
Scrutinizer's Unix or Linux Image
Scrutinizer Scrutinizer is offline Forum Staff  
Moderator
 
Join Date: Nov 2008
Last Activity: 25 August 2016, 12:24 AM EDT
Location: Amsterdam
Posts: 10,982
Thanks: 427
Thanked 3,034 Times in 2,697 Posts
Try:

Code:
gawk '{print length, $1}' infile | sort -n | gawk '$1!=p{print $1}{print $2; p=$1}'

You would need to use a version of awk that correctly counts multi-byte characters:
The Following User Says Thank You to Scrutinizer For This Useful Post:
gimley (01-19-2013)
Sponsored Links
    #3  
Old Unix and Linux 01-19-2013
gimley gimley is offline
Registered User
 
Join Date: Feb 2011
Last Activity: 6 August 2016, 7:08 AM EDT
Posts: 241
Thanks: 115
Thanked 3 Times in 3 Posts
Many thanks it worked. However since GAWK under windows does not allow pipes, I had to create 3 scripts
one for processing count, the other for sorting and the third for printing out the count.
Desperate situations call for desperate measures. Any way in which I can handle pipes under windows.
    #4  
Old Unix and Linux 01-20-2013
drl's Unix or Linux Image
drl drl is offline Forum Advisor  
Registered Voter
 
Join Date: Apr 2007
Last Activity: 24 August 2016, 11:32 AM EDT
Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris
Posts: 1,951
Thanks: 115
Thanked 314 Times in 276 Posts
Hi.
Quote:
Originally Posted by gimley
Any way in which I can handle pipes under windows.
I thought cmd and command in MS Windows could handle simple pipes, like dir | more . Even in the case that they do, you may not find sort, et al, to be available.

So ... see Cygwin for a very complete solution.

Best wishes ... cheers, drl

Last edited by drl; 01-20-2013 at 12:31 PM..
The Following User Says Thank You to drl For This Useful Post:
gimley (01-20-2013)
Sponsored Links
Closed

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Linux More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Sorting by length khoremand Shell Programming and Scripting 8 11-01-2012 12:27 PM
Need to find lines where the length is less than 50 characters Gangadhar Reddy Shell Programming and Scripting 4 08-13-2012 07:51 AM
Sorting words based on length yairpg UNIX for Dummies Questions & Answers 1 12-03-2011 05:33 AM
Sorting with non- and alphanumeric characters fed.m.ang Shell Programming and Scripting 2 09-15-2009 12:43 PM
Conditional sorting on fixed length flat file zsk_00 UNIX for Dummies Questions & Answers 1 02-27-2009 01:01 PM



All times are GMT -4. The time now is 02:20 AM.