Sponsored Content
Top Forums Shell Programming and Scripting Sorting on length with identification of number of characters Post 302758423 by gimley on Saturday 19th of January 2013 07:59:43 AM
Old 01-19-2013
Sorting on length with identification of number of characters

Hello,
I am writing an open-source stemmer in Java for Indic languages which admit a large number of suffixes.
The Java stemmer requires that each suffix string be sorted as per its length and that all strings of the same length are arranged in a single group, sorted alphabetically. Moreover as a header I need to specify the numeric value of the string, say
Code:
5
6
7
8
etc.

Since the languages in question have over 300 and more suffixes, trying to sort on length and identifying the length of each string and counting it becomes a difficult issue.
An example will make this clear.
Input:
Code:
आधी
इतक
इतपत
ईचना
ईचनात
ई
ईना
ईन

Expected output
Code:
1
ई
2
ईन
3
आधी
इतक
ईना
4
इतपत
ईचना
5
ईचनात

Since handling such a large database is laborious, is it possible to write a script in AWK or PERL which would enable the above output.
Your help would go a long way in putting java-based stemmers in different languages in the open-source community.
Many thanks in advance for your kind help

Last edited by Scrutinizer; 01-23-2013 at 01:16 AM.. Reason: quote tags -> code tags
 

10 More Discussions You Might Find Interesting

1. AIX

Is the Length of User ID for AIX Limit to 8 Characters?

Hi, I'm using AIX version 5.3 currently. I'm trying to create a user id, e.g. andyleong, which the system prompted the length is too long. 1. I would like to know is that the length of user id is limited to maximum 8 characters for AIX. 2. Is it apply to all versions of AIX? If no... (2 Replies)
Discussion started by: meihua_t
2 Replies

2. UNIX for Dummies Questions & Answers

Conditional sorting on fixed length flat file

I have a fixed length file that need to be sorted according to the following rule IF B=1 ORDER by A,B Else ORDER by A,C Input file is ABC 131 112 122 231 212 222 Output needed ABC 112 131 122 212 231 222 (1 Reply)
Discussion started by: zsk_00
1 Replies

3. Shell Programming and Scripting

Sorting with non- and alphanumeric characters

Hi guys, I'm new to this forum and I'm not a UNIX expert. I can't figure out this certain problem i'm having: I need to sort some words, some of the words are annotations (enclosed within < and >). I need to have them sorted alphabetically with all non-alphanumeric characters up front. For... (2 Replies)
Discussion started by: fed.m.ang
2 Replies

4. Shell Programming and Scripting

Search and replace particular characters in fixed length-file

Masters, I have fixed length input file like FHEAD0000000001XXXX20090901 0000009000Y1000XXX2 THEAD000000000220090901 ITM0000109393813 430143504352N22SP 000000000000RN000000010000EA P0000000000000014390020090901 TTAIL0000000003000000 FTAIL00000000040000000002 Note... (4 Replies)
Discussion started by: bittoo
4 Replies

5. UNIX for Dummies Questions & Answers

Sorting words based on length

i need to write a bash script that recive a list of varuables kaka pele ronaldo beckham zidane messi rivaldo gerrard platini i need the program to print the longest word of the list. word in the output appears on a separate line and word order in the output is in the order Llachsicografi costs.... (1 Reply)
Discussion started by: yairpg
1 Replies

6. Shell Programming and Scripting

Remove characters from fixed length file

Hello I've question on the requirement I am working on. We are getting a fixed length file with "33" characters long. We are processing that file loading into DB. Now some times we are getting a file with "35" characters long. In this case I have to remove two characters (in 22,23... (14 Replies)
Discussion started by: manasvi24
14 Replies

7. Shell Programming and Scripting

Need to find lines where the length is less than 50 characters

Hi, I have a big file say abc.csv. And in that file, I need to find lines whose length is less than 50 characters. How can it be achieved? Thanks in advance. Thanks (4 Replies)
Discussion started by: Gangadhar Reddy
4 Replies

8. Shell Programming and Scripting

Sorting by length

Hello, I have a very large file: a dictionary of headwords of around 40000 and would like to have the dictionary sorted by its length i.e. the largest string first and the smallest at the end. I have hunted for a perl or awk script on the forum which can do the job but there is none available. I... (8 Replies)
Discussion started by: khoremand
8 Replies

9. Shell Programming and Scripting

Sorting a file with frequency on length

Hello, I have a file which has the following structure word space Frequency The file is around 30,000 headwords each along with its frequency. The words have different lengths. What I need is a PERL or AWK script which can sort the file on length of the headword and once the file is sorted on... (12 Replies)
Discussion started by: gimley
12 Replies

10. Shell Programming and Scripting

Checking the user input in perl for characters and length

My question is basically as the title says. How can I check a user inputted string is only certain characters long (for example, 3 characters long) and how do I check a user inputted string only contains certain characters (for example, it should only contain the characters 'u', 'a', 'g', and 'c')... (4 Replies)
Discussion started by: Eric1
4 Replies
XGETTEXT(1)								GNU							       XGETTEXT(1)

NAME
xgettext - extract gettext strings from source SYNOPSIS
xgettext [OPTION] [INPUTFILE]... DESCRIPTION
Extract translatable strings from given input files. Mandatory arguments to long options are mandatory for short options too. Similarly for optional arguments. Input file location: INPUTFILE ... input files -f, --files-from=FILE get list of input files from FILE -D, --directory=DIRECTORY add DIRECTORY to list for input files search If input file is -, standard input is read. Output file location: -d, --default-domain=NAME use NAME.po for output (instead of messages.po) -o, --output=FILE write output to specified file -p, --output-dir=DIR output files will be placed in directory DIR If output file is -, output is written to standard output. Choice of input file language: -L, --language=NAME recognise the specified language (C, C++, ObjectiveC, PO, Shell, Python, Lisp, EmacsLisp, librep, Scheme, Smalltalk, Java, JavaProp- erties, C#, awk, YCP, Tcl, Perl, PHP, GCC-source, NXStringTable, RST, Glade) -C, --c++ shorthand for --language=C++ By default the language is guessed depending on the input file name extension. Input file interpretation: --from-code=NAME encoding of input files (except for Python, Tcl, Glade) By default the input files are assumed to be in ASCII. Operation mode: -j, --join-existing join messages with existing file -x, --exclude-file=FILE.po entries from FILE.po are not extracted -c, --add-comments[=TAG] place comment block with TAG (or those preceding keyword lines) in output file Language specific options: -a, --extract-all extract all strings (only languages C, C++, ObjectiveC, Shell, Python, Lisp, EmacsLisp, librep, Scheme, Java, C#, awk, Tcl, Perl, PHP, GCC-source, Glade) -k, --keyword[=WORD] additional keyword to be looked for (without WORD means not to use default keywords) (only languages C, C++, ObjectiveC, Shell, Python, Lisp, EmacsLisp, librep, Scheme, Java, C#, awk, Tcl, Perl, PHP, GCC-source, Glade) --flag=WORD:ARG:FLAG additional flag for strings inside the argument number ARG of keyword WORD (only languages C, C++, ObjectiveC, Shell, Python, Lisp, EmacsLisp, librep, Scheme, Java, C#, awk, YCP, Tcl, Perl, PHP, GCC-source) -T, --trigraphs understand ANSI C trigraphs for input (only languages C, C++, ObjectiveC) --qt recognize Qt format strings (only language C++) --debug more detailed formatstring recognition result Output details: -e, --no-escape do not use C escapes in output (default) -E, --escape use C escapes in output, no extended chars --force-po write PO file even if empty -i, --indent write the .po file using indented style --no-location do not write '#: filename:line' lines -n, --add-location generate '#: filename:line' lines (default) --strict write out strict Uniforum conforming .po file --properties-output write out a Java .properties file --stringtable-output write out a NeXTstep/GNUstep .strings file -w, --width=NUMBER set output page width --no-wrap do not break long message lines, longer than the output page width, into several lines -s, --sort-output generate sorted output -F, --sort-by-file sort output by file location --omit-header don't write header with `msgid ""' entry --copyright-holder=STRING set copyright holder in output --foreign-user omit FSF copyright in output for foreign user --msgid-bugs-address=EMAIL@ADDRESS set report address for msgid bugs -m, --msgstr-prefix[=STRING] use STRING or "" as prefix for msgstr entries -M, --msgstr-suffix[=STRING] use STRING or "" as suffix for msgstr entries Informative output: -h, --help display this help and exit -V, --version output version information and exit AUTHOR
Written by Ulrich Drepper. REPORTING BUGS
Report bugs to <bug-gnu-gettext@gnu.org>. COPYRIGHT
Copyright (C) 1995-1998, 2000-2005 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICU- LAR PURPOSE. SEE ALSO
The full documentation for xgettext is maintained as a Texinfo manual. If the info and xgettext programs are properly installed at your site, the command info xgettext should give you access to the complete manual. GNU gettext-tools 0.14.4 April 2005 XGETTEXT(1)
All times are GMT -4. The time now is 10:13 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy