Identifying suffixes in a file and printing them out

04-04-2012

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Identifying suffixes in a file and printing them out

Hello,
I am interested in finding and identifying suffixes for Indian names through an awk script or a perl program. Suffixes normally are found at the end of a word as is shown in the sample given below.
What I need is a perl script which will identify suffixes of a defined lenght to be given in the command line and spew them out in a separate file (if possible with their frequency).
My perl and awk scripting skills do not go that far and hence this request for help for an interesting problem which could have utility in other cases also.
The script should identify suffixes more than two character in length
A sample is given below:

Code:

chandrashekhar
hansa
hansaben
hemant
hemantbhai
hemaprasad
mohanchandra
raj
rajchandra
rajprasad
rajshekhar
sharadbhai
shardaben

The expected output in a separate file would be

Code:

ben	2
bhai	2
chandra	2
prasad	2
shekhar	2

Many thanks for any help given. The database is large and would be around 80,000 words

Last edited by Scrutinizer; 04-04-2012 at 01:27 PM.. Reason: Almost right: use code tags instead of quote tags

gimley

View Public Profile for gimley

Find all posts by gimley

04-04-2012

Registered User

2,019, 606

Join Date: Apr 2009

Last Activity: 27 February 2021, 12:15 PM EST

Location: India

Posts: 2,019

Thanks Given: 50

Thanked 606 Times in 567 Posts

How would one know which suffixes to look for? For e.g., you're looking for suffix 'ben' in a file containing list of names. How would one know its 'ben' or 'bhai' that is to be looked for? Are these suffixes defined in a separate file?

balajesuri

View Public Profile for balajesuri

Find all posts by balajesuri

04-04-2012

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Many thanks for your query.
The answer is :unfortunately no. I agree that this is a major issue. I have written a rev sort in PERL which sorts the words in reverse order and tried to extract the suffixes from the list: a laborious and tedious problem. This is why I thought of trying to get results programatically. I know that I will not always get the right answers but the false positives can always be weeded out

gimley

View Public Profile for gimley

Find all posts by gimley

04-04-2012

Registered User

2,019, 606

Join Date: Apr 2009

Last Activity: 27 February 2021, 12:15 PM EST

Location: India

Posts: 2,019

Thanks Given: 50

Thanked 606 Times in 567 Posts

What logic did you use to extract suffixes from names?
1. They're going to be of variable length.
2. There could be names without a suffix, e.g., 'raj'. How would one make the computer understand "raj doesn't contain a suffix, so leave it" ?

It would be easier if you can get the list of suffixes you're looking for. You don't suppose it would be a huge list, do you?

Code:

#! /usr/bin/perl -w
use strict;

my @suffixes = qw / # Place all the required suffixes in this list.
ben
bhai
chandra
prasad
shekhar
/;

my (%x, $s, $nm);

open I, "< inputfile.txt"; # This file contains names in which suffixes are to be looked.
for $nm (<I>) {
    for $s (@suffixes) {
        if ($nm =~ /$s$/) {
            $x{$s}++;
        }
    }
}
close I;

for (sort keys %x) {
    print "$_ $x{$_}\n";
}

Code:

$ ./test.pl
ben 2
bhai 2
chandra 2
prasad 2
shekhar 2

balajesuri

View Public Profile for balajesuri

Find all posts by balajesuri

04-05-2012

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Or perhaps the other way around, we can find suffixes if there are names without those suffixes in the list, so ben, bhai and chandra can easily be found, but to find prasad and shekbar are more difficult, since there is no name without those suffixes.. Another complication would be a name like hemaprasad, which I presume is short for hemantprasad ( I am just guessing, I am not Indian ), but how does the algorithm know?

Anyway, as long as there are "names without suffixes" for every "name with suffix" present ("easy ones"), this algorithm might find the right result:

Code:

sort -u infile |                       # first do a unique sort of the input file and pipe that into awk as the first file (at the point of - )
awk '
NR==FNR{                               # if we are processing the first file
  if(p && $1~"^"p){                    # if a previous name exists and there is a match at the beginning 
    sub(p,x,$1)                        # then delete the match from the word
    S[$1]                              # and store the result as a suffix in array S
  }
  else
    p=$1                               # else set the previous name to $1
  next                                 # process the next line
}
{
  for(i in S) if($1~i"$"){             # This is the second file, for every name if there is a partial match at the end with
    S[i]++                             # the list of suffixes then increase their incidence..
    next                               # process the next line
  }    
}                                      # the list of suffixes then increase their incidence..
END{
  for(i in S)print i,S[i]              # Print out all the suffixes and the incidences..

}
' - infile                             # use the unique sort as the first file (-) and the file itself as the second.

output:

Code:

prasad 2
ben 2
bhai 2
chandra 2
shekhar 2

This algorithm could be further optimized by sorting the suffix array such that a longest match in the second part of the script is always found first..

Last edited by Scrutinizer; 04-05-2012 at 03:25 AM..

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

Shell Programming and Scripting

Identifying suffixes in a file and printing them out

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Identifying missing file dates

Discussion started by: nalu

2. Shell Programming and Scripting

Conditional identification of suffixes moving from right to left: revisited

Discussion started by: gimley

3. Shell Programming and Scripting

Identifying Missing File Sequence

Discussion started by: rramkrishnas

4. Shell Programming and Scripting

Identifying presence and name of new file(s)?

Discussion started by: lupin..the..3rd

5. Shell Programming and Scripting

Identifying the file completion

Discussion started by: ravigupta2u

6. Shell Programming and Scripting

Identifying a string from a set of files and printing to a new file

Discussion started by: Kelly_B

7. Shell Programming and Scripting

Problem identifying charset of a file

Discussion started by: sridhar_423

8. Shell Programming and Scripting

identifying null values in a file

Discussion started by: dsravan

9. UNIX for Dummies Questions & Answers

Identifying invisible characters in Unix file

Discussion started by: thanuman

10. Programming

Identifying and removing control characters in a file.

Discussion started by: oracle8