Identifying suffixes in a file and printing them out


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Identifying suffixes in a file and printing them out
# 1  
Old 04-04-2012
Identifying suffixes in a file and printing them out

Hello,
I am interested in finding and identifying suffixes for Indian names through an awk script or a perl program. Suffixes normally are found at the end of a word as is shown in the sample given below.
What I need is a perl script which will identify suffixes of a defined lenght to be given in the command line and spew them out in a separate file (if possible with their frequency).
My perl and awk scripting skills do not go that far and hence this request for help for an interesting problem which could have utility in other cases also.
The script should identify suffixes more than two character in length
A sample is given below:
Code:
chandrashekhar
hansa
hansaben
hemant
hemantbhai
hemaprasad
mohanchandra
raj
rajchandra
rajprasad
rajshekhar
sharadbhai
shardaben

The expected output in a separate file would be
Code:
ben	2
bhai	2
chandra	2
prasad	2
shekhar	2

Many thanks for any help given. The database is large and would be around 80,000 words

Last edited by Scrutinizer; 04-04-2012 at 01:27 PM.. Reason: Almost right: use code tags instead of quote tags
# 2  
Old 04-04-2012
How would one know which suffixes to look for? For e.g., you're looking for suffix 'ben' in a file containing list of names. How would one know its 'ben' or 'bhai' that is to be looked for? Are these suffixes defined in a separate file?
# 3  
Old 04-04-2012
Many thanks for your query.
The answer is :unfortunately no. I agree that this is a major issue. I have written a rev sort in PERL which sorts the words in reverse order and tried to extract the suffixes from the list: a laborious and tedious problem. This is why I thought of trying to get results programatically. I know that I will not always get the right answers but the false positives can always be weeded out
# 4  
Old 04-04-2012
What logic did you use to extract suffixes from names?
1. They're going to be of variable length.
2. There could be names without a suffix, e.g., 'raj'. How would one make the computer understand "raj doesn't contain a suffix, so leave it" ?

It would be easier if you can get the list of suffixes you're looking for. You don't suppose it would be a huge list, do you?

Code:
#! /usr/bin/perl -w
use strict;

my @suffixes = qw / # Place all the required suffixes in this list.
ben
bhai
chandra
prasad
shekhar
/;

my (%x, $s, $nm);

open I, "< inputfile.txt"; # This file contains names in which suffixes are to be looked.
for $nm (<I>) {
    for $s (@suffixes) {
        if ($nm =~ /$s$/) {
            $x{$s}++;
        }
    }
}
close I;

for (sort keys %x) {
    print "$_ $x{$_}\n";
}

Code:
$ ./test.pl
ben 2
bhai 2
chandra 2
prasad 2
shekhar 2

# 5  
Old 04-05-2012
Or perhaps the other way around, we can find suffixes if there are names without those suffixes in the list, so ben, bhai and chandra can easily be found, but to find prasad and shekbar are more difficult, since there is no name without those suffixes.. Another complication would be a name like hemaprasad, which I presume is short for hemantprasad ( I am just guessing, I am not Indian ), but how does the algorithm know?

Anyway, as long as there are "names without suffixes" for every "name with suffix" present ("easy ones"), this algorithm might find the right result:

Code:
sort -u infile |                       # first do a unique sort of the input file and pipe that into awk as the first file (at the point of - )
awk '
NR==FNR{                               # if we are processing the first file
  if(p && $1~"^"p){                    # if a previous name exists and there is a match at the beginning 
    sub(p,x,$1)                        # then delete the match from the word
    S[$1]                              # and store the result as a suffix in array S
  }
  else
    p=$1                               # else set the previous name to $1
  next                                 # process the next line
}
{
  for(i in S) if($1~i"$"){             # This is the second file, for every name if there is a partial match at the end with
    S[i]++                             # the list of suffixes then increase their incidence..
    next                               # process the next line
  }    
}                                      # the list of suffixes then increase their incidence..
END{
  for(i in S)print i,S[i]              # Print out all the suffixes and the incidences..

}
' - infile                             # use the unique sort as the first file (-) and the file itself as the second.

output:
Code:
prasad 2
ben 2
bhai 2
chandra 2
shekhar 2

This algorithm could be further optimized by sorting the suffix array such that a longest match in the second part of the script is always found first..

Last edited by Scrutinizer; 04-05-2012 at 03:25 AM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Identifying missing file dates

Hi Experts, I have written the below script to check the missing files based on the date in the file name from current date to in a given interval of days. In the file names we have dates along with some name. ex:jera_sit_2017-04-25-150325.txt. The below script is working fine if we have only... (10 Replies)
Discussion started by: nalu
10 Replies

2. Shell Programming and Scripting

Conditional identification of suffixes moving from right to left: revisited

Dear all, I have a large database of names which I have sorted on reverse with a Perl Script. A sample is provided below agarsingh aghansingh akalsingh akamsingh akbareesingh akhamisingh akramysingh akuvsingh anchalusingh andaroosingh angadsingh anjawsingh angibai angobai angurbai... (11 Replies)
Discussion started by: gimley
11 Replies

3. Shell Programming and Scripting

Identifying Missing File Sequence

Hi, I have a file which contains few columns and the first column has the file names, and I would like to identify the missing file sequence number form the file and would copy to another file. My files has data in below format. APKRISPSIN320131231201319_0983,1,54,125,... (5 Replies)
Discussion started by: rramkrishnas
5 Replies

4. Shell Programming and Scripting

Identifying presence and name of new file(s)?

I have an HP-UX server that runs a script each night. The script connects to an SFTP server and downloads all xml files (if any are present) from a certain folder, and then deletes the files from the SFTP server. So sometimes it will download a new file, sometimes it will download 2 or 3 new... (4 Replies)
Discussion started by: lupin..the..3rd
4 Replies

5. Shell Programming and Scripting

Identifying the file completion

Hi, A script is running for multiple databases so data is also being populated for multiple DBs in a.txt file. I need to rename this file once all the data is populated. Kindly suggest me How can I check once file is populated completely before renaming? Thanks in advance. (3 Replies)
Discussion started by: ravigupta2u
3 Replies

6. Shell Programming and Scripting

Identifying a string from a set of files and printing to a new file

Dear All, I'm an amateur to writing scripts and need to do the following Need to read all files with a .log extension in a directory and identify the value for username i.e. all files have something like username = John. Once this is read, I need to print this value to a new file. The new file... (2 Replies)
Discussion started by: Kelly_B
2 Replies

7. Shell Programming and Scripting

Problem identifying charset of a file

Hi all, My objective is to find out the charset using which a file is encoded. (The OS is SunOs) I have set NLS_LANG to AR8MSWIN1256 and spooled the file. When viewed the file using vi, I saw the following \307\341\321\355\307\326 I then inserted the line containing these codes in a... (3 Replies)
Discussion started by: sridhar_423
3 Replies

8. Shell Programming and Scripting

identifying null values in a file

I have a huge file with 20 fileds in each record and each field is seperated by "|". If i want to get all the reocrds that have 18th or for that matter any filed as null how can i do it? Please let me know (3 Replies)
Discussion started by: dsravan
3 Replies

9. UNIX for Dummies Questions & Answers

Identifying invisible characters in Unix file

I have a file, which when you look at it, appears as if it has spaces.... But sometimes, it is has tab or Nulls or some other character which we are not able to see..... How to find what character exactly it is in the file, where ever we are seeing a space... (Iam in solaris)... (1 Reply)
Discussion started by: thanuman
1 Replies

10. Programming

Identifying and removing control characters in a file.

What is the best method to identify an remove control characters in a file. Would it be easier to do this in Unix or in C. (0 Replies)
Discussion started by: oracle8
0 Replies
Login or Register to Ask a Question