Conditional identification of suffixes moving from right to left: revisited


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Conditional identification of suffixes moving from right to left: revisited
# 1  
Old 01-03-2016
Conditional identification of suffixes moving from right to left: revisited

Dear all,
I have a large database of names which I have sorted on reverse with a Perl Script. A sample is provided below
Code:
agarsingh
aghansingh
akalsingh
akamsingh
akbareesingh
akhamisingh
akramysingh
akuvsingh
anchalusingh
andaroosingh
angadsingh
anjawsingh
angibai
angobai
angurbai
angureebai
anjabai
anjiqbai
anjsbai
anjanybai
anjatibai
anjopbai
anjusbai
ankikbai
akhileshkumar
akleskumar
akshaykumar
anchalkumar
anjanikumar
ankitkumar
antimkumar

My problem is that I wish to identify the suffixes ( i.e. the possible identical longest strings moving from right to left) which are adjoined to the names and store such strings along with their frequency in a separate file with the following conditions
the suffix string should be at least between 3 and 5 characters in length
the suffix string should be repeated at least 10 times in the database.

Thus in the sample given above, the script would identify only the following suffixes along with their frequency
Code:
singh	12
bai	12

The suffix
Code:
kumar

Will not be identified since it is less than 10 times.
I had posted the query earlier, but at present I have tried to refine it with conditional constraints so that hopefully only the most pertinent suffixes will be identified. There could be a few false positives but I could weed them out.
I work in a Windows environment and PERL or AWK script would be helpful.
Many thanks and all good wishes for the New Year to all the folks who take their valuable time off to help people solve their problems
# 2  
Old 01-03-2016
For starters, this:

Code:
$ awk '{while($1=substr($1,2)) if (length($1)>=3) A[$1]++} END{for(i in A) if(A[i]>=10) print i, A[i]}' file 
ngh 12
ingh 12
bai 12
singh 12

would produce the names with some false positive substrings .

Under Windows you probably need to put the script in a script file

Code:
{
  while($1=substr($1,2)) if (length($1)>=3) A[$1]++
} 
END {
  for(i in A) if(A[i]>=10) print i, A[i]
}

and run it as
Code:
awk -f script_file inputfile

This User Gave Thanks to Scrutinizer For This Post:
# 3  
Old 01-03-2016
Many thanks. It worked very well. When I posted the request, I knew that there are chances of false positives, but a list of suffixes is easier to handle than wading through thousands of lines.
I can also tweak the awk script if I wish to set the range
Happy New Year and thanks once more
# 4  
Old 01-04-2016
While the earlier method worked and I had to tweak a few suffixes manually, I have been rethinking the process of identification of suffixed names and after going through nearly 40 to 50 thousand names, I have identified a pattern. Very often, in nearly 95% of the cases,the name that is suffixed is also a name by itself as in the example below and comes first in my rev sort followed by names to which it is suffixed.
Code:
singh
agarsingh
aghansingh
akalsingh
akamsingh
akbareesingh
akhamisingh
akramysingh
akuvsingh
anchalusingh
andaroosingh
angadsingh
anjawsingh
bai
angibai
angobai
angurbai
angureebai
anjabai
anjiqbai
anjsbai
anjanybai
anjatibai
anjopbai
anjusbai
ankikbai
kumar
akhileshkumar
akleskumar
akshaykumar
anchalkumar
anjanikumar
ankitkumar
antimkumar

Could it be possible to extract such suffixes given that the suffix is a stand-alone name as in the case of
Code:
kumar
bai
singh

with the proviso that the standalone name is suffixed at least three times to another name. This would obviate the need for blind search and also false positives. I know that this could possibly miss out a few suffixes, but from my analysis, this could provide a more accurate solution.
Would it be possible to devise a PERL or AWK script to identify such cases.
Many thanks once again for all kind help.
# 5  
Old 01-05-2016
How about
Code:
awk '
        {if ($0 !~ IX "$" || NR == 1) IX = $0
         else CNT[IX]++
        }
END     {for (c in CNT) print c, CNT[c]
        }
' file
kumar 7
bai 12
singh 12

# 6  
Old 01-05-2016
Thanks a lot. It works well, all I had to do was trim off short words from the list and which in no way were suffixes, and I managed to get a pretty comprehensive lst of suffixes.
I have been studying the syntax of the script and there is one part which perplexes me. The rest I could grab
Code:
NR == 1

Could you please explain what this really does.
Thanks once again and a Happy New Year.
# 7  
Old 01-05-2016
NR is the record counter, so this condition is true on the first line of the input stream.
This User Gave Thanks to RudiC For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

7 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Identifying suffixes in a file and printing them out

Hello, I am interested in finding and identifying suffixes for Indian names through an awk script or a perl program. Suffixes normally are found at the end of a word as is shown in the sample given below. What I need is a perl script which will identify suffixes of a defined lenght to be given in... (4 Replies)
Discussion started by: gimley
4 Replies

2. Solaris

ls display linux style, revisited!!!

hi all, ive downloaded ,built and installed coreutils from sunfreeware.com,in my quest to get the color display when ls is used(linux style)... After the pkg is installed,how do i use ls to get the color? I know its installed because i get a host of cmds that have been updated,l like this, ... (1 Reply)
Discussion started by: wrapster
1 Replies

3. Virtualization and Cloud Computing

BAM to SOA - Da? Buzzhype Revisited

Many readers have read the hype, experienced the Orwellian marketspeak, watched the positioning debates, and seen poorly managed software companies play the game of analyst-chasing (similar to ambulance chasing when you think about it). Finally, the up-to-date definitions, and hopefully a bit of... (0 Replies)
Discussion started by: Linux Bot
0 Replies

4. UNIX for Advanced & Expert Users

mailx on ksh revisited

I have read through all documents in FAQ and have run into an issue with sending an email with body message text and an email attachment. I have included what I have thus far and I can get the message body to send in the email to work only. I cannot understand the uuencode even after I read the... (5 Replies)
Discussion started by: tekline
5 Replies

5. Solaris

ufsrestore revisited,,

in ufsrestore how do i know which volume my selected folder or file exist?. (4 Replies)
Discussion started by: S26+
4 Replies

6. Solaris

ufsrestore, revisited

I just installed solaris 9 on a sunblade 150(sparc), and have it partitioned. I've been using ufsrestore to restore bring the config from my old system, to the sunblade. I'm not having any luck. The root directory restore seems to work. When I try to restore /usr, I get an "/usr/sbin/fsck not... (4 Replies)
Discussion started by: ECBROWN
4 Replies

7. Shell Programming and Scripting

Simple Search and Replace - Revisited

I have a ascii file with lines like this: 240|^M\ ^M\^M\ Old Port Marketing order recd $62,664.- to ship 6/22/99^M\ when this record gets loaded into my database, the \ is stored literally and so the user sees carriage return \ (hex 0D 5C) when what i need is carriage return line feed (hex 0D... (1 Reply)
Discussion started by: Brandt
1 Replies
Login or Register to Ask a Question