Sorry my broadband has been playing truant. I changed the position as you suggested and the sort has become more accurate. Many thanks for educating me on the reason why these should be placed first.
I just used to put these two in the beginning without knowing that their placement makes a difference. I learned awk by trial and error and I remember reading somewhere that the placement of AWK commands does not matter. Now I know it does and the value of such placements.
Many thanks once again
---------- Post updated 09-19-12 at 11:37 AM ---------- Previous update was 09-18-12 at 11:10 PM ----------
Hello,
I modified the script as per your suggestion and it worked just fine when I ran the script on a small sample, however when the sample size increased the sort seems to go wrong.
Instead of sorting look-alikes starting in reverse order (i.e. last to first), the script sometimes does a random sort. In other cases it seems to work just fine.
I am attaching a larger sample for testing. Here is the output of the file after applying the awk script. As can be seen all similar words in Hindi are not clustered together but are pretty well scattered.
Quote:
#awsekar
! aousekar=औसेकर
! aushekar=औशेकर
! aausekar=औसेकर
! ausekar=अवसेकर
! awasekar=अवसेकर
! avasekar=अवसेकर
! auosekar=औसेकर
! avsekar=अवसेकर
! aaosekar=औसेकर
! awsekar=अवसेकर
#ayaaj
! ayaza=अयाज़
! ayaaj=
! ayaaz=अय्याज़
! ayaz=अय्याज़
! ayaja=अयाज़
! aiyaz=अयाज़
! ayaj=अयाज़
! aayaj=आयाज
! ayyaz=अय्याज
! ayyaj=आय्याज़
! aayaz=आयाज
#ayeza
! aayisha=आयिशा
! aayeesha=आयिशा
! aesha=आयेशा
! aayasha=आयशा
! aayasa=आयशा
! ayeshah=आयेशा
! aayesha=आयशा
! aaisa=आयशा
! ayasha=आयशा
! aiyesha=आयशा
! aaysa=आयसा
! ayesha=आयशा
! aiysha=आयशा
! aisha=आयशा
! aaysha=आयशा
! aaeesha=आयशा
! ayeesha=आयिशा
! aiesha=आयेशा
! aeysha=आयशा
! aeesha=आयशा
! aeesa=आयशा
! ayeshaha=आयशा
! ayasa=आयशा
! aaesha=आयेशा
! aysa=आयशा
! ayisha=आयिशा
! ayaesha=आयेशा
! aaisha=आयशा
! ayeza=आयशा
! aiyasha=आयशा
! aysha=आयशा
! aisa=आयशा
! aayeshaa=आयशा
! aaeesa=आयशा
The expected output should have been with all the look-alikes sorted from last to first clustered together.
Quote:
#awsekar
! aousekar=औसेकर
! aausekar=औसेकर
! auosekar=औसेकर
! aaosekar=औसेकर
! aushekar=औशेकर
! ausekar=अवसेकर
! awasekar=अवसेकर
! avasekar=अवसेकर
! avsekar=अवसेकर
! awsekar=अवसेकर
#ayaaj
! ayaaj=
! ayaza=अयाज़
! ayaja=अयाज़
! aiyaz=अयाज़
! ayaj=अयाज़
! ayaaz=अय्याज़
! ayaz=अय्याज़
! ayyaz=अय्याज
! ayyaj=आय्याज़
! aayaz=आयाज
! aayaj=आयाज
#ayeza
! ayeesha=आयिशा
! aayisha=आयिशा
! ayisha=आयिशा
! aayeesha=आयिशा
! aaesha=आयेशा
! ayaesha=आयेशा
! aiesha=आयेशा
! aesha=आयेशा
! ayeshah=आयेशा
! aayasha=आयशा
! aayasa=आयशा
! aayesha=आयशा
! aaisa=आयशा
! ayasha=आयशा
! aiyesha=आयशा
! ayesha=आयशा
! aiysha=आयशा
! aisha=आयशा
! aaysha=आयशा
! aaeesha=आयशा
! aeysha=आयशा
! aeesha=आयशा
! aeesa=आयशा
! ayeshaha=आयशा
! ayasa=आयशा
! aysa=आयशा
! aaisha=आयशा
! ayeza=आयशा
! aiyasha=आयशा
! aysha=आयशा
! aisa=आयशा
! aayeshaa=आयशा
! aaeesa=आयशा
! aaysa=आयसा
I have gone through the script pretty carefully: line by line and tested each condition laid down and am a bit perplexed as to why the data is behaving in this fashion. There are no trailing spaces and the data is absolutely "clean".
Please help. Sorting by hand is a time-consuming and also error-prone process.
Many thanks in advance.