Reverse sort on delimited chunks within a file

09-17-2012

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Reverse sort on delimited chunks within a file

Hello,
I have a large file in which data of names is sorted according to their homographs. The database has the following structure:Each set of homographs with their corresponding equivalents in Devanagari is separated out from the next set by a hard return. An example will make this clear:

Quote:

#akhal
!akhal=अखाल
! akkal=अक्कल
! akal=अकाल

#akhande
!akhande=अखंडे
! aakhande=आखंडे
! akhnde=अखंडे

#aklash
!aklash=
! akhlas=अख्लास

#akshan
!akshan=अक्षन

#alag
!alag=अलग
! alagh=अलघ
! allagh=अलघ

#alakama
!alakama=अलकमा
! alkama=अलकमा

The revsort routines I have in Gawk/Perl sort in reverse order but by doing so, do not respect the structure of the file which gets jumbled up.
I have tried to write a sort in which each set is sorted in reverse order separately, maintaining the integrity of the data structure, but am quite frustrated with the results since I know the logic but just cannot handle the bit of delimiting sets and then sorting in reverse within each set.
As an example of the desired output the first two sets would look something like this: (manually sorted and correctly I hope)

Quote:

#akhal
! akkal=अक्कल
! akal=अकाल
!akhal=अखाल

#akhande
! akhnde=अखंडे
!akhande=अखंडे
! aakhande=आखंडे

Many thanks in advance for help. I work under windows so an awk or perl script would be of great use.

gimley

View Public Profile for gimley

Find all posts by gimley

09-17-2012

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

Try this gawk solution:

Code:

gawk 'BEGIN { PROCINFO["sorted_in"] = "@ind_str_desc" }
  { delete L
    printf "%s\n",$1
    for(i=2;i<=NF;i++)
       L[gensub(/ /,"","g",$i)]=$i
    for(l in L) printf "%s\n", L[l]
    printf "\n" }' FS='\n' RS='' infile

This User Gave Thanks to Chubler_XL For This Post:

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

09-18-2012

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Many thanks. Am not in at present. But will definitely get back to you with feedback. Your solutions always work.
Thanks once again

---------- Post updated at 10:59 PM ---------- Previous update was at 09:55 PM ----------

Hello,
Sorry to hassle you but I am getting a consistent error on line7 of the code. I tried to correct it in all possible manners but I still get a consistent error.
Could you please help. Am reproducing the awk message below:

Code:

gawk: sortonsets.gk:7:     printf "\n" }' FS='\n' RS='
gawk: sortonsets.gk:7:                  ^ Invalid char ''' in expression

Many thanx

---------- Post updated at 11:19 PM ---------- Previous update was at 10:59 PM ----------

Sorry for the goof-up. Guess I was too tired. Here's the working code I put in comments to get clarity.

Code:

BEGIN { PROCINFO["sorted_in"] = "@ind_str_desc" }
  { delete L
    printf "%s\n",$1
    for(i=2;i<=NF;i++)
       L[gensub(/ /,"","g",$i)]=$i
    for(l in L) printf "%s\n", L[l]
    printf "\n" }
    # change the record separator from newline to nothing	
	RS=""
# change the field separator from whitespace to newline
	FS="\n"

gimley

View Public Profile for gimley

Find all posts by gimley

09-18-2012

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

NP, probably better to put RS and FS assignments in the BEGIN block.

Code:

BEGIN {
    # change the record separator from newline to empty line
    RS="";

    # change the field separator from whitespace to newline
    FS="\n"; 

    # change array sort order from unsorted to by index descending
    PROCINFO["sorted_in"] = "@ind_str_desc";
}

This User Gave Thanks to Chubler_XL For This Post:

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

09-18-2012

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Many thanks. Will try it out and see the output.
For the nonce and out of curiosity, I did want to put the FS and RS at the top but you had placed them at the end: does it make any difference to execution.
Many thanks once more for taking pains to make this useful suggestion.

gimley

View Public Profile for gimley

Find all posts by gimley

09-18-2012

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

Yes it does make a difference. Originally I had the assignments on the command line (outside of the single quotes) and these assignments are done before the BEGIN block.

Assignments in the program outside of the BEGIN block will be done when each line is read in. This bad as the first line will be parsed with the default RS and FS before this happens.

I don't tend to change RS in the middle of the code so I'm not sure what will happen to $0 after this is done, but I assume that nothing will happen with the current line as it's already been read from the file at this stage.

This User Gave Thanks to Chubler_XL For This Post:

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

09-19-2012

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Sorry my broadband has been playing truant. I changed the position as you suggested and the sort has become more accurate. Many thanks for educating me on the reason why these should be placed first.
I just used to put these two in the beginning without knowing that their placement makes a difference. I learned awk by trial and error and I remember reading somewhere that the placement of AWK commands does not matter. Now I know it does and the value of such placements.
Many thanks once again

---------- Post updated 09-19-12 at 11:37 AM ---------- Previous update was 09-18-12 at 11:10 PM ----------

Hello,
I modified the script as per your suggestion and it worked just fine when I ran the script on a small sample, however when the sample size increased the sort seems to go wrong.
Instead of sorting look-alikes starting in reverse order (i.e. last to first), the script sometimes does a random sort. In other cases it seems to work just fine.
I am attaching a larger sample for testing. Here is the output of the file after applying the awk script. As can be seen all similar words in Hindi are not clustered together but are pretty well scattered.

Quote:

#awsekar
! aousekar=औसेकर
! aushekar=औशेकर
! aausekar=औसेकर
! ausekar=अवसेकर
! awasekar=अवसेकर
! avasekar=अवसेकर
! auosekar=औसेकर
! avsekar=अवसेकर
! aaosekar=औसेकर
! awsekar=अवसेकर

#ayaaj
! ayaza=अयाज़
! ayaaj=
! ayaaz=अय्याज़
! ayaz=अय्याज़
! ayaja=अयाज़
! aiyaz=अयाज़
! ayaj=अयाज़
! aayaj=आयाज
! ayyaz=अय्याज
! ayyaj=आय्याज़
! aayaz=आयाज

#ayeza
! aayisha=आयिशा
! aayeesha=आयिशा
! aesha=आयेशा
! aayasha=आयशा
! aayasa=आयशा
! ayeshah=आयेशा
! aayesha=आयशा
! aaisa=आयशा
! ayasha=आयशा
! aiyesha=आयशा
! aaysa=आयसा
! ayesha=आयशा
! aiysha=आयशा
! aisha=आयशा
! aaysha=आयशा
! aaeesha=आयशा
! ayeesha=आयिशा
! aiesha=आयेशा
! aeysha=आयशा
! aeesha=आयशा
! aeesa=आयशा
! ayeshaha=आयशा
! ayasa=आयशा
! aaesha=आयेशा
! aysa=आयशा
! ayisha=आयिशा
! ayaesha=आयेशा
! aaisha=आयशा
! ayeza=आयशा
! aiyasha=आयशा
! aysha=आयशा
! aisa=आयशा
! aayeshaa=आयशा
! aaeesa=आयशा

The expected output should have been with all the look-alikes sorted from last to first clustered together.

Quote:

#awsekar
! aousekar=औसेकर
! aausekar=औसेकर
! auosekar=औसेकर
! aaosekar=औसेकर
! aushekar=औशेकर
! ausekar=अवसेकर
! awasekar=अवसेकर
! avasekar=अवसेकर
! avsekar=अवसेकर
! awsekar=अवसेकर

#ayaaj
! ayaaj=
! ayaza=अयाज़
! ayaja=अयाज़
! aiyaz=अयाज़
! ayaj=अयाज़
! ayaaz=अय्याज़
! ayaz=अय्याज़
! ayyaz=अय्याज
! ayyaj=आय्याज़
! aayaz=आयाज
! aayaj=आयाज

#ayeza
! ayeesha=आयिशा
! aayisha=आयिशा
! ayisha=आयिशा
! aayeesha=आयिशा
! aaesha=आयेशा
! ayaesha=आयेशा
! aiesha=आयेशा
! aesha=आयेशा
! ayeshah=आयेशा
! aayasha=आयशा
! aayasa=आयशा
! aayesha=आयशा
! aaisa=आयशा
! ayasha=आयशा
! aiyesha=आयशा
! ayesha=आयशा
! aiysha=आयशा
! aisha=आयशा
! aaysha=आयशा
! aaeesha=आयशा
! aeysha=आयशा
! aeesha=आयशा
! aeesa=आयशा
! ayeshaha=आयशा
! ayasa=आयशा
! aysa=आयशा
! aaisha=आयशा
! ayeza=आयशा
! aiyasha=आयशा
! aysha=आयशा
! aisa=आयशा
! aayeshaa=आयशा
! aaeesa=आयशा
! aaysa=आयसा

I have gone through the script pretty carefully: line by line and tested each condition laid down and am a bit perplexed as to why the data is behaving in this fashion. There are no trailing spaces and the data is absolutely "clean".
Please help. Sorting by hand is a time-consuming and also error-prone process.
Many thanks in advance.

gimley

View Public Profile for gimley

Find all posts by gimley

Shell Programming and Scripting

Reverse sort on delimited chunks within a file

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Sort tab delimited file according to which rows have missing values

Discussion started by: MBarrett1213

2. UNIX for Dummies Questions & Answers

[Solved] Reverse the order of a list of file names (but not sort them alphabetically or numerically)

Discussion started by: MDeBiasse

3. Shell Programming and Scripting

Reverse sort

Discussion started by: gimley

4. Shell Programming and Scripting

How to convert a space delimited file into a pipe delimited file using shellscript?

Discussion started by: nithins007

5. Shell Programming and Scripting

reverse sort file

Discussion started by: pxy2d1

6. UNIX for Dummies Questions & Answers

Sort the fields in a comma delimited file

Discussion started by: swethapatil

7. UNIX for Dummies Questions & Answers

sort -reverse order

Discussion started by: sivakumar.rj

8. Shell Programming and Scripting

reverse sort

Discussion started by: markjason

9. Shell Programming and Scripting

Converting Tab delimited file to Comma delimited file in Unix

Discussion started by: charan81

10. Shell Programming and Scripting

sort a file in reverse order

Discussion started by: frustrated1