Reverse sort on delimited chunks within a file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Reverse sort on delimited chunks within a file
# 1  
Old 09-17-2012
Reverse sort on delimited chunks within a file

Hello,
I have a large file in which data of names is sorted according to their homographs. The database has the following structure:Each set of homographs with their corresponding equivalents in Devanagari is separated out from the next set by a hard return. An example will make this clear:
Quote:
#akhal
!akhal=अखाल
! akkal=अक्कल
! akal=अकाल

#akhande
!akhande=अखंडे
! aakhande=आखंडे
! akhnde=अखंडे

#aklash
!aklash=
! akhlas=अख्लास

#akshan
!akshan=अक्षन

#alag
!alag=अलग
! alagh=अलघ
! allagh=अलघ

#alakama
!alakama=अलकमा
! alkama=अलकमा
The revsort routines I have in Gawk/Perl sort in reverse order but by doing so, do not respect the structure of the file which gets jumbled up.
I have tried to write a sort in which each set is sorted in reverse order separately, maintaining the integrity of the data structure, but am quite frustrated with the results since I know the logic but just cannot handle the bit of delimiting sets and then sorting in reverse within each set.
As an example of the desired output the first two sets would look something like this: (manually sorted and correctly I hope)
Quote:
#akhal
! akkal=अक्कल
! akal=अकाल
!akhal=अखाल

#akhande
! akhnde=अखंडे
!akhande=अखंडे
! aakhande=आखंडे
Many thanks in advance for help. I work under windows so an awk or perl script would be of great use.
# 2  
Old 09-17-2012
Try this gawk solution:

Code:
gawk 'BEGIN { PROCINFO["sorted_in"] = "@ind_str_desc" }
  { delete L
    printf "%s\n",$1
    for(i=2;i<=NF;i++)
       L[gensub(/ /,"","g",$i)]=$i
    for(l in L) printf "%s\n", L[l]
    printf "\n" }' FS='\n' RS='' infile

This User Gave Thanks to Chubler_XL For This Post:
# 3  
Old 09-18-2012
Many thanks. Am not in at present. But will definitely get back to you with feedback. Your solutions always work.
Thanks once again

---------- Post updated at 10:59 PM ---------- Previous update was at 09:55 PM ----------

Hello,
Sorry to hassle you but I am getting a consistent error on line7 of the code. I tried to correct it in all possible manners but I still get a consistent error.
Could you please help. Am reproducing the awk message below:
Code:
gawk: sortonsets.gk:7:     printf "\n" }' FS='\n' RS='
gawk: sortonsets.gk:7:                  ^ Invalid char ''' in expression

Many thanx

---------- Post updated at 11:19 PM ---------- Previous update was at 10:59 PM ----------

Sorry for the goof-up. Guess I was too tired. Here's the working code I put in comments to get clarity.
Code:
BEGIN { PROCINFO["sorted_in"] = "@ind_str_desc" }
  { delete L
    printf "%s\n",$1
    for(i=2;i<=NF;i++)
       L[gensub(/ /,"","g",$i)]=$i
    for(l in L) printf "%s\n", L[l]
    printf "\n" }
    # change the record separator from newline to nothing	
	RS=""
# change the field separator from whitespace to newline
	FS="\n"

# 4  
Old 09-18-2012
NP, probably better to put RS and FS assignments in the BEGIN block.

Code:
BEGIN {
    # change the record separator from newline to empty line
    RS="";

    # change the field separator from whitespace to newline
    FS="\n"; 

    # change array sort order from unsorted to by index descending
    PROCINFO["sorted_in"] = "@ind_str_desc";
}

This User Gave Thanks to Chubler_XL For This Post:
# 5  
Old 09-18-2012
Many thanks. Will try it out and see the output.
For the nonce and out of curiosity, I did want to put the FS and RS at the top but you had placed them at the end: does it make any difference to execution.
Many thanks once more for taking pains to make this useful suggestion.
# 6  
Old 09-18-2012
Yes it does make a difference. Originally I had the assignments on the command line (outside of the single quotes) and these assignments are done before the BEGIN block.

Assignments in the program outside of the BEGIN block will be done when each line is read in. This bad as the first line will be parsed with the default RS and FS before this happens.

I don't tend to change RS in the middle of the code so I'm not sure what will happen to $0 after this is done, but I assume that nothing will happen with the current line as it's already been read from the file at this stage.
This User Gave Thanks to Chubler_XL For This Post:
# 7  
Old 09-19-2012
Sorry my broadband has been playing truant. I changed the position as you suggested and the sort has become more accurate. Many thanks for educating me on the reason why these should be placed first.
I just used to put these two in the beginning without knowing that their placement makes a difference. I learned awk by trial and error and I remember reading somewhere that the placement of AWK commands does not matter. Now I know it does and the value of such placements.
Many thanks once again

---------- Post updated 09-19-12 at 11:37 AM ---------- Previous update was 09-18-12 at 11:10 PM ----------

Hello,
I modified the script as per your suggestion and it worked just fine when I ran the script on a small sample, however when the sample size increased the sort seems to go wrong.
Instead of sorting look-alikes starting in reverse order (i.e. last to first), the script sometimes does a random sort. In other cases it seems to work just fine.
I am attaching a larger sample for testing. Here is the output of the file after applying the awk script. As can be seen all similar words in Hindi are not clustered together but are pretty well scattered.
Quote:
#awsekar
! aousekar=औसेकर
! aushekar=औशेकर
! aausekar=औसेकर
! ausekar=अवसेकर
! awasekar=अवसेकर
! avasekar=अवसेकर
! auosekar=औसेकर
! avsekar=अवसेकर
! aaosekar=औसेकर
! awsekar=अवसेकर

#ayaaj
! ayaza=अयाज़
! ayaaj=
! ayaaz=अय्याज़
! ayaz=अय्याज़
! ayaja=अयाज़
! aiyaz=अयाज़
! ayaj=अयाज़
! aayaj=आयाज
! ayyaz=अय्याज
! ayyaj=आय्याज़
! aayaz=आयाज

#ayeza
! aayisha=आयिशा
! aayeesha=आयिशा
! aesha=आयेशा
! aayasha=आयशा
! aayasa=आयशा
! ayeshah=आयेशा
! aayesha=आयशा
! aaisa=आयशा
! ayasha=आयशा
! aiyesha=आयशा
! aaysa=आयसा
! ayesha=आयशा
! aiysha=आयशा
! aisha=आयशा
! aaysha=आयशा
! aaeesha=आयशा
! ayeesha=आयिशा
! aiesha=आयेशा
! aeysha=आयशा
! aeesha=आयशा
! aeesa=आयशा
! ayeshaha=आयशा
! ayasa=आयशा
! aaesha=आयेशा
! aysa=आयशा
! ayisha=आयिशा
! ayaesha=आयेशा
! aaisha=आयशा
! ayeza=आयशा
! aiyasha=आयशा
! aysha=आयशा
! aisa=आयशा
! aayeshaa=आयशा
! aaeesa=आयशा
The expected output should have been with all the look-alikes sorted from last to first clustered together.
Quote:
#awsekar
! aousekar=औसेकर
! aausekar=औसेकर
! auosekar=औसेकर
! aaosekar=औसेकर
! aushekar=औशेकर
! ausekar=अवसेकर
! awasekar=अवसेकर
! avasekar=अवसेकर
! avsekar=अवसेकर
! awsekar=अवसेकर


#ayaaj
! ayaaj=
! ayaza=अयाज़
! ayaja=अयाज़
! aiyaz=अयाज़
! ayaj=अयाज़
! ayaaz=अय्याज़
! ayaz=अय्याज़
! ayyaz=अय्याज
! ayyaj=आय्याज़
! aayaz=आयाज
! aayaj=आयाज

#ayeza
! ayeesha=आयिशा
! aayisha=आयिशा
! ayisha=आयिशा
! aayeesha=आयिशा
! aaesha=आयेशा
! ayaesha=आयेशा
! aiesha=आयेशा
! aesha=आयेशा
! ayeshah=आयेशा
! aayasha=आयशा
! aayasa=आयशा
! aayesha=आयशा
! aaisa=आयशा
! ayasha=आयशा
! aiyesha=आयशा
! ayesha=आयशा
! aiysha=आयशा
! aisha=आयशा
! aaysha=आयशा
! aaeesha=आयशा
! aeysha=आयशा
! aeesha=आयशा
! aeesa=आयशा
! ayeshaha=आयशा
! ayasa=आयशा
! aysa=आयशा
! aaisha=आयशा
! ayeza=आयशा
! aiyasha=आयशा
! aysha=आयशा
! aisa=आयशा
! aayeshaa=आयशा
! aaeesa=आयशा
! aaysa=आयसा
I have gone through the script pretty carefully: line by line and tested each condition laid down and am a bit perplexed as to why the data is behaving in this fashion. There are no trailing spaces and the data is absolutely "clean".
Please help. Sorting by hand is a time-consuming and also error-prone process.
Many thanks in advance.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Sort tab delimited file according to which rows have missing values

Hello! I have a tab delimited file with values in three columns. Some values occur in all three columns, other values are present in only one or two columns. I would like to sort the file so that rows with no missing values come first, rows with one missing values come next, and rows with two... (9 Replies)
Discussion started by: MBarrett1213
9 Replies

2. UNIX for Dummies Questions & Answers

[Solved] Reverse the order of a list of file names (but not sort them alphabetically or numerically)

Hello all, I have a list of file names in a text document where each file name consists of 4 letters and 3 numbers (for example MACR119). There are 48 file names in the document (they are not in alphabetical or numerical order). I would like to reorder the list of names so that the 48th name is... (3 Replies)
Discussion started by: MDeBiasse
3 Replies

3. Shell Programming and Scripting

Reverse sort

Hello, I have a large list of names and would like to do a reverse sort on them i.e. the sort should be by the ending and not by the beginning of the word. I had written in awk a small script but it does wrong things { for(i=length($0);i>=1;i--) printf("%s/n",substr($0,i,1)); } Could anyone... (3 Replies)
Discussion started by: gimley
3 Replies

4. Shell Programming and Scripting

How to convert a space delimited file into a pipe delimited file using shellscript?

Hi All, I have space delimited file similar to the one as shown below.. I need to convert it as a pipe delimited, the values inside the pipe delimited file should be as highlighted... AA ATIU2345098809 009697 005374 BB ATIU2345097809 005445 006518 CC ATIU9685098809 003215 003571 DD... (7 Replies)
Discussion started by: nithins007
7 Replies

5. Shell Programming and Scripting

reverse sort file

Hi all I am trying to numerically reverse sort a file but I seem to be having trouble. Example of file contents: text1,1 text2,-1 text3,0 I can sort using sort -k 2n -t, filename without any problems. However I want my results in descending order but using -r in my command... (2 Replies)
Discussion started by: pxy2d1
2 Replies

6. UNIX for Dummies Questions & Answers

Sort the fields in a comma delimited file

Hi, I have a comma delimited file. I want to sort the fields alphabetically and again store them in a comma delimited file. For example, My file looks like this. abc,aaa,xyz,xxx,def pqr,ggg,eee,iii,qqq zyx,lmo,pqr,abc,fff and I want my output to look like this, all fields sorted... (3 Replies)
Discussion started by: swethapatil
3 Replies

7. UNIX for Dummies Questions & Answers

sort -reverse order

I need to sort the particular column only in reverse order how i can give it.. if i give the -r option the whole file is getting sorted in reverse order. 1st 2nd col 3rd C col 4th col 5th col ------------------------------------------- C... (7 Replies)
Discussion started by: sivakumar.rj
7 Replies

8. Shell Programming and Scripting

reverse sort

Hello, How do i sort a csv file. i should be sorting column1(varchar),column2*(varchar) in ascending and column4 in descending order(numeric datatype). I tried few combinations of sort, but doesn't seem to be getting the right result. sort -t "," -k 1 -k 2 -k 4nr file any help is... (3 Replies)
Discussion started by: markjason
3 Replies

9. Shell Programming and Scripting

Converting Tab delimited file to Comma delimited file in Unix

Hi, Can anyone let me know on how to convert a Tab delimited file to Comma delimited file in Unix Thanks!! (22 Replies)
Discussion started by: charan81
22 Replies

10. Shell Programming and Scripting

sort a file in reverse order

I a file with log entries... I want to sort it so that the last line in the file is first and the first line is last.. eg. Sample file 1 h a f 8 6 After sort should look like 6 8 f a h 1 (11 Replies)
Discussion started by: frustrated1
11 Replies
Login or Register to Ask a Question