gawk asort to sort record groups based on one subfield | Unix Linux Forums | UNIX for Dummies Questions & Answers

  Go Back    


UNIX for Dummies Questions & Answers If you're not sure where to post a UNIX or Linux question, post it here. All UNIX and Linux newbies welcome !!

gawk asort to sort record groups based on one subfield

UNIX for Dummies Questions & Answers


Closed Thread    
 
Thread Tools Search this Thread Display Modes
    #1  
Old 10-07-2012
lucasvs lucasvs is offline
Registered User
 
Join Date: Dec 2011
Last Activity: 25 May 2013, 5:03 AM EDT
Posts: 85
Thanks: 29
Thanked 0 Times in 0 Posts
gawk asort to sort record groups based on one subfield

input ("/" delimited fields):

Code:
style1/book1 (author_C)/editor1/2000
style1/book2 (author_A)/editor2/2004
style1/book3 (author_B)/editor3/2001
style2/book8 (author_B)/editor4/2010
style2/book5 (author_A)/editor2/1998

Records with same field 1 belong to the same group.
Using asort (not sort), in each group I need to sort the records in ascending order based on the string between braces in field 2, to obtain:

Code:
style1/book2 (author_A)/editor2/2004
style1/book3 (author_B)/editor3/2001
style1/book1 (author_C)/editor1/2000
style2/book5 (author_A)/editor2/1998
style2/book8 (author_B)/editor4/2010

I tried to sort the records by field1 and then by subfield2 in field2, but it didn't work:

Code:
BEGIN{FS=OFS="/"}

{
    array[$1] = $0

    split ($2, aut, " ")

    asort(array)

    o = asort(aut)

    for (o in aut)
        print array[aut[o]]

}

Sponsored Links
    #2  
Old 10-07-2012
Don Cragun's Avatar
Don Cragun Don Cragun is online now Forum Staff  
Moderator
 
Join Date: Jul 2012
Last Activity: 23 September 2014, 7:53 AM EDT
Location: San Jose, CA, USA
Posts: 4,701
Thanks: 180
Thanked 1,570 Times in 1,337 Posts
The versions of awk that I use (on OS X) don't have the asort() and asorti() functions, but I have read the gawk man page. Unlike the sort utility, there is no way to specify a sort key for these functions; they always sort the array using the entire contents of the string as the sort key. If you want to use asort() in gawk to sort with field 1 as your primary sort key and the second part of field 2 as your secondary key; you need to prepend each line in your array with primary and secondary sort fields, use asort() or asorti() to sort the modified records, and then strip off the added sort fields when you print (or otherwise process) the results.
Sponsored Links
    #3  
Old 10-07-2012
lucasvs lucasvs is offline
Registered User
 
Join Date: Dec 2011
Last Activity: 25 May 2013, 5:03 AM EDT
Posts: 85
Thanks: 29
Thanked 0 Times in 0 Posts
Quote:
ou need to prepend each line in your array with primary and secondary sort fields, use asort() or asorti() to sort the modified records, and then strip off the added sort fields when you print (or otherwise process) the results.
So you mean I should
1st) sort by field1 and generate a first output
2nd) use this output to sort by subfield 2 and generate the final output.

I tried things like below but still doesn't work.

Code:
BEGIN{FS=OFS="/"}

{
# sort by field1
    array[$1] = $0

    asort(array)

# first output
    for (i in array)
        $0 = array[i]

# redefine fields in first output    
        split($0, rec, FS)
        rec[$2] = $0

        split($0, sub, " ")
        aut[++a] = sub[2]

# sort by subfield2 
        n = asort(aut)

# print final output
        for (j=1; j<=n; j++)
            print array[aut[j]]
    
}

    #4  
Old 10-08-2012
Don Cragun's Avatar
Don Cragun Don Cragun is online now Forum Staff  
Moderator
 
Join Date: Jul 2012
Last Activity: 23 September 2014, 7:53 AM EDT
Location: San Jose, CA, USA
Posts: 4,701
Thanks: 180
Thanked 1,570 Times in 1,337 Posts
Quote:
Originally Posted by lucasvs View Post
So you mean I should
1st) sort by field1 and generate a first output
2nd) use this output to sort by subfield 2 and generate the final output.

I tried things like below but still doesn't work.

Code:
BEGIN{FS=OFS="/"}

{
# sort by field1
    array[$1] = $0

    asort(array)

# first output
    for (i in array)
        $0 = array[i]

# redefine fields in first output    
        split($0, rec, FS)
        rec[$2] = $0

        split($0, sub, " ")
        aut[++a] = sub[2]

# sort by subfield2 
        n = asort(aut)

# print final output
        for (j=1; j<=n; j++)
            print array[aut[j]]
    
}

What you have above sorts the accumulated input twice each time you read a line from your input file. Clearly that isn't what you want. Even if you were doing the sorts in an END clause instead of in a clause that is executed every time you read an input line, performing a sort on the entire line and then performing a sort on the second sort key is not the same as performing a single sort with a primary and secondary sort key.

I don't have access to a system running gawk, but just using standard interfaces, I get the output:
Quote:
style1/book2 (author_A)/editor2/2004
style1/book3 (author_B)/editor3/2001
style1/book1 (author_C)/editor1/2000
style2/book5 (author_A)/editor2/1998
style2/book8 (author_B)/editor4/2010
which I think is what you're trying to get, when I use the script:

Code:
#!/bin/ksh
awk 'BEGIN{FS=OFS="/"
    tmpfile = "asorti.out"
    sortcommand = "sort -t/ -o " tmpfile
    cleanup = "rm " tmpfile
}
{   split ($2, sub, " ")
    array[$1 "/" sub[2] "/" $0] = $0
}
END{for (i in array) print i | sortcommand
    close(sortcommand)
    while(getline i < tmpfile) print array[i]
    close(tmpfile)
    system(cleanup)
}' in

where in contains the data listed in your first posting on this thread. If I read the gawk man page correctly, this should be roughly equivalent to:

Code:
#!/bin/ksh
gawk 'BEGIN{FS = OFS = "/" }
{   split ($2, sub, " ")
    array[$1 "/" sub[2] "/" $0] = $0
}   
END{n = asorti(array)
    for(i = 1; i <= n; print array[i++]);
}' in

Sponsored Links
Closed Thread

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Match groups of capital words using gawk louisJ Shell Programming and Scripting 1 05-22-2012 06:55 AM
Don't understand AWK asort behaviour jgilot Shell Programming and Scripting 3 11-23-2011 06:39 PM
Help with sort and keep data record to calculate N50 in c cpp_beginner Programming 5 07-19-2011 06:57 AM
Gawk / Awk Merge Lines based on Key Jamesfirst Shell Programming and Scripting 9 10-28-2010 09:22 AM
Removing \n within a record (awk/gawk) CKT_newbie88 Shell Programming and Scripting 10 05-13-2009 03:12 PM



All times are GMT -4. The time now is 08:20 AM.