Sort mixed data file | Unix Linux Forums | UNIX for Advanced & Expert Users

  Go Back    


UNIX for Advanced & Expert Users Expert-to-Expert. Learn advanced UNIX, UNIX commands, Linux, Operating Systems, System Administration, Programming, Shell, Shell Scripts, Solaris, Linux, HP-UX, AIX, OS X, BSD.

Sort mixed data file

UNIX for Advanced & Expert Users


Closed Thread    
 
Thread Tools Search this Thread Display Modes
    #1  
Old 07-20-2013
jnrohit2k jnrohit2k is offline
Registered User
 
Join Date: May 2013
Last Activity: 16 March 2014, 7:24 PM EDT
Posts: 17
Thanks: 1
Thanked 0 Times in 0 Posts
Sort mixed data file

I have a text file and each field is separated by semicolon ( ; ). Field number 7 is internally separated by comma ( , ) and pipe ( | ) symbol. I want to sort file based on three different fields which are marked in BOLD.

Here first BOLD field will have numbers upto the length of 9 characters, second BOLD field will have start date in 'YYYYMMDD' format and third field will have end date in 'YYYYMMDD' format.


Input File:


Code:
 
REQSTS;00002315000000000011;1548356967;EIN;390606261;;FFFF_SAK,1,100010463,Y,0|FFFF_NPI,1,1548356967,N,1|FFFF_YYY_ID,1,100010463,N,2;;;1699720086;;;;EP;20120103;20120401;GC;2013
REQSTS;00002315000000000001;1316908164;EIN;310806261;;AAAA_SAK,1,18663,Y,0|AAAA_NPI,1,1316998164,N,1|AAAA_TTT_ID,1,30370300,N,2;;;1699720086;;;;EP;20130101;20130331;GC;2013
REQSTS;00002315000000000003;1720192893;EIN;320806261;;BBBB_SAK,1,18663,Y,0|BBBB_NPI,1,1720172893,N,1|BBBB_UUU_ID,1,100002999,N,2;;;1699720086;;;;EP;20130101;20130430;GC;2013
REQSTS;00002315000000000013;1366449767;EIN;390806961;;GGGG_SAK,1,34082,Y,0|GGGG_NPI,1,1366489767,N,1|GGGG_ZZZ_ID,1,32562000,N,2;;;1699720086;;;;EP;20120203;20120301;GC;2013
REQSTS;00002315000000000005;1003888704;EIN;390836261;;CCCC_SAK,1,18663,Y,0|CCCC_NPI,1,1003868704,N,1|CCCC_VVV_ID,1,34394500,N,2;;;1699720086;;;;EP;20121201;20130131;GC;2013
REQSTS;00002315000000000009;1174668474;EIN;272042615;;EEEE_SAK,1,100009394,Y,0|EEEE_NPI,1,1174618474,N,1|EEEE_XXX_ID,1,100009394,N,2;;;1699720086;;;;EP;20120103;20120401;GC;2013
REQSTS;00002315000000000007;1992777660;EIN;394806261;;DDDD_SAK,1,18663,Y,0|DDDD_NPI,1,1992757660,N,1|DDDD_WWW_ID,1,31598400,N,2;;;1699720086;;;;EP;20121201;20130531;GC;2013


Required output: First and second field should be in ascending order, and third field should be in descending order.

Output file:


Code:
REQSTS;00002315000000000007;1992777660;EIN;394806261;;DDDD_SAK,1,18663,Y,0|DDDD_NPI,1,1992757660,N,1|DDDD_WWW_ID,1,31598400,N,2;;;1699720086;;;;EP;20121201;20130531;GC;2013
REQSTS;00002315000000000005;1003888704;EIN;390836261;;CCCC_SAK,1,18663,Y,0|CCCC_NPI,1,1003868704,N,1|CCCC_VVV_ID,1,34394500,N,2;;;1699720086;;;;EP;20121201;20130131;GC;2013
REQSTS;00002315000000000003;1720192893;EIN;320806261;;BBBB_SAK,1,18663,Y,0|BBBB_NPI,1,1720172893,N,1|BBBB_UUU_ID,1,100002999,N,2;;;1699720086;;;;EP;20130101;20130430;GC;2013
REQSTS;00002315000000000001;1316908164;EIN;310806261;;AAAA_SAK,1,18663,Y,0|AAAA_NPI,1,1316998164,N,1|AAAA_TTT_ID,1,30370300,N,2;;;1699720086;;;;EP;20130101;20130331;GC;2013
REQSTS;00002315000000000013;1366449767;EIN;390806961;;GGGG_SAK,1,34082,Y,0|GGGG_NPI,1,1366489767,N,1|GGGG_ZZZ_ID,1,32562000,N,2;;;1699720086;;;;EP;20120203;20120301;GC;2013
REQSTS;00002315000000000009;1174668474;EIN;272042615;;EEEE_SAK,1,100009394,Y,0|EEEE_NPI,1,1174618474,N,1|EEEE_XXX_ID,1,100009394,N,2;;;1699720086;;;;EP;20120103;20120401;GC;2013
REQSTS;00002315000000000011;1548356967;EIN;390606261;;FFFF_SAK,1,100010463,Y,0|FFFF_NPI,1,1548356967,N,1|FFFF_YYY_ID,1,100010463,N,2;;;1699720086;;;;EP;20120103;20120401;GC;2013

Thanks!

Last edited by jnrohit2k; 07-20-2013 at 01:37 PM..
Sponsored Links
    #2  
Old 07-20-2013
mjf mjf is offline
Registered User
 
Join Date: Nov 2011
Last Activity: 26 July 2014, 6:48 AM EDT
Location: Newtown, PA
Posts: 121
Thanks: 7
Thanked 28 Times in 27 Posts
Here is one solution assuming your file will not contain '%':


Code:
 awk -F';' '{split($7,a,","); print a[3] "%" $15 "%" $16 "%" $0}' input.txt | sort -t '%' -k 1n -k 2n -k 3nr | awk -F'%' '{print $4}'


Code:
REQSTS;00002315000000000007;1992777660;EIN;394806261;;DDDD_SAK,1,18663,Y,0|DDDD_NPI,1,1992757660,N,1|DDDD_WWW_ID,1,31598400,N,2;;;1699720086;;;;EP;20121201;20130531;GC;2013
REQSTS;00002315000000000005;1003888704;EIN;390836261;;CCCC_SAK,1,18663,Y,0|CCCC_NPI,1,1003868704,N,1|CCCC_VVV_ID,1,34394500,N,2;;;1699720086;;;;EP;20121201;20130131;GC;2013
REQSTS;00002315000000000003;1720192893;EIN;320806261;;BBBB_SAK,1,18663,Y,0|BBBB_NPI,1,1720172893,N,1|BBBB_UUU_ID,1,100002999,N,2;;;1699720086;;;;EP;20130101;20130430;GC;2013
REQSTS;00002315000000000001;1316908164;EIN;310806261;;AAAA_SAK,1,18663,Y,0|AAAA_NPI,1,1316998164,N,1|AAAA_TTT_ID,1,30370300,N,2;;;1699720086;;;;EP;20130101;20130331;GC;2013
REQSTS;00002315000000000013;1366449767;EIN;390806961;;GGGG_SAK,1,34082,Y,0|GGGG_NPI,1,1366489767,N,1|GGGG_ZZZ_ID,1,32562000,N,2;;;1699720086;;;;EP;20120203;20120301;GC;2013
REQSTS;00002315000000000009;1174668474;EIN;272042615;;EEEE_SAK,1,100009394,Y,0|EEEE_NPI,1,1174618474,N,1|EEEE_XXX_ID,1,100009394,N,2;;;1699720086;;;;EP;20120103;20120401;GC;2013
REQSTS;00002315000000000011;1548356967;EIN;390606261;;FFFF_SAK,1,100010463,Y,0|FFFF_NPI,1,1548356967,N,1|FFFF_YYY_ID,1,100010463,N,2;;;1699720086;;;;EP;20120103;20120401;GC;2013

Sponsored Links
    #3  
Old 07-20-2013
jnrohit2k jnrohit2k is offline
Registered User
 
Join Date: May 2013
Last Activity: 16 March 2014, 7:24 PM EDT
Posts: 17
Thanks: 1
Thanked 0 Times in 0 Posts
It worked. Thanks!
    #4  
Old 08-22-2013
jnrohit2k jnrohit2k is offline
Registered User
 
Join Date: May 2013
Last Activity: 16 March 2014, 7:24 PM EDT
Posts: 17
Thanks: 1
Thanked 0 Times in 0 Posts
I have a text file and each field is separated by semicolon ( ; ). Field number 7 is internally separated by comma ( , ) and pipe ( | ) symbol. I want to sort file based on four different fields which are marked in BOLD.

Here first BOLD field will have numbers upto the length of 9 characters, second BOLD field will have start date in 'YYYYMMDD' format, third field will have end date in 'YYYYMMDD' format and fourth BOLD field will have string of 2 characters (Like "IP", "GC" ) or two WHITE SPACE.


Input File:


Code:
 
REQSTS;00002315000000000011;1548356967;EIN;390606261;;FFFF_SAK,1,100010463,Y,0|FFFF_NPI,1,1548356967,N,1|FFFF_YYY_ID,1,100010463,N,2;;;1699720086;;;;EP;20120103;20120401;GC;2013
REQSTS;00002315000000000001;1316908164;EIN;310806261;;AAAA_SAK,1,18663,Y,0|AAAA_NPI,1,1316998164,N,1|AAAA_TTT_ID,1,30370300,N,2;;;1699720086;;;;EP;20130101;20130331;IP;2013
REQSTS;00002315000000000002;1316908164;EIN;310806261;;AAAA_SAK,1,18663,Y,0|AAAA_NPI,1,1316998164,N,1|AAAA_TTT_ID,1,30370300,N,2;;;1699720086;;;;EP;20121201;20130229;IP;2013
REQSTS;00002315000000000022;1316908164;EIN;310806261;;AAAA_SAK,1,18663,Y,0|AAAA_NPI,1,1316998164,N,1|AAAA_TTT_ID,1,30370300,N,2;;;1699720086;;;;EP;20121201;20130101;IP;2013
REQSTS;00002315000000000003;1720192893;EIN;320806261;;BBBB_SAK,1,18663,Y,0|BBBB_NPI,1,1720172893,N,1|BBBB_UUU_ID,1,100002999,N,2;;;1699720086;;;;EP;20130101;20130430;GC;2013
REQSTS;00002315000000000013;1366449767;EIN;390806961;;GGGG_SAK,1,34082,Y,0|GGGG_NPI,1,1366489767,N,1|GGGG_ZZZ_ID,1,32562000,N,2;;;1699720086;;;;EP;20120203;20120301;IP;2013
REQSTS;00002315000000000005;1003888704;EIN;390836261;;CCCC_SAK,1,18663,Y,0|CCCC_NPI,1,1003868704,N,1|CCCC_VVV_ID,1,34394500,N,2;;;1699720086;;;;EP;20121201;20130131;  ;2013
REQSTS;00002315000000000009;1174668474;EIN;272042615;;EEEE_SAK,1,100009394,Y,0|EEEE_NPI,1,1174618474,N,1|EEEE_XXX_ID,1,100009394,N,2;;;1699720086;;;;EP;20120103;20120401;IP;2013
REQSTS;00002315000000000007;1992777660;EIN;394806261;;DDDD_SAK,1,18663,Y,0|DDDD_NPI,1,1992757660,N,1|DDDD_WWW_ID,1,31598400,N,2;;;1699720086;;;;EP;20121101;20130531;  ;2013
REQSTS;00002315000000000016;1548356967;EIN;390606261;;FFFF_SAK,1,100010463,Y,0|FFFF_NPI,1,1548356967,N,1|FFFF_YYY_ID,1,100010463,N,2;;;1699720086;;;;EP;20110203;20110501;GC;2013


Required output: Sorting should happen based on following four fields in same sequence:-

(1) Fourth BOLD field should be in decending order (CHARACTER)
(2) First BOLD field should be in ascending order (NUMBER)
(3) Second BOLD field should be in ascending order (NUMBER)
(4) Third BOLD field should be in descending order (NUMBER)

Output file:


Code:
REQSTS;00002315000000000002;1316908164;EIN;310806261;;AAAA_SAK,1,18663,Y,0|AAAA_NPI,1,1316998164,N,1|AAAA_TTT_ID,1,30370300,N,2;;;1699720086;;;;EP;20121201;20130229;IP;2013
REQSTS;00002315000000000022;1316908164;EIN;310806261;;AAAA_SAK,1,18663,Y,0|AAAA_NPI,1,1316998164,N,1|AAAA_TTT_ID,1,30370300,N,2;;;1699720086;;;;EP;20121201;20130101;IP;2013
REQSTS;00002315000000000001;1316908164;EIN;310806261;;AAAA_SAK,1,18663,Y,0|AAAA_NPI,1,1316998164,N,1|AAAA_TTT_ID,1,30370300,N,2;;;1699720086;;;;EP;20130101;20130331;IP;2013
REQSTS;00002315000000000013;1366449767;EIN;390806961;;GGGG_SAK,1,34082,Y,0|GGGG_NPI,1,1366489767,N,1|GGGG_ZZZ_ID,1,32562000,N,2;;;1699720086;;;;EP;20120203;20120301;IP;2013
REQSTS;00002315000000000009;1174668474;EIN;272042615;;EEEE_SAK,1,100009394,Y,0|EEEE_NPI,1,1174618474,N,1|EEEE_XXX_ID,1,100009394,N,2;;;1699720086;;;;EP;20120103;20120401;IP;2013
REQSTS;00002315000000000003;1720192893;EIN;320806261;;BBBB_SAK,1,18663,Y,0|BBBB_NPI,1,1720172893,N,1|BBBB_UUU_ID,1,100002999,N,2;;;1699720086;;;;EP;20130101;20130430;GC;2013
REQSTS;00002315000000000016;1548356967;EIN;390606261;;FFFF_SAK,1,100010463,Y,0|FFFF_NPI,1,1548356967,N,1|FFFF_YYY_ID,1,100010463,N,2;;;1699720086;;;;EP;20110203;20110501;GC;2013
REQSTS;00002315000000000011;1548356967;EIN;390606261;;FFFF_SAK,1,100010463,Y,0|FFFF_NPI,1,1548356967,N,1|FFFF_YYY_ID,1,100010463,N,2;;;1699720086;;;;EP;20120103;20120401;GC;2013
REQSTS;00002315000000000007;1992777660;EIN;394806261;;DDDD_SAK,1,18663,Y,0|DDDD_NPI,1,1992757660,N,1|DDDD_WWW_ID,1,31598400,N,2;;;1699720086;;;;EP;20121101;20130531;  ;2013
REQSTS;00002315000000000005;1003888704;EIN;390836261;;CCCC_SAK,1,18663,Y,0|CCCC_NPI,1,1003868704,N,1|CCCC_VVV_ID,1,34394500,N,2;;;1699720086;;;;EP;20121201;20130131;  ;2013

I tried below command but it is not showing field 17th (i.e., fourth BOLD field) in the output.


Code:
 
awk -F';' '{$17 "%" split($7,a,","); print a[3] "%" $15 "%" $16 "%" $0}' temp.dat
 
100010463%20120103%20120401%CLMREQ;00002315000000000011;1548356967;EIN;390606261;;FFFF_SAK,1,100010463,Y,0|FFFF_NPI,1,1548356967,N,1|FFFF_YYY_ID,1,100010463,N,2;;;1699720086;;;;EP;20120103;20120401;GC;2013
18663%20130101%20130331%CLMREQ;00002315000000000001;1316908164;EIN;310806261;;AAAA_SAK,1,18663,Y,0|AAAA_NPI,1,1316998164,N,1|AAAA_TTT_ID,1,30370300,N,2;;;1699720086;;;;EP;20130101;20130331;IP;2013
18663%20121201%20130229%CLMREQ;00002315000000000002;1316908164;EIN;310806261;;AAAA_SAK,1,18663,Y,0|AAAA_NPI,1,1316998164,N,1|AAAA_TTT_ID,1,30370300,N,2;;;1699720086;;;;EP;20121201;20130229;IP;2013
18663%20121201%20130101%CLMREQ;00002315000000000022;1316908164;EIN;310806261;;AAAA_SAK,1,18663,Y,0|AAAA_NPI,1,1316998164,N,1|AAAA_TTT_ID,1,30370300,N,2;;;1699720086;;;;EP;20121201;20130101;IP;2013
18663%20130101%20130430%CLMREQ;00002315000000000003;1720192893;EIN;320806261;;BBBB_SAK,1,18663,Y,0|BBBB_NPI,1,1720172893,N,1|BBBB_UUU_ID,1,100002999,N,2;;;1699720086;;;;EP;20130101;20130430;GC;2013
34082%20120203%20120301%CLMREQ;00002315000000000013;1366449767;EIN;390806961;;GGGG_SAK,1,34082,Y,0|GGGG_NPI,1,1366489767,N,1|GGGG_ZZZ_ID,1,32562000,N,2;;;1699720086;;;;EP;20120203;20120301;IP;2013
18663%20121201%20130131%CLMREQ;00002315000000000005;1003888704;EIN;390836261;;CCCC_SAK,1,18663,Y,0|CCCC_NPI,1,1003868704,N,1|CCCC_VVV_ID,1,34394500,N,2;;;1699720086;;;;EP;20121201;20130131;  ;2013
100009394%20120103%20120401%CLMREQ;00002315000000000009;1174668474;EIN;272042615;;EEEE_SAK,1,100009394,Y,0|EEEE_NPI,1,1174618474,N,1|EEEE_XXX_ID,1,100009394,N,2;;;1699720086;;;;EP;20120103;20120401;IP;2013
18663%20121101%20130531%CLMREQ;00002315000000000007;1992777660;EIN;394806261;;DDDD_SAK,1,18663,Y,0|DDDD_NPI,1,1992757660,N,1|DDDD_WWW_ID,1,31598400,N,2;;;1699720086;;;;EP;20121101;20130531;  ;2013
100010463%20110203%20110501%CLMREQ;00002315000000000016;1548356967;EIN;390606261;;FFFF_SAK,1,100010463,Y,0|FFFF_NPI,1,1548356967,N,1|FFFF_YYY_ID,1,100010463,N,2;;;1699720086;;;;EP;20110203;20110501;GC;2013

Thanks!
Sponsored Links
    #5  
Old 08-22-2013
RudiC RudiC is offline Forum Advisor  
Registered User
 
Join Date: Jul 2012
Last Activity: 26 July 2014, 12:57 PM EDT
Location: Aachen, Germany
Posts: 3,890
Thanks: 62
Thanked 922 Times in 875 Posts
What about analysing and trying to understand mjf's suggestion, and then adapting it YOURSELF to your new specification?
Sponsored Links
    #6  
Old 08-22-2013
jnrohit2k jnrohit2k is offline
Registered User
 
Join Date: May 2013
Last Activity: 16 March 2014, 7:24 PM EDT
Posts: 17
Thanks: 1
Thanked 0 Times in 0 Posts
Sorry, I did a mistake here. The awk command should be;

awk -F';' '{split($7,a,","); print $17 "%" a[3] "%" $15 "%" $16 "%" $0}' temp1.dat | sort -t '%' -k 1r -k 2n -k 3n -k 4nr

It gave me following output but it did not sort first bold field correctly. Still trying....


Code:
 
IP%34082%20120203%20120301%CLMREQ;00002315000000000013;1366449767;EIN;390806961;;GGGG_SAK,1,34082,Y,0|GGGG_NPI,1,1366489767,N,1|GGGG_ZZZ_ID,1,32562000,N,2;;;1699720086;;;;EP;20120203;20120301;IP;2013
IP%18663%20130101%20130331%CLMREQ;00002315000000000001;1316908164;EIN;310806261;;AAAA_SAK,1,18663,Y,0|AAAA_NPI,1,1316998164,N,1|AAAA_TTT_ID,1,30370300,N,2;;;1699720086;;;;EP;20130101;20130331;IP;2013
IP%18663%20121201%20130229%CLMREQ;00002315000000000002;1316908164;EIN;310806261;;AAAA_SAK,1,18663,Y,0|AAAA_NPI,1,1316998164,N,1|AAAA_TTT_ID,1,30370300,N,2;;;1699720086;;;;EP;20121201;20130229;IP;2013
IP%18663%20121201%20130101%CLMREQ;00002315000000000022;1316908164;EIN;310806261;;AAAA_SAK,1,18663,Y,0|AAAA_NPI,1,1316998164,N,1|AAAA_TTT_ID,1,30370300,N,2;;;1699720086;;;;EP;20121201;20130101;IP;2013
IP%100009394%20120103%20120401%CLMREQ;00002315000000000009;1174668474;EIN;272042615;;EEEE_SAK,1,100009394,Y,0|EEEE_NPI,1,1174618474,N,1|EEEE_XXX_ID,1,100009394,N,2;;;1699720086;;;;EP;20120103;20120401;IP;2013
GC%18663%20130101%20130430%CLMREQ;00002315000000000003;1720192893;EIN;320806261;;BBBB_SAK,1,18663,Y,0|BBBB_NPI,1,1720172893,N,1|BBBB_UUU_ID,1,100002999,N,2;;;1699720086;;;;EP;20130101;20130430;GC;2013
GC%100010463%20120103%20120401%CLMREQ;00002315000000000011;1548356967;EIN;390606261;;FFFF_SAK,1,100010463,Y,0|FFFF_NPI,1,1548356967,N,1|FFFF_YYY_ID,1,100010463,N,2;;;1699720086;;;;EP;20120103;20120401;GC;2013
GC%100010463%20110203%20110501%CLMREQ;00002315000000000016;1548356967;EIN;390606261;;FFFF_SAK,1,100010463,Y,0|FFFF_NPI,1,1548356967,N,1|FFFF_YYY_ID,1,100010463,N,2;;;1699720086;;;;EP;20110203;20110501;GC;2013
  %18663%20121201%20130131%CLMREQ;00002315000000000005;1003888704;EIN;390836261;;CCCC_SAK,1,18663,Y,0|CCCC_NPI,1,1003868704,N,1|CCCC_VVV_ID,1,34394500,N,2;;;1699720086;;;;EP;20121201;20130131;  ;2013
  %18663%20121101%20130531%CLMREQ;00002315000000000007;1992777660;EIN;394806261;;DDDD_SAK,1,18663,Y,0|DDDD_NPI,1,1992757660,N,1|DDDD_WWW_ID,1,31598400,N,2;;;1699720086;;;;EP;20121101;20130531;  ;2013

Thanks!
Sponsored Links
    #7  
Old 08-22-2013
RudiC RudiC is offline Forum Advisor  
Registered User
 
Join Date: Jul 2012
Last Activity: 26 July 2014, 12:57 PM EDT
Location: Aachen, Germany
Posts: 3,890
Thanks: 62
Thanked 922 Times in 875 Posts
You should modify the sort parameters:
Code:
sort -t '%' -k1,1r -k2,2n -k3,3n -k4,4nr

-k2n starts a key (field) 2 and runs to the end of line.
-k2,2 starts at 2 and stops at 2.
Sponsored Links
Closed Thread

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
awk - mixed for and if to select particular lines in a data file naska Shell Programming and Scripting 4 07-05-2013 02:58 PM
How to use FS for mixed file? Akshay Hegde Shell Programming and Scripting 5 03-08-2013 01:26 PM
Advanced: Sort, count data in column, append file name JamesT Shell Programming and Scripting 14 08-14-2012 07:36 AM
Ignore Header and Footer and Sort the data in fixed width file sasikari Shell Programming and Scripting 5 07-14-2011 11:42 AM
Sort a big data file rubber08 Shell Programming and Scripting 2 10-18-2010 07:05 AM



All times are GMT -4. The time now is 03:14 PM.