Sorting blocks by a section of the identifier

 
Thread Tools Search this Thread
Top Forums UNIX for Beginners Questions & Answers Sorting blocks by a section of the identifier
# 1  
Old 04-12-2017
Sorting blocks by a section of the identifier

I have a file that should be sorted by a string (shown in red in my example below) in the identifier. The RS is ^@M0, something like this:
Code:
 @M04961:22:000000000-B5VGJ:1:1101:9280:7106 1:N:0:86
 GGCATGAAAACATACAAACCGTCTTTCCAGAAATTGTTCCAAGTATCGGCAACAGCTTTATCAATACCATGAAAAATATCAACCACACCAGAAGCAGCAT
 +
 GGGGGGGGGGGGGGGGGCCGGGGGF,EDFFGEDFG,@DGGCGGEGGG7DCGGGF68CGFFFGGGG@CGDGFFDFEFEFF:30CGAFFDFEFF8CAF;;8F
 @M04961:22:000000000-B5VGJ:1:1101:10817:7690 1:N:0:86
 ACGAGCATCATCTTGATTAAGCTCATTAGGGTTAGCCTCGGTACGGTCAGGCATCCACGGCGCTTTAAAATAGTTGTTATAGATATTCAAATAACCCTGA
 +
 CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEEFFFGGGGGG
 @M04961:22:000000000-B5VGJ:1:1101:14258:7136 1:N:0:86
 GGCATGAAAACATACAACAGCGGCTTTAACCGGACGCTCGACGCCATTAATAATGTTTTCCGTAAATTCAGCGCCTTCCATGATGAGACAGGCCGTTTGA
 +
 CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDEGGEGEGEGGG
 @M04961:22:000000000-B5VGJ:1:1101:15671:7305 1:N:0:86
 GGCATGAAAACATACAAAGTAAGGGGCCGAAGCCCCTGCAATTAAAATTGTTGACCACCTACATACCAAAGACGAGCGCCTTTACGCTTGCCTTTAGTAC
 +
 CCCC@CCFFGFGEGGGGGFGGGGGGGGFGGGGGGEFGGGGGGGGGCGGGGGGGGCFFG@GFFGGGGGCCGCGFGGGGGGGGGGGFFBEGG:CFF9>CGEG
 @M04961:22:000000000-B5VGJ:1:1101:10091:7763 1:N:0:86
 GAGCACATTGTAGCATTGTGCCAATTCATCCATTAACTTCTCAGTAACAGATACAAACTCATCACGAACGTCAGAAGCAGCCTTATGGCCGTCAACATAC
 +
 :=@FGEFFFGGGGGGGFBB@BEFGG?F,EFCCF@FGGGGGGECFGFG9,><3>FC@DFFGG9:383@FC9,>;,>78FC=FCDECFFDGFFCFFGGC?FF
 @M04961:22:000000000-B5VGJ:1:1101:14783:7784 1:N:0:86
 TCTTATTACCATTTCAACTACTCCGGTTATCGCTGGCGACTCCTTCGAGATGGACGCCGTTGGCGCTCTCCGTCTTTCTCCATTGCGTCGTGGCCTTGCT
 +
 CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDGGGGGGGCGGGGFGG
 @M04961:22:000000000-B5VGJ:1:1101:26069:7790 1:N:0:86
 CAGAACGTGAAAAAGCGTCCTGCGTGTAGCGAACTGCGATGGGCATACTGTAACCATAAGGCCACGTATTTTGCAAGCTGGCATGAAAACATACAT
 +
 CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

Then desire output is the following:
Code:
 @M04961:22:000000000-B5VGJ:1:1101:9280:7106 1:N:0:86
 GGCATGAAAACATACAAACCGTCTTTCCAGAAATTGTTCCAAGTATCGGCAACAGCTTTATCAATACCATGAAAAATATCAACCACACCAGAAGCAGCAT
 +
 GGGGGGGGGGGGGGGGGCCGGGGGF,EDFFGEDFG,@DGGCGGEGGG7DCGGGF68CGFFFGGGG@CGDGFFDFEFEFF:30CGAFFDFEFF8CAF;;8F
 @M04961:22:000000000-B5VGJ:1:1101:14258:7136 1:N:0:86
 GGCATGAAAACATACAACAGCGGCTTTAACCGGACGCTCGACGCCATTAATAATGTTTTCCGTAAATTCAGCGCCTTCCATGATGAGACAGGCCGTTTGA
 +
 CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDEGGEGEGEGGG
 @M04961:22:000000000-B5VGJ:1:1101:15671:7305 1:N:0:86
 GGCATGAAAACATACAAAGTAAGGGGCCGAAGCCCCTGCAATTAAAATTGTTGACCACCTACATACCAAAGACGAGCGCCTTTACGCTTGCCTTTAGTAC
 +
 CCCC@CCFFGFGEGGGGGFGGGGGGGGFGGGGGGEFGGGGGGGGGCGGGGGGGGCFFG@GFFGGGGGCCGCGFGGGGGGGGGGGFFBEGG:CFF9>CGEG
 @M04961:22:000000000-B5VGJ:1:1101:10817:7690 1:N:0:86
 ACGAGCATCATCTTGATTAAGCTCATTAGGGTTAGCCTCGGTACGGTCAGGCATCCACGGCGCTTTAAAATAGTTGTTATAGATATTCAAATAACCCTGA
 +
 CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEEFFFGGGGGG
 @M04961:22:000000000-B5VGJ:1:1101:10091:7763 1:N:0:86
 GAGCACATTGTAGCATTGTGCCAATTCATCCATTAACTTCTCAGTAACAGATACAAACTCATCACGAACGTCAGAAGCAGCCTTATGGCCGTCAACATAC
 +
 :=@FGEFFFGGGGGGGFBB@BEFGG?F,EFCCF@FGGGGGGECFGFG9,><3>FC@DFFGG9:383@FC9,>;,>78FC=FCDECFFDGFFCFFGGC?FF
 @M04961:22:000000000-B5VGJ:1:1101:14783:7784 1:N:0:86
 TCTTATTACCATTTCAACTACTCCGGTTATCGCTGGCGACTCCTTCGAGATGGACGCCGTTGGCGCTCTCCGTCTTTCTCCATTGCGTCGTGGCCTTGCT
 +
 CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDGGGGGGGCGGGGFGG
 @M04961:22:000000000-B5VGJ:1:1101:26069:7790 1:N:0:86
 CAGAACGTGAAAAAGCGTCCTGCGTGTAGCGAACTGCGATGGGCATACTGTAACCATAAGGCCACGTATTTTGCAAGCTGGCATGAAAACATACAT
 +
 CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

I have tried several combinations of awk | sort, but haven't been able to get the desire output
Any help will be greatly appreciated!

Last edited by Xterra; 04-12-2017 at 06:10 PM..
# 2  
Old 04-12-2017
Does
Code:
sort -t: -n -7

work? Does your red number always appear in the 7th field when using a colon as a delimiter? Is the field always four digits? Finally, does the "b1" after your sort code ever change? ("b" represents a space.)
# 3  
Old 04-12-2017
wbport
I am not quite sure how your code will output the desired result. About your questions, everything is constant but the length of the string/field
# 4  
Old 04-12-2017
Quote:
Originally Posted by Xterra
I have a file that should be sorted by a string
Which typical size has such a file? Background of the question is: would it be feasible to sort it in memory or will we need to work with temporary files to conserve memory. Furthermore, a rough estimation of the typical number of records would help - orders of magnitude would be sufficient.

I suppose the easiest way would be to first transform the "records" which now span multiple lines into single lines, then use a simple sort like the suggested one to sort these lines, finally transform the the lines back into the original structure. This will work if: the resulting line length is sufficiently short and the available memory is big enough to do the sorting.

I hope this helps.

bakunin
This User Gave Thanks to bakunin For This Post:
# 5  
Old 04-12-2017
The last field on the sort command, unless it is sorting data piped in from something else (like sed), is the name of the file to be sorted followed by a redirect (>) or a pipe (|) to another program. "-t:" says to use a colon as a field delimiter. "-n" says this is a numeric field but if this field is always four digits, that parameter can be taken out.

If you need a numeric sort but need to ignore the space and following digit, you can use a sed command to replace the space with a leading colon and a space. After running sort, another sed can take out the colon before the space.
This User Gave Thanks to wbport For This Post:
# 6  
Old 04-12-2017
Thanks guys! Sort of got it:
Code:
 awk -vRS="@M0" 'BEGIN { FS="\n"; OFS="\t" }{ print RS$1, $2, $3, $4 }' test.txt | sed '1d' | sort -t: -k 7 | tr "\t" "\n"

These sort of files contain between 15-25 million entries. So, I would like to have a "lighter" script handling this thingy
Thanks in advance!
# 7  
Old 04-13-2017
You can get rid of the sed in your pipeline with a small change to the awk script:
Code:
 awk -vRS="@M0" 'BEGIN{FS="\n"; OFS="\t"}NR>1{print RS$1, $2, $3, $4}' test.txt | sort -t: -k 7 | tr "\t" "\n"

I would also be tempted to change the:
Code:
sort -t: -k 7

to:
Code:
sort -t: -k 7,7

but as long as the field 7 values in your millions of input lines are unique (including the numbers after the space), it won't make any difference.
This User Gave Thanks to Don Cragun For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Row blocks to column blocks

Hello, Searched for a while and found some "line-to-column" script. My case is similar but with multiple fields each row: S02 Length Per S02 7043 3.864 S02 54477 29.89 S02 104841 57.52 S03 Length Per S03 1150 0.835 S03 1321 0.96 S03 ... (9 Replies)
Discussion started by: yifangt
9 Replies

2. UNIX for Dummies Questions & Answers

Sorting arrays horizontally without END section, awk

input: ref001, Europe, Belgium, 1001 ref001, Europe, Spain, 203 ref001, Europe, Germany, 457 ref002, America, Canada, 234 ref002, America, US, 87 ref002, America, Alaska, 652 Without using an END section, I need to write all the info related to the same ref number ($1)and continent ($2) on... (9 Replies)
Discussion started by: lucasvs
9 Replies

3. Shell Programming and Scripting

Prepend first line of section to each line until the next section header

I have searched in a variety of ways in a variety of places but have come up empty. I would like to prepend a portion of a section header to each following line until the next section header. I have been using sed for most things up until now but I'd go for a solution in just about anything--... (7 Replies)
Discussion started by: pagrus
7 Replies

4. Shell Programming and Scripting

how to split this file into blocks and then send these blocks as input to the tool called Yices?

Hello, I have a file like this: FILE.TXT: (define argc :: int) (assert ( > argc 1)) (assert ( = argc 1)) <check> # (define c :: float) (assert ( > c 0)) (assert ( = c 0)) <check> # now, i want to separate each block('#' is the delimeter), make them separate files, and then send them as... (5 Replies)
Discussion started by: paramad
5 Replies

5. Shell Programming and Scripting

is not an identifier

Hi Guys... I am using the following codes in my script: SID_L=`cat /var/opt/oracle/oratab|grep -v "^#"|cut -f1 -d: -s` SID_VAR=$SID_L for SID_RUN in $SID_VAR do ORACLE_HOME=`grep ^$SID_RUN /var/opt/oracle/oratab | \ awk -F: '{print $2}'` ;export ORACLE_HOME export... (2 Replies)
Discussion started by: Phuti
2 Replies

6. Shell Programming and Scripting

Extract section of file based on word in section

I have a list of Servers in no particular order as follows: virtualMachines="IIBSBS IIBVICDMS01 IIBVICMA01"And I am generating some output from a pre-existing script that gives me the following (this is a sample output selection). 9/17/2010 8:00:05 PM: Normal backup using VDRBACKUPS... (2 Replies)
Discussion started by: jelloir
2 Replies

7. UNIX for Dummies Questions & Answers

Convert 512-blocks to 4k blocks

I'm Unix. I'm looking at "df" on Unix now and below is an example. It's lists the filesystems out in 512-blocks, I need this in 4k blocks. Is there a way to do this in Unix or do I manually convert and how? So for container 1 there is 7,340,032 in size in 512-blocks. What would the 4k block be... (2 Replies)
Discussion started by: rockycj
2 Replies

8. Shell Programming and Scripting

not an identifier

Hi I have already gone through this topic on this forum, but still i am getting same problem. I am using solaris 10. my login shell is /usr/bash i have got a script as below /home/gyan> cat 3.cm #!/usr/bin/ksh export PROG_NAME=rpaa001 if i run this script as below , it works fine... (3 Replies)
Discussion started by: gyanibaba
3 Replies

9. Shell Programming and Scripting

Sorting blocks of data

Hello all, Below is what I am trying to accomplish: I have a file that looks like this /* ----------------- xxxx.y_abcd_00000050 ----------------- */ jdghjghkla sadgsdags asdgsdgasd asdgsagasdg /* ----------------- xxxx.y_abcd_00000055 ----------------- */ sdgsdg sdgxcvzxcbv... (8 Replies)
Discussion started by: alfredo123
8 Replies

10. Shell Programming and Scripting

Sorting rules on a text section

Hi all My text file looks like this: start doc ... (certain number of records) REC3|Emma|info| REC3|Lukas|info| REC3|Arthur|info| ... (certain number of records) end doc start doc ... (certain number of records)... (4 Replies)
Discussion started by: Indalecio
4 Replies
Login or Register to Ask a Question