Sorting blocks by a section of the identifier

04-14-2017

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

Quote:

Originally Posted by RudiC

Wouldn't 8.000.000 kB (8 * 10^6 * 10^3) be 8 GB? And thus (sort of) manageable? How come we're talking terabytes?

Now, that you mention it: i think you are right. I just read Dons "8TB" and didn't recalculate myself. My bad.

I just counted one record of the posted saple to have 260 characters. As a size of 15-25 million records were mentioned: 15 * 10^6 * 260 ~ 4GB, 25 * 10^6 * 260 ~ 6GB. This should indeed be feasible to sort in memory.

bakunin

bakunin

View Public Profile for bakunin

Find all posts by bakunin

04-14-2017

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

I'm sorry for all of the confusion. I had originally intended to type 8GB, but hit the T instead of the G key.

Then while I was reviewing it, I decided to spell it out and converted the 8TB to 8 terabytes compounding, instead of correcting, the error.

With the BSD based awk on macOS, I don't have the asorti() function and only the 1st character of values assigned to RS matters. So, the following is completely untested, but if I understand the GNU awk page correctly, I think the pipeline:

Code:

awk -vRS="@M0" 'BEGIN{FS="\n"; OFS="\t"}NR>1{print RS$1, $2, $3, $4}' test.txt | sort -t: -k 7 | tr "\t" "\n"

should be replaceable by the following single invocation of awk:

Code:

awk '
BEGIN {	FS = OFS = "\n"
	RS = "@M0"
}
NR > 1 {split($1, f, /:/)
	out[f[7]] = RS $0
	order[f[7]]
}
END {	n = asorti(order)
	for(i = 1; i <= n; i++)
		printf("%s", out[order[i]])
}' test.txt

as long as there are no duplicates in the 7th colon separated field in any of the records in your input file. (If there are duplicates, I think all but the last record in each set of duplicates will be missing in the output produced by the above script.)

I would appreciate it if someone with access to GNU awk could try this out with the sample data in post #1 in this thread and let me know if I came close to getting it right.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

04-18-2017

Registered User

365, 3

Join Date: Jun 2010

Last Activity: 6 August 2019, 11:08 PM EDT

Posts: 365

Thanks Given: 149

Thanked 3 Times in 3 Posts

Sorry for the long delay! I will give it a try

Xterra

View Public Profile for Xterra

Find all posts by Xterra

UNIX for Beginners Questions & Answers

Sorting blocks by a section of the identifier

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Row blocks to column blocks

Discussion started by: yifangt

2. UNIX for Dummies Questions & Answers

Sorting arrays horizontally without END section, awk

Discussion started by: lucasvs

3. Shell Programming and Scripting

Prepend first line of section to each line until the next section header

Discussion started by: pagrus

4. Shell Programming and Scripting

how to split this file into blocks and then send these blocks as input to the tool called Yices?

Discussion started by: paramad

5. Shell Programming and Scripting

is not an identifier

Discussion started by: Phuti

6. Shell Programming and Scripting

Extract section of file based on word in section

Discussion started by: jelloir

7. UNIX for Dummies Questions & Answers

Convert 512-blocks to 4k blocks

Discussion started by: rockycj

8. Shell Programming and Scripting

not an identifier

Discussion started by: gyanibaba

9. Shell Programming and Scripting

Sorting blocks of data

Discussion started by: alfredo123

10. Shell Programming and Scripting

Sorting rules on a text section

Discussion started by: Indalecio