07-02-2014
Since you didn't say anything about whether or not your input file was sorted, the earlier suggestions had to make the assumption that lines in your 20,000,000 line file were in random order. Therefore, all of the key values and the sums of the corresponding 2nd fields had to be kept in memory until the entire file had been read. Then the totals could be printed for each of the different keys present in the file. The error messages you got say that awk ran out of memory trying to accumulate all of the data.
Now that we know that all of the lines with a given key are adjacent in your input file, the later scripts could print a sum for each key as soon as a new key was found. Very little memory is required to do that and your script runs much faster because it needs less system resources to get the job done.
These 2 Users Gave Thanks to Don Cragun For This Post:
10 More Discussions You Might Find Interesting
1. Shell Programming and Scripting
Hi,
I have a table in Db2 with data say
id_1 phase1
id_1 phase2
id_1 phase3
id_2 phase1
id_2 phase2
I need to concatenate the values like
id_1 phase1,phase2,phase3
id_2 phase1,phase2
I tried recursive query but in vain as the length of string to be concatenated in quite long. ... (17 Replies)
Discussion started by: jsaravana
17 Replies
2. Shell Programming and Scripting
Hi All,
I have a file which is having 3 columns as (string string integer)
a b 1
x y 2
p k 5
y y 4
.....
.....
Question:
I want get the unique value of column 2 in a sorted way(on column 2) and the sum of the 3rd column of the corresponding rows. e.g the above file should return the... (6 Replies)
Discussion started by: amigarus
6 Replies
3. Shell Programming and Scripting
Hi, my requirement is to sum values in a row.
eg:
input is: sum,value1,value2,value3,.....,value N
Required Output: sum,<summation of N values>
Please help me... (5 Replies)
Discussion started by: MrGopal666
5 Replies
4. Shell Programming and Scripting
Is it possible to remove redundant names in the 4th column?
input
cqWE 100 200 singapore;singapore
AZO 300 400 brazil;america;germany;ireland;germany
....
....
output
cqWE 100 200 singapore
AZO 300 400 brazil;america;germany;ireland (4 Replies)
Discussion started by: quincyjones
4 Replies
5. UNIX for Dummies Questions & Answers
Hello,
I am new to Linux environment , I working on Linux script which should send auto email based on the specific condition from log file. Below is the sample log file
Name m/c usage
abc xxx 10
abc xxx 20
abc xxx 5
xyz ... (6 Replies)
Discussion started by: asjaiswal
6 Replies
6. Shell Programming and Scripting
Hello out there,
file.txt:
comp51820_c1_seq1 42 N 0:0:0:0:0:0 1:0:0:0:0:0 0:0:0:0:0:0 3:0:0:0:0:0 0:0:0:0:0:0
comp51820_c1_seq1 43 N 0:0:0:0:0:0 0:1:0:0:0:0 0:0:0:0:0:0 0:3:0:0:0:0 0:0:0:0:0:0
comp51820_c1_seq1 44 N 0:0:4:0:3:1 0:0:1:9:0:0 10:0:0:0:0:0 0:3:3:2:2:6 2:2:2:5:60:3... (16 Replies)
Discussion started by: pathunkathunk
16 Replies
7. Shell Programming and Scripting
Hi,
I have a table to be imported for R as matrix or data.frame but I first need to edit it because I've got several lines with the same identifier (1st column), so I want to sum the each column (2nd -nth) of each identifier (1st column)
The input is for example, after sorted:
K00001 1 1 4 3... (8 Replies)
Discussion started by: sargotrons
8 Replies
8. UNIX for Dummies Questions & Answers
Hi All,
I have a requirement where I need to find sum of values from column D through O present in a CSV file and check whether the sum of each Individual column matches with the value present for that corresponding column present in the trailer record.
For example, let's assume for column D... (9 Replies)
Discussion started by: tpk
9 Replies
9. UNIX for Beginners Questions & Answers
I have a file which need to be summed up using date column.
I/P:
2017/01/01 a 10
2017/01/01 b 20
2017/01/01 c 40
2017/01/01 a 60
2017/01/01 b 50
2017/01/01 c 40
2017/01/01 a 20
2017/01/01 b 30
2017/01/01 c 40
2017/02/01 a 10
2017/02/01 b 20
2017/02/01 c 30
2017/02/01 a 10... (6 Replies)
Discussion started by: Booo
6 Replies
10. UNIX for Beginners Questions & Answers
I have a file abc.csv, from which I need column 24(PurchaseOrder_TotalCost) to get the sum_of_amounts with date and row count into another file say output.csv
abc.csv-
UTF-8,,,,,,,,,,,,,,,,,,,,,,,,,
... (6 Replies)
Discussion started by: Tahir_M
6 Replies
sort(1) General Commands Manual sort(1)
Name
sort - sort file data
Syntax
sort [options] [-k keydef] [+pos1[-pos2]] [file...]
Description
The command sorts lines of all the named files together and writes the result on the standard output. The name `-' means the standard
input. If no input files are named, the standard input is sorted.
Options
The default sort key is an entire line. Default ordering is lexicographic by bytes in machine collating sequence. The ordering is
affected globally by the following options, one or more of which may appear.
-b Ignores leading blanks (spaces and tabs) in field comparisons.
-d Sorts data according to dictionary ordering: letters, digits, and blanks only.
-f Folds uppercase to lowercase while sorting.
-i Ignore characters outside the ASCII range 040-0176 in nonnumeric comparisons.
-k keydef The keydefargument is a key field definition. The format is field_start, [field_end] [type], where field_start and field_end
are the definition of the restricted search key, and type is a modifier from the option list [bdfinr]. These modifiers have the
functionality, for this key only, that their command line counter-parts have for the entire record.
-n Sorts fields with numbers numerically. An initial numeric string, consisting of optional blanks, optional minus sign, and zero
or more digits with optional decimal point, is sorted by arithmetic value. (Note that -0 is taken to be equal to 0.) Option n
implies option b.
-r Reverses the sense of comparisons.
-tx Uses specified character as field separator.
The notation +pos1 -pos2 restricts a sort key to a field beginning at pos1 and ending just before pos2. Pos1 and pos2 each have the form
m.n, optionally followed by one or more of the options bdfinr, where m tells a number of fields to skip from the beginning of the line and
n tells a number of characters to skip further. If any options are present they override all the global ordering options for this key. If
the b option is in effect n is counted from the first nonblank in the field; b is attached independently to pos2. A missing .n means .0; a
missing -pos2 means the end of the line. Under the -tx option, fields are strings separated by x; otherwise fields are nonempty nonblank
strings separated by blanks.
When there are multiple sort keys, later keys are compared only after all earlier keys compare equal. Lines that otherwise compare equal
are ordered with all bytes significant.
These are additional options:
-c Checks sorting order and displays output only if out of order.
-m Merges previously sorted data.
-o name Uses specified file as output file. This file may be the same as one of the inputs.
-T dir Uses specified directory to build temporary files.
-u Suppresses all duplicate entries. Ignored bytes and bytes outside keys do not participate in this comparison.
Examples
Print in alphabetical order all the unique spellings in a list of words. Capitalized words differ from uncapitalized.
sort -u +0f +0 list
Print the password file, sorted by user id number (the 3rd colon-separated field).
sort -t: +2n /etc/passwd
Print the first instance of each month in an already sorted file of (month day) entries. The options -um with just one input file make the
choice of a unique representative from a set of equal lines predictable.
sort -um +0 -1 dates
Restrictions
Very long lines are silently truncated.
Diagnostics
Comments and exits with nonzero status for various trouble conditions and for disorder discovered under option c.
Files
/usr/tmp/stm*, /tmp/* first and second tries for temporary files
See Also
comm(1), join(1), rev(1), uniq(1)
sort(1)