Sponsored Content
Top Forums Shell Programming and Scripting Complex text parsing with speed/performance problem (awk solution?) Post 302781085 by Michael Stora on Friday 15th of March 2013 01:29:07 PM
Old 03-15-2013
Complex text parsing with speed/performance problem (awk solution?)

I have 1.6 GB (and growing) of files with needed data between the 11th and 34th line (inclusive) of the second column of comma delimited files. There is also a lot of stray white space in the file that needs to be trimmed. They have DOS-like end of lines.

I need to transpose the 11th through 34th lines of col2 from each data file and append them as new rows to an existing file. I also need to add several variables to the front and back of each output line which will be parsed/calculated from the data file names and file metadata.

Input:
...,...
xxx, 9
xxx. 10
xxx, 11 <--need 11th through 34th row in col2.
...,...
xxx, 34
xxx, 35
xxx, 36
...,...

Output:
var1,var2,var3,var4,var5,var6,11,12,13,...,32,33,34,/original/directory/path/of/data/file/,original_data_file_name

Then the entire file including rows previously in it need to be sorted by several of the columns, and duplicate lines removed (excluding some columns from the duplicate determination).

My dos2unix|head|foot|cut|tr(remove whitespace)|tr(change eol to comma)|echo(vars,std_in,vars) works but is way too slow!

I'm thinking there is a way to do the selecting, whitespace removal, transpose with padding of variables on both ends of the output line in one awk command which should speed things up a whole lot, but I am not that good at awk.

Mike
 

6 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk parsing problem

I need help with a problem that I have not been able to figure out. I have a file that is about 650K lines. Records are seperated by blank lines, fields seperated by new lines. I was trying to make a report that would add up 2 fields and associate them with a CP. example output would be... (11 Replies)
Discussion started by: timj123
11 Replies

2. Shell Programming and Scripting

Parsing a complex log file

I have a log file that has many SQL statements/queries/blocks and their resultant output (success or failure) added to each of them. I need to pick up all the statements which caused errors and write them to a separate file. On most cases, the SQL statement is a single line, like DROP . And if... (1 Reply)
Discussion started by: exchequer598
1 Replies

3. Shell Programming and Scripting

Difficult problem: Complex text file manipulation in bash script.

I don't know if this is a big issue or not, but I'm having difficulties. I apoligize for the upcoming essay :o. I'm writing a script, similar to a paint program that edits images, but in the form of ANSI block characters. The program so far is working. I managed to save the image into a file,... (14 Replies)
Discussion started by: tinman47
14 Replies

4. Shell Programming and Scripting

Complex awk problem

hello, i have a complex awk problem... i have two tables, one with a value (0 to 1) and it's corresponding p-value, like this: 1. table: ______________________________ value p-value ... ... 0.254 0.003 0.245 0.005 0.233 0.006 ... ... ______________________________ and a... (6 Replies)
Discussion started by: dietmar13
6 Replies

5. Shell Programming and Scripting

Text string parsing in awk

I have a awk script that parses many millions of lines so performance is critical. At one point I am extracting some variables from a space delimited string. alarm = $11; len = split(alarm,a," "); ent = a; chem = a; for (i = 5; i<= len; i++) {chem = chem " " a}It works but is slow. Adding the... (7 Replies)
Discussion started by: Michael Stora
7 Replies

6. Shell Programming and Scripting

awk parsing problem

Hello fellow unix geeks, I am having a small dilemna trying to parse a log file I have. Below is a sample of what it will look like: MY_TOKEN1(group) TOKEN(other)|SSID1 MY_TOKEN2(group, group2)|SSID2 What I need to do is only keep the MY_TOKEN pieces and where there are multiple... (7 Replies)
Discussion started by: dagamier
7 Replies
RS(1)							    BSD General Commands Manual 						     RS(1)

NAME
rs -- reshape a data array SYNOPSIS
rs [-[csCS][x] [kKgGw][N] tTeEnyjhHmz] [rows [cols]] DESCRIPTION
The rs utility reads the standard input, interpreting each line as a row of blank-separated entries in an array, transforms the array accord- ing to the options, and writes it on the standard output. With no arguments it transforms stream input into a columnar format convenient for terminal viewing. The shape of the input array is deduced from the number of lines and the number of columns on the first line. If that shape is inconvenient, a more useful one might be obtained by skipping some of the input with the -k option. Other options control interpretation of the input col- umns. The shape of the output array is influenced by the rows and cols specifications, which should be positive integers. If only one of them is a positive integer, rs computes a value for the other which will accommodate all of the data. When necessary, missing data are supplied in a manner specified by the options and surplus data are deleted. There are options to control presentation of the output columns, including transposition of the rows and columns. The following options are available: -cx Input columns are delimited by the single character x. A missing x is taken to be `^I'. -sx Like -c, but maximal strings of x are delimiters. -Cx Output columns are delimited by the single character x. A missing x is taken to be `^I'. -Sx Like -C, but padded strings of x are delimiters. -t Fill in the rows of the output array using the columns of the input array, that is, transpose the input while honoring any rows and cols specifications. -T Print the pure transpose of the input, ignoring any rows or cols specification. -kN Ignore the first N lines of input. -KN Like -k, but print the ignored lines. -gN The gutter width (inter-column space), normally 2, is taken to be N. -GN The gutter width has N percent of the maximum column width added to it. -e Consider each line of input as an array entry. -n On lines having fewer entries than the first line, use null entries to pad out the line. Normally, missing entries are taken from the next line of input. -y If there are too few entries to make up the output dimensions, pad the output by recycling the input from the beginning. Normally, the output is padded with blanks. -h Print the shape of the input array and do nothing else. The shape is just the number of lines and the number of entries on the first line. -H Like -h, but also print the length of each line. -j Right adjust entries within columns. -wN The width of the display, normally 80, is taken to be the positive integer N. -m Do not trim excess delimiters from the ends of the output array. -z Adapt column widths to fit the largest entries appearing in them. With no arguments, rs transposes its input, and assumes one array entry per input line unless the first non-ignored line is longer than the display width. Option letters which take numerical arguments interpret a missing number as zero unless otherwise indicated. EXAMPLES
The rs utility can be used as a filter to convert the stream output of certain programs (e.g., spell, du, file, look, nm, who, and wc(1)) into a convenient ``window'' format, as in % who | rs This function has been incorporated into the ls(1) program, though for most programs with similar output rs suffices. To convert stream input into vector output and back again, use % rs 1 0 | rs 0 1 A 10 by 10 array of random numbers from 1 to 100 and its transpose can be generated with % jot -r 100 | rs 10 10 | tee array | rs -T > tarray In the editor vi(1), a file consisting of a multi-line vector with 9 elements per line can undergo insertions and deletions, and then be neatly reshaped into 9 columns with :1,$!rs 0 9 Finally, to sort a database by the first line of each 4-line field, try % rs -eC 0 4 | sort | rs -c 0 1 SEE ALSO
jot(1), pr(1), sort(1), vi(1) BUGS
Handles only two dimensional arrays. The algorithm currently reads the whole file into memory, so files that do not fit in memory will not be reshaped. Fields cannot be defined yet on character positions. Re-ordering of columns is not yet possible. There are too many options. BSD
December 30, 1993 BSD
All times are GMT -4. The time now is 04:25 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy