Complex text parsing with speed/performance problem (awk solution?)

03-15-2013

Registered User

183, 15

Join Date: Jul 2010

Last Activity: 22 June 2015, 3:25 PM EDT

Posts: 183

Thanks Given: 56

Thanked 15 Times in 13 Posts

Complex text parsing with speed/performance problem (awk solution?)

I have 1.6 GB (and growing) of files with needed data between the 11th and 34th line (inclusive) of the second column of comma delimited files. There is also a lot of stray white space in the file that needs to be trimmed. They have DOS-like end of lines.

I need to transpose the 11th through 34th lines of col2 from each data file and append them as new rows to an existing file. I also need to add several variables to the front and back of each output line which will be parsed/calculated from the data file names and file metadata.

Input:
...,...
xxx, 9
xxx. 10
xxx, 11 <--need 11th through 34th row in col2.
...,...
xxx, 34
xxx, 35
xxx, 36
...,...

Output:
var1,var2,var3,var4,var5,var6,11,12,13,...,32,33,34,/original/directory/path/of/data/file/,original_data_file_name

Then the entire file including rows previously in it need to be sorted by several of the columns, and duplicate lines removed (excluding some columns from the duplicate determination).

My dos2unix|head|foot|cut|tr(remove whitespace)|tr(change eol to comma)|echo(vars,std_in,vars) works but is way too slow!

I'm thinking there is a way to do the selecting, whitespace removal, transpose with padding of variables on both ends of the output line in one awk command which should speed things up a whole lot, but I am not that good at awk.

Mike

Michael Stora

View Public Profile for Michael Stora

Find all posts by Michael Stora

03-15-2013

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

As much as I would like to help, you lost me. I can't possibly imagine what you want to achieve, coming from where.
Pls post directory structure (as you want that looong path included), algorithms to calculate the varns, an input file sample, and from that concoction the desired output.

EDIT: This may fulfill part of your requirements:

Code:

$ awk 'NR > 10 {printf "%s,", $2} NR == 34 {printf "%s", FILENAME; exit}' /some/path/file

EDIT 2: put sub (/[^\/]*$/, ",&",FILENAME); in front of the print FILENAME.

Last edited by RudiC; 03-15-2013 at 03:08 PM..

RudiC

View Public Profile for RudiC

Find all posts by RudiC

03-15-2013

Registered User

183, 15

Join Date: Jul 2010

Last Activity: 22 June 2015, 3:25 PM EDT

Posts: 183

Thanks Given: 56

Thanked 15 Times in 13 Posts

RudiC, not to over complicate things, assume there is a loop through all the data files and the variables FilePathName, FilePath, FileName, Var1, ..., Var6 are defined within the loop before the code is called.

The transpose with padding code will be called for one file at a time within the loop. If awk is used, the variables can be passed using the -v option as needed. Var1-6 are not needed in the awk so they can be passed as one concatenated string.

After the loop, the sort is done and duplicate columns will be deleted. If you don't want to keep it general, assume it is the last two colums FilePath and FileName that are not part of the duplicate line determination.

Mike

Michael Stora

View Public Profile for Michael Stora

Find all posts by Michael Stora

03-15-2013

Registered User

4,673, 588

Join Date: Oct 2010

Last Activity: 1 February 2016, 3:35 PM EST

Location: Southern NJ, USA (Nord)

Posts: 4,673

Thanks Given: 8

Thanked 588 Times in 561 Posts

Certainly putting the tr to delete white space first may speed things up. Doing it all in C/c++/java/perl/python/awk might help. Move dos2unix after the various reducers. What is a 'foot'? tail?

DGPickett

View Public Profile for DGPickett

Find all posts by DGPickett

03-15-2013

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

I'm not sure if dos2unix is needed at all as awk doesn't encounter the <CR> char printing only $2. Unless there's more non- ASCII chars to translate, that is.
If you supply var1-6 as one long string to awk, I think the attempt above is pretty close. Printf var1-6 in a BEGIN section, or using NR==1.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

03-15-2013

Registered User

183, 15

Join Date: Jul 2010

Last Activity: 22 June 2015, 3:25 PM EDT

Posts: 183

Thanks Given: 56

Thanked 15 Times in 13 Posts

Thanks for giving me some code to work with. Here is an actual data file (with IP obfuscation)

Code:

XXXXXXX,,,,XXXXXXX
XXXXXXX,XXXXXXX,,,
XXXXXXX,XXXXXXX,,,
XXXXXXX,,,,
,,,,
XXXXXXX,,,,
XXXXXXX,XXXXXXX,,,
,,,,
XXXXXXX,,,,
XXXXXXX,XXXXXXX,,,
XXXXXXX,411.7075872,,,
XXXXXXX,388.756628,,,
XXXXXXX,384.2531634,,,
XXXXXXX,188.317418,,,
XXXXXXX,495.1306749,,,
XXXXXXX,495.7313397,,,
XXXXXXX,364.8057139,,,
XXXXXXX,128.1652694,,,
XXXXXXX,78.1880777,,,
XXXXXXX,47.85832595,,,
XXXXXXX,397.106979,,,
XXXXXXX,171.5723148,,,
XXXXXXX,452.5367818,,,
XXXXXXX,334.4613963,,,
XXXXXXX,245.0863368,,,
XXXXXXX,182.0549603,,,
XXXXXXX,495.5126526,,,
XXXXXXX,30.64512099,,,
XXXXXXX,291.9205658,,,
XXXXXXX,221.6485369,,,
XXXXXXX,24.33776897,,,
XXXXXXX,270.5466812,,,
XXXXXXX,32.99794073,,,
XXXXXXX,183.2580134,,,

Mike

Michael Stora

View Public Profile for Michael Stora

Find all posts by Michael Stora

03-15-2013

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Try this:

Code:

awk -F, 'NR == 1        {printf "%s,", var16}
         NR > 10        {gsub (/ /,"",$2); printf "%s,", $2}
         NR == 24       {sub (/[^\/]*$/, ",&",FILENAME); printf "%s\n", FILENAME; exit}
        ' var16="v1,v2,v3,v4,v5,v6" /some/path/to/file
v1,v2,v3,v4,v5,v6,411.7075872,388.756628,384.2531634,188.317418,495.1306749,495.7313397,364.8057139,128.1652694,78.1880777,47.85832595,397.106979,171.5723148,452.5367818,334.4613963,/some/path/to/,file

RudiC

View Public Profile for RudiC

Find all posts by RudiC

Shell Programming and Scripting

Complex text parsing with speed/performance problem (awk solution?)

6 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk parsing problem

Discussion started by: dagamier

2. Shell Programming and Scripting

Text string parsing in awk

Discussion started by: Michael Stora

3. Shell Programming and Scripting

Complex awk problem

Discussion started by: dietmar13

4. Shell Programming and Scripting

Difficult problem: Complex text file manipulation in bash script.

Discussion started by: tinman47

5. Shell Programming and Scripting

Parsing a complex log file

Discussion started by: exchequer598

6. Shell Programming and Scripting

awk parsing problem

Discussion started by: timj123