Complex text parsing with speed/performance problem (awk solution?)

03-15-2013

Registered User

4,673, 588

Join Date: Oct 2010

Last Activity: 1 February 2016, 3:35 PM EST

Location: Southern NJ, USA (Nord)

Posts: 4,673

Thanks Given: 8

Thanked 588 Times in 561 Posts

I gotta take a sed swipe at this:

Code:

echo "var1,var2,var3,var4,var5,var6,$(
sed '
   1,10d
   s/^[^,]*, *\([^,]*[^ ,]\).*/\1/
   :l
   N
   s/\n[^,]*, *\([^,]*[^ ,]\).*/,\1/
   34q
   b l
  ' original_data_file_name | dos2unix
 ),/original/directory/path/of/data/file/,original_data_file_name"

Narrative for sed: delete 10 lines, make line 11 into csv column two without leading spaces and stop before any trailing spaces (assumes never blank, else we can either trim trailing spaces in one more 's' line or 't' detect no sub and sub in a bare comma), discarding the rest of that line, set a loop branch target 'l', get the next line, turn the linefeed and second line into a comma and column 2 as before without white space, quit if line 34, branch to 'l' otherwise. We can work on the raw file since space, comma are not different code in DOS, and maybe we do not need dos2unix unless there are other codes that need fixing; any carriage returns got tossed.

If the files are long, make your first command head?

DGPickett

View Public Profile for DGPickett

Find all posts by DGPickett

03-15-2013

Registered User

183, 15

Join Date: Jul 2010

Last Activity: 22 June 2015, 3:25 PM EDT

Posts: 183

Thanks Given: 56

Thanked 15 Times in 13 Posts

RudiC, Thanks a bunch.

I threw in some random whitespace that sometimes happens to test.

Code:

awk -F, 'NR == 1        {printf "%s,", HeaderRows}
         NR > 10        {gsub (/ /,"",$2); printf "%s,", $2}
         NR == 34       {sub (/[^\/]*$/, ",&",FILENAME); printf "%s\n", FILENAME; exit}
        ' HeaderRows="v1,v2,v3,v4,v5,v6" ~/example.txt

Output:

Code:

v1,v2,v3,v4,v5,v6,411.7075872,388.756628,384.2531634,188.317418,495.1306749,495.7313397,364.8057139,128.1652694,78.1880777,47.85832595,397.106979,171.5723148,452.5367818,334.4613963,245.0863368,182.0549603,495.5126526,30.64512099,291.9205658,221.6485369,24.33776897,270.5466812,32.99794073,183.2580134,/home/mestora/,example.csv

DGPickett, yes I meant "tail". Wrong animal

I will performance test your solution vs. the awk one. Unfortunately this will be exported to a cygwin over Win7 environment in final implementation (part of my performance issues)

Mike

Last edited by Michael Stora; 03-15-2013 at 04:02 PM..

Michael Stora

View Public Profile for Michael Stora

Find all posts by Michael Stora

03-15-2013

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

So you've got 4 spaces in the output? When I seed several spaces in my test file, they get gsubbed. Pls post the input file...

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

03-15-2013

Registered User

4,673, 588

Join Date: Oct 2010

Last Activity: 1 February 2016, 3:35 PM EST

Location: Southern NJ, USA (Nord)

Posts: 4,673

Thanks Given: 8

Thanked 588 Times in 561 Posts

Are embedded spaces within field values a problem? Usually, they are called 'data'.

This User Gave Thanks to DGPickett For This Post:

DGPickett

View Public Profile for DGPickett

Find all posts by DGPickett

03-15-2013

Registered User

183, 15

Join Date: Jul 2010

Last Activity: 22 June 2015, 3:25 PM EDT

Posts: 183

Thanks Given: 56

Thanked 15 Times in 13 Posts

The QUOTE block created the spaces in my reply, I switched to CODE and they are gone.

I am time testing both solutions right now.

Mike

---------- Post updated at 12:23 PM ---------- Previous update was at 12:10 PM ----------

I gave you both thanks.

awk solution 10000 times:

Code:

date
for i in {1..10000}
do
awk -F, 'NR == 1        {printf "%s,", HeaderRows}
         NR > 10        {gsub (/ /,"",$2); printf "%s,", $2}
         NR == 34       {sub (/[^\/]*$/, ",&",FILENAME); printf "%s\n", FILENAME; exit}
        ' HeaderRows="v1,v2,v3,v4,v5,v6" ~/example.csv > /dev/nul
done
date

Running time: 4:26 (~27 mS per file)

sed solution 10000 times:

Code:

 date
 for i in {1..10000}
 do
echo "var1,var2,var3,var4,var5,var6,$(
sed '
   1,10d
   s/^[^,]*, *\([^,]*[^ ,]\).*/\1/
   :l
   N
   s/\n[^,]*, *\([^,]*[^ ,]\).*/,\1/
   34q
   b l
  ' example.csv | dos2unix
 ),/home/mestora/,example.csv"  >/dev/nul 
done
date

Running time: 9:56 (~60 mS per file)

Mike

Michael Stora

View Public Profile for Michael Stora

Find all posts by Michael Stora

03-15-2013

Registered User

4,673, 588

Join Date: Oct 2010

Last Activity: 1 February 2016, 3:35 PM EST

Location: Southern NJ, USA (Nord)

Posts: 4,673

Thanks Given: 8

Thanked 588 Times in 561 Posts

Pull off the dos2unix, it might be exec time!As awk is more field oriented, it has an advantage on this.

DGPickett

View Public Profile for DGPickett

Find all posts by DGPickett

03-15-2013

Registered User

183, 15

Join Date: Jul 2010

Last Activity: 22 June 2015, 3:25 PM EDT

Posts: 183

Thanks Given: 56

Thanked 15 Times in 13 Posts

never mind, figured it out. I thought ^ here was a font anchor but it is part of a completed string RE instead.

Last edited by Michael Stora; 03-15-2013 at 10:02 PM..

Michael Stora

View Public Profile for Michael Stora

Find all posts by Michael Stora

Shell Programming and Scripting

Complex text parsing with speed/performance problem (awk solution?)

6 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk parsing problem

Discussion started by: dagamier

2. Shell Programming and Scripting

Text string parsing in awk

Discussion started by: Michael Stora

3. Shell Programming and Scripting

Complex awk problem

Discussion started by: dietmar13

4. Shell Programming and Scripting

Difficult problem: Complex text file manipulation in bash script.

Discussion started by: tinman47

5. Shell Programming and Scripting

Parsing a complex log file

Discussion started by: exchequer598

6. Shell Programming and Scripting

awk parsing problem

Discussion started by: timj123