Complex text parsing with speed/performance problem (awk solution?)


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Complex text parsing with speed/performance problem (awk solution?)
# 8  
Old 03-15-2013
I gotta take a sed swipe at this:
Code:
echo "var1,var2,var3,var4,var5,var6,$(
sed '
   1,10d
   s/^[^,]*, *\([^,]*[^ ,]\).*/\1/
   :l
   N
   s/\n[^,]*, *\([^,]*[^ ,]\).*/,\1/
   34q
   b l
  ' original_data_file_name | dos2unix
 ),/original/directory/path/of/data/file/,original_data_file_name"

Narrative for sed: delete 10 lines, make line 11 into csv column two without leading spaces and stop before any trailing spaces (assumes never blank, else we can either trim trailing spaces in one more 's' line or 't' detect no sub and sub in a bare comma), discarding the rest of that line, set a loop branch target 'l', get the next line, turn the linefeed and second line into a comma and column 2 as before without white space, quit if line 34, branch to 'l' otherwise. We can work on the raw file since space, comma are not different code in DOS, and maybe we do not need dos2unix unless there are other codes that need fixing; any carriage returns got tossed.

If the files are long, make your first command head?
# 9  
Old 03-15-2013
RudiC, Thanks a bunch.

I threw in some random whitespace that sometimes happens to test.


Code:
awk -F, 'NR == 1        {printf "%s,", HeaderRows}
         NR > 10        {gsub (/ /,"",$2); printf "%s,", $2}
         NR == 34       {sub (/[^\/]*$/, ",&",FILENAME); printf "%s\n", FILENAME; exit}
        ' HeaderRows="v1,v2,v3,v4,v5,v6" ~/example.txt

Output:
Code:
v1,v2,v3,v4,v5,v6,411.7075872,388.756628,384.2531634,188.317418,495.1306749,495.7313397,364.8057139,128.1652694,78.1880777,47.85832595,397.106979,171.5723148,452.5367818,334.4613963,245.0863368,182.0549603,495.5126526,30.64512099,291.9205658,221.6485369,24.33776897,270.5466812,32.99794073,183.2580134,/home/mestora/,example.csv

DGPickett, yes I meant "tail". Wrong animal Smilie
I will performance test your solution vs. the awk one. Unfortunately this will be exported to a cygwin over Win7 environment in final implementation (part of my performance issues) Smilie

Mike

Last edited by Michael Stora; 03-15-2013 at 04:02 PM..
# 10  
Old 03-15-2013
So you've got 4 spaces in the output? When I seed several spaces in my test file, they get gsubbed. Pls post the input file...
This User Gave Thanks to RudiC For This Post:
# 11  
Old 03-15-2013
Are embedded spaces within field values a problem? Usually, they are called 'data'. Smilie
This User Gave Thanks to DGPickett For This Post:
# 12  
Old 03-15-2013
The QUOTE block created the spaces in my reply, I switched to CODE and they are gone.

I am time testing both solutions right now.

Mike

---------- Post updated at 12:23 PM ---------- Previous update was at 12:10 PM ----------

I gave you both thanks.

awk solution 10000 times:
Code:
date
for i in {1..10000}
do
awk -F, 'NR == 1        {printf "%s,", HeaderRows}
         NR > 10        {gsub (/ /,"",$2); printf "%s,", $2}
         NR == 34       {sub (/[^\/]*$/, ",&",FILENAME); printf "%s\n", FILENAME; exit}
        ' HeaderRows="v1,v2,v3,v4,v5,v6" ~/example.csv > /dev/nul
done
date

Running time: 4:26 (~27 mS per file)

sed solution 10000 times:
Code:
 date
 for i in {1..10000}
 do
echo "var1,var2,var3,var4,var5,var6,$(
sed '
   1,10d
   s/^[^,]*, *\([^,]*[^ ,]\).*/\1/
   :l
   N
   s/\n[^,]*, *\([^,]*[^ ,]\).*/,\1/
   34q
   b l
  ' example.csv | dos2unix
 ),/home/mestora/,example.csv"  >/dev/nul 
done
date

Running time: 9:56 (~60 mS per file)

Mike
# 13  
Old 03-15-2013
Pull off the dos2unix, it might be exec time!As awk is more field oriented, it has an advantage on this.
# 14  
Old 03-15-2013
never mind, figured it out. I thought ^ here was a font anchor but it is part of a completed string RE instead.

Last edited by Michael Stora; 03-15-2013 at 10:02 PM..
Login or Register to Ask a Question

Previous Thread | Next Thread

6 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk parsing problem

Hello fellow unix geeks, I am having a small dilemna trying to parse a log file I have. Below is a sample of what it will look like: MY_TOKEN1(group) TOKEN(other)|SSID1 MY_TOKEN2(group, group2)|SSID2 What I need to do is only keep the MY_TOKEN pieces and where there are multiple... (7 Replies)
Discussion started by: dagamier
7 Replies

2. Shell Programming and Scripting

Text string parsing in awk

I have a awk script that parses many millions of lines so performance is critical. At one point I am extracting some variables from a space delimited string. alarm = $11; len = split(alarm,a," "); ent = a; chem = a; for (i = 5; i<= len; i++) {chem = chem " " a}It works but is slow. Adding the... (7 Replies)
Discussion started by: Michael Stora
7 Replies

3. Shell Programming and Scripting

Complex awk problem

hello, i have a complex awk problem... i have two tables, one with a value (0 to 1) and it's corresponding p-value, like this: 1. table: ______________________________ value p-value ... ... 0.254 0.003 0.245 0.005 0.233 0.006 ... ... ______________________________ and a... (6 Replies)
Discussion started by: dietmar13
6 Replies

4. Shell Programming and Scripting

Difficult problem: Complex text file manipulation in bash script.

I don't know if this is a big issue or not, but I'm having difficulties. I apoligize for the upcoming essay :o. I'm writing a script, similar to a paint program that edits images, but in the form of ANSI block characters. The program so far is working. I managed to save the image into a file,... (14 Replies)
Discussion started by: tinman47
14 Replies

5. Shell Programming and Scripting

Parsing a complex log file

I have a log file that has many SQL statements/queries/blocks and their resultant output (success or failure) added to each of them. I need to pick up all the statements which caused errors and write them to a separate file. On most cases, the SQL statement is a single line, like DROP . And if... (1 Reply)
Discussion started by: exchequer598
1 Replies

6. Shell Programming and Scripting

awk parsing problem

I need help with a problem that I have not been able to figure out. I have a file that is about 650K lines. Records are seperated by blank lines, fields seperated by new lines. I was trying to make a report that would add up 2 fields and associate them with a CP. example output would be... (11 Replies)
Discussion started by: timj123
11 Replies
Login or Register to Ask a Question