awk, string as record separator, transposing rows into columns

04-13-2011

Registered User

36, 0

Join Date: May 2007

Last Activity: 25 January 2012, 1:28 PM EST

Posts: 36

Thanks Given: 0

Thanked 0 Times in 0 Posts

awk, string as record separator, transposing rows into columns

I'm working on a different stage of a project that someone helped me address elsewhere in these threads.
The .docs I'm cycling through look roughly like this:

Code:

1 of 26 DOCUMENTS


Copyright 2010 The Age Company Limited
All Rights Reserved 
The Age (Melbourne, Australia)

November 27, 2010 Saturday  
First Edition

SECTION: NEWS; In Brief Overseas; Pg. 16

LENGTH: 114 words

There are about four .doc files I'm cycling through and I need to get them to look like this:

Code:

1 of 26 Documents, Copyright 2010 etc, All Rights Reserved, The Age (Melbourne, Australia), November 27, 2010
2 of 26 Documents, Copyright 2010 etc, All Rights Reserved, The Age (Melbourne, Australia), November 27, 2010
3 of 26 Documents, Copyright 2010 etc, All Rights Reserved, The Age (Melbourne, Australia), November 27, 2010

I'm working with the following script

Code:

#!/usr/bin/sh
for i in /Users/simon/canadian/*
do
  textutil -convert txt $i 
done

for i in /Users/simon/canadian/*
do
  sed -i "" '/^$/d' $i  
done

rm /Users/simon/canadian/*.doc

for i in /Users/simon/canadian/*
do
  grep -A5 'DOCUMENTS$' $i > $i.tmp
  mv $i.tmp $i
done

for i in /Users/simon/canadian/*
do
  gawk '$1=$1' RS="DOCUMENTS$" FS="\n" OFS=, $i >$i.tmp
  mv $i.tmp $i
done

It's doing what I want, EXCEPT, the output is all being put on one line:
1 of 26 DOCUMENTS,Copyright 2010 The Age Company Limited,All Rights Reserved ,--,2 of 26 DOCUMENTS,Copyright 2010 Nationwide News Pty Limited,All Rights Reserved ,--,3 of 26 DOCUMENTS,Copyright 2010 John Fairfax Publications Pty Ltd,All Rights Reserved ,--

I know that the double-dash is inserted by AWK when it encounters a a new record separator, so it is encountering the "DOCUMENTS$" record separator properly, but it's not outputting the text on separate lines.
Can someone provide any insight?
Simon

Last edited by Franklin52; 04-13-2011 at 03:11 PM.. Reason: Please indent your code and use code tags

spindoctor

View Public Profile for spindoctor

Find all posts by spindoctor

04-13-2011

Registered User

1,203, 103

Join Date: Mar 2007

Last Activity: 28 January 2020, 10:33 PM EST

Location: Orlando, Florida

Posts: 1,203

Thanks Given: 1

Thanked 103 Times in 100 Posts

Could you display the expected result?
Thank you.

Shell_Life

View Public Profile for Shell_Life

Find all posts by Shell_Life

04-13-2011

Registered User

290, 37

Join Date: Jan 2009

Last Activity: 28 June 2018, 4:18 PM EDT

Location: Tegucigalpa, Honduras

Posts: 290

Thanks Given: 8

Thanked 37 Times in 36 Posts

Hi spindoctor,

Based on sample an option would be:

Code:

for each in *.doc
do
awk '
/^[0-9].*DOCUMENTS/{a=$0}
/^Copyright/||/^All/||/^The Age/{a=a","$0}
/[A-Z].* [0-9]+,/{$NF="";a=a","$0;print a}' $each
done

Hope it helps.

Regards

Last edited by cgkmal; 04-13-2011 at 03:57 PM..

cgkmal

View Public Profile for cgkmal

Find all posts by cgkmal

04-13-2011

Registered User

36, 0

Join Date: May 2007

Last Activity: 25 January 2012, 1:28 PM EST

Posts: 36

Thanks Given: 0

Thanked 0 Times in 0 Posts

@Shell_Life:
The expected result is this (lines only separated at 2 of 26, 3 of 26, etc.:
1 of 26 Documents, Copyright 2010 etc, All Rights Reserved, The Age (Melbourne, Australia), November 27, 2010
2 of 26 Documents, Copyright 2010 etc, All Rights Reserved, The Age (Melbourne, Australia), November 27, 2010
3 of 26 Documents, Copyright 2010 etc, All Rights Reserved, The Age (Melbourne, Australia), November 27, 2010

@cgkmal, I don't think that will work. There are many different newspaper names.
If you look in the script that I'm using, I first grep the 5 lines from each line that matches "[0-9] Documents$", so that grabs all the text from the Document Number down to the date. But it produces it a file where all the information is in rows...

1 of 26 Documents
Copyright
All Rights reserved
NOvember 27, 2010.

So I'm trying to transpose. I don't think your solution will work because I've got other newspaper titles.

spindoctor

View Public Profile for spindoctor

Find all posts by spindoctor

04-13-2011

Registered User

1,613, 160

Join Date: Oct 2007

Last Activity: 12 February 2019, 12:19 PM EST

Location: USA

Posts: 1,613

Thanks Given: 40

Thanked 160 Times in 150 Posts

How about an awk script for all that...

Code:

awk '{
   if ($0~"DOCUMENTS") {
      f=0
      x=""
   }
   if ($0 ~ "Edition$") {
      f=1
      if (x) printf("%s\n", x)
      x=""
   }
   if ($0 && !f) {
      if (x) x=sprintf("%s,%s", x, $0)
      else   x=sprintf("%s", $0);
   }
}' 1.doc 2.doc 3.doc ...

shamrock

View Public Profile for shamrock

Find all posts by shamrock

04-13-2011

Registered User

2,977, 644

Join Date: Oct 2010

Last Activity: 14 September 2019, 1:15 PM EDT

Location: France

Posts: 2,977

Thanks Given: 88

Thanked 644 Times in 613 Posts

Code:

grep -A5 'DOCUMENTS$' tst | awk '/Copyright/{NF=3;$3="etc"}$0!~/^$/' | paste -s -d"," -

---------- Post updated at 10:11 PM ---------- Previous update was at 10:09 PM ----------

Instead of tst, put your filename (*.doc , $i, or whatever )

---------- Post updated at 10:16 PM ---------- Previous update was at 10:11 PM ----------

You may want to choose another delimiter (colon for example) to avoid further parsing problem because the sequence <coma><space> also match in (Melbourne, Australia) whereas it is not supposed to be a delimiter but may be interpreted as is later... or may make further parsing tedious.

ctsgnb

View Public Profile for ctsgnb

Find all posts by ctsgnb

04-13-2011

Registered User

36, 0

Join Date: May 2007

Last Activity: 25 January 2012, 1:28 PM EST

Posts: 36

Thanks Given: 0

Thanked 0 Times in 0 Posts

@ctsgnb
That might work, but again, several of the files have different text contained in the first part of each document.
How do I write the output from the paste command to a file?

---------- Post updated at 05:11 PM ---------- Previous update was at 04:50 PM ----------

@shamrock, I tried your script and the output is marvellous, but it is not matching all the documents in each file I'm working on.
The problem is that each file has information entered in different ways. In one file, the word edition comes before the date, so your script stops printing before the crucial information, the date.

The only thing that unifies all the files is that each section is separated by the phrase
[0-9] of [0-9] DOCUMENTS, followed by some information, followed by a date. The only way to match that text is to grep a few lines after the Documents phrase, transpose the rows, and then play around with the columns I've got (probably in Excel) to get all the dates lined up. I was really very close with my original script, but it would just output everything on one line, rather than on one line for each record (defined as the space between the DOCUMENTS string.

---------- Post updated at 10:40 PM ---------- Previous update was at 05:11 PM ----------

The problem is really with the newline. I can get all the information I need out of every file, but awk is just outputting everything onto one line

Quote:

And where it encounters the new record string that I define at the beginning of the script, it inserts a double dash (see right before 2 of 28) in the output, but it won't print them on two lines.
If I can just break up that output into one record per line, which I thought awk was supposed to do, I'm off to the races...

spindoctor

View Public Profile for spindoctor

Find all posts by spindoctor

Shell Programming and Scripting

awk, string as record separator, transposing rows into columns

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Pivot Rows to Columns, with field separator

Discussion started by: RichZR

2. Shell Programming and Scripting

Use string as Record separator in awk

Discussion started by: cgkmal

3. Shell Programming and Scripting

Transposing rows to columns with multiple similar lines

Discussion started by: redse171

4. Shell Programming and Scripting

Transposing rows and columns (pivoting) using shell scripting

Discussion started by: ksatish89

5. Shell Programming and Scripting

transposing columns into rows

Discussion started by: ida1215

6. Shell Programming and Scripting

Help for a Perl newcomer! Transposing data from columns to rows

Discussion started by: Sarah_W

7. Shell Programming and Scripting

Transposing Repeated Rows to Columns.

Discussion started by: ravzter

8. Shell Programming and Scripting

Transposing rows into columns

Discussion started by: prasperl

9. Shell Programming and Scripting

Transposing columns with awk

Discussion started by: phoenix_nebula

10. Shell Programming and Scripting

Rows to columns transposing and reformating.

Discussion started by: bluethunder