awk, string as record separator, transposing rows into columns


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk, string as record separator, transposing rows into columns
# 1  
Old 04-13-2011
awk, string as record separator, transposing rows into columns

I'm working on a different stage of a project that someone helped me address elsewhere in these threads.
The .docs I'm cycling through look roughly like this:

Code:
1 of 26 DOCUMENTS


Copyright 2010 The Age Company Limited
All Rights Reserved 
The Age (Melbourne, Australia)

November 27, 2010 Saturday  
First Edition

SECTION: NEWS; In Brief Overseas; Pg. 16

LENGTH: 114 words

There are about four .doc files I'm cycling through and I need to get them to look like this:
Code:
1 of 26 Documents, Copyright 2010 etc, All Rights Reserved, The Age (Melbourne, Australia), November 27, 2010
2 of 26 Documents, Copyright 2010 etc, All Rights Reserved, The Age (Melbourne, Australia), November 27, 2010
3 of 26 Documents, Copyright 2010 etc, All Rights Reserved, The Age (Melbourne, Australia), November 27, 2010

I'm working with the following script
Code:
#!/usr/bin/sh
for i in /Users/simon/canadian/*
do
  textutil -convert txt $i 
done

for i in /Users/simon/canadian/*
do
  sed -i "" '/^$/d' $i  
done

rm /Users/simon/canadian/*.doc

for i in /Users/simon/canadian/*
do
  grep -A5 'DOCUMENTS$' $i > $i.tmp
  mv $i.tmp $i
done

for i in /Users/simon/canadian/*
do
  gawk '$1=$1' RS="DOCUMENTS$" FS="\n" OFS=, $i >$i.tmp
  mv $i.tmp $i
done

It's doing what I want, EXCEPT, the output is all being put on one line:
1 of 26 DOCUMENTS,Copyright 2010 The Age Company Limited,All Rights Reserved ,--,2 of 26 DOCUMENTS,Copyright 2010 Nationwide News Pty Limited,All Rights Reserved ,--,3 of 26 DOCUMENTS,Copyright 2010 John Fairfax Publications Pty Ltd,All Rights Reserved ,--

I know that the double-dash is inserted by AWK when it encounters a a new record separator, so it is encountering the "DOCUMENTS$" record separator properly, but it's not outputting the text on separate lines.
Can someone provide any insight?
Simon

Last edited by Franklin52; 04-13-2011 at 03:11 PM.. Reason: Please indent your code and use code tags
# 2  
Old 04-13-2011
Could you display the expected result?
Thank you.
# 3  
Old 04-13-2011
Hi spindoctor,

Based on sample an option would be:

Code:
for each in *.doc
do
awk '
/^[0-9].*DOCUMENTS/{a=$0}
/^Copyright/||/^All/||/^The Age/{a=a","$0}
/[A-Z].* [0-9]+,/{$NF="";a=a","$0;print a}' $each
done


Hope it helps.

Regards

Last edited by cgkmal; 04-13-2011 at 03:57 PM..
# 4  
Old 04-13-2011
@Shell_Life:
The expected result is this (lines only separated at 2 of 26, 3 of 26, etc.:
1 of 26 Documents, Copyright 2010 etc, All Rights Reserved, The Age (Melbourne, Australia), November 27, 2010
2 of 26 Documents, Copyright 2010 etc, All Rights Reserved, The Age (Melbourne, Australia), November 27, 2010
3 of 26 Documents, Copyright 2010 etc, All Rights Reserved, The Age (Melbourne, Australia), November 27, 2010

@cgkmal, I don't think that will work. There are many different newspaper names.
If you look in the script that I'm using, I first grep the 5 lines from each line that matches "[0-9] Documents$", so that grabs all the text from the Document Number down to the date. But it produces it a file where all the information is in rows...

1 of 26 Documents
Copyright
All Rights reserved
NOvember 27, 2010.

So I'm trying to transpose. I don't think your solution will work because I've got other newspaper titles.
# 5  
Old 04-13-2011
How about an awk script for all that...
Code:
awk '{
   if ($0~"DOCUMENTS") {
      f=0
      x=""
   }
   if ($0 ~ "Edition$") {
      f=1
      if (x) printf("%s\n", x)
      x=""
   }
   if ($0 && !f) {
      if (x) x=sprintf("%s,%s", x, $0)
      else   x=sprintf("%s", $0);
   }
}' 1.doc 2.doc 3.doc ...

# 6  
Old 04-13-2011
Code:
grep -A5 'DOCUMENTS$' tst | awk '/Copyright/{NF=3;$3="etc"}$0!~/^$/' | paste -s -d"," -

---------- Post updated at 10:11 PM ---------- Previous update was at 10:09 PM ----------

Instead of tst, put your filename (*.doc , $i, or whatever )

---------- Post updated at 10:16 PM ---------- Previous update was at 10:11 PM ----------

You may want to choose another delimiter (colon for example) to avoid further parsing problem because the sequence <coma><space> also match in (Melbourne, Australia) whereas it is not supposed to be a delimiter but may be interpreted as is later... or may make further parsing tedious.
# 7  
Old 04-13-2011
@ctsgnb
That might work, but again, several of the files have different text contained in the first part of each document.
How do I write the output from the paste command to a file?

---------- Post updated at 05:11 PM ---------- Previous update was at 04:50 PM ----------

@shamrock, I tried your script and the output is marvellous, but it is not matching all the documents in each file I'm working on.
The problem is that each file has information entered in different ways. In one file, the word edition comes before the date, so your script stops printing before the crucial information, the date.

The only thing that unifies all the files is that each section is separated by the phrase
[0-9] of [0-9] DOCUMENTS, followed by some information, followed by a date. The only way to match that text is to grep a few lines after the Documents phrase, transpose the rows, and then play around with the columns I've got (probably in Excel) to get all the dates lined up. I was really very close with my original script, but it would just output everything on one line, rather than on one line for each record (defined as the space between the DOCUMENTS string.

---------- Post updated at 10:40 PM ---------- Previous update was at 05:11 PM ----------

The problem is really with the newline. I can get all the information I need out of every file, but awk is just outputting everything onto one line
Quote:
1 of 28 DOCUMENTS Copyright 2010 Dagbladet Politiken All Rights Reserved ......ds -- 2 of 28 DOCUMENTS Copyright 2010 Dagbladet Politiken
And where it encounters the new record string that I define at the beginning of the script, it inserts a double dash (see right before 2 of 28) in the output, but it won't print them on two lines.
If I can just break up that output into one record per line, which I thought awk was supposed to do, I'm off to the races...
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Pivot Rows to Columns, with field separator

Hi All, I have a requirement to flatten data out, based on the value in COL_2. Our file is pipe delimited, however COL_2 contains a comma separated string, which we would like to pivot out from one row into multiple rows. Please see my example input data below: Input Data ... (4 Replies)
Discussion started by: RichZR
4 Replies

2. Shell Programming and Scripting

Use string as Record separator in awk

Hello to all, Please some help on this. I have the file in format as below. How can I set the record separator as the string below in red "No. Time Source Destination Protocol Length Info" I've tried code below but it doesn't seem to... (6 Replies)
Discussion started by: cgkmal
6 Replies

3. Shell Programming and Scripting

Transposing rows to columns with multiple similar lines

Hi, I am trying to transpose rows to columns for thousands of records. The problem is there are records that have the same lines that need to be separated. the input file as below:- ID 1A02_HUMAN AC P01892; O19619; P06338; P10313; P30444; P30445; P30446; P30514; AC Q29680; Q29837;... (2 Replies)
Discussion started by: redse171
2 Replies

4. Shell Programming and Scripting

Transposing rows and columns (pivoting) using shell scripting

Here is the contents of an input file. A,1,2,3,4 10,aaa,bbb,ccc,ddd 11,eee,fff,ggg,hhh 12,iii,jjj,lll,mmm 13,nnn,ooo,ppp I wanted the output to be A 10 1 aaa 10 2 bbb 10 3 ccc 10 4 ddd 11 1 eee 11 2 fff 11 3 ggg 11 4 hhh ..... and so on How to do it in ksh... (9 Replies)
Discussion started by: ksatish89
9 Replies

5. Shell Programming and Scripting

transposing columns into rows

Hi, I need to transpose columns of my files into rows and save it as individual files. sample contents of the file below. 0.9120 0.7782 0.6959 0.6904 0.6322 0.8068 0.9082 0.9290 0.7272 0.9870 0.7648 0.8053 0.8300 0.9520 0.8614 0.6734 0.7910 0.6413 0.7126 0.7364 0.8491 0.8868 0.7586 0.8949... (8 Replies)
Discussion started by: ida1215
8 Replies

6. Shell Programming and Scripting

Help for a Perl newcomer! Transposing data from columns to rows

I have to create a Perl script which will transpose the data output from my experiment, from columns to rows, in order for me to analyse the data. I am a complete Perl novice so any help would be greatly appreciated. The data as it stands looks like this: Subject Condition Fp1 ... (12 Replies)
Discussion started by: Sarah_W
12 Replies

7. Shell Programming and Scripting

Transposing Repeated Rows to Columns.

I have 1000s of these rows that I would like to transpose to columns. However I would like the transpose every 3 consecutive rows to columns like below, sorted by column 3 and provide a total for each occurrences. Finally I would like a grand total of column 3. 21|FE|41|0B 50\65\78 15... (2 Replies)
Discussion started by: ravzter
2 Replies

8. Shell Programming and Scripting

Transposing rows into columns

I have a file like the one given below P1|V1|V2 P1|V1|V3 P1V1|V2 P2|V1|V4 P2|V2|V6 P2|V1|V4 I want it convert to P1|V1|V2|V2|V3 P2|V1|V4|V2|V6 2nd and 3rd column should be considered as together and so the tird row is duplicate Any ideas? (3 Replies)
Discussion started by: prasperl
3 Replies

9. Shell Programming and Scripting

Transposing columns with awk

I want a sweet simple time efficient awk script in online which gets output 001_r 0.0265185 0.0437049 0.0240642 0.0310264 0.0200482 0.0146746 0.0351344 0.0347856 0.036119 1.49 firstcoloumnvalue allvaluesof 'c' in one row 001_r : 002_r c: 0.0265185 N: 548 001_r : 007_r c:... (5 Replies)
Discussion started by: phoenix_nebula
5 Replies

10. Shell Programming and Scripting

Rows to columns transposing and reformating.

----File attached. Input file =========== COL_1 <IP Add 1> COL_2 <Service1> COL_3 <ABCDEFG> COL_4 <IP ADD:PORT> COL_4 <IP ADD:PORT> COL_1 <IP Add 2> COL_2 <Service2> COL_2 <Service3> COL_2 <Service4> COL_3 <AAAABBB> COL_4 <IP ADD:PORT> COL_4 <IP ADD:PORT> COL_4 <IP... (27 Replies)
Discussion started by: bluethunder
27 Replies
Login or Register to Ask a Question