Copying Text between two unique text patterns

05-28-2007

Registered User

36, 0

Join Date: May 2007

Last Activity: 25 January 2012, 1:28 PM EST

Posts: 36

Thanks Given: 0

Thanked 0 Times in 0 Posts

Copying Text between two unique text patterns

Dear Colleagues:
I have .rtf files of a collection of newspaper articles. Each newspaper article starts with a variation of the phrase "Document * of 20" and is separated from the next article with the character string "==================="

I would like to be able to take the text composing each news article from between these two patterns and dump them into separate, uniquely named files. I've been playing around with SED, grep, cut and csplit, but nothing seems to be working. I have the regular expressions developed to capture the two lines "Document * of 20" and "--------" independently, but I can't figure out how to capture and play with the text between the two lines. I hope you can help.
Yours,
Simon J. Kiss
Queen's University

spindoctor

View Public Profile for spindoctor

Find all posts by spindoctor

05-28-2007

Registered User

230, 4

Join Date: Mar 2007

Last Activity: 1 January 2015, 1:39 PM EST

Location: Stockholm

Posts: 230

Thanks Given: 1

Thanked 4 Times in 4 Posts

Hi Simon,
Though there could some other smarter solution,I have used the following approach to solve this problem.

Assuming we have the contents of the file /tmp/MyNewArticleFile.rtf as ,

cat /tmp/MyNewArticleFile.rtf

HTML Code:

Times of India
Edition-1
Date:27 th May

Document 1 of 20

All blah blah goes here
Ad Page
Blah

================================

Document 2 of 20

All blah blah goes here
Ad Page
Blah

================================

Document 3 of 20

All blah blah goes here
Ad Page
Blah

================================
Document 4 of 20

All blah blah goes here
Ad Page
Blah

================================
End of the Edition
Thanks
Editor

I have written the following script that process the above file to generate the output.
Here the assumption is the Document has 20 Pages.

Code:

#!/bin/ksh
let page=1
while [[ page -le 20 ]] ; do
sed -n /Document\ $page/,/==========*/p /tmp/MyNewArticleFile.rtf > /tmp/ArticleSplitPage-$page
((page=page+1))
done

Upon execution of the above script i get 20 pages spilt according to the Document no.

cat /tmp/ArticleSpiltPage-1

HTML Code:

Document 1 of 20

All blah blah goes here
Ad Page
Blah

================================

Thanks,
Nagarajan Ganesan.

ennstate

View Public Profile for ennstate

Find all posts by ennstate

05-28-2007

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi.

For the sample data file "data1":

Code:

Document * of 20
Hello

=====
Document one of 20

World

=====
Document 44 of 20

Now is

=====
Document "Climatology Review" in of 20

with no Documents at the beginning of the time.

=====

I ran this script:

Code:

#!/bin/sh

# @(#) s1       Demonstrate csplit.

F=${1-data1}

csplit -k -s -z $F "/^Document.*of/" {\*}

echo
for file in xx*
do
        echo
        echo "File: $file"
        head -3 $file |
        cat -n
done

exit 0

To produce this:

Code:

% ./s1


File: xx00
     1  Document * of 20
     2  Hello
     3

File: xx01
     1  Document one of 20
     2
     3  World

File: xx02
     1  Document 44 of 20
     2
     3  Now is

File: xx03
     1  Document "Climatology Review" in of 20
     2
     3  with no Documents at the beginning of the time.

This assumes that the lines "=====" are visual sugar ... cheers, drl

drl

View Public Profile for drl

Find all posts by drl

01-16-2009

Registered User

1, 0

Join Date: Jan 2009

Last Activity: 22 April 2009, 6:49 AM EDT

Posts: 1

Thanks Given: 0

Thanked 0 Times in 0 Posts

How to grep the text between patterns

Hi
I am having a very small problem,but just cdnt find out the solution.

I am having a file which has multiple entries as :

<id>QIIC.QA</id>
<id>.AEX</id>
<id>QIIC</id>
..
I want the output as
QIIC.QA
.AEX
QIIC

And then check which pattern has been repeated and how many times?
Please Help.
Thanks.

praveen21

View Public Profile for praveen21

Find all posts by praveen21

UNIX for Dummies Questions & Answers

Copying Text between two unique text patterns

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to print unique text in field

Discussion started by: cmccabe

2. Shell Programming and Scripting

Command for non-unique text

Discussion started by: cmccabe

3. Shell Programming and Scripting

Find patterns and filter the text

Discussion started by: metturr

4. Shell Programming and Scripting

Need to extract text repetitively between two patterns

Discussion started by: Vignesh58

5. Shell Programming and Scripting

Replacing text between two patterns

Discussion started by: azdps

6. UNIX for Dummies Questions & Answers

Copying text from Windows to AIX - missing text?

Discussion started by: PlainInverted

7. UNIX for Advanced & Expert Users

Vi copying text

Discussion started by: cokedude

8. Shell Programming and Scripting

Extracting several lines of text after a unique string

Discussion started by: bouncer

9. Shell Programming and Scripting

Extracting Text Between Two Unique Lines

Discussion started by: Grizzly

10. Shell Programming and Scripting

extracting unique lines from text file

Discussion started by: soliberus