Question about SED cutting and renaming

06-19-2007

Registered User

36, 0

Join Date: May 2007

Last Activity: 25 January 2012, 1:28 PM EST

Posts: 36

Thanks Given: 0

Thanked 0 Times in 0 Posts

Question about SED cutting and renaming

Hi.
I've posted a couple of questions on my little project before, and it's been helpful, but things just keep changing on my end.
Allow me to explain.
I'm getting hundreds of .txt files, each containing the results of a database search from a newspaper. EAch file contains the news stories from a particular day. They look as follows:

Quote:

==============================================================================
Documents

MLA renews calls to review mill's licence; Wapiti River threatened, he says:
[Final Edition]
RICK MCCONNELL Journal Staff Writer. Edmonton Journal. Edmonton, Alta.:Jan
8, 1992. p. A7

Liberals predict provincial deficit will top $900M for current fiscal year:
[Final Edition]
RICHARD HELM Journal Staff Writer. Edmonton Journal. Edmonton, Alta.:Jan
8, 1992. p. A8

Calgary wants McCoy to order end to strike by paramedics:[Final Edition]
Edmonton Journal. Edmonton, Alta.:Jan 8, 1992. p. A7

Cancer patient wants smoking ban:[Final Edition]
SHERRI AIKENHEAD Journal Staff Writer. Edmonton Journal. Edmonton, Alta.:
Jan 8, 1992. p. B3

First impressions often misleading; The West:[Final Edition]
GILLIAN STEWARD. Edmonton Journal. Edmonton, Alta.:Jan 8, 1992. p. A10

! All documents are reproduced with the permission of the copyright owner.
Further reproduction or distribution is prohibited without permission.

==============================================================================
Citation style: ProQuest Standard

Document 1 of 5

MLA renews calls to review mill's licence; Wapiti River threatened, he says:
[Final Edition]
RICK MCCONNELL Journal Staff Writer. Edmonton Journal. Edmonton, Alta.:Jan
8, 1992. p. A7

Author(s): RICK MCCONNELL Journal Staff Writer

Document types: NEWS

Dateline: Grande Prairie

Publication title: Edmonton Journal. Edmonton, Alta.: Jan 8, 1992. pg. A.7

Source type: Newspaper

ProQuest document 193222101
ID:

Text Word Count 586

Document URL: http://proquest.umi.com/
pqdweb?did=193222101&Fmt=3&clientId=14119&RQT=309&VName=PQD

Abstract (Document Summary)

Untreated effluent from Procter and Gamble's mill is leaking through a clay
liner into groundwater and threatening the nearby Wapiti River, says provincial
New Democrat MLA John McInnis.

Samples taken in May from 16 groundwater test wells near the mill show trace
levels of chlorinated organic compounds, created by bleaching pulp with
chlorine to turn it white. Results from five of the test wells, which McInnis
said were given him by "a concerned citizen," show no detectable toxins.

"Mr. McInnis would like a public review on everything. The only mill that Mr.
McInnis does not want to shut down is the mill at Weldwood, because that's the
mill that's represented by another socialist."

Full Text (586 words)

(Copyright The Edmonton Journal)

Grande Prairie

Fresh concerns about the safety of pollutants at a Grande Prairie pulp mill
have renewed calls for a public review of the operator's licence.

Untreated effluent from Procter and Gamble's mill is leaking through a clay
liner into groundwater and threatening the nearby Wapiti River, says pro

Now, each file contains several news stories. I would like tocut out each news story and output it to its own file, with its unique file name.
So at the end, I would like the file 01.08.1992.txt to be broken up to 01.08.1992-1.txt, 01.08.1992-2.txt 01.08.1992-3.txt and so one. Then I would like it to do the same thing to the file 01.05.1994.txt and all the files in the directory (dozens, if not hundrds).

I've used the following Sed command to get the text of each news story "sed -n /Document\.\[0-9]/,/==========*/p 01.08.1992.txt > somefilename.txt (this is where I start to have trouble.) Note: a row of equal signs separates each news story in each day's file.

I also use egrep to get the number of the story to the standard output. That is, each story on each day is labelled as Document 1 of 10, then the next is Document 2 of 10. I use egrep ^Document\.\[0-9] *.txt and that gives me the digits 1, 2, 3, 4, 5, 6 until exhausted.

Surely there must be some way of having unix apply those digits to the end of the filename after the content has been split from it?

I hope you can help. I'm grateful for all your suggestions.
Simon

spindoctor

View Public Profile for spindoctor

Find all posts by spindoctor

06-19-2007

Registered User

36, 0

Join Date: May 2007

Last Activity: 25 January 2012, 1:28 PM EST

Posts: 36

Thanks Given: 0

Thanked 0 Times in 0 Posts

And I'm working with bash. I'm not really familiar with the other shells. I'm new to this, so simplicity is a virtue.

spindoctor

View Public Profile for spindoctor

Find all posts by spindoctor

06-19-2007

Registered User

2,898, 136

Join Date: Mar 2007

Last Activity: 11 July 2016, 2:55 PM EDT

Location: Toronto, Canada

Posts: 2,898

Thanks Given: 0

Thanked 136 Times in 120 Posts

Quote:

Originally Posted by spindoctor

Hi.
I'm getting hundreds of .txt files, each containing the results of a database search from a newspaper. EAch file contains the news stories from a particular day. They look as follows:

[SNIP]

Now, each file contains several news stories. I would like tocut out each news story and output it to its own file, with its unique file name.
So at the end, I would like the file 01.08.1992.txt to be broken up to 01.08.1992-1.txt, 01.08.1992-2.txt 01.08.1992-3.txt and so one. Then I would like it to do the same thing to the file 01.05.1994.txt and all the files in the directory (dozens, if not hundrds).

This might do what you want:

Code:

awk -v RS="\n=============*\n" 'NR > 1 {
   file = FILENAME
   sub( /.*\//,"",file )
   sub( /\.txt$/,"",file )
   ++n
   file = file "." n ".txt"
   print > file
   close( file )
}' *.txt

cfajohnson

View Public Profile for cfajohnson

Find all posts by cfajohnson

06-20-2007

Registered User

1,714, 63

Join Date: Apr 2004

Last Activity: 15 May 2020, 11:27 AM EDT

Location: Bordeaux, France

Posts: 1,714

Thanks Given: 2

Thanked 63 Times in 59 Posts

Quote:

Originally Posted by cfajohnson

This might do what you want:

Code:

awk -v RS="\n=============*\n" 'NR > 1 {
   file = FILENAME
   sub( /.*\//,"",file )
   sub( /\.txt$/,"",file )
   ++n
   file = file "." n ".txt"
   print > file
   close( file )
}' *.txt

Some awk versions doesn't support this kind of Input Record Separator.
My AIX version of awk use only the first character of RS.

Code:

awk '
   /^==========*$/ {
      if (outfile) close(outfile);
      outfile = FILENAME;
      sub(/\.[^.]*$/, "-" seq++ "&", outfile);
      next;
   }
   { 
      print $0 > outfile
   }
' *.txt

With your input sample in file 01.08.1992.txt, this script generate two files : 01.08.1992-0.txt and 01.08.1992-1.txt

aigles

View Public Profile for aigles

Find all posts by aigles

06-20-2007

Registered User

1,203, 103

Join Date: Mar 2007

Last Activity: 28 January 2020, 10:33 PM EST

Location: Orlando, Florida

Posts: 1,203

Thanks Given: 1

Thanked 103 Times in 100 Posts

my_split.sh:

Code:

csplit -k -f $1. $1 "/======================/" "{99}" 1> /dev/null 2>&1

Usage:
my_split.sh <input_file>

Shell_Life

View Public Profile for Shell_Life

Find all posts by Shell_Life

06-20-2007

Registered User

36, 0

Join Date: May 2007

Last Activity: 25 January 2012, 1:28 PM EST

Posts: 36

Thanks Given: 0

Thanked 0 Times in 0 Posts

This is very useful. Part of the problem, however is that the string "=============" also appears in frivolous manners at the top of each file. Thus, using that strictly as the only record separator won't really work. I need to somehow capture text between the patterns "Document.[0-9]" at the beginning and "===============" at the end.
I tried CFAJohnson's script and it generated thousands of empty .txt files in my folder, presumably because it relied solely on the "============" as the record separator.

I tried aigles script and it returned the following error message:

awk: null file name in print or getline
input record number 1, file 01.08.1992.txt
source line number 9

Simon

spindoctor

View Public Profile for spindoctor

Find all posts by spindoctor

06-20-2007

Registered User

1,714, 63

Join Date: Apr 2004

Last Activity: 15 May 2020, 11:27 AM EDT

Location: Bordeaux, France

Posts: 1,714

Thanks Given: 2

Thanked 63 Times in 59 Posts

Quote:

Originally Posted by spindoctor

I tried aigles script and it returned the following error message:

awk: null file name in print or getline
input record number 1, file 01.08.1992.txt
source line number 9

Simon

A possible cause of the problem is may be that your inputfile doesn't start with '===========' line.

Quote:

Originally Posted by spindoctor

This is very useful. Part of the problem, however is that the string "=============" also appears in frivolous manners at the top of each file. Thus, using that strictly as the only record separator won't really work. I need to somehow capture text between the patterns "Document.[0-9]" at the beginning and "===============" at the end.

The following script capture text between the patterns "Document.[0-9]" at the beginning and "===============" at the end.
The sequence number for the outfiles is taken from the "Document.[0-9]" line.

Code:

awk '
   /^Document [0-9]/,/^==========*$/ {
      if (/^Document [0-9]/) {
         if (outfile) close(outfile);
         seq = $2;
         outfile = FILENAME;
         sub(/\.[^.]*$/, "-" seq "&", outfile);
      }
      print $0 > outfile
   }
' *.txt

aigles

View Public Profile for aigles

Find all posts by aigles

Shell Programming and Scripting

Question about SED cutting and renaming

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

sed expression-help cutting name

Discussion started by: jdilts

2. Windows & DOS: Issues & Discussions

Simple renaming question

Discussion started by: pasc

3. Shell Programming and Scripting

Help with renaming file using sed

Discussion started by: joyful1213

4. UNIX for Dummies Questions & Answers

renaming multiple files using sed or awk one liner

Discussion started by: pandeesh

5. Shell Programming and Scripting

Renaming files with sed

Discussion started by: anishkumarv

6. Shell Programming and Scripting

Renaming Movies (or Flipping Portions of Filenames Using sed or awk)

Discussion started by: ksk

7. Shell Programming and Scripting

Sed Question 1. (Don't quite know how to use sed! Thanks)

Discussion started by: beibeiatNY

8. UNIX for Dummies Questions & Answers

Cutting In Unix Question

Discussion started by: Shiruken

9. Shell Programming and Scripting

need help cutting consecutive lines with sed or awk

Discussion started by: raghin

10. UNIX for Dummies Questions & Answers

Cutting lines out using sed I presume.

Discussion started by: cfoxwell