Question about SED cutting and renaming


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Question about SED cutting and renaming
# 1  
Old 06-19-2007
Question about SED cutting and renaming

Hi.
I've posted a couple of questions on my little project before, and it's been helpful, but things just keep changing on my end.
Allow me to explain.
I'm getting hundreds of .txt files, each containing the results of a database search from a newspaper. EAch file contains the news stories from a particular day. They look as follows:

Quote:
==============================================================================
Documents


MLA renews calls to review mill's licence; Wapiti River threatened, he says:
[Final Edition]
RICK MCCONNELL Journal Staff Writer. Edmonton Journal. Edmonton, Alta.:Jan
8, 1992. p. A7


Liberals predict provincial deficit will top $900M for current fiscal year:
[Final Edition]
RICHARD HELM Journal Staff Writer. Edmonton Journal. Edmonton, Alta.:Jan
8, 1992. p. A8


Calgary wants McCoy to order end to strike by paramedics:[Final Edition]
Edmonton Journal. Edmonton, Alta.:Jan 8, 1992. p. A7


Cancer patient wants smoking ban:[Final Edition]
SHERRI AIKENHEAD Journal Staff Writer. Edmonton Journal. Edmonton, Alta.:
Jan 8, 1992. p. B3


First impressions often misleading; The West:[Final Edition]
GILLIAN STEWARD. Edmonton Journal. Edmonton, Alta.:Jan 8, 1992. p. A10

! All documents are reproduced with the permission of the copyright owner.
Further reproduction or distribution is prohibited without permission.

==============================================================================
Citation style: ProQuest Standard

Document 1 of 5












MLA renews calls to review mill's licence; Wapiti River threatened, he says:
[Final Edition]
RICK MCCONNELL Journal Staff Writer. Edmonton Journal. Edmonton, Alta.:Jan
8, 1992. p. A7

Author(s): RICK MCCONNELL Journal Staff Writer

Document types: NEWS

Dateline: Grande Prairie

Publication title: Edmonton Journal. Edmonton, Alta.: Jan 8, 1992. pg. A.7

Source type: Newspaper

ProQuest document 193222101
ID:

Text Word Count 586

Document URL: http://proquest.umi.com/
pqdweb?did=193222101&Fmt=3&clientId=14119&RQT=309&VName=PQD

Abstract (Document Summary)

Untreated effluent from Procter and Gamble's mill is leaking through a clay
liner into groundwater and threatening the nearby Wapiti River, says provincial
New Democrat MLA John McInnis.

Samples taken in May from 16 groundwater test wells near the mill show trace
levels of chlorinated organic compounds, created by bleaching pulp with
chlorine to turn it white. Results from five of the test wells, which McInnis
said were given him by "a concerned citizen," show no detectable toxins.

"Mr. McInnis would like a public review on everything. The only mill that Mr.
McInnis does not want to shut down is the mill at Weldwood, because that's the
mill that's represented by another socialist."



Full Text (586 words)

(Copyright The Edmonton Journal)

Grande Prairie

Fresh concerns about the safety of pollutants at a Grande Prairie pulp mill
have renewed calls for a public review of the operator's licence.

Untreated effluent from Procter and Gamble's mill is leaking through a clay
liner into groundwater and threatening the nearby Wapiti River, says pro
Now, each file contains several news stories. I would like tocut out each news story and output it to its own file, with its unique file name.
So at the end, I would like the file 01.08.1992.txt to be broken up to 01.08.1992-1.txt, 01.08.1992-2.txt 01.08.1992-3.txt and so one. Then I would like it to do the same thing to the file 01.05.1994.txt and all the files in the directory (dozens, if not hundrds).

I've used the following Sed command to get the text of each news story "sed -n /Document\.\[0-9]/,/==========*/p 01.08.1992.txt > somefilename.txt (this is where I start to have trouble.) Note: a row of equal signs separates each news story in each day's file.

I also use egrep to get the number of the story to the standard output. That is, each story on each day is labelled as Document 1 of 10, then the next is Document 2 of 10. I use egrep ^Document\.\[0-9] *.txt and that gives me the digits 1, 2, 3, 4, 5, 6 until exhausted.

Surely there must be some way of having unix apply those digits to the end of the filename after the content has been split from it?

I hope you can help. I'm grateful for all your suggestions.
Simon
# 2  
Old 06-19-2007
And I'm working with bash. I'm not really familiar with the other shells. I'm new to this, so simplicity is a virtue.
# 3  
Old 06-19-2007
Quote:
Originally Posted by spindoctor
Hi.
I'm getting hundreds of .txt files, each containing the results of a database search from a newspaper. EAch file contains the news stories from a particular day. They look as follows:

[SNIP]

Now, each file contains several news stories. I would like tocut out each news story and output it to its own file, with its unique file name.
So at the end, I would like the file 01.08.1992.txt to be broken up to 01.08.1992-1.txt, 01.08.1992-2.txt 01.08.1992-3.txt and so one. Then I would like it to do the same thing to the file 01.05.1994.txt and all the files in the directory (dozens, if not hundrds).

This might do what you want:

Code:
awk -v RS="\n=============*\n" 'NR > 1 {
   file = FILENAME
   sub( /.*\//,"",file )
   sub( /\.txt$/,"",file )
   ++n
   file = file "." n ".txt"
   print > file
   close( file )
}' *.txt

# 4  
Old 06-20-2007
Quote:
Originally Posted by cfajohnson

This might do what you want:

Code:
awk -v RS="\n=============*\n" 'NR > 1 {
   file = FILENAME
   sub( /.*\//,"",file )
   sub( /\.txt$/,"",file )
   ++n
   file = file "." n ".txt"
   print > file
   close( file )
}' *.txt

Some awk versions doesn't support this kind of Input Record Separator.
My AIX version of awk use only the first character of RS.
Code:
awk '
   /^==========*$/ {
      if (outfile) close(outfile);
      outfile = FILENAME;
      sub(/\.[^.]*$/, "-" seq++ "&", outfile);
      next;
   }
   { 
      print $0 > outfile
   }
' *.txt

With your input sample in file 01.08.1992.txt, this script generate two files : 01.08.1992-0.txt and 01.08.1992-1.txt
# 5  
Old 06-20-2007
my_split.sh:
Code:
csplit -k -f $1. $1 "/======================/" "{99}" 1> /dev/null 2>&1

Usage:
my_split.sh <input_file>
# 6  
Old 06-20-2007
This is very useful. Part of the problem, however is that the string "=============" also appears in frivolous manners at the top of each file. Thus, using that strictly as the only record separator won't really work. I need to somehow capture text between the patterns "Document.[0-9]" at the beginning and "===============" at the end.
I tried CFAJohnson's script and it generated thousands of empty .txt files in my folder, presumably because it relied solely on the "============" as the record separator.

I tried aigles script and it returned the following error message:

awk: null file name in print or getline
input record number 1, file 01.08.1992.txt
source line number 9

Simon
# 7  
Old 06-20-2007
Quote:
Originally Posted by spindoctor
I tried aigles script and it returned the following error message:

awk: null file name in print or getline
input record number 1, file 01.08.1992.txt
source line number 9

Simon
A possible cause of the problem is may be that your inputfile doesn't start with '===========' line.

Quote:
Originally Posted by spindoctor
This is very useful. Part of the problem, however is that the string "=============" also appears in frivolous manners at the top of each file. Thus, using that strictly as the only record separator won't really work. I need to somehow capture text between the patterns "Document.[0-9]" at the beginning and "===============" at the end.
The following script capture text between the patterns "Document.[0-9]" at the beginning and "===============" at the end.
The sequence number for the outfiles is taken from the "Document.[0-9]" line.
Code:
awk '
   /^Document [0-9]/,/^==========*$/ {
      if (/^Document [0-9]/) {
         if (outfile) close(outfile);
         seq = $2;
         outfile = FILENAME;
         sub(/\.[^.]*$/, "-" seq "&", outfile);
      }
      print $0 > outfile
   }
' *.txt

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

sed expression-help cutting name

Hi, I have some sample names. The regions in red are what I want to extract. AIB13-positive AIB13-blank AIB13-1116-0022999 GNX13-1521-0003532 Each of the sample names are represented as variable ${SAMPLE} within a loop. I've tried extracting the middle region with the following code... (2 Replies)
Discussion started by: jdilts
2 Replies

2. Windows & DOS: Issues & Discussions

Simple renaming question

I want to rename all fileextensions inside a folder that are called ".srt.ass" to just ".ass" My (poor code so far is): or %x in (*.srt.ass) do RENAME "*.srt.ass" "*.ass" Thanks in advance. (4 Replies)
Discussion started by: pasc
4 Replies

3. Shell Programming and Scripting

Help with renaming file using sed

Hi, I am currently updating a script, It should do the ff: rename the YYYYMMDD in mainlog.YYYYMMDD to $NEW_DATE I was ask to use sed but I am not that familiar with it. I should update the touch part with the sed. Please kindly help me with this. Thanks a lot. ... (3 Replies)
Discussion started by: joyful1213
3 Replies

4. UNIX for Dummies Questions & Answers

renaming multiple files using sed or awk one liner

hi, I have a directory "test" under which there are 3 files a.txt,b.txt and c.txt. I need to rename those files to a.pl,b.pl and c.pl respectively. is it possible to achieve this in a sed or awk one liner? i have searched but many of them are scripts. I need to do this in a one liner. I... (2 Replies)
Discussion started by: pandeesh
2 Replies

5. Shell Programming and Scripting

Renaming files with sed

Hi all, I created file like this AAb.lol AAc.lol AAx.lol test.sh My goal is to create a script (test.sh) which renames all the files to their original name without AA. I want to end up with this: b.lol c.lol x.lol Using sed how is it possible? i tried to write the script ... (3 Replies)
Discussion started by: anishkumarv
3 Replies

6. Shell Programming and Scripting

Renaming Movies (or Flipping Portions of Filenames Using sed or awk)

Hey folks My problem is simple. For my first stash of movies, I used a naming convention of YEAR_MOVIE_NAME__QUALITY/ for each movie folder. For example, if I had a 1080p print of Minority Report, it would be 2002_Minority_Report__1080p/. The 2nd time around, I changed the naming convention... (4 Replies)
Discussion started by: ksk
4 Replies

7. Shell Programming and Scripting

Sed Question 1. (Don't quite know how to use sed! Thanks)

Write a sed script to extract the year, rank, and stock for the most recent 10 years available in the file top10_mktval.csv, and output in the following format: ------------------------------ YEAR |RANK| STOCK ------------------------------ 2007 | 1 | Exxon... (1 Reply)
Discussion started by: beibeiatNY
1 Replies

8. UNIX for Dummies Questions & Answers

Cutting In Unix Question

Please Delete! Thanks. (7 Replies)
Discussion started by: Shiruken
7 Replies

9. Shell Programming and Scripting

need help cutting consecutive lines with sed or awk

HI All, I want to cut 5 lines after a pattern using sed or awk. can any one tell me how to do it ? (2 Replies)
Discussion started by: raghin
2 Replies

10. UNIX for Dummies Questions & Answers

Cutting lines out using sed I presume.

Happy New Year!!! I'm trying to cut out some text in a data file where I only want the first line and the last line. For example. 1. Colin Was here <<-- Text I want to cut out 2. THIS IS THE TEXT I WANT TO CUT <- CUT THIS OUT 3. OUT BECAUSE IT'S NO GOOD TO ME <- CUT THIS OUT 4. I... (5 Replies)
Discussion started by: cfoxwell
5 Replies
Login or Register to Ask a Question