Non trivial file splitting, saving with variable filename


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Non trivial file splitting, saving with variable filename
# 1  
Old 06-15-2013
Non trivial file splitting, saving with variable filename

Hello,

Although I have found similar questions, I could not find advice that could help with our problem.

The issue:

We have a few thousands text files (books).

Each book has many chapters. Each chapter is identified by a cite-key. We need
to split each of those book files by chapters, having each chapter's cite-key as
file name.

Example of book file:

Code:
* Chapter 1 -- Branchial or Visceral Arches

  :PROPERTIES:
  :GENRE: biology
  :CITE-KEY: DW:1
  :END:


The Branchial or Visceral Arches and Pharyngeal Pouches. -- In
the lateral walls of the anterior part of the fore-gut five pharyngeal
pouches appear (Fig. 42).



* Chapter 2 -- Dorsal and Ventral Diverticulum

  :PROPERTIES:
  :GENRE: biology
  :CITE-KEY: DW:2
  :END:


Each of the upper four pouches is prolonged into a dorsal and a ventral
diverticulum.

Over these pouches corresponding indentations of the ectoderm occur, forming 
what are known as the branchial or outer pharyngeal grooves.


[etc.]

After splitting, we would have a series of files, in same directory as the source:
dw-1.txt, dw-2.txt, etc., each containing only the proper chapter.

As example, file dw-2.txt would contain:

Code:
* Chapter 2

  :PROPERTIES:
  :GENRE: biology
  :CITE-KEY: DW:2
  :END:


Each of the upper four pouches is prolonged into a dorsal and
a ventral diverticulum.

Over these pouches corresponding indentations of the ectoderm occur,
forming what are known as the branchial or outer pharyngeal grooves.

One may notice those files use org-syntax. We are able to split those files
mapping a function with emacs' (org-map-entries), but the process is way too
slow. The text files do change, and we need to split all the books frequently.
Emacs is way too slow for that.


Could anybody give me a hint on how to do that with awk or some other fast
shell scripting?


Thank you very much.
# 2  
Old 06-15-2013
Hi, try:

Code:
awk '/\* Chapter/{close(f); p=x; f=x} /CITE-KEY/{f=tolower($2) ".txt"; $0=p$0 } !f{p=p $0 ORS} f{print >f}' file

This User Gave Thanks to Scrutinizer For This Post:
# 3  
Old 06-15-2013
Hi,

It works beautifully, and it is amazingly fast!

The file names are written with a colon, which is not allowed on OS X:
dw:1.txt, is there a way to have a dash instead, like
dw-1.txt?

I must add that I had to take away the Chapter part,
becasue many chapter headings do not include that word in their text.

So, I have been using:

awk '/\* /{close(f); p=x; f=x} /CITE-KEY/{f=tolower($2) ".txt"; $0=p$0 } !f{p=p $0 ORS} f{print >f}' file

I tried:

awk '/\* {close(f); p=x; f=x} /CITE-KEY/{f=tolower($2) ".txt"; $0=p$0 } !f{p=p $0 ORS} f{print >f}' file

But it throws:
Code:
awk: syntax error at source line 1
 context is
    /\* {close(f); p=x; f=x} >>>  /CITE-KEY/{ <<<
awk: bailing out at source line 1

Thank you so much. So much elegance in Awk. Truly inspiring.

Last edited by samask; 06-15-2013 at 02:43 PM.. Reason: Correct text
# 4  
Old 06-15-2013
Nice to hear you can appreciate awk's elegance. I am not aware of a restriction whereby colons would not allowed in file names in OS X, but if you would like to use a dash, try:
Code:
awk '$1=="*"{close(f); p=f=x} /CITE-KEY/{f=tolower($2) ".txt"; sub(":","-",f); $0=p $0} !f{p=p $0 ORS} f{print >f}' file


Last edited by Scrutinizer; 06-15-2013 at 04:02 PM..
This User Gave Thanks to Scrutinizer For This Post:
# 5  
Old 06-15-2013
Thank you, it works perfectly.

I can see it uses a different approach. Now I can learn more. Smilie

Such brevity, but at the same time expressivity, that is why I feel AWK is so elegant.

Thank you so much, once again.

Last edited by samask; 06-15-2013 at 04:17 PM.. Reason: Edit mispell
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Saving Mod in a variable

Hello Experts, In one of my shell script, I've been trying to calculate mod and saving it in a variable, below is what I have tried but it isn't working. Any help appreciated!!! #!/bin/bash num1=4 num2=3 echo "Number one is $num1" echo "Number two is $num2" mod_final=$(( echo "num1%num2"... (7 Replies)
Discussion started by: mukulverma2408
7 Replies

2. Open Source

Splitting files using awk and reading filename value from input data

I have a process that requires me to read data from huge log files and find the most recent entry on a per-user basis. The number of users may fluctuate wildly month to month, so I can't code for it with names or a set number of variables to capture the data, and the files are large so I don't... (7 Replies)
Discussion started by: rbatte1
7 Replies

3. Shell Programming and Scripting

Trivial perl doubt about FILE

Hi, In the following perl code: #!/usr/bin/perl -w if (open(FILE, "< in_file")) { while (<FILE>) { chomp($_); if ($_ =~ /patt$/) { my $f = (split(" ", $_)); print "$f\n"; } } close FILE; } Why changing the "FILE" as... (4 Replies)
Discussion started by: royalibrahim
4 Replies

4. Homework & Coursework Questions

Matlab help! Reading in a file with a variable filename

1. The problem statement, all variables and given/known data: I want to read in a file, and plot the data in matlab. However, I do not like hardwiring filenames into my codes, so I always give the user the option to specify what the filename is. I am pretty inexperienced with matlab, so I have no... (0 Replies)
Discussion started by: ds7202
0 Replies

5. Shell Programming and Scripting

Trouble saving variable

Hi, I have problems when you save a variable of a command. I have put the following line: CONEXION_BAGDAD = $ (grep-c "Please login with USER and PASS" $ LOG_FILE_BAGDAD) But I returned the following error: syntax error at line 67: `CONEXION_BAGDAD = $ 'unexpected Because it can happen?... (2 Replies)
Discussion started by: danietepa
2 Replies

6. Shell Programming and Scripting

Filename from splitting files to have the same filename of the original file with counter value

Hi all, I have a list of xml file. I need to split the files to a different files when see the <ko> tag. The list of filename are B20090908.1100-20090908.1200_CDMA=1,NO=2,SITE=3.xml B20090908.1200-20090908.1300_CDMA=1,NO=2,SITE=3.xml B20090908.1300-20090908.1400_CDMA=1,NO=2,SITE=3.xml ... (3 Replies)
Discussion started by: natalie23
3 Replies

7. UNIX for Dummies Questions & Answers

saving command output to a variable

Hello, I have a shell script containing a command string in the following format: command1 | command2 | cut -c9-16 The output from this is a record number (using characters 9-16 of the original output string) e.g. ORD-1234 I wish to save this value to a variable for use in later commands... (4 Replies)
Discussion started by: philjo
4 Replies

8. Shell Programming and Scripting

Piping to a file and setting filename using a variable

Hi all, I would like to send the output of a line in a ksh script to a file, but I need to name the file using a predefined variable: ls -l > $MYVAR.arc But what is the correct syntax for achieving this? I can't seem to find the correct syntax for giving the file an extension. Any... (8 Replies)
Discussion started by: mandriver
8 Replies

9. UNIX for Dummies Questions & Answers

File Transfer that is not so trivial I guess

I have three computers A, B and C. To login to B and C I should use A because it has a SSH key. I don't have any other way of accessing these two computers. Now, if I need to transfer a file between B and C, I am unable to find a way that would work... because I don't know how to authenticate... (1 Reply)
Discussion started by: Legend986
1 Replies

10. UNIX for Dummies Questions & Answers

Moving files by splitting the path embedded in the filename

Hello All. I am having a directory /tmp/rahul which contains many files in the format @#home@#rahul@#programs@#script.pl where /home/rahul/programs is the directory where the script.pl file is to be placed. I have many files in this format. What i want is a script which read these... (7 Replies)
Discussion started by: rahulrathod
7 Replies
Login or Register to Ask a Question