Howdy folks, I've got a very large plain text file that I need to split into many smaller files. My script-fu is not powerful enough for this, so any assistance is much appreciated.
The file is a database dump from Cyrus IMAP server. It's basically a bunch of emails (thousands) all concatenated into one huge file. There is a delimiter line between each email. It looks something like this
So as you can see, the start of each email is preceded with a line that begins with "--dump".
What I'm looking for, is:
1. To split this monolithic file into many smaller files, where each smaller file contains a single email.
2. Where each smaller file should contain all of the lines of text after a "--dump" delimiter, up until the next "--dump" delimiter (or end of file).
3. And the "--dump" delimiter line itself should not be included in each smaller file.
I feel like some awk/grep/sed magic could do this, but I'm not enough of a wizard to write this script.
You may have noticed that my script closed the previous output file before opening the next output file. This is usually a much better practice unless you know that your script will open less than ten files in its lifetime.
It is hard to get csplit (and split) to drop the delimiter lines.
The awk script jethro provided is giving me some files just containing an empty line and some files just containing "dump". And, on many systems, this code will run out of file descriptors when you're processing a file containing a lot of mail messages.
Assuming that your file containing the dump of the mail messages is named dump, you might try something like:
If you're running this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk, /usr/xpg6/bin/awk, or nawk.
This User Gave Thanks to Don Cragun For This Post:
Hi jethro,
Using the script you provided:
on Mac OS X I get:
The standards explicitly say that it is unspecified whether:
is interpreted as:
(as it is in awk on OS X) or as:
(as it is on your system).
Changing your code to:
with a file named file containing:
I get 6 files as shown here:
where file.1 and file.4 contain only a <newline> character; file.2 and file.5 contain "dump" and the line terminating <newline> character; and file.3 and file.6 contain the requested mail messages plus the tail end of the headers: file.3: file.6:
Note that there are two empty lines at the end of both of the above files. I believe that only one empty line was expected.
The number of file descriptors used by awk for open input and output streams is implementation defined. Some versions of awk used to limit you to 9 open files. Most systems today allow around 1024 or 2048 file descriptors per process and (unless you have privileges to up that limit before you invoke awk), awk won't be able to have more files open than the number of file descriptors available to it. You may have noticed that my script closed the previous output file before opening the next output file. This is usually a much better practice unless you know that your script will open less than ten files in its lifetime.
---------------
OOPS. I originally said that file.2 and file.5 contained "code". That has been corrected above. They contain "dump"; not "code".
Last edited by Don Cragun; 04-26-2014 at 01:24 AM..
Reason: Fix typo.
This User Gave Thanks to Don Cragun For This Post:
I forgot to mention that the standards specify that the first character in the awk variable RS is used as the record separator. If RS contains more than one character, the standards explicitly state that the behavior is unspecified. (It appears that the awk on jethro's system treats RS as an ERE while the awk on OS X only uses the first character of RS.)
Last edited by Don Cragun; 04-26-2014 at 02:37 AM..
Reason: Fix typo.
Hi,
I have a requirement that has 50-60 million records that we need to split a delimited string (Delimeter is newline) into rows.
Source Date:
SerialID UnidID GENRE
100 A11 AAAchar(10)BBB
200 B11 CCCchar(10)DDD(10)ZZZZ
Field 'GENRE' is a string with new line as delimeter and not sure... (5 Replies)
Hi, I have a rquirement in unix as below .
I have a text file with me seperated by | symbol and i need to generate a excel file through unix commands/script so that each value will go to each column.
ex:
Input Text file:
1|A|apple
2|B|bottle
excel file to be generated as output as... (9 Replies)
hi
i have a requirement to input a string to a shell script and to split the string to multiple fields,
the string is copied from a row of three columns (name,age,address) in an excel sheet.
the three columns (from excel) are seperated with a tab when pasted in the command prompt, but when the ... (2 Replies)
I have a text file with irregular spacing between values which makes it really difficult to manipulate. Is there an easy way to convert it into a space delimited text file so that all the spaces, double spaces, triple spaces, tabs between numbers are converted into spaces. The file looks like this:... (5 Replies)
Hello,
Please help me. I have hundreds of text files composed of several rows of information and I need to separate each row into a new text file. I was trying to figure out how to split the text file into different text files, based on each row of text in the original text file. Here is an... (2 Replies)
Hello Gurus,
I have a text file containing nearly 12,000 tab delimited characters with 4000 rows. If the file size is small, excel can convert the text into coloumns. However, the file that I have is very big. Can some body help me in solving this problem?
The input file example,
... (6 Replies)
Hello
We have a text file with 400,000 lines and need to split into multiple files each with 5000 lines ( will result in 80 files)
Got an idea of using head and tail commands to do that with a loop but looked not efficient.
Please advise the simple and yet effective way to do it.
TIA... (3 Replies)
Hi all, I need help to split a tab-delimited list into separate files by the filename-field. The list is already sorted ascendingly by filename, an example list would look like this;
filename001 word1 word2
filename001 word3 word4
filename002 word1 word2
filename002 word3 word4... (4 Replies)
Could one of you shad some light on this:
I need to split the file by determining the record count and than splitting it up into 4 files. Please note, this is not a fixed record length but rather a "|" delimited file.
I am not sure as how to handle reminder/offset for the 4th file.
For... (4 Replies)