Splitting a delimited text file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Splitting a delimited text file
# 8  
Old 04-28-2014
Quote:
Originally Posted by jethrow
Code:
awk 'NR>1 {print > (OFN=FILENAME"."(NR-1)); close(OFN)}' RS="--dump[^\n]*" file

EDIT:
... implemented this above ...
On my system (HP-UX 11.31) I get:

awk: Input line Disposition: attachm cannot be longer than 3,000 bytes.
The input line number is 53. The file is qsubmit.processed.dump.
The source line number is 1.

FYI the input file has emails as large as several megabytes (because of mime encoded attachments).

Thanks!

---------- Post updated at 11:48 AM ---------- Previous update was at 11:47 AM ----------

Quote:
Originally Posted by Don Cragun
It is hard to get csplit (and split) to drop the delimiter lines.
Dropping them is ideal, but not necessarily a problem for me, as I can "grep -v" to remove them in a second pass.

---------- Post updated at 03:50 PM ---------- Previous update was at 11:48 AM ----------

Ok, I got what I needed using this. Thank you all for the helpful ideas, it got me pointed down the right path.

Code:
csplit -n 5 $1 /-dump-/ {*}

for i in $(ls xx*); do
  awk 'NR > 2' $i > ./output/$i.eml
  rm $i
done

# 9  
Old 04-28-2014
Quote:
Originally Posted by lupin..the..3rd
On my system (HP-UX 11.31) I get:

awk: Input line Disposition: attachm cannot be longer than 3,000 bytes.
The input line number is 53. The file is qsubmit.processed.dump.
The source line number is 1.

FYI the input file has emails as large as several megabytes (because of mime encoded attachments).

Thanks!

---------- Post updated at 11:48 AM ---------- Previous update was at 11:47 AM ----------



Dropping them is ideal, but not necessarily a problem for me, as I can "grep -v" to remove them in a second pass.

---------- Post updated at 03:50 PM ---------- Previous update was at 11:48 AM ----------

Ok, I got what I needed using this. Thank you all for the helpful ideas, it got me pointed down the right path.

Code:
csplit -n 5 $1 /-dump-/ {*}

for i in $(ls xx*); do
  awk 'NR > 2' $i > ./output/$i.eml
  rm $i
done

Just out of curiosity, why did you decide not to use the awk script I suggested?
Code:
awk '
/^--dump/ {
	if(ofn != "") close(ofn)
	ofn = sprintf("message:%07d", ++f)
	next
}
{	print > ofn
}' dump

It only invokes awk once (instead of once per extracted message) and only reads and writes the data found in your input file once (instead of twice); so it should be considerably faster.
# 10  
Old 04-28-2014
Quote:
Originally Posted by Don Cragun
Just out of curiosity, why did you decide not to use the awk script I suggested?
Thank you Don, I agree that it is more elegant to only invoke awk one time, rather than twice. Even more so when you consider that the dump files I need to split up are ~4 GB each in size, containing ~12,000 emails each.

But at least on my HP-UX 11.31 servers, I get the following awk error:

Code:
itl1 # ./script.sh ./admin.inbox
awk: A print or getline function must have a file name.
 The input line number is 1. The file is ./admin.inbox.
 The source line number is 7.
itl1 # cat script.sh
awk '
/^--dump/ {
        if(ofn != "") close(ofn)
        ofn = sprintf("message:%07d", ++f)
        next
}
{       print > ofn
}' $1
itl1 #
itl1 # uname -a
HP-UX itl1 B.11.31 U ia64 3456089508 unlimited-user license
itl1 #

---------- Post updated at 04:31 PM ---------- Previous update was at 04:19 PM ----------

Update: Trying the same thing on RHEL6, I get the following error:
Code:
[root@email root]# ./script.sh ./sub.proc
awk: cmd. line:6: (FILENAME=./sub.proc FNR=1) fatal: expression for `>' redirection has null string value
[root@email root]# cat script.sh
#!/bin/sh
awk '
/^--dump/ {
        if(ofn != "") close(ofn)
        ofn = sprintf("message:%07d", ++f)
        next
}
{       print > ofn
}' $1
[root@email root]# uname -a
Linux email.dev 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Dec 13 06:58:20 EST 2013 x86_64 x86_64 x86_64 GNU/Linux
[root@email root]#

# 11  
Old 04-28-2014
If your input files contain 12,000 messages, your script is invoking awk 12,000 times; not 2 times!

In your sample input in the first message in this thread you showed that the 1st line in your input file started with --dump. From those error messages, I have to assume that the 1st line of your input file does not start with that string.

If the data before the 1st line in your file starting with --dump is a mail message you want to keep, change:
Code:
awk '

to:
Code:
awk '
BEGIN {	ofn = sprintf("message:%07d", ++f)
}

otherwise, change it to:
Code:
awk '
BEGIN {	ofn = "/dev/null"
}

Note that on many filesystem types, putting 12,000 files in a single directory may make processing files in that directory slow. You might want to consider creating intermediate directories to reduce the number of files/directory.
This User Gave Thanks to Don Cragun For This Post:
# 12  
Old 04-28-2014
Quote:
Originally Posted by Don Cragun
If your input files contain 12,000 messages, your script is invoking awk 12,000 times; not 2 times!

In your sample input in the first message in this thread you showed that the 1st line in your input file started with --dump. From those error messages, I have to assume that the 1st line of your input file does not start with that string.
Yep, you caught me there. Smilie The first line of the input file does not begin with "--dump". There are a few lines of metadata at the head of the file. This metadata can be discarded. After the few lines of metadata, it's all "--dump" delimited emails.

Thanks again for the suggestions, I'll try them when I'm back in the office tomorrow morning.
# 13  
Old 04-30-2014
Ok, just like you said, this worked perfectly for me, so I'll be using this on my server. THANK YOU!


Code:
[root@email root]# cat script.sh
#!/bin/sh
awk '
BEGIN { ofn = "/dev/null"
}
/^--dump/ {
        if(ofn != "") close(ofn)
        ofn = sprintf("message:%07d", ++f)
        next
}
{       print > ofn
}' $1

# 14  
Old 04-30-2014
Quote:
Originally Posted by lupin..the..3rd
Ok, just like you said, this worked perfectly for me, so I'll be using this on my server. THANK YOU!
I'm glad it worked for you. In the future, please be sure that you fully describe your input file format so we can avoid providing solutions that do what you asked for, but not what you needed. Smilie
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Splitting a text file into smaller files with awk, how to create a different name for each new file

Hello, I have some large text files that look like, putrescine Mrv1583 01041713302D 6 5 0 0 0 0 999 V2000 2.0928 -0.2063 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 5.6650 0.2063 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 3.5217 ... (3 Replies)
Discussion started by: LMHmedchem
3 Replies

2. Shell Programming and Scripting

Splitting delimited string into rows

Hi, I have a requirement that has 50-60 million records that we need to split a delimited string (Delimeter is newline) into rows. Source Date: SerialID UnidID GENRE 100 A11 AAAchar(10)BBB 200 B11 CCCchar(10)DDD(10)ZZZZ Field 'GENRE' is a string with new line as delimeter and not sure... (5 Replies)
Discussion started by: techmoris
5 Replies

3. UNIX for Dummies Questions & Answers

Need to convert a pipe delimited text file to tab delimited

Hi, I have a rquirement in unix as below . I have a text file with me seperated by | symbol and i need to generate a excel file through unix commands/script so that each value will go to each column. ex: Input Text file: 1|A|apple 2|B|bottle excel file to be generated as output as... (9 Replies)
Discussion started by: raja kakitapall
9 Replies

4. Shell Programming and Scripting

splitting tab delimited strings

hi i have a requirement to input a string to a shell script and to split the string to multiple fields, the string is copied from a row of three columns (name,age,address) in an excel sheet. the three columns (from excel) are seperated with a tab when pasted in the command prompt, but when the ... (2 Replies)
Discussion started by: midhun19
2 Replies

5. UNIX for Dummies Questions & Answers

Converting a text file with irregular spacing into a space delimited text file?

I have a text file with irregular spacing between values which makes it really difficult to manipulate. Is there an easy way to convert it into a space delimited text file so that all the spaces, double spaces, triple spaces, tabs between numbers are converted into spaces. The file looks like this:... (5 Replies)
Discussion started by: evelibertine
5 Replies

6. Linux

Splitting a Text File by Rows

Hello, Please help me. I have hundreds of text files composed of several rows of information and I need to separate each row into a new text file. I was trying to figure out how to split the text file into different text files, based on each row of text in the original text file. Here is an... (2 Replies)
Discussion started by: dvdrevilla
2 Replies

7. UNIX for Dummies Questions & Answers

How to convert text to columns in tab delimited text file

Hello Gurus, I have a text file containing nearly 12,000 tab delimited characters with 4000 rows. If the file size is small, excel can convert the text into coloumns. However, the file that I have is very big. Can some body help me in solving this problem? The input file example, ... (6 Replies)
Discussion started by: Unilearn
6 Replies

8. Shell Programming and Scripting

splitting text file into smaller ones

Hello We have a text file with 400,000 lines and need to split into multiple files each with 5000 lines ( will result in 80 files) Got an idea of using head and tail commands to do that with a loop but looked not efficient. Please advise the simple and yet effective way to do it. TIA... (3 Replies)
Discussion started by: prvnrk
3 Replies

9. Shell Programming and Scripting

splitting tab-delimited file with awk

Hi all, I need help to split a tab-delimited list into separate files by the filename-field. The list is already sorted ascendingly by filename, an example list would look like this; filename001 word1 word2 filename001 word3 word4 filename002 word1 word2 filename002 word3 word4... (4 Replies)
Discussion started by: perkele
4 Replies

10. Shell Programming and Scripting

splitting a pipe delimited file in unix

Could one of you shad some light on this: I need to split the file by determining the record count and than splitting it up into 4 files. Please note, this is not a fixed record length but rather a "|" delimited file. I am not sure as how to handle reminder/offset for the 4th file. For... (4 Replies)
Discussion started by: ddedic
4 Replies
Login or Register to Ask a Question