Splitting a delimited text file

04-28-2014

Registered User

109, 10

Join Date: Jul 2011

Last Activity: 9 November 2017, 5:49 PM EST

Posts: 109

Thanks Given: 23

Thanked 10 Times in 9 Posts

Quote:

Originally Posted by jethrow

Code:

awk 'NR>1 {print > (OFN=FILENAME"."(NR-1)); close(OFN)}' RS="--dump[^\n]*" file

EDIT:
... implemented this above ...

Quote:

Originally Posted by Don Cragun

It is hard to get csplit (and split) to drop the delimiter lines.

Dropping them is ideal, but not necessarily a problem for me, as I can "grep -v" to remove them in a second pass.

---------- Post updated at 03:50 PM ---------- Previous update was at 11:48 AM ----------

Ok, I got what I needed using this. Thank you all for the helpful ideas, it got me pointed down the right path.

Code:

csplit -n 5 $1 /-dump-/ {*}

for i in $(ls xx*); do
  awk 'NR > 2' $i > ./output/$i.eml
  rm $i
done

lupin..the..3rd

View Public Profile for lupin..the..3rd

Find all posts by lupin..the..3rd

04-28-2014

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by lupin..the..3rd

On my system (HP-UX 11.31) I get:

awk: Input line Disposition: attachm cannot be longer than 3,000 bytes.
The input line number is 53. The file is qsubmit.processed.dump.
The source line number is 1.

FYI the input file has emails as large as several megabytes (because of mime encoded attachments).

Thanks!

---------- Post updated at 11:48 AM ---------- Previous update was at 11:47 AM ----------

Dropping them is ideal, but not necessarily a problem for me, as I can "grep -v" to remove them in a second pass.

---------- Post updated at 03:50 PM ---------- Previous update was at 11:48 AM ----------

Ok, I got what I needed using this. Thank you all for the helpful ideas, it got me pointed down the right path.

Code:

csplit -n 5 $1 /-dump-/ {*}

for i in $(ls xx*); do
  awk 'NR > 2' $i > ./output/$i.eml
  rm $i
done

Just out of curiosity, why did you decide not to use the awk script I suggested?

Code:

awk '
/^--dump/ {
	if(ofn != "") close(ofn)
	ofn = sprintf("message:%07d", ++f)
	next
}
{	print > ofn
}' dump

It only invokes awk once (instead of once per extracted message) and only reads and writes the data found in your input file once (instead of twice); so it should be considerably faster.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

04-28-2014

Registered User

109, 10

Join Date: Jul 2011

Last Activity: 9 November 2017, 5:49 PM EST

Posts: 109

Thanks Given: 23

Thanked 10 Times in 9 Posts

Quote:

Originally Posted by Don Cragun

Just out of curiosity, why did you decide not to use the awk script I suggested?

Thank you Don, I agree that it is more elegant to only invoke awk one time, rather than twice. Even more so when you consider that the dump files I need to split up are ~4 GB each in size, containing ~12,000 emails each.

But at least on my HP-UX 11.31 servers, I get the following awk error:

Code:

itl1 # ./script.sh ./admin.inbox
awk: A print or getline function must have a file name.
 The input line number is 1. The file is ./admin.inbox.
 The source line number is 7.
itl1 # cat script.sh
awk '
/^--dump/ {
        if(ofn != "") close(ofn)
        ofn = sprintf("message:%07d", ++f)
        next
}
{       print > ofn
}' $1
itl1 #
itl1 # uname -a
HP-UX itl1 B.11.31 U ia64 3456089508 unlimited-user license
itl1 #

---------- Post updated at 04:31 PM ---------- Previous update was at 04:19 PM ----------

Update: Trying the same thing on RHEL6, I get the following error:

Code:

[root@email root]# ./script.sh ./sub.proc
awk: cmd. line:6: (FILENAME=./sub.proc FNR=1) fatal: expression for `>' redirection has null string value
[root@email root]# cat script.sh
#!/bin/sh
awk '
/^--dump/ {
        if(ofn != "") close(ofn)
        ofn = sprintf("message:%07d", ++f)
        next
}
{       print > ofn
}' $1
[root@email root]# uname -a
Linux email.dev 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Dec 13 06:58:20 EST 2013 x86_64 x86_64 x86_64 GNU/Linux
[root@email root]#

lupin..the..3rd

View Public Profile for lupin..the..3rd

Find all posts by lupin..the..3rd

04-28-2014

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

If your input files contain 12,000 messages, your script is invoking awk 12,000 times; not 2 times!

In your sample input in the first message in this thread you showed that the 1st line in your input file started with --dump. From those error messages, I have to assume that the 1st line of your input file does not start with that string.

If the data before the 1st line in your file starting with --dump is a mail message you want to keep, change:

Code:

awk '

to:

Code:

awk '
BEGIN {	ofn = sprintf("message:%07d", ++f)
}

otherwise, change it to:

Code:

awk '
BEGIN {	ofn = "/dev/null"
}

Note that on many filesystem types, putting 12,000 files in a single directory may make processing files in that directory slow. You might want to consider creating intermediate directories to reduce the number of files/directory.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

04-28-2014

Registered User

109, 10

Join Date: Jul 2011

Last Activity: 9 November 2017, 5:49 PM EST

Posts: 109

Thanks Given: 23

Thanked 10 Times in 9 Posts

Quote:

Originally Posted by Don Cragun

Yep, you caught me there.

The first line of the input file does not begin with "--dump". There are a few lines of metadata at the head of the file. This metadata can be discarded. After the few lines of metadata, it's all "--dump" delimited emails.

Thanks again for the suggestions, I'll try them when I'm back in the office tomorrow morning.

lupin..the..3rd

View Public Profile for lupin..the..3rd

Find all posts by lupin..the..3rd

04-30-2014

Registered User

109, 10

Join Date: Jul 2011

Last Activity: 9 November 2017, 5:49 PM EST

Posts: 109

Thanks Given: 23

Thanked 10 Times in 9 Posts

Ok, just like you said, this worked perfectly for me, so I'll be using this on my server. THANK YOU!

Code:

[root@email root]# cat script.sh
#!/bin/sh
awk '
BEGIN { ofn = "/dev/null"
}
/^--dump/ {
        if(ofn != "") close(ofn)
        ofn = sprintf("message:%07d", ++f)
        next
}
{       print > ofn
}' $1

lupin..the..3rd

View Public Profile for lupin..the..3rd

Find all posts by lupin..the..3rd

04-30-2014

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by lupin..the..3rd

Ok, just like you said, this worked perfectly for me, so I'll be using this on my server. THANK YOU!

I'm glad it worked for you. In the future, please be sure that you fully describe your input file format so we can avoid providing solutions that do what you asked for, but not what you needed.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

Splitting a delimited text file

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Splitting a text file into smaller files with awk, how to create a different name for each new file

Discussion started by: LMHmedchem

2. Shell Programming and Scripting

Splitting delimited string into rows

Discussion started by: techmoris

3. UNIX for Dummies Questions & Answers

Need to convert a pipe delimited text file to tab delimited

Discussion started by: raja kakitapall

4. Shell Programming and Scripting

splitting tab delimited strings

Discussion started by: midhun19

5. UNIX for Dummies Questions & Answers

Converting a text file with irregular spacing into a space delimited text file?

Discussion started by: evelibertine

6. Linux

Splitting a Text File by Rows

Discussion started by: dvdrevilla

7. UNIX for Dummies Questions & Answers

How to convert text to columns in tab delimited text file

Discussion started by: Unilearn

8. Shell Programming and Scripting

splitting text file into smaller ones

Discussion started by: prvnrk

9. Shell Programming and Scripting

splitting tab-delimited file with awk

Discussion started by: perkele

10. Shell Programming and Scripting

splitting a pipe delimited file in unix

Discussion started by: ddedic