Splitting XML file on basis of line number into multiple file Post: 302902249

Sponsored Content

Top Forums Shell Programming and Scripting Splitting XML file on basis of line number into multiple file Post 302902249 by Don Cragun on Monday 19th of May 2014 03:05:49 PM

05-19-2014

Registered User

It looks like ShriniShoo has given you code that will work fine as long as:

you always want to store the output in files named ABCD_part?.xml (no matter what the input file name is),
your input file always has a number of records that is a positive integral multiple of the number of output files you want to create,
you only have one input file to process,
and you want to read your input files twice.

If you want code that:

produces output files based on the input file name,
handles input files with zero or more records,
can process multiple input files,
only reads your input files once,
verifies that each input file has the number of input lines indicated by the TotalRecord tag,
prints status information for each input file processed, and
returns a non-zero exit status if one or more of the input files is malformed,

you could try something like:

Code:

#!/bin/ksh
awk '
function eofcheck(	e, i) {
	# Close output files for previous input file.
	for(i = 1; i <= nf; i++)
		close(of[i])
	# Perform end-of-file error checks...
	if(tlp == ntl) return
	e = 0
	for(i = 1; i <= nf; i++)
		if(c[i] > 0) {
			printf("\t*** Missing %d+%d records for part %d.\n",
				int(c[i] / lpr), (c[i] % lpr) > 0, i)
			e = 1
		}
	if(e) ec = 2
	else {	printf("\t*** Expected %d trailer line%s; found %d.\n", ntl,
			ntl == 1 ? "" : "s", tlp)
		ec = 3
	}
}
BEGIN {	if(lpr == 0) lpr = 15	# lines per record (default 15)
	if(nf == 0) nf = 4	# # of output files (default 4)
	if(nhl == 0) nhl = 7	# # of header lines (default 7)
	if(ntl == 0) ntl = 1	# # of trailer lines (default 1)
	ec = 0			# final exit code
}
FNR == 1 {
	# If this is not the first input file, perform EOF checks on lsat file.
	if(NR > 1) eofcheck()
	# Generate output filenames...
	for(i = 1; i <= nf; i++)
		of[i] = substr(FILENAME, 1, length(FILENAME) - 4) "_part" i \
			substr(FILENAME, length(FILENAME) - 3)
	# Set temporary value for ftl (it will be recalcuated when we process
	# the TotalRecord tag.
	ftl = 1
	# Clear number of trailer lines printed for current file.
	tlp = 0
}
FNR <= nhl || FNR >= ftl {
	# Look for input record count.
	if(split($0, rc, /<\/*TotalRecord>/) != 3 || rc[2] !~ /^[0-9]+$/) {
		# Copy other header lines and the trailer to all output files...
		for(i = 1; i <= nf; i++)
			print > of[i]
		# Count number of trailer lines printed.
		if(FNR >= ftl) tlp++
		next
	}
	# We have the header line that defines the number of records present.
	irc = rc[2]		# input record count
	rpf = int(irc / nf)	# base output records / file
	rem = irc % nf		# records left over after even split among files
	printf("Found TotalRecord header in %s, %d input records.\n", FILENAME,
		irc)
	for(i = 1; i <= nf; i++) {
		# Calculate # of records for each output file.
		c[i] = rpf + (rem >= i)
		# Print TotalRecord tag header lines.
		printf("%s<TotalRecord>%d</TotalRecord>%s\n", rc[1], c[i],
			rc[3]) > of[i]
		printf("\tPreparing to write %d records to %s\n", c[i], of[i])
		# Convert count for each file from records to lines.
		c[i] *= lpr
	}
	# Calculate First Trailer Line number and initialize output file number.
	ftl = nhl + 1 + lpr * irc	# line # of 1st trailer line
	ofn = 1			# output file number
	tlp = 0			# # of trailer lines printed
	next
}
ftl == 1 {
	# TotalRecord tag not found.
	printf("TotalRecord tag not found in %s headers; aborting.\n", FILENAME)
	exit 99
}
{	# Copy data lines to appropriate output file.
	while(c[ofn]-- <= 0)
		if(++ofn > nf) {
			printf("Internal error: FNR=%d, ftl=%d, ofn=%d\n",
				FNR, ftl, ofn)
			exit 98
	}
	print > of[ofn]
}
END {	eofcheck()
	exit ec
}' "$@"

As always, if you want to run this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk, /usr/xpg6/bin/awk, or nawk.

But, of course, this doesn't meet the conflicting requirements you have posted in this thread: You said that the TotalRecord tags are on line 7 in your headers, but your sample header has it on line 6. This code looks for a TotalRecord tag on any line in the headers. It will give you an error if no tag is found. It will produce multiple TotalRecord tags if more than one appears and use data from the last one found. If more than one set of TotalRecord tags appears on a single line, all of the tags on that line will be silently ignored. (Producing an error in these cases is left as an exercise for the reader.)

You said you wanted the same number of records in the first three files and any additional records added to the last file. This code spreads any extra records out such that if there is one extra record, it will go into the first output file; if there are two extra records, one will go into each of the first two output files; and if there are three extra records, one will go into each of the first three output files. (This made error checking simpler in cases where there are fewer records in the input file than there are output files. And, I think it make more sense to do it this way. If you disagree, feel free to modify the code to partition output records the way you want it.)

The awk script is fully parameterized to accept any positive number of header lines, any positive number of trailer lines, any positive number of lines per record, and any positive number of output files/input file (up to your system's awk's limit on the number of open files), but adding getopts code to parse options to this script to override the defaults is left as an exercise for the reader.

If you save the above code in a file named splitter and make it executable (chmod +x splitter), you can invoke it as:

Code:

./splitter ABCD.xml

to split ABCD.xml into four files named ABCD_part1.xml through ABCD_part4.xml. If you give it additional file operands it will split all of the give files.

This code assumes that it is working on XML files, but doesn't enforce any naming convention. Note, however, that this code assumes that the input file pathames end with a period followed by a three character filename extension (such as .xml or .XML). If an input pathname contains less than four characters, the results are unspecified. Adding checks for this situation is left as an exercise for the reader.

Last edited by Don Cragun; 05-19-2014 at 04:13 PM.. Reason: Change 8 to nhl + 1.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

splitting a file (xml) into multiple files

To split the files Hi, I'm having a xml file with multiple xml header. so i want to split the file into multiple files. Test.xml --------- <?xml version="UTF_8"> <emp: ....> <name>a</name> <age>10</age> </emp> <?xml version="UTF_8"> <emp: ....> <name>b</name> <age>10</age>...

2. Shell Programming and Scripting

Split one file to Multiple file with report basis in unix

3. Shell Programming and Scripting

splitting a huge line of file into multiple lines with fixed number of columns

Hi, I have a huge file with a single line. But I want to break that line into lines of with each line having five columns. My file is like this: code: "hi","there","how","are","you?","It","was","great","working","with","you.","hope","to","work","you." I want it like this: code:...

4. Shell Programming and Scripting

Help required in Splitting a xml file into multiple and appending it in another .xml file

HI All, I have to split a xml file into multiple xml files and append it in another .xml file. for example below is a sample xml and using shell script i have to split it into three xml files and append all the three xmls in a .xml file. Can some one help plz. eg: <?xml version="1.0"?>...

5. Shell Programming and Scripting

Splitting a file based on line number

Hi I have a file with over a million lines (rows) and I want to split everything from 500,000 to a million into another file (to make the file smaller). Is there a simple command for this? Thank you Phil

6. UNIX for Dummies Questions & Answers

Splitting the file basis of Line Number

Can u pls advise the unix command as I have a file which contain the records in the below format 333434 435435 435443 434543 343536 Now the total line count is 89380 , now i want to create a separate I am trying to split my large big file into small bits using the line...

7. Shell Programming and Scripting

Looping through XML file on basis of a node

<?xml version="1.0" encoding="UTF-8"?> <Document> <FIToFICstmrCdtTrf> <GrpHdr> <MsgId>10001</MsgId> <NbOfTxs>1</NbOfTxs> <IntrBkSttlmDt>2015-05-06</IntrBkSttlmDt> <SttlmInf> <SttlmMtd>CLRG</SttlmMtd> </SttlmInf> <PmtTpInf> ...

8. Shell Programming and Scripting

Splitting xml file into several xml files using perl

Hi Everyone, I'm new here and I was checking this old post: /shell-programming-and-scripting/180669-splitting-file-into-several-smaller-files-using-perl.html (cannot paste link because of lack of points) I need to do something like this but understand very little of perl. I also check...

9. Shell Programming and Scripting

Splitting a single xml file into multiple xml files

Hi, I'm having a xml file with multiple xml header. so i want to split the file into multiple files. Sample.xml consists multiple headers so how can we split these multiple headers into multiple files in unix. eg : <?xml version="1.0" encoding="UTF-8"?> <ml:individual...

10. UNIX for Beginners Questions & Answers

Split a txt file on the basis of line number

I have to split a file containing 100 lines to 5 files say from lines ,1-20 ,21-30 ,31-40 ,51-60 ,61-100 Here is i can do it for 2 file but how to handle it for more than 2 files awk 'NR < 21{ print >> "a"; next } {print >> "b" }' $input_file Please advidse. Thanks

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

splitting a file (xml) into multiple files

Discussion started by: sasi_u

2. Shell Programming and Scripting

Split one file to Multiple file with report basis in unix

Discussion started by: krbala1985

3. Shell Programming and Scripting

splitting a huge line of file into multiple lines with fixed number of columns

Discussion started by: rajsharma

4. Shell Programming and Scripting

Help required in Splitting a xml file into multiple and appending it in another .xml file

Discussion started by: ganesan kulasek

5. Shell Programming and Scripting

Splitting a file based on line number

Discussion started by: phil_heath

6. UNIX for Dummies Questions & Answers

Splitting the file basis of Line Number

Discussion started by: punpun66

7. Shell Programming and Scripting

Looping through XML file on basis of a node

Discussion started by: harish2015

8. Shell Programming and Scripting

Splitting xml file into several xml files using perl

Discussion started by: mcosta

9. Shell Programming and Scripting

Splitting a single xml file into multiple xml files

Discussion started by: Narendra921631

10. UNIX for Beginners Questions & Answers

Split a txt file on the basis of line number

Discussion started by: abhaydas