Split large xml into mutiple files and with header and footer in file


Login or Register to Reply

 
Thread Tools Search this Thread
# 1  
Old 12-12-2018
Split large xml into mutiple files and with header and footer in file

Split large xml into mutiple files and with header and footer in file

tried below
it splits unevenly and also i need help in adding header and footer
command :
Code:
csplit -s -k -f my_XML_split.xml extrfile.xml "/<Document>/" {1}


sample xml
Code:
<?xml version="1.0" encoding="UTF-8"?><Recipient>
  <Header>
    <tag1></tag1>
    <tag2>1212233</tag2>
      --
	  ----
	  ---
  </Header>
  
  <Document>
  ---
  ---
  ---
  </Document>
   <Document>
  ---
  ---
  ---
  </Document>
   <Document>
  ---
  ---
  ---
  </Document>
  
 <Footer>
---
-- 
</Footer>

Moderator's Comments:
Mod Comment MOD's comment: Please do always wrap your codes/sample of input and expected output in [CODE]your codes..[/CODE] tags for more clarity of your question.

Last edited by RavinderSingh13; 2 Weeks Ago at 01:37 AM..
# 2  
Old 12-13-2018
Parsing XML isn't trivial, but we get asked for it all the time, so:

Code:
# yanx.awk v0.0.8, Tyler Montbriand, 2017.  Yet another noncompliant XML parser
###############################################################################
# XML is a pain to process in the shell, but people need it all the time.
# I've been using and improving this kludge since 2014 or so.  It parses and
# stacks tags and digests parameters, allowing simple XML processing and
# extraction to be managed with a handful of lines addendum.
#
# I've restricted my use of GNU features enough that this script will run on
# busybox's awk.  I think it works with mawk except -e is unsupported.
# You can work around that by running multiple files, i.e.
# mawk -f yanx.awk -f mystuff.awk inputfile
###############################################################################
# Basic use:
#
# Fed this XML, <body><html a="b">Your Web Browser Hates This</html></body>
# yanx will read it token-by-token as so:
#     Line 1:  Empty, skipped
#     Line 2:  $1="body"
#     Line 3:  $1="html a="b"", $2="Your web browser hates this"
#     Line 4:  $1="/html"
#     Line 5:  $1="/body", $2="\n"
#
# The script sets a few new "special" variables along the way.
# TAG           The name of the current tag, uppercased.
# CTAG          If close-tag, name in uppercase.
# TAGS          List of nested tags, like HTML%BODY%, including current tag
# LTAGS         List of nested tags, not including current tag
# ARGS          Array of tag parameters, uppercased.  i.e. ARGS["HREF"]
# DEP           How many tags deep it's nested, including current tag.
#
###############################################################################
# Examples:
# # Rewrite cdata of all divs
# awk -f yanx.awk -e 'TAGS ~ /^DIV%/ { $2="quux froob" } 1' input
# # Extract href's from every link
# awk -f yanx.awk -e 'TAGS~/^A%/ && ("HREF" in ARGS) {
#       print ARGS["HREF"] }' ORS="\n" input
###############################################################################
# Known Bugs:
# A short XML script can't possibly handle DOD, etc.  Entities a la &lt;
# are not translated either.
#
# I've done my best to make it swallow <!--, <? ?> and other such fancy
# XML syntax without choking, but that doesn't mean it handles them
# properly either.
#
# It's an XML parser, not an HTML parser.  It probably won't swallow a
# wild-from-the internet HTML web page without some cleanup first:
# javascript, tags inside comments, etc will be mangled instead of ignored.
#
# Last: Because of its design, when printing raw HTML, yanx adds an extra <
# to the end of the file.  This is because < belongs at the beginning of
# a token but awk is told it's printed at the end.  There is no equivalent
# "line prefix" variable that I know of, if you want it to print smarter
# you'll have to print the <'s yourself, by setting ORS=" and
# printing lines like print "<" $0
###############################################################################
BEGIN {
        FS=">"; OFS=">";
        RS="<"; ORS="<"
}

# After match("qwertyuiop", /rty/)
#       rbefore("qwertyuiop") is "qwe",
#       rmid("qwertyuipo")    is "r"
#       rall("qwertyuiop")    is "rty"
#       rafter("qwertyuiop")  is "uiop"

# !?!?!
# function rbefore(STR)   { return(substr(STR, N, RSTART-1)); }# before match
function rbefore(STR)   { return(substr(STR, 0, RSTART-1)); }# before match
function rmid(STR)      { return(substr(STR, RSTART, 1)); }  # First char match
function rall(STR)      { return(substr(STR, RSTART, RLENGTH)); }# Entire match
function rafter(STR)    { return(substr(STR, RSTART+RLENGTH)); }# after match

function aquote(OUT, A, PFIX, TA) { # Turns Q SUBSEP R into A[PFIX":"Q]=R
        if(OUT)
        {
                if(PFIX) PFIX=PFIX":"
                split(OUT, TA, SUBSEP);
                A[toupper(PFIX) toupper(TA[1])]=TA[2];
        }

        return("");
}

# Intended to be less stupid about quoted text in XML/HTML.
# Splits a='b' c='d' e='f' into A[PFIX":"a]=b, A[PFIX":"c]=d, etc.
function qsplit(STR, A, PFIX, X, OUT) {
        while(STR && match(STR, /([ \n\t]+)|[\x27\x22=]/))
        {
                OUT = OUT rbefore(STR);
                RMID=rmid(STR);

                if((RMID == "'") || (RMID == "\""))     # Quote characters
                {
                        if(!Q)          Q=RMID;         # Begin quote section
                        else if(Q == RMID)      Q="";   # End quote section
                        else                    OUT = OUT RMID; # Quoted quote
                } else if(RMID == "=") {
                        if(Q)   OUT=OUT RMID; else OUT=OUT SUBSEP;
                } else if((RMID=="\r")||(RMID=="\n")||(RMID=="\t")||(RMID==" ")) {
                        if(Q)   OUT = OUT rall(STR); # Literal quoted whitespace
                        else    OUT = aquote(OUT, A, PFIX); # Unquoted WS, next block
                }
                STR=rafter(STR); # Strip off the text we've processed already.
        }

        aquote(OUT STR, A, PFIX); # Process any text we haven't already.
}


{ SPEC=0 ; TAG="" }

NR==1 {
        if(ORS == RS) print;
        next } # The first "line" is blank when RS=<

/^[!?]/ { SPEC=1    }   # XML specification junk

# Handle open-tags
(!SPEC) && match($1, /^[^\/ \r\n\t>]+/) {
        CTAG=""
        TAG=substr(toupper($1), RSTART, RLENGTH);
        if((!SPEC) && !($1 ~ /\/$/))
        {
                TAGS=TAG "%" TAGS;
                DEP++;
                LTAGS=TAGS
        }

        for(X in ARGS) delete ARGS[X];

        qsplit(rafter($1), ARGS, "", "", "");
}

# Handle close-tags
(!SPEC) && /^[\/]/ {
        sub(/^\//, "", $1);
        LTAGS=TAGS
        CTAG=toupper($1)
        TAG=""
#        sub("^.*" toupper($1) "%", "", TAGS);
        sub("^" toupper($1) "%", "", TAGS);
        $1="/"$1
        DEP=split(TAGS, TA, "%")-1;
        # Update TAG with tag on top of stack, if any
#       if(DEP < 0) {   DEP=0;  TAG=""  }
#       else { TAG=TA[DEP]; }
}

You can use it with this:

Code:
# xmlsplit.awk
BEGIN {
        ORS=""
        X="x."
        ROWS=5
}

# First pass, remember headers and footers
NR==FNR {
        if(F || TAG == "FOOTER")
        {
                if(!F) {
                        FTRSTART=FNR
                        F=1
                }
                FTR=FTR "<" $1 OFS $2
        }
        else if((!H) && (TAG == "DOCUMENT"))
        {
                HDREND=FNR
                H=1
        }
        else if(!H)     HDR=HDR "<" $1 OFS $2
        next
}

# Skip header and footer
(FNR < HDREND) || (FNR >= FTRSTART) { next }

# Close output file once enough DOCUMENT records
((XNR%(ROWS+1)) == 0) {
        if(FILE) {
                print FTR > FILE
                close(FILE);
        }
        FILE=sprintf("%s%04d", X,++FILENUM);
        print HDR > FILE
        XNR++
}

{       print "<" $0 > FILE     }

CTAG == "DOCUMENT" { XNR++ }

END {   if(FILE) print FTR > FILE }

Like this:

Code:
# Yes, it's fed inputfile twice
awk -f yanx.awk -f xmlsplit.awk X="x." ROWS="10" input input

With this input:

Code:
<?xml version="1.0" encoding="UTF-8"?><Recipient>
  <Header>
    <tag1></tag1>
    <tag2>1212233</tag2>
      --
          ----
          ---
  </Header>

<Document>001</Document>
<Document>002</Document>
<Document>003</Document>
<Document>004</Document>
<Document>005</Document>
<Document>006</Document>
<Document>007</Document>
<Document>008</Document>
<Document>009</Document>
<Document>010</Document>
<Document>011</Document>
<Document>012</Document>
<Document>013</Document>
<Document>014</Document>
<Document>015</Document>
<Document>016</Document>
<Document>017</Document>
<Document>018</Document>
<Document>019</Document>
<Document>020</Document>
<Document>021</Document>
<Document>022</Document>
<Document>023</Document>
<Document>024</Document>
<Document>025</Document>
<Document>026</Document>
<Document>027</Document>
<Document>028</Document>
<Document>029</Document>
<Document>030</Document>
<Document>031</Document>
<Document>032</Document>
<Document>033</Document>
<Document>034</Document>
<Document>035</Document>
<Document>036</Document>
<Document>037</Document>
<Document>038</Document>
<Document>039</Document>
<Document>040</Document>
<Document>041</Document>
<Document>042</Document>
<Document>043</Document>
<Document>044</Document>
<Document>045</Document>
<Document>046</Document>
<Document>047</Document>
<Document>048</Document>
<Document>049</Document>
<Document>050</Document>
<Document>051</Document>
<Document>052</Document>
<Document>053</Document>
<Document>054</Document>
<Document>055</Document>
<Document>056</Document>
<Document>057</Document>
<Document>058</Document>
<Document>059</Document>
<Document>060</Document>
<Document>061</Document>
<Document>062</Document>
<Document>063</Document>
<Document>064</Document>
<Document>065</Document>
<Document>066</Document>
<Document>067</Document>
<Document>068</Document>
<Document>069</Document>
<Document>070</Document>
<Document>071</Document>
<Document>072</Document>
<Document>073</Document>
<Document>074</Document>
<Document>075</Document>
<Document>076</Document>
<Document>077</Document>
<Document>078</Document>
<Document>079</Document>
<Document>080</Document>
<Document>081</Document>
<Document>082</Document>
<Document>083</Document>
<Document>084</Document>
<Document>085</Document>
<Document>086</Document>
<Document>087</Document>
<Document>088</Document>
<Document>089</Document>
<Document>090</Document>
<Document>091</Document>
<Document>092</Document>
<Document>093</Document>
<Document>094</Document>
<Document>095</Document>
<Document>096</Document>
<Document>097</Document>
<Document>098</Document>
<Document>099</Document>
<Document>100</Document>

 <Footer>
---
--
</Footer>

To produce output like this:

Code:
$ cat x.0001

<?xml version="1.0" encoding="UTF-8"?><Recipient>
  <Header>
    <tag1></tag1>
    <tag2>1212233</tag2>
      --
          ----
          ---
  </Header>

<Document>001</Document>
<Document>002</Document>
<Document>003</Document>
<Document>004</Document>
<Document>005</Document>
<Document>006</Document>
<Document>007</Document>
<Document>008</Document>
<Document>009</Document>
<Document>010</Document>
<Footer>
---
--
</Footer>

$ cat x.0010

<?xml version="1.0" encoding="UTF-8"?><Recipient>
  <Header>
    <tag1></tag1>
    <tag2>1212233</tag2>
      --
          ----
          ---
  </Header>

<Document>091</Document>
<Document>092</Document>
<Document>093</Document>
<Document>094</Document>
<Document>095</Document>
<Document>096</Document>
<Document>097</Document>
<Document>098</Document>
<Document>099</Document>
<Document>100</Document>

 <Footer>
---
--
</Footer>

$

This User Gave Thanks to Corona688 For This Post:
Neo (12-14-2018)
# 3  
Old 12-14-2018
Hi Corona,

Thanks for your quick response with code
Do i need to install any xml_splitter libraries in the unix and you have provided 2 big scripts which 1 do i need to consider

Iam new to scripting kindly assist on the above

--- Post updated at 11:27 PM ---

I have created two files yanx.awk xml_split.awk and triggered the command it exited without files being created please guide me where hte out put file path provided
Code:
awk -f yanx.awk -f xmlsplit.awk X="x." ROWS="10" input input

--- Post updated 12-14-18 at 06:14 AM ---

Hi Corona,

Please assist on the below error few files iam able to split few i cannot getting below error kindly assist

Code:
awk: xmlsplit.awk:37: (FILENAME=sampletest11.xml FNR=4) fatal: can't redirect to `/0001' (Permission denied)

# 4  
Old 12-14-2018
It creates them in the current directory. If you want it to put them somewhere else, set the value of X.

Code:
awk -f yanx.awk -f xmlsplit.awk X="x." ROWS="10" X="/path/to/folder/outputname" input input

Please show exactly what you're doing, word for word, letter for letter, keystroke for keystroke. What you have posted is obviously not what you're doing, the filenames differ.

Use nawk on solaris.

Last edited by Corona688; 12-14-2018 at 11:40 AM..
# 5  
Old 12-16-2018
Hi Corona,

Below are the steps I am doing sampletest11.xml is my sample file and the xml node slightly differs and the body node is "Recipient"
party_ID is my footer so changed it accordingly in xml_split.awk

Script worked fine with 200 records and when the xml file got 18k records which is expected file it throws the below exception

Code:
awk: xmlsplit.awk:37: (FILENAME=sampletest11.xml FNR=4) fatal: can't redirect to `/0001' (Permission denied)

Sample xml skeleton:
Code:
<?xml version="1.0" encoding="UTF-8"?>
<DocumentSet>
    <Recipient>
        <Context>
            <TESTER>08</TESTER>
            <name>TEST</name>
            <Locale>en_AU</Locale>
            <Channel>kjsdhfuis</Channel>
            <UserId>8</UserId>
            <HLX>000000</HLX>
            <Key1>TEST1</Key1>
            <Key2>TEST2</Key2>
            <Key3>TEST3</Key3>
            <KeyID>hotdirectorytest</KeyID>
            <dummy2222>TEST7</dummy2222>
            <EffectiveFrom>20170612000000</EffectiveFrom>
            <Currency>AUD</Currency>
        </Context>
        <Document>
            <Form>
                <Name>TESTER2</Name>
                <Data>
                    <DocumentSetC>
                        <HeaderData>
                            <TESTER>08</TESTER>
                            <Channel>kjsdhfuis</Channel>
                            <UserId>X009189</UserId>
                            <HLX>000000</HLX>
                            <dummy>08VIC000000</dummy>
                            <Key1>TEST2</Key1>
                            <Key2>TEST3</Key2>
                            <Key3/>
                            <KeyID>TEST70</KeyID>
                            <dummy2222>Approval Letter</dummy2222>
                            <TEST7>APPA08120617206891</TEST7>
                            <EffectiveFrom>20170612000000</EffectiveFrom>
                            <HLX44>12345</HLX44>
                            <SystemDate>20170612</SystemDate>
                        </HeaderData>
                        <FormData>
                            <Name>TESTER2</Name>
                            <Context>
                                <UniqueDocID>1240525</UniqueDocID>
                                <dummy11112233>LEN_APP_0010_OUT</dummy11112233>
                                <TEST2ApprovedAmount>8989</TEST2ApprovedAmount>
                            </Context>
                            <ReceivingParty>
                                <Applicant>
                                    <TEST45456>sfdsfnsdfnff  </TEST45456>
                                </Applicant>
                                <IndividualDemographics>
                                
                                </IndividualDemographics>
                                <DeliveryChannel>POST</DeliveryChannel>
                                <NoOfCopies>1</NoOfCopies>
                            </ReceivingParty>
                            <Application>
                                <ProductGroups>
                            <TEST454567>sfdsfnsdfnff  </TEST454567>

                                </ProductGroups>
                            </Application>
                        </FormData>
                    </DocumentSetC>
                </Data>
            </Form>
            <TYP1>5</TYP1>
        </Document>
    </Recipient>
       <Recipient2> ---</Recipient2>
           ---------------
           -------------- 
           -----------------
            -----------------
          <Recipient18000> ---</Recipient18000>
    <PartyID>12345</PartyID>
 </DocumentSet>

Command :
Code:
$ awk -f yanx.awk -f xmlsplit.awk X="x." ROWS="10" sampletest11.xml sampletest11.xml

--- Post updated at 11:44 PM ---

Kindly assist on the above is the issue because of the file size or number of records in the file??




Moderator's Comments:
Mod Comment Please use CODE tags as required by forum rules!

Last edited by RudiC; 12-17-2018 at 04:07 AM.. Reason: Added CODE tags.
# 6  
Old 12-17-2018
X might not be the wisest variable name chosen to convey the output files' path as it is used (conditionally) in yanx.awk as the index in a for loop IF the input file contains xml specification info (that might be the reason that it works on a test file if that is missing the xml specs) and thus may be overwritten.

Try again but replace the X variable name with another, e.g. FP (for "file path") in xmlsplit.awk and on the command line, NOT in yanx.awk.
# 7  
Old 12-17-2018
Thanks for showing what input you actually have. What output do you actually want?

Code modified to rudic's suggestions:

Code:
BEGIN {
	ORS=""
	OUT="x."
	ROWS=5
	ROWTAG="DOCUMENT"
	FTRTAG="FOOTER"
}

# First pass, remember headers and footers
NR==FNR {
	if(F || TAG == FTRTAG)
	{
		if(!F) {
			FTRSTART=FNR
			F=1
		}
		FTR=FTR RS $1 OFS $2
	}
	else if((!H) && (TAG == ROWTAG))
	{
		HDREND=FNR
		H=1
	}
	else if(!H)	HDR=HDR RS $1 OFS $2
	next
}

# Skip header and footer
(FNR < HDREND) || (FNR >= FTRSTART) { next }

# Close output file once enough DOCUMENT records
((XNR%(ROWS+1)) == 0) {
	if(FILE) {
		print FTR > FILE
		close(FILE);
	}
	FILE=sprintf("%s%04d", OUT,++FILENUM);
	print HDR > FILE
	XNR++
}

{	print RS $0 > FILE	}

CTAG == "DOCUMENT" { XNR++ }

END {	if(FILE) print FTR > FILE }

...but it won't work until I know what tags you're actually using for header and footer. Modify HDRTAG and FTRTAG accordingly.
Login or Register to Reply

|
Thread Tools Search this Thread
Search this Thread:
Advanced Search

More UNIX and Linux Forum Topics You Might Find Helpful
Split large file into smaller files without disturbing the entry chunks Kamesh G UNIX for Beginners Questions & Answers 12 05-10-2018 05:39 AM
Eliminate Header and footer from EBCDIC file abhilashnair UNIX for Dummies Questions & Answers 4 12-18-2014 05:48 AM
File Row Line Count without Header Footer gagan8877 UNIX for Dummies Questions & Answers 7 05-02-2013 02:29 PM
Is there a way to append both at header and footer of a file jediwannabe Shell Programming and Scripting 3 02-28-2013 06:57 AM
Help needed - Split large file into smaller files based on pattern match frustrated1 Shell Programming and Scripting 7 01-18-2013 06:02 PM
Removing header or footer from file sridhardwh Shell Programming and Scripting 5 06-04-2012 06:43 AM
Add header and footer with record count in footer itsranjan Shell Programming and Scripting 1 03-25-2012 12:45 AM
Split large zone file dump into multiple files Bluemerlin Shell Programming and Scripting 7 12-21-2011 09:15 AM
sort a report file having header and footer suryanarayana Shell Programming and Scripting 4 11-25-2011 10:48 PM
Ignore Header and Footer and Sort the data in fixed width file sasikari Shell Programming and Scripting 5 07-14-2011 12:42 PM
Using AWK to separate data from a large XML file into multiple files JRy Shell Programming and Scripting 16 10-17-2009 09:06 PM
Split large file and add header and footer to each small files ashish4422 Shell Programming and Scripting 7 07-07-2008 03:13 PM
Split large file and add header and footer to each file ashish4422 Shell Programming and Scripting 1 04-15-2008 07:12 AM
Total of lines w/out header and footer incude for a file gzs553 Shell Programming and Scripting 1 11-16-2006 07:42 PM
Need to Chop Header and Footer record from input file coolbudy Shell Programming and Scripting 4 08-09-2005 01:26 PM