Visit The New, Modern Unix Linux Community


Split large xml into mutiple files and with header and footer in file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Split large xml into mutiple files and with header and footer in file
# 1  
Split large xml into mutiple files and with header and footer in file

Split large xml into mutiple files and with header and footer in file

tried below
it splits unevenly and also i need help in adding header and footer
command :
Code:
csplit -s -k -f my_XML_split.xml extrfile.xml "/<Document>/" {1}


sample xml
Code:
<?xml version="1.0" encoding="UTF-8"?><Recipient>
  <Header>
    <tag1></tag1>
    <tag2>1212233</tag2>
      --
	  ----
	  ---
  </Header>
  
  <Document>
  ---
  ---
  ---
  </Document>
   <Document>
  ---
  ---
  ---
  </Document>
   <Document>
  ---
  ---
  ---
  </Document>
  
 <Footer>
---
-- 
</Footer>

Moderator's Comments:
Mod Comment MOD's comment: Please do always wrap your codes/sample of input and expected output in [CODE]your codes..[/CODE] tags for more clarity of your question.

Last edited by RavinderSingh13; 02-06-2019 at 01:37 AM..
# 2  
Parsing XML isn't trivial, but we get asked for it all the time, so:

Code:
# yanx.awk v0.0.8, Tyler Montbriand, 2017.  Yet another noncompliant XML parser
###############################################################################
# XML is a pain to process in the shell, but people need it all the time.
# I've been using and improving this kludge since 2014 or so.  It parses and
# stacks tags and digests parameters, allowing simple XML processing and
# extraction to be managed with a handful of lines addendum.
#
# I've restricted my use of GNU features enough that this script will run on
# busybox's awk.  I think it works with mawk except -e is unsupported.
# You can work around that by running multiple files, i.e.
# mawk -f yanx.awk -f mystuff.awk inputfile
###############################################################################
# Basic use:
#
# Fed this XML, <body><html a="b">Your Web Browser Hates This</html></body>
# yanx will read it token-by-token as so:
#     Line 1:  Empty, skipped
#     Line 2:  $1="body"
#     Line 3:  $1="html a="b"", $2="Your web browser hates this"
#     Line 4:  $1="/html"
#     Line 5:  $1="/body", $2="\n"
#
# The script sets a few new "special" variables along the way.
# TAG           The name of the current tag, uppercased.
# CTAG          If close-tag, name in uppercase.
# TAGS          List of nested tags, like HTML%BODY%, including current tag
# LTAGS         List of nested tags, not including current tag
# ARGS          Array of tag parameters, uppercased.  i.e. ARGS["HREF"]
# DEP           How many tags deep it's nested, including current tag.
#
###############################################################################
# Examples:
# # Rewrite cdata of all divs
# awk -f yanx.awk -e 'TAGS ~ /^DIV%/ { $2="quux froob" } 1' input
# # Extract href's from every link
# awk -f yanx.awk -e 'TAGS~/^A%/ && ("HREF" in ARGS) {
#       print ARGS["HREF"] }' ORS="\n" input
###############################################################################
# Known Bugs:
# A short XML script can't possibly handle DOD, etc.  Entities a la &lt;
# are not translated either.
#
# I've done my best to make it swallow <!--, <? ?> and other such fancy
# XML syntax without choking, but that doesn't mean it handles them
# properly either.
#
# It's an XML parser, not an HTML parser.  It probably won't swallow a
# wild-from-the internet HTML web page without some cleanup first:
# javascript, tags inside comments, etc will be mangled instead of ignored.
#
# Last: Because of its design, when printing raw HTML, yanx adds an extra <
# to the end of the file.  This is because < belongs at the beginning of
# a token but awk is told it's printed at the end.  There is no equivalent
# "line prefix" variable that I know of, if you want it to print smarter
# you'll have to print the <'s yourself, by setting ORS=" and
# printing lines like print "<" $0
###############################################################################
BEGIN {
        FS=">"; OFS=">";
        RS="<"; ORS="<"
}

# After match("qwertyuiop", /rty/)
#       rbefore("qwertyuiop") is "qwe",
#       rmid("qwertyuipo")    is "r"
#       rall("qwertyuiop")    is "rty"
#       rafter("qwertyuiop")  is "uiop"

# !?!?!
# function rbefore(STR)   { return(substr(STR, N, RSTART-1)); }# before match
function rbefore(STR)   { return(substr(STR, 0, RSTART-1)); }# before match
function rmid(STR)      { return(substr(STR, RSTART, 1)); }  # First char match
function rall(STR)      { return(substr(STR, RSTART, RLENGTH)); }# Entire match
function rafter(STR)    { return(substr(STR, RSTART+RLENGTH)); }# after match

function aquote(OUT, A, PFIX, TA) { # Turns Q SUBSEP R into A[PFIX":"Q]=R
        if(OUT)
        {
                if(PFIX) PFIX=PFIX":"
                split(OUT, TA, SUBSEP);
                A[toupper(PFIX) toupper(TA[1])]=TA[2];
        }

        return("");
}

# Intended to be less stupid about quoted text in XML/HTML.
# Splits a='b' c='d' e='f' into A[PFIX":"a]=b, A[PFIX":"c]=d, etc.
function qsplit(STR, A, PFIX, X, OUT) {
        while(STR && match(STR, /([ \n\t]+)|[\x27\x22=]/))
        {
                OUT = OUT rbefore(STR);
                RMID=rmid(STR);

                if((RMID == "'") || (RMID == "\""))     # Quote characters
                {
                        if(!Q)          Q=RMID;         # Begin quote section
                        else if(Q == RMID)      Q="";   # End quote section
                        else                    OUT = OUT RMID; # Quoted quote
                } else if(RMID == "=") {
                        if(Q)   OUT=OUT RMID; else OUT=OUT SUBSEP;
                } else if((RMID=="\r")||(RMID=="\n")||(RMID=="\t")||(RMID==" ")) {
                        if(Q)   OUT = OUT rall(STR); # Literal quoted whitespace
                        else    OUT = aquote(OUT, A, PFIX); # Unquoted WS, next block
                }
                STR=rafter(STR); # Strip off the text we've processed already.
        }

        aquote(OUT STR, A, PFIX); # Process any text we haven't already.
}


{ SPEC=0 ; TAG="" }

NR==1 {
        if(ORS == RS) print;
        next } # The first "line" is blank when RS=<

/^[!?]/ { SPEC=1    }   # XML specification junk

# Handle open-tags
(!SPEC) && match($1, /^[^\/ \r\n\t>]+/) {
        CTAG=""
        TAG=substr(toupper($1), RSTART, RLENGTH);
        if((!SPEC) && !($1 ~ /\/$/))
        {
                TAGS=TAG "%" TAGS;
                DEP++;
                LTAGS=TAGS
        }

        for(X in ARGS) delete ARGS[X];

        qsplit(rafter($1), ARGS, "", "", "");
}

# Handle close-tags
(!SPEC) && /^[\/]/ {
        sub(/^\//, "", $1);
        LTAGS=TAGS
        CTAG=toupper($1)
        TAG=""
#        sub("^.*" toupper($1) "%", "", TAGS);
        sub("^" toupper($1) "%", "", TAGS);
        $1="/"$1
        DEP=split(TAGS, TA, "%")-1;
        # Update TAG with tag on top of stack, if any
#       if(DEP < 0) {   DEP=0;  TAG=""  }
#       else { TAG=TA[DEP]; }
}

You can use it with this:

Code:
# xmlsplit.awk
BEGIN {
        ORS=""
        X="x."
        ROWS=5
}

# First pass, remember headers and footers
NR==FNR {
        if(F || TAG == "FOOTER")
        {
                if(!F) {
                        FTRSTART=FNR
                        F=1
                }
                FTR=FTR "<" $1 OFS $2
        }
        else if((!H) && (TAG == "DOCUMENT"))
        {
                HDREND=FNR
                H=1
        }
        else if(!H)     HDR=HDR "<" $1 OFS $2
        next
}

# Skip header and footer
(FNR < HDREND) || (FNR >= FTRSTART) { next }

# Close output file once enough DOCUMENT records
((XNR%(ROWS+1)) == 0) {
        if(FILE) {
                print FTR > FILE
                close(FILE);
        }
        FILE=sprintf("%s%04d", X,++FILENUM);
        print HDR > FILE
        XNR++
}

{       print "<" $0 > FILE     }

CTAG == "DOCUMENT" { XNR++ }

END {   if(FILE) print FTR > FILE }

Like this:

Code:
# Yes, it's fed inputfile twice
awk -f yanx.awk -f xmlsplit.awk X="x." ROWS="10" input input

With this input:

Code:
<?xml version="1.0" encoding="UTF-8"?><Recipient>
  <Header>
    <tag1></tag1>
    <tag2>1212233</tag2>
      --
          ----
          ---
  </Header>

<Document>001</Document>
<Document>002</Document>
<Document>003</Document>
<Document>004</Document>
<Document>005</Document>
<Document>006</Document>
<Document>007</Document>
<Document>008</Document>
<Document>009</Document>
<Document>010</Document>
<Document>011</Document>
<Document>012</Document>
<Document>013</Document>
<Document>014</Document>
<Document>015</Document>
<Document>016</Document>
<Document>017</Document>
<Document>018</Document>
<Document>019</Document>
<Document>020</Document>
<Document>021</Document>
<Document>022</Document>
<Document>023</Document>
<Document>024</Document>
<Document>025</Document>
<Document>026</Document>
<Document>027</Document>
<Document>028</Document>
<Document>029</Document>
<Document>030</Document>
<Document>031</Document>
<Document>032</Document>
<Document>033</Document>
<Document>034</Document>
<Document>035</Document>
<Document>036</Document>
<Document>037</Document>
<Document>038</Document>
<Document>039</Document>
<Document>040</Document>
<Document>041</Document>
<Document>042</Document>
<Document>043</Document>
<Document>044</Document>
<Document>045</Document>
<Document>046</Document>
<Document>047</Document>
<Document>048</Document>
<Document>049</Document>
<Document>050</Document>
<Document>051</Document>
<Document>052</Document>
<Document>053</Document>
<Document>054</Document>
<Document>055</Document>
<Document>056</Document>
<Document>057</Document>
<Document>058</Document>
<Document>059</Document>
<Document>060</Document>
<Document>061</Document>
<Document>062</Document>
<Document>063</Document>
<Document>064</Document>
<Document>065</Document>
<Document>066</Document>
<Document>067</Document>
<Document>068</Document>
<Document>069</Document>
<Document>070</Document>
<Document>071</Document>
<Document>072</Document>
<Document>073</Document>
<Document>074</Document>
<Document>075</Document>
<Document>076</Document>
<Document>077</Document>
<Document>078</Document>
<Document>079</Document>
<Document>080</Document>
<Document>081</Document>
<Document>082</Document>
<Document>083</Document>
<Document>084</Document>
<Document>085</Document>
<Document>086</Document>
<Document>087</Document>
<Document>088</Document>
<Document>089</Document>
<Document>090</Document>
<Document>091</Document>
<Document>092</Document>
<Document>093</Document>
<Document>094</Document>
<Document>095</Document>
<Document>096</Document>
<Document>097</Document>
<Document>098</Document>
<Document>099</Document>
<Document>100</Document>

 <Footer>
---
--
</Footer>

To produce output like this:

Code:
$ cat x.0001

<?xml version="1.0" encoding="UTF-8"?><Recipient>
  <Header>
    <tag1></tag1>
    <tag2>1212233</tag2>
      --
          ----
          ---
  </Header>

<Document>001</Document>
<Document>002</Document>
<Document>003</Document>
<Document>004</Document>
<Document>005</Document>
<Document>006</Document>
<Document>007</Document>
<Document>008</Document>
<Document>009</Document>
<Document>010</Document>
<Footer>
---
--
</Footer>

$ cat x.0010

<?xml version="1.0" encoding="UTF-8"?><Recipient>
  <Header>
    <tag1></tag1>
    <tag2>1212233</tag2>
      --
          ----
          ---
  </Header>

<Document>091</Document>
<Document>092</Document>
<Document>093</Document>
<Document>094</Document>
<Document>095</Document>
<Document>096</Document>
<Document>097</Document>
<Document>098</Document>
<Document>099</Document>
<Document>100</Document>

 <Footer>
---
--
</Footer>

$

This User Gave Thanks to Corona688 For This Post:
# 3  
Hi Corona,

Thanks for your quick response with code
Do i need to install any xml_splitter libraries in the unix and you have provided 2 big scripts which 1 do i need to consider

Iam new to scripting kindly assist on the above

--- Post updated at 11:27 PM ---

I have created two files yanx.awk xml_split.awk and triggered the command it exited without files being created please guide me where hte out put file path provided
Code:
awk -f yanx.awk -f xmlsplit.awk X="x." ROWS="10" input input

--- Post updated 12-14-18 at 06:14 AM ---

Hi Corona,

Please assist on the below error few files iam able to split few i cannot getting below error kindly assist

Code:
awk: xmlsplit.awk:37: (FILENAME=sampletest11.xml FNR=4) fatal: can't redirect to `/0001' (Permission denied)

# 4  
It creates them in the current directory. If you want it to put them somewhere else, set the value of X.

Code:
awk -f yanx.awk -f xmlsplit.awk X="x." ROWS="10" X="/path/to/folder/outputname" input input

Please show exactly what you're doing, word for word, letter for letter, keystroke for keystroke. What you have posted is obviously not what you're doing, the filenames differ.

Use nawk on solaris.

Last edited by Corona688; 12-14-2018 at 11:40 AM..
# 5  
Hi Corona,

Below are the steps I am doing sampletest11.xml is my sample file and the xml node slightly differs and the body node is "Recipient"
party_ID is my footer so changed it accordingly in xml_split.awk

Script worked fine with 200 records and when the xml file got 18k records which is expected file it throws the below exception

Code:
awk: xmlsplit.awk:37: (FILENAME=sampletest11.xml FNR=4) fatal: can't redirect to `/0001' (Permission denied)

Sample xml skeleton:
Code:
<?xml version="1.0" encoding="UTF-8"?>
<DocumentSet>
    <Recipient>
        <Context>
            <TESTER>08</TESTER>
            <name>TEST</name>
            <Locale>en_AU</Locale>
            <Channel>kjsdhfuis</Channel>
            <UserId>8</UserId>
            <HLX>000000</HLX>
            <Key1>TEST1</Key1>
            <Key2>TEST2</Key2>
            <Key3>TEST3</Key3>
            <KeyID>hotdirectorytest</KeyID>
            <dummy2222>TEST7</dummy2222>
            <EffectiveFrom>20170612000000</EffectiveFrom>
            <Currency>AUD</Currency>
        </Context>
        <Document>
            <Form>
                <Name>TESTER2</Name>
                <Data>
                    <DocumentSetC>
                        <HeaderData>
                            <TESTER>08</TESTER>
                            <Channel>kjsdhfuis</Channel>
                            <UserId>X009189</UserId>
                            <HLX>000000</HLX>
                            <dummy>08VIC000000</dummy>
                            <Key1>TEST2</Key1>
                            <Key2>TEST3</Key2>
                            <Key3/>
                            <KeyID>TEST70</KeyID>
                            <dummy2222>Approval Letter</dummy2222>
                            <TEST7>APPA08120617206891</TEST7>
                            <EffectiveFrom>20170612000000</EffectiveFrom>
                            <HLX44>12345</HLX44>
                            <SystemDate>20170612</SystemDate>
                        </HeaderData>
                        <FormData>
                            <Name>TESTER2</Name>
                            <Context>
                                <UniqueDocID>1240525</UniqueDocID>
                                <dummy11112233>LEN_APP_0010_OUT</dummy11112233>
                                <TEST2ApprovedAmount>8989</TEST2ApprovedAmount>
                            </Context>
                            <ReceivingParty>
                                <Applicant>
                                    <TEST45456>sfdsfnsdfnff  </TEST45456>
                                </Applicant>
                                <IndividualDemographics>
                                
                                </IndividualDemographics>
                                <DeliveryChannel>POST</DeliveryChannel>
                                <NoOfCopies>1</NoOfCopies>
                            </ReceivingParty>
                            <Application>
                                <ProductGroups>
                            <TEST454567>sfdsfnsdfnff  </TEST454567>

                                </ProductGroups>
                            </Application>
                        </FormData>
                    </DocumentSetC>
                </Data>
            </Form>
            <TYP1>5</TYP1>
        </Document>
    </Recipient>
       <Recipient2> ---</Recipient2>
           ---------------
           -------------- 
           -----------------
            -----------------
          <Recipient18000> ---</Recipient18000>
    <PartyID>12345</PartyID>
 </DocumentSet>

Command :
Code:
$ awk -f yanx.awk -f xmlsplit.awk X="x." ROWS="10" sampletest11.xml sampletest11.xml

--- Post updated at 11:44 PM ---

Kindly assist on the above is the issue because of the file size or number of records in the file??




Moderator's Comments:
Mod Comment Please use CODE tags as required by forum rules!

Last edited by RudiC; 12-17-2018 at 04:07 AM.. Reason: Added CODE tags.
# 6  
X might not be the wisest variable name chosen to convey the output files' path as it is used (conditionally) in yanx.awk as the index in a for loop IF the input file contains xml specification info (that might be the reason that it works on a test file if that is missing the xml specs) and thus may be overwritten.

Try again but replace the X variable name with another, e.g. FP (for "file path") in xmlsplit.awk and on the command line, NOT in yanx.awk.
# 7  
Thanks for showing what input you actually have. What output do you actually want?

Code modified to rudic's suggestions:

Code:
BEGIN {
	ORS=""
	OUT="x."
	ROWS=5
	ROWTAG="DOCUMENT"
	FTRTAG="FOOTER"
}

# First pass, remember headers and footers
NR==FNR {
	if(F || TAG == FTRTAG)
	{
		if(!F) {
			FTRSTART=FNR
			F=1
		}
		FTR=FTR RS $1 OFS $2
	}
	else if((!H) && (TAG == ROWTAG))
	{
		HDREND=FNR
		H=1
	}
	else if(!H)	HDR=HDR RS $1 OFS $2
	next
}

# Skip header and footer
(FNR < HDREND) || (FNR >= FTRSTART) { next }

# Close output file once enough DOCUMENT records
((XNR%(ROWS+1)) == 0) {
	if(FILE) {
		print FTR > FILE
		close(FILE);
	}
	FILE=sprintf("%s%04d", OUT,++FILENUM);
	print HDR > FILE
	XNR++
}

{	print RS $0 > FILE	}

CTAG == "DOCUMENT" { XNR++ }

END {	if(FILE) print FTR > FILE }

...but it won't work until I know what tags you're actually using for header and footer. Modify HDRTAG and FTRTAG accordingly.

Previous Thread | Next Thread
Thread Tools Search this Thread
Search this Thread:
Advanced Search

Test Your Knowledge in Computers #173
Difficulty: Easy
The key difference between Linux and many other popular modern operating systems is that the Linux kernel and other components are free and open-source software.
True or False?

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Eliminate Header and footer from EBCDIC file

Is there any command to eliminate Header and footer from EBCDIC file (4 Replies)
Discussion started by: abhilashnair
4 Replies

2. UNIX for Dummies Questions & Answers

File Row Line Count without Header Footer

Hi There! I am saving the file count of all files in a directory to an output file using: wc -l * > FileCount.txt I get: 114 G4SXORD 3 G4SXORH 0 G4SXORP 117 total But this count includes header and footer. I want to subtract 2 from the count and get ... (7 Replies)
Discussion started by: gagan8877
7 Replies

3. Shell Programming and Scripting

Is there a way to append both at header and footer of a file

currently I've a file Insert into CD_CARD_TYPE (CODE, DESCRIPTION, LAST_UPDATE_BY, LAST_UPDATE_DATE) Values ('024', '024', 2, sysdate); Insert into CD_CARD_TYPE (CODE, DESCRIPTION, LAST_UPDATE_BY, LAST_UPDATE_DATE) Values ('032', '032', 2, sysdate); ........ is it... (3 Replies)
Discussion started by: jediwannabe
3 Replies

4. Shell Programming and Scripting

Removing header or footer from file

Hi Every one, what is the coomand to remove header or footer from a file. Please help me by providing command/syntax to remove header/footer from unix. Thanks in advance for all your support. (5 Replies)
Discussion started by: sridhardwh
5 Replies

5. Shell Programming and Scripting

Add header and footer with record count in footer

This is my file(Target.txt) name|age|locaction abc|23|del xyz|24|mum jkl|25|kol The file should be like this 1|03252012 1|name|age|location 2|abc|23|del 2|xyz|24|mum 2|jkl|25|kol 2|kkk|26|hyd 3|4 Column 1 is row indicator for row 1 and 2, column indicator is 1,for data rows... (1 Reply)
Discussion started by: itsranjan
1 Replies

6. Shell Programming and Scripting

sort a report file having header and footer

I am having report file with header and footer . The details in between header and footer are separated by a pipe charater. I want to sort the file by considering multiple columns in between header and footer. pls help (4 Replies)
Discussion started by: suryanarayana
4 Replies

7. Shell Programming and Scripting

Split large file and add header and footer to each small files

I have one large file, after every 200 line i have to split the file and the add header and footer to each small file? It is possible to add different header and footer to each file? (7 Replies)
Discussion started by: ashish4422
7 Replies

8. Shell Programming and Scripting

Split large file and add header and footer to each file

I have one large file, after every 200 line i have to split the file and the add header and footer to each small file? It is possible to add different header and footer to each file? (1 Reply)
Discussion started by: ashish4422
1 Replies

9. Shell Programming and Scripting

Total of lines w/out header and footer incude for a file

I am trying to get a total number of tapes w/out headers or footers in a ERV file and append it to the file. For some reason I cannot get it to work. Any ideas? #!/bin/sh dat=`date +"%b%d_%Y"` + date +%b%d_%Y dat=Nov16_2006 tapemgr="/export/home/legato/tapemgr/rpts"... (1 Reply)
Discussion started by: gzs553
1 Replies

10. Shell Programming and Scripting

Need to Chop Header and Footer record from input file

Hi, I need to chope the header and footer record from an input file and make a new output file, please let me know how i can do it in unix.thanks. (4 Replies)
Discussion started by: coolbudy
4 Replies

Featured Tech Videos