Split large xml into mutiple files and with header and footer in file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Split large xml into mutiple files and with header and footer in file
# 29  
Old 02-18-2019
Hi Don,

My Apologies for confusing you again AWK commands are perfectly working fine and it splits file correctly as expected

Hope I am not confusing you further

1) If my input file name is sampletest_111.xml after AWK command file name will be like sampletest_111.xml.0001
2)sampletest_111.xml.0001 is renamed to Extrfile111.xml
3)when there are multiple input files AWK is spliting files and creating unique files but
below piece of code is not renaming files in a sequence its just appending to 1 file
Output Expected:Extrfile111.xml,Extrfile1112.xml etc i mean unique name

Code:
for f in ../Inbound/sampletest_*
  do    TMP="${f/sampletest_/Extrfile}"
         mv "$f" "${TMP%.*}"
  done


Total code :
Code:
#!/bin/sh

#pass all Input files to array
FileList=($(ls | grep "sampletest*\\_[0-9]"))
  
echo  "$FileList"  

#loop array for Input files

for x in "${FileList[@]}"
do
 #for each element in array
 

#File Split Begin
awk -f xml_tag_handler.awk -f File_split.awk OUT=$x"" ROWS="500" $x $x
mv $x ../
done

rm Response.xml Extr*.xml

for f in sampletest_*
echo "$f"
  do    TMP="${f/sampletest_/Extrfile}"
         mv "$f" "${TMP%.*}"
  done

# add all files to array
arr=($(ls | grep "Extrfile[0-9]*.xml"))


 #loop array
for i in "${arr[@]}"
do
 #for each element in array
  echo "$i"

   sed -i '/<com1:URI>/c\<com1:URI>file:///tmp/karthik/'$i'</com1:URI>' soaprequest.xml
  
#WebService Call Begin
sleep 5
curl --header "Content-Type: text/xml;charset=UTF-8" --data @soaprequest.xml {WSDLURL} --insecure >> Response.xml
echo ":Webservice call Begin"
done

  sed -i '/<com1:URI>/c\<com1:URI>file:///tmp/karthik/'$i'</com1:URI>' soaprequest.xml
  
echo ":Webservice call End"

NEW_VAR=$(awk -v sq="'" -F'<ns11:Job_Id>' '
		{	for(i = 2; i <= NF; i++) {
				sub(/<.*/, "", $i)
				printf("%s%s", cnt++ ? "," : sq, $i)
			}
		}
		END {	print sq
		}' Response.xml	
	)

printf 'NEW_VAR has been assigned the value: %s\n' "$NEW_VAR"

#End Web Service Call


xml_tag_handler.awk:
Code:
###############################################################################
BEGIN {
        FS=">"; OFS=">";
        RS="<"; ORS="<"
}

# After match("qwertyuiop", /rty/)
#       rbefore("qwertyuiop") is "qwe",
#       rmid("qwertyuipo")    is "r"
#       rall("qwertyuiop")    is "rty"
#       rafter("qwertyuiop")  is "uiop"

# !?!?!
# function rbefore(STR)   { return(substr(STR, N, RSTART-1)); }# before match
function rbefore(STR)     { return(substr(STR, 0, RSTART-1)); }# before match
function rmid(STR)      { return(substr(STR, RSTART, 1)); }  # First char match
function rall(STR)      { return(substr(STR, RSTART, RLENGTH)); }# Entire match
function rafter(STR)    { return(substr(STR, RSTART+RLENGTH)); }# after match

function aquote(OUT, A, PFIX, TA) { # Turns Q SUBSEP R into A[PFIX":"Q]=R
        if(OUT)
        {
                if(PFIX) PFIX=PFIX":"
                split(OUT, TA, SUBSEP);
                A[toupper(PFIX) toupper(TA[1])]=TA[2];
        }

        return("");
}

# Intended to be less stupid about quoted text in XML/HTML.
# Splits a='b' c='d' e='f' into A[PFIX":"a]=b, A[PFIX":"c]=d, etc.
function qsplit(STR, A, PFIX, X, OUT) {
        while(STR && match(STR, /([ \n\t]+)|[\x27\x22=]/))
        {
                OUT = OUT rbefore(STR);
                RMID=rmid(STR);

                if((RMID == "'") || (RMID == "\""))     # Quote characters
                {
                        if(!Q)          Q=RMID;         # Begin quote section
                        else if(Q == RMID)      Q="";   # End quote section
                        else                    OUT = OUT RMID; # Quoted quote
                } else if(RMID == "=") {
                        if(Q)   OUT=OUT RMID; else OUT=OUT SUBSEP;
                } else if((RMID=="\r")||(RMID=="\n")||(RMID=="\t")||(RMID==" ")) {
                        if(Q)   OUT = OUT rall(STR); # Literal quoted whitespace
                        else    OUT = aquote(OUT, A, PFIX); # Unquoted WS, next block
                }
                STR=rafter(STR); # Strip off the text we've processed already.
        }

        aquote(OUT STR, A, PFIX); # Process any text we haven't already.
}


{ SPEC=0 ; TAG="" }

NR==1 {
        if(ORS == RS) print;
        next } # The first "line" is blank when RS=<

/^[!?]/ { SPEC=1    }   # XML specification junk

# Handle open-tags
(!SPEC) && match($1, /^[^\/ \r\n\t>]+/) {
        CTAG=""
        TAG=substr(toupper($1), RSTART, RLENGTH);
        if((!SPEC) && !($1 ~ /\/$/))
        {
                TAGS=TAG "%" TAGS;
                DEP++;
                LTAGS=TAGS
        }

        for(X in ARGS) delete ARGS[X];

        qsplit(rafter($1), ARGS, "", "", "");
}

# Handle close-tags
(!SPEC) && /^[\/]/ {
        sub(/^\//, "", $1);
        LTAGS=TAGS
        CTAG=toupper($1)
        TAG=""
#        sub("^.*" toupper($1) "%", "", TAGS);
        sub("^" toupper($1) "%", "", TAGS);
        $1="/"$1
        DEP=split(TAGS, TA, "%")-1;
        # Update TAG with tag on top of stack, if any
#       if(DEP < 0) {   DEP=0;  TAG=""  }
#       else { TAG=TA[DEP]; }
}

File_split.awk
Code:
BEGIN {
        ORS=""
        #OUT="x."
        ROWS=5
        ROWTAG="^RECIPIENT[0-9]*$"
        HDRTAG="^DOCUMENTSET$"
        FTRTAG="^DOCUMENTSET$"
}

# First pass, remember headers and footers
NR==FNR {
        if(!HDREND)
        {
                HDR=HDR RS $1 OFS $2
                if(TAG ~ HDRTAG) HDREND=FNR
                next
        }

        if(FTRSTART || (CTAG ~ FTRTAG))
        {
                FTR=FTR RS $1 OFS $2
                if(CTAG ~ FTRTAG) FTRSTART=FNR
        }

        next
}

# Skip header and footer
(FNR <= HDREND) || (FNR >= FTRSTART) { next }

# Close output file once enough DOCUMENT records
((XNR%(ROWS+1)) == 0) {
#       printf("FNR==%d XNR==%d FILE=%s\n", FNR, XNR, FILE)>"/dev/stderr"
        if(!length(OUT)) FBASE=FILENAME "."
                else FBASE = OUT "."
				
        if(FILE) {
                print FTR > FILE
                close(FILE);
        }

        FILE=sprintf("%s%04d", FBASE,++FILENUM);
        print HDR > FILE
        XNR++
}

{       print RS $0 > FILE      }

CTAG ~ ROWTAG { XNR++ }

END {   if(FILE) print FTR > FILE       }

#8 in the same thread got the sample xml structure for your reference

Last edited by karthik; 02-18-2019 at 01:40 AM..
# 30  
Old 02-18-2019
Quote:
Originally Posted by karthik
Hi Don,

My Apologies for confusing you again AWK commands are perfectly working fine and it splits file correctly as expected

Hope I am not confusing you further

1) If my input file name is sampletest_111.xml after AWK command file name will be like sampletest_111.xml.0001
2)sampletest_111.xml.0001 is renamed to Extrfile111.xml
3)when there are multiple input files AWK is spliting files and creating unique files but
below piece of code is not renaming files in a sequence its just appending to 1 file
Output Expected:Extrfile111.xml,Extrfile1112.xml etc i mean unique name

... ... ...
Hi karthik,
I am not confused at all this time. Please go back and read closely what I said in post #28!

I don't care how well your awk script works when you invoke it with the name of a file to be processed. The script you showed us in post #27 NEVER EVER under any circumstances invokes awk; not even once! And, if your script doesn't run awk, it just plain is not possible that awk is splitting anything.

Showing us a few hundred more lines of awk code doesn't alter the fact that you are never running that code.

Until you alter the code that is initializing the FileList array correctly, there is nothing else to talk about. If you change the sixth line in your script from:
Code:
echo "$FileList"

to:
Code:
echo "Files to be processed: ${FileList[@]}"

and look at the output that line produces when you run your script, maybe you'll believe me. And, yes, I noticed that you changed the way you initialized that array from:
Code:
FileList=($(ls | grep "../Inbound/sampletest*\\_[0-9]"))

to:
Code:
FileList=($(ls | grep "sampletest*\\_[0-9]"))

but it doesn't alter the fact that the FileList array will still be an empty array and your awk script will never be executed. The empty line produced by the echo in your script should have been a strong indication to you that something was wrong, but you seem to be ignoring that fact. With the above change, hopefully it will be crystal clear.

The grep utility takes a basic regular expression as its first operand; not a filename matching pattern. BREs and filename matching patters have some similarities, but they are not the same. Since none of your filenames contain a literal backslash character (i.e. \), the grep can't match any lines in the output produced by ls.

You would do well to change the second line in your script from an empty line to:
Code:
set -xv

to enable tracing so you can actually see what your script is doing.
# 31  
Old 02-18-2019
Hi Don,

See the below it is able to find the input files and i have pasted my output in debug mode it is able to rename only 1 file Extrfile112.xml where as it ignored or not able to
read sampletest_111.xml is the issue
Code:
+ FileList=($(ls | grep "sampletest*\\_[0-9]"))
++ ls
++ grep 'sampletest*\_[0-9]'
+ echo sampletest_111.xml
sampletest_111.xml
+ echo 'Files to be processed: sampletest_111.xml' sampletest_112.xml
Files to be processed: sampletest_111.xml sampletest_112.xml
+ for x in '"${FileList[@]}"'
+ awk -f xml_tag_handler.awk -f File_split.awk OUT=sampletest_111.xml ROWS=500 sampletest_111.xml sampletest_111.xml
+ mv sampletest_111.xml ../
+ for x in '"${FileList[@]}"'
+ awk -f xml_tag_handler.awk -f File_split.awk OUT=sampletest_112.xml ROWS=500 sampletest_112.xml sampletest_112.xml
+ mv sampletest_112.xml ../
+ rm Response.xml 'Extr*.xml'
rm: cannot remove `Extr*.xml': No such file or directory
+ for f in 'sampletest_*'
+ TMP=Extrfile112.xml.0001
+ mv sampletest_112.xml.0001 Extrfile112.xml
+ echo sampletest_112.xml.0001
sampletest_112.xml.0001
+ arr=($(ls | grep "Extrfile[0-9]*.xml"))
++ ls
++ grep 'Extrfile[0-9]*.xml'
+ for i in '"${arr[@]}"'
+ echo Extrfile112.xml
Extrfile112.xml
+ sed -i '/<com1:URI>/c\<com1:URI>file:///tmp/karthik/Extrfile112.xml</com1:URI>' soaprequest.xml
+ sleep 5
+ curl --header 'Content-Type: text/xml;charset=UTF-8' --data @soaprequest.xml https://cobodmsoa-vip.dev4.cbd.extnp.national.com.au:8002/DWSAL1/PublishingService --insecure
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3196    0  1631  100  1565    793    761  0:00:02  0:00:02 --:--:--  1603
+ echo ':Webservice call Begin'

# 32  
Old 02-18-2019
OK. You lucked out... The BRE sampletest*\\_[0-9] tells grep to match and print lines that contain the string sampletes followed by zero or more occurrences of t followed by whatever unspecified characters are matched by the character sequence \_ on the regular expression matching engine used on your operating system followed by a decimal digit. It looks like your operating system's RE matching engine chooses to use that sequence to match an underscore character,

To match the filenames you want to process, the following BRE would work more reliably:
Code:
grep 'sampletest_[0-9][0-9]*.xml'

If you want to exclude matching filenames like sampletest_112.xml.0001, you could force the xml to only be matched at the end of a filename with:
Code:
grep 'sampletest_[0-9][0-9]*.xml$'

Now that we have gotten past that... What statement in your script is failing to do what you want it to do? What are the arguments being passed to that command according to the trace output you're seeing? What arguments did you hope would be passed to that command instead of the arguments that are actually being passed to that command?
# 33  
Old 02-18-2019
Hi Don,

I have corrected the grep command as suggested now the issue is after split it will create multiple files with ending like below but see the mv command its moving all the files to 1 single file basically Extrfile110.xml it should create new unique file kindly suggest where iam goin wrong

ex:
sampletest_111.xml.0001
sampletest_112.xml.000i
Code:
+ for f in 'sampletest_*'
+ TMP=Extrfile110.xml.0001
+ mv sampletest_110.xml.0001 Extrfile110.xml
+ echo sampletest_110.xml.0001
sampletest_110.xml.0001
+ for f in 'sampletest_*'
+ TMP=Extrfile110.xml.0002
+ mv sampletest_110.xml.0002 Extrfile110.xml
+ echo sampletest_110.xml.0002
sampletest_110.xml.0002
+ for f in 'sampletest_*'
+ TMP=Extrfile110.xml.0003
+ mv sampletest_110.xml.0003 Extrfile110.xml
+ echo sampletest_110.xml.0003
sampletest_110.xml.0003
+ for f in 'sampletest_*'
+ TMP=Extrfile110.xml.0004
+ mv sampletest_110.xml.0004 Extrfile110.xml
+ echo sampletest_110.xml.0004
sampletest_110.xml.0004
+ for f in 'sampletest_*'
+ TMP=Extrfile110.xml.0005
+ mv sampletest_110.xml.0005 Extrfile110.xml
+ echo sampletest_110.xml.0005
sampletest_110.xml.0005
+ for f in 'sampletest_*'
+ TMP=Extrfile110.xml.0006
+ mv sampletest_110.xml.0006 Extrfile110.xml
+ echo sampletest_110.xml.0006
sampletest_110.xml.0006
+ for f in 'sampletest_*'
+ TMP=Extrfile110.xml.0007
+ mv sampletest_110.xml.0007 Extrfile110.xml
+ echo sampletest_110.xml.0007
sampletest_110.xml.0007
+ for f in 'sampletest_*'
+ TMP=Extrfile110.xml.0008
+ mv sampletest_110.xml.0008 Extrfile110.xml

--- Post updated at 11:25 PM ---

Quote:
Originally Posted by karthik
Hi Don,

I have corrected the grep command as suggested now the issue is after split it will create multiple files with ending like below but see the mv command its moving all the files to 1 single file basically Extrfile110.xml it should create new unique file kindly suggest where iam goin wrong

ex:
sampletest_111.xml.0001
sampletest_112.xml.000i
Code:
+ for f in 'sampletest_*'
+ TMP=Extrfile110.xml.0001
+ mv sampletest_110.xml.0001 Extrfile110.xml
+ echo sampletest_110.xml.0001
sampletest_110.xml.0001
+ for f in 'sampletest_*'
+ TMP=Extrfile110.xml.0002
+ mv sampletest_110.xml.0002 Extrfile110.xml
+ echo sampletest_110.xml.0002
sampletest_110.xml.0002
+ for f in 'sampletest_*'
+ TMP=Extrfile110.xml.0003
+ mv sampletest_110.xml.0003 Extrfile110.xml
+ echo sampletest_110.xml.0003
sampletest_110.xml.0003
+ for f in 'sampletest_*'
+ TMP=Extrfile110.xml.0004
+ mv sampletest_110.xml.0004 Extrfile110.xml
+ echo sampletest_110.xml.0004
sampletest_110.xml.0004
+ for f in 'sampletest_*'
+ TMP=Extrfile110.xml.0005
+ mv sampletest_110.xml.0005 Extrfile110.xml
+ echo sampletest_110.xml.0005
sampletest_110.xml.0005
+ for f in 'sampletest_*'
+ TMP=Extrfile110.xml.0006
+ mv sampletest_110.xml.0006 Extrfile110.xml
+ echo sampletest_110.xml.0006
sampletest_110.xml.0006
+ for f in 'sampletest_*'
+ TMP=Extrfile110.xml.0007
+ mv sampletest_110.xml.0007 Extrfile110.xml
+ echo sampletest_110.xml.0007
sampletest_110.xml.0007
+ for f in 'sampletest_*'
+ TMP=Extrfile110.xml.0008
+ mv sampletest_110.xml.0008 Extrfile110.xml

Below Mv Command is the issue it is creating same file name

Code:
for f in sampletest_*
  do    TMP="${f/sampletest_/Extrfile}"
         mv "$f" "${TMP%.*}"
echo "$f"
 done

# 34  
Old 02-18-2019
You have an awk script that is creating uniquely named files. You then add another loop following your awk script that removes the final part of those filenames taking away the part that makes them unique and guarantees that only the last file created by each invocation of your awk script will be kept as the renaming loop overwrites each of the output files with the next output file in sequence.

If you want unique names, why do you have the renaming loop that intentionally strips off the part of their names that makes them unique?
# 35  
Old 02-18-2019
Hi Don,

The reason I am renaming split file is to convert that to proper xml name thats the reason I am renaming after that it invokes a wsdl
My Expected Output should look like below or any sequence will do but .xml should be there at the end

Code:
sampletest_110.xml.0004 to Extrfile110_4.xml or Extrfile1104.xml
sampletest_110.xml.0005  to   Extrfile110_5.xml  or Extrfile1105.xml
sampletest_111.xml.0001  to  Extrfile111_1.xml   or Extrfile1111.xml

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Eliminate Header and footer from EBCDIC file

Is there any command to eliminate Header and footer from EBCDIC file (4 Replies)
Discussion started by: abhilashnair
4 Replies

2. UNIX for Dummies Questions & Answers

File Row Line Count without Header Footer

Hi There! I am saving the file count of all files in a directory to an output file using: wc -l * > FileCount.txt I get: 114 G4SXORD 3 G4SXORH 0 G4SXORP 117 total But this count includes header and footer. I want to subtract 2 from the count and get ... (7 Replies)
Discussion started by: gagan8877
7 Replies

3. Shell Programming and Scripting

Is there a way to append both at header and footer of a file

currently I've a file Insert into CD_CARD_TYPE (CODE, DESCRIPTION, LAST_UPDATE_BY, LAST_UPDATE_DATE) Values ('024', '024', 2, sysdate); Insert into CD_CARD_TYPE (CODE, DESCRIPTION, LAST_UPDATE_BY, LAST_UPDATE_DATE) Values ('032', '032', 2, sysdate); ........ is it... (3 Replies)
Discussion started by: jediwannabe
3 Replies

4. Shell Programming and Scripting

Removing header or footer from file

Hi Every one, what is the coomand to remove header or footer from a file. Please help me by providing command/syntax to remove header/footer from unix. Thanks in advance for all your support. (5 Replies)
Discussion started by: sridhardwh
5 Replies

5. Shell Programming and Scripting

Add header and footer with record count in footer

This is my file(Target.txt) name|age|locaction abc|23|del xyz|24|mum jkl|25|kol The file should be like this 1|03252012 1|name|age|location 2|abc|23|del 2|xyz|24|mum 2|jkl|25|kol 2|kkk|26|hyd 3|4 Column 1 is row indicator for row 1 and 2, column indicator is 1,for data rows... (1 Reply)
Discussion started by: itsranjan
1 Replies

6. Shell Programming and Scripting

sort a report file having header and footer

I am having report file with header and footer . The details in between header and footer are separated by a pipe charater. I want to sort the file by considering multiple columns in between header and footer. pls help (4 Replies)
Discussion started by: suryanarayana
4 Replies

7. Shell Programming and Scripting

Split large file and add header and footer to each small files

I have one large file, after every 200 line i have to split the file and the add header and footer to each small file? It is possible to add different header and footer to each file? (7 Replies)
Discussion started by: ashish4422
7 Replies

8. Shell Programming and Scripting

Split large file and add header and footer to each file

I have one large file, after every 200 line i have to split the file and the add header and footer to each small file? It is possible to add different header and footer to each file? (1 Reply)
Discussion started by: ashish4422
1 Replies

9. Shell Programming and Scripting

Total of lines w/out header and footer incude for a file

I am trying to get a total number of tapes w/out headers or footers in a ERV file and append it to the file. For some reason I cannot get it to work. Any ideas? #!/bin/sh dat=`date +"%b%d_%Y"` + date +%b%d_%Y dat=Nov16_2006 tapemgr="/export/home/legato/tapemgr/rpts"... (1 Reply)
Discussion started by: gzs553
1 Replies

10. Shell Programming and Scripting

Need to Chop Header and Footer record from input file

Hi, I need to chope the header and footer record from an input file and make a new output file, please let me know how i can do it in unix.thanks. (4 Replies)
Discussion started by: coolbudy
4 Replies
Login or Register to Ask a Question