XML to TXT or CSV Post: 302641107

Sponsored Content

Top Forums UNIX for Dummies Questions & Answers XML to TXT or CSV Post 302641107 by Corona688 on Tuesday 15th of May 2012 03:50:12 PM

05-15-2012

Registered User

Handling arbitrary XML isn't trivial. Hopefully this should be flexible and mold itself to your input data, since it discovers columns as it goes and tries to preserve order. It decides where a 'row' is by looking for two close-tags in a row.

If it doesn't work, try nawk. If it still doesn't work, post some of your actual, unmodified input data.

Code:

$ cat xmlg.awk

BEGIN { RS="<";         FS=">"; ORS="\r\n"  }

# Skip weird XML specification lines or blank records
/^\?/ || /^$/   {       next    }

# Handle close tags
/^[/]/  {
        N=D;    while((N>0) && ("/"STACK[N] != $1))     N--;

        if("/"STACK[N] == $1)   D=(N-1);
        POP++;

        if(POP == 2)
        {
                if(!HEADER++)
                {
                        split(ARG[1], Z, SUBSEP);
                        printf("%s %s", Z[2], Z[3]);
                        for(N=2; N<=ARG_; N++)
                        {
                                split(ARG[N], Z, SUBSEP);
                                printf("|%s %s", Z[2], Z[3]);
                        }

                        printf("\n");
                }

                printf("%s", DATA[ARG[1]]);
                for(N=2; N<=ARG_; N++)
                        printf("|%s", DATA[ARG[N]]);
                printf("\n");
        }
        next
}

# Handle open tags
{
        gsub(/^[ \r\n\t]*/, "", $2);    # Whitespace isn't data
        gsub(/[ \r\n\t]*$/, "", $2);

        # Reset parameters
        POP=0;

        M=split($1, A, " ");
        STACK[++D]=A[1];

        # Handle parameters
        Q=split(A[2], B, " ");
        for(N=1; N<=Q; N++)
        {
                split(B[N], C, "=");
                gsub(/['"]/,"", C[2]);
#               PARAM[C[1]]=C[2];
#               print C[1], "=", PARAM[C[1]];

                I=D SUBSEP STACK[D] SUBSEP C[1];
                if(!SEEN[I]++)
                        ARG[++ARG_]=I;

                DATA[I]=C[2];
        }

        if($2)
        {
                I=D SUBSEP STACK[D] SUBSEP "CDATA";
                if(!SEEN[I]++)
                        ARG[++ARG_]=I;

                DATA[I]=$2;
        }
}

$ awk -f xmlg.awk file.xml

archive id|file line|author CDATA|time CDATA|text CDATA
ffghgsddes|1|953b|18:03|this is an evidence regarding ...
ffghgsddes|2|04bfa|18:03|we have seen those documents before
jhljkhlasdf|1|953b|18:03|this is an evidence regarding ...
jhljkhlasdf|2|04bfa|18:03|we have seen those documents before

$ awk -f xmlg.awk file.xml > output.txt

ORS="\r\n" should make it more easily importable into excel or what have you.

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

AWK CSV to TXT format, TXT file not in a correct column format

HI guys, I have created a script to read 1 column in a csv file and then place it in text file. However, when i checked out the text file, it is not in a column format... Example: CSV file contains name,age aa,11 bb,22 cc,33 After using awk to get first column TXT file...

2. Shell Programming and Scripting

Txt to csv convert

Hi, I was trying some split command to pull out values like "uid=abc,ou=INTERNAL,ou=PEOPLE" into a csv file. However because of erratic nature of occurrance of rows made me stopped. Could someone help me in this? and if someone has a one liner for this? The text file contain pattern like this...

3. Shell Programming and Scripting

Converting txt file in csv

HI All, I have a text file memory.txt which has following values. Average: 822387 7346605 89.93 288845 4176593 2044589 51883 2.47 7600 i want to convert this file in csv format and i am using following command to do it. sed s/_/\./g <...

4. Shell Programming and Scripting

Parsing txt, xml files and preparing csv file

Hi, I need to parse text, xml files to get the statistic numbers and prepare summary csv file. What is the best way to parse these file and prepare csv file. Any idea you have , please? Regards,

5. Shell Programming and Scripting

Convert txt to csv

Hi - I am looking to convert the following text to csv. The columns may not always have data in them and they may have varying spaces but I still need to have a comma there anyway: Sample Data: ~~~~~~~ Name Email Location Phone Tom...

6. Shell Programming and Scripting

.PDF and .TXT to .XML. Is it possible?

Hi! I need to realize this task. In folder i have such files: name1.txt name1.pdf name2.txt name2.pdf etc... I want to scan this folder, match files with same name (name1.txt with name1.pdf, name2.txt with name2.pdf) and create files name1.xml and name2.xml, based on it. i.e: i want...

7. Shell Programming and Scripting

txt file to CSV

hi.. I have a text file which looks likes this 2258 4569 1239 258 473 i need to convert it into comma seperated format eg:2258,4569,1239,258,437 pls help

8. UNIX for Dummies Questions & Answers

Help with a project. convert a txt to csv

Hi people. I've finally converted to linux, and I'm starting to explore the amazing capabilities of the terminal. At the moment in trying to learn how to extract text using the "grep" and "sed" command. I decided to learn by trying to figure out how to solve a practical problem. I have a schedule...

9. UNIX for Dummies Questions & Answers

Need help converting txt to XML

I have a table as following Archive id Line Author Time Text 1fjj34 3 75jk5l 03:20 this is an evidence regarding ... 1fjj34 4 gjhhtrd 03:21 we have seen those documents before 1fjj34 10 645jmdvvb 04:00 Will you consider such an offer?...

10. Shell Programming and Scripting

Using awk for converting xml to txt

Hi, I have a xml script, I converted it to .txt with values comma seperated using awk function. But I want the output values should be inside double quotes My xml script (Workorders.xml) is shown like below: <?xml version="1.0" encoding="utf-8" ?> <scbm-extract version="3.3">...

LEARN ABOUT DEBIAN

mkdoc::xml

MKDoc::XML(3pm) 					User Contributed Perl Documentation					   MKDoc::XML(3pm)

NAME

       MKDoc::XML - The MKDoc XML Toolkit

SYNOPSIS

       This is an article, not a module.

SUMMARY

       MKDoc is a web content management system written in Perl which focuses on standards compliance, accessiblity and usability issues, and
       multi-lingual websites.

       At MKDoc Ltd we have decided to gradually break up our existing commercial software into a collection of completely independent, well-
       documented, well-tested open-source CPAN modules.

       Ultimately we want MKDoc code to be a coherent collection of module distributions, yet each distribution should be usable and useful in
       itself.

       MKDoc::XML is part of this effort.

       You could help us and turn some of MKDoc's code into a CPAN module.  You can take a look at the existing code at
       http://download.mkdoc.org/.

       If you are interested in some functionality which you would like to see as a standalone CPAN module, send an email to
       <mkdoc-modules@lists.webarch.co.uk>.

DISCLAIMER

       MKDoc::XML is a low level XML library.
       MKDoc::XML::* modules do not make sure your XML is well-formed.
       MKDoc::XML::* modules can be used to work with somehow broken XML.
       MKDoc::XML::* modules should not be used as high-level parsers with general purpose XML unless you know what you're doing.

WHAT'S IN THE BOX
   XML tokenizer
       MKDoc::XML::Tokenizer splits your XML / XHTML files into a list of MKDoc::XML::Token objects using a single regex.

   XML tree builder
       MKDoc::XML::TreeBuilder sits on top of MKDoc::XML::Tokenizer and builds parsed trees out of your XML / XHTML data.

   XML stripper
       MKDoc::XML::Stripper objects removes unwanted markup from your XML / HTML data. Useful to remove all those nasty presentational tags or
       'style' attributes from your XHTML data for example.

   XML tagger
       MKDoc::XML::Tagger module matches expressions in XML / XHTML documents and tag them appropriately. For example, you could automatically
       hyperlink certain glossary words or add <abbr> tags based on a dictionary of abbreviations and acronyms.

   XML entity decoder
       MKDoc::XML::Decode is a pluggable, configurable entity expander module which currently supports html entities, numerical entities and basic
       xml entities.

   XML entity encoder
       MKDoc::XML::Encode does the exact reverse operation as MKDoc::XML::Decode.

   XML Dumper
       MKDoc::XML::Dumper serializes arbitrarily complex perl structures into XML strings.  It is also able of doing the reverse operation, i.e.
       deserializing an XML string into a perl structure.

AUTHOR

       Copyright 2003 - MKDoc Holdings Ltd.

       Author: Jean-Michel Hiver

       This module is free software and is distributed under the same license as Perl itself. Use it at your own risk.

SEE ALSO

	 Petal: http://search.cpan.org/dist/Petal/
	 MKDoc: http://www.mkdoc.com/

       Help us open-source MKDoc. Join the mkdoc-modules mailing list:

	 mkdoc-modules@lists.webarch.co.uk

perl v5.10.1							    2005-03-10							   MKDoc::XML(3pm)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

AWK CSV to TXT format, TXT file not in a correct column format

Discussion started by: mdap

2. Shell Programming and Scripting

Txt to csv convert

Discussion started by: john_prince

3. Shell Programming and Scripting

Converting txt file in csv

Discussion started by: mkashif

4. Shell Programming and Scripting

Parsing txt, xml files and preparing csv file

Discussion started by: LinuxLearner