How to extract info from text file between the tags Post: 302693047

Sponsored Content

Top Forums UNIX for Advanced & Expert Users How to extract info from text file between the tags Post 302693047 by Corona688 on Tuesday 28th of August 2012 12:52:24 PM

08-28-2012

Registered User

I have a generic data-extraction script for xml which often works nicely for repeated XML/HTML structures as long as there isn't things with spaces inside tag attributes. It prints everything in a tabular way so you can filter and rearrange as you please afterwards.

The DEP variable controls how many close-tags in a row it looks for before printing a row of data. Set it as high as you can while still having it printing what you want. In this case, that seems to be 4.

Code:

$ cat xmlg.awk 

BEGIN { RS="<";         FS=">"; ORS="\r\n";

        # Change this to alter how many close-tags in a row are needed
        # before a row of data is printed.
        if(!DEP) DEP=1
        SEP="\t"
        }

# Skip weird XML specification lines or blank records
/^\?/ || /^$/   {       next    }

# Handle close tags
/^[/]/  {
        N=D;    while((N>0) && ("/"STACK[N] != $1))     N--;

        if("/"STACK[N] == $1)   D=(N-1);
        POP++;

        if(POP == DEP)
        {
                if(!HEADER++)
                {
                        split(ARG[1], Z, SUBSEP);
                        printf("%s %s", Z[2], Z[3]);
                        for(N=2; N<=ARG_; N++)
                        {
                                split(ARG[N], Z, SUBSEP);
                                printf("%s%s %s", SEP, Z[2], Z[3]);
                        }

                        printf("\n");
                }

                printf("%s", DATA[ARG[1]]);
                for(N=2; N<=ARG_; N++)
                        printf("%s%s", SEP, DATA[ARG[N]]);
                printf("\n");
        }
        next
}

# Handle open tags
{
        gsub(/^[ \r\n\t]*/, "", $2);    # Whitespace isn't data
        gsub(/[ \r\n\t]*$/, "", $2);
        sub(/\/$/, "", $(NF-1));

        # Reset parameters
        POP=0;

        M=split($1, A, " ");
        STACK[++D]=A[1];

        if((!MAX) || (D>MAX)) MAX=D;    # Save max depth

        # Handle parameters
        Q=split(A[2], B, " ");
        for(N=1; N<=Q; N++)
        {
                split(B[N], C, "=");
                gsub(/['"]/,"", C[2]);

                I=D SUBSEP STACK[D] SUBSEP C[1];
                if(!SEEN[I]++)
                        ARG[++ARG_]=I;

                DATA[I]=C[2];
        }

        if($2)
        {
                I=D SUBSEP STACK[D] SUBSEP "CDATA";
                if(!SEEN[I]++)
                        ARG[++ARG_]=I;

                DATA[I]=$2;
        }
}

$ awk -v DEP=4 -f xmlg.awk < data2.xml |
        awk -F"\t" 'NR>1 { print $4"-"$5"-"$11"-"$12 }'

First Name Last Name-Power user-someone@company.com-123 memory LANE

$

Last edited by Corona688; 08-28-2012 at 03:00 PM..

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

10 More Discussions You Might Find Interesting

1. AIX

Extract info

Anyone have a better idea to automate extraction of info like ... "uname" "ifconfig" "ps efl" "netstat -ao" etc. from several hundred aix, solaris, red hat boxes? without logging into each box and manually performing these tasks and dumping them to individual files? thanks for any input

2. UNIX and Linux Applications

Parsing info from a text file into an IDL procedure

Hi, I hope this is appropriate for this forum. I have a text file (test.txt) that contains information that I would like to parse into an IDL procedure. Each line of the text file is either a number or a string, which will be a variable in my IDL procedure. Therefore I want to read each line...

3. Programming

c program to extract text between two delimiters from some text file

needa c program to extract text between two delimiters from some text file. and then storing them in to diffrent variables ? text file like 0: abc.txt ========= aaaaaa|11111111|sssssssssss|333333|ddddddddd|34343454564|asass aaaaaa|11111111|sssssssssss|333333|ddddddddd|34343454564|asass...

4. Shell Programming and Scripting

how to extract info from a file using awk

Dear all I have a file call interfaces.txt Filename: interfaces.txt How can I extract the information at below? ABC_DB_001 hostname1 20901 ABC_DB_002 hostname2 20903 ABC_DB_003 hostname3 20905 Currently I am using a very stupid method grep ^ABC interfaces.txt > name.txt grep...

5. Shell Programming and Scripting

Extract info from log file and compute using time date stamp

Looking for a shell script or a simple perl script . I am new to scripting and not very good at it . I have 2 directories . One of them holds a text file with list of files in it and the second one is a daily log which shows the file completion time. I need to co-relate both and make a report. ...

6. Shell Programming and Scripting

Using AWK BEGIN to extract file header info into variables

Hi Folks, I've searched for this for quite a while, but can't find any solution - hope someone can help. I have various files with standard headers. eg. <HEADER> IP: 1.2.3.4 Username: Joe Time: 12:00:00 Date: 23/05/2010 </HEADER> This is a test and this part can be any size...

7. Shell Programming and Scripting

how to extract the info in the tag from a xml file

Hi All, Do anyone of you have any idea how to extract each<info> tag to each different file. I have 1000 raw files, which come in every 15 mins.( I am using bash) I have tried my script as below, but it took hours to finish, which is inefficiency. perl -n -e '/^<info>/ and open FH,">file".$n++;...

8. Shell Programming and Scripting

How to extract the day of the year and use that info to copy a file remotely

Hello, Thank you in advance for helping a newbie who is having great trouble with this simple task. I'm allowed to copy one file remotely each night due to bandwidth restrictions. A new file gets generated once a day, and I need to copy the previous day's file. Here is what I'd like to do:...

9. Shell Programming and Scripting

HELP: Shell Script to read a Log file line by line and extract Info based on KEYWORDS matching

I have a LOG file which looks like this Import started at: Mon Jul 23 02:13:01 EDT 2012 Initialization completed in 2.146 seconds. -------------------------------------------------------------------------------- -- Import summary for Import item: PolicyInformation...

10. Shell Programming and Scripting

How to extract info from pings.?

Hi guys, new to this forum. I am currently trying to extract the times from pinging a domain and list the top 3 and then also do the opposite i.e. list the bottom 3. so if I had this as a ping result (the bold part is what I want): 64 bytes from 193.120.166.90: icmp_seq=10 ttl=128 time=34.8...

LEARN ABOUT DEBIAN

bio::ontologyio::handlers::basesaxhandler

Bio::OntologyIO::Handlers::BaseSAXHandler(3pm)		User Contributed Perl Documentation	    Bio::OntologyIO::Handlers::BaseSAXHandler(3pm)

NAME

       Bio::OntologyIO::Handlers::BaseSAXHandler base class for SAX Handlers

SYNOPSIS

       See description.

DESCRIPTION

       This module is an abstract module, serving as the base of any SAX Handler implementation. It tries to offer the framework that SAX handlers
       generally need, such as tag_stack, char_store, etc.

       In the implementation handler, you can take advantage of this based module by the following suggestions.

       1) In start_element,

	sub start_element {
	    my $self=shift;
	    my $tag=$_[0]->{Name};
	    my %args=%{$_[0]->{Attributes}};
	    # Your code here.

	    # Before you conclude the method, write these 2 line.
	    $self->_visited_count_inc($tag);
	    $self->_push_tag($tag);
	}

       2) In end_element,

	sub end_element {
	    my $self=shift;
	    my $tag=shift->{Name};
	    # Your code here.

	    # Before you conclude the method, write these 2 lines.
	    $self->_visited_count_dec($tag);
	    $self->_pop_tag;
	}

       3) In characters, or any other methods where you may use the tag stack or count

	sub characters {
	    my $self=shift;
	    my $text=shift->{Data};

	    $self->_chars_hash->{$self->_top_tag} .= $text;

	}
	$count = $self->_visited_count('myTag');
	$tag = $self->_top_tag;

FEEDBACK

   Mailing Lists
       User feedback is an integral part of the evolution of this and other Bioperl modules. Send your comments and suggestions preferably to one
       of the Bioperl mailing lists.

       Your participation is much appreciated.

	 bioperl-l@bioperl.org			- General discussion
	 http://bioperl.org/wiki/Mailing_lists	- About the mailing lists

   Support
       Please direct usage questions or support issues to the mailing list:

       bioperl-l@bioperl.org

       rather than to the module maintainer directly. Many experienced and reponsive experts will be able look at the problem and quickly address
       it. Please include a thorough description of the problem with code and data examples if at all possible.

   Reporting Bugs
       Report bugs to the Bioperl bug tracking system to help us keep track the bugs and their resolution.  Bug reports can be submitted via the
       web:

	 https://redmine.open-bio.org/projects/bioperl/

AUTHOR

       Juguang Xiao, juguang@tll.org.sg

   APPENDIX
       The rest of the documentation details each of the object methods.  Interal methods are usually preceded with a _

   _tag_stack
	 Title	 : _tag_stack
	 Usage	 : @tags = $self->_tag_stack;
	 Function: Get an array of tags that have been accessed but not enclosed.
	 Return  :
	 Args	 :

   _push_tag
   _pop_tag
   _top_tag
	 Title	 : _top_tag
	 Usage	 : $top = $self->_top_tag;
	 Function: get the top tag in the tag stack.
	 Return  : a tag name
	 Args	 : [none]

   _chars_hash
	 Title	 : _chars_hash
	 Usage	 : $hash= $self->_chars_hash;
	 Function: return the character cache for the specific tag
	 Return  : a hash reference, which is intent for character storage for tags
	 Args	 : [none]

   _current_hash
   _visited_count_inc
	 Title	 : _vistied_count_inc
	 Usage	 : $self->vistied_count_inc($tag); # the counter for the tag increase
	 Function: the counter for the tag increase
	 Return  : the current count after this increment
	 Args	 : the tag name [scalar]

   _visited_count_dec
	 Title	 : _visited_count_dec
	 Usage	 : $self->_visited_count_dec($tag);
	 Function: the counter for the tag decreases by one
	 Return  : the current count for the specific tag after the decrement
	 Args	 : the tag name [scalar]

   _visited_count
	 Title	 : _visited_count
	 Usage	 : $count = $self->_visited_count
	 Function: return the counter for the tag
	 Return  : the current counter for the specific tag
	 Args	 : the tag name [scalar]

perl v5.14.2							    2012-03-02			    Bio::OntologyIO::Handlers::BaseSAXHandler(3pm)

10 More Discussions You Might Find Interesting

1. AIX

Extract info

Discussion started by: chm0dvii

2. UNIX and Linux Applications

Parsing info from a text file into an IDL procedure

Discussion started by: msb65

3. Programming

c program to extract text between two delimiters from some text file

Discussion started by: kukretiabhi13

4. Shell Programming and Scripting

how to extract info from a file using awk

Discussion started by: on9west