Sponsored Content
Top Forums UNIX for Advanced & Expert Users How to extract info from text file between the tags Post 302693047 by Corona688 on Tuesday 28th of August 2012 12:52:24 PM
Old 08-28-2012
I have a generic data-extraction script for xml which often works nicely for repeated XML/HTML structures as long as there isn't things with spaces inside tag attributes. It prints everything in a tabular way so you can filter and rearrange as you please afterwards.

The DEP variable controls how many close-tags in a row it looks for before printing a row of data. Set it as high as you can while still having it printing what you want. In this case, that seems to be 4.

Code:
$ cat xmlg.awk 

BEGIN { RS="<";         FS=">"; ORS="\r\n";

        # Change this to alter how many close-tags in a row are needed
        # before a row of data is printed.
        if(!DEP) DEP=1
        SEP="\t"
        }

# Skip weird XML specification lines or blank records
/^\?/ || /^$/   {       next    }

# Handle close tags
/^[/]/  {
        N=D;    while((N>0) && ("/"STACK[N] != $1))     N--;

        if("/"STACK[N] == $1)   D=(N-1);
        POP++;

        if(POP == DEP)
        {
                if(!HEADER++)
                {
                        split(ARG[1], Z, SUBSEP);
                        printf("%s %s", Z[2], Z[3]);
                        for(N=2; N<=ARG_; N++)
                        {
                                split(ARG[N], Z, SUBSEP);
                                printf("%s%s %s", SEP, Z[2], Z[3]);
                        }

                        printf("\n");
                }

                printf("%s", DATA[ARG[1]]);
                for(N=2; N<=ARG_; N++)
                        printf("%s%s", SEP, DATA[ARG[N]]);
                printf("\n");
        }
        next
}

# Handle open tags
{
        gsub(/^[ \r\n\t]*/, "", $2);    # Whitespace isn't data
        gsub(/[ \r\n\t]*$/, "", $2);
        sub(/\/$/, "", $(NF-1));

        # Reset parameters
        POP=0;

        M=split($1, A, " ");
        STACK[++D]=A[1];

        if((!MAX) || (D>MAX)) MAX=D;    # Save max depth

        # Handle parameters
        Q=split(A[2], B, " ");
        for(N=1; N<=Q; N++)
        {
                split(B[N], C, "=");
                gsub(/['"]/,"", C[2]);

                I=D SUBSEP STACK[D] SUBSEP C[1];
                if(!SEEN[I]++)
                        ARG[++ARG_]=I;

                DATA[I]=C[2];
        }

        if($2)
        {
                I=D SUBSEP STACK[D] SUBSEP "CDATA";
                if(!SEEN[I]++)
                        ARG[++ARG_]=I;

                DATA[I]=$2;
        }
}

$ awk -v DEP=4 -f xmlg.awk < data2.xml |
        awk -F"\t" 'NR>1 { print $4"-"$5"-"$11"-"$12 }'

First Name Last Name-Power user-someone@company.com-123 memory LANE

$


Last edited by Corona688; 08-28-2012 at 03:00 PM..
 

10 More Discussions You Might Find Interesting

1. AIX

Extract info

Anyone have a better idea to automate extraction of info like ... "uname" "ifconfig" "ps efl" "netstat -ao" etc. from several hundred aix, solaris, red hat boxes? without logging into each box and manually performing these tasks and dumping them to individual files? thanks for any input (1 Reply)
Discussion started by: chm0dvii
1 Replies

2. UNIX and Linux Applications

Parsing info from a text file into an IDL procedure

Hi, I hope this is appropriate for this forum. I have a text file (test.txt) that contains information that I would like to parse into an IDL procedure. Each line of the text file is either a number or a string, which will be a variable in my IDL procedure. Therefore I want to read each line... (1 Reply)
Discussion started by: msb65
1 Replies

3. Programming

c program to extract text between two delimiters from some text file

needa c program to extract text between two delimiters from some text file. and then storing them in to diffrent variables ? text file like 0: abc.txt ========= aaaaaa|11111111|sssssssssss|333333|ddddddddd|34343454564|asass aaaaaa|11111111|sssssssssss|333333|ddddddddd|34343454564|asass... (7 Replies)
Discussion started by: kukretiabhi13
7 Replies

4. Shell Programming and Scripting

how to extract info from a file using awk

Dear all I have a file call interfaces.txt Filename: interfaces.txt How can I extract the information at below? ABC_DB_001 hostname1 20901 ABC_DB_002 hostname2 20903 ABC_DB_003 hostname3 20905 Currently I am using a very stupid method grep ^ABC interfaces.txt > name.txt grep... (3 Replies)
Discussion started by: on9west
3 Replies

5. Shell Programming and Scripting

Extract info from log file and compute using time date stamp

Looking for a shell script or a simple perl script . I am new to scripting and not very good at it . I have 2 directories . One of them holds a text file with list of files in it and the second one is a daily log which shows the file completion time. I need to co-relate both and make a report. ... (0 Replies)
Discussion started by: breez_drew
0 Replies

6. Shell Programming and Scripting

Using AWK BEGIN to extract file header info into variables

Hi Folks, I've searched for this for quite a while, but can't find any solution - hope someone can help. I have various files with standard headers. eg. <HEADER> IP: 1.2.3.4 Username: Joe Time: 12:00:00 Date: 23/05/2010 </HEADER> This is a test and this part can be any size... (6 Replies)
Discussion started by: damoske
6 Replies

7. Shell Programming and Scripting

how to extract the info in the tag from a xml file

Hi All, Do anyone of you have any idea how to extract each<info> tag to each different file. I have 1000 raw files, which come in every 15 mins.( I am using bash) I have tried my script as below, but it took hours to finish, which is inefficiency. perl -n -e '/^<info>/ and open FH,">file".$n++;... (2 Replies)
Discussion started by: natalie23
2 Replies

8. Shell Programming and Scripting

How to extract the day of the year and use that info to copy a file remotely

Hello, Thank you in advance for helping a newbie who is having great trouble with this simple task. I'm allowed to copy one file remotely each night due to bandwidth restrictions. A new file gets generated once a day, and I need to copy the previous day's file. Here is what I'd like to do:... (1 Reply)
Discussion started by: tmozdzen
1 Replies

9. Shell Programming and Scripting

HELP: Shell Script to read a Log file line by line and extract Info based on KEYWORDS matching

I have a LOG file which looks like this Import started at: Mon Jul 23 02:13:01 EDT 2012 Initialization completed in 2.146 seconds. -------------------------------------------------------------------------------- -- Import summary for Import item: PolicyInformation... (8 Replies)
Discussion started by: biztank
8 Replies

10. Shell Programming and Scripting

How to extract info from pings.?

Hi guys, new to this forum. I am currently trying to extract the times from pinging a domain and list the top 3 and then also do the opposite i.e. list the bottom 3. so if I had this as a ping result (the bold part is what I want): 64 bytes from 193.120.166.90: icmp_seq=10 ttl=128 time=34.8... (5 Replies)
Discussion started by: acoding
5 Replies
Bio::OntologyIO::Handlers::BaseSAXHandler(3pm)		User Contributed Perl Documentation	    Bio::OntologyIO::Handlers::BaseSAXHandler(3pm)

NAME
Bio::OntologyIO::Handlers::BaseSAXHandler base class for SAX Handlers SYNOPSIS
See description. DESCRIPTION
This module is an abstract module, serving as the base of any SAX Handler implementation. It tries to offer the framework that SAX handlers generally need, such as tag_stack, char_store, etc. In the implementation handler, you can take advantage of this based module by the following suggestions. 1) In start_element, sub start_element { my $self=shift; my $tag=$_[0]->{Name}; my %args=%{$_[0]->{Attributes}}; # Your code here. # Before you conclude the method, write these 2 line. $self->_visited_count_inc($tag); $self->_push_tag($tag); } 2) In end_element, sub end_element { my $self=shift; my $tag=shift->{Name}; # Your code here. # Before you conclude the method, write these 2 lines. $self->_visited_count_dec($tag); $self->_pop_tag; } 3) In characters, or any other methods where you may use the tag stack or count sub characters { my $self=shift; my $text=shift->{Data}; $self->_chars_hash->{$self->_top_tag} .= $text; } $count = $self->_visited_count('myTag'); $tag = $self->_top_tag; FEEDBACK
Mailing Lists User feedback is an integral part of the evolution of this and other Bioperl modules. Send your comments and suggestions preferably to one of the Bioperl mailing lists. Your participation is much appreciated. bioperl-l@bioperl.org - General discussion http://bioperl.org/wiki/Mailing_lists - About the mailing lists Support Please direct usage questions or support issues to the mailing list: bioperl-l@bioperl.org rather than to the module maintainer directly. Many experienced and reponsive experts will be able look at the problem and quickly address it. Please include a thorough description of the problem with code and data examples if at all possible. Reporting Bugs Report bugs to the Bioperl bug tracking system to help us keep track the bugs and their resolution. Bug reports can be submitted via the web: https://redmine.open-bio.org/projects/bioperl/ AUTHOR
Juguang Xiao, juguang@tll.org.sg APPENDIX The rest of the documentation details each of the object methods. Interal methods are usually preceded with a _ _tag_stack Title : _tag_stack Usage : @tags = $self->_tag_stack; Function: Get an array of tags that have been accessed but not enclosed. Return : Args : _push_tag _pop_tag _top_tag Title : _top_tag Usage : $top = $self->_top_tag; Function: get the top tag in the tag stack. Return : a tag name Args : [none] _chars_hash Title : _chars_hash Usage : $hash= $self->_chars_hash; Function: return the character cache for the specific tag Return : a hash reference, which is intent for character storage for tags Args : [none] _current_hash _visited_count_inc Title : _vistied_count_inc Usage : $self->vistied_count_inc($tag); # the counter for the tag increase Function: the counter for the tag increase Return : the current count after this increment Args : the tag name [scalar] _visited_count_dec Title : _visited_count_dec Usage : $self->_visited_count_dec($tag); Function: the counter for the tag decreases by one Return : the current count for the specific tag after the decrement Args : the tag name [scalar] _visited_count Title : _visited_count Usage : $count = $self->_visited_count Function: return the counter for the tag Return : the current counter for the specific tag Args : the tag name [scalar] perl v5.14.2 2012-03-02 Bio::OntologyIO::Handlers::BaseSAXHandler(3pm)
All times are GMT -4. The time now is 11:08 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy