How to extract info from text file between the tags


 
Thread Tools Search this Thread
Top Forums UNIX for Advanced & Expert Users How to extract info from text file between the tags
# 1  
Old 08-28-2012
How to extract info from text file between the tags

Hi,
I have a text file with member information...
[
Code:
B]Name[/B] is in H1 tag
Title is in H2 tag
Email is in <a id="ctl00_ContentPlaceHolder3_repeaterItems_ctl01_lbnEmailMe" href="javascript:__doPostBack('ctl00$ContentPlaceHolder3$repeaterItems$ctl01$lbnEmailMe','')">someone@company.com</a>
Location: <span id="ctl00_ContentPlaceHolder3_repeaterItems_ctl01_lblOfficeCity">City, State, Zip</span>

My Mission is to:
extract a list of names, titles, location and emails from this text file.

How should I go about doing that.. I am using AWK to extract the email matches but how do i get all the pieces in a text file [TABBED].

Thank you for your time in advance.

iG
# 2  
Old 08-28-2012
It'll be better if you provide an input sample and the corresponding output desired.
# 3  
Old 08-28-2012
Input - Output

Repeated Code on the page:
Code:
<div class="userInfo">


                        <div class="contactDetails">
                            <a id="ctl00_ContentPlaceHolder3_repeaterItems_ctl01_nameLink" href="#"><h1>First Name Last Name</h1></a>
                            <h2>Power user</h2>
                            <div class="contactNumbers">
                                <span id="ctl00_ContentPlaceHolder3_repeaterItems_ctl01_lblMobile" class="SRtext"><span class="SRtextLabel">Cell: </span>999.999.9999 <br /></span>
                                <span id="ctl00_ContentPlaceHolder3_repeaterItems_ctl01_lblOfficeNum" class="SRtext"><span class="SRtextLabel">Office Telephone: </span>999.999.9999<br /></span>
                                <span id="ctl00_ContentPlaceHolder3_repeaterItems_ctl01_lblHomePhone" class="SRtext"><span class="SRtextLabel">Home Telephone: </span>999.999.9999<br /></span>
                                
                                
                                
                                
                                <span id="ctl00_ContentPlaceHolder3_repeaterItems_ctl01_lblEmail"><span class="SRtextLabel">Email: </span><a id="ctl00_ContentPlaceHolder3_repeaterItems_ctl01_lbnEmailMe" href="javascript:__doPostBack('ctl00$ContentPlaceHolder3$repeaterItems$ctl01$lbnEmailMe','')">someone@company.com</a><br /></span>
                            </div>

                
                    <div class="SRtextNoIndent">
                <span id="ctl00_ContentPlaceHolder3_repeaterItems_lblOffices">123 memory LANE</span><br />
                <span id="ctl00_ContentPlaceHolder3_repeaterItems_lblCity">Detroit, MI 48204</span>
            </div>
            
        </div>
    </div>

OUTPUT:
Code:
Name-Title-Email-Location
First Name Last Name-Power user-someone@company.com-Detriot,MI 48204
First Name Last Name-Power user-someone@company.com-Detriot,MI 48204
First Name Last Name-Power user-someone@company.com-Detriot,MI 48204
First Name Last Name-Power user-someone@company.com-Detriot,MI 48204

Moderator's Comments:
Mod Comment
Please use code tags when posting data and code samples!

Last edited by vgersh99; 08-28-2012 at 12:31 PM.. Reason: code tags, please!
# 4  
Old 08-28-2012
I have a generic data-extraction script for xml which often works nicely for repeated XML/HTML structures as long as there isn't things with spaces inside tag attributes. It prints everything in a tabular way so you can filter and rearrange as you please afterwards.

The DEP variable controls how many close-tags in a row it looks for before printing a row of data. Set it as high as you can while still having it printing what you want. In this case, that seems to be 4.

Code:
$ cat xmlg.awk 

BEGIN { RS="<";         FS=">"; ORS="\r\n";

        # Change this to alter how many close-tags in a row are needed
        # before a row of data is printed.
        if(!DEP) DEP=1
        SEP="\t"
        }

# Skip weird XML specification lines or blank records
/^\?/ || /^$/   {       next    }

# Handle close tags
/^[/]/  {
        N=D;    while((N>0) && ("/"STACK[N] != $1))     N--;

        if("/"STACK[N] == $1)   D=(N-1);
        POP++;

        if(POP == DEP)
        {
                if(!HEADER++)
                {
                        split(ARG[1], Z, SUBSEP);
                        printf("%s %s", Z[2], Z[3]);
                        for(N=2; N<=ARG_; N++)
                        {
                                split(ARG[N], Z, SUBSEP);
                                printf("%s%s %s", SEP, Z[2], Z[3]);
                        }

                        printf("\n");
                }

                printf("%s", DATA[ARG[1]]);
                for(N=2; N<=ARG_; N++)
                        printf("%s%s", SEP, DATA[ARG[N]]);
                printf("\n");
        }
        next
}

# Handle open tags
{
        gsub(/^[ \r\n\t]*/, "", $2);    # Whitespace isn't data
        gsub(/[ \r\n\t]*$/, "", $2);
        sub(/\/$/, "", $(NF-1));

        # Reset parameters
        POP=0;

        M=split($1, A, " ");
        STACK[++D]=A[1];

        if((!MAX) || (D>MAX)) MAX=D;    # Save max depth

        # Handle parameters
        Q=split(A[2], B, " ");
        for(N=1; N<=Q; N++)
        {
                split(B[N], C, "=");
                gsub(/['"]/,"", C[2]);

                I=D SUBSEP STACK[D] SUBSEP C[1];
                if(!SEEN[I]++)
                        ARG[++ARG_]=I;

                DATA[I]=C[2];
        }

        if($2)
        {
                I=D SUBSEP STACK[D] SUBSEP "CDATA";
                if(!SEEN[I]++)
                        ARG[++ARG_]=I;

                DATA[I]=$2;
        }
}

$ awk -v DEP=4 -f xmlg.awk < data2.xml |
        awk -F"\t" 'NR>1 { print $4"-"$5"-"$11"-"$12 }'

First Name Last Name-Power user-someone@company.com-123 memory LANE

$


Last edited by Corona688; 08-28-2012 at 03:00 PM..
# 5  
Old 08-30-2012
Mountain Loin SED AWK..

Thanks..
I updated to Mountain Loin and SED and AWK are not workign for me...

Any fix for this?

Thanks
G
# 6  
Old 08-31-2012
Details, please! In what way are they "not working"?

What, exactly, did you do? Show us word for word, letter for letter, keystroke for keystroke.

Does my script to transform the data into XML work at least?

Post more complete output, I suspect your data doesn't always look like what you said.
# 7  
Old 09-05-2012
a clumsy solution but works

clumsy solution but worked well...
Code:
cat filename | grep -E "<h1>|<h2>|Email: |lblCity" | sed 's/</\n/g' | grep -E "^h1|^h2|^a|^span id" | grep -v -E "ctl01_nameLink|ctl01_lblEmail" | awk -F ">" '{ print $2 }' | sed 's/ /___/g' | xargs -n 4 | sed 's/___/ /g'

I tried with the input you provide and below is the output i got.
Code:
[gerb@ashik temp]$ cat temp | grep -E "<h1>|<h2>|Email: |lblCity" | sed 's/</\n/g' | grep -E "^h1|^h2|^a|^span id" | grep -v -E "ctl01_nameLink|ctl01_lblEmail" | awk -F ">" '{ print $2 }' | sed 's/ /___/g' | xargs -n 4 | sed 's/___/ /g'
First Name Last Name Power user someone@company.com Detroit, MI 48204

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to extract info from pings.?

Hi guys, new to this forum. I am currently trying to extract the times from pinging a domain and list the top 3 and then also do the opposite i.e. list the bottom 3. so if I had this as a ping result (the bold part is what I want): 64 bytes from 193.120.166.90: icmp_seq=10 ttl=128 time=34.8... (5 Replies)
Discussion started by: acoding
5 Replies

2. Shell Programming and Scripting

HELP: Shell Script to read a Log file line by line and extract Info based on KEYWORDS matching

I have a LOG file which looks like this Import started at: Mon Jul 23 02:13:01 EDT 2012 Initialization completed in 2.146 seconds. -------------------------------------------------------------------------------- -- Import summary for Import item: PolicyInformation... (8 Replies)
Discussion started by: biztank
8 Replies

3. Shell Programming and Scripting

How to extract the day of the year and use that info to copy a file remotely

Hello, Thank you in advance for helping a newbie who is having great trouble with this simple task. I'm allowed to copy one file remotely each night due to bandwidth restrictions. A new file gets generated once a day, and I need to copy the previous day's file. Here is what I'd like to do:... (1 Reply)
Discussion started by: tmozdzen
1 Replies

4. Shell Programming and Scripting

how to extract the info in the tag from a xml file

Hi All, Do anyone of you have any idea how to extract each<info> tag to each different file. I have 1000 raw files, which come in every 15 mins.( I am using bash) I have tried my script as below, but it took hours to finish, which is inefficiency. perl -n -e '/^<info>/ and open FH,">file".$n++;... (2 Replies)
Discussion started by: natalie23
2 Replies

5. Shell Programming and Scripting

Using AWK BEGIN to extract file header info into variables

Hi Folks, I've searched for this for quite a while, but can't find any solution - hope someone can help. I have various files with standard headers. eg. <HEADER> IP: 1.2.3.4 Username: Joe Time: 12:00:00 Date: 23/05/2010 </HEADER> This is a test and this part can be any size... (6 Replies)
Discussion started by: damoske
6 Replies

6. Shell Programming and Scripting

Extract info from log file and compute using time date stamp

Looking for a shell script or a simple perl script . I am new to scripting and not very good at it . I have 2 directories . One of them holds a text file with list of files in it and the second one is a daily log which shows the file completion time. I need to co-relate both and make a report. ... (0 Replies)
Discussion started by: breez_drew
0 Replies

7. Shell Programming and Scripting

how to extract info from a file using awk

Dear all I have a file call interfaces.txt Filename: interfaces.txt How can I extract the information at below? ABC_DB_001 hostname1 20901 ABC_DB_002 hostname2 20903 ABC_DB_003 hostname3 20905 Currently I am using a very stupid method grep ^ABC interfaces.txt > name.txt grep... (3 Replies)
Discussion started by: on9west
3 Replies

8. Programming

c program to extract text between two delimiters from some text file

needa c program to extract text between two delimiters from some text file. and then storing them in to diffrent variables ? text file like 0: abc.txt ========= aaaaaa|11111111|sssssssssss|333333|ddddddddd|34343454564|asass aaaaaa|11111111|sssssssssss|333333|ddddddddd|34343454564|asass... (7 Replies)
Discussion started by: kukretiabhi13
7 Replies

9. UNIX and Linux Applications

Parsing info from a text file into an IDL procedure

Hi, I hope this is appropriate for this forum. I have a text file (test.txt) that contains information that I would like to parse into an IDL procedure. Each line of the text file is either a number or a string, which will be a variable in my IDL procedure. Therefore I want to read each line... (1 Reply)
Discussion started by: msb65
1 Replies

10. AIX

Extract info

Anyone have a better idea to automate extraction of info like ... "uname" "ifconfig" "ps efl" "netstat -ao" etc. from several hundred aix, solaris, red hat boxes? without logging into each box and manually performing these tasks and dumping them to individual files? thanks for any input (1 Reply)
Discussion started by: chm0dvii
1 Replies
Login or Register to Ask a Question