Perl code to retrieve text from website

03-18-2014

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Look at the HTML of the document. 'view source' in your browser.

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

03-18-2014

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

perl code to retrieve data from website

I see what you mean, it a mess. Thanks.

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

03-18-2014

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

That website has your data buried 14 div tags deep but helpfully put next to a div with the id of "popper_LaboratoryTestName".

This is what it's like, trying to deal with XML with regular expressions:

Code:

$ cat xmls.awk 

BEGIN {
        DEP=2;  # How many close tags in a row before data dump
        POS=0
        RS="<";
        FS="[ \n\t\t>/]";
}

# Always this finicky case when RS isn't \n
(NR==1) && (length($0) == 0) { next }

# Skip XML comments
/^!--/ {
        while(!(I=index($0, "-->"))) if(getline <= 0) exit;
        # Strip out comment
        $0="--XMLCOMMENT-- />"substr($0,I+3);
}

# Ignore XML specification junk
/^\?/ || /^\!/ { next }

# Close tags
/^\// {
        for(TPOS=POS; (TPOS>0) && (toupper($2) != TS[POS]); TPOS--);

        if(TPOS <= 0) print "Went under for "$2
        else
        {
                TPOS--;
                while(TPOS < POS)
                {
                        sub(/\/[^\/]*$/, "", TSS); POS--;
                }

#               printf("%s-%s\n", TSS, toupper($2));
        }

#       POP++;
#       if(POP == DEP)
        {
#               printf("%d pops in a row\n", POP);
#               for(X in A) delete A[X];
        }

        next
}

# These should be special variables for match() but aren't.
# String before match
function rbefore(STR)   { return(substr(STR, N, RSTART-1)); }
# First char of match
function rmid(STR)      { return(substr(STR, RSTART, 1)); }
# Entire match
function rall(STR)      { return(substr(STR, RSTART, RLENGTH)); }
# String after match
function rafter(STR)    { return(substr(STR, RSTART+RLENGTH)); }

# Turns Q=R into A[Q]=R
function aquote(OUT, A, TA) {
        if(OUT)
        {
                split(OUT, TA, SUBSEP);
                A[tolower(TA[1])]=TA[2];
        }

        return("");
}

# Intended to be less stupid about quoted text in XML/HTML.
# Splits a='b' c='d' e='f' into A[a]=b, A[c]=d, A[e]=f, etc.
function qsplit(STR, A, X, OUT) {
        while(STR && match(STR, /([ \n\t]+)|[\x27\x22=]/))
        {
                OUT = OUT rbefore(STR);

                RMID=rmid(STR);
                if((RMID == "'") || (RMID == "\""))
                {
                        if(!Q)          Q=RMID;
                        else if(Q == RMID)      Q="";
                        else                    OUT = OUT RMID;
                } else if(RMID == "=") {
                        if(Q)   OUT=OUT RMID; else OUT=OUT SUBSEP;
                } else if((RMID=="\r")||(RMID=="\n")||(RMID=="\t")||(RMID==" ")) {
                        if(Q)   OUT = OUT rall(STR);
                        else    OUT = aquote(OUT, A);
                }
                STR=rafter(STR);
        }

        aquote(OUT STR, A);
}

# Non-close tag
!/^\// {
        POP=0;
        TAG=$1;                         sub(/^[^ \r\n\t]*/, "");
        match($0, /\/?>/);
        TDATA=rbefore($0);              CDATA=rafter($0);
        # Flatten and strip whitespace
        gsub(/[ \r\n\t]+/, " ", CDATA);
        gsub(/^[ \r\n\t]+/, "", CDATA);
        gsub(/[ \r\n\t]+$/, "", CDATA);

        if(RLENGTH != 2) # Found > instead of self-closing />
        {
                TS[++POS]=toupper(TAG);
#               printf("%s+%s\n", TSS, toupper(TAG));
                TSS=TSS"/"toupper(TAG);
        }

        for(X in TA) delete TA[X];
        qsplit(TDATA, TA);
        for(X in TA) A[X]=TA[X];

        if(length(CDATA)) A["CDATA:"toupper(TAG)]=CDATA

#       for(X in A) printf("%s[%s]=%s\n", TAG, X, A[X]);
}

(A["id"] == "popper_LaboratoryTestName") && (TS[POS]=="P") { print A["CDATA:P"] }

$ wget -q http://www.ncbi.nlm.nih.gov/gtr/tests/508680/ -O - | awk -f xmls.awk
Exome Sequencing (Exome)

$

...the short version, anyway. XML is not trivial.

Hope this helps.

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

Shell Programming and Scripting

Perl code to retrieve text from website

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to search a text in file and retrieve required lines following it with UNIX command?

Discussion started by: Arunkumarsak4

2. Shell Programming and Scripting

Retrieve information Text/Word from HTML code using awk/sed

Discussion started by: sk2code

3. Shell Programming and Scripting

How can i run sql queries from UNIX shell script and retrieve data into text docs of UNIX?

Discussion started by: 24ajay

4. Shell Programming and Scripting

PERL: retrieve the data based on regular expression

Discussion started by: i150371485

5. Shell Programming and Scripting

How to retrieve a number or string from file1 and redirect into file2 in perl script?

Discussion started by: workforsiva

6. Shell Programming and Scripting

perl: a way to see a sub code in debug mode: perl -de 0 ?

Discussion started by: alex_5161

7. Shell Programming and Scripting

Using Perl to query a website and parse the result

Discussion started by: chavanak

8. Shell Programming and Scripting

retrieve what the currently selected item is in a dropdown select list using perl tk

Discussion started by: lassimanji

9. Shell Programming and Scripting

Perl website login and session

Discussion started by: LNC

10. UNIX for Dummies Questions & Answers

retrieve text after grep

Discussion started by: frustrated1