Perl code to retrieve text from website


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Perl code to retrieve text from website
# 8  
Old 03-18-2014
Look at the HTML of the document. 'view source' in your browser.
# 9  
Old 03-18-2014
perl code to retrieve data from website

I see what you mean, it a mess. Thanks.
# 10  
Old 03-18-2014
That website has your data buried 14 div tags deep but helpfully put next to a div with the id of "popper_LaboratoryTestName".

This is what it's like, trying to deal with XML with regular expressions:

Code:
$ cat xmls.awk 

BEGIN {
        DEP=2;  # How many close tags in a row before data dump
        POS=0
        RS="<";
        FS="[ \n\t\t>/]";
}

# Always this finicky case when RS isn't \n
(NR==1) && (length($0) == 0) { next }

# Skip XML comments
/^!--/ {
        while(!(I=index($0, "-->"))) if(getline <= 0) exit;
        # Strip out comment
        $0="--XMLCOMMENT-- />"substr($0,I+3);
}

# Ignore XML specification junk
/^\?/ || /^\!/ { next }

# Close tags
/^\// {
        for(TPOS=POS; (TPOS>0) && (toupper($2) != TS[POS]); TPOS--);

        if(TPOS <= 0) print "Went under for "$2
        else
        {
                TPOS--;
                while(TPOS < POS)
                {
                        sub(/\/[^\/]*$/, "", TSS); POS--;
                }

#               printf("%s-%s\n", TSS, toupper($2));
        }

#       POP++;
#       if(POP == DEP)
        {
#               printf("%d pops in a row\n", POP);
#               for(X in A) delete A[X];
        }

        next
}

# These should be special variables for match() but aren't.
# String before match
function rbefore(STR)   { return(substr(STR, N, RSTART-1)); }
# First char of match
function rmid(STR)      { return(substr(STR, RSTART, 1)); }
# Entire match
function rall(STR)      { return(substr(STR, RSTART, RLENGTH)); }
# String after match
function rafter(STR)    { return(substr(STR, RSTART+RLENGTH)); }

# Turns Q=R into A[Q]=R
function aquote(OUT, A, TA) {
        if(OUT)
        {
                split(OUT, TA, SUBSEP);
                A[tolower(TA[1])]=TA[2];
        }

        return("");
}

# Intended to be less stupid about quoted text in XML/HTML.
# Splits a='b' c='d' e='f' into A[a]=b, A[c]=d, A[e]=f, etc.
function qsplit(STR, A, X, OUT) {
        while(STR && match(STR, /([ \n\t]+)|[\x27\x22=]/))
        {
                OUT = OUT rbefore(STR);

                RMID=rmid(STR);
                if((RMID == "'") || (RMID == "\""))
                {
                        if(!Q)          Q=RMID;
                        else if(Q == RMID)      Q="";
                        else                    OUT = OUT RMID;
                } else if(RMID == "=") {
                        if(Q)   OUT=OUT RMID; else OUT=OUT SUBSEP;
                } else if((RMID=="\r")||(RMID=="\n")||(RMID=="\t")||(RMID==" ")) {
                        if(Q)   OUT = OUT rall(STR);
                        else    OUT = aquote(OUT, A);
                }
                STR=rafter(STR);
        }

        aquote(OUT STR, A);
}

# Non-close tag
!/^\// {
        POP=0;
        TAG=$1;                         sub(/^[^ \r\n\t]*/, "");
        match($0, /\/?>/);
        TDATA=rbefore($0);              CDATA=rafter($0);
        # Flatten and strip whitespace
        gsub(/[ \r\n\t]+/, " ", CDATA);
        gsub(/^[ \r\n\t]+/, "", CDATA);
        gsub(/[ \r\n\t]+$/, "", CDATA);

        if(RLENGTH != 2) # Found > instead of self-closing />
        {
                TS[++POS]=toupper(TAG);
#               printf("%s+%s\n", TSS, toupper(TAG));
                TSS=TSS"/"toupper(TAG);
        }

        for(X in TA) delete TA[X];
        qsplit(TDATA, TA);
        for(X in TA) A[X]=TA[X];

        if(length(CDATA)) A["CDATA:"toupper(TAG)]=CDATA

#       for(X in A) printf("%s[%s]=%s\n", TAG, X, A[X]);
}

(A["id"] == "popper_LaboratoryTestName") && (TS[POS]=="P") { print A["CDATA:P"] }

$ wget -q http://www.ncbi.nlm.nih.gov/gtr/tests/508680/ -O - | awk -f xmls.awk
Exome Sequencing (Exome)

$

...the short version, anyway. XML is not trivial.

Hope this helps.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to search a text in file and retrieve required lines following it with UNIX command?

I have requirement to search for a text in the file and retrieve required lines that is user defined with unix command. Eg: Find the text UNIX in the below file and need to return Test 8 & Test 9 Test 1 Test 2 Test 3 Test 4 UNIX Test 5 Test 6 Test 7 Test 8 Test 9 Result can... (8 Replies)
Discussion started by: Arunkumarsak4
8 Replies

2. Shell Programming and Scripting

Retrieve information Text/Word from HTML code using awk/sed

awk/sed newbie here. I have a HTML file and from that file and I would like to retrieve a text word. <font face=arial size=-1><li><a href=/value_for_clients/Tokyo/abc_process.txt>abc</a> NDK Version: 4.0 </li> <font face=arial size=-1><li><a... (6 Replies)
Discussion started by: sk2code
6 Replies

3. Shell Programming and Scripting

How can i run sql queries from UNIX shell script and retrieve data into text docs of UNIX?

Please share the doc asap as very urgently required. (1 Reply)
Discussion started by: 24ajay
1 Replies

4. Shell Programming and Scripting

PERL: retrieve the data based on regular expression

Hi Friends i have a code below sample $text contains the values test1 PIC X test1 PIC XX test1 PIC XXX test1 PIC X(8) test1 PIC X(12) test1 PIC X test1 X(8) test1 PIC X VALUE 'N'. $text =~ /^\d{6} +(\d{2}) +(+) +PIC +(+)(\((\d+)\)(V(+)| +(COMP\-3).|\.)|( +(COMP\-3).|... (4 Replies)
Discussion started by: i150371485
4 Replies

5. Shell Programming and Scripting

How to retrieve a number or string from file1 and redirect into file2 in perl script?

hello forum members, I am siva ,As i am new to perl scripting i looking help from forum members. i need a sample program are command for pattern matching. I have file name infile1 which some data, I need to search the particular number are string in the file which repeats n number of... (0 Replies)
Discussion started by: workforsiva
0 Replies

6. Shell Programming and Scripting

perl: a way to see a sub code in debug mode: perl -de 0 ?

Is there a way to see or print a sub code? Sometime a sub could be already defined, but in the debug mode (so, interactively) it could be already out of screen. So, I would think about a way to check if the sub is defined (just 'defined' is not a problem) and how it is defined. Also, if... (4 Replies)
Discussion started by: alex_5161
4 Replies

7. Shell Programming and Scripting

Using Perl to query a website and parse the result

Hi, I am a JAVA programmer and I have no idea about perl. I did use it a long time ago and I don't even remember the basics. So here is my problem: In my work, I am supposed to build a simple program that opens a website (Gene Ontology)and passes my query and returns the result into a file. The... (1 Reply)
Discussion started by: chavanak
1 Replies

8. Shell Programming and Scripting

retrieve what the currently selected item is in a dropdown select list using perl tk

I have a dropdown menu built in perl tk (I am using active state perl). I want to select a value from the dropdown menu and I want to be able to perform some other actions depending upon what value is selected. I have all the graphical part made but I dont know how to get the selected value. Any... (0 Replies)
Discussion started by: lassimanji
0 Replies

9. Shell Programming and Scripting

Perl website login and session

Hi, I'm currently working on a perl website, and I would need a system where a few users can login into the administration side of the site. about 5-10 users maximum, all pretty simple. I was thinking of using an .htaccess file and a seperate admin folder on the server. I'm wondering if there... (2 Replies)
Discussion started by: LNC
2 Replies

10. UNIX for Dummies Questions & Answers

retrieve text after grep

I am trying to search for a pattern in a file containing xml - When I match the search I want to retrieve all the text within the xml brackets.. Whats the best way to read in data between xml tags in a shell script? ie.. xml returned which I have in a file now is something like below:... (2 Replies)
Discussion started by: frustrated1
2 Replies
Login or Register to Ask a Question