awk to parse html file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk to parse html file
# 1  
Old 09-29-2014
awk to parse html file

Is it possible in awk to parse a webpage (EDAR Gene Sequencing - Genetic Testing Company | The DNA Diagnostic Experts | GeneDx), the source code is attached.

HTML Code:
<title> EDAR Gene Sequencing
<dt>Test Code:</dt>
    <dd>156 </dd>
 
    <dt>Turnaround Time:</dt>
    <dd>6-8 weeks </dd>
 
    <dt>Preferred Specimen:</dt>
    <dd>2-5 mL Blood - Lavender Top Tube </dd>
 
<dt>CPT Codes:</dt>
    <dd>81479x1</dd>
 
<ul id="clinical-utility">
    <li>Confirmation of a clinical diagnosis </li>
    <li>Differentiation between X-linked and autosomal forms of the disease </li>
    <li>Prenatal diagnosis in at-risk pregnancies</li>
 
<ol id="references">
    <li>Bal, E et al. Hum Mutat. 28:703-709, 2007.</li>
    <li>Headon et al. Nature. 414:913-916, 2001.</li>
    <li>Monreal et al. Nat Genet 22:366-369, 1999.</li>
    <li>Chassaing et al. Hum Mutat. 27(3):255-259, 2006</li>
The <…..> are not needed only the text is, if it is possible. Thanks Smilie.

Last edited by cmccabe; 09-29-2014 at 03:16 PM.. Reason: CODE tags (in this case html tags)
# 2  
Old 09-29-2014
XML is not trivial. This awk parser is not perfect at it but may do.

Code:
BEGIN {
        FS=">"; OFS=">";
        RS="<"; ORS="<"
}

# These should be special variables for match() but aren't.
function rbefore(STR)   { return(substr(STR, N, RSTART-1)); }# before match
function rmid(STR)      { return(substr(STR, RSTART, 1)); }  # First char match
function rall(STR)      { return(substr(STR, RSTART, RLENGTH)); }# Entire match
function rafter(STR)    { return(substr(STR, RSTART+RLENGTH)); }# after match

function aquote(OUT, A, PFIX, TA) { # Turns Q SUBSEP R into A[PFIX":"Q]=R
        if(OUT)
        {
                if(PFIX) PFIX=PFIX":"
                split(OUT, TA, SUBSEP);
                A[toupper(PFIX) toupper(TA[1])]=TA[2];
        }

        return("");
}

# Intended to be less stupid about quoted text in XML/HTML.
# Splits a='b' c='d' e='f' into A[PFIX":"a]=b, A[PFIX":"c]=d, etc.
function qsplit(STR, A, PFIX, X, OUT) {
        while(STR && match(STR, /([ \n\t]+)|[\x27\x22=]/))
        {
                OUT = OUT rbefore(STR);
                RMID=rmid(STR);

                if((RMID == "'") || (RMID == "\""))     # Quote characters
                {
                        if(!Q)          Q=RMID;         # Begin quote section
                        else if(Q == RMID)      Q="";   # End quote section
                        else                    OUT = OUT RMID; # Quoted quote
                } else if(RMID == "=") {
                        if(Q)   OUT=OUT RMID; else OUT=OUT SUBSEP;
                } else if((RMID=="\r")||(RMID=="\n")||(RMID=="\t")||(RMID==" ")) {
                        if(Q)   OUT = OUT rall(STR); # Literal quoted whitespace
                        else    OUT = aquote(OUT, A, PFIX); # Unquoted WS, next block
                }
                STR=rafter(STR); # Strip off the text we've processed already.
        }

        aquote(OUT STR, A, PFIX); # Process any text we haven't already.
}


{ SPEC=0 ; TAG="" }

NR==1 {
        if(ORS == RS) print;
        next } # The first "line" is blank when RS=<

/^[!?]/ { SPEC=1    }   # XML specification junk

# Handle open-tags
match($1, /^[^\/ \r\n\t>]+/) {
        TAG=substr(toupper($1), RSTART, RLENGTH);
        if((!SPEC) && !($1 ~ /\/$/))
        {
                TAGS=TAG "%" TAGS;
                DEP++;
                LTAGS=TAGS
        }

        for(X in ARGS) delete ARGS[X];

        qsplit(rafter($1), ARGS);
}

# Handle close-tags
(!SPEC) && /^[\/]/ {
        sub(/^\//, "", $1);
        LTAGS=TAGS
        sub("^.*" toupper($1) "%", "", TAGS);
        $1="/"$1
        DEP=split(TAGS, TA, "%")-1;
        if(DEP < 0) DEP=0;
}

Use it like:

Code:
$ awk -f xml.awk -e 'TAGS ~ /^TITLE/ { print $2 }
        TAGS ~ /^H4/ { P=/ORDERING|BILLING|REFERENCES/ ; next }
        {       gsub(/[\r\n\t ]+/, " ", $2);
                sub(/^ $/, "", $2);
                if(P && $2) print $2 }' ORS="\n" index.html

  EDAR Gene Sequencing - Genetic Testing Company | The DNA Diagnostic Experts | GeneDx
Test Code:
156
Turnaround Time:
6-8 weeks
Preferred Specimen:
2-5 mL Blood - Lavender Top Tube
CPT Codes:
81479x1
New York Approved:
Yes
ABN Required:
Yes
Billing Information:
View Billing Policy
ICD Codes:
757.31
: Congenital ectodermal dysplasia
*
 For price inquiries please email
zebras@genedx.com
Bal, E et al. Hum Mutat. 28:703-709, 2007.
Headon et al. Nature. 414:913-916, 2001.
Monreal et al. Nat Genet 22:366-369, 1999.
Chassaing et al. Hum Mutat. 27(3):255-259, 2006
Back To Top
Contact Us
Site Map
Terms of Service
Privacy Statement
&copy; GeneDx
207 Perry Parkway Gaithersburg, MD 20877
Phone: +1 301 519 2100, Fax: +1 301 519 2892
Email:
genedx@genedx.com
Stay Connected:

$

This User Gave Thanks to Corona688 For This Post:
# 3  
Old 09-29-2014
Quote:
Originally Posted by cmccabe
Is it possible in awk to parse a webpage (EDAR Gene Sequencing - Genetic Testing Company | The DNA Diagnostic Experts | GeneDx), the source code is attached.

HTML Code:
<title> EDAR Gene Sequencing
<dt>Test Code:</dt>
    <dd>156 </dd>
 
    <dt>Turnaround Time:</dt>
    <dd>6-8 weeks </dd>
 
    <dt>Preferred Specimen:</dt>
    <dd>2-5 mL Blood - Lavender Top Tube </dd>
 
<dt>CPT Codes:</dt>
    <dd>81479x1</dd>
 
<ul id="clinical-utility">
    <li>Confirmation of a clinical diagnosis </li>
    <li>Differentiation between X-linked and autosomal forms of the disease </li>
    <li>Prenatal diagnosis in at-risk pregnancies</li>
 
<ol id="references">
    <li>Bal, E et al. Hum Mutat. 28:703-709, 2007.</li>
    <li>Headon et al. Nature. 414:913-916, 2001.</li>
    <li>Monreal et al. Nat Genet 22:366-369, 1999.</li>
    <li>Chassaing et al. Hum Mutat. 27(3):255-259, 2006</li>
The <.....> are not needed only the text is, if it is possible. Thanks Smilie.
Can you post what the desired output should look like...
This User Gave Thanks to shamrock For This Post:
# 4  
Old 09-29-2014
In cases where you don't have quoted > characters in tags (and I didn't see any of them in your samples, but didn't do an exhaustive search in your attachment), the following much simpler script might work:
Code:
awk -F '<[^>]*>' '{$1=$1}1' OFS='' file

With the sample data you posted in the 1st message in this thread, it produces the output:
Code:
 EDAR Gene Sequencing
Test Code:
    156 
 
    Turnaround Time:
    6-8 weeks 
 
    Preferred Specimen:
    2-5 mL Blood - Lavender Top Tube 
 
CPT Codes:
    81479x1
 

    Confirmation of a clinical diagnosis 
    Differentiation between X-linked and autosomal forms of the disease 
    Prenatal diagnosis in at-risk pregnancies
 

    Bal, E et al. Hum Mutat. 28:703-709, 2007.
    Headon et al. Nature. 414:913-916, 2001.
    Monreal et al. Nat Genet 22:366-369, 1999.
    Chassaing et al. Hum Mutat. 27(3):255-259, 2006

I didn't see any problems processing your attached sample either, but due to the length (since this preserves all input lines and just removes tags), I won't post the results here. It would also be easy to get rid of empty lines after removing tags if that is what you want.
This User Gave Thanks to Don Cragun For This Post:
# 5  
Old 09-29-2014
Thank you all Smilie.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to parse a specifc value between html tags using sed?

Hi, im trying to read a Temperature value from html code. So far i have managed to reduce the whole html page down to this single line with the following sed command:sed -n '/Temperature/p' $temp_temperature | tee temp_string <TD width='350'>Temperature :</td><td>25... (2 Replies)
Discussion started by: naittis
2 Replies

2. Shell Programming and Scripting

Parse html

I downloaded source code using: wget -qO- http://fulgentdiagnostics.com/test/clinical-exome/ | cat > flugentsource.txt Now I am trying to use sed to parse it to confirm a gene count. Basically, output (flugent.txt) all the gene names with a total count after them I'm not all that... (5 Replies)
Discussion started by: cmccabe
5 Replies

3. Shell Programming and Scripting

Parse multiple html files in directory

I have downloaded source code for 97 files using: wget -x -i link.txt then run a rename loop: for file in * do mv $file $file.txt done to keep the html tags but make the file a text that can be parsed. In each of the 97 txt files the gene # is variable, but the gene is associated... (15 Replies)
Discussion started by: cmccabe
15 Replies

4. Shell Programming and Scripting

Parse excel file with html on each cell

<DIV><P>Pré-condição aceder ao ecrã Home do MRS.</P></DIV><DIV><P>OK.</P></DIV><DIV><P>Seleccionar Pesquisa de Recepção Directa.</P></DIV><DIV><P>Confirmar que abriu ecrã de Recepção Directa.</P></DIV><DIV> (6 Replies)
Discussion started by: oliveiraum
6 Replies

5. Shell Programming and Scripting

Using awk to Parse File

Hi all, I have a file that contains a good hundred of these job definitions below: Job Name Last Start Last End ST Run Pri/Xit ________________________________________________________________ ____________________... (7 Replies)
Discussion started by: atticuss
7 Replies

6. Shell Programming and Scripting

Extract/Parse information from html (website)

Hello, I want to extract some informations from a html (website, http://www.energiecontracting.de/7-mitglieder/von-A-Z.php?a_z=B&seite=2 ) file and save those in a predefined format (.csv).. However it seems that the code on that website is kinda messy and I can't find a way to handle it... (5 Replies)
Discussion started by: TehOne
5 Replies

7. Shell Programming and Scripting

sed to parse html

Hello, I have a html file like this : <html> ... ... ... <table> ....... ...... </table> <table name = "hi"> ...... ..... ... </table> <h1> Welcome </h1> ....... ...... </html> (11 Replies)
Discussion started by: prasanna1157
11 Replies

8. Shell Programming and Scripting

Parse file using awk and work in awk output

hi guys, i want to parse a file using public function, the file contain raw data in the below format i want to get the output like this to load it to Oracle DB MARWA1,BSS:26,1,3,0,0,0,0,0.00,22,22,22.00 MARWA2,BSS:26,1,3,0,0,0,0,0.00,22,22,22.00 this the file raw format: Number of... (6 Replies)
Discussion started by: dagigg
6 Replies

9. Shell Programming and Scripting

Parse HTML tag parameters and text

Hi! I have a bunch of HTML files, which I want to parse to CSV files. Every page has a table in it, and I need to parse each row into a csv record. With awk and sed, I managed to put every table row in separate lines. So my file looks like this: <TR> .... </TR> <TR> .... </TR> ...One... (1 Reply)
Discussion started by: senszey
1 Replies

10. UNIX for Advanced & Expert Users

shell script to parse html file

hi all, i have a html file something similar to this. <tr class="evenrow"> <td class="data">added</td><td class="data">xyz@abc.com</td> <td class="data">filename.sql</td><td class="modifications-data">08/25/2009 07:58:40</td><td class="data">Added TK prof script</td> </tr> <tr... (1 Reply)
Discussion started by: sais
1 Replies
Login or Register to Ask a Question