awk to parse html file

09-29-2014

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

awk to parse html file

Is it possible in awk to parse a webpage (EDAR Gene Sequencing - Genetic Testing Company | The DNA Diagnostic Experts | GeneDx), the source code is attached.

HTML Code:

<title> EDAR Gene Sequencing
<dt>Test Code:</dt>
    <dd>156 </dd>
 
    <dt>Turnaround Time:</dt>
    <dd>6-8 weeks </dd>
 
    <dt>Preferred Specimen:</dt>
    <dd>2-5 mL Blood - Lavender Top Tube </dd>
 
<dt>CPT Codes:</dt>
    <dd>81479x1</dd>
 
<ul id="clinical-utility">
    <li>Confirmation of a clinical diagnosis </li>
    <li>Differentiation between X-linked and autosomal forms of the disease </li>
    <li>Prenatal diagnosis in at-risk pregnancies</li>
 
<ol id="references">
    <li>Bal, E et al. Hum Mutat. 28:703-709, 2007.</li>
    <li>Headon et al. Nature. 414:913-916, 2001.</li>
    <li>Monreal et al. Nat Genet 22:366-369, 1999.</li>
    <li>Chassaing et al. Hum Mutat. 27(3):255-259, 2006</li>

The <…..> are not needed only the text is, if it is possible. Thanks

GeneDx.txt (11.8 KB)

Last edited by cmccabe; 09-29-2014 at 03:16 PM.. Reason: CODE tags (in this case html tags)

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

09-29-2014

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

XML is not trivial. This awk parser is not perfect at it but may do.

Code:

BEGIN {
        FS=">"; OFS=">";
        RS="<"; ORS="<"
}

# These should be special variables for match() but aren't.
function rbefore(STR)   { return(substr(STR, N, RSTART-1)); }# before match
function rmid(STR)      { return(substr(STR, RSTART, 1)); }  # First char match
function rall(STR)      { return(substr(STR, RSTART, RLENGTH)); }# Entire match
function rafter(STR)    { return(substr(STR, RSTART+RLENGTH)); }# after match

function aquote(OUT, A, PFIX, TA) { # Turns Q SUBSEP R into A[PFIX":"Q]=R
        if(OUT)
        {
                if(PFIX) PFIX=PFIX":"
                split(OUT, TA, SUBSEP);
                A[toupper(PFIX) toupper(TA[1])]=TA[2];
        }

        return("");
}

# Intended to be less stupid about quoted text in XML/HTML.
# Splits a='b' c='d' e='f' into A[PFIX":"a]=b, A[PFIX":"c]=d, etc.
function qsplit(STR, A, PFIX, X, OUT) {
        while(STR && match(STR, /([ \n\t]+)|[\x27\x22=]/))
        {
                OUT = OUT rbefore(STR);
                RMID=rmid(STR);

                if((RMID == "'") || (RMID == "\""))     # Quote characters
                {
                        if(!Q)          Q=RMID;         # Begin quote section
                        else if(Q == RMID)      Q="";   # End quote section
                        else                    OUT = OUT RMID; # Quoted quote
                } else if(RMID == "=") {
                        if(Q)   OUT=OUT RMID; else OUT=OUT SUBSEP;
                } else if((RMID=="\r")||(RMID=="\n")||(RMID=="\t")||(RMID==" ")) {
                        if(Q)   OUT = OUT rall(STR); # Literal quoted whitespace
                        else    OUT = aquote(OUT, A, PFIX); # Unquoted WS, next block
                }
                STR=rafter(STR); # Strip off the text we've processed already.
        }

        aquote(OUT STR, A, PFIX); # Process any text we haven't already.
}


{ SPEC=0 ; TAG="" }

NR==1 {
        if(ORS == RS) print;
        next } # The first "line" is blank when RS=<

/^[!?]/ { SPEC=1    }   # XML specification junk

# Handle open-tags
match($1, /^[^\/ \r\n\t>]+/) {
        TAG=substr(toupper($1), RSTART, RLENGTH);
        if((!SPEC) && !($1 ~ /\/$/))
        {
                TAGS=TAG "%" TAGS;
                DEP++;
                LTAGS=TAGS
        }

        for(X in ARGS) delete ARGS[X];

        qsplit(rafter($1), ARGS);
}

# Handle close-tags
(!SPEC) && /^[\/]/ {
        sub(/^\//, "", $1);
        LTAGS=TAGS
        sub("^.*" toupper($1) "%", "", TAGS);
        $1="/"$1
        DEP=split(TAGS, TA, "%")-1;
        if(DEP < 0) DEP=0;
}

Use it like:

Code:

$ awk -f xml.awk -e 'TAGS ~ /^TITLE/ { print $2 }
        TAGS ~ /^H4/ { P=/ORDERING|BILLING|REFERENCES/ ; next }
        {       gsub(/[\r\n\t ]+/, " ", $2);
                sub(/^ $/, "", $2);
                if(P && $2) print $2 }' ORS="\n" index.html

  EDAR Gene Sequencing - Genetic Testing Company | The DNA Diagnostic Experts | GeneDx
Test Code:
156
Turnaround Time:
6-8 weeks
Preferred Specimen:
2-5 mL Blood - Lavender Top Tube
CPT Codes:
81479x1
New York Approved:
Yes
ABN Required:
Yes
Billing Information:
View Billing Policy
ICD Codes:
757.31
: Congenital ectodermal dysplasia
*
 For price inquiries please email
zebras@genedx.com
Bal, E et al. Hum Mutat. 28:703-709, 2007.
Headon et al. Nature. 414:913-916, 2001.
Monreal et al. Nat Genet 22:366-369, 1999.
Chassaing et al. Hum Mutat. 27(3):255-259, 2006
Back To Top
Contact Us
Site Map
Terms of Service
Privacy Statement
&copy; GeneDx
207 Perry Parkway Gaithersburg, MD 20877
Phone: +1 301 519 2100, Fax: +1 301 519 2892
Email:
genedx@genedx.com
Stay Connected:

$

This User Gave Thanks to Corona688 For This Post:

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

09-29-2014

Registered User

1,613, 160

Join Date: Oct 2007

Last Activity: 12 February 2019, 12:19 PM EST

Location: USA

Posts: 1,613

Thanks Given: 40

Thanked 160 Times in 150 Posts

Quote:

Originally Posted by cmccabe

Is it possible in awk to parse a webpage (EDAR Gene Sequencing - Genetic Testing Company | The DNA Diagnostic Experts | GeneDx), the source code is attached.

HTML Code:

<title> EDAR Gene Sequencing
<dt>Test Code:</dt>
    <dd>156 </dd>
 
    <dt>Turnaround Time:</dt>
    <dd>6-8 weeks </dd>
 
    <dt>Preferred Specimen:</dt>
    <dd>2-5 mL Blood - Lavender Top Tube </dd>
 
<dt>CPT Codes:</dt>
    <dd>81479x1</dd>
 
<ul id="clinical-utility">
    <li>Confirmation of a clinical diagnosis </li>
    <li>Differentiation between X-linked and autosomal forms of the disease </li>
    <li>Prenatal diagnosis in at-risk pregnancies</li>
 
<ol id="references">
    <li>Bal, E et al. Hum Mutat. 28:703-709, 2007.</li>
    <li>Headon et al. Nature. 414:913-916, 2001.</li>
    <li>Monreal et al. Nat Genet 22:366-369, 1999.</li>
    <li>Chassaing et al. Hum Mutat. 27(3):255-259, 2006</li>

The <.....> are not needed only the text is, if it is possible. Thanks Smilie

Can you post what the desired output should look like...

This User Gave Thanks to shamrock For This Post:

shamrock

View Public Profile for shamrock

Find all posts by shamrock

09-29-2014

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

In cases where you don't have quoted > characters in tags (and I didn't see any of them in your samples, but didn't do an exhaustive search in your attachment), the following much simpler script might work:

Code:

awk -F '<[^>]*>' '{$1=$1}1' OFS='' file

With the sample data you posted in the 1st message in this thread, it produces the output:

Code:

 EDAR Gene Sequencing
Test Code:
    156 
 
    Turnaround Time:
    6-8 weeks 
 
    Preferred Specimen:
    2-5 mL Blood - Lavender Top Tube 
 
CPT Codes:
    81479x1
 

    Confirmation of a clinical diagnosis 
    Differentiation between X-linked and autosomal forms of the disease 
    Prenatal diagnosis in at-risk pregnancies
 

    Bal, E et al. Hum Mutat. 28:703-709, 2007.
    Headon et al. Nature. 414:913-916, 2001.
    Monreal et al. Nat Genet 22:366-369, 1999.
    Chassaing et al. Hum Mutat. 27(3):255-259, 2006

I didn't see any problems processing your attached sample either, but due to the length (since this preserves all input lines and just removes tags), I won't post the results here. It would also be easy to get rid of empty lines after removing tags if that is what you want.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

09-29-2014

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

Thank you all

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

Shell Programming and Scripting

awk to parse html file

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to parse a specifc value between html tags using sed?

Discussion started by: naittis

2. Shell Programming and Scripting

Parse html

Discussion started by: cmccabe

3. Shell Programming and Scripting

Parse multiple html files in directory

Discussion started by: cmccabe

4. Shell Programming and Scripting

Parse excel file with html on each cell

Discussion started by: oliveiraum

5. Shell Programming and Scripting

Using awk to Parse File

Discussion started by: atticuss

6. Shell Programming and Scripting

Extract/Parse information from html (website)

Discussion started by: TehOne

7. Shell Programming and Scripting

sed to parse html

Discussion started by: prasanna1157

8. Shell Programming and Scripting

Parse file using awk and work in awk output

Discussion started by: dagigg

9. Shell Programming and Scripting

Parse HTML tag parameters and text

Discussion started by: senszey

10. UNIX for Advanced & Expert Users

shell script to parse html file

Discussion started by: sais