Parsing - export html table data as .csv file?

Login or Register to Ask a Question and Join Our Community

Parsing - export html table data as .csv file?

Tags

awk, html parse, linux, solved

Login to Discuss or Reply to this Discussion in Our Community

Operating Systems Linux Parsing - export html table data as .csv file?

05-20-2016

Registered User

79, 2

Join Date: Feb 2013

Last Activity: 13 January 2020, 9:06 PM EST

Posts: 79

Thanks Given: 31

Thanked 2 Times in 2 Posts

Parsing - export html table data as .csv file?

Hi all,

Is there any out there have a brilliant idea on how to export html table data as .csv or write to txt file with separated comma and also get the filename of link from every table and put one line per rows each table.

Please see the attached html and PNG of what it looks like.

sample.html

Parsing - export html table data as .csv file?-samplepng

I already googling to find solution: and here's I get https://www.mylinuxplace.com/convert-html-to-csv/ but the link is not included

Parsing - export html table data as .csv file?-bbeforeandafterpng

Thank you so much!

Last edited by bakunin; 05-20-2016 at 01:25 PM..

lxdorney

View Public Profile for lxdorney

Find all posts by lxdorney

05-20-2016

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

Quote:

Originally Posted by lxdorney

Is there any out there have a brilliant idea on how to export html table data as .csv or write to txt file with separated comma and also get the filename of link from every table and put one line per rows each table.

Alas, there is indeed a "brilliant idea", but you probably are not going to like it: write a parser!

The solution you found (and which is similar to many others, including a few of my own) will work the way it is supposed to as long as the HTML source you feed it is "well-behaved". Well-behaved in this context means: it shall not contain constructs the creator of said solution did not think about in advance. If it does, the "solution" will perhaps break in one or the other way.

The reason is that "parsing" cannot be done with regular expressions, however cleverly arranged. "parsing" is a recursive process and with anything short of a recursive parser you might get somewhere near a solution, but not a solution in the full meaning of the word. If you are interested in why: here is it in length.

So, if you can live with some shortcomings like the chance that the "solution" you end up with will not always work, you can use what you found. If you need a real solution: i suggest the "Dragon Book" ("Principles of Compiler Design"; Aho, Sethi, Ullmann) as the best reference for building parsers, lexical analysers and similar programs.

I hope this helps.

bakunin

This User Gave Thanks to bakunin For This Post:

bakunin

View Public Profile for bakunin

Find all posts by bakunin

05-20-2016

Registered User

79, 2

Join Date: Feb 2013

Last Activity: 13 January 2020, 9:06 PM EST

Posts: 79

Thanks Given: 31

Thanked 2 Times in 2 Posts

Thanks for the response, hoping for work around on this☺

lxdorney

View Public Profile for lxdorney

Find all posts by lxdorney

05-20-2016

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Quote:

Originally Posted by lxdorney

Hi all,

Is there any out there have a brilliant idea on how to export html table data as .csv or write to txt file with separated comma

I'm curious what you understand the difference between these two to be. CSV is nothing but a text file with separated commas (which becomes a real pain to manage whenever the data has commas in it).

If the idea is to open this data in excel, I find it much easier to make a tab-delimited file than to struggle with csv.

I might have a solution. working on it.

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

05-20-2016

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Quote:

Originally Posted by lxdorney

Thanks for the response, hoping for work around on this☺

There is no "work around". Please read the link above to understand why. I just have a parser already since we get asked xml questions 37 times a day here.

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

05-20-2016

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

This is a "work around" for handling arbitrary XML in shell, my yanx.awk library:

Code:

BEGIN {
        FS=">"; OFS=">";
        RS="<"; ORS="<"
}

# After match("qwertyuiop", /rty/)
#       rbefore("qwertyuiop") is "qwe",
#       rmid("qwertyuipo")    is "r"
#       rall("qwertyuiop")    is "rty"
#       rafter("qwertyuiop")  is "uiop"
function rbefore(STR)   { return(substr(STR, N, RSTART-1)); }# before match
function rmid(STR)      { return(substr(STR, RSTART, 1)); }  # First char match
function rall(STR)      { return(substr(STR, RSTART, RLENGTH)); }# Entire match
function rafter(STR)    { return(substr(STR, RSTART+RLENGTH)); }# after match

function aquote(OUT, A, PFIX, TA) { # Turns Q SUBSEP R into A[PFIX":"Q]=R
        if(OUT)
        {
                if(PFIX) PFIX=PFIX":"
                split(OUT, TA, SUBSEP);
                A[toupper(PFIX) toupper(TA[1])]=TA[2];
        }

        return("");
}

# Intended to be less stupid about quoted text in XML/HTML.
# Splits a='b' c='d' e='f' into A[PFIX":"a]=b, A[PFIX":"c]=d, etc.
function qsplit(STR, A, PFIX, X, OUT) {
        while(STR && match(STR, /([ \n\t]+)|[\x27\x22=]/))
        {
                OUT = OUT rbefore(STR);
                RMID=rmid(STR);

                if((RMID == "'") || (RMID == "\""))     # Quote characters
                {
                        if(!Q)          Q=RMID;         # Begin quote section
                        else if(Q == RMID)      Q="";   # End quote section
                        else                    OUT = OUT RMID; # Quoted quote
                } else if(RMID == "=") {
                        if(Q)   OUT=OUT RMID; else OUT=OUT SUBSEP;
                } else if((RMID=="\r")||(RMID=="\n")||(RMID=="\t")||(RMID==" ")) {
                        if(Q)   OUT = OUT rall(STR); # Literal quoted whitespace
                        else    OUT = aquote(OUT, A, PFIX); # Unquoted WS, next block
                }
                STR=rafter(STR); # Strip off the text we've processed already.
        }

        aquote(OUT STR, A, PFIX); # Process any text we haven't already.
}


{ SPEC=0 ; TAG="" }

NR==1 {
        if(ORS == RS) print;
        next } # The first "line" is blank when RS=<

/^[!?]/ { SPEC=1    }   # XML specification junk

# Handle open-tags
match($1, /^[^\/ \r\n\t>]+/) {
        CTAG=""
        TAG=substr(toupper($1), RSTART, RLENGTH);
        if((!SPEC) && !($1 ~ /\/$/))
        {
                TAGS=TAG "%" TAGS;
                DEP++;
                LTAGS=TAGS
        }

        for(X in ARGS) delete ARGS[X];

        qsplit(rafter($1), ARGS, "", "", "");
}

# Handle close-tags
(!SPEC) && /^[\/]/ {
        sub(/^\//, "", $1);
        LTAGS=TAGS
        CTAG=toupper($1)
#        sub("^.*" toupper($1) "%", "", TAGS);
        sub("^" toupper($1) "%", "", TAGS);
        $1="/"$1
        DEP=split(TAGS, TA, "%")-1;
        if(DEP < 0) DEP=0;
}

And here is how you use it, html.awk :

Code:

BEGIN { ORS="" ; OFS="" ; LINK=1; COL=0 }

# Print a column of data in CSV format
function csv(S) {
        gsub(/[,"]/, "\\\\&", S); printf("%s\"%s\"", OFS, S); OFS=","
}

# When a table row starts, or the entire table ends, print row
(TAG=="TR" || CTAG=="TABLE") && COL {
        OFS=""
        # Print current row if any
        for(C=1; C<=COL; C++) { csv(DATA[C]);   delete DATA[C]; }
        for(C=1; C in LINKS; C++) { csv(LINKS[C]); delete LINKS[C]; }
        printf("\n");
        COL=0; LINK=1 # Reset indexes for arrays
}
# Count colums in table.  count em as a row to separate date comment
TAG=="TD" || TAG=="EM" { COL++ }

# Clean up HTML garbage
{ gsub(/([|])|([ \r\n\t]+)|(&nbsp;)/, " ", $2); }

# Collect attachments when found
TAGS ~ /TABLE/ && ARGS["HREF"] {
        sub(/.*[/]/, "", ARGS["HREF"]);
        LINKS[LINK++]=ARGS["HREF"];
        delete ARGS["HREF"];
        next # Skip to next tags, we dont want link title
}

# Append text to the current row and col
TAGS ~ /(^|%)TD%/ && !($2 ~ /^[ \r\n\t]+$/) {DATA[COL] = DATA[COL] $2 }

And here is how you run it:

Code:

$ awk -f yanx.awk -f html.awk input.html

"AA Number. 3-456","The quick brown fox jumps over the lazy dog near the bank of the river. The quick brown fox jumps over the lazy dog near the bank of the river.","(Hello World May 20\, 2016)","May 18\, 2016","abcd.pdf","abcfull.pdf"
"BB Number. 7-890","The quick brown fox jumps over the lazy dog near the bank of the river1.The quick brown fox jumps over the lazy dog near the bank of the river2.","(Lord of the rings May 30\, 2016)","May 28\, 2016","efghi.pdf","efghifull.pdf","efghisum.pdf"

$

Use nawk on solaris.

This User Gave Thanks to Corona688 For This Post:

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

05-20-2016

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

Brilliant! There is no other way to say it!

What an inspired effort, Corona688!

bakunin

This User Gave Thanks to bakunin For This Post:

bakunin

View Public Profile for bakunin

Find all posts by bakunin

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Export HTML table

HI , I have a HTML tables as below. It has 2 tables ,I want to extract the second table . Please help me in doing it. <html> <body> <b><br>Running Date: </b>11-JAN-2019 03:07</br> <h2> Schema mapping and info </h2> <BR><TABLE width="100%" class="x1h" cellpadding="1"...

2. Shell Programming and Scripting

Script to Gather data from logs and export to a CSV file

Greetings, After a few hours of trial and error, I decide to ask for some help. I am new to AWK and shell script, so please don't laugh :p I made the below script, to gather data from some logs and have the output into a CSV file : #!/bin/sh #Script to collect Errors ...

3. Shell Programming and Scripting

How to export hive table data to a file on local UNIX?

Hi All , I am stuck on the below situation.I have a table called "test" which are created on hive.I need to export the data from hive to a file(test.txt) on local unix system.I have tried the below command ,but its giving the exception . hive -e "select * from test " > /home/user/test.txt ; ...

4. Shell Programming and Scripting

Input data of a file from perl into HTML table

Hi , I need an help in perl scripting. I have an perl script written and i have an for loop in that ,where as it writes some data to a file and it has details like below. cat out.txt This is the first line this is the second line. .....Now, this file needs to be send in mail in HTML...

5. Shell Programming and Scripting

Creating html table from data in file

Hi. I need to create html table from file which contains data. No awk please :) In example, ->cat file num1 num2 num3 23 3 5 2 3 4 (between numbers and words single TAB). after running mycode i need to get (heading is the first line): <table>...

6. UNIX for Dummies Questions & Answers

Storing data from a table into a csv file

Hi I need to write a bash script to take the data stored in 3 oracle tables .. and filter them and store the results in a csv file. It is an Oracle database Thank you

7. Shell Programming and Scripting

How to export table data to xml file?

Hi , I would like to get some suggestion from the experts. My requirement is to export oracle table data as an xml file. Any unix/linux tools, scripts available? Regards,

8. Shell Programming and Scripting

Export data from DB2 table to .txt file(space delimited)

Hi I need help on this. Its very urgent for me.. please try to help me out.. I have data in tables in DB2 database. I would like to export the data from DB2 tables into a text file, which has to be space delimited. so that I can carry out awk, grep operations on that file. I tried to export...

9. Shell Programming and Scripting

Export a HTML table to Xcel

Hello All, I have a perl script that prints a HMTL table. I want to convert this data into a report and this want to export this information into Excel. How can I do this? Regards, garric

10. Shell Programming and Scripting

unix script to export data from csv file to oracle database

Hello people, Need favour. The problem I have is that, I need to develop a unix shell script that performs recurring exports of data from a csv file to an oracle database. Basically, the csv file contains just the first name and last name will be dumped to an Unix server. The data from these...

Login or Register to Ask a Question