Adding incomplete HTML code to a file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Adding incomplete HTML code to a file
# 1  
Old 08-19-2012
Adding incomplete HTML code to a file

Hi folks,

I am scraping data from the Internet that has the format similar to what's on this page -- Trigger Notice Report

The code I've written for scraping and storing results works fine when the HTML code is well written, but not when there are mistakes. In particular, the code breaks when there are no tags for closing a table rows.

If you view the source for the link above, you will see that there are '</TR>'s which indicate the end of an HTML row, but sometimes not a new '<TR>' to indicate the beginning of the next row.

I need to use somethink like awk or sed to basically do the following -- insert a line with ''<TR>'' whenever the previous line was "</TR>" followed by a ''<TD [some text]''. For example, in the code below, I need a line with ''<TR>'' just before the highlighted line. The rest of the HTML file follows pretty much the same pattern. Any suggestions?

Code:
<TR bgcolor="#C0C0C0">
<TD headers='Arizona noinfo' align=center width=25 ><FONT size='-2'>@</FONT></TD>
<TD headers='Arizona noinfo' align=center  width=25><FONT size='-2'>&nbsp;</FONT></TD>
<TD headers='Arizona noinfo' align=center  width=25><FONT size='-2'>&nbsp;</FONT></TD>
<TH id='Arizona' align=left  width=100><FONT size='-2'>Arizona</FONT></TH>
<TD headers='Arizona noinfo' align=center  width=50><FONT size='-2'>2</FONT></TD>
<TD headers='Arizona noinfo' align=center  width=50><FONT size='-2'>2</FONT></TD>
<TD headers='Arizona 13_week_IUR indicators' align=center  width=50><FONT size='-2'>2.74</FONT></TD>
<TD headers='Arizona pct_of_prior_2years indicators' align=center  width=50><FONT size='-2'>86</FONT></TD>
<TD headers='Arizona 3_mo_satur indicators' align=center  width=50><FONT size='-2'>9.4</FONT></TD>
<TD headers='Arizona year pct_of_prior indicators' align=center  width=50><FONT size='-2'>102</FONT></TD>
<TD headers='Arizona 2nd_year pct_of_prior indicators' align=center  width=50><FONT size='-2'>128</FONT></TD>
<TD headers='Arizona 2nd_year pct_of_prior indicators' align=center  width=50><FONT size='-2'>223</FONT></TD>
<TD headers='Arizona avail_wks pct_of_prior indicators noinfo' align=center width=50><FONT size='-2'>20</FONT></TD>
<TD headers='Arizona dates periods status' align=center width=100><FONT size='-2'>B 02-22-2009</FONT></TD>
</TR>


<TD headers='Arkansas noinfo' align=center width=25 ><FONT size='-2'>&nbsp;</FONT></TD>
<TD headers='Arkansas noinfo' align=center  width=25><FONT size='-2'>&nbsp;</FONT></TD>
<TD headers='Arkansas noinfo' align=center  width=25><FONT size='-2'>&</FONT></TD>
<TH id='Arkansas' align=left  width=100><FONT size='-2'>Arkansas</FONT></TH>
<TD headers='Arkansas noinfo' align=center  width=50><FONT size='-2'>2</FONT></TD>
<TD headers='Arkansas noinfo' align=center  width=50><FONT size='-2'>2</FONT></TD>
<TD headers='Arkansas 13_week_IUR indicators' align=center  width=50><FONT size='-2'>4.02</FONT></TD>
<TD headers='Arkansas pct_of_prior_2years indicators' align=center  width=50><FONT size='-2'>87</FONT></TD>
<TD headers='Arkansas 3_mo_satur indicators' align=center  width=50><FONT size='-2'>7.9</FONT></TD>
<TD headers='Arkansas year pct_of_prior indicators' align=center  width=50><FONT size='-2'>103</FONT></TD>
<TD headers='Arkansas 2nd_year pct_of_prior indicators' align=center  width=50><FONT size='-2'>131</FONT></TD>
<TD headers='Arkansas 2nd_year pct_of_prior indicators' align=center  width=50><FONT size='-2'>154</FONT></TD>
<TD headers='Arkansas avail_wks pct_of_prior indicators noinfo' align=center width=50><FONT size='-2'>&nbsp;</FONT></TD>
<TD headers='Arkansas dates periods status' align=center width=100><FONT size='-2'>E 09-26-2009</FONT></TD>
</TR>

# 2  
Old 08-19-2012
Try this:
Code:
awk 'BEGIN{FS=">";OFS=FS}
   /<TR/ {r++}
   /<\/TR/ {r--}
   /<TD/&&!r{print "<TR>";r--} 1' infile

# 3  
Old 08-19-2012
I think this should take care of neatly arranged HTML files as you presented, and more
realistic ones where </TR> might not be on a record by itself.

Code:
awk '
    /<\/tr>$/ {
        print;
        getline;
        if( substr( $1, 1, 3 ) == "<td" )
            printf( "<tr>%s\n", $0 );
        next;
    }
    {
        gsub( "<\/tr>[[:space:]]*<td", "</tr><tr><td " );
        print;
    }
' IGNORECASE=1 input-file >output-file

EDIT: @Chubler_XL -- nice.
# 4  
Old 09-04-2012
Hey Chubler_XL and agama,

Thanks for your help! Maybe I'm doing something wrong, but I tried running agama's code but it doesn't seem to be doing anything to the file (weird). I tried Chubler's and it fixed only the first instance, but not all instances of the issue.

Any suggestions?
# 5  
Old 09-05-2012
try this
Code:
awk '/TR/{if($1 ~/^<TR/){f=1};if($1 ~ /^<\/TR/){f=0};print;next}(f==0&&NF>0){print "<TR sometext>";f=1}{print}' filename


Last edited by raj_saini20; 09-05-2012 at 06:00 AM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Adding to an array in an external file, and adding elements to it.

I have an array in an external file, "array.txt", which contains: char *testarray={"Zero", "One", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight", "Nine"};I want to be able to add an element to this array, and have that element display, whenever I call it, without having to recompile... (29 Replies)
Discussion started by: ignatius
29 Replies

2. Shell Programming and Scripting

HTML code upload text file grep through shell script

I am looking for HTML code that browse text file and grep with database file then retrieve result txtfileuploaded contain 112233 115599 113366 shell code grep -F -f txtfileuploaded /data/database.txt result 112233 Mar 41$ 115599 Nov 44$ 113366 Oct 33$ attached... (2 Replies)
Discussion started by: phpshell
2 Replies

3. UNIX for Dummies Questions & Answers

Help with incomplete Code

Hello, Since i am new in shell scripting, i need some help from you guys. :rolleyes: I am trying to implement an automata that reflects the attached photo.. The main idea behind is to take an array of (0 & 1)s from the user and terminate it by "end". Then, the string is send to the function... (1 Reply)
Discussion started by: Geekie
1 Replies

4. Shell Programming and Scripting

problem with sending mail from txt file having HTML code via sendmail -t

Hi, i have the following code in shell named as test3.sh.. #!/bin/sh . /home/<user>/.profile export dt=`date "+%d%b%y"` export tim=`date "+%d%b%y %HM:%MM"` cd export WD=`pwd` SID="<sid>" export SID export ORACLE_SID=$SID export ORACLE_HOME=/oracle/$SID/102_64 export... (4 Replies)
Discussion started by: jassi10781
4 Replies

5. Shell Programming and Scripting

Removing all except couple of html tags from html file

I tried to find elegant (or at least simple) way to remove all but couple of html tags from html file, but all examples I found dealt with removing all the tags. The logic of the script would be: - if there is <li> or <ul> on the line, do nothing (=write same line to output) - if there is:... (0 Replies)
Discussion started by: juubuntu
0 Replies

6. Shell Programming and Scripting

Adding new lines to a file + adding suffix to a pattern

I need some help with adding lines to file and substitute a pattern. Ok I have a file: #cat names.txt name: John Doe stationed: 1 name: Michael Sweets stationed: 41 . . . And would like to change it to: name: John Doe employed permanently stationed: 1-office (7 Replies)
Discussion started by: hemo21
7 Replies

7. Shell Programming and Scripting

HTML code remove

Hello, I have one file which has been inserted intermittently with HTML web page. I would like to remove all text between "<html xmlns="http://www.w3.org/1999/xhtml">" and </html> tags. Can any one please suggest me sed regular expression for it. Thanks (3 Replies)
Discussion started by: nrbhole
3 Replies

8. UNIX for Advanced & Expert Users

problem mailing HTML code in cron file.

Hi All, I have written a script which sends mail using “sendmail” command and mail contains HTML code. When I run scripts on terminal it is working properly, but when I try to run this script through a crontab file it sends blank mail with proper subject. crontab file detail : 00 05... (1 Reply)
Discussion started by: abhishek.mind
1 Replies

9. UNIX for Dummies Questions & Answers

How do I extract text only from html file without HTML tag

I have a html file called myfile. If I simply put "cat myfile.html" in UNIX, it shows all the html tags like <a href=r/26><img src="http://www>. But I want to extract only text part. Same problem happens in "type" command in MS-DOS. I know you can do it by opening it in Internet Explorer,... (4 Replies)
Discussion started by: los111
4 Replies

10. UNIX for Advanced & Expert Users

Incomplete reading and transferring zip file via SAMBA smbclient

Hello, Anyone out there can help on this problem? I have a zip file about 34MB containing a file in EBCDIC and is resided on a Windows 2000 server. This zip file is retrieved and read from a UNIX server via SAMBA "SMBCLIENT" (by default the file is transferred via command bin) and issued... (2 Replies)
Discussion started by: eddie Law
2 Replies
Login or Register to Ask a Question