Parse html


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Parse html
# 1  
Old 02-07-2015
Parse html

I downloaded source code using:
Code:
  wget -qO- http://fulgentdiagnostics.com/test/clinical-exome/ | cat > flugentsource.txt

Now I am trying to use sed to parse it to confirm a gene count. Basically, output (flugent.txt) all the gene names with a total count after them

I'm not all that familiar, but this seemed to work before:

Code:
 sed 'N; s/\n/\t/; s/href="/>/; s/<[^>]*>//g; s/">/\t/g; s/[ -]*&#[0-9]*;[ -]*//g; /^[\t]*$/d' flugentsource.txt . flugent.txt

It's the text after: <h4>Genes: </h4> and ends with </div>

Thank you Smilie.

Last edited by vbe; 02-07-2015 at 02:14 PM.. Reason: icode for sed...
# 2  
Old 02-07-2015
Hello cmccabe,

Could you please try following code and let me know if this helps, also it is advisable to always show us expected output it will help us to understand the requirement more clearly.
Code:
awk '/<h4>Genes: <\/h4>/ {A=1} A && /<\/div>$/ {print $0;A=0}'  Input_file

Thanks,
R. Singh
# 3  
Old 02-07-2015
The output is attached and is very close. At the end of the file where is says </div> can that be the total count of all the gene names? Also, is the whitespace in the beginning due to the source file being indented? Thanks.
# 4  
Old 02-07-2015
Hello cmccabe,

Could you please try following and let me know if this helps.
Code:
awk '/<h4>Genes: <\/h4>/ {A=1} A && /<\/div>$/ {print $0;B=$0;A=0} END{S=gsub(/,/,X,B);print "Total Count: " S+1}'  Input_file

Thanks,
R. Singh
This User Gave Thanks to RavinderSingh13 For This Post:
# 5  
Old 02-07-2015
That seemed to work and confirms that there are 41 additional genes (Total Count = 4682) in that list even though it says 4641 in the source. Thanks.
# 6  
Old 02-08-2015
Not sure I understood what you were after. Try this to compare the actual number in the list with the given Gene count:
Code:
awk     '/<h4>Genes/            {getline; print NF}
         /<h4>Number of Genes/  {getline
                                 gsub (/[       ]*|<\/div>/, "")
                                 print}
        ' FS=, /tmp/flugentsource.txt
4682
4641

---------- Post updated at 20:24 ---------- Previous update was at 20:01 ----------

If the genes are split over several lines, try
Code:
awk     '/<h4>Genes/            {do {getline; CNT+=NF-1}
                                        while (!($0 ~ /<\/div>/))
                                 print CNT+1
                                }
         /<h4>Number of Genes/  {getline
                                 gsub (/[       ]*|<\/div>/, "")
                                 print}
        ' FS=, file

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Multiline html tag parse shell script

Hello, I want to parse the contents of a multiline html tag ex: <html> <body> <p>some other text</p> <div> <p class="margin-bottom-0"> text1 <br> text2 <br> <br> text3 </p> </div> </body> (15 Replies)
Discussion started by: SorcRR
15 Replies

2. UNIX for Beginners Questions & Answers

How to parse a specifc value between html tags using sed?

Hi, im trying to read a Temperature value from html code. So far i have managed to reduce the whole html page down to this single line with the following sed command:sed -n '/Temperature/p' $temp_temperature | tee temp_string <TD width='350'>Temperature :</td><td>25... (2 Replies)
Discussion started by: naittis
2 Replies

3. Shell Programming and Scripting

Parse multiple html files in directory

I have downloaded source code for 97 files using: wget -x -i link.txt then run a rename loop: for file in * do mv $file $file.txt done to keep the html tags but make the file a text that can be parsed. In each of the 97 txt files the gene # is variable, but the gene is associated... (15 Replies)
Discussion started by: cmccabe
15 Replies

4. Shell Programming and Scripting

awk to parse html file

Is it possible in awk to parse a webpage (EDAR Gene Sequencing - Genetic Testing Company | The DNA Diagnostic Experts | GeneDx), the source code is attached. <title> EDAR Gene Sequencing <dt>Test Code:</dt> <dd>156 </dd> <dt>Turnaround Time:</dt> <dd>6-8 weeks </dd> ... (4 Replies)
Discussion started by: cmccabe
4 Replies

5. Shell Programming and Scripting

Parse excel file with html on each cell

<DIV><P>Pré-condição aceder ao ecrã Home do MRS.</P></DIV><DIV><P>OK.</P></DIV><DIV><P>Seleccionar Pesquisa de Recepção Directa.</P></DIV><DIV><P>Confirmar que abriu ecrã de Recepção Directa.</P></DIV><DIV> (6 Replies)
Discussion started by: oliveiraum
6 Replies

6. UNIX for Advanced & Expert Users

Mutt for html body and multiple html & pdf attachments

Hi all: Been racking my brain on this for the last couple of days and what has been most frustrating is that this is the last piece I need to complete a project. There are numerous posts discussing mutt in this forum and others but I have been unable to find similar issues. Running with... (1 Reply)
Discussion started by: raggmopp
1 Replies

7. Shell Programming and Scripting

Extract/Parse information from html (website)

Hello, I want to extract some informations from a html (website, http://www.energiecontracting.de/7-mitglieder/von-A-Z.php?a_z=B&seite=2 ) file and save those in a predefined format (.csv).. However it seems that the code on that website is kinda messy and I can't find a way to handle it... (5 Replies)
Discussion started by: TehOne
5 Replies

8. Shell Programming and Scripting

sed to parse html

Hello, I have a html file like this : <html> ... ... ... <table> ....... ...... </table> <table name = "hi"> ...... ..... ... </table> <h1> Welcome </h1> ....... ...... </html> (11 Replies)
Discussion started by: prasanna1157
11 Replies

9. Shell Programming and Scripting

Parse HTML tag parameters and text

Hi! I have a bunch of HTML files, which I want to parse to CSV files. Every page has a table in it, and I need to parse each row into a csv record. With awk and sed, I managed to put every table row in separate lines. So my file looks like this: <TR> .... </TR> <TR> .... </TR> ...One... (1 Reply)
Discussion started by: senszey
1 Replies

10. UNIX for Advanced & Expert Users

shell script to parse html file

hi all, i have a html file something similar to this. <tr class="evenrow"> <td class="data">added</td><td class="data">xyz@abc.com</td> <td class="data">filename.sql</td><td class="modifications-data">08/25/2009 07:58:40</td><td class="data">Added TK prof script</td> </tr> <tr... (1 Reply)
Discussion started by: sais
1 Replies
Login or Register to Ask a Question