Python BeautifulSoup Re Finding Digits Within Tags


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Python BeautifulSoup Re Finding Digits Within Tags
# 1  
Old 07-20-2015
Python BeautifulSoup Re Finding Digits Within Tags

I am writing a little python script that needs to grab version numbers between "<td>4.2.2</td>" within the tbody of the page:
Code:
[<tbody>
<tr style="background: #eee;"><td>4.2.2</td> <td align="center"><a href="https://blah.blah/-4.2.2.zip">zip</a> 
(<a href="https://blah/blah-4.2.2.zip.md5">md5</a> | <a href="https://blah/blah-4.2.2.zip.sha1">sha1</a>)
</td><td align="center"><a href="https://blah/blah-4.2.2.tar.gz">tar.gz</a> 
(<a href="https://blah/blahs-4.2.2.tar.gz.md5">md5</a> | <a href="https://blah/blah-4.2.2.tar.gz.sha1">sha1</a>)
</td><td align="center"><a href="https://blah/blah-4.2.2-IIS.zip">IIS zip</a> 
(<a href="https://blah/blah-4.2.2-IIS.zip.md5">md5</a> | <a href="https://blah/blah-4.2.2-IIS.zip.sha1">sha1</a>)
</td></tr><tr><td>4.2.1</td> <td align="center"><a href="https://blah/blah-4.2.1.zip">zip</a> 
(<a href="https://blah/blah-4.2.1.zip.md5">md5</a> | <a href="https://blah/blah-4.2.1.zip.sha1">sha1</a>)
</td><td align="center">[/tbody]
[tbody]blah blah blah blah blah
[/tbody]

Is it possible to use a one-liner to scrap only the digits between the tags:

"<td>4.2.2</td>"

so it spits out:
4.2.2
4.2.1
etc..

This is what I have done so far but dont understand why it creates the variable rpart as a ResultSet and a regular string that I can scrape the data.
Code:
wphtml = BeautifulSoup('http://blah.blah/release)
rpart = wphtml.find_all('tbody', limit=1)
rpart[0]
[<tbody>
<tr style="background: #eee;"><td>4.2.2</td> <td align="center"><a href="https://blah.blah/-4.2.2.zip">zip</a> (<a href="https://blah/blah-4.2.2.zip.md5">md5</a> 
| <a href="https://blah/blah-4.2.2.zip.sha1">sha1</a>)</td><td align="center"><a href="https://blah/blah-4.2.2.tar.gz">tar.gz</a> (<a href="https://blah/blahs-4.2.2.tar.gz.md5">md5</a> 
| <a href="https://blah/blah-4.2.2.tar.gz.sha1">sha1</a>)</td><td align="center"><a href="https://blah/blah-4.2.2-IIS.zip">IIS zip</a> 
(<a href="https://blah/blah-4.2.2-IIS.zip.md5">md5</a> | <a href="https://blah/blah-4.2.2-IIS.zip.sha1">sha1</a>)</td></tr><tr><td>4.2.1</td> 
<td align="center"><a href="https://blah/blah-4.2.1.zip">zip</a> (<a href="https://blah/blah-4.2.1.zip.md5">md5</a> 
| <a href="https://blah/blah-4.2.1.zip.sha1">sha1</a>)</td><td align="center">[/tbody]
[tbody]blah blah blah blah blah
[/tbody]
whos
rpart           ResultSet        [<tbody>\n<tr style="back<...>="1"></td></tr> </tbody>]
wphtml          BeautifulSoup    <!DOCTYPE html>\n<html di<...>"></iframe></body></html>

Is this their a way to do this as a one-liner?
Code:
rpart = wphtml.find_all('tbody', limit=1, td=re.compile('\<td\>\d*.\d*.\d*.\<\/td\>'))
4.2.2
4.2.1
etc..

or

for tag in wphtml.find_all('tbody', limit=1, string=re.compile("\b\<td\>\d*.\d*.\d*.\<\/td\>\b")):
    print(tag.content)

So what I am trying to do is:

1 - Search through the html page and capture on the first [tbody]....[/tbody], hence limit=1
2 - Regex through the results and only print out the digits that are inside the <td>\d*.\d*.\d*.\<td> tags
3 - Resulting in:

4.2.2
4.2.1
etc..

Last edited by metallica1973; 07-20-2015 at 04:58 PM..
# 2  
Old 07-21-2015
The best thing about using a language like python is that you've ready-made parsers to make your life simpler.. and not resort to (cheaper?) techniques like regex (leave those things to perl :-D).

What you're trying to parse looks like a HTML file. Take a look at the HTMLParser module and see if you can cook something using that.
# 3  
Old 07-23-2015
Many thanks for the reply,

after putting a little elbow grease into this, I was able to accomplish what I needed to do with Beautiful and re:
Code:
wphtml = BeautifulSoup('http://blah.blah/release)
rpart=wphtml.soup.find('tbody')
tds=rpart.find_all('td')
blah=[]
for r in rpart:
    re.compile(r'<td>(.*?)</td>', flags=re.DOTALL)
    blah.append(r.string)
blah
u'4.2.2',
 None,
 None,
 None,
 u'4.2.1',

my next question is how do I get rid of the NoneSmilie

---------- Post updated at 05:46 PM ---------- Previous update was at 02:39 PM ----------

blah=filter(None, blah)

Last edited by metallica1973; 07-23-2015 at 04:34 PM..
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

sed / awk script to delete the two digits from first 3 digits

Hi All , I am having an input file as stated below 5728 U_TOP_LOGIC/U_CM0P/core/u_cortexm0plus/u_top/u_sys/u_core/r03_q_reg_20_/Q 011 611 U_TOP_LOGIC/U_CM0P/core/u_cortexm0plus/u_top/u_sys/u_core/r04_q_reg_20_/Q 011 3486... (4 Replies)
Discussion started by: kshitij
4 Replies

2. Programming

[Python] BeautifulSoup tags > </a>

using BeautifulSoup how can i get the txt between all the > </a> example >The Student .mp4</a> thanks (10 Replies)
Discussion started by: bob123
10 Replies

3. Programming

Create a C source and compile inside Python 1.4.0 to 3.7.0 in Python for ALL? platforms...

Hi all... As you know I like making code backwards compatible for as many platforms as possible. This Python script was in fact dedicated for the AMIGA A1200 using Pythons 1.4.0, 1.5.2, 1.6.0, 2.0.1, and 2.4.6 as that is all we have for varying levels of upgrades from a HDD and 4MB FastRam... (1 Reply)
Discussion started by: wisecracker
1 Replies

4. Shell Programming and Scripting

**python** unable to read the background color in python

I am working on requirement on spreadsheet in python scripting. I have a spreadsheet containing cell values and with background color. I am able to read the value value but unable to get the background color of that particular cell. Actually my requirement is to read the cell value along... (1 Reply)
Discussion started by: giridhar276
1 Replies

5. Shell Programming and Scripting

Find filenames with three digits and add zeros to make five digits

Hello all! I've looked all over the internet and this site and have come up a loss with an easy way to make a bash script to do what I want to do. I have a file with a naming convention as follows: 2012-01-18 string of words here 123.jpg 2012-01-18 string of words here 1234.jpg 2012-01-18... (2 Replies)
Discussion started by: Buzzman25
2 Replies

6. Shell Programming and Scripting

Finding missing tags

I have a list containing strings. All strings should have either "smp" or "drw" else it is considered an error. I have written this code below. Any better ideas to tackle this? set fdrw = 0 set fsmp = 0 foreach f ($Lst) set fdrwtag = `echo $f | awk '/drw/'` set fsmptag = `echo $f | awk... (1 Reply)
Discussion started by: kristinu
1 Replies

7. UNIX for Dummies Questions & Answers

how to use grep: finding a string with double quotes and multiple digits

I have a file with a lot of lines (a lot!) that contain 10 digits between double quotes. ie "1726937489". The digits are random throughout, but always contain ten digits. I can not for the life of me, (via scouring the internet and grep how-to manuals) figure out how to find this when I search.... (3 Replies)
Discussion started by: titusbass
3 Replies

8. Shell Programming and Scripting

Finding tags in file names using csh

I have the following script and want to check if in each $f there exists either a "drw" or "smp" tag in the file name. How can I do it? For example npt06-32x24drw has the "drw" tag npt06-32x24smp has the "smp" tag npt06-32x24 no "drw" or "smp" tag found #!/bin/csh set iarg = 0... (0 Replies)
Discussion started by: kristinu
0 Replies

9. Shell Programming and Scripting

help: single digits inflated to 2 digits

Hi Folks Probably an easy one here but how do I get a sequence to get used as mentioned. For example in the following I want to automatically create files that have a 2 digit number at the end of their names: m@pyhead:~$ for x in $(seq 00 10); do touch file_$x; done m@pyhead:~$ ls file*... (2 Replies)
Discussion started by: amadain
2 Replies
Login or Register to Ask a Question