Python BeautifulSoup Re Finding Digits Within Tags

07-20-2015

Registered User

227, 3

Join Date: Dec 2007

Last Activity: 3 February 2020, 9:46 AM EST

Location: Washington D.C

Posts: 227

Thanks Given: 31

Thanked 3 Times in 3 Posts

Python BeautifulSoup Re Finding Digits Within Tags

I am writing a little python script that needs to grab version numbers between "<td>4.2.2</td>" within the tbody of the page:

Code:

[<tbody>
<tr style="background: #eee;"><td>4.2.2</td> <td align="center"><a href="https://blah.blah/-4.2.2.zip">zip</a> 
(<a href="https://blah/blah-4.2.2.zip.md5">md5</a> | <a href="https://blah/blah-4.2.2.zip.sha1">sha1</a>)
</td><td align="center"><a href="https://blah/blah-4.2.2.tar.gz">tar.gz</a> 
(<a href="https://blah/blahs-4.2.2.tar.gz.md5">md5</a> | <a href="https://blah/blah-4.2.2.tar.gz.sha1">sha1</a>)
</td><td align="center"><a href="https://blah/blah-4.2.2-IIS.zip">IIS zip</a> 
(<a href="https://blah/blah-4.2.2-IIS.zip.md5">md5</a> | <a href="https://blah/blah-4.2.2-IIS.zip.sha1">sha1</a>)
</td></tr><tr><td>4.2.1</td> <td align="center"><a href="https://blah/blah-4.2.1.zip">zip</a> 
(<a href="https://blah/blah-4.2.1.zip.md5">md5</a> | <a href="https://blah/blah-4.2.1.zip.sha1">sha1</a>)
</td><td align="center">[/tbody]
[tbody]blah blah blah blah blah
[/tbody]

Is it possible to use a one-liner to scrap only the digits between the tags:

"<td>4.2.2</td>"

so it spits out:
4.2.2
4.2.1
etc..

This is what I have done so far but dont understand why it creates the variable rpart as a ResultSet and a regular string that I can scrape the data.

Code:

wphtml = BeautifulSoup('http://blah.blah/release)
rpart = wphtml.find_all('tbody', limit=1)
rpart[0]
[<tbody>
<tr style="background: #eee;"><td>4.2.2</td> <td align="center"><a href="https://blah.blah/-4.2.2.zip">zip</a> (<a href="https://blah/blah-4.2.2.zip.md5">md5</a> 
| <a href="https://blah/blah-4.2.2.zip.sha1">sha1</a>)</td><td align="center"><a href="https://blah/blah-4.2.2.tar.gz">tar.gz</a> (<a href="https://blah/blahs-4.2.2.tar.gz.md5">md5</a> 
| <a href="https://blah/blah-4.2.2.tar.gz.sha1">sha1</a>)</td><td align="center"><a href="https://blah/blah-4.2.2-IIS.zip">IIS zip</a> 
(<a href="https://blah/blah-4.2.2-IIS.zip.md5">md5</a> | <a href="https://blah/blah-4.2.2-IIS.zip.sha1">sha1</a>)</td></tr><tr><td>4.2.1</td> 
<td align="center"><a href="https://blah/blah-4.2.1.zip">zip</a> (<a href="https://blah/blah-4.2.1.zip.md5">md5</a> 
| <a href="https://blah/blah-4.2.1.zip.sha1">sha1</a>)</td><td align="center">[/tbody]
[tbody]blah blah blah blah blah
[/tbody]
whos
rpart           ResultSet        [<tbody>\n<tr style="back<...>="1"></td></tr> </tbody>]
wphtml          BeautifulSoup    <!DOCTYPE html>\n<html di<...>"></iframe></body></html>

Is this their a way to do this as a one-liner?

Code:

rpart = wphtml.find_all('tbody', limit=1, td=re.compile('\<td\>\d*.\d*.\d*.\<\/td\>'))
4.2.2
4.2.1
etc..

or

for tag in wphtml.find_all('tbody', limit=1, string=re.compile("\b\<td\>\d*.\d*.\d*.\<\/td\>\b")):
    print(tag.content)

So what I am trying to do is:

1 - Search through the html page and capture on the first [tbody]....[/tbody], hence limit=1
2 - Regex through the results and only print out the digits that are inside the <td>\d*.\d*.\d*.\<td> tags
3 - Resulting in:

4.2.2
4.2.1
etc..

Last edited by metallica1973; 07-20-2015 at 04:58 PM..

metallica1973

View Public Profile for metallica1973

Find all posts by metallica1973

07-21-2015

Registered User

2,019, 606

Join Date: Apr 2009

Last Activity: 27 February 2021, 12:15 PM EST

Location: India

Posts: 2,019

Thanks Given: 50

Thanked 606 Times in 567 Posts

The best thing about using a language like python is that you've ready-made parsers to make your life simpler.. and not resort to (cheaper?) techniques like regex (leave those things to perl :-D).

What you're trying to parse looks like a HTML file. Take a look at the HTMLParser module and see if you can cook something using that.

balajesuri

View Public Profile for balajesuri

Find all posts by balajesuri

07-23-2015

Registered User

227, 3

Join Date: Dec 2007

Last Activity: 3 February 2020, 9:46 AM EST

Location: Washington D.C

Posts: 227

Thanks Given: 31

Thanked 3 Times in 3 Posts

Many thanks for the reply,

after putting a little elbow grease into this, I was able to accomplish what I needed to do with Beautiful and re:

Code:

wphtml = BeautifulSoup('http://blah.blah/release)
rpart=wphtml.soup.find('tbody')
tds=rpart.find_all('td')
blah=[]
for r in rpart:
    re.compile(r'<td>(.*?)</td>', flags=re.DOTALL)
    blah.append(r.string)
blah
u'4.2.2',
 None,
 None,
 None,
 u'4.2.1',

my next question is how do I get rid of the None

---------- Post updated at 05:46 PM ---------- Previous update was at 02:39 PM ----------

blah=filter(None, blah)

Last edited by metallica1973; 07-23-2015 at 04:34 PM..

metallica1973

View Public Profile for metallica1973

Find all posts by metallica1973

Shell Programming and Scripting

Python BeautifulSoup Re Finding Digits Within Tags

9 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

sed / awk script to delete the two digits from first 3 digits

Discussion started by: kshitij

2. Programming

[Python] BeautifulSoup tags > </a>

Discussion started by: bob123

3. Programming

Create a C source and compile inside Python 1.4.0 to 3.7.0 in Python for ALL? platforms...

Discussion started by: wisecracker

4. Shell Programming and Scripting

python unable to read the background color in python

Discussion started by: giridhar276

5. Shell Programming and Scripting

Find filenames with three digits and add zeros to make five digits

Discussion started by: Buzzman25

6. Shell Programming and Scripting

Finding missing tags

Discussion started by: kristinu

7. UNIX for Dummies Questions & Answers

how to use grep: finding a string with double quotes and multiple digits

Discussion started by: titusbass

8. Shell Programming and Scripting

Finding tags in file names using csh

Discussion started by: kristinu

9. Shell Programming and Scripting

help: single digits inflated to 2 digits

Discussion started by: amadain