Unix/Linux Go Back    


Programming Post questions about C, C++, Java, SQL, and other programming languages here.

Python Web Page Scraping Urls Creating A Dictionary

Programming


Tags
python, solved

Closed    
 
Thread Tools Search this Thread Display Modes
    #1  
Old Unix and Linux 06-06-2017   -   Original Discussion by metallica1973
metallica1973's Unix or Linux Image
metallica1973 metallica1973 is offline
Registered User
 
Join Date: Dec 2007
Last Activity: 7 June 2017, 3:38 PM EDT
Location: Washington D.C
Posts: 219
Thanks: 31
Thanked 3 Times in 3 Posts
Python Web Page Scraping Urls Creating A Dictionary

I have thrown in the towel and cant figure out how to do this. I have a directory of html files that contain urls that I need to scrape (loop through) and add into a dictionary. An example of the output I would like is:

Code:
bigbadwolf.htlm: https://www.blah.com, http://www.blahblah.com, http://www.blahblahblah.com
maryhadalittlelamb.html: http://www.red.com, https://www.redyellow.com, http://www.zigzag.com
time.html: https://www.est.com, http://www.pst.com, https://www.cst.com

My code that I have so far is:

Code:
for subdir, dirs, files in os.walk('./html/tutorials/blah'):
    for tut in files:
        if tut.endswith(".html"):
            fpath = os.path.join("./html/tutorials/blah", tut)
            content = open(fpath, "r").read()
            file = BeautifulSoup(content, 'lxml')
            for links in file.find_all('a'):
		urls = links.get('href')
		print "HTML Files: {}\nUrls: {}\n".format(tut,urls)

produces the correct output for the most part:

Code:
HTML Files: bigbadwolf.html
Urls: https://www.blah.com

HTML Files: bigbadwolf.html
Urls: https://www.blahblah.com

HTML Files: bigbadwolf.html
Urls: https://www.blahblahblah.com

HTML files: maryhadalittlelamb.html
Urls: http://www.red.com 

HTML files: maryhadalittlelamb.html
Urls: https://www.redyellow.com 

HTML files: maryhadalittlelamb.html
Urls: http://www.zigzag.com

but I want it in a dictionary with this format:

Code:
bigbadwolf.htlm: https://www.blah.com, http://www.blahblah.com, http://www.blahblahblah.com
maryhadalittlelamb.html: http://www.red.com, https://www.redyellow.com, http://www.zigzag.com
time.html: https://www.est.com, http://www.pst.com, https://www.cst.com

As you can see, there will be several urls inside of an html doc so there will be keys that can contain many values(urls). I tried many variable of the below code but cant get a single key to have many urls associated with it.

Code:
tut_links = {}
for subdir, dirs, files in os.walk('./html/tutorials/blah'):
    for tut in files:
        if tut.endswith(".html"):
            fpath = os.path.join("./html/tutorials/blah", tut)
            content = open(fpath, "r").read()
            file = BeautifulSoup(content, 'lxml')
            for links in file.find_all('a'):
                urls = links.get('href')
                tut_links[tut] = urls

produces:

Code:
bigbadwolf.htlm: https://www.blah.com
maryhadalittlelamb.html: http://www.red.com
time.html: https://www.est.com
...
...
...

Can someone please shine some light on what I am trying to do?

Last edited by metallica1973; 06-06-2017 at 01:42 PM..
Sponsored Links
    #2  
Old Unix and Linux 06-06-2017   -   Original Discussion by metallica1973
Neo's Unix or Linux Image
Neo Neo is offline Forum Staff  
Administrator
 
Join Date: Sep 2000
Last Activity: 12 December 2017, 3:47 AM EST
Location: Asia pacific region
Posts: 14,086
Thanks: 932
Thanked 1,270 Times in 608 Posts
We don't get many Python questions here, sorry.

PHP, PERL, and all the standard UNIX and Linux shell programming languages, as well a C and C++ questions; but not many Python questions.

I'm not a Python programmer; so perhaps someone else here is? and they can help you?
Sponsored Links
    #3  
Old Unix and Linux 06-06-2017   -   Original Discussion by metallica1973
durden_tyler's Unix or Linux Image
durden_tyler durden_tyler is offline Forum Advisor  
Registered User
 
Join Date: Apr 2009
Last Activity: 9 September 2017, 1:30 PM EDT
Posts: 2,083
Thanks: 21
Thanked 383 Times in 346 Posts
Quote:
Originally Posted by metallica1973 View Post
...
...
but I want it in a dictionary with this format:

Code:
bigbadwolf.htlm: https://www.blah.com, http://www.blahblah.com, http://www.blahblahblah.com
maryhadalittlelamb.html: http://www.red.com, https://www.redyellow.com, http://www.zigzag.com
time.html: https://www.est.com, http://www.pst.com, https://www.cst.com

As you can see, there will be several urls inside of an html doc so there will be keys that can contain many values(urls).
...
...
Is each value of the dictionary:
(a) a list (or array) of URLs? or
(b) a comma-delimited string of URLs?

If you want (a), then try something like the following:


Code:
tut_links = {}
for subdir, dirs, files in os.walk('./html/tutorials/blah'):
    for tut in files:
        if tut.endswith(".html"):
            tut_links[tut] = []
            fpath = os.path.join("./html/tutorials/blah", tut)
            content = open(fpath, "r").read()
            file = BeautifulSoup(content, 'lxml')
            for links in file.find_all('a'):
                urls = links.get('href')
                tut_links[tut].append(urls)

Disclaimer: Completely untested; I don't have the module at the moment.
The Following 2 Users Say Thank You to durden_tyler For This Useful Post:
metallica1973 (06-06-2017), Neo (06-07-2017)
    #4  
Old Unix and Linux 06-06-2017   -   Original Discussion by metallica1973
metallica1973's Unix or Linux Image
metallica1973 metallica1973 is offline
Registered User
 
Join Date: Dec 2007
Last Activity: 7 June 2017, 3:38 PM EDT
Location: Washington D.C
Posts: 219
Thanks: 31
Thanked 3 Times in 3 Posts
Thanks for all the replies

Quote:
Is each value of the dictionary:
(a) a list (or array) of URLs? or
(b) a comma-delimited string of URLs?
it is a comma-delimited string of URLs

---------- Post updated at 02:30 PM ---------- Previous update was at 02:05 PM ----------

Thanks durden_tyler,

I tested your additional list stuff in which i had looked at before and didnt go down that path and it worked. Awesome. Many thanks

---------- Post updated at 02:51 PM ---------- Previous update was at 02:30 PM ----------

Would you happen to know how to delete duplicate entries inside of this embedded list?

Code:
bigbadwolf.htlm: 

'https://www.blah.com',
'https://www.blah.com',
'https://www.blah.com',
'http://www.blahblah.com'
'http://www.blahblah.com'

Sponsored Links
    #5  
Old Unix and Linux 06-06-2017   -   Original Discussion by metallica1973
durden_tyler's Unix or Linux Image
durden_tyler durden_tyler is offline Forum Advisor  
Registered User
 
Join Date: Apr 2009
Last Activity: 9 September 2017, 1:30 PM EDT
Posts: 2,083
Thanks: 21
Thanked 383 Times in 346 Posts
Quote:
Originally Posted by metallica1973 View Post
...
...
Would you happen to know how to delete duplicate entries inside of this embedded list?
...
...
Do not add a duplicate entry in the first place:


Code:
tut_links = {}
for subdir, dirs, files in os.walk('./html/tutorials/blah'):
    for tut in files:
        if tut.endswith(".html"):
            tut_links[tut] = []
            fpath = os.path.join("./html/tutorials/blah", tut)
            content = open(fpath, "r").read()
            file = BeautifulSoup(content, 'lxml')
            for links in file.find_all('a'):
                urls = links.get('href')
                if urls not in tut_links[tut]:
                    tut_links[tut].append(urls)

The Following User Says Thank You to durden_tyler For This Useful Post:
metallica1973 (06-07-2017)
Sponsored Links
    #6  
Old Unix and Linux 06-07-2017   -   Original Discussion by metallica1973
metallica1973's Unix or Linux Image
metallica1973 metallica1973 is offline
Registered User
 
Join Date: Dec 2007
Last Activity: 7 June 2017, 3:38 PM EDT
Location: Washington D.C
Posts: 219
Thanks: 31
Thanked 3 Times in 3 Posts
I cross referenced the html file with the output urls and its correct. Most of the html files do contain multiple duplicate urls as in:

http://www.blah.org
http://www.blah.org
http://www.blah.org

So I would need to remove the duplicates. I have dont this before in the past using:

Code:
tut_links=list(set(tut_links))

Let me give this a shot and see what happens. Thanks for all the help. I will let you know how it goes.

---------- Post updated 06-07-17 at 02:54 PM ---------- Previous update was 06-06-17 at 04:03 PM ----------

Thanks for all the help. Here is the finish code:

Code:
tut_links = {}

for subdir, dirs, files in os.walk('./html/tutorials/blah'):
    for tut in files:
        if tut.endswith(".html"):
	    tut_links[tut] = []
            fpath = os.path.join("./html/tutorials/blah", tut)
            content = open(fpath, "r").read()
            file = BeautifulSoup(content, 'lxml')
            for links in file.find_all('a', href=True):
		urls = links.get('href')
	    	if urls.startswith('http' or 'https'):		
	           tut_links[tut].append(urls)
	    for dup in tut_links.values(): --> removes duplicate urls from the dictionary value list
    		dup[:] = list(set(dup))

Worked like a champ

Code:
'bigbadwolf.htlm' : ['https://www.blah.com', 'http://www.blahblah.com','http://www.blahblahblah.com']


Last edited by metallica1973; 06-07-2017 at 04:37 PM..
The Following User Says Thank You to metallica1973 For This Useful Post:
Neo (06-08-2017)
Sponsored Links
Closed

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Linux More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Counting all words that start with a capital letter in a string using python dictionary royalibrahim Shell Programming and Scripting 1 12-07-2013 12:14 PM
Creating a dictionary with domain name adjuncted gimley Shell Programming and Scripting 13 05-08-2013 11:57 AM
Creating Custom man Page in Solaris sridhar_423 UNIX for Dummies Questions & Answers 1 02-28-2009 11:27 PM
Creating a Man page for a command raghu.amilineni Solaris 2 12-21-2008 05:02 PM



All times are GMT -4. The time now is 06:25 AM.