Find specific link w/ beautifulsoup - python

Hi I cannot figure out how to find links which begin with certain text for the life of me.
findall('a') works fine, but it's way too much. I just want to make a list of all links that begin with
http://www.nhl.com/ice/boxscore.htm?id=
Can anyone help me?
Thank you very much

First set up a test document and open up the parser with BeautifulSoup:
>>> from BeautifulSoup import BeautifulSoup
>>> doc = '<html><body><div>yep</div><div>somelink</div>another</body></html>'
>>> soup = BeautifulSoup(doc)
>>> print soup.prettify()
<html>
<body>
<div>
<a href="something">
yep
</a>
</div>
<div>
<a href="http://www.nhl.com/ice/boxscore.htm?id=3">
somelink
</a>
</div>
<a href="http://www.nhl.com/ice/boxscore.htm?id=7">
another
</a>
</body>
</html>
Next, we can search for all <a> tags with an href attribute starting with http://www.nhl.com/ice/boxscore.htm?id=. You can use a regular expression for it:
>>> import re
>>> soup.findAll('a', href=re.compile('^http://www.nhl.com/ice/boxscore.htm\?id='))
[somelink, another]

You might not need BeautifulSoup since your search is specific
>>> import re
>>> links = re.findall("http:\/\/www\.nhl\.com\/ice\/boxscore\.htm\?id=.+", str(doc))

You can find all links and than filter that list to get only links that you need. This will be very fast solution regardless the fact that you filter it afterwards.
listOfAllLinks = soup.findAll('a')
listOfLinksINeed = []
for link in listOfAllLinks:
if "www.nhl.com" in link:
listOfLinksINeed.append(link['href'])

Related

How can I extract the following links from html source code in python?

Here is my some html source code :
<div class="s">
<div class="th N3nEGc" style="height:48px;width:61px">
<a href="/imgres?imgurl=https://linuxhint.com/wpcontent/uploads/2018/12/11.jpg&imgrefurl=https://linuxhint.com/setup_screensaver_manjaro_linux/&h=912&w=1140&tbnid=10DzCgmImE0jM&tbnh=201&tbnw=251&usg=K_YJsquLr4rorhW2ks8UdceQ8uKjg=&docid=0vImrzSjsr5zQM"
data-ved="2ahUKEwj3062g3pDjAhWZQN4KHS-_BL8Q8g0wC3oECAUQBQ"
ping="/urlsa=t&source=web&rct=j&url=/imgres%3Fimgurl%3Dhttps://linuxhint.com/wpcontent/uploads/2018/12/11.jpg%26imgrefurl%3Dhttps://linuxhint.com/setup_screensaver_manjaro_linux/%26h%3D912%26w%3D1140%26tbnid%3D10DzCgmImE0jM%26tbnh%3D201%26tbnw%3D251%26usg%3DK_YJsquLr4rorhW2ks8UdceQ8uKjg%3D%26docid%3D0vImrzSjsr5zQM&ved=2ahUKEwj3062g3pDjAhWZQN4KHS-_BL8Q8g0wC3oECAUQBQ">
</a>
</div>
</div>
What I want to extract is the link:
<a href="/imgres?imgurl=https://linuxhint.com/wpcontent/uploads/2018/12/11.jpg&
so the output will be in that way,
https://linuxhint.com/wpcontent/uploads/2018/12/11.jpg
What I tried by using python is :
sourceCode = opener.open(googlePath).read().decode('utf-8')
links = re.findall('href="/imgres?imgurl=(.*?)jpg&imgrefurl="',sourceCode)
for i in links:
print(i)
Better way than parse query string through regex is using parse_qs function (safer, you get exactly what you want without regex fiddling) (doc):
data = '''<div class="s"><div class="th N3nEGc" style="height:48px;width:61px"><a href="/imgres?imgurl=https://linuxhint.com/wpcontent/uploads/2018/12/11.jpg&imgrefurl=https://linuxhint.com/setup_screensaver_manjaro_linux/&h=912&w=1140&tbnid=10DzCgmImE0jM&tbnh=201&tbnw=251&usg=K_YJsquLr4rorhW2ks8UdceQ8uKjg=&docid=0vImrzSjsr5zQM" data-ved="2ahUKEwj3062g3pDjAhWZQN4KHS-_BL8Q8g0wC3oECAUQBQ" ping="/urlsa=t&source=web&rct=j&url=/imgres%3Fimgurl%3Dhttps://linuxhint.com/wpcontent/uploads/2018/12/11.jpg%26imgrefurl%3Dhttps://linuxhint.com/setup_screensaver_manjaro_linux/%26h%3D912%26w%3D1140%26tbnid%3D10DzCgmImE0jM%26tbnh%3D201%26tbnw%3D251%26usg%3DK_YJsquLr4rorhW2ks8UdceQ8uKjg%3D%26docid%3D0vImrzSjsr5zQM&ved=2ahUKEwj3062g3pDjAhWZQN4KHS-_BL8Q8g0wC3oECAUQBQ">'''
from bs4 import BeautifulSoup
from urllib.parse import urlparse, parse_qs
soup = BeautifulSoup(data, 'lxml')
d = urlparse(soup.select_one('a[href*="imgurl"]')['href'])
q = parse_qs(d.query)
print(q['imgurl'])
Prints:
['https://linuxhint.com/wpcontent/uploads/2018/12/11.jpg']
If the problem is your regex, then I think you can try this one:
link = re.search('^https?:\/\/.*[\r\n]*[^.\\,:;]', sourceCode)
link = link.group()
print (link)
Perhaps you should add an escape character for '?', try out this :
links = re.findall('href="/imgres\?imgurl=(.*?)jpg&imgrefurl="',sourceCode)
for i in links:
print(i)

How to exract link to image "a href" & "class" in a html page using beautifulsoup

So I have several images using the same line of code to refer to html image links on a page: <a href="#" class="sh-mo__image" data-image="http://somejpgimage.jpeg">
I would like to retrieve the link only but just can't seem to navigate beyond the class to the link.
Can anyone help?
Also I have "n" number of links which I would like to retrieve separately.
You can do what #D.Chel suggested using list comprehension.
>>> links = [x['data-image'] for x in soup.find_all('a', {'class': 'sh-mo__image'})]
>>> links
['http://somejpgimage1.jpeg', 'http://somejpgimage2.jpeg']
I believe that your are looking for something like this
from bs4 import BeautifulSoup
html = ''' <a href="#" class="sh-mo__image" data-image="http://somejpgimage1.jpeg">
<a href="#" class="sh-mo__image" data-image="http://somejpgimage2.jpeg"> '''
soup = BeautifulSoup(html,'lxml')
mylinks = []
for link in soup.find_all('a',{'class':'sh-mo__image'}):
mylinks.append(link['data-image'])

Parse href attribute value from element with Beautifulsoup and Mechanize

Can anyone help me traverse an html tree with beautiful soup?
I'm trying to parse through html output and after gather each value then insert into a table named Tld with python/django
<div class="rc" data-hveid="53">
<h3 class="r">
Billing: Portal Home
</h3>
And only parse the value of href attribute of <a>, so only this part:
https://billing.anapp.com/
of:
Billing: Portal Home
I currently have:
for url in urls:
mb.open(url)
beautifulSoupObj = BeautifulSoup(mb.response().read())
beautifulSoupObj.find_all('h3',attrs={'class': 'r'})
The problem is find_all above, isn't make it far enough to the <a> element.
Any help is much appreciated.
Thank you.
from bs4 import BeautifulSoup
html = """
<div class="rc" data-hveid="53">
<h3 class="r">
Billing: Portal Home
</h3>
"""
bs = BeautifulSoup(html)
elms = bs.select("h3.r a")
for i in elms:
print(i.attrs["href"])
prints:
https://billing.anapp.com/
h3.r a is a css selector
you can use css selector (i prefer them), xpath, or find in elements. the selector h3.r a will look for all h3 with class r and get from inside them the a elements. it could be a more complicated example like #an_id table tr.the_tr_class td.the_td_class it will find an id given td's inside that belong to the tr with the given class and are inside a table of course.
this will also give you the same result. find_all returns a list of bs4.element.Tag, find_all has a recursive field not sure if you can do it in one line, i personaly prefer css selector because its easy and clean.
for elm in bs.find_all('h3',attrs={'class': 'r'}):
for a_elm in elm.find_all("a"):
print(a_elm.attrs["href"])
I think it's worth mentioning what would happen in case there were similarly named classes that contain spaces.
Taking a piece of code that #Foo Bar User provided and changing it a little
from bs4 import BeautifulSoup
html = """
<div class="rc" data-hveid="53">
<h3 class="r s">
Billing: Portal Home
</h3>
<h3 class='r s sth s'>
Don't grab this
</h3>
"""
bs = BeautifulSoup(html)
when we try to get just the link where class equals 'r s' by css selectors:
elms = bs.select("h3.r.s a")
for i in elms:
print(i.attrs["href"])
it prints
https://billing.anapp.com/
https://link_you_dont_want.com/
however using
for elm in bs.find_all('h3',attrs={'class': 'r s'}):
for a_elm in elm.find_all("a"):
print(a_elm.attrs["href"])
gives the desired result
https://billing.anapp.com/
That's just something I've encountered during my own work. If there is a way to overcome this using css selectors, please let me know!

Parsing out data using BeautifulSoup in Python

I am attempting to use BeautifulSoup to parse through a DOM tree and extract the names of authors. Below is a snippet of HTML to show the structure of the code I'm going to scrape.
<html>
<body>
<div class="list-authors">
<span class="descriptor">Authors:</span>
Dacheng Lin,
Ronald A. Remillard,
Jeroen Homan
</div>
<div class="list-authors">
<span class="descriptor">Authors:</span>
A.G. Kosovichev
</div>
<!--There are many other div tags with this structure-->
</body>
</html>
My point of confusion is that when I do soup.find, it finds the first occurrence of the div tag that I'm searching for. After that, I search for all 'a' link tags. At this stage, how do I extract the authors names from each of the link tags and print them out? Is there a way to do it using BeautifulSoup or do I need to use Regex? How do I continue iterating over every other other div tag and extract the authors names?
import re
import urllib2,sys
from BeautifulSoup import BeautifulSoup, NavigableString
html = urllib2.urlopen(address).read()
soup = BeautifulSoup(html)
try:
authordiv = soup.find('div', attrs={'class': 'list-authors'})
links=tds.findAll('a')
for link in links:
print ''.join(link[0].contents)
#Iterate through entire page and print authors
except IOError:
print 'IO error'
just use findAll for the divs link you do for the links
for authordiv in soup.findAll('div', attrs={'class': 'list-authors'}):
Since link is already taken from an iterable, you don't need to subindex link -- you can just do link.contents[0].
print link.contents[0] with your new example with two separate <div class="list-authors"> yields:
Dacheng Lin
Ronald A. Remillard
Jeroen Homan
A.G. Kosovichev
So I'm not sure I understand the comment about searching other divs. If they are different classes, you will either need to do a separate soup.find and soup.findAll, or just modify your first soup.find.

How do I fix wrongly nested / unclosed HTML tags?

I need to sanitize HTML submitted by the user by closing any open tags with correct nesting order. I have been looking for an algorithm or Python code to do this but haven't found anything except some half-baked implementations in PHP, etc.
For example, something like
<p>
<ul>
<li>Foo
becomes
<p>
<ul>
<li>Foo</li>
</ul>
</p>
Any help would be appreciated :)
using BeautifulSoup:
from BeautifulSoup import BeautifulSoup
html = "<p><ul><li>Foo"
soup = BeautifulSoup(html)
print soup.prettify()
gets you
<p>
<ul>
<li>
Foo
</li>
</ul>
</p>
As far as I know, you can't control putting the <li></li> tags on separate lines from Foo.
using Tidy:
import tidy
html = "<p><ul><li>Foo"
print tidy.parseString(html, show_body_only=True)
gets you
<ul>
<li>Foo</li>
</ul>
Unfortunately, I know of no way to keep the <p> tag in the example. Tidy interprets it as an empty paragraph rather than an unclosed one, so doing
print tidy.parseString(html, show_body_only=True, drop_empty_paras=False)
comes out as
<p></p>
<ul>
<li>Foo</li>
</ul>
Ultimately, of course, the <p> tag in your example is redundant, so you might be fine with losing it.
Finally, Tidy can also do indenting:
print tidy.parseString(html, show_body_only=True, indent=True)
becomes
<ul>
<li>Foo
</li>
</ul>
All of these have their ups and downs, but hopefully one of them is close enough.
Run it through Tidy or one of its ported libraries.
Try to code it by hand and you will want to gouge your eyes out.
use html5lib, work great!
like this.
soup = BeautifulSoup(data, 'html5lib')
Just now, I got a html which lxml and pyquery didn't work good on , seems there are some errors in the html.
Since Tidy is not easy to install in windows, I choose BeautifulSoup.
But I found that:
from BeautifulSoup import BeautifulSoup
import lxml.html
soup = BeautifulSoup(page)
h = lxml.html(soup.prettify())
act same as h = lxml.html(page)
Which real solve my problem is soup = BeautifulSoup(page, 'html5lib').
You should install html5lib first, then can use it as a parser in BeautifulSoup.
html5lib parser seems work much better than others.
Hope this can help someone.
I tried to use, below method but Failed on python 3
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(page, 'html5lib')
I tried below and got Success
soup = bs4.BeautifulSoup(html, 'html5lib')
f_html = soup.prettify()
print(f'Formatted html::: {f_html}')

Categories

Resources