BeautifulSoup locating iframe and its attribute - python

I have to get iframe src with beautiful soup
<div class="divclass">
<div id="simpleid">
<iframe width="300" height="300" src="http://google.com>
I could use selenium with code:
iframe1 = driver.find_element_by_class_name("divclass")
iframe = iframe1.find_element_by_tag_name("iframe").get_attribute("src")
but selenium is too slow for this task.
I've been looking for solution here on stackoverflow and tried several codes but always get error 403 while using urllib (changing browser agent is not working, still 403 error) or I get "None"

Use soup.find_all('tag you want to search')
>>> from bs4 import BeautifulSoup
>>> html = '''
... <div class="divclass">
... <div id="simpleid">
... <iframe width="300" height="300" src="http://google.com">
... '''
>>> soup = BeautifulSoup(html, 'html.parser')
>>> soup.find_all('iframe')
[<iframe height="300" src="http://google.com" width="300">
</iframe>]
>>> soup.find_all('iframe')[0]['src']
u'http://google.com'
>>>

Very good question.
Looking at the site you're trying to get that iframe from using that lib, you have to get the contents of tag in that div, and then base64 decode it and you should be done.
Seeing how you do things, don't stop! You're going to be a great programmer.

Related

Navigating through html with BeautifulSoup from a specific point

I'm using the following piece of code to find an attribute in a piece of HTML code:
results = soup.findAll("svg", {"data-icon" : "times"})
This works, and it returns me a list with the tag and attributes. However, I would also like to move from that part of the HTML code, to the sibling (if that's the right term) below it, and retrieve the contents of that paragraph. See the example below.
<div class="382"><svg aria-hidden="true" data-icon="times".......</svg></div>
<div class="405"><p>Example</p></div>
I can't seem to figure out how to do this properly. Searching for the div class names does not work, because the class name is randomised.
You can use CSS selector with +:
from bs4 import BeautifulSoup
html_doc = """
<div class="382"><svg aria-hidden="true" data-icon="times"> ... </svg></div>
<div class="405"><p>Example</p></div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
div = soup.select_one('div:has(svg[data-icon="times"]) + div')
print(div.text)
Prints:
Example
Or without CSS selector:
div = soup.find("svg", attrs={"data-icon": "times"}).find_next("div")
print(div.text)
Prints:
Example

Can't get src from iframe with beautifulSoup python

I'm trying to extract video from a web page with BeautifulSoup in python but i got into some problems.
When i go to the web page and inspect to see html elements I see this tag
<iframe id="iframe-embed2" src="https://player.voxzer.org/view/1167612b04f6855ecc4bb5e0" allowfullscreen="true" webkitallowfullscreen="true" mozallowfullscreen="true" width="100%" height="auto" frameborder="0"></iframe>
and when i copy the src and open it, it shows me the video.
but when I use BeautifulSoup to find the iframe from the web page I got src as empty string.
import requests
from bs4 import BeautifulSoup
site = requests.get("the url ...")
soup = BeautifulSoup(site.text, "html.parser")
print(soup.find_all("iframe"))
>>> [<iframe allowfullscreen="true" frameborder="0" height="auto" id="iframe-embed2" mozallowfullscreen="true" scrolling="no" src="" webkitallowfullscreen="true" width="100%"></iframe>]
What is the problem here?
this question doesn't have any working solutions
Parse iframe with blank src using bs4
What is the problem here?
I looked at site.text and found https://player.voxzer.org/view/1167612b04f6855ecc4bb5e0 to be placed in line
mainvideos.push('https://player.voxzer.org/view/1167612b04f6855ecc4bb5e0')
as .push is JavaScript method, apparently src of this iframe is set by JavaScript code, so you will need way to execute JavaScript code of site (for example using Selenium).

Isolate SRC attribute from soup return in python

I am using Python3 with BeautifulSoup to get a certain div from a webpage. My end goal is to get the img src's url from within this div so I can pass it to pytesseract to get the text off the image.
The img doesn't have any classes or unique identifiers so I am not sure how to use BeautifulSoup to get just this image every time. There are several other images and their order changes from day to day. So instead, I just got the entire div that surrounds the image. The div information doesn't change and is unique, so my code looks like this:
weather_today = soup.find("div", {"id": "weather_today_content"})
thus my script currently returns the following:
<div class="style3" id="weather_today_content">
<img alt="" src="/database/img/weather_today.jpg?ver=2018-08-01" style="width: 400px"/>
</div>
Now I just need to figure out how to pull just the src into a string so I can then pass it to pytesseract to download and use ocr to pull further information.
I am unfamiliar with regex but have been told this is the best method. Any assistance would be greatly appreciated. Thank you.
Find the 'img' element, in the 'div' element you found, then read the attribute 'src' from it.
from bs4 import BeautifulSoup
html ="""
<html><body>
<div class="style3" id="weather_today_content">
<img alt="" src="/database/img/weather_today.jpg?ver=2018-08-01" style="width: 400px"/>
</div>
</body></html>
"""
soup = BeautifulSoup(html, 'html.parser')
weather_today = soup.find("div", {"id": "weather_today_content"})
print (weather_today.find('img')['src'])
Outputs:
/database/img/weather_today.jpg?ver=2018-08-01
You can use CSS selector, that is built within BeautifulSoup (methods select() and select_one()):
data = """<div class="style3" id="weather_today_content">
<img alt="" src="/database/img/weather_today.jpg?ver=2018-08-01" style="width: 400px"/>
</div>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
print(soup.select_one('div#weather_today_content img')['src'])
Prints:
/database/img/weather_today.jpg?ver=2018-08-01
The selector div#weather_today_content img means select <div> with id=weather_today_content and withing this <div> select an <img>.

Python how to parsing HTML with BS4

<div class="stuff">
<div class="this">K/D</div>
<div class="that">8.66</div>
( If not clear the two divs below the top div are its children )
I'm currently trying to parse for 8.66 and I have made many attempts to parse for it using lxml and beautifulsoup. I tried running a loop to search for that value but it seems like nothing works!
If you can help please do I am absolutely lost on how to do this. Thank you in advance!!
You can specify the class value:
from bs4 import BeautifulSoup as soup
d = """
<div class="stuff">
<div class="this">K/D</div>
<div class="that">8.66</div>
"""
s = soup(d, 'html.parser')
print(s.find('div', {'class':'that'}).text)
Output:
8.66

Find specific link w/ beautifulsoup

Hi I cannot figure out how to find links which begin with certain text for the life of me.
findall('a') works fine, but it's way too much. I just want to make a list of all links that begin with
http://www.nhl.com/ice/boxscore.htm?id=
Can anyone help me?
Thank you very much
First set up a test document and open up the parser with BeautifulSoup:
>>> from BeautifulSoup import BeautifulSoup
>>> doc = '<html><body><div>yep</div><div>somelink</div>another</body></html>'
>>> soup = BeautifulSoup(doc)
>>> print soup.prettify()
<html>
<body>
<div>
<a href="something">
yep
</a>
</div>
<div>
<a href="http://www.nhl.com/ice/boxscore.htm?id=3">
somelink
</a>
</div>
<a href="http://www.nhl.com/ice/boxscore.htm?id=7">
another
</a>
</body>
</html>
Next, we can search for all <a> tags with an href attribute starting with http://www.nhl.com/ice/boxscore.htm?id=. You can use a regular expression for it:
>>> import re
>>> soup.findAll('a', href=re.compile('^http://www.nhl.com/ice/boxscore.htm\?id='))
[somelink, another]
You might not need BeautifulSoup since your search is specific
>>> import re
>>> links = re.findall("http:\/\/www\.nhl\.com\/ice\/boxscore\.htm\?id=.+", str(doc))
You can find all links and than filter that list to get only links that you need. This will be very fast solution regardless the fact that you filter it afterwards.
listOfAllLinks = soup.findAll('a')
listOfLinksINeed = []
for link in listOfAllLinks:
if "www.nhl.com" in link:
listOfLinksINeed.append(link['href'])

Categories

Resources