I'm trying to extract video from a web page with BeautifulSoup in python but i got into some problems.
When i go to the web page and inspect to see html elements I see this tag
<iframe id="iframe-embed2" src="https://player.voxzer.org/view/1167612b04f6855ecc4bb5e0" allowfullscreen="true" webkitallowfullscreen="true" mozallowfullscreen="true" width="100%" height="auto" frameborder="0"></iframe>
and when i copy the src and open it, it shows me the video.
but when I use BeautifulSoup to find the iframe from the web page I got src as empty string.
import requests
from bs4 import BeautifulSoup
site = requests.get("the url ...")
soup = BeautifulSoup(site.text, "html.parser")
print(soup.find_all("iframe"))
>>> [<iframe allowfullscreen="true" frameborder="0" height="auto" id="iframe-embed2" mozallowfullscreen="true" scrolling="no" src="" webkitallowfullscreen="true" width="100%"></iframe>]
What is the problem here?
this question doesn't have any working solutions
Parse iframe with blank src using bs4
What is the problem here?
I looked at site.text and found https://player.voxzer.org/view/1167612b04f6855ecc4bb5e0 to be placed in line
mainvideos.push('https://player.voxzer.org/view/1167612b04f6855ecc4bb5e0')
as .push is JavaScript method, apparently src of this iframe is set by JavaScript code, so you will need way to execute JavaScript code of site (for example using Selenium).
Related
I am using BeautifulSoup, the findAll method is missing <p> tags. I run the code and it returns and empty list. But if I inspect the page I can clearly see it as shown in the picture bellow.
I chose some random site.
import requests
from bs4 import BeautifulSoup
#An example web site
url = 'https://www.kite.com/python/answers/how-to-extract-text-from-an-html-file-in-python'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
print(soup.findAll("p"))
The output:
(env) pinux#main:~/dev$ python trial.py
[]
I inspect the page using the browser:
The text is clearly there. Why doesn't BeautifulSoup catch them? Can someone shed some light on what is going on?
It appears that parts of this webpage is rendered in JavaScript. You can try using selenium, since Selenium WebDrivers automatically wait for the page to fully render.
import bs4
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("https://url-to-webpage.com")
soup = bs4.BeautifulSoup(browser.page_source, features="html.parser")
I am learning BeautifulSoup and i tried to extract all the "a" tags from a website. I am getting lot of "a" tags but few of them are ignored and i am confused why that is happening any help will be highly appreciated.
Link i used is : https://www.w3schools.com/python/
img : https://ibb.co/mmEKTK
red box in the image is a section that has been totally ignored by the bs4. It does contains "a" tags.
Code:
import requests
import bs4
import re
import html5lib
res = requests.get('https://www.w3schools.com/python/')
soup = bs4.BeautifulSoup(res.text,'html5lib')
try:
links_with_text = []
for a in soup.find_all('a', href=True):
print(a['href'])
except:
print ('none')
sorry for the code indentation i am new here.
The links which are being ignored by bs4 are dynamically rendered i.e Advertisements etc were not present in the HTML code but have been called by scripts based on your browser habits. requests package will only fetch static HTML content, you need to simulate browser to get the dynamic content.
Selenium can be used with any browser like Chrome, Firefox etc. If you want to achieve the same results on server (without UI), use headless browsers like Phatomjs.
So I am trying to download a few eBooks that I have purchased through humble bundle. I am using beautifulsoup and requests to try and parse the html and get the href links for the pdfs.
Python
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.humblebundle.com/downloads?key=fkuzzq6R8MA8ydEw")
soup = BeautifulSoup(r.content, "html.parser")
links = soup.find_all("div", {"class": "js-all-downloads-holder"})
print(links)
I am going to put a imgar link to the site and html layout because I don't believe you can access the html page without prompting a login(Which might be one of the reason I am having this issue to start with.) https://imgur.com/24x2X0m
HTML
<div class="flexbtn active noicon js-start-download">
<div class="right"></div>
<span class="label">PDF</span>
<a class="a" download="" href="https://dl.humble.com/makea2drpginaweekend.pdf?gamekey=fkuzzq6R8MA8ydEw&ttl=1521117317&t=b714bb732413a1f0532ec6aa72b282f9">
PDF
</a>
</div>
So the print statement should output to contents of the div but that is not the case.
Output
python3 pdf_downloader.py
[]
Sorry for the long post, I have just been up all night working on this and at this point it would have just been easier to hit the download button 20+ times but that is not how you learn.
I have to get iframe src with beautiful soup
<div class="divclass">
<div id="simpleid">
<iframe width="300" height="300" src="http://google.com>
I could use selenium with code:
iframe1 = driver.find_element_by_class_name("divclass")
iframe = iframe1.find_element_by_tag_name("iframe").get_attribute("src")
but selenium is too slow for this task.
I've been looking for solution here on stackoverflow and tried several codes but always get error 403 while using urllib (changing browser agent is not working, still 403 error) or I get "None"
Use soup.find_all('tag you want to search')
>>> from bs4 import BeautifulSoup
>>> html = '''
... <div class="divclass">
... <div id="simpleid">
... <iframe width="300" height="300" src="http://google.com">
... '''
>>> soup = BeautifulSoup(html, 'html.parser')
>>> soup.find_all('iframe')
[<iframe height="300" src="http://google.com" width="300">
</iframe>]
>>> soup.find_all('iframe')[0]['src']
u'http://google.com'
>>>
Very good question.
Looking at the site you're trying to get that iframe from using that lib, you have to get the contents of tag in that div, and then base64 decode it and you should be done.
Seeing how you do things, don't stop! You're going to be a great programmer.
I tried to obtain the text inside a <div> tag but was not able. I am trying to obtain the this text:
MsiExec.exe /X{42435041-332D-5350-00A7-A758B70C0F00}
This tag is not enclosed inside any div class.
<div style="margin-top: 10px;"><span class="colorlt">Uninstaller:</span> MsiExec.exe /X{42435041-332D-5350-00A7-A758B70C0F00}</div>
Can somebody pls tell me how it is to done using python?
I am using BeautifulSoup to scrap the page.
Is this the entire contents of the scraped page? If so, try this:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<div style="margin-top: 10px;"><span class="colorlt">Uninstaller:</span> MsiExec.exe /X{42435041-332D-5350-00A7-A758B70C0F00}</div>', 'html.parser')
print soup.div.text
Uninstaller: MsiExec.exe /X{42435041-332D-5350-00A7-A758B70C0F00}
If your scraped page contains other divs, this may not work.