Mimicking HTML5 Video support on PhantomJS used through Selenium in Python - python

I am trying to extract the source link of an HTML5 video found in the video tag . Using Firefox webdrive , I am able to get the desired result ie -
[<video class="video-stream html5-main-video" src='myvideoURL..'</video>]
but if I use PhantomJS -
<video class="video-stream html5-main-video" style="width: 854px; height: 480px; left: 0px; top: 0px; -webkit-transform: none;" tabindex="-1"></video>
I suspect this is because of PhantomJS' lack of HTML5 Video support . Is there anyway I can trick the webpage into thinking that HTML5 Video is supported so that it generates the URL ? Or can I do something else ?
tried this
try:
WebDriverWait(browser,10).until(EC.presence_of_element_located((By.XPATH, "//video")))
finally:
k = browser.page_source
browser.quit()
soup = BeautifulSoup(k,'html.parser')
print (soup.find_all('video'))

The way Firefox and phantomjs webdrivers communicate with Selenium are quite different.
When using Firefox, it signals back that the page has finished loading after it loaded some of the javascript
Differently in phantomjs, it signals Selenium that the page has finished loading as soon as it is able to get the page source meaning it wouldn't have loaded any javascript.
What you need to do is Wait for the element to be present before extracting it, in this case it would be:
video = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//video")))
EDIT:
Youtube first checks if the browser supports the video content before deciding whether to provide the source, theres a workaround though described here

Related

Python element click intercepted Error? and advice on how click a view more button

I have been writing a simple code for a 5 days now which I am doing to improve my knowledge on web scraping using different packages, I have already wrote one that downloads all the URL's and has the choice of downloading all images or videos but when I visited 'https://www.pixwox.com/' it has a different html design where the urls are hidden so I googled and started using Selenium. It was all going well until I hit a wall and the limits of my python knowledge.
The error below is what I have been getting for about 4 of them days, sometimes the code will work fine and others it will show the error below:
selenium.common.exceptions.ElementClickInterceptedException: Message: element click intercepted: Element <a class="downbtn" href="https://scontent-frt3-2.cdninstagram.com/v/t51.2885-15/313202955_1363105014229232_3490221399588911023_n.jpg?stp=dst-jpg_e35_p828x828&_nc_ht=scontent-frt3-2.cdninstagram.com&_nc_cat=108&_nc_ohc=cB6igIdJd0UAX-sHSgL&edm=ACWDqb8BAAAA&ccb=7-5&ig_cache_key=Mjk2MDU4OTQ5NjY4NTI4NzQwMQ%3D%3D.2-ccb7-5&oh=00_AfAL1tPs2in8qcStQLZMdlDGZxdNRp5H5nnV4LpHWR07gg&oe=6363A260&_nc_sid=1527a3&dl=1">...</a> is not clickable at point (156, 814). Other element would receive the click: <iframe id="h12_frm_bl9ib7ijd9k" scrolling="no" frameborder="0" width="100%" height="auto" marginheight="0" marginwidth="0" style="margin: 0px; padding: 0px; width: 100%; height: 84.6186px; overflow: hidden;"></iframe>
(Session info: chrome=106.0.5249.119)
I know this means that the element I'm trying to click on has another element currently over top of it but currently even with stepping through the code I have not been able to see what is covering the element.
View more
My code is below, Please excuse the code for being messy, I am still learning.
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as ec
from selenium.webdriver.support.ui import WebDriverWait
option = webdriver.ChromeOptions()
option.add_argument(" — incognito")
#option.add_argument("start-maximized")
#option.add_argument("--window-size=1400,600")
user = input("User: ")
full_url = 'https://www.pixwox.com/profile/' + user + '/'
driver = webdriver.Chrome()
driver.get(full_url)
print(driver.title)
i = 0
while i < int(12):
driver.execute_script("window.scrollTo(0,document.body.scrollHeight);")
i += 1
time.sleep(2)
download_button = "downbtn"
WebDriverWait(driver, 10).until(ec.presence_of_element_located((By.CLASS_NAME, download_button)))#.click()
elements = driver.find_elements(By.CLASS_NAME, download_button)
time.sleep(5)
for element in elements:
if element.text == 'Download': element.click()##### Added this sleep time ? time.sleep(2)
print(f"Dowdloading: {element}")
time.sleep(5)
else:
pass
also I know compared to other peoples code this is quite basic but I am still learning so any help would be great to further my learning, please excuse the while loop, I didn't know how else to keep scrolling down without using it.
Fix's tried:
Tried extending the sleep time to see whether it is a loading issue.
Tried to scroll down to see if that was causing the issue but that raises another issue ie the view more button which is another obstacle.
Stepped through the code using Thonny but was not able to find what is intercepting the element.
Tried https://stackoverflow.com/questions/44724185/element-myelement-is-not-clickable-at-point-x-y-other-element-would-receiv
Tried https://stackoverflow.com/questions/62260511/org-openqa-selenium-elementclickinterceptedexception-element-click-intercepted
I was expecting my code to be able to take a user input and download all images and videos on the whole page, currently it is very intermittent 99% of the time it downloads one image and then raises an 'ElementClickInterceptedException' error, that 1% downloads roughly 50 images and videos before raising the same error. I was also expecting it to scroll to the very bottom of the page so all images/videos load but the
View more
button stops the code from continuing.
Any help would be greatly appreciated.
Thank you

How to get the download link of html a tag which has no explicit true link?

I encountered a web page that has many download sign like
If I click on each of these download sign, the browser will start downloading a zip file.
However, it seems that these download sign are just images with no explicit download links can be copied.
I looked into the source of html. I figured out each download sign belong to a tr tag block as below.
<tr>
<td title="aaa">
<span class="centerFile">
<img src="/images/downloadCenter/pic.png" />
</span>aaa
</td>
<td>2021-09-10 13:42</td>
<td>bbb</td>
<td><a id="4099823" data='{"clazzId":37675171,"libraryId":"98689831","relationId":1280730}' recordid="4099823" target="_blank" class="download_ic checkSafe" style="line-height:54px;"><img src="/images/down.png" /></a></td>
</tr>
Click this link will download a zip file with download link
So my problem is how to get download links of these download sign without actually clicking them in the browser. In particular, I want to know how to do this using python by analyzing the source html so I could to do batch downloading?
If you want to do the batch download of those files, and are not able to find out links by analysis of html and javascript (because it's probably javascript function that creates this link, or javascript call to backend) then you can use selenium to simulate you acting as user.
You will need to do something like code below, where I'm using class name from html you present, where I think is call to javascript download function:
from selenium import webdriver
driver = webdriver.Chrome()
# URL of website
url = "https://www.yourwebsitelinkhere.com/"
driver.get(url)
# use class name to find anchor link
download_links = driver.find_elements_by_css_selector(".download_ic.checkSafe")
for link in download_links:
link.click()
Example how it works for stackoverflow (in the day of writing this answer)
driver = webdriver.Chrome()
driver.get("https://stackoverflow.com")
elements = driver.find_elements_by_css_selector('.-marketing-link.js-gps-track')
elements[0].click()
And this should lead you to stackoverflow about site.
[EDIT] Answer edited, as it seems compound classes are not supported by selenium, example for stackoverflow added

How to make click using Selenium?

I got stuck with extracting href="/ttt/play" from the following HTML code.
<div class="collection-list-wrapper-2 w-dyn-list">
<div class="w-dyn-items">
<div typeof="ListItem" class="collection-item-2 w-clearfix w-dyn-item">
<div class="div-block-38 w-hidden-medium w-hidden-small w-hidden-tiny"><img src="https://global-uploads.webflow.com/59cf_home.svg" width="16" height="16" alt="Official Link" class="image-28">
<a property="url" href="/ttt/play" class="link-block-4 w-inline-block">
<div class="row-7 w-row"><div class="column-10 w-col w-col-2"><img height="25" property="image" src="https://global-fb0edc0001b4b11d/5a77ba9773fd490001ddaaaa_play.png" alt="Play" class="image-23"><h2 property="name" class="heading-34">Play</h2><div style="background-color:#d4af37;color:white" class="text-block-28">GOLD LEVEL</div><div class="text-block-30">HOT</div><div style="background-color:#d4af37;color:white" class="text-block-28 w-condition-invisible">SILVER LEVEL</div></div></div></a>
</div>
<div typeof="ListItem" class="collection-item-2 w-clearfix w-dyn-item">
This is my code in Python:
driver = webdriver.PhantomJS()
driver.implicitly_wait(20)
driver.set_window_size(1120, 550)
driver.get(website_url)
tag = driver.find_elements_by_class_name("w-dyn-item")[0]
tag.find_element_by_tag_name("a").click()
url = driver.current_url
print(url)
driver.quit()
When I print url using print(url), I want to see url equal to website_url/ttt/play, but instead of it I get website_url.
It looks like the click event does not work and the new link is not really opened.
When using .click() it must be "visible" (you using PhantomJS) and not hidden, in a drop-down for example.
Also make sure the page is completely loaded.
As i see it you have two options:
Ether use selenium to revile it, and then click.
Use java script to do the actual click
I strongly suggest to click with javascript, its much faster and more reliable.
Here is a little wrapper to make things easier:
def execute_script(driver, xpath):
""" wrapper for selenium driver execute_script
:param driver: selenium driver
:param xpath: (str) xpath to the element
:return: execute_script result
"""
execute_string = "window.document.evaluate('{}', document, null, 9, null).singleNodeValue.click();".format(xpath)
return driver.execute_script(execute_string)
The wrapper basically implement this technique to click on elements with javascript.
then in your selenium script use the wrapper like so:
execute_script(driver, element_xpath)
you can also make it more general to not only do clicks, but scrolls and other magic..
ps. in my example i use xpath, but you can also use css_path basically, what-ever runs in javascript.

scraping dynamic updates of temperature sensor data from a website

I wrote following python code:
from bs4 import BeautifulSoup
import urllib2
url= 'http://www.example.com'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read(),"html.parser")
freq=soup.find('div', attrs={'id':'frequenz'})
print freq
The result is:
<div id="frequenz" style="font-size:500%; font-weight: bold; width: 100%; height: 10%; margin-top: 5px; text-align: center">tempsensor</div>
When I look at this site with a web browser, the web page shows a dynamic content, not the string 'tempsensor'. The temperature value is automatically refreshed every second. So something in the web page is
replacing the string 'tempsensor' with a numerical value automatically.
My problem is now: How can I get Python to show the updated numerical value? How can I obtain the value of the automatic update to tempsensor in BeautifulSoup?
Sorry No, Not possible with BeautifulSoup alone.
The problem is that BS4 is not a complete web browser. It is only an HTML parser. It doesn't parse CSS, nor Javascript.
A complete web browser does at least four things:
Connects to web servers, fetches data
Parses HTML content and CSS formatting and presents a web page
Parses Javascript content, runs it.
Provides for user interaction for things like Browser Navigation, HTML Forms and an events API for the Javascript program
Still not sure? Now look at your code. BS4 does not even include the first step, fetching the web page, to do that you had to use urllib2.
Dynamic sites usually include Javascript to run on the browser and periodically update contents. BS4 doesn't provide that, and so you won't see them, and furthermore never will by using only BS4. Why? Because item (3) above, downloading and executing the Javascript program is not happening. It would be happing in IE, Firefox, or Chrome, and that's why those work to show dynamic content while the BS4-only scraping does not show it.
PhantomJS and CasperJS provide a more mechanized browser that often can run the JavaScript codes enabling dynamic websites. But CasperJS and PhantomJS are programmed in server-side Javascript, not Python.
Apparently, some people are using a browser built into PyQt4 for these kinds of dynamic screenscaping tasks, isolating part of the DOM, and sending that to BS4 for parsing. That might allow for a Python solution.
In comments, #Cyphase suggests that the exact data you want might be available at a different URL, in which case it might be fetched and parsed with urllib2/BS4. This can be determined by careful examination of the Javascript that is running at a site, particularly you could look for setTimeout and setInterval which schedules updates, or ajax, or jQuery's .load function for fetching data from the back end. Javascripts for updates of dynamic content will usually only fetch data from back-end URLs of the same web site. If they use jQuery $('#frequenz') refers to the div, and by searching for this in the JS you may find the code that updates the div. Without jQuery the JS update would probably use document.getElementById('frequenz').
You're missing a tiny bit of code:
from bs4 import BeautifulSoup
import urllib2
url= 'http://www.example.com'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read(), 'html.parser')
freq = soup.find('div', attrs={'id':'frequenz'})
print freq.string # Added .string
This should do it:
freq.text.strip()
As in
>>> html = '<div id="frequenz" style="font-size:500%; font-weight: bold; width: 100%; height: 10%; margin-top: 5px; text-align: center">tempsensor</div>'
>>> soup = BeautifulSoup(html)
>>> soup.text.strip()
u'tempsensor'

Clicking on a Javascript Link on Firefox with Selenium

I am trying to get some comments off the car blog, Jalopnik. It doesn't come with the web page initially, instead the comments get retrieved with some Javascript. You only get the featured comments. I need all the comments so I would click "All" (between "Featured" and "Start a New Discussion") and get them.
To automate this, I tried learning Selenium. I modified their script from Pypi, guessing the code for clicking a link was link.click() and link = broswer.find_element_byxpath(...). It doesn't look liek the "All" button (displaying all comments) was pressed.
Ultimately I'd like to download the HTML of that version to parse.
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import time
browser = webdriver.Firefox() # Get local session of firefox
browser.get("http://jalopnik.com/5912009/prius-driver-beat-up-after-taking-out-two-bikers/") # Load page
time.sleep(0.2)
link = browser.find_element_by_xpath("//a[#class='tc cn_showall']")
link.click()
browser.save_screenshot('screenie.png')
browser.close()
Using Firefox with the Firebug plugin, I browsed to http://jalopnik.com/5912009/prius-driver-beat-up-after-taking-out-two-bikers.
I then opened the Firebug console and clicked on ALL; it obligingly showed a single AJAX call to http://jalopnik.com/index.php?op=threadlist&post_id=5912009&mode=all&page=0&repliesmode=hide&nouser=true&selected_thread=null
Opening that url in a new window gets me the comment feed you are seeking.
More generally, if you substitute the appropriate article-ID into that url, you should be able to automate the process without Selenium.

Categories

Resources