I am attempting to use Selenium/BeautifulSoup to unit test a web page. I am getting an error though that I haven't been able to Google.
selenium.common.exceptions.WebDriverException: Message: ''
I am using a Portable version of Firefox and a proxy.
import urllib2
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
from selenium.common.exceptions import NoSuchElementException
import time
import sys
def getItemDivs(url):
profile = webdriver.FirefoxProfile()
profile.set_preference("general.useragent.override","Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/21.0")
profile.set_preference("network.proxy.http", "proxy.example.com")
ffbin = webdriver.firefox.firefox_binary.FirefoxBinary('C:\\FirefoxPortable\\App\\Firefox\\firefox.exe')
# IT FAILS ON THE NEXT LINE
driver=webdriver.Firefox(profile, firefox_binary=ffbin)
driver.implicitly_wait(30)
# THIS LINE CONTAINS A VALID COOKIE, BUT IT HAS BEEN REMOVED FOR THIS QUESTION.
driver.add_cookie(<<mycookie>>)
base_url = url
verificationErrors = []
accept_next_alert = True
driver.get(base_url)
scrap1 = driver.page_source
soup = BeautifulSoup(scrap1)
This question is similar to this one, however, in that question they had a successful first request. I haven't had a success.
What can cause this type of exception but leave the message empty?
The problem was that I didn't set the network.proxy.port. Adding this line solved the problem:
profile.set_preference("network.proxy.port", "80")
Related
I am trying to write a code that scrapes all reviews from a single hotel on tripadvisor. The code runs through all pages except the last one, where it has a problem. It says that the problem is the next.click() in the loop. I am assuming this is because "next" is still present in the element, but just disabled. Anyone know how to fix this problem? I basically want it to not try to click next when it reaches the last page/when it is disabled, but still technically present. Any help would be much appreciated!
#maybe3.1
from argparse import Action
from calendar import month
from distutils.command.clean import clean
from lib2to3.pgen2 import driver
from os import link
import unittest
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import ElementNotInteractableException
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
from dateutil import relativedelta
from selenium.webdriver.common.action_chains import ActionChains
import time
import datetime
from selenium.common.exceptions import StaleElementReferenceException
from selenium.common.exceptions import NoSuchElementException
import pandas as pd
import requests
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Extract the HTML and create a BeautifulSoup object.
url = ('https://www.tripadvisor.com/Hotel_Review-g46833-d256905-Reviews-Knights_Inn_South_Hackensack-South_Hackensack_New_Jersey.html#REVIEWS')
user_agent = ({'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) \
AppleWebKit/537.36 (KHTML, like Gecko) \
Chrome/90.0.4430.212 Safari/537.36',
'Accept-Language': 'en-US, en;q=0.5'})
driver = webdriver.Chrome()
driver.get(url)
# Find and extract the data elements.
wait = WebDriverWait(driver,30)
wait.until(EC.element_to_be_clickable((By.XPATH,'//*[#id="component_15"]/div/div[3]/div[13]/div')))
#explicit wait here
next = driver.find_element(By.XPATH,'.//a[#class="ui_button nav next primary "]')
here = next.is_displayed()
while here == True:
time.sleep(2)
soup = BeautifulSoup(driver.page_source, 'html.parser')
time.sleep(2)
Titles = []
for title in soup.findAll('a',{'Qwuub'}):
Titles.append(title.text.strip())
reviews = []
for review in soup.findAll('q',{'class':'QewHA H4 _a'}):
reviews.append(review.text.strip())
next.click()
if here != True:
time.sleep(2)
soup = BeautifulSoup(driver.page_source, 'html.parser')
time.sleep(8)
break
# Create the dictionary.
dict = {'Review Title':Titles,'Reviews/Feedback':reviews}
# Create the dataframe.
datafr = pd.DataFrame.from_dict(dict)
datafr.head(10)
# Convert dataframe to CSV file.
datafr.to_csv('hotels1.855.csv', index=False, header=True)
This question might be in the same vein like:
python selenium to check if this text field is disabled or not
You can check if an element is enabled with:
driver.find_element_by_id("id").is_enabled
You can also wrap the code in a try/except block.
page=2
while True:
try:
#your code
driver.find_element(By.XPATH,f"//a[#class='pageNum ' and text()='{page}']").click()
page+=1
time.sleep(1)
except:
break
Should be a simple loop to go through all pages wait till the a tag in question is not valid anymore.
import time
I have written a script in python using selenium to fetch the business summary (which is within p tag) located at the bottom right corner of a webpage under the header Company profile. The webpage is heavily dynamic, so I thought to use a browser simulator. I have created a css selector, which is able to parse the summary if I copy the html elements directly from that webpage and try on it locally. For some reason, when I tried the same selector within my below script, it doesn't do the trick. It throws timeout exception error instead. How can I fetch it?
This is my try:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
link = "https://in.finance.yahoo.com/quote/AAPL?p=AAPL"
def get_information(driver, url):
driver.get(url)
item = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "[id$='-QuoteModule'] p[class^='businessSummary']")))
driver.execute_script("arguments[0].scrollIntoView();", item)
print(item.text)
if __name__ == "__main__":
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 20)
try:
get_information(driver,link)
finally:
driver.quit()
It seem that there is no Business Summary block initially, but it is generated after you scroll page down. Try below solution:
from selenium.webdriver.common.keys import Keys
def get_information(driver, url):
driver.get(url)
driver.find_element_by_tag_name("body").send_keys(Keys.END)
item = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "[id$='-QuoteModule'] p[class^='businessSummary']")))
print(item.text)
You have to scroll the page down twice until the element will be present:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
import time
link = "https://in.finance.yahoo.com/quote/AAPL?p=AAPL"
def get_information(driver, url):
driver.get(url)
driver.find_element_by_tag_name("body").send_keys(Keys.END) # scroll page
time.sleep(1) # small pause between
driver.find_element_by_tag_name("body").send_keys(Keys.END) # one more time
item = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "[id$='-QuoteModule'] p[class^='businessSummary']")))
driver.execute_script("arguments[0].scrollIntoView();", item)
print(item.text)
if __name__ == "__main__":
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 20)
try:
get_information(driver,link)
finally:
driver.quit()
If you will scroll only one time it won't work properly at some reason(at least for me). I think it depends on window dimensions, on the smaller window you have to scroll more than on a bigger one.
Here is a much simpler approach using requests and working with the JSON data that is already in the page. I would also recommend to always use request if possible. It may take some extra work but the end result is a lot more reliable / cleaner. You could also take my example a lot further and parse the JSON to work directly with it (you need to clean up the text to be valid JSON). In my example I just use split which was just faster to do but it could lead to problems down the road when doing something more complex.
import requests
from lxml import html
url = 'https://in.finance.yahoo.com/quote/AAPL?p=AAPL'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}
r = requests.get(url, headers=headers)
tree = html.fromstring(r.text)
data= [e.text_content() for e in tree.iter('script') if 'root.App.main = ' in e.text_content()][0]
data = data.split('longBusinessSummary":"')[1]
data = data.split('","city')[0]
print (data)
I was trying to get the embedded video URL from https://www.fmovies.is . I'm using selenium.PhantomJS(). The exact same code works perfectly if I use selenium.Firefox() driver . It seems as though I'm doing something wrong during the waiting phase.
If someone could point out what I was doing wrong , I would really appreciate it.
from bs4 import BeautifulSoup
from selenium import webdriver
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver import DesiredCapabilities
desired_capabilities = DesiredCapabilities.PHANTOMJS.copy()
desired_capabilities['phantomjs.page.customHeaders.User-Agent'] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5)AppleWebKit 537.36 (KHTML, like Gecko) Chrome"
desired_capabilities['phantomjs.page.customHeaders.Accept'] = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"
url = "https://fmovies.is/film/kung-fu-panda-2.9kx/q8kkyj"
driver = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true', '--ssl-protocol=any'],desired_capabilities=desired_capabilities)
driver.get(url)
try:
element = WebDriverWait(driver, 100).until(EC.presence_of_element_located((By.ID, "jw")))
finally:
driver.find_element_by_id("player").click()
pageSource = driver.page_source
soup = BeautifulSoup(pageSource,'lxml')
url = soup.find("video",{"class":"jw-video"})
print url
videoURL = ''
if url:
videoURL = url['src']
print videoURL
I need to scrape google suggestions from search input. Now I use selenium+phantomjs webdriver.
search_input = selenium.find_element_by_xpath(".//input[#id='lst-ib']")
search_input.send_keys('phantomjs har python')
time.sleep(1.5)
from lxml.html import fromstring
etree = fromstring(selenium.page_source)
output = []
for suggestion in etree.xpath(".//ul[#role='listbox']/li//div[#class='sbqs_c']"):
output.append(" ".join([s.strip() for s in suggestion.xpath(".//text()") if s.strip()]))
but I see in firebug XHR request like this. And response - simple text file with data that I need. Then I look at log:
selenium.get_log("har")
I can't see this request. How can I catch it? I need this url for using as template for requests lib to use it with other search words. Or may be possible run js that initiate this request with other(not from input field) arguments, is it possible?
You can solve it with Python+Selenium+PhantomJS only.
Here is the list of things I've done to make it work:
pretend to be a browser with a head by changing the PhantomJS's User-Agent through Desired Capabilities
use Explicit Waits
ask for the direct https://www.google.com/?gws_rd=ssl#q=phantomjs+har+python url
Working solution:
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
desired_capabilities = webdriver.DesiredCapabilities.PHANTOMJS
desired_capabilities["phantomjs.page.customHeaders.User-Agent"] = "Mozilla/5.0 (Linux; U; Android 2.3.3; en-us; LG-LU3000 Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1"
driver = webdriver.PhantomJS(desired_capabilities=desired_capabilities)
driver.get("https://www.google.com/?gws_rd=ssl#q=phantomjs+har+python")
wait = WebDriverWait(driver, 10)
# focus the input and trigger the suggestion list to be shown
search_input = wait.until(EC.visibility_of_element_located((By.NAME, "q")))
search_input.send_keys(Keys.ARROW_DOWN)
search_input.click()
# wait for the suggestion box to appear
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "ul[role=listbox]")))
# parse suggestions
print "List of suggestions: "
for suggestion in driver.find_elements_by_css_selector("ul[role=listbox] li[dir]"):
print suggestion.text
Prints:
List of suggestions:
python phantomjs screenshot
python phantomjs ghostdriver
python phantomjs proxy
unable to start phantomjs with ghostdriver
Trying to screen scrape a web site without having to launch an actual browser instance in a python script (using Selenium). I can do this with Chrome or Firefox - I've tried it and it works - but I want to use PhantomJS so it's headless.
The code looks like this:
import sys
import traceback
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 "
"(KHTML, like Gecko) Chrome/15.0.87"
)
try:
# Choose our browser
browser = webdriver.PhantomJS(desired_capabilities=dcap)
#browser = webdriver.PhantomJS()
#browser = webdriver.Firefox()
#browser = webdriver.Chrome(executable_path="/usr/local/bin/chromedriver")
# Go to the login page
browser.get("https://www.whatever.com")
# For debug, see what we got back
html_source = browser.page_source
with open('out.html', 'w') as f:
f.write(html_source)
# PROCESS THE PAGE (code removed)
except Exception, e:
browser.save_screenshot('screenshot.png')
traceback.print_exc(file=sys.stdout)
finally:
browser.close()
The output is merely:
<html><head></head><body></body></html>
But when I use the Chrome or Firefox options, it works fine. I thought maybe the web site was returning junk based on the user agent, so I tried faking that out. No difference.
What am I missing?
UPDATED: I will try to keep the below snippet updated with until it works. What's below is what I'm currently trying.
import sys
import traceback
import time
import re
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.support import expected_conditions as EC
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 (KHTML, like Gecko) Chrome/15.0.87")
try:
# Set up our browser
browser = webdriver.PhantomJS(desired_capabilities=dcap, service_args=['--ignore-ssl-errors=true'])
#browser = webdriver.Chrome(executable_path="/usr/local/bin/chromedriver")
# Go to the login page
print "getting web page..."
browser.get("https://www.website.com")
# Need to wait for the page to load
timeout = 10
print "waiting %s seconds..." % timeout
wait = WebDriverWait(browser, timeout)
element = wait.until(EC.element_to_be_clickable((By.ID,'the_id')))
print "done waiting. Response:"
# Rest of code snipped. Fails as "wait" above.
I was facing the same problem and no amount of code to make the driver wait was helping.
The problem is the SSL encryption on the https websites, ignoring them will do the trick.
Call the PhantomJS driver as:
driver = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true', '--ssl-protocol=TLSv1'])
This solved the problem for me.
You need to wait for the page to load. Usually, it is done by using an Explicit Wait to wait for a key element to be present or visible on a page. For instance:
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
# ...
browser.get("https://www.whatever.com")
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.content")))
html_source = browser.page_source
# ...
Here, we'll wait up to 10 seconds for a div element with class="content" to become visible before getting the page source.
Additionally, you may need to ignore SSL errors:
browser = webdriver.PhantomJS(desired_capabilities=dcap, service_args=['--ignore-ssl-errors=true'])
Though, I'm pretty sure this is related to the redirecting issues in PhantomJS. There is an open ticket in phantomjs bugtracker:
PhantomJS does not follow some redirects
driver = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true', '--ssl-protocol=TLSv1'])
This worked for me