I want to create a web based scraper using Python, Selenium and PhantomJS where you can input a url into a form and the results from the scrape will be returned to the webpage. I can run it on my PC and I can also get it to work through the terminal.
It is located in a virtual environment on Dreamhost shared hosting with Python3.5 installed. I have tested that the parameters are being passed in fine, and it does work using just lxml and requests. However, when I try to run the script from the form on the webpage using PhantomJS then it doesn't work properly. The following error in returned...
Traceback (most recent call last):
File "testscrape.py", line 140, in <module>
driver = init_driver()
File "testscrape.py", line 69, in init_driver
driver = webdriver.PhantomJS(executable_path=phantomPATH,desired_capabilities=dcap)
File "/home/paul/.python35/bin/magenv/lib/python3.5/site-packages/selenium/webdriver/phantomjs/webdriver.py", line 56, in __init__
desired_capabilities=desired_capabilities)
File "/home/paul/.python35/bin/magenv/lib/python3.5/site-packages/selenium/webdriver/remote/webdriver.py", line 91, in __init__
self.start_session(desired_capabilities, browser_profile)
File "/home/paul/.python35/bin/magenv/lib/python3.5/site-packages/selenium/webdriver/remote/webdriver.py", line 173, in start_session
'desiredCapabilities': desired_capabilities,
File "/home/paul/.python35/bin/magenv/lib/python3.5/site-packages/selenium/webdriver/remote/webdriver.py", line 231, in execute
response = self.command_executor.execute(driver_command, params)
File "/home/paul/.python35/bin/magenv/lib/python3.5/site-packages/selenium/webdriver/remote/remote_connection.py", line 395, in execute
return self._request(command_info[0], url, body=data)
File "/home/paul/.python35/bin/magenv/lib/python3.5/site-packages/selenium/webdriver/remote/remote_connection.py", line 463, in _request
resp = opener.open(request, timeout=self._timeout)
File "/home/paul/.python35/lib/python3.5/urllib/request.py", line 465, in open
response = self._open(req, data)
File "/home/paul/.python35/lib/python3.5/urllib/request.py", line 483, in _open
'_open', req)
File "/home/paul/.python35/lib/python3.5/urllib/request.py", line 443, in _call_chain
result = func(*args)
File "/home/paul/.python35/lib/python3.5/urllib/request.py", line 1268, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/home/paul/.python35/lib/python3.5/urllib/request.py", line 1243, in do_open
r = h.getresponse()
File "/home/paul/.python35/lib/python3.5/http/client.py", line 1174, in getresponse
response.begin()
File "/home/paul/.python35/lib/python3.5/http/client.py", line 282, in begin
version, status, reason = self._read_status()
File "/home/paul/.python35/lib/python3.5/http/client.py", line 243, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/home/paul/.python35/lib/python3.5/socket.py", line 575, in readinto
return self._sock.recv_into(b)
ConnectionResetError: [Errno 104] Connection reset by peer
I have tried a few different variations of desired_capabilities and even changing file permissions of everything in the virtual environment but to no avail. I must be missing something, or is it just not possible? Any suggestions gratefully received.
Related
I'm trying to get started with Selenium with Chromedriver.
I've downloaded the latest driver for Linux_64 to
/usr/local/share/chromedriver
And links to /usr/local/bin/chromdriver and /usr/bin/chromedriver
Python code
#!/usr/bin/env python
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('http://www.google.com/')
browser.save_screenshot('screenie.png')
browser.quit()
The file is executable
When running it the terminal gets idle. Only way to get some output is by pressing CTRL+C which the returns this
Traceback (most recent call last):
File "./test3.py", line 5, in <module>
browser = webdriver.Chrome()
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/chrome/webdriver.py", line 75, in __init__
desired_capabilities=desired_capabilities)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 154, in __init__
self.start_session(desired_capabilities, browser_profile)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 243, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 310, in execute
response = self.command_executor.execute(driver_command, params)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/remote_connection.py", line 466, in execute
return self._request(command_info[0], url, body=data)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/remote_connection.py", line 490, in _request
resp = self._conn.getresponse()
File "/usr/lib/python2.7/httplib.py", line 1136, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 453, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 409, in _read_status
line = self.fp.readline(_MAXLINE + 1)
File "/usr/lib/python2.7/socket.py", line 480, in readline
data = self._sock.recv(self._rbufsize)
KeyboardInterrupt
Any ideas?
While installing the Google Cloud SDK - Python, a httplib2.SSLHandshakeError keeps occuring. I have configured the unfilled_client_secrets.json (shown below the return). And this has not solved the HandshakeError.
Similar questions have been asked on here below, but none have been explicitly answered. Thank you, in advance for any help you might be able to provide.
~ $ ./google-cloud-sdk/install.sh Welcome to the Google Cloud SDK!
Traceback (most recent call last):
File
"/Users/rptrainor/./google-cloud-sdk/bin/bootstrapping/install.py",
line 206, in
main()
File "/Users/rptrainor/./google-cloud-sdk/bin/bootstrapping/install.py",
line 184, in main
Install(pargs.override_components, pargs.additional_components)
File
"/Users/rptrainor/./google-cloud-sdk/bin/bootstrapping/install.py",
line 130, in Install
_CLI.Execute(['--quiet', 'components', 'list'])
File "/Users/rptrainor/google-cloud-sdk/lib/googlecloudsdk/calliope/cli.py",
line 759, in Execute
self._HandleAllErrors(exc, command_path_string, specified_arg_names)
File
"/Users/rptrainor/google-cloud-sdk/lib/googlecloudsdk/calliope/cli.py",
line 737, in Execute
resources = args.calliope_command.Run(cli=self, args=args)
File "/Users/rptrainor/google-cloud-sdk/lib/googlecloudsdk/calliope/backend.py",
line 741, in Run
display_info=self.ai.display_info).Display()
File "/Users/rptrainor/google-cloud-sdk/lib/googlecloudsdk/calliope/display.py",
line 427, in Display
self._printer.Print(self._resources)
File "/Users/rptrainor/google-cloud-sdk/lib/googlecloudsdk/core/resource/resource_printer_base.py", line 251, in Print
for resource in resources:
File "/Users/rptrainor/google-cloud-sdk/lib/surface/components/list.py",
line 86, in Run
result = update_manager.List()
File "/Users/rptrainor/google-cloud-sdk/lib/googlecloudsdk/core/updater/update_manager.py",
line 516, in List
_, diff = self._GetStateAndDiff(command_path='components.list')
File
"/Users/rptrainor/google-cloud-sdk/lib/googlecloudsdk/core/updater/update_manager.py",
line 446, in _GetStateAndDiff
command_path=command_path)
File "/Users/rptrainor/google-cloud-sdk/lib/googlecloudsdk/core/updater/update_manager.py",
line 429, in _GetLatestSnapshot
*effective_url.split(','), command_path=command_path)
File "/Users/rptrainor/google-cloud-sdk/lib/googlecloudsdk/core/updater/snapshots.py",
line 165, in FromURLs
for url in urls]
File "/Users/rptrainor/google-cloud-sdk/lib/googlecloudsdk/core/updater/snapshots.py",
line 186, in _DictFromURL
response = installers.ComponentInstaller.MakeRequest(url, command_path)
File
"/Users/rptrainor/google-cloud-sdk/lib/googlecloudsdk/core/updater/installers.py",
line 283, in MakeRequest
return url_opener.urlopen(req, timeout=timeout)
File "/Users/rptrainor/google-cloud-sdk/lib/googlecloudsdk/core/url_opener.py",
line 69, in urlopen
return opener.open(req, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",
line 404, in open
response = self._open(req, data)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",
line 422, in _open
'_open', req) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",
line 382, in _call_chain
result = func(*args) File "/Users/rptrainor/google-cloud-sdk/lib/googlecloudsdk/core/url_opener.py",
line 54, in https_open
return self.do_open(build, req)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",
line 1181, in do_open
h.request(req.get_method(), req.get_selector(), req.data, headers) File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py",
line 995, in request
self._send_request(method, url, body, headers)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py",
line 1029, in _send_request
self.endheaders(body)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py",
line 991, in endheaders
self._send_output(message_body)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py",
line 844, in _send_output
self.send(msg)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py",
line 806, in send
self.connect()
File "/Users/rptrainor/google-cloud-sdk/lib/third_party/httplib2/init.py",
line 1081, in connect
raise SSLHandshakeError(e)
httplib2.SSLHandshakeError: [Errno 1] _ssl.c:510: error:14090086:SSL >routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed
{
"web":{
"client_id":"[[CLIENT_ID_IS_HERE]]",
"project_id":"[[PROJECT_ID_IS_HERE]]",
"auth_uri":"https://accounts.google.com/o/oauth2/auth",
"token_uri":"https://accounts.google.com/o/oauth2/token",
"auth_provider_x509_cert_url":"https://www.googleapis.com/oauth2/v1/certs",
"client_secret":"[[CLIENT_SECRET_IS_HERE]]"
}
}
Try updating Python to the last 2.7.x version. I could resolve the very same issue updating Python to 2.7.13.
One silly yet effective solution could be accessing these URL's via browser once and accepting their certificate.
As well check the time of your computer. If it is not appropriate i mean not in the current date. Server will not share the certificate.
I am running a script with Selenium, Firefox headless in my linux server. It is running well for my server. But I cannot install/configure the same thing for another one.
I am getting this error for my python script:
Traceback (most recent call last):
File "cde.py", line 290, in <module>
acde.Run()
File "cde.py", line 76, in Run
self.driver.get(self.link_to_explore)
File "/home/dev/.local/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 213, in get
self.execute(Command.GET, {'url': url})
File "/home/dev/.local/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 199, in execute
response = self.command_executor.execute(driver_command, params)
File "/home/dev/.local/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 395, in execute
return self._request(command_info[0], url, body=data)
File "/home/dev/.local/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 426, in _request
resp = self._conn.getresponse()
File "/usr/lib/python2.7/httplib.py", line 1127, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 453, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 417, in _read_status
raise BadStatusLine(line)
httplib.BadStatusLine: ''
May be I am missing to install something dependency. Is it possible to clone the configuration for certain app and use same in another machine ?
I'm using Selenium with Python bindings to scrape AJAX content from a web page with headless Firefox. It works perfectly when run on my local machine. When I run the exact same script on my VPS, errors get thrown on seemingly random (yet consistent) lines. My local and remote systems have the same exact OS/architecture, so I'm guessing the difference is VPS-related.
For each of these tracebacks, the line is run 4 times before an error is thrown.
I most often get this URLError when executing JavaScript to scroll an element into view.
File "google_scrape.py", line 18, in _get_data
driver.execute_script("arguments[0].scrollIntoView(true);", e)
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 396, in execute_script
{'script': script, 'args':converted_args})['value']
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 162, in execute
response = self.command_executor.execute(driver_command, params)
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 355, in execute
return self._request(url, method=command_info[0], data=data)
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 402, in _request
response = opener.open(request)
File "/usr/lib64/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/usr/lib64/python2.7/urllib2.py", line 422, in _open
'_open', req)
File "/usr/lib64/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib64/python2.7/urllib2.py", line 1214, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib64/python2.7/urllib2.py", line 1184, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 111] Connection refused>
Occasionally I'll get this BadStatusLine when reading text from an element.
File "google_scrape.py", line 19, in _get_data
if e.text.strip():
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 55, in text
return self._execute(Command.GET_ELEMENT_TEXT)['value']
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 233, in _execute
return self._parent.execute(command, params)
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 162, in execute
response = self.command_executor.execute(driver_command, params)
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 355, in execute
return self._request(url, method=command_info[0], data=data)
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 402, in _request
response = opener.open(request)
File "/usr/lib64/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/usr/lib64/python2.7/urllib2.py", line 422, in _open
'_open', req)
File "/usr/lib64/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib64/python2.7/urllib2.py", line 1214, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib64/python2.7/urllib2.py", line 1187, in do_open
r = h.getresponse(buffering=True)
File "/usr/lib64/python2.7/httplib.py", line 1045, in getresponse
response.begin()
File "/usr/lib64/python2.7/httplib.py", line 409, in begin
version, status, reason = self._read_status()
File "/usr/lib64/python2.7/httplib.py", line 373, in _read_status
raise BadStatusLine(line)
httplib.BadStatusLine: ''
A couple times I've gotten a socket error:
File "google_scrape.py", line 19, in _get_data
if e.text.strip():
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 55, in text
return self._execute(Command.GET_ELEMENT_TEXT)['value']
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 233, in _execute
return self._parent.execute(command, params)
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 162, in execute
response = self.command_executor.execute(driver_command, params)
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 355, in execute
return self._request(url, method=command_info[0], data=data)
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 402, in _request
response = opener.open(request)
File "/usr/lib64/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/usr/lib64/python2.7/urllib2.py", line 422, in _open
'_open', req)
File "/usr/lib64/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib64/python2.7/urllib2.py", line 1214, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib64/python2.7/urllib2.py", line 1187, in do_open
r = h.getresponse(buffering=True)
File "/usr/lib64/python2.7/httplib.py", line 1045, in getresponse
response.begin()
File "/usr/lib64/python2.7/httplib.py", line 409, in begin
version, status, reason = self._read_status()
File "/usr/lib64/python2.7/httplib.py", line 365, in _read_status
line = self.fp.readline(_MAXLINE + 1)
File "/usr/lib64/python2.7/socket.py", line 476, in readline
data = self._sock.recv(self._rbufsize)
socket.error: [Errno 104] Connection reset by peer
I'm scraping from Google without a proxy, so my first thought was that my IP address is recognized as a VPS and put under a 5-time page-manipulation limitation or something. But my initial research indicates that these errors would not arise from being blocked.
Any insight into what these errors mean collectively, or on the necessary considerations when making HTTP requests from a VPS would be much appreciated.
Update
After a little thinking and looking into what a webdriver really is -- automated browser input -- I should have been confused about why remote_connection.py is making urllib2 requests at all. It would seem that the text method of the WebElement class is an "extra" feature of the python bindings that isn't part of the Selenium core. That doesn't explain the above errors, but it may indicate that the text method shouldn't be used for scraping.
Update 2
I realized that, for my purposes, Selenium's only function is getting the ajax content to load. So after the page loads, I'm parsing the source with lxml rather than getting elements with Selenium, i.e.:
html = lxml.html.fromstring(driver.page_source)
However, page_source is yet another method that results in a call to urllib2, and I consistently get the BadStatusLine error the second time I use it. Minimizing urllib2 requests is definitely a step in the right direction.
Update 3
Eliminating urllib2 requests by grabbing the source with javascript is better yet:
html = lxml.html.fromstring(driver.execute_script("return window.document.documentElement.outerHTML"))
Conclusion
These errors can be avoided by doing a time.sleep(10) between every few requests. The best explanation I've come up with is that Google's firewall recognizes my IP as a VPS and therefore puts it under a stricter set of blocking rules.
This was my initial thought, but I still find it hard to believe because my web searches return no indication that the above errors could be caused by a firewall.
If this is the case though, I would think the stricter rules could be circumvented with a proxy, though that proxy might have to be a local system or tor to avoid the same restrictions.
As per our conversation, you discovered that even for a small number of daily scrapes, Google has anti-scraping blocking in place. The solution is to put a delay of a few seconds between each fetch.
In the general case, since you are technically transferring non-recoverable costs to a third party, it is always good practice to try to reduce the extra resource load you are placing upon the remote server. Without pauses between HTTP fetches, a fast server and connection can cause a remote denial of service, especially to scrape targets that do not have Google's server resources.
I've written a Django (version 1.3, sadly) management command to connect to BrowserStack with Selenium and am going to be using to run integration tests. (I've had to write a custom management command to get around the fact that we use AskBot within this site and it messes up the Django testing framework in some funny ways; otherwise I would simply use the testing framework.)
Gist of the script is here https://gist.github.com/cellofellow/7491221. This is a port of an earlier script that just ran unittest directly without any Django context.
What happens is that when ran, I get a traceback like so:
./manage.py browserstack signup
Browser: IE
Browser Version: 10.0
Operating System: Windows
OS Version: 7
E
======================================================================
ERROR: runTest (apps.common.management.commands.browserstack.SignUpBasic)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/jgardner/izeni/doterra_pro/apps/common/management/commands/browserstack.py", line 46, in setUp
desired_capabilities=self.caps)
File "/home/jgardner/.virtualenvs/doterra_pro/local/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 71, in __init__
self.start_session(desired_capabilities, browser_profile)
File "/home/jgardner/.virtualenvs/doterra_pro/local/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 113, in start_session
'desiredCapabilities': desired_capabilities,
File "/home/jgardner/.virtualenvs/doterra_pro/local/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 162, in execute
response = self.command_executor.execute(driver_command, params)
File "/home/jgardner/.virtualenvs/doterra_pro/local/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 355, in execute
return self._request(url, method=command_info[0], data=data)
File "/home/jgardner/.virtualenvs/doterra_pro/local/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 402, in _request
response = opener.open(request)
File "/usr/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 442, in error
result = self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 897, in http_error_401
url, req, headers)
File "/usr/lib/python2.7/urllib2.py", line 872, in http_error_auth_reqed
response = self.retry_http_basic_auth(host, req, realm)
File "/usr/lib/python2.7/urllib2.py", line 885, in retry_http_basic_auth
return self.parent.open(req, timeout=req.timeout)
File "/usr/lib/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 422, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1214, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1187, in do_open
r = h.getresponse(buffering=True)
File "/usr/lib/python2.7/httplib.py", line 1045, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 409, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 365, in _read_status
line = self.fp.readline(_MAXLINE + 1)
File "/usr/lib/python2.7/socket.py", line 476, in readline
data = self._sock.recv(self._rbufsize)
timeout: timed out
----------------------------------------------------------------------
Ran 1 test in 5.201s
FAILED (errors=1)
In BrowserStack an instance is started but because whatever happens next can't connect, it simply runs for a minute or so and then exits.
The script it was ported from didn't have this problem. What may be causing it?
Turns out I simply had to set socket.setdefaulttimeout(60) There are dozens of calls to socket.setdefaulttimeout in this codebase, both in dependencies and our own code, so who knows what it was actually set to.