Trying to scrape a webpage for data, I check the current URL to make sure I'm on the expected page. However, it eventually raises an error and it seems to be when checking the URL. I can't figure out why, and when it happens isn't consistent. Sometimes it's several pages into the script, sometimes it's only a few pages in.
Traceback (most recent call last):
File "scrape.py", line 5, in <module>
scraper.start_search("ebook")
File "/home/ubuntu/workspace/scraper/school/scraper.py", line 56, in start_search
self.scrape_item(product_el)
File "/home/ubuntu/workspace/scraper/school/scraper.py", line 97, in scrape_item
if self.driver.current_url.split("/")[3] != "search":
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 493, in current_url
return self.execute(Command.GET_CURRENT_URL)['value']
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 236, in execute
response = self.command_executor.execute(driver_command, params)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/remote_connection.py", line 415, in execute
return self._request(command_info[0], url, body=data)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/remote_connection.py", line 489, in _request
resp = opener.open(request, timeout=self._timeout)
File "/usr/lib/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 422, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1214, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1184, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 111] Connection refused>
The seemingly relevant code is just:
if self.driver.current_url.split("/")[3] != "search":
time.sleep(random.randint(1, 3))
self.driver.back()
I'm using Python 2.7, Selenium, and PhantomJS.
I don't know why this is happening, though I have also seen current_url be flaky. Have you tried mitigating this with some exception handling?
from retry import retry
from urllib2 import URLError
#retry(URLError, tries=3)
def get_url(driver):
return driver.current_url
def main():
# Whatever setup you have goes here
# <...>
if get_url(driver).split("/")[3] != "search":
time.sleep(random.randint(1, 3))
driver.back()
if __name__ == "__main__":
main()
The retry package is available from PyPI
Related
I'm trying to use Telenium to automate testing of a Kivy app. Per the README.md I've run my application with the Telenium module:
python -m telenium.execute main.py
Next I've tried to connect to this application using a client:
>>> id = cli.pick()
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pyjsonrpc/http.py", line 168, in __call__
return self.http_client_instance.call(self.method, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pyjsonrpc/http.py", line 259, in call
debug = self.debug
File "/usr/local/lib/python2.7/dist-packages/pyjsonrpc/http.py", line 132, in http_request
response = urllib2.urlopen(request, timeout = timeout)
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 422, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1214, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1184, in do_open
raise URLError(err)
URLError: <urlopen error [Errno 111] Connection refused>
To diagnose a bit further, I did the following:
>>> cli.url
'http://localhost:9901/jsonrpc'
Any help in resolving this issue would be appreciated.
Running with your applications main.py will solve this issue
Check the full solution here
I am trying to make a function that will download file from the internet. If I go to the direct web address I do get the images or download the files. But, when I am running my code, it just hangs and then I get the timeout error. Is there any particular reason why that might be happening?
#fpp = "http://www.blog.pythonlibrary.org/wp-content/uploads/2012/06/wxDbViewer.zip"
fpp = "http://www.gunnerkrigg.com//comics/00000001.jpg"
download_file(fpp)
This is my function:
import urllib2
def download_file(url_path):
response = urllib2.urlopen(url_path)
data = response.read()
Is there any particular reason why might work from the browser but not in the code?
this is the error i get:
Traceback (most recent call last):
File "/Users/dk/testing/myfile.py", line 42, in <module>
download_file(fpp)
File "/Users/dk/Documents/testing/code_project.py", line 154, in download_file
response = urllib2.urlopen(url_path)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 431, in open
response = self._open(req, data)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 449, in _open
'_open', req)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 409, in _call_chain
result = func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1227, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1197, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 60] Operation timed out>
I've written a Django (version 1.3, sadly) management command to connect to BrowserStack with Selenium and am going to be using to run integration tests. (I've had to write a custom management command to get around the fact that we use AskBot within this site and it messes up the Django testing framework in some funny ways; otherwise I would simply use the testing framework.)
Gist of the script is here https://gist.github.com/cellofellow/7491221. This is a port of an earlier script that just ran unittest directly without any Django context.
What happens is that when ran, I get a traceback like so:
./manage.py browserstack signup
Browser: IE
Browser Version: 10.0
Operating System: Windows
OS Version: 7
E
======================================================================
ERROR: runTest (apps.common.management.commands.browserstack.SignUpBasic)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/jgardner/izeni/doterra_pro/apps/common/management/commands/browserstack.py", line 46, in setUp
desired_capabilities=self.caps)
File "/home/jgardner/.virtualenvs/doterra_pro/local/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 71, in __init__
self.start_session(desired_capabilities, browser_profile)
File "/home/jgardner/.virtualenvs/doterra_pro/local/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 113, in start_session
'desiredCapabilities': desired_capabilities,
File "/home/jgardner/.virtualenvs/doterra_pro/local/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 162, in execute
response = self.command_executor.execute(driver_command, params)
File "/home/jgardner/.virtualenvs/doterra_pro/local/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 355, in execute
return self._request(url, method=command_info[0], data=data)
File "/home/jgardner/.virtualenvs/doterra_pro/local/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 402, in _request
response = opener.open(request)
File "/usr/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 442, in error
result = self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 897, in http_error_401
url, req, headers)
File "/usr/lib/python2.7/urllib2.py", line 872, in http_error_auth_reqed
response = self.retry_http_basic_auth(host, req, realm)
File "/usr/lib/python2.7/urllib2.py", line 885, in retry_http_basic_auth
return self.parent.open(req, timeout=req.timeout)
File "/usr/lib/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 422, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1214, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1187, in do_open
r = h.getresponse(buffering=True)
File "/usr/lib/python2.7/httplib.py", line 1045, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 409, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 365, in _read_status
line = self.fp.readline(_MAXLINE + 1)
File "/usr/lib/python2.7/socket.py", line 476, in readline
data = self._sock.recv(self._rbufsize)
timeout: timed out
----------------------------------------------------------------------
Ran 1 test in 5.201s
FAILED (errors=1)
In BrowserStack an instance is started but because whatever happens next can't connect, it simply runs for a minute or so and then exits.
The script it was ported from didn't have this problem. What may be causing it?
Turns out I simply had to set socket.setdefaulttimeout(60) There are dozens of calls to socket.setdefaulttimeout in this codebase, both in dependencies and our own code, so who knows what it was actually set to.
When I try to crawl Twitter using this code:
import urllib2
s = "https://mobile.twitter.com/bing/"
html = urllib2.urlopen(s).read()
print html
... I get the following error:
Traceback (most recent call last):
File "C:\Users\arpit\Downloads\Desktop\Wiki Code\final Crawler_wiki.py", line 14, in <module>
html = urllib2.urlopen(s).read()
File "C:\Python27\lib\urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 400, in open
response = self._open(req, data)
File "C:\Python27\lib\urllib2.py", line 418, in _open
'_open', req)
File "C:\Python27\lib\urllib2.py", line 378, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 1215, in https_open
return self.do_open(httplib.HTTPSConnection, req)
File "C:\Python27\lib\urllib2.py", line 1177, in do_open
raise URLError(err)
URLError: <urlopen error [Errno 10061] No connection could be made because the target machine actively refused it>
If I replace mobile.twitter.com with twitter.com then it works, but I want it to work with mobile.twitter.com.
The twitter site is probably looking for a user-agent which you dont have set when you make the request through the urllib api.
You will likely need to use something like mechanize to fake your user-agent.
But I highly suggest your use the twitter api which provide a lot of easy and awesome way to play with data.
My code
conn = __get_s3_connection(s3_values.get('accessKeyId'), s3_values.get('secretAccessKey'))
key = s3_values.get('proposal_key') + proposal_unique_id + s3_values.get('proposal_append_path')
request = urllib2.Request(conn.generate_url(s3_values.get('expires_in'), 'GET', bucket=s3_values.get('bucket'), key=key))
request.add_header('Accept-encoding', 'gzip')
response = urllib2.urlopen(request)
The url looks like https://production.myorg.s3.amazonaws.com/key/document.xml.gz?Signature=signature%3D&Expires=1349462207&AWSAccessKeyId=accessId
This method was working fine until 1 hour back, but when I run the same program, it throws
Traceback (most recent call last):
File "/Users/hhimanshu/IdeaProjects/analytics/src/utilities/documentReader.py", line 145, in <module>
main()
File "/Users/hhimanshu/IdeaProjects/analytics/src/utilities/documentReader.py", line 141, in main
x = get_proposal_data_from_s3('documentId')
File "/Users/hhimanshu/IdeaProjects/analytics/src/utilities/documentReader.py", line 54, in get_proposal_data_from_s3
response = urllib2.urlopen(request)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 392, in open
response = self._open(req, data)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 410, in _open
'_open', req)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 370, in _call_chain
result = func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1194, in https_open
return self.do_open(httplib.HTTPSConnection, req)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1161, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 6] _ssl.c:503: TLS/SSL connection has been closed>
What could be the reason? How can I avoid this situation?
This was because of intermittent internet connection. Resolved on it own