I'm using Selenium Webdriver (in Python) to automate the downloading of thousands of files from a certain website (that can't be webscraped by conventional means like urllib, httplib, etc). My script works perfectly with Firefox, but I don't need to see magic happening, so I'm trying to use PhantomJS. It works almost all the way down, except when it tries to click a certain button in order to close a window. Here's the command at which the script gets stuck:
browser.find_element_by_css_selector("img[alt=\"Close Window\"]").click()
It just hangs in there, nothing happens.
PhantomJS is faster than Firefox (since there are no visuals), so I thought the problem might be related to the 'Close Window' button not being clickable soon enough. Hence I tried using an explicit wait:
element = WebDriverWait(browser, 30).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "img[alt=\"Close Window\"]")))
print "done with waiting"
browser.find_element_by_css_selector("img[alt=\"Close Window\"]").click()
Doesn't work: the wait ends pretty quickly (the "done with waiting" message appears after a second or so), but then the code hangs again. I've also tried using an implicit wait, but that didn't work either.
So, I'm at a loss. The same script runs like a charm when I use Firefox, so why doesn't the it work with PhantomJS?
I don't know if this helps, but here is the page source:
http://www.flickr.com/photos/88729961#N00/9512669916/sizes/l/in/photostream/
I don't know if this helps either, but when I break the execution w/ Crtl-C, I get this:
Traceback (most recent call last):
File "myscript.py", line 361, in <module>
myfunction(some_argument, some_other_argument)
File "myscript.py", line 277, in myfunction
browser.find_element_by_css_selector("img[alt=\"Close Window\"]").click()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/selenium-2.33.0-py2.7.egg/selenium/webdriver/remote/webelement.py", line 54, in click
self._execute(Command.CLICK_ELEMENT)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/selenium-2.33.0-py2.7.egg/selenium/webdriver/remote/webelement.py", line 228, in _execute
return self._parent.execute(command, params)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/selenium-2.33.0-py2.7.egg/selenium/webdriver/remote/webdriver.py", line 163, in execute
response = self.command_executor.execute(driver_command, params)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/selenium-2.33.0-py2.7.egg/selenium/webdriver/remote/remote_connection.py", line 349, in execute
return self._request(url, method=command_info[0], data=data)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/selenium-2.33.0-py2.7.egg/selenium/webdriver/remote/remote_connection.py", line 396, in _request
response = opener.open(request)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 422, in _open
'_open', req)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1214, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1187, in do_open
r = h.getresponse(buffering=True)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 1045, in getresponse
response.begin()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 409, in begin
version, status, reason = self._read_status()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 365, in _read_status
line = self.fp.readline(_MAXLINE + 1)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 476, in readline
data = self._sock.recv(self._rbufsize)
KeyboardInterrupt
I'm new to programming and I can't make sense of this output (I don't even know what a "socket" is). But maybe some of you can point me in the right direction? A quick fix might be too much to ask, but maybe a hint as to what could be going on?
(Mac OS X 10.6.8, Python 2.7.5, Selenium 2.33, PhantomJS 1.9.1)
Running the following line of code in your script, solves the question.
browser.execute_script("closeWindow(false, '/lnacui2api/cart/displayCart.do', 'false');");
Related
I need to parse the news from several sites with javascript and use selenium + PhantomJS for it. But there are videos on these sites, which are useless for me and I don't need them at all. (I was given an advice to use selenium + Chrome or selenium + Firefox, but I don't need any opening windows during parsing).
These videos start playing automatically according to the site's logic, and in the end of the end exception http.client.RemoteDisconnected: Remote end closed connection without response throws.
I think it throws because my internet is very slow and videos can't be full loaded with it.
How can I avoid this problem?
May be any content constraints exist in the selenium or PhantomJS?
Full traceback:
File "viralnova/viralnova.py", line 101, in parse_viralnova
_parse_post_link(postlinktest, driver)
File "viralnova/viralnova.py", line 9, in _parse_post_link
driver.get(post_link)
File "/Users/user/anaconda/envs/env/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 309, in get
self.execute(Command.GET, {'url': url})
File "/Users/user/anaconda/envs/env/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 295, in execute
response = self.command_executor.execute(driver_command, params)
File "/Users/user/anaconda/envs/env/lib/python3.6/site-packages/selenium/webdriver/remote/remote_connection.py", line 464, in execute
return self._request(command_info[0], url, body=data)
File "/Users/user/anaconda/envs/env/lib/python3.6/site-packages/selenium/webdriver/remote/remote_connection.py", line 526, in _request
resp = opener.open(request, timeout=self._timeout)
File "/Users/user/anaconda/envs/env/lib/python3.6/urllib/request.py", line 526, in open
response = self._open(req, data)
File "/Users/user/anaconda/envs/env/lib/python3.6/urllib/request.py", line 544, in _open
'_open', req)
File "/Users/user/anaconda/envs/env/lib/python3.6/urllib/request.py", line 504, in _call_chain
result = func(*args)
File "/Users/user/anaconda/envs/env/lib/python3.6/urllib/request.py", line 1346, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/Users/user/anaconda/envs/env/lib/python3.6/urllib/request.py", line 1321, in do_open
r = h.getresponse()
File "/Users/user/anaconda/envs/env/lib/python3.6/http/client.py", line 1331, in getresponse
response.begin()
File "/Users/user/anaconda/envs/env/lib/python3.6/http/client.py", line 297, in begin
version, status, reason = self._read_status()
File "/Users/user/anaconda/envs/env/lib/python3.6/http/client.py", line 266, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
Code is here
def _parse_post_link(post_link, driver):
try:
driver.get(post_link)
except Exception:
return None
post_page_soup = Soup(driver.page_source, "lxml")
title = post_page_soup.find('div', attrs={'class': 'post-box-detail article'}).h2.text
print(title)
def parse_viralnova(to_csv=True):
driver = webdriver.PhantomJS("/Users/user/.phantomjsdriver/phantomjs")
postlinktest = 'http://www.viralnova.com/restroom-design-fails/'
_parse_post_link(postlinktest, driver)
If it's just parsing the text content that you're after, you might consider using just Python and BeautifulSoup. You shouldn't be triggering anything in the browser this way since you won't use one at all (you mentioned you don't need windows opening) and at the same time the solution will be faster lacking that browser overhead.
If you do need some javascript loaded, you can try using dryscape as well.
I want to create a web based scraper using Python, Selenium and PhantomJS where you can input a url into a form and the results from the scrape will be returned to the webpage. I can run it on my PC and I can also get it to work through the terminal.
It is located in a virtual environment on Dreamhost shared hosting with Python3.5 installed. I have tested that the parameters are being passed in fine, and it does work using just lxml and requests. However, when I try to run the script from the form on the webpage using PhantomJS then it doesn't work properly. The following error in returned...
Traceback (most recent call last):
File "testscrape.py", line 140, in <module>
driver = init_driver()
File "testscrape.py", line 69, in init_driver
driver = webdriver.PhantomJS(executable_path=phantomPATH,desired_capabilities=dcap)
File "/home/paul/.python35/bin/magenv/lib/python3.5/site-packages/selenium/webdriver/phantomjs/webdriver.py", line 56, in __init__
desired_capabilities=desired_capabilities)
File "/home/paul/.python35/bin/magenv/lib/python3.5/site-packages/selenium/webdriver/remote/webdriver.py", line 91, in __init__
self.start_session(desired_capabilities, browser_profile)
File "/home/paul/.python35/bin/magenv/lib/python3.5/site-packages/selenium/webdriver/remote/webdriver.py", line 173, in start_session
'desiredCapabilities': desired_capabilities,
File "/home/paul/.python35/bin/magenv/lib/python3.5/site-packages/selenium/webdriver/remote/webdriver.py", line 231, in execute
response = self.command_executor.execute(driver_command, params)
File "/home/paul/.python35/bin/magenv/lib/python3.5/site-packages/selenium/webdriver/remote/remote_connection.py", line 395, in execute
return self._request(command_info[0], url, body=data)
File "/home/paul/.python35/bin/magenv/lib/python3.5/site-packages/selenium/webdriver/remote/remote_connection.py", line 463, in _request
resp = opener.open(request, timeout=self._timeout)
File "/home/paul/.python35/lib/python3.5/urllib/request.py", line 465, in open
response = self._open(req, data)
File "/home/paul/.python35/lib/python3.5/urllib/request.py", line 483, in _open
'_open', req)
File "/home/paul/.python35/lib/python3.5/urllib/request.py", line 443, in _call_chain
result = func(*args)
File "/home/paul/.python35/lib/python3.5/urllib/request.py", line 1268, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/home/paul/.python35/lib/python3.5/urllib/request.py", line 1243, in do_open
r = h.getresponse()
File "/home/paul/.python35/lib/python3.5/http/client.py", line 1174, in getresponse
response.begin()
File "/home/paul/.python35/lib/python3.5/http/client.py", line 282, in begin
version, status, reason = self._read_status()
File "/home/paul/.python35/lib/python3.5/http/client.py", line 243, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/home/paul/.python35/lib/python3.5/socket.py", line 575, in readinto
return self._sock.recv_into(b)
ConnectionResetError: [Errno 104] Connection reset by peer
I have tried a few different variations of desired_capabilities and even changing file permissions of everything in the virtual environment but to no avail. I must be missing something, or is it just not possible? Any suggestions gratefully received.
While using mechanize to open and process a lot pages (1000+) on a website I have hit a strange problem. Every now and then I get stuck trying to load a page, without timing out, the problem doesn't seem to be page specific as if I run it again and try to open the same page it works as a charm, but rather seem to happen at random.
I'm using this function to open pages
def openMechanize(br, url):
while True:
try:
print time.localtime()
print "opening: " + url
resp = br.open(url, timeout = 2.5)
print "done\n"
return resp
except Exception, errormsg:
print repr(errormsg)
print "failed to load page, retrying"
time.sleep(0.5)
When it gets stuck it makes the first print, current time and opening url, but never gets to the second one. I have tried to let it run for hours but nothing happens.
When interrupting the script with ctrl+c while it is stuck I get the following output:
File "test.py", line 143, in openMechanize
resp = br.open(url, timeout = 2.5)
File "/usr/local/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 203, in open
return self._mech_open(url, data, timeout=timeout)
File "/usr/local/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 230, in _mech_open
response = UserAgentBase.open(self, request, data)
File "/usr/local/lib/python2.7/dist-packages/mechanize/_opener.py", line 193, in open
response = urlopen(self, req, data)
File "/usr/local/lib/python2.7/dist-packages/mechanize/_urllib2_fork.py", line 344, in _open
'_open', req)
File "/usr/local/lib/python2.7/dist-packages/mechanize/_urllib2_fork.py", line 332, in _call_chain
result = func(*args)
File "/usr/local/lib/python2.7/dist-packages/mechanize/_urllib2_fork.py", line 1142, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/local/lib/python2.7/dist-packages/mechanize/_urllib2_fork.py", line 1116, in do_open
r = h.getresponse()
File "/usr/lib/python2.7/httplib.py", line 1045, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 409, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 365, in _read_status
line = self.fp.readline(_MAXLINE + 1)
File "/usr/lib/python2.7/socket.py", line 476, in readline
data = self._sock.recv(self._rbufsize)
KeyboardInterrupt
Upon inspecting socket.py, where it gets stuck, I see the following:
self._rbuf = StringIO() # reset _rbuf. we consume it via buf.
while True:
try:
data = self._sock.recv(self._rbufsize)
except error, e:
if e.args[0] == EINTR:
continue
raise
Looks like it gets stuck in an endless loop as recv for some reason crashes
Has anyone experienced this error and found some sort of fix?
I'm using Selenium with Python bindings to scrape AJAX content from a web page with headless Firefox. It works perfectly when run on my local machine. When I run the exact same script on my VPS, errors get thrown on seemingly random (yet consistent) lines. My local and remote systems have the same exact OS/architecture, so I'm guessing the difference is VPS-related.
For each of these tracebacks, the line is run 4 times before an error is thrown.
I most often get this URLError when executing JavaScript to scroll an element into view.
File "google_scrape.py", line 18, in _get_data
driver.execute_script("arguments[0].scrollIntoView(true);", e)
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 396, in execute_script
{'script': script, 'args':converted_args})['value']
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 162, in execute
response = self.command_executor.execute(driver_command, params)
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 355, in execute
return self._request(url, method=command_info[0], data=data)
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 402, in _request
response = opener.open(request)
File "/usr/lib64/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/usr/lib64/python2.7/urllib2.py", line 422, in _open
'_open', req)
File "/usr/lib64/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib64/python2.7/urllib2.py", line 1214, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib64/python2.7/urllib2.py", line 1184, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 111] Connection refused>
Occasionally I'll get this BadStatusLine when reading text from an element.
File "google_scrape.py", line 19, in _get_data
if e.text.strip():
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 55, in text
return self._execute(Command.GET_ELEMENT_TEXT)['value']
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 233, in _execute
return self._parent.execute(command, params)
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 162, in execute
response = self.command_executor.execute(driver_command, params)
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 355, in execute
return self._request(url, method=command_info[0], data=data)
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 402, in _request
response = opener.open(request)
File "/usr/lib64/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/usr/lib64/python2.7/urllib2.py", line 422, in _open
'_open', req)
File "/usr/lib64/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib64/python2.7/urllib2.py", line 1214, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib64/python2.7/urllib2.py", line 1187, in do_open
r = h.getresponse(buffering=True)
File "/usr/lib64/python2.7/httplib.py", line 1045, in getresponse
response.begin()
File "/usr/lib64/python2.7/httplib.py", line 409, in begin
version, status, reason = self._read_status()
File "/usr/lib64/python2.7/httplib.py", line 373, in _read_status
raise BadStatusLine(line)
httplib.BadStatusLine: ''
A couple times I've gotten a socket error:
File "google_scrape.py", line 19, in _get_data
if e.text.strip():
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 55, in text
return self._execute(Command.GET_ELEMENT_TEXT)['value']
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 233, in _execute
return self._parent.execute(command, params)
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 162, in execute
response = self.command_executor.execute(driver_command, params)
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 355, in execute
return self._request(url, method=command_info[0], data=data)
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 402, in _request
response = opener.open(request)
File "/usr/lib64/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/usr/lib64/python2.7/urllib2.py", line 422, in _open
'_open', req)
File "/usr/lib64/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib64/python2.7/urllib2.py", line 1214, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib64/python2.7/urllib2.py", line 1187, in do_open
r = h.getresponse(buffering=True)
File "/usr/lib64/python2.7/httplib.py", line 1045, in getresponse
response.begin()
File "/usr/lib64/python2.7/httplib.py", line 409, in begin
version, status, reason = self._read_status()
File "/usr/lib64/python2.7/httplib.py", line 365, in _read_status
line = self.fp.readline(_MAXLINE + 1)
File "/usr/lib64/python2.7/socket.py", line 476, in readline
data = self._sock.recv(self._rbufsize)
socket.error: [Errno 104] Connection reset by peer
I'm scraping from Google without a proxy, so my first thought was that my IP address is recognized as a VPS and put under a 5-time page-manipulation limitation or something. But my initial research indicates that these errors would not arise from being blocked.
Any insight into what these errors mean collectively, or on the necessary considerations when making HTTP requests from a VPS would be much appreciated.
Update
After a little thinking and looking into what a webdriver really is -- automated browser input -- I should have been confused about why remote_connection.py is making urllib2 requests at all. It would seem that the text method of the WebElement class is an "extra" feature of the python bindings that isn't part of the Selenium core. That doesn't explain the above errors, but it may indicate that the text method shouldn't be used for scraping.
Update 2
I realized that, for my purposes, Selenium's only function is getting the ajax content to load. So after the page loads, I'm parsing the source with lxml rather than getting elements with Selenium, i.e.:
html = lxml.html.fromstring(driver.page_source)
However, page_source is yet another method that results in a call to urllib2, and I consistently get the BadStatusLine error the second time I use it. Minimizing urllib2 requests is definitely a step in the right direction.
Update 3
Eliminating urllib2 requests by grabbing the source with javascript is better yet:
html = lxml.html.fromstring(driver.execute_script("return window.document.documentElement.outerHTML"))
Conclusion
These errors can be avoided by doing a time.sleep(10) between every few requests. The best explanation I've come up with is that Google's firewall recognizes my IP as a VPS and therefore puts it under a stricter set of blocking rules.
This was my initial thought, but I still find it hard to believe because my web searches return no indication that the above errors could be caused by a firewall.
If this is the case though, I would think the stricter rules could be circumvented with a proxy, though that proxy might have to be a local system or tor to avoid the same restrictions.
As per our conversation, you discovered that even for a small number of daily scrapes, Google has anti-scraping blocking in place. The solution is to put a delay of a few seconds between each fetch.
In the general case, since you are technically transferring non-recoverable costs to a third party, it is always good practice to try to reduce the extra resource load you are placing upon the remote server. Without pauses between HTTP fetches, a fast server and connection can cause a remote denial of service, especially to scrape targets that do not have Google's server resources.
I have been using SUDS and RPCLib to develop a SOAP interface to a software solution that takes a PDF document and returns a PNG, and have found a very interesting problem.
I have written the testing client (using SUDS) and server (using RPCLib), and they work successfully when when the documents to be uploaded and returned are less than about 3.5Mb. However, when uploading larger documents I get the SUDS Error:
Traceback (most recent call last):
File "MyFunc.py", line 90, in <module>
callMyFuncSOAPService(fName, test_id, fNameOut)
File "MyFuncClient.py", line 77, in callMyFuncSOAPService
temp_list = client.service.createInstance(encoded_data, 19, test_id, 20)
File "/usr/local/lib/python2.7/dist-packages/suds/client.py", line 542, in __call__
return client.invoke(args, kwargs)
File "/usr/local/lib/python2.7/dist-packages/suds/client.py", line 602, in invoke
result = self.send(soapenv)
File "/usr/local/lib/python2.7/dist-packages/suds/client.py", line 637, in send
reply = transport.send(request)
File "/usr/local/lib/python2.7/dist-packages/suds/transport/https.py", line 64, in send
return HttpTransport.send(self, request)
File "/usr/local/lib/python2.7/dist-packages/suds/transport/http.py", line 77, in send
fp = self.u2open(u2request)
File "/usr/local/lib/python2.7/dist-packages/suds/transport/http.py", line 118, in u2open
return url.open(u2request, timeout=tm)
File "/usr/lib/python2.7/urllib2.py", line 400, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 418, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1207, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1177, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 32] Broken pipe>
And when returning a document, the Server finishes processing and returns the document, but the Client hangs.
I have a feeling that this is due to a limit in the HTTP transport layer, but have no idea how to address this. Thanks!
You could just increase your allowed request length.
app = WsgiApplication(...)
app.max_content_length = 10 * 0x100000 # 10 MB
Spyne 2.10 and 2.9.4 will have these parameters in the constructor, so you'll be able to just do this:
WsgiApplication(..., max_content_length=10 * 0x100000)
I found a possible solution. It involved updating to and editing the spyne library [the successor to rpclib].
in WSGI.py, function __wsgi_input_to_iterable(), I commented out the 2 lines that threw the error:
raise RequestTooLongError()
The problem is that we are pulling a limit from length = str(http_env.get('CONTENT_LENGTH', self.max_content_length)) that seems to be incorrect.
I will check with the SPYNE developers to see what this bug could be coming from