Weird problem here. I have a Python 3 script that runs 24/7 and uses Selenium and Firefox to go to a web page and every 5 minutes downloads a file from a download link (which I can't just download with urllib, or whatever, because even though the link address for the download file remains constant, the data in the file is constantly changing and is different every time the page is reloaded and also depending on the criteria specified). The script runs fine almost all the time but I can't get rid of this one error that pops up every once in a while which terminates the script. Here's the error:
Traceback (most recent call last):
File "/Users/Shared/ROTH_1/Folio/get_F_notes.py", line 248, in <module>
driver.get(search_url)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/selenium/webdriver/remote/webdriver.py", line 187, in get
self.execute(Command.GET, {'url': url})
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/selenium/webdriver/remote/webdriver.py", line 173, in execute
response = self.command_executor.execute(driver_command, params)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/selenium/webdriver/remote/remote_connection.py", line 349, in execute
return self._request(command_info[0], url, body=data)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/selenium/webdriver/remote/remote_connection.py", line 379, in _request
self._conn.request(method, parsed_url.path, body, headers)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/http/client.py", line 1090, in request
self._send_request(method, url, body, headers)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/http/client.py", line 1118, in _send_request
self.putrequest(method, url, **skips)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/http/client.py", line 966, in putrequest
raise CannotSendRequest(self.__state)
http.client.CannotSendRequest: Request-sent
And here is the part of my script where the problem comes in, specifically, the script hits the "except ConnectionRefusedError:" part and, as intended, prints out "WARNING 1 : ConnectionRefusedError: search page did not load; now trying again". However, I get the above error, I think, when the loop begins again and tries to "driver.get(search_url)" again. The script chokes at that point and gives me the above error.
I have researched this quite a bit and it seems possible that the script is trying to reuse the same connection from the first attempt. The fix seems to be to create a new connection. But that is all I have been able to gather and, I have no idea how to create a new connection with Selenium. Do you? Or, is there some other issue here?
search_url = 'https://www.example.com/download_page'
loop_get_search_page = 1
while loop_get_search_page < 7:
if loop_get_search_page == 6:
print('WARNING: tried to load search page 5 times; exiting script to try again later')
##### log out
try:
driver.find_element_by_link_text('Sign Out')
except NoSuchElementException:
print('WARNING: NoSuchElementException: Unable to find the link text for the "Sign Out" button')
driver.quit()
raise SystemExit
try:
driver.get(search_url)
except TimeoutException:
print('WARNING ', loop_get_search_page, ': TimeoutException: search page did not load; now trying again', sep='')
loop_get_search_page += 1
continue
except ConnectionRefusedError:
print('WARNING ', loop_get_search_page, ': ConnectionRefusedError: search page did not load; now trying again')
loop_get_search_page += 1
continue
else:
break
Just ran into this problem myself. In my case, I had another thread running on the side that was also making requests via WebDriver. Turns out WebDriver is not threadsafe.
Check out the discussion at Can Selenium use multi threading in one browser? and the links there for more context.
When I removed the other thread, everything started working as expected.
Is it possible that you're running every 5m in a new thread?
The only way I know of to "create a new connection" is to launch a new instance of the WebDriver. That can get slow if you're doing a lot of requests, but since you're only doing things every 5m, it shouldn't really affect your throughput. As long as you always clean up your WebDriver instance when your dl is done, this might be a good option for you.
Related
I am making a webscraping program that goes through each URL in a list of URLs, opens the page with that URL, and extracts some information from the soup. Most of the time it works fine, but occasionally the program will stop advancing through the list but not terminate the program, show warnings/exceptions, or otherwise show signs of error. My code, stripped down to the relevant parts, looks like this:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as bs
# some code...
for url in url_list:
req = Request(url, headers={"User-Agent": "Mozilla/5.0"})
page = urlopen(req)
soup = bs(page, features="html.parser")
# do some stuff with the soup...
When the program stalls, if I terminate it manually (using PyCharm), I get this traceback:
File "/Path/to/my/file.py", line 48, in <module>
soup = bs(page, features="html.parser")
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/bs4/__init__.py", line 266, in __init__
markup = markup.read()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 454, in read
return self._readall_chunked()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 564, in _readall_chunked
value.append(self._safe_read(chunk_left))
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 610, in _safe_read
chunk = self.fp.read(min(amt, MAXAMOUNT))
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/socket.py", line 589, in readinto
return self._sock.recv_into(b)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 1052, in recv_into
return self.read(nbytes, buffer)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 911, in read
return self._sslobj.read(len, buffer)
KeyboardInterrupt
Here's what I have tried and learned:
Added a check to make sure that the page status is always 200 when making the soup. The fail condition never occurs.
Added a print statement after the soup is created. This print statement does not trigger after stalling.
The URLs are always valid. This is confirmed by the fact that the program does not stall on the same URL every time, and double confirmed by a similar program I have with nearly identical code that shows the same behavior on a different set of URLs.
I have tried running through this step-by-step with a debugger. The problem has not occurred in the 30 or so iterations I've checked manually, which may just be coincidence.
The page returns the correct headers when bs4 stalls. The problem seems to be isolated to the creation of the soup.
What could cause this behavior?
Ubuntu 18.x + selenium webdriver(Firefox)
Facing a weird problem, the following block works if I run all of it together
from selenium import webdriver
url = 'https://indiamart.com'
driver = webdriver.Firefox()
driver.get(url)
driver.find_element_by_xpath(xpath).click()
This is happening with every url I have tried.
However if I execute one line at a time, it gives
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/media/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 326, in get
self.execute(Command.GET, {'url': url})
File "/media/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 312, in execute
response = self.command_executor.execute(driver_command, params)
File "/media/lib/python3.6/site-packages/selenium/webdriver/remote/remote_connection.py", line 472, in execute
return self._request(command_info[0], url, body=data)
File "/media/lib/python3.6/site-packages/selenium/webdriver/remote/remote_connection.py", line 495, in _request
self._conn.request(method, parsed_url.path, body, headers)
File "/usr/lib/python3.6/http/client.py", line 1239, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1285, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1234, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1065, in _send_output
self.send(chunk)
File "/usr/lib/python3.6/http/client.py", line 986, in send
self.sock.sendall(data)
BrokenPipeError: [Errno 32] Broken pipe
This is the error on the code
driver.get(url)
However if I execute the same line again after the Broken Pipe error it works and gets the url.
I am very very confused. Can someone help me out.
Thanks
This is a known bug of the latest build v0.21.0 of geckodriver matched with the latest version of selenium v3.11
To work around this bug either:
a) downgrade geckodriver to v0.20.1 or earlier
b) wait for the bugfix/mitigations be rolled out in the upcoming versions of selenium and/or geckodriver
This bug originates from newly added support in v 0.21 of Keep-Alive feature. However, the default timeout from geckodriver in 0.21 is set to 5s, which is exceptionally short.
This bug is tracked here for geckodriver and here for selenium.
This error message...
BrokenPipeError: [Errno 32] Broken pipe
...implies that the GeckoDriver server process has received a SIGPIPE while writing to a socket. BrokenPipeError usually happens when a process tries to write to a socket which is fully closed on the client side. This may be happening when the GeckoDriver server process doesn't wait till all the data from the server is received and simply tries to close the socket (using close function)it had opened with the client.
Here you can find a detailed discussion on How to prevent errno 32 broken pipe?
Solution
Moving forward as you are invoking click() on your desired element you need to induce WebDriverWait for the element to be clickable as follows:
driver.get(url)
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "xpath"))).click()
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Again, the BrokenPipeError can also occur if your request is blocked or takes too long and after request-side (server) timeout. The server may close the connection and then, when the response-side (client) tries to write to the socket, it may throw a BrokenPipeError. Inthis cases you can set page_load_timeout as follows:
driver.set_page_load_timeout(3)
Here you can find a detailed discussion on How to set the timeout of 'driver.get' for python selenium 3.8.0?
In the recent release, they have the issue, upgrade the selenium with latest release
pip install -U selenium
I'm possting a file to a webserver using urllib2.py in my script and it keeps timining out. My code:
def postdata(nodemac,filename,timestamp):
try:
wakeup()
socket.setdefaulttimeout(TIMEOUT)
opener = urllib2.build_opener(MultipartPostHandler.MultipartPostHandler)
host = HOST
func = "post_data"
url = "http://{0}{1}?f={2}&nodemac={3}&time={4}".format(host, URI, func, nodemac, timestamp)
if os.path.isfile(filename):
data = {"data":open(filename,"rb")}
response = opener.open(url, data, timeout=TIMEOUT)
retval = response.read()
if "SUCCESS" in retval:
return 0
else:
print "RETVAL "+retval
return 99
else:
print filename +" is not a file"
return 99
except Exception as e:
print e
return 99
I set TIMEOUT to 20, 60 and 120 but the same thing keeps happening. What's going on with that I'm wondering? Something is foul! The timeout set to 20 used to work just fine and then all of a sudden, today it started to time out on me... does anyone have any clues? I couldn't find anything that brought me further on the web so I thought I'd try here...!
Thanks,
Traceback:
File "gateway.py", line 686, in CloudRun
read_img()
File "gateway.py", line 668, in read_img
retval = database.postimg(mac,fh,timestmp)
File "/root/database.py", line 100, in postimg
response = opener.open(url, data, timeout=TIMEOUT)
File "/usr/lib/python2.7/urllib2.py", line 394, in open
File "/usr/lib/python2.7/urllib2.py", line 412, in _open
File "/usr/lib/python2.7/urllib2.py", line 372, in _call_chain
File "/usr/lib/python2.7/urllib2.py", line 1199, in http_open
File "/usr/lib/python2.7/urllib2.py", line 1174, in do_open
URLError: <urlopen error timed out>
The timeout set to 20 used to work just fine and then all of a sudden, today it started to time out on me... does anyone have any clues?
Well, the first clue is that the behavior changed even though your code didn't. So, it's got to be something with the environment.
The most likely possibility is that the timeouts are a perfectly accurate sign that the server is overloaded, broken, etc. But, since you gave us no information at all about the server you're trying to talk to, it's hard to do much more than guess.
Here are some ideas for tracking down the problem.
First, take that exact same URL, paste it into your browser's address bar, and see what happens. Does it time out? Or take longer than 20 seconds to respond?
Try adding &data=data to the end of the URL.
Try using a simple tool to send the request as a POST. For example, from the command line, try curl -d data http://whatever.
Try logging exactly what headers and post data urllib2 used, and making curl send exactly the same thing. (If this fails, and the previous one worked, you may want to edit your question, or ask a new question, with "Why did this work and that fail?" with the details.)
Try running the same tests from a different computer with a different internet connection.
I have a web-service deployed in my box. I want to check the result of this service with various input. Here is the code I am using:
import sys
import httplib
import urllib
apUrl = "someUrl:somePort"
fileName = sys.argv[1]
conn = httplib.HTTPConnection(apUrl)
titlesFile = open(fileName, 'r')
try:
for title in titlesFile:
title = title.strip()
params = urllib.urlencode({'search': 'abcd', 'text': title})
conn.request("POST", "/somePath/", params)
response = conn.getresponse()
data = response.read().strip()
print data+"\t"+title
conn.close()
finally:
titlesFile.close()
This code is giving an error after same number of lines printed (28233). Error message:
Traceback (most recent call last):
File "testService.py", line 19, in ?
conn.request("POST", "/somePath/", params)
File "/usr/lib/python2.4/httplib.py", line 810, in request
self._send_request(method, url, body, headers)
File "/usr/lib/python2.4/httplib.py", line 833, in _send_request
self.endheaders()
File "/usr/lib/python2.4/httplib.py", line 804, in endheaders
self._send_output()
File "/usr/lib/python2.4/httplib.py", line 685, in _send_output
self.send(msg)
File "/usr/lib/python2.4/httplib.py", line 652, in send
self.connect()
File "/usr/lib/python2.4/httplib.py", line 636, in connect
raise socket.error, msg
socket.error: (99, 'Cannot assign requested address')
I am using Python 2.4.3. I am doing conn.close() also. But why is this error being given?
This is not a python problem.
In linux kernel 2.4 the ephemeral port range is from 32768 through 61000. So number of available ports = 61000-32768+1 = 28233. From what i understood, because the web-service in question is quite fast (<5ms actually) thus all the ports get used up. The program has to wait for about a minute or two for the ports to close.
What I did was to count the number of conn.close(). When the number was 28000 wait for 90sec and reset the counter.
BIGYaN identified the problem correctly and you can verify that by calling "netstat -tn" right after the exception occurs. You will see very many connections with state "TIME_WAIT".
The alternative to waiting for port numbers to become available again is to simply use one connection for all requests. You are not required to call conn.close() after each call of conn.request(). You can simply leave the connection open until you are done with your requests.
I too faced similar issue while executing multiple POST statements using python's request library in Spark. To make it worse, I used multiprocessing over each executor to post to a server. So thousands of connections created in seconds that took few seconds each to change the state from TIME_WAIT and release the ports for the next set of connections.
Out of all the available solutions available over the internet that speak of disabling keep-alive, using with request.Session() et al, I found this answer to be working which makes use of 'Connection' : 'close' configuration as header parameter. You may need to put the header content in a separte line outside the post command though.
headers = {
'Connection': 'close'
}
with requests.Session() as session:
response = session.post('https://xx.xxx.xxx.x/xxxxxx/x', headers=headers, files=files, verify=False)
results = response.json()
print results
This is my answer to the similar issue using the above solution.
I've written my first Python application with the App Engine APIs, it is intended to monitor a list of servers and notify me when one of them goes down, by sending a message to my iPhone using Prowl, or sending me an email, or both.
Problem is, a few times a week it notifies me a server is down even when it clearly isn't. I've tested it with servers i know should be up virtually all the time like google.com or amazon.com but i get notifications with them too.
I've got a copy of the code running at http://aeservmon.appspot.com, you can see that google.com was added Jan 3rd but is only listed as being up for 6 days.
Below is the relevant section of the code from checkservers.py that does the checking with urlfetch, i assumed that the DownloadError exception would only be raised when the server couldn't be contacted, but perhaps I'm wrong.
What am I missing?
Full source on github under mrsteveman1/aeservmon (i can only post one link as a new user, sorry!)
def testserver(self,server):
if server.ssl:
prefix = "https://"
else:
prefix = "http://"
try:
url = prefix + "%s" % server.serverdomain
result = urlfetch.fetch(url, headers = {'Cache-Control' : 'max-age=30'} )
except DownloadError:
logging.info('%s could not be reached' % server.serverdomain)
self.serverisdown(server,000)
return
if result.status_code == 500:
logging.info('%s returned 500' % server.serverdomain)
self.serverisdown(server,result.status_code)
else:
logging.info('%s is up, status code %s' % (server.serverdomain,result.status_code))
self.serverisup(server,result.status_code)
UPDATE Jan 21:
Today I found one of the exceptions in the logs:
ApplicationError: 5
Traceback (most recent call last):
File "/base/python_lib/versions/1/google/appengine/ext/webapp/__init__.py", line 507, in __call__
handler.get(*groups)
File "/base/data/home/apps/aeservmon/1.339312180538855414/checkservers.py", line 149, in get
self.testserver(server)
File "/base/data/home/apps/aeservmon/1.339312180538855414/checkservers.py", line 106, in testserver
result = urlfetch.fetch(url, headers = {'Cache-Control' : 'max-age=30'} )
File "/base/python_lib/versions/1/google/appengine/api/urlfetch.py", line 241, in fetch
return rpc.get_result()
File "/base/python_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 501, in get_result
return self.__get_result_hook(self)
File "/base/python_lib/versions/1/google/appengine/api/urlfetch.py", line 331, in _get_fetch_result
raise DownloadError(str(err))
DownloadError: ApplicationError: 5
other folks have been reporting issues with the fetch service (e.g. http://code.google.com/p/googleappengine/issues/detail?id=1902&q=urlfetch&colspec=ID%20Type%20Status%20Priority%20Stars%20Owner%20Summary%20Log%20Component)
can you print the exception, it may have more detail, e.g.:
"DownloadError: ApplicationError: 2 something bad"