I have this piece of code that just downloads files from WebDav server.
_download(self) is a thread function, it is handled by multi_download(self) controller, that keeps the Thread count under 24 and it works fine. It should ensure that no more than 24 sockets are used. It is very straightforward, I am not even going to post the method here. Maybe relevant is that I am using Threads, not ThreadPoolExecutor - i am not a fan of Pool so much - handling the max ThreadCount manually.
Problem is when e.g. VPN drops and i cannot connect, or some other unhandled network problems. I could handle that ofc, but thats not the point here.
The unexpected behaviour is HERE:
After a while of running retrials and logging exceptions , the file descriptor count seems to overreach the limit because it starts throwing this Error. It never happened when there was no errors/retrials in the whole process.:
NOTE: webdav.download() library method uses with open(file, 'wb') to download data, so there should be no hanging FDs either.
2022-02-09 10:36:53,898 - DEBUG - 2294-1644212940.tdms thrd - Retried download successfull on 25 attempt [webdav.py:_download:183]
2022-02-09 10:36:53,904 - DEBUG - 2294-1644212940.tdms thrd - downloaded 900 files [webdav.py:add_download_counter:67]#just a log
2022-02-09 10:36:59,801 - DEBUG - 2294-1644219643.tdms thrd - Retried download successfull on 25 attempt [webdav.py:_download:183]
2022-02-09 10:36:59,856 - DEBUG - 2294-1644213248.tdms thrd - Retried download successfull on 25 attempt [webdav.py:_download:183]
2022-02-09 10:36:59,905 - WARNING - 2294-1643646904.tdms thrd - WebDav cannot connect: HTTPConnectionPool(host='123.16.456.123', port=987):
Max retries exceeded with url:/path/to/webdav/file (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7b3377d898>:
Failed to establish a new connection: [Errno 24] Too many open files'))
# Marked in code where this is thrown !
I assume that it means i am opening too many sockets, but I tried to clean after myself, in code you can see me closing the session and even deleting the reference to client to make it more neat. BUT after a while of debugging it, I cannot seem to get hold of WHERE I am forgetting something and where are the hanging sockets. I try ask for help before I start counting FDs and subclassing easywebdav2 classes :) Thanks, Q.
# Python 3.7.3
from easywebdav2 import Client as WebDavClient
# WebDav source:
# https://github.com/zabuldon/easywebdav/blob/master/easywebdav2/client.py
def clean_webdav(self, webdav):
"""Closing sockets after"""
try:
webdav.session.close()
except Exception as err:
logger.error(f'Err closing session: {err}')
finally:
del webdav
def _download(self, local, remote, *, retry=0):
"""This is a thread function, therefore raising SystemExit."""
try:
webdav = WebDavClient(**kw)
except Exception as err:
logger.error(f'There is an err creating client: {err}')
raise SystemExit
try:
webdav.download(remote, local) # < --------------- HERE THROWS
if retry != 0:
logger.info(f'Retry number {retry} was successfull')
except(ConnectionError, requests.exceptions.ConnectionError) as err:
if retry >= MAX_RETRY:
logger.exception(f'There was err: {err}')
return
retry += 1
self.clean_webdav(webdav)
self._download(local, remote, retry=retry)
except Exception as err:
logger.error(f'Unhandled Exception: {err}')
finally:
self.clean_webdav(webdav)
raise SystemExit
EDIT: Since one answer contained reference to WebDav being HTTP protocol expansion(which it is) - HTTP keep-alive should not play a role here if I am specifically closing the requests session by webdav.session.close() which is indeed THE requests session made by webdav. There should be no keep-alive after specifically closing right?
I'm not specifically familiar with the WebDAV client package you're using, but any WebDAV client would usually support HTTP Keep-Alive, which means that after you release a connection, it will keep it alive in a pool for a while in case you need it again.
The way you use a client like that would be to construct one WebDavClient for your application, and use that one client for all requests. Of course, you need to know that it's thread safe if you're going to call it from multiple download threads.
Since you are creating a WebDavClient in each thread, there's a good chance that total number of connections being kept alive across all of their pools exceeds your file handle limit.
--
EDIT: A quick look on the Web indicates that each WebDavClient creates a session of object from requests, which does indeed have a connection pool, but unfortunately isn't thread-safe. You should create a WebDavClient per thread and use it for all of the downloads that that thread does. That will probably require a little refactoring.
Related
I am using pub/sub to publish logs from an IoT device to the cloud, where they're stored in cloud logging by a cloud function. This had been working fine, but now I am running into issues where the messages are not being delivered and eventually the application gets killed. This is the error message:
google.api_core.exceptions.RetryError: Deadline of 60.0s exceeded while calling functools.partial(<function _wrap_unary_errors.<locals>.error_remapped_callable at 0x7487bd20>, topic: "projects/projectid/topics/iot_device_logs"
messages {
data: "20210612T04:09:22.116Z - ERROR - Failed to create objects on main "
attributes {
key: "device_num_id"
value: "devincenumid"
}
attributes {
key: "logger_name"
value: "iotXX"
}
}
, metadata=[('x-goog-request-params', 'topic=projects/projectid/topics/iot_device_logs'), ('x-goog-api-client', 'gl-python/3.7.3 grpc/1.33.2 gax/1.23.0 gccl/2.5.0')]), last exception: 503 Transport closed
20210612T04:21:08.211Z - INFO - QueryDevice object created
20210612T04:38:30.880Z - DEBUG - Analyzer failure counts
20210612T04:42:40.760Z - INFO - Attempting to query device
20210612T04:48:05.126Z - DEBUG - Attempting to publish 'info' log on iotXX
bash: line 1: 609 Killed python3.7 path/to/file.py
The code in question is something like this:
def get_callback(self, f, log):
def callback(f):
try:
self.debug(f"Successfully published log: {f.result()}")
except Exception as e:
self.debug(f"Failed to publish log: {e}")
return callback
def publish_log(self, log, severity):
# data must be in bytestring
data = log.encode("utf-8")
try:
self.debug(f"Attempting to publish '{severity}' log on {self.name}")
# add two attributes to distinguish the log once in the cloud
future = PUBLISHER.publish(
TOPIC_PATH, data=data, logger_name=self.name, device_num_id=self.deviceid)
futures[log] = future
# publish failures shall be handled in the callback function
future.add_done_callback(self.get_callback(future, log))
except Exception as e:
self.debug(f"Error on publish_log: {e}")
I believe this is happening during a connection outage, which I can understand it might not be able to send the messages. However, I don't understand why the application is being killed.
So far, I am trying to change the retry settings to see if it improves. But I am concerned that the application will continue to get killed.
Any idea on how to determine why it is being killed instead of simply failing to send and continue on?
I seem to have found out the problem, and it is not what I was thinking. I am posting an answer in case someone else is confused by a similar problem and hopefully they're not misguided.
In my case, the connection problem coincided with my application being killed. But as far as I can tell, this was not the reason and pubsub or its retry settings had nothing to do with my application getting killed.
I found on the kernel logs a more descriptive message saying that the application had been killed by an out of memory reaper because it was consuming too much ram.
Turns out I had a memory leak on my program. I was not handling the futures generated by the pubsub publisher properly, so they kept adding up and consuming memory.
I have created a Python program that uses Autobahn to make WebSocket connections to a remote host, and receive a data flow over these connections.
From time to time, some different exceptions occur during these connections, most often either an exception immediately when attempting to connect, stating that the initial WebSocket handshake failed (most likely due to an overloaded server), like this:
2017-05-03T20:31:10 dropping connection to peer tcp:1.2.3.4:443 with abort=True: WebSocket opening handshake timeout (peer did not finish the opening handshake in time)
Or a later exception during a successful and ongoing connection, saying that the connection timed out due to lack of pong response to a ping, as follows:
2017-05-04T13:33:40 dropping connection to peer tcp:1.2.3.4:443 with abort=True: WebSocket ping timeout (peer did not respond with pong in time)
2017-05-04T13:33:40 session closed with reason wamp.close.transport_lost [WAMP transport was lost without closing the session before]
2017-05-04T13:33:40 While firing onDisconnect: Traceback (most recent call last):
File "c:\Python36\lib\site-packages\txaio\aio.py", line 450, in done
f._result = x
AttributeError: attribute '_result' of '_asyncio.Future' objects is not writable
As can be seen above, this also triggers some other strange exception in the txaio module in this particular case.
No matter what kind of exception that occurs, I would like to catch them and handle them gracefully, but for some reason the exceptions (none of them) seem to bubble up to the code that initiated these connections (i.e. get caught by my try ... except clause there), which looks like this:
from autobahn.asyncio.wamp import ApplicationSession
from autobahn.asyncio.wamp import ApplicationRunner
...
class MyComponent(ApplicationSession):
...
try:
runner = ApplicationRunner("wss://my.websocket.server.com:443", "realm1")
runner.run(MyComponent)
except Exception as e:
print('Unexpected connection error')
...
Instead, all these exceptions just hang my program completely after the error messages have been dumped out to the terminal as above, why is this?
So, the question is: How and where in the code can I catch these exceptions that occur during the WebSocket connections in Autobahn, and react/handle them gracefully?
I am using Tornado to asynchronously scrape data from many thousand URLS. Each of them is 5-50MB, so they take a while to download. I keep getting "Exception: HTTP 599: Connection closed http:…" errors, despite the fact that I am setting both connect_timeout and request_timeout to a very large number.
Why, despite the large timeout settings, am I still timing out on some requests after only a few minutes of running the script?* Is there a way to instruct httpclient.AsyncHTTPClient to NEVER time out? Or is there a better solution to prevent timeouts?
The following command is how I'm calling the fetch (each worker calls this request_and_save_url() sub-coroutine in the Worker() coroutine):
#gen.coroutine
def request_and_save_url(url, q_all):
try:
response = yield httpclient.AsyncHTTPClient().fetch(url, partial(handle_request, q_all=q_all), connect_timeout = 60*24*3*999999999, request_timeout = 60*24*3*999999999)
except Exception as e:
print('Exception: {0} {1}'.format(e, url))
raise gen.Return([])
As you note HTTPError 599 is raised on connection or request timeout, but this is not the only case. The other one is when connection has been closed by the server before request ends (including entire response fetch) e.g. due to its (server) timeout to handle request or whatever.
I have this simple minimal 'working' example below that opens a connection to google every two seconds. When I run this script when I have a working internet connection, I get the Success message, and when I then disconnect, I get the Fail message and when I reconnect again I get the Success again. So far, so good.
However, when I start the script when the internet is disconnected, I get the Fail messages, and when I connect later, I never get the Success message. I keep getting the error:
urlopen error [Errno -2] Name or service not known
What is going on?
import urllib2, time
while True:
try:
print('Trying')
response = urllib2.urlopen('http://www.google.com')
print('Success')
time.sleep(2)
except Exception, e:
print('Fail ' + str(e))
time.sleep(2)
This happens because the DNS name "www.google.com" cannot be resolved. If there is no internet connection the DNS server is probably not reachable to resolve this entry.
It seems I misread your question the first time. The behaviour you describe is, on Linux, a peculiarity of glibc. It only reads "/etc/resolv.conf" once, when loading. glibc can be forced to re-read "/etc/resolv.conf" via the res_init() function.
One solution would be to wrap the res_init() function and call it before calling getaddrinfo() (which is indirectly used by urllib2.urlopen().
You might try the following (still assuming you're using Linux):
import ctypes
libc = ctypes.cdll.LoadLibrary('libc.so.6')
res_init = libc.__res_init
# ...
res_init()
response = urllib2.urlopen('http://www.google.com')
This might of course be optimized by waiting until "/etc/resolv.conf" is modified before calling res_init().
Another solution would be to install e.g. nscd (name service cache daemon).
For me, it was a proxy problem.
Running the following before import urllib.request helped
import os
os.environ['http_proxy']=''
response = urllib.request.urlopen('http://www.google.com')
I'm writing code that will run on Linux, OS X, and Windows. It downloads a list of approximately 55,000 files from the server, then steps through the list of files, checking if the files are present locally. (With SHA hash verification and a few other goodies.) If the files aren't present locally or the hash doesn't match, it downloads them.
The server-side is plain-vanilla Apache 2 on Ubuntu over port 80.
The client side works perfectly on Mac and Linux, but gives me this error on Windows (XP and Vista) after downloading a number of files:
urllib2.URLError: <urlopen error <10048, 'Address already in use'>>
This link: http://bytes.com/topic/python/answers/530949-client-side-tcp-socket-receiving-address-already-use-upon-connect points me to TCP port exhaustion, but "netstat -n" never showed me more than six connections in "TIME_WAIT" status, even just before it errored out.
The code (called once for each of the 55,000 files it downloads) is this:
request = urllib2.Request(file_remote_path)
opener = urllib2.build_opener()
datastream = opener.open(request)
outfileobj = open(temp_file_path, 'wb')
try:
while True:
chunk = datastream.read(CHUNK_SIZE)
if chunk == '':
break
else:
outfileobj.write(chunk)
finally:
outfileobj = outfileobj.close()
datastream.close()
UPDATE: I find by greping the log that it enters the download routine exactly 3998 times. I've run this multiple times and it fails at 3998 each time. Given that the linked article states that available ports are 5000-1025=3975 (and some are probably expiring and being reused) it's starting to look a lot more like the linked article describes the real issue. However, I'm still not sure how to fix this. Making registry edits is not an option.
If it is really a resource problem (freeing os socket resources)
try this:
request = urllib2.Request(file_remote_path)
opener = urllib2.build_opener()
retry = 3 # 3 tries
while retry :
try :
datastream = opener.open(request)
except urllib2.URLError, ue:
if ue.reason.find('10048') > -1 :
if retry :
retry -= 1
else :
raise urllib2.URLError("Address already in use / retries exhausted")
else :
retry = 0
if datastream :
retry = 0
outfileobj = open(temp_file_path, 'wb')
try:
while True:
chunk = datastream.read(CHUNK_SIZE)
if chunk == '':
break
else:
outfileobj.write(chunk)
finally:
outfileobj = outfileobj.close()
datastream.close()
if you want you can insert a sleep or you make it os depended
on my win-xp the problem doesn't show up (I reached 5000 downloads)
I watch my processes and network with process hacker.
Thinking outside the box, the problem you seem to be trying to solve has already been solved by a program called rsync. You might look for a Windows implementation and see if it meets your needs.
You should seriously consider copying and modifying this pyCurl example for efficient downloading of a large collection of files.
Instead of opening a new TCP connection for each request you should really use persistent HTTP connections - have a look at urlgrabber (or alternatively, just at keepalive.py for how to add keep-alive connection support to urllib2).
All indications point to a lack of available sockets. Are you sure that only 6 are in TIME_WAIT status? If you're running so many download operations it's very likely that netstat overruns your terminal buffer. I find that netstat stat overruns my terminal during normal useage periods.
The solution is to either modify the code to reuse sockets. Or introduce a timeout. It also wouldn't hurt to keep track of how many open sockets you have. To optimize waiting. The default timeout on Windows XP is 120 seconds. so you want to sleep for at least that long if you run out of sockets. Unfortunately it doesn't look like there's an easy way to check from Python when a socket has closed and left the TIME_WAIT status.
Given the asynchronous nature of the requests and timeouts, the best way to do this might be in a thread. Make each threat sleep for 2 minutes before it finishes. You can either use a Semaphore or limit the number of active threads to ensure that you don't run out of sockets.
Here's how I'd handle it. You might want to add an exception clause to the inner try block of the fetch section, to warn you about failed fetches.
import time
import threading
import Queue
# assumes url_queue is a Queue object populated with tuples in the form of(url_to_fetch, temp_file)
# also assumes that TotalUrls is the size of the queue before any threads are started.
class urlfetcher(threading.Thread)
def __init__ (self, queue)
Thread.__init__(self)
self.queue = queue
def run(self)
try: # needed to handle empty exception raised by an empty queue.
file_remote_path, temp_file_path = self.queue.get()
request = urllib2.Request(file_remote_path)
opener = urllib2.build_opener()
datastream = opener.open(request)
outfileobj = open(temp_file_path, 'wb')
try:
while True:
chunk = datastream.read(CHUNK_SIZE)
if chunk == '':
break
else:
outfileobj.write(chunk)
finally:
outfileobj = outfileobj.close()
datastream.close()
time.sleep(120)
self.queue.task_done()
elsewhere:
while url_queue.size() < TotalUrls: # hard limit of available ports.
if threading.active_threads() < 3975: # Hard limit of available ports
t = urlFetcher(url_queue)
t.start()
else:
time.sleep(2)
url_queue.join()
Sorry, my python is a little rusty, so I wouldn't be surprised if I missed something.