PyCurl request hangs infinitely on perform - python

I have written a script to fetch scan results from Qualys to be run each week for the purpose of metrics gathering.
The first part of this script involves fetching a list of references for each of the scans that were run in the past week for further processing.
The problem is that, while this will work perfectly sometimes, other times the script will hang on the c.perform() line. This is manageable when running the script manually as it can just be re-run until it works. However, I am looking to run this as a scheduled task each week without any manual interaction.
Is there a foolproof way that I can detect if a hang has occurred and resend the PyCurl request until it works?
I have tried setting the c.TIMEOUT and c.CONNECTTIMEOUT options but these don't seem to be effective. Also, as no exception is thrown, simply putting it in a try-except block also won't fly.
The function in question is below:
# Retrieve a list of all scans conducted in the past week
# Save this to refs_raw.txt
def getScanRefs(usr, pwd):
print("getting scan references...")
with open('refs_raw.txt','wb') as refsraw:
today = DT.date.today()
week_ago = today - DT.timedelta(days=7)
strtoday = str(today)
strweek_ago = str(week_ago)
c = pycurl.Curl()
c.setopt(c.URL, 'https://qualysapi.qualys.eu/api/2.0/fo/scan/?action=list&launched_after_datetime=' + strweek_ago + '&launched_before_datetime=' + strtoday)
c.setopt(c.HTTPHEADER, ['X-Requested-With: pycurl', 'Content-Type: text/xml'])
c.setopt(c.USERPWD, usr + ':' + pwd)
c.setopt(c.POST, 1)
c.setopt(c.PROXY, 'companyproxy.net:8080')
c.setopt(c.CAINFO, certifi.where())
c.setopt(c.SSL_VERIFYPEER, 0)
c.setopt(c.SSL_VERIFYHOST, 0)
c.setopt(c.CONNECTTIMEOUT, 3)
c.setopt(c.TIMEOUT, 3)
refsbuffer = BytesIO()
c.setopt(c.WRITEDATA, refsbuffer)
c.perform()
body = refsbuffer.getvalue()
refsraw.write(body)
c.close()
print("Got em!")

I fixed the issue myself by launching a separate process using multiprocessing to launch the API call in a separate process, killing and restarting if it goes on for longer than 5 seconds. It's not very pretty but is cross-platform. For those looking for a solution that is more elegant but only works on *nix look into the signal library, specifically SIGALRM.
Code below:
# As this request for scan references sometimes hangs it will be run in a separate thread here
# This will be terminated and relaunched if no response is received within 5 seconds
def performRequest(usr, pwd):
today = DT.date.today()
week_ago = today - DT.timedelta(days=7)
strtoday = str(today)
strweek_ago = str(week_ago)
c = pycurl.Curl()
c.setopt(c.URL, 'https://qualysapi.qualys.eu/api/2.0/fo/scan/?action=list&launched_after_datetime=' + strweek_ago + '&launched_before_datetime=' + strtoday)
c.setopt(c.HTTPHEADER, ['X-Requested-With: pycurl', 'Content-Type: text/xml'])
c.setopt(c.USERPWD, usr + ':' + pwd)
c.setopt(c.POST, 1)
c.setopt(c.PROXY, 'companyproxy.net:8080')
c.setopt(c.CAINFO, certifi.where())
c.setopt(c.SSL_VERIFYPEER, 0)
c.setopt(c.SSL_VERIFYHOST, 0)
refsBuffer = BytesIO()
c.setopt(c.WRITEDATA, refsBuffer)
c.perform()
c.close()
body = refsBuffer.getvalue()
refsraw = open('refs_raw.txt', 'wb')
refsraw.write(body)
refsraw.close()
# Retrieve a list of all scans conducted in the past week
# Save this to refs_raw.txt
def getScanRefs(usr, pwd):
print("Getting scan references...")
# Occasionally the request will hang infinitely. Launch in separate method and retry if no response in 5 seconds
success = False
while success != True:
sendRequest = multiprocessing.Process(target=performRequest, args=(usr, pwd))
sendRequest.start()
for seconds in range(5):
print("...")
time.sleep(1)
if sendRequest.is_alive():
print("Maximum allocated time reached... Resending request")
sendRequest.terminate()
del sendRequest
else:
success = True
print("Got em!")

The question is old but i will add this answer, it might help someone.
the only way to terminate a running curl after executing "perform()" is by using callbacks:
1- using CURLOPT_WRITEFUNCTION:
as stated from docs:
Your callback should return the number of bytes actually taken care of. If that amount differs from the amount passed to your callback function, it'll signal an error condition to the library. This will cause the transfer to get aborted and the libcurl function used will return CURLE_WRITE_ERROR.
the drawback with this method is curl calls the write function only when receives new data from the server, so in case of server stopped sending data curl will just keep waiting at the server side and will not receive your kill signal
2- the alternative and the best so far is using progress callback:
the beauty of progress callback is being curl will call it at least once per seconds even if no data coming from the server which will give you the opportunity to return non zero value as a kill switch to curl
use option CURLOPT_XFERINFOFUNCTION,
note it is better than using CURLOPT_PROGRESSFUNCTION as quoted in docs:
We encourage users to use the newer CURLOPT_XFERINFOFUNCTION instead, if you can.
also you need to set option CURLOPT_NOPROGRESS
CURLOPT_NOPROGRESS must be set to 0 to make this function actually get called.
This is an example to show you both write and progress functions implementations in python:
# example of using write and progress function to terminate curl
import pycurl
open('mynewfile', 'w') as f # used to save downloaded data
counter = 0
# define callback functions which will be used by curl
def my_write_func(data):
"""write to file"""
f.write(data)
counter += len(data)
# an example to terminate curl: tell curl to abort if the downloaded data exceeded 1024 byte by returning -1 or any number
# not equal to len(data)
if counter >= 1024:
return -1
def progress(*data):
"""it receives progress from curl and can be used as a kill switch
Returning a non-zero value from this callback will cause curl to abort the transfer
"""
d_size, downloaded, u_size, uploade = data
# an example to terminate curl: tell curl to abort if the downloaded data exceeded 1024 byte by returning non zero value
if downloaded >= 1024:
return -1
# initialize curl object and options
c = pycurl.Curl()
# callback options
c.setopt(pycurl.WRITEFUNCTION, my_write_func)
self.c.setopt(pycurl.NOPROGRESS, 0) # required to use a progress function
self.c.setopt(pycurl.XFERINFOFUNCTION, self.progress)
# self.c.setopt(pycurl.PROGRESSFUNCTION, self.progress) # you can use this option but pycurl.XFERINFOFUNCTION is recommended
# put other curl options as required
# executing curl
c.perform()

Related

Flask hangs when run as a subprocess alongside pytest

I've spent the last hour and a half trying and failing to debug this test and I am utterly stumped. To simplify the process of testing the Flask server I am building, I have made a relatively simple script which starts the server, then runs pytest, kills the server, writes the outputs to files, and exits with Pytest's exit code. This code was working perfectly until today, and I haven't modified it since (aside from debugging this issue).
Here's the problem: when it gets to a certain point in the tests, it hangs. The weird thing is that this does not happen if I run my tests in any other way.
Debugging my server in VS Code, and running tests in the terminal: works
Running my server using the same code used in the test script and running pytest manually: works
Running pytest using the test script and running the server through the start server script (which uses the same code for running the server as the test script does) in a second terminal: works
Here's the other interesting thing: the tests always hang in the same place, part way through the setup fixture. It sends the clear command, and an echo request to the server (which prints the name of the current test). The database clears successfully, and the server echoes the correct information, but the echo route never exits - my tests never get a response. This echo route behaves perfectly for the 50 or so tests that happen before this point. If I comment out the test that is causing it to fail, it fails on the next test. If I comment out the call to the echo then it hangs on a later test on a completely different request to a different route. When it hangs, the server cannot be killed using a SIGTERM, but instead requires a SIGKILL.
Here is my echo route:
#debug.get('/echo')
def echo() -> IEcho:
"""
Echo an input. This returns the given value, but also prints it to stdout
on the server. Useful for debugging tests.
## Params:
* `value` (`str`): value to echo
"""
try:
value = request.args['value']
except KeyError:
raise http_errors.BadRequest('echo route requires a `value` argument')
to_print = f'{Fore.MAGENTA}[ECHO]\t\t{value}{Fore.RESET}'
# Print it to both stdout and stderr to ensure it is seen across all logs
# Otherwise it could be more difficult to figure out what's up with server
# output
print(to_print)
print(to_print, file=sys.stderr)
return {'value': value}
And here is my code that sends the requests:
def get(token: JWT | None, url: str, params: dict) -> dict:
"""
Returns the response to a GET web request
This also parses the response to help with error checking
### Args:
* `url` (`str`): URL to request to
* `params` (`dict`): parameters to send
### Returns:
* `dict`: response data
"""
return handle_response(requests.get(
url,
params=params,
headers=encode_headers(token),
timeout=3
))
def echo(value: str) -> IEcho:
"""
Echo an input. This returns the given value, but also prints it to stdout
on the server. Useful for debugging tests.
## Params:
* `value` (`str`): value to echo
"""
return cast(IEcho, get(None, f"{URL}/echo", {"value": value}))
#pytest.fixture(autouse=True)
def before_each(request: pytest.FixtureRequest):
"""Clear the database between tests"""
clear()
echo(f"{request.module.__name__}.{request.function.__name__}")
print("After echo") # This never prints
Here is my code for running Pytest in my test script
def pytest():
pytest = subprocess.Popen(
[sys.executable, '-u', '-m', 'pytest', '-v', '-s'],
)
# Wait for tests to finish
print("🔨 Running tests...")
try:
ret = pytest.wait()
except KeyboardInterrupt:
print("❗ Testing cancelled")
pytest.terminate()
# write_outputs(pytest, None)
# write_outputs(pytest, "pytest")
raise
# write_outputs(pytest, "pytest")
if ret == 0:
print("✅ It works!")
else:
print("❌ Tests failed")
return bool(ret)
And here is my code for running my server in my test script:
def backend(debug=False, live_output=False):
env = os.environ.copy()
if debug:
env.update({"ENSEMBLE_DEBUG": "TRUE"})
debug_flag = ["--debug"]
else:
debug_flag = []
if live_output is False:
outputs = subprocess.PIPE
else:
outputs = None
flask = subprocess.Popen(
[sys.executable, '-u', '-m', 'flask'] + debug_flag + ['run'],
env=env,
stderr=outputs,
stdout=outputs,
)
if outputs is not None and (flask.stderr is None or flask.stdout is None):
print("❗ Can't read flask output", file=sys.stderr)
flask.kill()
sys.exit(1)
# Request until we get a success, but crash if we failed to start in 10
# seconds
start_time = time.time()
started = False
while time.time() - start_time < 10:
try:
requests.get(
f'http://localhost:{os.getenv("FLASK_RUN_PORT")}/debug/echo',
params={'value': 'Test script startup...'},
)
except requests.ConnectionError:
continue
started = True
break
if not started:
print("❗ Server failed to start in time")
flask.kill()
if outputs is not None:
write_outputs(flask, None)
sys.exit(1)
else:
if flask.poll() is not None:
print("❗ Server crashed during startup")
if outputs is not None:
write_outputs(flask, None)
sys.exit(1)
print("✅ Server started")
return flask
So in summary, does anyone have any idea what on earth is happening? It freezes on such a simple route that this makes me very concerned. I think I may have found some crazy bug in Flask or in the requests library or something.
Even if you don't know what's happening with this, it'd be really helpful to have any ideas as to how I can debug this further, as I have absolutely no idea what is going on.
It turns out that my server output was filling up all the buffer space in the pipe, meaning that it would wait for the buffer to empty. The issue is that my test script was waiting for the tests to exit, and the tests could not progress unless the server was active. As such, the code reached a three-way deadlock. I fixed it by redirecting my output through a file (where limited buffer size wasn't a problem).

Store file in binary transfer mode with ftplib in python does not finish [duplicate]

I am trying to upload a file to an FTP site using FTPS, but when I attempt to store the file, it just hangs after the file is fully transferred.
global f_blocksize
global total_size
global size_written
f_blocksize = 1024
total_size = os.path.getsize(file_path)
size_written = 0
file = open(file_path, "rb")
try:
ftps = FTP_TLS("ftp.example.com")
ftps.auth()
ftps.sendcmd("USER username")
ftps.sendcmd("PASS password")
ftps.prot_p()
print(ftps.getwelcome())
try:
print("File transfer started...")
ftps.storbinary("STOR myfile.txt", file, callback=handle, blocksize=f_blocksize)
print("File transfer complete!")
except OSError as ex:
print(ftps.getresp())
except Exception as ex:
print("FTP transfer failed.")
print("%s: %s" %(type(ex), str(ex)))
def handle(block):
global size_written
global total_size
global f_blocksize
size_written = size_written + f_blocksize if size_written + f_blocksize < total_size else total_size
percent_complete = size_written / total_size * 100
print("%s percent complete" %str(percent_complete))
I get the following output:
220 Microsoft FTP Service
File transfer started...
3.5648389904264577 percent complete
7.129677980852915 percent complete
10.694516971279374 percent complete
14.25935596170583 percent complete
17.824194952132288 percent complete
21.389033942558747 percent complete
24.953872932985206 percent complete
28.51871192341166 percent complete
32.083550913838124 percent complete
35.648389904264576 percent complete
39.213228894691035 percent complete
42.778067885117494 percent complete
46.342906875543946 percent complete
49.90774586597041 percent complete
53.472584856396864 percent complete
57.03742384682332 percent complete
60.60226283724979 percent complete
64.16710182767625 percent complete
67.7319408181027 percent complete
71.29677980852915 percent complete
74.8616187989556 percent complete
78.42645778938207 percent complete
81.99129677980854 percent complete
85.55613577023499 percent complete
89.12097476066144 percent complete
92.68581375108789 percent complete
96.25065274151436 percent complete
99.81549173194082 percent complete
100.0 percent complete
After which there is no further progress until the connection times out...
FTP transfer failed.
<class 'ftplib.error_temp'>: 425 Data channel timed out due to not meeting the minimum bandwidth requirement.
While the program is running I can see an empty myfile.txt in the FTP site if I connect and look manually, but when I either cancel it or the connection times out, this empty file disappears.
Is there something I'm missing that I need to invoke to close the file after it has been completely transferred?
This appears to be an issue with Python's SSLSocket class, which is waiting for data from the server when running unwrap. Since it never receives this data from the server, it is unable to unwrap SSL from the socket and therefore times out.
This server in particular I have identified by the welcome message as some Microsoft FTP server, which fits in well with the issue written about in this blog
The "fix" (if you can call it that) was to stop the SSLSocket from attempting to unwrap the connection altogether by editing ftplib.py and amending the FTP_TLS.storbinary() method.
def storbinary(self, cmd, fp, blocksize=8192, callback=None, rest=None):
self.voidcmd('TYPE I')
with self.transfercmd(cmd, rest) as conn:
while 1:
buf = fp.read(blocksize)
if not buf: break
conn.sendall(buf)
if callback: callback(buf)
# shutdown ssl layer
if isinstance(conn, ssl.SSLSocket):
# HACK: Instead of attempting unwrap the connection, pass here
pass
return self.voidresp()
I faced this issue on STORBINARY function when using python's ftplib.FTP_TLS, prot_p and Microsoft FTP server.
Example:
ftps = FTP_TLS(host,username,password)
ftps.prot_p
STORBINARY...
The error indicated timeout on the unwrap function.
It is related to the following issues:
https://www.sami-lehtinen.net/blog/python-32-ms-ftps-ssl-tls-lockup-fix
https://bugs.python.org/issue10808
https://bugs.python.org/issue34557
Resolution:
Open the python page for ftplib: https://docs.python.org/3/library/ftplib.html
Click on the source code which will take you to something like this: https://github.com/python/cpython/blob/3.10/Lib/ftplib.py
Create a copy of this code into your project (example: my_lib\my_ftplib.py)
For the method that is failing, in your case STORBINARY, the error looks to be on the line where it says conn.unwrap() in that method. Comment this line. Enter keyword pass otherwise the empty if block will give syntax error.
Import the above library in your file where you are instantiating the FTP_TLS. Now you will no longer face this error.
Reasoning:
The code in the function def ntransfercmd (under FTP_LTS class) encloses the conn object into a SSL session. The above line which you have commented is responsible for tearing down the SSL session post transfer. For some reason, when using Microsoft's FTP server, the code gets blocked on that line and results in timeout. This can be either because post transfer the server drops the connection or maybe the server unwraps the SSL from its side. I am not sure. Commenting that line is harmless because eventually the connection will be closed anyways - see below for details:
In ftplib's python code, you will notice that the conn object in STORBINARY function is enclosed in a with block, and that it is created using socket.create_connection. This means that .close() is automatically called when the code exits the with block (you can confirm this by looking at the __exit__ method on source code of python's socket class).

Slow upload of many small files with SFTP

When uploading 100 files of 100 bytes each with SFTP, it takes 17 seconds here (after the connection is established, I don't even count the initial connection time). This means it's 17 seconds to transfer 10 KB only, i.e. 0.59 KB/sec!
I know that sending SSH commands to open, write, close, etc. probably creates a big overhead, but still, is there a way to speed up the process when sending many small files with SFTP?
Or a special mode in paramiko / pysftp to keep all the writes operations to do in a memory buffer (let's say all operations for the last 2 seconds), and then do everything in one grouped pass of SSH/SFTP? This would avoid to wait for the ping time between each operation.
Note:
I have a ~ 100 KB/s connection upload speed (tested 0.8 Mbit upload speed), 40 ms ping time to the server
Of course, if instead of sending 100 files of 100 bytes, I send 1 file of 10 KB bytes, it takes < 1 second
I don't want to have to run a binary program on remote, only SFTP commands are accepted
import pysftp, time, os
with pysftp.Connection('1.2.3.4', username='root', password='') as sftp:
with sftp.cd('/tmp/'):
t0 = time.time()
for i in range(100):
print(i)
with sftp.open('test%i.txt' % i, 'wb') as f: # even worse in a+ append mode: it takes 25 seconds
f.write(os.urandom(100))
print(time.time() - t0)
With the following method (100 asynchronous tasks), it's done in ~ 0.5 seconds, which is a massive improvement.
import asyncio, asyncssh # pip install asyncssh
async def main():
async with asyncssh.connect('1.2.3.4', username='root', password='') as conn:
async with conn.start_sftp_client() as sftp:
print('connected')
await asyncio.wait([sftp.put('files/test%i.txt' % i) for i in range(100)])
asyncio.run(main())
I'll explore the source, but I still don't know if it groups many operations in few SSH transactions, or if it just runs commands in parallel.
I'd suggest you to parallelize the upload using multiple connections from multiple threads. That's easy and reliable solution.
If you want to do the hard way by using buffering the requests, you can base your solution on the following naive example.
The example:
Queues 100 file open requests;
As it reads the responses to the open requests, it queues write requests;
As it reads the responses to the write requests, it queues close requests
If I do plain SFTPClient.put for 100 files, it takes about 10-12 seconds. Using the code below, I achieve the same about 50-100 times faster.
But! The code is really naive:
It expects that the server responds to the requests in the same order. Indeed, majority of SFTP servers (including the de-facto standard OpenSSH) respond in the same order. But according to the SFTP specification, an SFTP server is free to respond in any order.
The code expects that all file reads happen in one go – upload.localhandle.read(32*1024). That's true for small files only.
The code expects that the SFTP server can handle 100 parallel requests and 100 opened files. That's not a problem for most servers, as they process the requests in order. And 100 opened files should not be a problem for a regular server.
You cannot do that for unlimited number of files though. You have to queue the files somehow to keep the number of outstanding requests in check. Actually even these 100 requests is too much.
The code uses non-public methods of SFTPClient class.
I do not do Python. There are definitely ways to code this more elegantly.
import paramiko
import paramiko.sftp
from paramiko.py3compat import long
ssh = paramiko.SSHClient()
ssh.connect(...)
sftp = ssh.open_sftp()
class Upload:
def __init__(self):
pass
uploads = []
for i in range(0, 100):
print(f"sending open request {i}")
upload = Upload()
upload.i = i
upload.localhandle = open(f"{i}.dat")
upload.remotepath = f"/remote/path/{i}.dat"
imode = \
paramiko.sftp.SFTP_FLAG_CREATE | paramiko.sftp.SFTP_FLAG_TRUNC | \
paramiko.sftp.SFTP_FLAG_WRITE
attrblock = paramiko.SFTPAttributes()
upload.request = \
sftp._async_request(type(None), paramiko.sftp.CMD_OPEN, upload.remotepath, \
imode, attrblock)
uploads.append(upload)
for upload in uploads:
print(f"reading open response {upload.i}");
t, msg = sftp._read_response(upload.request)
if t != paramiko.sftp.CMD_HANDLE:
raise SFTPError("Expected handle")
upload.handle = msg.get_binary()
print(f"sending write request {upload.i} to handle {upload.handle}");
data = upload.localhandle.read(32*1024)
upload.request = \
sftp._async_request(type(None), paramiko.sftp.CMD_WRITE, \
upload.handle, long(0), data)
for upload in uploads:
print(f"reading write response {upload.i} {upload.request}");
t, msg = sftp._read_response(upload.request)
if t != paramiko.sftp.CMD_STATUS:
raise SFTPError("Expected status")
print(f"closing {upload.i} {upload.handle}");
upload.request = \
sftp._async_request(type(None), paramiko.sftp.CMD_CLOSE, upload.handle)
for upload in uploads:
print(f"reading close response {upload.i} {upload.request}");
sftp._read_response(upload.request)

Checking FTP connection is valid using NOOP command

I'm having trouble with one of my scripts seemingly disconnecting from my FTP during long batches of jobs. To counter this, I've attempted to make a module as shown below:
def connect_ftp(ftp):
print "ftp1"
starttime = time.time()
retry = False
try:
ftp.voidcmd("NOOP")
print "ftp2"
except:
retry = True
print "ftp3"
print "ftp4"
while (retry):
try:
print "ftp5"
ftp.connect()
ftp.login('LOGIN', 'CENSORED')
print "ftp6"
retry = False
print "ftp7"
except IOError as e:
print "ftp8"
retry = True
sys.stdout.write("\rTime disconnected - "+str(time.time()-starttime))
sys.stdout.flush()
print "ftp9"
I call the function using only:
ftp = ftplib.FTP('CENSORED')
connect_ftp(ftp)
However, I've traced how the code runs using print lines, and on the first use of the module (before the FTP is even connected to) my script runs ftp.voidcmd("NOOP") and does not except it, so no attempt is made to connect to the FTP initially.
The output is:
ftp1
ftp2
ftp4
ftp success #this is ran after the module is called
I admit my code isn't the best or prettiest, and I haven't implemented anything yet to make sure I'm not reconnecting constantly if I keep failing to reconnect, but I can't work out why this isn't working for the life of me so I don't see a point in expanding the module yet. Is this even the best approach for connecting/reconnecting to an FTP?
Thank you in advance
This connects to the server:
ftp = ftplib.FTP('CENSORED')
So, naturally the NOOP command succeeds, as it does not need an authenticated connection.
Your connect_ftp is correct, except that you need to specify a hostname in your connect call.

urlopen error 10045, 'address already in use' while downloading in Python 2.5 on Windows

I'm writing code that will run on Linux, OS X, and Windows. It downloads a list of approximately 55,000 files from the server, then steps through the list of files, checking if the files are present locally. (With SHA hash verification and a few other goodies.) If the files aren't present locally or the hash doesn't match, it downloads them.
The server-side is plain-vanilla Apache 2 on Ubuntu over port 80.
The client side works perfectly on Mac and Linux, but gives me this error on Windows (XP and Vista) after downloading a number of files:
urllib2.URLError: <urlopen error <10048, 'Address already in use'>>
This link: http://bytes.com/topic/python/answers/530949-client-side-tcp-socket-receiving-address-already-use-upon-connect points me to TCP port exhaustion, but "netstat -n" never showed me more than six connections in "TIME_WAIT" status, even just before it errored out.
The code (called once for each of the 55,000 files it downloads) is this:
request = urllib2.Request(file_remote_path)
opener = urllib2.build_opener()
datastream = opener.open(request)
outfileobj = open(temp_file_path, 'wb')
try:
while True:
chunk = datastream.read(CHUNK_SIZE)
if chunk == '':
break
else:
outfileobj.write(chunk)
finally:
outfileobj = outfileobj.close()
datastream.close()
UPDATE: I find by greping the log that it enters the download routine exactly 3998 times. I've run this multiple times and it fails at 3998 each time. Given that the linked article states that available ports are 5000-1025=3975 (and some are probably expiring and being reused) it's starting to look a lot more like the linked article describes the real issue. However, I'm still not sure how to fix this. Making registry edits is not an option.
If it is really a resource problem (freeing os socket resources)
try this:
request = urllib2.Request(file_remote_path)
opener = urllib2.build_opener()
retry = 3 # 3 tries
while retry :
try :
datastream = opener.open(request)
except urllib2.URLError, ue:
if ue.reason.find('10048') > -1 :
if retry :
retry -= 1
else :
raise urllib2.URLError("Address already in use / retries exhausted")
else :
retry = 0
if datastream :
retry = 0
outfileobj = open(temp_file_path, 'wb')
try:
while True:
chunk = datastream.read(CHUNK_SIZE)
if chunk == '':
break
else:
outfileobj.write(chunk)
finally:
outfileobj = outfileobj.close()
datastream.close()
if you want you can insert a sleep or you make it os depended
on my win-xp the problem doesn't show up (I reached 5000 downloads)
I watch my processes and network with process hacker.
Thinking outside the box, the problem you seem to be trying to solve has already been solved by a program called rsync. You might look for a Windows implementation and see if it meets your needs.
You should seriously consider copying and modifying this pyCurl example for efficient downloading of a large collection of files.
Instead of opening a new TCP connection for each request you should really use persistent HTTP connections - have a look at urlgrabber (or alternatively, just at keepalive.py for how to add keep-alive connection support to urllib2).
All indications point to a lack of available sockets. Are you sure that only 6 are in TIME_WAIT status? If you're running so many download operations it's very likely that netstat overruns your terminal buffer. I find that netstat stat overruns my terminal during normal useage periods.
The solution is to either modify the code to reuse sockets. Or introduce a timeout. It also wouldn't hurt to keep track of how many open sockets you have. To optimize waiting. The default timeout on Windows XP is 120 seconds. so you want to sleep for at least that long if you run out of sockets. Unfortunately it doesn't look like there's an easy way to check from Python when a socket has closed and left the TIME_WAIT status.
Given the asynchronous nature of the requests and timeouts, the best way to do this might be in a thread. Make each threat sleep for 2 minutes before it finishes. You can either use a Semaphore or limit the number of active threads to ensure that you don't run out of sockets.
Here's how I'd handle it. You might want to add an exception clause to the inner try block of the fetch section, to warn you about failed fetches.
import time
import threading
import Queue
# assumes url_queue is a Queue object populated with tuples in the form of(url_to_fetch, temp_file)
# also assumes that TotalUrls is the size of the queue before any threads are started.
class urlfetcher(threading.Thread)
def __init__ (self, queue)
Thread.__init__(self)
self.queue = queue
def run(self)
try: # needed to handle empty exception raised by an empty queue.
file_remote_path, temp_file_path = self.queue.get()
request = urllib2.Request(file_remote_path)
opener = urllib2.build_opener()
datastream = opener.open(request)
outfileobj = open(temp_file_path, 'wb')
try:
while True:
chunk = datastream.read(CHUNK_SIZE)
if chunk == '':
break
else:
outfileobj.write(chunk)
finally:
outfileobj = outfileobj.close()
datastream.close()
time.sleep(120)
self.queue.task_done()
elsewhere:
while url_queue.size() < TotalUrls: # hard limit of available ports.
if threading.active_threads() < 3975: # Hard limit of available ports
t = urlFetcher(url_queue)
t.start()
else:
time.sleep(2)
url_queue.join()
Sorry, my python is a little rusty, so I wouldn't be surprised if I missed something.

Categories

Resources