how to time-out gracefully while downloading with python - python

I'm downloading a huge set of files with following code in a loop:
try:
urllib.urlretrieve(url2download, destination_on_local_filesystem)
except KeyboardInterrupt:
break
except:
print "Timed-out or got some other exception: "+url2download
If the server times-out on URL url2download when connection is just initiating, the last exception is handled properly. But sometimes server responded, and downloading is started, but the server is so slow, that it'll takes hours for even one file, and eventually it returns something like:
Enter username for Clients Only at albrightandomalley.com:
Enter password for in Clients Only at albrightandomalley.com:
and just hangs there (although no username/passworde is aksed if the same link is downloaded through the browser).
My intention in this situation would be -- skip this file and go to the next one. The question is -- how to do that? Is there a way in python to specify how long is OK to work on downloading one file, and if more time is already spent, interrupt, and go forward?

Try:
import socket
socket.setdefaulttimeout(30)

If you're not limited to what's shipped with python out of the box, then the urlgrabber module might come in handy:
import urlgrabber
urlgrabber.urlgrab(url2download, destination_on_local_filesystem,
timeout=30.0)

There's a discussion of this here. Caveats (in addition to the ones they mention): I haven't tried it, and they're using urllib2, not urllib (would that be a problem for you?) (Actually, now that I think about it, this technique would probably work for urllib, too).

This question is more general about timing out a function:
How to limit execution time of a function call in Python
I've used the method described in my answer there to write a wait for text function that times out to attempt an auto-login. If you'd like similar functionality you can reference the code here:
http://code.google.com/p/psftplib/source/browse/trunk/psftplib.py

Related

Python SimpleHTTPServer keeps going down and I don't know why

This is my first time working with SimpleHTTPServer, and honestly my first time working with web servers in general, and I'm having a frustrating problem. I'll start up my server (via SSH) and then I'll go try to access it and everything will be fine. But I'll come back a few hours later and the server won't be running anymore. And by that point the SSH session has disconnected, so I can't see if there were any error messages. (Yes, I know I should use something like screen to save the shell messages -- trying that right now, but I need to wait for it to go down again.)
I thought it might just be that my code was throwing an exception, since I had no error handling, but I added what should be a pretty catch-all try/catch block, and I'm still experiencing the issue. (I feel like this is probably not the best method of error handling, but I'm new at this... so let me know if there's a better way to do this)
class MyRequestHandler(SimpleHTTPServer.SimpleHTTPRequestHandler):
# (this is the only function my request handler has)
def do_GET(self):
if 'search=' in self.path:
try:
# (my code that does stuff)
except Exception as e:
# (log the error to a file)
return
else:
SimpleHTTPServer.SimpleHTTPRequestHandler.do_GET(self)
Does anyone have any advice for things to check, or ways to diagnose the issue? Most likely, I guess, is that my code is just crashing somewhere else... but if there's anything in particular I should know about the way SimpleHTTPServer operates, let me know.
I never had SimpleHTTPServer running for an extended period of time usually I just use it to transfer a couple of files in an ad-hoc manner, but I guess that it wouldn't be so bad as long as your security restraints are elsewhere (ie firewall) and you don't have need for much scale.
The SSH session is ending, which is killing your tasks (both foreground and background tasks). There are two solutions to this:
Like you've already mentioned use a utility such as screen to prevent your session from ending.
If you really want this to run for an extended period of time, you should look into your operating system's documentation on how to start/stop/enable services (now-a-days most of the cool kids are using systemd, but you might also find yourself using SysVinit or some other init system)
EDIT:
This link is in the comments, but I thought I should put it here as it answers this question pretty well

Download webpage source up to a keyword

I'm looking to download the source code from a website up to a particular keyword (The websites are all from a forum so I'm only interested in the source code for the first posts user details) so I only need to download the source code until I find "<!-- message, attachments, sig -->" for the first time in the source code.
How to get webpage title without downloading all the page source
This question although in a different language is quite similar to what I'm looking to do although I'm not that experienced with python so I can't figure out how to recode that answer into python.
First, be aware that you may have already gotten all or most of each page into your OS buffers, NIC, router, or ISP before you cancel, so there may be no benefit at all to doing this. And there will be a cost—you can't reuse connections if you close them early; you have to recv smaller pieces at a time if you want to be able to cancel early; etc.
If you have a rough idea of how many bytes you probably need to read (better to often go a little bit over than to sometimes go a little bit under), and the server handles HTTP range requests, you may want to try that instead of requesting the entire file and then closing the socket early.
But, if you want to know how to close the socket early:
urllib2.urlopen, requests, and most other high-level libraries are designed around the idea that you're going to want to read the whole file. They buffer the data as it comes in, to give you a high-level file-like interface. On top of that, their API is blocking. Neither of that is what you want. You want to get the bytes as they come in, as fast as possible, and when you close the socket, you want that to be as soon after the recv as possible.
So, you may want to consider using one of the Python wrappers around libcurl, which gives you a pretty good balance between power/flexibility and ease-of-use. For example, with pycurl:
import pycurl
buf = ''
def callback(newbuf):
global buf
buf += newbuf
if '<div style="float: right; margin-left: 8px;">' in buf:
return 0
return len(newbuf)
c = pycurl.Curl()
c.setopt(c.URL, 'http://curl.haxx.se/dev/')
c.setopt(c.WRITEFUNCTION, callback)
try:
c.perform()
except Exception as e:
print(e)
c.close()
print len(buf)
As it turns out, this ends up reading 12259/12259 bytes on that test. But if I change it to a string that comes in the first 2650 bytes, I only read 2650/12259 bytes. And if I fire up Wireshark and instrument recv, I can see that, although the next packet did arrive at my NIC, I never actually read it; I closed the socket immediately after receiving 2650 bytes. So, that might save some time… although probably not too much. More importantly, though, if I throw it at a 13MB image file and try to stop after 1MB, I only receive a few KB extra, and most of the image hasn't even made it to my router yet (although it may have all left the server, if you care at all about being nice to the server), so that definitely will save some time.
Of course a typical forum page is a lot closer to 12KB than to 13MB. (This page, for example, was well under 48KB even after all my rambling.) But maybe you're dealing with atypical forums.
If the pages are really big, you may want to change the code to only check buf[-len(needle):] + newbuf instead of the whole buffer each time. Even with a 13MB image, searching the whole thing over and over again didn't add much to the total runtime, but it did raise my CPU usage from 1% to 9%…
One last thing: If you're reading from, say, 500 pages, doing them concurrently—say, 8 at a time—is probably going to save you a lot more time than just canceling each one early. Both together might well be better than either on its own, so that's not an argument against doing this—it's just a suggestion to do that as well. (See the receiver-multi.py sample if you want to let curl handle the concurrency for you… or just use multiprocessing or concurrent.futures to use a pool of child processes.)

Various timeouts for python httplib

I'm implementing a little service that fetches web pages from various servers. I need to be able to configure different types of timeouts. I've tried mucking around with the settimeout method of sockets but it's not exactly as I'd like it. Here are the problems.
I need to specify a timeout for the initial DNS lookup. I understand this is done when I instantiate the HTTPConnection at the beginning.
My code is written in such a way that I first .read a chunk of data (around 10 MB) and if the entire payload fits in this, I move on to other parts of the code. If it doesn't fit in this, I directly stream the payload out to a file rather than into memory. When this happens, I do an unbounded .read() to get the data and if the remote side sends me, say, a byte of data every second, the connection just keeps waiting receiving one byte every second. I want to be able to disconnect with a "you're taking too long". A thread based solution would be the last resort.
httplib is to straight forward for what you are looking for.
I would recommend to take a look for http://pycurl.sourceforge.net/ and the http://curl.haxx.se/libcurl/c/curl_easy_setopt.html#CURLOPTTIMEOUT option.
The http://curl.haxx.se/libcurl/c/curl_easy_setopt.html#CURLOPT_NOSIGNAL option sounds also interesting:
Consider building libcurl with c-ares support to enable asynchronous DNS lookups, which enables nice timeouts for name resolves without signals.
Have you tried requests?
You can set timeouts conveniently http://docs.python-requests.org/en/latest/user/quickstart/#timeouts
>>> requests.get('http://github.com', timeout=0.001)
EDIT:
I missed the part 2 of the question. For that you could use this:
import sys
import signal
import requests
class TimeoutException(Exception):
pass
def get_timeout(url, dns_timeout=10, load_timeout=60):
def timeout_handler(signum, frame):
raise TimeoutException()
signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(load_timeout) # triger alarm in seconds
try:
response = requests.get(url, timeout=dns_timeout)
except TimeoutException:
return "you're taking too long"
return response
and in your code use the get_timeout function.
If you need the timeout to be available for other functions you could create a decorator.
Above code from http://pguides.net/python-tutorial/python-timeout-a-function/.

Program does not exit. How to find out what python is doing?

I have a python script which is working fine so far. However, my program does not exit properly. I can debug until and I'm returning to the end, but the programm keeps running.
main.main() does a lot of stuff: it downloads (http, ftp, sftp, ...) some csv files from a data provider, converts the data into a standardized file format and loads everyting into the database.
This works fine. However, the program does not exit. How can I find out, where the programm is "waiting"?
There exist more than one provider - the script terminates correctly for all providers except for one (sftp download, I'm using paramiko)
if __name__ == "__main__":
main.log = main.log2both
filestoconvert = []
#filestoconvert = glob.glob(r'C:\Data\Feed\ProviderName\download\*.csv')
main.main(['ProviderName'], ['download', 'convert', 'load'], filestoconvert)
I'm happy for any thoughts and ideas!
If your program does not terminate it most likely means you have a thread still working.
To list all the running threads you can use :
threading.enumerate()
This function lists all Thread that are currently running (see documentation)
If this is not enough you might need a bit of script along with the function (see documentation):
sys._current_frames()
So to print stacktrace of all alive threads you would do something like :
import sys, traceback, threading
thread_names = {t.ident: t.name for t in threading.enumerate()}
for thread_id, frame in sys._current_frames().iteritems():
print("Thread %s:" % thread_names.get(thread_id, thread_id))
traceback.print_stack(frame)
print()
Good luck !
You can involve the python debugger for a script.py with
python -m pdb script.py
You find the pdb commands at http://docs.python.org/library/pdb.html#debugger-commands
You'd better use GDB, which allows to pinpoint hung processes, like jstack in Java
This question is 10 years old, but I post my solution for someone with a similar issue with a non-finishing Python script like mine.
In my case, the debugging process didn't help. All debugging outputs showed only one thread. But the suggestion by #JC Plessis that some work should be going on helped me find the cause.
I was using Selenium with the chrome driver, and I was finishing the selenium process after closing the only tab that was open with
driver.close()
But later, I changed the code to use a headless browser, and the Selenium driver wasn't closed after driver.close(), and the python script was stuck indefinitely. It results that the right way to shutdown the Selenium driver was actually.
driver.quit()
That solved the problem, and the script was finally finishing again.
You can use sys.settrace to pinpoint which function blocks. Then you can use pdb to step through it.

Using httplib2 in python 3 properly? (Timeout problems)

Hey, first time post, I'm really stuck on httplib2. I've been reading up on it from diveintopython3.org, but it mentions nothing about a timeout function. I look up the documentation, but the only thing I see is an ability to put a timeout int but there are no units specified (seconds? milliseconds? What's the default if None?) This is what I have (I also have code to check what the response is and try again, but it's never tried more than once)
h = httplib2.Http('.cache', timeout=None)
for url in list:
response, content = h.request(url)
more stuff...
So the Http object stays around until some arbitrary time, but I'm downloading a ton of pages from the same server, and after a while, it hangs on getting a page. No errors are thrown, the thing just hangs at a page. So then I try:
h = httplib2.Http('.cache', timeout=None)
for url in list:
try:
response, content = h.request(url)
except:
h = httplib2.Http('.cache', timeout=None)
more stuff...
But then it recreates another Http object every time (goes down the 'except' path)...I dont understand how to keep getting with the same object, until it expires and I make another. Also, is there a way to set a timeout on an individual request?
Thanks for the help!
Due to bug httplib2 measured the timeout in seconds multiplied by 2 until version 0.7.5 (2012-08-28).
Set the timeout to 1, and you'll pretty quickly know if it means one millisecond or one second.
I don't know what your try/except should solve, if it hangs on h.request(url) in one case it should hang in the other.
If you run out of memory in that code, then httplib2 doesn't get garbage collected properly. It may be that you have circular references (although it doesn't look like it above) or it may be a bug in httlib2.

Categories

Resources