Scrapy freeze on connection timeout - python

I wrote a scrapy crawler that uses an Internet connection that is pretty unreliable. This is something that is a given. I cannot easily or cheaply change it - once in a while the Internet connection will be lost and after a few seconds or so it will be restored.
I observe behaviour where a Scrapy 18.4 crawler would freeze indefinitely without printing any error messages. It stops reacting to Ctrl+C, which makes me think this happens somewhere pretty deep in the reactor stack, though I cannot be sure.
There are absolutely no error messages which makes things rather hopeless to debug.
Question: Would anyone have any clues as to how to debug this problem? I don't really have any meaningful logs to attach for the reasons laid out above.

You can set up a class for timeout and then run your code in a try except. Something like this:
import signal
class timeout:
def __init__(self, seconds=1, error_message='Timeout'):
self.seconds = seconds
self.error_message = error_message
def handle_timeout(self, signum, frame):
raise TimeoutError(self.error_message)
def __enter__(self):
signal.signal(signal.SIGALRM, self.handle_timeout)
signal.alarm(self.seconds)
def __exit__(self, type, value, traceback):
signal.alarm(0)
def run_crawl():
while True:
print("This runs")
try:
with timeout(seconds=3):
run_crawl()
except Exception as e:
print(e)
Note: Since this uses signal it will only work on UNIX-based systems.
If you want it to go back to running the crawler (auto-restart), then you can just put it all in an infinite loop.
while True:
print("Restarting spider")
try:
with timeout(seconds=3):
run_crawl()
except Exception as e:
print(e)
This is all assuming that you can just keep restarting the bot after x seconds without major negative results. IE if you are constantly scraping the same page(s) over and over again, then this would work pretty seamlessly.
However, if you are scraping a very long list [once] and just want it to finish without an error, then it would work less well, but could still be used by setting x to a number that represents an amount of time greater than the duration of the entire process when it executes successfully (this is true no matter the length of the execution of the program - don't set x to 3 seconds if you are scraping one site that takes 7 seconds to complete, and the same is true if you are doing 5 that take 30 seconds or 500 that take 5 minutes, you will need to set x to an amount greater than that duration.
The reason that I specifically separated your bot having a quick completion time vs. not is that if it fails during a loop with a timeout of 30 seconds then you lose, on average, 15 seconds if it fails, but if you have a 30 minute execution time then you would lose, on average, 15 minutes when it fails, and if your internet is going out every 15 minutes, on average, then you would have it failing a vast majority of the time and you'll need to look more into debugging the problem and actually solving it instead of working around it, as this "solution" should definitely be considered a workaround.

Related

Sleep in case of error, python

So I have a situation where I am to use internet connection for 12 hours straight and make calls to an api. But Light keeps going off after every 10 minutes. Is is possible to write a try, except function that will cause a delay of 10 minutes in case an error of timed out is generated. It is hopeful that the electricity will come back in 10 minutes.|
This is what I am currently using:
try:
a=translator.translate(str(x1),dest='hi')
b=translator.translate(str(x2),dest='hi')
except:
sleep(60*10)
You can use the retry module for these kind of retrying on exception. This makes the code to look much cleaner. pip install retry should install the module
from retry import retry
#retry(Exception, delay=10*60, tries=-1)
def my_code_that_needs_to_be_retried_for_ever():
a=translator.translate(str(x1),dest='hi')
b=translator.translate(str(x2),dest='hi')
# Call the function
my_code_that_needs_to_be_retried_for_ever()
With the above code, when my_code_that_needs_to_be_retried_for_ever is invoked it would be retried every 60*10 seconds (10 mins) forever (as tries is set to -1) everytime the code inside the fuction block raises an Exception
Use try and except to catch the exception and then time.sleep to make your Python script sleep for the desired amount of time. You can then put everything inside an endless while loop and break out of it once everything finished.
while True:
try:
# put everything here which might produce exception
pass
# if this point is reached everything worked fine, so exit loop
break
except:
time.sleep(10*60)
You can run the following example to see the general idea:
import random
import time
print("Before loop")
while True:
try:
print("Try to execute commands")
# your commands here
if random.random() > 0.3:
print("Randomly simulate timeout")
raise Exception("Timeout")
print("Everything done")
break
except:
print("Timeout: sleep for 2 seconds and try again")
time.sleep(2)
print("After loop")
Instead of real commands, we randomly decide to raise an exception to simulate the timeout. The result might look something like this:
Before loop
Try to execute commands
Randomly simulate timeout
Timeout: sleep for 2 seconds and try again
Try to execute commands
Randomly simulate timeout
Timeout: sleep for 2 seconds and try again
Try to execute commands
Randomly simulate timeout
Timeout: sleep for 2 seconds and try again
Try to execute commands
Everything done
After loop

How do I loop through keyboard interrupt whenever the user presses ctrl+c multiple times in a row

I am using python3 on android 6 (termux linux emulator) and python works great, but I want to create a login everytime termux starts up by adding this program to ~/.bashrc, the problem is when I ask the user for the password, and they enter ctrl-c, the program will be alerted that they tried to bypass security, and force them to wait for 5 minutes before trying again. So my question is how can I loop the KeyboardInterrupt so while they are waiting 5 minutes and they press ctrl-c again it will make them wait 10 minutes, 15 minutes (up by 5) etc and so forth, so they cant bypass the waiting time?
I agree with the user 9000, but if you really want to do this, my solution would be the following:
import time
password_check_passed = False
while not password_check_passed:
try:
password_check_passed = check_password()
except KeyboardInterrupt:
print("You are required to authenticate yourself. You've received a cooldown of 5 minutes.")
end_of_punishment = time.time() + 5*60 # now + 5 minutes
while time.time() < end_of_punishment:
try:
# attempt to sleep until the end of the punitive timeout
time.sleep(end_of_punishment - time.time())
except KeyboardInterrupt:
print("Still not getting it? 5 more minutes!")
end_of_punishment += 5*60
The check_password function would return a boolean indicating that it succeeded and print out any instructions for authentication.
While there is technically a possibility to send two interrupts fast enough for the second one to occur before the inner try clause, it is very small and I wouldn't worry about it.
What you should be more worried about is that this makes termination of your program much more difficult.

Python Multithreaded Messenger Simulation. Stuck on timerThread update. What do?

I have a piece of code that simulates a system of messengers (think post office or courier service) delivering letters in a multithreaded way. I want to add a way to manage my messengers "in the field" to increase the efficiency of my system.
tl;dr: How do I update my tens-to-hundreds of timerthreads so they wait longer before calling their function?
Here's what the code I've written so far is supposed to do in steps.
Someone asks for a letter
We check to see if there are any available messengers. If none, we say "oops, sorry. can't help you with that"
If at least one is available, we send the messenger to deliver the letter (new timer thread with its wait param as the time it takes to get there and back)
When the messenger gets back, we put him in the back of the line of available messengers to wait for the next delivery
I do this by removing Messenger objects from a double ended queue, and then adding them back in after a timerthread is done waiting. This is because my Messengers are all unique and eventually I want to track how many deliveries each has had, how far they have traveled, and other stuff.
Here's a pseudoish-codesnippet of the larger program I wrote for this
numMessengers=5
messengerDeque=deque()
pOrder=0.0001
class Messenger:
def __init__(self):
for i in range(numMessengers):
messenger=Messenger()
messengerDeque.append(messenger)
def popDeque():
messenger=idleDeque.popleft()
print 'messenger #?, sent'
return messenger
def appendDeque(messenger):
print 'messenger #?, returned'
messengerDeque.append(messenger)
def randomDelivery():
if numpy.random.randint(0,10000)<=(pOrder*10000):
if len(messengerDeque)!=0:
messenger=popDeque()
tripTime=distance/speed*120
t=threading.Timer(tripTime,appendDeque,args=[messenger])
t.start()
else:
print "oops, sorry. can't help you with that"
The above works in my program.
What I would like to add is some way to 'reroute' my messengers with new orders.
Lets say you have to deliver a letter within an hour of when you get it. You have five messengers and five orders, so they're all busy. You then get a sixth order.
Messenger 2 will be back in 20 minutes, and order six will take 30 minutes to get to the delivery destination. So instead of saying "oops, we can't help you". We would say, ok, Messenger 2, when you get back, immediately go deliver letter six.
With the code I've written, I think this could be done by checking the active threads to see how long until they call their functions, pick the first one you see where that time + how long your new delivery takes is < 1 hr, cancel it, and start a new thread with the time left plus the new time to wait.
I just don't know how to do that.
How do you check how long is left in a timerthread and update it without making a huge mess of your threads?
I'm also open to other, smarter ways of doing what I described.
YAY PYTHON MULTITHREADING!!!!!
Thanks for the help
Using the class threading.Timer wont fulfill your needs. Although there is a "interval" member in Timer instances, once the Timer(thread) started running any changes in interval (time-out) are not considered.
Furthermore you need to know how much time is still left for the timer to be triggered, for which there isn't a method as far as I know.
Furthermore you probably also need a way to identify which Timer instance you need to update with the new timeout value, but this is up-to you.
You should implement your own Timer class, perhaps something along the lines of:
import threading
import time
class MyTimer(threading.Thread):
def __init__(self, timeout, event):
super(MyTimer, self).__init__()
self.to = timeout
self.evt = event
def setTimeout(self, v):
self.end = time.time() + v
def run(self):
self.start = time.time()
self.end = time.time() + self.to
while self.end > time.time():
time.sleep(0) # instead of thread.yield
self.evt()
def getRemaining(self):
return self.end - time.time()
def hi(): print "hi"
T=MyTimer(20,hi)
T.start()
for i in range(10):
time.sleep(1)
# isAlive gives you True if the thread is running
print T.getRemaining(), T.isAlive()
T.setTimeout(1)
for i in range(3):
time.sleep(1)
print T.getRemaining(), T.isAlive()

Python Threaded Timer Returning Random Errors

I have a python thread that runs every 20 seconds. The code is:
import threading
def work():
Try:
#code here
except (SystemExit, KeyboardInterrupt):
raise
except Exception, e:
logger.error('error somewhere',exc_info=True)
threading.Timer(20, work).start ();
It usually runs completely fine. Once in a while, it'll return an error that doesnt make much sense. The errors are the same two errors. The first one might be legitimate, but the errors after that definitely aren't. Then after that, it returns that same error every time it runs the thread. If I kill the process and start over, then it runs cleanly. I have absolutely no idea what going on here. Help please.
As currently defined in your question, you are most likely exceeding your maximum recursion depth. I can't be certain because you have omitted any opportunities for flow control that may be evident in your try block. Furthermore, everytime your code fails to execute, the general catch for exceptions will log the exception and then bump you into a new timer with a new logger (assume you are declaring that in the try block). I think you probably meant to do the following:
import threading
import time
def work():
try:
#code here
pass
except (SystemExit, KeyboardInterrupt):
raise
except Exception, e:
logger.error('error somewhere',exc_info=True)
t = threading.Timer(20, work)
t.start()
i = 0
while True:
time.sleep(1)
i+=1
if i >1000:
break
t.cancel()
If this is in fact the case, the reason your code was not working is that when you call your work function the first time, it processes and then right at the end, starts another work function in a new timer. This happens add infinitum until the stack fills up, python coughs, and gets angry that you have recursed (called a function from within itself) too many times.
My code fix pulls the timer outside of the function so we create a single timer, which calls the work function once every 20 seconds.
Because threading.timers run in separate threads, we also need to wait around in the main thread. To do this, I added a simple while loop that will run for 1000 seconds and then close the timer and exit. If we didn't wait around in the main loop, it would call your timer and then close out immediately causing python to clean up the timer before it executed even once.

Python error code on exit, and internet connectivity

I'm running a small python script to check every 10 seconds to see if my internet access is available (been having problems with my isp sucking). I've been running it for probably 2 months and it's worked perfectly, but now it randomly exits. Sometimes it exits within 20 seconds of me starting it, and sometimes it waits 5 minutes. the code is:
import time
import datetime
import urllib2
waitTime = 300
outfile = "C:\Users\simmons\Desktop\internetConnectivity.txt"
def internetOffline():
with open (outfile, "a") as file:
file.write("Internet Offline: %s\n" % time.ctime())
print("Internet Went Down!")
def internetCheck():
try:
urllib2.urlopen('https://www.google.com', timeout = 2)
except urllib2.URLError:
internetOffline()
while (1):
internetCheck()
time.sleep( 10 )
My question is not only, how would I print out what is happening when it exits, but also, does anyone know of a more efficient way of doing this, so it possibly causes less network traffic. It's not a problem now, but I was just wondering about more efficient methods.
this could be from going to google to many times im not to sure
run the program in you're IDE and then read the error it throws on exit this should tell you what or where the program is exiting
Here is a good way to do this:
import urllib2
def internet_on():
try:
response=urllib2.urlopen('http://74.125.228.100',timeout=1)
return True
except urllib2.URLError as err: pass
return False
74.125.228.100 is one of the IP-addresses for google.com. Change http://74.125.228.100 to whatever site can be expected to respond quickly
I got this solution from this question, take a look at it, it should help

Categories

Resources