I have a large scraping job that I am trying to run, using Selenium and PhantomJS in Python. It throws a couple of different errors after having run correctly for about 24 hours. Tested this a couple of times. Obviously, any new code added is a bit hard to test, as I have to wait for 24 hours to see it anything was solved. So, I was wondering if anyone with more experience could take a look at this piece of code and see if it seems ok. What I am trying to do is keeping the while loop going in spite of errors from the browser.
while something:
try:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
except httplib.HTTPException:
print 'HTTPException'
time.sleep(20)
pass
except IOError:
print 'IOError'
time.sleep(20)
pass
Your code looks fine. You can use except (httplib.HTTPException, IOError) as e:, print type(e).__name__ to combine the handlers, and you can drop the pass:
while something:
try:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
except (httplib.HTTPException, IOError) as e:
print type(e).__name__
time.sleep(20)
I'd use the logging module to provide logging information here; the logger.exception() function would include the exception and traceback in the output:
logger = logging.getLogger(__name__)
# ...
except (httplib.HTTPException, IOError) as e:
logger.exception('Ignoring exception, sleeping for 20 seconds')
time.sleep(20)
Related
I am trying to catch an exception thrown by Selenium. When the exception is thrown I would like to restart the scraping process.
try:
startScraping(rootPage)
except (InvalidSessionIdException):
startScraping(rootPage)
finally:
driver.close()
Above is the code I am using.
The problem I am facing is that when an InvalidSessionIdException occurs the script still stops execution and shows a stacktrace.
the second startScraping(rootPage) (in the except block) is not protected by any try/except block...
If an exception occurs, retrying immediately is probably bound to fail ... again. I would
catch the exception
print a meaningful warning
wait a while
repeat until it works, or a given number of times with a for loop
like this
import time
nb_retries = 10
for nb_attempts in range(nb_retries):
try:
startScraping(rootPage)
break # success, exit the loop
except InvalidSessionIdException as e:
print("Warning: {}, attempt {}/{}".format(e,nb_attempts+1,nb_retries))
time.sleep(1)
else:
# loop went through after 10 attempts and no success
print("unable to scrape after {} retries".format(nb_retries))
driver.close()
Try this if you want to restart the process and ignoring the exception:
while True:
try:
startScraping(rootPage)
break # after finishing the scraping process
except (InvalidSessionIdException):
pass # or print the excepion
driver.close()
as mentioned in the code, you can print the exception or do any other exception handling you may want.
I am downloading images using multiprocessing and a unique thing I noticed that you can not get Error Message except there is BaseException. For simple cases, you can run a loop and do the following:
for i in range(start, start+end):
url = df.iloc[i,ind]
try:
load_save_image_from_url(url,DIR,str(i),resize=resize,resize_shape=resize_shape)
count+=1
except (KeyboardInterrupt, SystemExit):
sys.exit("Forced exit prompted by User: Quitting....")
except Exception as e:
print(f"Error at index {i}: {e}\n")
pass
It works completely fine But when you use Multiprocessing you can not either use logging or print because just a log or print is described which is last one.
try:
pool = Pool(workers)
pool.map(partial(load_save_image_from_url,OUT_DIR=DIR,resize=resize,resize_shape=resize_shape),
lis_tups)
except (KeyboardInterrupt, SystemExit):
sys.exit("Forced exit prompted by User: Quitting....")
except ConnectionError:
logging.error(f"Connection Error for some URL")
pass
except Exception as e:
logging.error(f'Some Other Error most probably Image related')
pass
pool.close()
pool.join()
Using pool.get() can work but it has to be in a loop and that too at the end of program.
How can I print an error or log an error when there is an exception while being in multiprocessing?
You can catch the exception inside the load_image function itself and return it to the main process.
result = pool.map(load_image, ...)
if result instanceof Exception:
# handle it
else:
# image loading succeded
def load_image():
try:
# make a get request
return image
except Exception as e:
return e
I am trying to execute a list of queries in Spark, but if the query does not run correctly, Spark throws me the following error:
AnalysisException: "ALTER TABLE CHANGE COLUMN is not supported for changing ...
This is part of my code (i'm using python and Spark SQL on Databricks):
for index, row in df_tables.iterrows():
query = row["query"]
print ("Executing query: ")
try:
spark.sql(query)
print ("Query executed")
except (ValueError, RuntimeError, TypeError, NameError):
print("Unable to process your query dude!!")
else:
#do another thing
Is there any way to catch that exception? ValueError, RuntimeError, TypeError, NameError seems not working.
There's no so much information about that in the Spark webpage.
I found AnalysisException defined in pyspark.sql.utils.
https://spark.apache.org/docs/3.0.1/api/python/_modules/pyspark/sql/utils.html
import pyspark.sql.utils
try:
spark.sql(query)
print ("Query executed")
except pyspark.sql.utils.AnalysisException:
print("Unable to process your query dude!!")
You can modify the try except statement as below :
try:
spark.sql(query)
print ("Query executed")
except Exception as x:
print("Unable to process your query dude!!" + \
"\n" + "ERROR : " + str(x))
I think it depends on your requirements. If you're running full workflow on this query and if you want them to pass through then your code will work fine. But let's say you want your workflow or datapipeline to fail, then you should exit from that except block.
The exact exception you may not get, but you can definitely get overview using
except Exception as x:
print(str(x))
You can use logging module for putting more information in logs for further investigation.
I want to propose a way to pick specific Exceptions.
I had the problem to find if some table already existed. The simplest way that i have found to do this is like this. Of course that this can break if Spark maintainers change the message of the exception, but i think that they do not have reason to do this, in this case.
import pyspark.sql.utils
try:
spark.read.parquet(SOMEPATH)
except pyspark.sql.utils.AnalysisException as e:
if "Path does not exist:" in str(e):
# Finding specific message of Exception.
pass # run some code to address this specific case.
else:
# if this is not the AnalysisException that i was waiting,
# i throw again the exception
raise (e)
except Exception as e:
# if is another exception i can catch like this
print(e)
raise (e)
I have a simply code:
import eventlet
def execute():
print("Start")
timeout = Timeout(3)
try:
print("First")
sleep(4)
print("Second")
except:
raise TimeoutException("Error")
finally:
timeout.cancel()
print("Third")
This code should throw TimeoutException, because code in 'try' block executing more than 3 seconds.
But this exception shallows in the loop. I can't see it in the output
This is output:
Start
First
Process finished with exit code 0
How can I raise this exception to the output?
Change sleep(4) to
eventlet.sleep(4)
This code will not output Start... because nobody calls execute(), also sleep is not defined. Show real code, I will edit answer.
For now, several speculations:
maybe you have from time import sleep, then it's a duplicate of Eventlet timeout not exiting and the problem is that you don't give Eventlet a chance to run and realize there was a timeout, solutions: eventlet.sleep() everywhere or eventlet.monkey_patch() once.
maybe you don't import sleep at all, then it's a NameError: sleep and all exceptions from execute are hidden by caller.
maybe you run this code with stderr redirected to file or /dev/null.
Let's also fix other issues.
try:
# ...
sleeep() # with 3 'e', invalid name
open('file', 'rb')
raise Http404
except:
# here you catch *all* exceptions
# in Python 2.x even SystemExit, KeyboardInterrupt, GeneratorExit
# - things you normally don't want to catch
# but in any Python version, also NameError, IOError, OSError,
# your application errors, etc, all irrelevant to timeout
raise TimeoutException("Error")
In Python 2.x you never write except: only except Exception:.
So let's catch only proper exceptions.
try:
# ...
execute_other() # also has Timeout, but shorter, it will fire first
except eventlet.Timeout:
# Now there may be other timeout inside `try` block and
# you will accidentally handle someone else's timeout.
raise TimeoutException("Error")
So let's verify that it was yours.
timeout = eventlet.Timeout(3)
try:
# ...
except eventlet.Timeout as t:
if t is timeout:
raise TimeoutException("Error")
# else, reraise and give owner of another timeout a chance to handle it
raise
Here's same code with shorter syntax:
with eventlet.Timeout(3, TimeoutException("Error")):
print("First")
eventlet.sleep(4)
print("Second")
print("Third")
I hope you really need to substitute one timeout exception for another.
I have code:
try:
print test.qwerq]
try:
print test.sdqwe]
except:
pass
except:
pass
How to print debug info for all errors in nested try ?
Re-raise exceptions.
try:
print test[qwerq]
try:
print test[qwe]
except:
# Do something with the exception.
raise
except:
# Do something here too, just for fun.
raise
It should be noted that in general you don't want to do this. You're better off not catching the exception if you're not going to do anything about it.
If you want to just print the call stack and not crash, look into the traceback module.