Python: Make HTTP Errors (500) Not Stop Script - python

This is the basic part of my code that I need help with. Note I learned python like last week. I don't understand try and exceptions and I know that what I need for this, so if anyone could help that would be great.
url = 'http://google.com/{0}/{1}'.format(variable, variable1)
site = urllib.request.urlopen(url)
That's not the real website but you get the idea. Now I'm running a loop over 5 times per item, then running around 20 different items. So it goes to say
google.com/spiders/(runs 5 times with diff types of spiders)
google.com/dogs/(runs 5 times with diff types of dogs)etc.
Now the 2nd variable is the same on like 90% of the items I'm looping over, but 1 or 2 of them have some of the "types" but not others. So I get an http error 500 because that site doesn't exist. How do I make it basically skip that. Its not something else, I know error 500 isn't the right error I believe, but I know the pages for those items don't exist. So how do I set this up so that it just skips that one if it gets any error.

You can use a try/except block in your loop, like:
try:
url = 'http://google.com/{0}/{1}'.format(variable, variable1)
site = urllib.request.urlopen(url)
except Exception, ex:
print "ERROR - " + str(ex)
You can also just catch specific exceptions - the above code would catch any exception (such as a bug in your code, not a network error)
See here for more: https://wiki.python.org/moin/HandlingExceptions

Related

Try/except not working as expected: "Except" error message is appended to passing result

I have code that is meant to find a graph on a webpage and create a link for web-crawling from it. If a graph is not found, then I've put in a try/except to print a message with a corresponding (player) link so it goes on to the next one if not found.
It's from a football valuation website and I've reduced the list two players for debugging: one is Kylian Mbappé (who has a graph on his page and should pass) and the other Ansu Fati (who doesn't). Attempting to grab the Ansu Fati's graph tag from his profile using BeautifulSoup results in a NoneType error.
The issue here is that Mbappé's graph link does get picked up for processing downstream in the code, but the "except" error/link message in the except clause is also printed to the console. This should only be the case for Ansu Fati.
Here's the code
final_url_list = ['https://www.transfermarkt.us/kylian-mbappe/profil/spieler/342229','https://www.transfermarkt.com/ansu-fati/profil/spieler/466810']
for i in final_url_list:
try:
int_page = requests.get(i, headers = {'User-Agent':'Mozilla/5.0'}).text
except requests.exceptions.Timeout:
sys.exit(1)
parsed_int_page = BeautifulSoup(int_page,'lxml')
try:
graph_container = parsed_int_page.find('div', class_='large-7 columns small-12 marktwertentwicklung-graph')
graph_a = graph_container.find('a')
graph_link = graph_a.get('href')
final_url_list.append('https://www.transfermarkt.us' + graph_link)
except:
pass
print("Graph error:" + i)
I tried using PyCharm's debugging to see how the interpreter is going through the steps and it seems like the whole except clause is skipped, but when I run it in the console, the "Graph error: link" is posted for both. I'm not sure what is wrong with the code for the try/except issue to be behaving this way.
The line
except None:
is looking for an exception with type None, which is impossible.
Try changing that line to
except AttributeError:
Doing so will result in the following output:
Graph error:https://www.transfermarkt.com/ansu-fati/profil/spieler/466810
Graph error:https://www.transfermarkt.us/kylian-mbappe/marktwertverlauf/spieler/342229
There's an additional issue here where you're modifying the list that you're iterating over, which is not only bad practice, but is resulting in the unexpected behavior you're seeing.
Because you're appending to the list you're iterating over, you're going to add an iteration for a url that you don't actually want to be scraping. To fix this, change the first couple of lines in your script to this:
url_list = ['https://www.transfermarkt.us/kylian-mbappe/profil/spieler/342229','https://www.transfermarkt.com/ansu-fati/profil/spieler/466810']
final_url_list = []
for i in url_list:
This way, you're appending the graph links to a different list, and you won't try to scrape links that you shouldn't be scraping. This will put all of the "graph links" into final_url_list

How to find the correct website link depending on a composed string with python

I have a list of first name and last that is supposed to be used to compose website links. But sometimes some users don't always follow the naming rule and finally, their website name doesn't match correctly with the expected one.
Here is an example: lest's say the name is John and last name is Paul. In this case, the website URL should be johnpaul.com. But sometimes, use put johnpaul.com or pauljohn.com, or john-paul.com.
I would like to automatize some processes on these websites. The vast majority of them are correct, but some not. When it is not correct, I just google the expected URL and it is generally the first or second result I get on google.
I was asking myself if it is possible to make a Google request and check the 2 or 3 first links with python to get the actual URL. Any idea on how to make something like this?
my code now looks like this:
for value in arr:
try:
print requests.get(url).status_code, url
except Exception as e:
print url, " is not available"
I'd go with endswith()
string = "bla.com"
strfilter = ('.com', '.de') # Tuple
if string.endswith(strfilter):
raise "400 Bad Request"
this way you filter out the .com .net etc errors.

Python - Socket Error 10054 - How to prevent terminal from preventing error?

Since it is not an execution-fail error, I am not sure what my options are to keep this error from popping up. I do not believe it really matters what my code is that is causing the error if there is some universal command to suppress this error line from printing see my error here
It is simply using whois to determine if the domain is registered or not. I was doing a basic test of the top 1,000 english words to see if their .com domains were taken. code here
Here is my code:
for url in wordlist:
try:
domain = whois.whois(url)
boom.write( ("%s,%s,%s\r\n"% \
(str(number), url, "TAKEN")).encode('UTF-8'))
except:
boom.write( ("%s,%s,%s\r\n"% \
(str(number), url, "NOT TAKEN")).encode('UTF-8'))
A bit hard to know for sure without your code, but wrap the section that's generating the error like this:
try:
# Your error-generating code
except:
pass

How can I re-start code from the point that CSV write is completed?

I made a web crawler for this page (http://www.bobaedream.co.kr/cyber/CyberCar.php?gubun=I) to collect the stock list of each page and write information(e.g. photo url, title, description, date, price, etc.) in CSV.
Sometimes "exceptions" are randomly popped up while collecting the lists. When I re-start the entire code, sometimes the exception does not appear. I used "try and except" function inside of while loop to avoid the exception like below, but when the exception appears, the run continues in the while loop and can't get out of it.
while True:
try:
self.driver.execute_script(option2[1])
except (StaleElementReferenceException, NoSuchElementException):
sleep(1)
print("Exception Found")
continue
break
What I would like to do is to re-start the entire code from the last list written in CSV when the exception begins. My code is pretty long, so it is hard to describe exactly which part to be started. But, what I am wondering is if there is any specific command or logic to get the information of that last list in CSV and re-start the code from that point when the exception appears.I know my description is poor but can you guys give me any advice or something?
Well your question is not clear to me yet.
import csv
with open(filename) as f:
last_record = list(csv.reader(f))[-1]
You can use the above code to get the last record written in the csv file and use it accordingly. Please inform if its not the answer you wanted.

100,000 HTTP Response Code Checks

I've got a list of ~100,000 links that I'd like to check the HTTP Response Code for. What might be the best method to use for doing this check programmatically?
I'm considering using the below Python code:
import requests
try:
for x in range(0, 100000):
r = requests.head(''.join(["http://stackoverflow.com/", str(x)]))
# They'll actually be read from a file, and aren't sequential
print r.status_code
except requests.ConnectionError:
print "failed to connect"
.. but am not aware of the potential side effects of checking such a large number of URLs in a single take. Thoughts?
The only side effect I can think of is time, which you can mitigate by making the requests in parallel. (use http://gevent.org/ or https://docs.python.org/2/library/thread.html).

Categories

Resources