the urlib2.URLError and its reason in python - python

The title of the question may be a bit confusing but I don't really know how best to word it...
I've found the following chunk of code which downloads a web page from the web by making use of the urllib2 library.
import urllib2
def download(url):
try:
html = urllib2.urlopen(url).read()
except urllib2.URLError as e:
print 'Download error:', e.reason
html = None
return html
Now if it happens that e.code is 404 then e.reason is simply an empty string which means it bears absolutely no information on what triggered the error, thus I don't really understand the point of using e.reason here.
It seems like it would be more reasonable to print e instead but even if I change it to simply print e it will still yield something awkward: HTTP Error 404: and the colon is apparenty followed by an empty string...
So it appears to me that the abovementioned code is a little clumsy in terms of exception handling. Is it so?

It would seem that you could either use the error itself (print e) or the code and the reason (print "Download Error: ", e.code, e.reason) if you wanted to see the 404 code.

Related

Checking URL Status without Throwing Error

I'm looking to check to see if 500+ strings in a given dataframe are URLs. I've seen that this can be done using the requests package but I've found that if I provide a URL, instead of receiving the error code 404, my program is crashing.
Because I'm looking to apply this function to a dataframe with many strings not being active URLs, the current function would not work for what I'm looking to accomplish.
I'm wondering if there is a way to adapt the coded below to actually return no (or anything else) in the case that the URL isn't real. For example, providing the url 'http://www.example.commmm' results in an error:
import requests
response = requests.get('http://www.example.com')
if response.status_code == 200:
print('Yes')
else:
print('No')
thanks in advance!
I would try and add a try/except to prevent your code from breaking
try:
print(x)
except:
print("An exception occurred")

HTTPError when appending DataFrame

I am reading Python code from another programmer, particularly the following code block:
try:
df.append(df_extension)
except HTTPError as e:
if ("No data could be loaded!" in str(e)):
print("No data could be loaded. Error was caught.")
else:
raise
In this, df and df_extension are pandas.DataFrames.
I wonder how an HTTPError could occur with pandas.DataFrame.append. At least from the documentation I can not find out how append raises an HTTPError.
Any ideas will be welcome.
According to comments to the question by #JCaesar and #Neither, you don't have to worry about an HTTPError arising from the use of df.append. The try-except-block does not seem to have any justification. The one-liner
df.append(df_extension)
suffices.

Try/except not working as expected: "Except" error message is appended to passing result

I have code that is meant to find a graph on a webpage and create a link for web-crawling from it. If a graph is not found, then I've put in a try/except to print a message with a corresponding (player) link so it goes on to the next one if not found.
It's from a football valuation website and I've reduced the list two players for debugging: one is Kylian Mbappé (who has a graph on his page and should pass) and the other Ansu Fati (who doesn't). Attempting to grab the Ansu Fati's graph tag from his profile using BeautifulSoup results in a NoneType error.
The issue here is that Mbappé's graph link does get picked up for processing downstream in the code, but the "except" error/link message in the except clause is also printed to the console. This should only be the case for Ansu Fati.
Here's the code
final_url_list = ['https://www.transfermarkt.us/kylian-mbappe/profil/spieler/342229','https://www.transfermarkt.com/ansu-fati/profil/spieler/466810']
for i in final_url_list:
try:
int_page = requests.get(i, headers = {'User-Agent':'Mozilla/5.0'}).text
except requests.exceptions.Timeout:
sys.exit(1)
parsed_int_page = BeautifulSoup(int_page,'lxml')
try:
graph_container = parsed_int_page.find('div', class_='large-7 columns small-12 marktwertentwicklung-graph')
graph_a = graph_container.find('a')
graph_link = graph_a.get('href')
final_url_list.append('https://www.transfermarkt.us' + graph_link)
except:
pass
print("Graph error:" + i)
I tried using PyCharm's debugging to see how the interpreter is going through the steps and it seems like the whole except clause is skipped, but when I run it in the console, the "Graph error: link" is posted for both. I'm not sure what is wrong with the code for the try/except issue to be behaving this way.
The line
except None:
is looking for an exception with type None, which is impossible.
Try changing that line to
except AttributeError:
Doing so will result in the following output:
Graph error:https://www.transfermarkt.com/ansu-fati/profil/spieler/466810
Graph error:https://www.transfermarkt.us/kylian-mbappe/marktwertverlauf/spieler/342229
There's an additional issue here where you're modifying the list that you're iterating over, which is not only bad practice, but is resulting in the unexpected behavior you're seeing.
Because you're appending to the list you're iterating over, you're going to add an iteration for a url that you don't actually want to be scraping. To fix this, change the first couple of lines in your script to this:
url_list = ['https://www.transfermarkt.us/kylian-mbappe/profil/spieler/342229','https://www.transfermarkt.com/ansu-fati/profil/spieler/466810']
final_url_list = []
for i in url_list:
This way, you're appending the graph links to a different list, and you won't try to scrape links that you shouldn't be scraping. This will put all of the "graph links" into final_url_list

How to ignore an IndexError on Python

I'm trying to write a script that will go through a list of urls and scrape the web page connected to that url and save the contents to a text file. Unfortunately, a few random urls lead to a page that isn't formatted in the same way and that gets me an IndexError. How do I write a script that will just skip the IndexError and move onto the next URL? I tried the code below but just get syntax errors. Thank you so much in advance for your help.
from bs4 import BeautifulSoup, SoupStrainer
import urllib2
import io
import os
import re
urlfile = open("dailynewsurls.txt",'r') # read one line at a time until end of file
for url in urlfile:
try:
page = urllib2.urlopen(url)
pagecontent = page.read() # get a file-like object at this url
soup = BeautifulSoup(pagecontent)
title = soup.find_all('title')
article = soup.find_all('article')
title = str(title[0].get_text().encode('utf-8'))
except IndexError:
return None
article = str(article[0].get_text().encode('utf-8'))
except IndexError:
return None
outfile = open(output_files_pathname + new_filename,'w')
outfile.write(title)
outfile.write("\n")
outfile.write(article)
outfile.close()
print "%r added as a text file" % title
print "All done."
The error I get is:
File "dailynews.py", line 39
except IndexError:
^
SyntaxError: invalid syntax
you would do something like:
try:
# the code that can cause the error
except IndexError: # catch the error
pass # pass will basically ignore it
# and execution will continue on to whatever comes
# after the try/except block
If you're in a loop, you could use continue instead of pass.
continue will immediately jump to the next iteration of the loop,
regardless of whether there was more code to execute in the iteration
it jumps from. sys.exit(0) would end the program.
Do the following:
except IndexError:
pass
And as suggested by another user, remove the another except IndexError.
When I run your actual program, either the original version or the edited one, in either Python 2.5 or 2.7, the syntax error I get is:
SyntaxError: 'return' outside function
And the meaning of that should be pretty obvious: You can't return from a function if you aren't in a function. If you want to "return" from the entire program, you can do that with exit:
import sys
# ...
except IndexError:
sys.exit()
(Note that you can give a value to exit, but it has to be a small integer, not an arbitrary Python value. Most shells have some way to use that return value, normally expecting 0 to mean success, a positive number to mean an error.)
In your updated version, if you fix that (whether by moving this whole thing into a function and then calling it, or by using exit instead of return) you will get an IndentationError. The lines starting with outfile = … have to be either indented to the same level as the return None above (in which case they're part of the except clause, and will never get run), or dedented back to the same level as the try and except lines (in which case they will always run, unless you've done a continue, return, break, exit, unhandled raise, etc.).
If you fix that, there are no more syntax errors in the code you showed us.
I suspect that your edited code still isn't your real code, and you may have other syntax errors in your real code. One common hard-to-diagnose error is a missing ) (or, less often, ] or }) at the end of a line, which usually causes the next line to report a SyntaxError, often at some odd location like a colon that looks (and would be, without the previous line) perfectly valid. But without seeing your real code (or, better, a real verifiable example), it's impossible to diagnose any further.
That being said, I don't think you want to return (or exit) here at all. You're trying to continue on to the next iteration of the loop. You do that with the continue statement. The return statement breaks out of the loop, and the entire function, which means none of the remaining URLs will ever get processed.
Finally, while it's not illegal, it's pointless to have extra statements after a return, continue, etc., because those statements can never get run. And similarly, while it's not illegal to have two except clauses with the same exception, it's pointless; the second one can only run in the case where the exception isn't an IndexError but is an IndexError, which means never.
I suspect you may have wanted a separate try/except around each of the two indexing statements, instead of one around the entire loop. While that isn't at all necessary here, it can sometimes make things clearer. If that's what you're going for, you want to write it like this:
page = urllib2.urlopen(url)
pagecontent = page.read() # get a file-like object at this url
soup = BeautifulSoup(pagecontent)
title = soup.find_all('title')
article = soup.find_all('article')
try:
title = str(title[0].get_text().encode('utf-8'))
except IndexError:
continue
try:
article = str(article[0].get_text().encode('utf-8'))
except IndexError:
return continue
outfile = open(output_files_pathname + new_filename,'w')
outfile.write(title)
outfile.write("\n")
outfile.write(article)
outfile.close()
print "%r added as a text file" % title
You cant "return"
except IndexError:
return None
article = str(article[0].get_text().encode('utf-8'))
this is not a function call
use a "pass" or "break" or "continue"
EDIT
try this
try:
page = urllib2.urlopen(url)
pagecontent = page.read() # get a file-like object at this url
soup = BeautifulSoup(pagecontent)
title = soup.find_all('title')
article = soup.find_all('article')
title = str(title[0].get_text().encode('utf-8'))
except IndexError:
try:
article = str(article[0].get_text().encode('utf-8'))
except IndexError:
continue

Python Mechanize IncompleteRead Error

I am experimenting with mechanize and re to find the websites which correspond to a list of retail stores.
I have been parsing Bing search results to grab the top result's url. Unfortunately, seemingly independent of the query, at random times I've been getting an httplib.IncompleteRead error. Even though I've got a workaround which follows, I'd like to understand what's happening.
def bingSearch(query): #query is the store's name, i.e. "Bob's Pet Shop"
while True:
try:
bingBrowser.open('http://www.bing.com/search?q="' + query.replace(' ','+') + '"' )
htmlCode = bingBrowser.response().read()
break
except httplib.IncompleteRead:
#Sleep for a little while and try again.
Other relevant info:
Sometimes, for a single bing url, the program will to attempt to open and read that url multiple times, before a successful read without an IncompleteRead error.
bingBrowser's headers attribute is set up to look nice.
bingBrowser's robots attribute is set to false.
httplib: incomplete read ... I don't know anything about Apache so I wasn't able to understand the answer to the question, but it may be helpful to you. That said, I doubt that I'm having a similar problem (Why would bing.com be suffering from an Apache error?!)
Edit:
Replaced query.replace(' ','+') + '"' ) with urllib.urlencode(dict(q=query)) per JF Sebastian's suggestion - no change (I know this wasn't proposed as a solution).
Suffered from an inexplicable urllib2.URLError on bingBrowser.open('http://www.bing.com/search?q="' + query.replace(' ','+') + '"' )
Got an xlwt related "String longer than 65535 characters" error - probably unrelated.
Thanks in advance.

Categories

Resources