I'm trying to write a script that will go through a list of urls and scrape the web page connected to that url and save the contents to a text file. Unfortunately, a few random urls lead to a page that isn't formatted in the same way and that gets me an IndexError. How do I write a script that will just skip the IndexError and move onto the next URL? I tried the code below but just get syntax errors. Thank you so much in advance for your help.
from bs4 import BeautifulSoup, SoupStrainer
import urllib2
import io
import os
import re
urlfile = open("dailynewsurls.txt",'r') # read one line at a time until end of file
for url in urlfile:
try:
page = urllib2.urlopen(url)
pagecontent = page.read() # get a file-like object at this url
soup = BeautifulSoup(pagecontent)
title = soup.find_all('title')
article = soup.find_all('article')
title = str(title[0].get_text().encode('utf-8'))
except IndexError:
return None
article = str(article[0].get_text().encode('utf-8'))
except IndexError:
return None
outfile = open(output_files_pathname + new_filename,'w')
outfile.write(title)
outfile.write("\n")
outfile.write(article)
outfile.close()
print "%r added as a text file" % title
print "All done."
The error I get is:
File "dailynews.py", line 39
except IndexError:
^
SyntaxError: invalid syntax
you would do something like:
try:
# the code that can cause the error
except IndexError: # catch the error
pass # pass will basically ignore it
# and execution will continue on to whatever comes
# after the try/except block
If you're in a loop, you could use continue instead of pass.
continue will immediately jump to the next iteration of the loop,
regardless of whether there was more code to execute in the iteration
it jumps from. sys.exit(0) would end the program.
Do the following:
except IndexError:
pass
And as suggested by another user, remove the another except IndexError.
When I run your actual program, either the original version or the edited one, in either Python 2.5 or 2.7, the syntax error I get is:
SyntaxError: 'return' outside function
And the meaning of that should be pretty obvious: You can't return from a function if you aren't in a function. If you want to "return" from the entire program, you can do that with exit:
import sys
# ...
except IndexError:
sys.exit()
(Note that you can give a value to exit, but it has to be a small integer, not an arbitrary Python value. Most shells have some way to use that return value, normally expecting 0 to mean success, a positive number to mean an error.)
In your updated version, if you fix that (whether by moving this whole thing into a function and then calling it, or by using exit instead of return) you will get an IndentationError. The lines starting with outfile = … have to be either indented to the same level as the return None above (in which case they're part of the except clause, and will never get run), or dedented back to the same level as the try and except lines (in which case they will always run, unless you've done a continue, return, break, exit, unhandled raise, etc.).
If you fix that, there are no more syntax errors in the code you showed us.
I suspect that your edited code still isn't your real code, and you may have other syntax errors in your real code. One common hard-to-diagnose error is a missing ) (or, less often, ] or }) at the end of a line, which usually causes the next line to report a SyntaxError, often at some odd location like a colon that looks (and would be, without the previous line) perfectly valid. But without seeing your real code (or, better, a real verifiable example), it's impossible to diagnose any further.
That being said, I don't think you want to return (or exit) here at all. You're trying to continue on to the next iteration of the loop. You do that with the continue statement. The return statement breaks out of the loop, and the entire function, which means none of the remaining URLs will ever get processed.
Finally, while it's not illegal, it's pointless to have extra statements after a return, continue, etc., because those statements can never get run. And similarly, while it's not illegal to have two except clauses with the same exception, it's pointless; the second one can only run in the case where the exception isn't an IndexError but is an IndexError, which means never.
I suspect you may have wanted a separate try/except around each of the two indexing statements, instead of one around the entire loop. While that isn't at all necessary here, it can sometimes make things clearer. If that's what you're going for, you want to write it like this:
page = urllib2.urlopen(url)
pagecontent = page.read() # get a file-like object at this url
soup = BeautifulSoup(pagecontent)
title = soup.find_all('title')
article = soup.find_all('article')
try:
title = str(title[0].get_text().encode('utf-8'))
except IndexError:
continue
try:
article = str(article[0].get_text().encode('utf-8'))
except IndexError:
return continue
outfile = open(output_files_pathname + new_filename,'w')
outfile.write(title)
outfile.write("\n")
outfile.write(article)
outfile.close()
print "%r added as a text file" % title
You cant "return"
except IndexError:
return None
article = str(article[0].get_text().encode('utf-8'))
this is not a function call
use a "pass" or "break" or "continue"
EDIT
try this
try:
page = urllib2.urlopen(url)
pagecontent = page.read() # get a file-like object at this url
soup = BeautifulSoup(pagecontent)
title = soup.find_all('title')
article = soup.find_all('article')
title = str(title[0].get_text().encode('utf-8'))
except IndexError:
try:
article = str(article[0].get_text().encode('utf-8'))
except IndexError:
continue
Related
I'm running a Python script to post a Tweet if the length is short enough with an exception for errors and an else statement for messages that are too long.
When I run this, it posts the Tweet and still gives the Tweet too long message. Any idea why that is happening and how to make it work as intended?
if len(tweet_text) <= (280-6):
try:
twitter = Twython(CONSUMER_KEY,CONSUMER_SECRET,ACCESS_KEY,ACCESS_SECRET)
twitter.update_status(status=tweet_text)
except TwythonError as error:
print(error)
else:
print("Tweet too Long. Please try again.")
The first string is checking the length of the tweet. Move the else four spaces back. Because try/except construction can be try/except/else construction
From the docs:
The try … except statement has an optional else clause, which, when present, must follow all except clauses. It is useful for code that must be executed if the try clause does not raise an exception. (emphasis added)
Spaces/Tabs matter in Python.
What your snippet says in common English
is "Try to post the tweet, unless there's an error, then print the error. If there's not an error, print 'Tweet too long. Please try again.'"
What you want is:
if len(tweet_text) <= (280-6):
try:
twitter = Twython(CONSUMER_KEY,CONSUMER_SECRET,ACCESS_KEY,ACCESS_SECRET)
twitter.update_status(status=tweet_text)
except TwythonError as error:
print(error)
else:
print("Tweet too Long. Please try again.")
Please mind you, I'm new to Pandas/Python and I don't know what I'm doing.
I'm working with CSV files and I basically filter currencies.
Every other day, the exported CSV file may contain or not contain certain currencies.
I have several such cells of codes--
AUDdf = df.loc[df['Currency'] == 'AUD']
AUDtable = pd.pivot_table(AUDdf,index=["Username"],values=["Amount"],aggfunc=np.sum)
AUDtable.loc['AUD Amounts Rejected Grand Total'] = (AUDdf['Amount'].sum())
AUDdesc = AUDdf['Amount'].describe()
When the CSV doesn't contain AUD, I get ValueError: cannot set a frame with no defined columns.
What I'd like to produce is a function or an if statement or a loop that checks if the column contains AUD, and if it does, it runs the above code, and if it doesn't, it simply skips it and proceeds to the next line of code for the next currency.
Any idea how I can accomplish this?
Thanks in advance.
This can be done in 2 ways:
You can create a try and except statement, this will try and look for the given currency and if a ValueError occurs it will skip and move on:
try:
AUDdf = df.loc[df['Currency'] == 'AUD']
AUDtable = pd.pivot_table(AUDdf,index=["Username"],values["Amount"],aggfunc=np.sum)
AUDtable.loc['AUD Amounts Rejected Grand Total'] = (AUDdf['Amount'].sum())
AUDdesc = AUDdf['Amount'].describe()
except ValueError:
pass
You can create an if statement which looks for the currencies presence first:
currency_set = set(list(df['Currency'].values))
if 'AUD' in currency_set:
AUDdf = df.loc[df['Currency'] == 'AUD']
AUDtable = pd.pivot_table(AUDdf,index=["Username"],values=["Amount"],aggfunc=np.sum)
AUDtable.loc['AUD Amounts Rejected Grand Total'] = (AUDdf['Amount'].sum())
AUDdesc = AUDdf['Amount'].describe()
1.Worst way to skip over the error/exception:
try:
<Your Code>
except:
pass
The above is probably the worst way because you want to know when an exception occur. using generic Except statements is bad practice because you want to avoid "catch em all" code. You want to be catching exceptions that you know how to handle. You want to know what specific exception occurred and you need to handle them on an exception-by-exception basis. Writing Generic except statements leads to missed bugs and tends to mislead while running the code to test.
Slightly worse way to handle the exception:
try:
<Your Code>
except Exception as e:
<Some code to handle an exception>
Still not optimal as it is still generic handling
Average way to handle it for your case:
try:
<Your Code>
except ValueError:
<Some code to handle this exception>
Other suggestion - Much Better Ways to deal with this:
1.You can get a set of the available columns at run time and aggregate based on if 'AUD' is in the list.
2.Clean your data set
You can use try and except where
try:
#your code here
except:
#some print statement
pass
I have the following code that is throwing up an out of range error on the barcode looping section of the below code.
for each in data['articles']:
f.writerow([each['local']['name'],
each['information'][0]['barcodes'][0]['barcode']])
I wrote a try and except to catch and handle when a barcode is not present within the json I am parsing and this worked perfectly during testing using the print function however I have been having some trouble getting the try and except to work whilst trying to writerow to a csv file.
Does anyone have any suggestions or another method I could try to get this to work.
My try and accept which worked when testing using print was as follows:
for each in data['articles']:
print(each['local']['name'])
try:
print(each['information'][0]['barcodes'][0]['barcode'])
except:
"none"
Any help is much appreciated!
As komatiraju032 points out, one way of doing this is via get(), although if there are different elements of the dictionary that might have empty/incorrect values, it might get unwieldy to provide a default for each one. To do this via a try/except you might do:
for each in data['articles']:
row = [each['local']['name']]
try:
row.append(each['information'][0]['barcodes'][0]['barcode'])
except (IndexError, KeyError):
row.append("none")
f.writerow(row)
This will give you that "none" replacement value regardless of which of those lists/dicts is missing the requested index/key, since any of those lookups might raise but they'll all end up at the same except.
Use dict.get() method. It will return None if key not exist
res = each['information'][0]['barcodes'][0].get('barcode')
I have code that is meant to find a graph on a webpage and create a link for web-crawling from it. If a graph is not found, then I've put in a try/except to print a message with a corresponding (player) link so it goes on to the next one if not found.
It's from a football valuation website and I've reduced the list two players for debugging: one is Kylian Mbappé (who has a graph on his page and should pass) and the other Ansu Fati (who doesn't). Attempting to grab the Ansu Fati's graph tag from his profile using BeautifulSoup results in a NoneType error.
The issue here is that Mbappé's graph link does get picked up for processing downstream in the code, but the "except" error/link message in the except clause is also printed to the console. This should only be the case for Ansu Fati.
Here's the code
final_url_list = ['https://www.transfermarkt.us/kylian-mbappe/profil/spieler/342229','https://www.transfermarkt.com/ansu-fati/profil/spieler/466810']
for i in final_url_list:
try:
int_page = requests.get(i, headers = {'User-Agent':'Mozilla/5.0'}).text
except requests.exceptions.Timeout:
sys.exit(1)
parsed_int_page = BeautifulSoup(int_page,'lxml')
try:
graph_container = parsed_int_page.find('div', class_='large-7 columns small-12 marktwertentwicklung-graph')
graph_a = graph_container.find('a')
graph_link = graph_a.get('href')
final_url_list.append('https://www.transfermarkt.us' + graph_link)
except:
pass
print("Graph error:" + i)
I tried using PyCharm's debugging to see how the interpreter is going through the steps and it seems like the whole except clause is skipped, but when I run it in the console, the "Graph error: link" is posted for both. I'm not sure what is wrong with the code for the try/except issue to be behaving this way.
The line
except None:
is looking for an exception with type None, which is impossible.
Try changing that line to
except AttributeError:
Doing so will result in the following output:
Graph error:https://www.transfermarkt.com/ansu-fati/profil/spieler/466810
Graph error:https://www.transfermarkt.us/kylian-mbappe/marktwertverlauf/spieler/342229
There's an additional issue here where you're modifying the list that you're iterating over, which is not only bad practice, but is resulting in the unexpected behavior you're seeing.
Because you're appending to the list you're iterating over, you're going to add an iteration for a url that you don't actually want to be scraping. To fix this, change the first couple of lines in your script to this:
url_list = ['https://www.transfermarkt.us/kylian-mbappe/profil/spieler/342229','https://www.transfermarkt.com/ansu-fati/profil/spieler/466810']
final_url_list = []
for i in url_list:
This way, you're appending the graph links to a different list, and you won't try to scrape links that you shouldn't be scraping. This will put all of the "graph links" into final_url_list
I'm making a API call that occasionally does not have certain fields in the response. If I run across one of these responses, my script throws the KeyError as expected, but then breaks out of the for loop completely. Is there any way of getting it to simply skip over the errored output and continue with the loop?
I've considered trying to put all the fields I'm searching for into a list and iterate over that using a continue statement to keep the iteration going when it encounters a missing field, but 1) it seems cumbersome and 2) I've got multiple levels of iterations within the output.
try:
for item in result["results"]:
print(MAJOR_SEP) # Just a line of characters separating the output
print("NPI:", item['number'])
print("First Name:", item['basic']['first_name'])
print("Middle Name:", item['basic']['middle_name'])
print("Last Name:", item['basic']['last_name'])
print("Credential:", item['basic']['credential'])
print(MINOR_SEP)
print("ADDRESSES")
for row in item['addresses']:
print(MINOR_SEP)
print(row['address_purpose'])
print("Address (Line 1):", row['address_1'])
print("Address (Line 2):", row['address_2'])
print("City:", row['city'])
print("State:", row['state'])
print("ZIP:", row['postal_code'])
print("")
print("Phone:", row['telephone_number'])
print("Fax:", row['fax_number'])
print(MINOR_SEP)
print("LICENSES")
for row in item['taxonomies']:
print(MINOR_SEP)
print("State License: {} - {}, {}".format(row['state'],row['license'],row['desc']))
print(MINOR_SEP)
print("OTHER IDENTIFIERS")
for row in item['identifiers']:
print(MINOR_SEP)
print("Other Identifier: {} - {}, {}".format(row['state'],row['identifier'],row['desc']))
print(MAJOR_SEP)
except KeyError as e:
print("{} is not defined.".format(e))
Those try...except blocks, specially those for very specific errors such as KeyError should be added around only the lines where it matter.
If you want to be able to continue processing, at least put the block inside the for loop so on error it will skip to the next item on the iteration. But even better would be to verify when the values are actually necessary and just replace them with a dummy value in case they are not.
For example: for row in item['addresses']:
Could be: for row in item.get('addresses', []):
Therefore you will accept items without an address
Try using try/except clause after for.
In example:
for item in result["results"]:
try:
# Code here.
except KeyError as e:
print("{} is not defined.".format(e))
Python documentation for exceptions: https://docs.python.org/3/tutorial/errors.html
You could also use contextlib.suppress (https://docs.python.org/3/library/contextlib.html#contextlib.suppress)
Example:
from contextlib import suppress
for item in result["results"]:
with suppress(KeyError):
# Code here