Downloading rss content in python - python

I am creating a python application where I am downloading a list of rss content from internet. I am having a list of 10 url's which I need to download.
I am using the urllib2 library provided by python. The code I am using is:
for url in urls:
rssObject = urllib2.urlopen(url)
data = rssObject.read()
with open(self.SERVER_PATH+"\\feeds\\"+str(feedID)+str(extension), "w") as requiredData:
requiredData.write(str(data))
requiredData.close()
Here the first url is downloaded but while downloading the next url I get an error:
<urlopen error [Errno 66] unknown>
Is there any event which can notify me for the completion of the downloading of the first URL? Or is there any other way with the help of which I can avoid this issue?
Thanks in advance.

Is there any event which can notify me for the completion of the downloading of the first URL?
The raising of the Exception is notification that the URL cannot be downloaded.
Or is there any other way with the help of which I can avoid this issue?
Yes, you can catch the exception.
try:
rssObject = urllib2.urlopen(url)
data = rssObject.read()
except URLError:
# do stuff which handles the error
I'm not perfectly sure that's the error you need to catch, but hopefully you have the skills to work out exactly what to catch (if it's not URLError).

As a follow-up to John Mee's answer, and after reading your comments, you can try something like the following:
def complete_urlopen(url):
complete = False
while not complete:
try:
obj = urllib2.urlopen(url)
complete = True
except URLError, e:
pass
return obj.read()
And then use it by replacing:
rssObject = urllib2.urlopen(url)
data = rssObject.read()
with:
data = complete_urlopen(url)
Be aware that this code assumes that the urlopen() will eventually succeed. If it never does, your process will hang in that while loop forever. A more sophisticated version of this could contemplate a max number of iterations, such that when they're reached, the process exits.

Related

Python requests-html session GET correct usage

I'm working on a web scraper that needs to open several thousand pages and get some data.
Since one of the data fields I need the most is only loaded after all javascripts of the site have been loaded, I'm using html-requests to render the page and then get the data I need.
I want to know, what's the best way to do this?
1- Open a session at the start of the script, do my whole scraping and then close the session when the script finishes after thousands of "clicks" and several hours?
2- Or should I open a session everytime I open a link, render the page, get the data, and then close the session, and repeat n times in a cycle?
Currently I'm doing the 2nd option, but I'm getting a problem. This is the code I'm using:
def getSellerName(listingItems):
for item in listingItems:
builtURL = item['href']
try:
session = HTMLSession()
r = session.get(builtURL,timeout=5)
r.html.render()
sleep(1)
sellerInfo = r.html.search("<ul class=\"seller_name\"></ul></div><a href=\"{user}\" target=")["user"]
##
##Do some stuff with sellerinfo
##
session.close()
except requests.exceptions.Timeout:
log.exception("TimeOut Ex: ")
continue
except:
log.exception("Gen Ex")
continue
finally:
session.close()
break
This works pretty well and is quite fast. However, after about 1.5 or 2 hours, I start getting OS exception like this one:
OSError: [Errno 24] Too many open files
And then that's it, I just get this exception over and over again, until I kill the script.
I'm guessing I need to close something else after every get and render, but I'm not sure what or if I'm doing it correctly.
Any help and/or suggestions, please?
Thanks!
You should create a session object outside the loop
def getSellerName(listingItems):
session = HTMLSession()
for item in listingItems:
//code

How to filter out crawler traps in a static corpus

I am doing a homework where we are asked to write a program to crawl a given static corpus. In the output, my code prints all the URLs crawled, but I know there are some that are traps, but I can't think of a way to filter those out in a Pythonic way.
I used regex to filter the tap-like url contents out, but this is not allowed in the homework as it is considered as hard-coding.
https://cbcl.ics.uci.edu/doku.php/software/arem?do=login&sectok=4d26fc0839d47d4ec13c5461c1ed6d96
http://cbcl.ics.uci.edu/doku.php/software/arem?do=login&sectok=d8b984cc6aa00bd1ef20471ac5150094
https://cbcl.ics.uci.edu/doku.php/software/arem?do=login&sectok=d8b984cc6aa00bd1ef20471ac5150094
http://cbcl.ics.uci.edu/doku.php/software/arem?do=login&sectok=d504a3676483838e82f07064ca3e12ee
and more with similar structure. There are also calendar urls with similar structure, only changing days:
http://calendar.ics.uci.edu/calendar.php?type=day&calendar=1&category=&day=22&month=01&year=2017
http://calendar.ics.uci.edu/calendar.php?type=day&calendar=1&category=&day=25&month=01&year=2017
http://calendar.ics.uci.edu/calendar.php?type=day&calendar=1&category=&day=26&month=01&year=2017
http://calendar.ics.uci.edu/calendar.php?type=day&calendar=1&category=&day=27&month=01&year=2017
I want to filter those out of my results but I can't think of any way.
I think this will solve your problem
import requests
for url in urls:
try:
response = requests.get(url)
# If the response was successful, no Exception will be raised
response.raise_for_status()
except Exception as err:
print(f'Other error occurred: {err}')
else:
print('Url is valid!')

How to download books automatically from Gutenberg

I am trying to download books from "http://www.gutenberg.org/". I want to know why my code gets nothing.
import requests
import re
import os
import urllib
def get_response(url):
response = requests.get(url).text
return response
def get_content(html):
reg = re.compile(r'(<span class="mw-headline".*?</span></h2><ul><li>.*</a></li></ul>)',re.S)
return re.findall(reg,html)
def get_book_url(response):
reg = r'a href="(.*?)"'
return re.findall(reg,response)
def get_book_name(response):
reg = re.compile('>.*</a>')
return re.findall(reg,response)
def download_book(book_url,path):
path = ''.join(path.split())
path = 'F:\\books\\{}.html'.format(path) #my local file path
if not os.path.exists(path):
urllib.request.urlretrieve(book_url,path)
print('ok!!!')
else:
print('no!!!')
def get_url_name(start_url):
content = get_content(get_response(start_url))
for i in content:
book_url = get_book_url(i)
if book_url:
book_name = get_book_name(i)
try:
download_book(book_url[0],book_name[0])
except:
continue
def main():
get_url_name(start_url)
if __name__ == '__main__':
start_url = 'http://www.gutenberg.org/wiki/Category:Classics_Bookshelf'
main()
I have run the code and get nothing, no tracebacks. How can I download the books automatically from the website?
I have run the code and get nothing,no tracebacks.
Well, there's no chance you get a traceback in the case of an exception in download_book() since you explicitely silent them:
try:
download_book(book_url[0],book_name[0])
except:
continue
So the very first thing you want to do is to at least print out errors:
try:
download_book(book_url[0],book_name[0])
except exception as e:
print("while downloading book {} : got error {}".format(book_url[0], e)
continue
or just don't catch exception at all (at least until you know what to expect and how to handle it).
I don't even know how to fix it
Learning how to debug is actually even more important than learning how to write code. For a general introduction, you want to read this first.
For something more python-specific, here are a couple ways to trace your program execution:
1/ add print() calls at the important places to inspect what you really get
2/ import your module in the interactive python shell and test your functions in isolation (this is easier when none of them depend on global variables)
3/ use the builtin step debugger
Now there are a few obvious issues with your code:
1/ you don't test the result of request.get() - an HTTP request can fail for quite a few reasons, and the fact you get a response doesn't mean you got the expected response (you could have a 400+ or 500+ response as well.
2/ you use regexps to parse html. DONT - regexps cannot reliably work on html, you want a proper HTML parser instead (BeautifulSoup is the canonical solution for web scraping as it's very tolerant). Also some of your regexps look quite wrong (greedy match-all etc).
start_url is not defined in main()
You need to use a global variable. Otherwise, a better (cleaner) approach is to pass in the variable that you are using. In any case, I would expect an error, start_url is not defined
def main(start_url):
get_url_name(start_url)
if __name__ == '__main__':
start_url = 'http://www.gutenberg.org/wiki/Category:Classics_Bookshelf'
main(start_url)
EDIT:
Nevermind, the problem is in this line: content = get_content(get_response(start_url))
The regex in get_content() does not seem to match anything. My suggestion would be to use BeautifulSoup, from bs4 import BeautifulSoup. For any information regarding why you shouldn't parse html with regex, see this answer RegEx match open tags except XHTML self-contained tags
Asking regexes to parse arbitrary HTML is like asking a beginner to write an operating system
As others have said, you get no output because your regex doesn't match anything. The text returned by the initial url has got a newline between </h2> and <ul>, try this instead:
r'(<span class="mw-headline".*?</span></h2>\n<ul><li>.*</a></li></ul>)'
When you fix that one, you will face another error, I suggest some debug printouts like this:
def get_url_name(start_url):
content = get_content(get_response(start_url))
for i in content:
print('[DEBUG] Handling:', i)
book_url = get_book_url(i)
print('[DEBUG] book_url:', book_url)
if book_url:
book_name = get_book_name(i)
try:
print('[DEBUG] book_url[0]:', book_url[0])
print('[DEBUG] book_name[0]:', book_name[0])
download_book(book_url[0],book_name[0])
except:
continue

How to avoid program termination for urllib2.httperror 404 error and display appropriate message

I'm scraping the content out of 100k systematic URLS (example.com/entry/1 > example.com/entry/100000).
However, around 10% of the URLs have been deleted, meaning when the script gets to them it gives me the error "urllib2.httperror http error 404" and stops running.
I'm relatively new to python and was wondering if there's a way to do something like this:
if result == error:
div_text = "missing"
So that the loop can continue to the next URL, but make a note that it failed.
urllib2.HTTPError is an exception raised by Python. You can wrap your URL call with a try/except block:
try:
# ... put your URL open call here ...
except urllib2.HTTPError:
div_text = 'missing'
This way, if this exception is encountered again, the Python interpreter will run the code inside that except block.

Python: handing exceptions when downloading non-existing files using urllib

I know how to download a file from the web using python, however I wish to handle cases where the file being requested does not exist. In which case, I want to print an error message ("404: File not found") and not write anything to disk. However, I still want to be able to continue executing the program (i.e. downloading other files in a list that may exist).
How do I do this? Below is some template code to download a file given its url (feel free to modify it if you believe there is a better way, but please keep it concise and simple).
import urllib
urllib.urlretrieve ("http://www.example.com/myfile.mp3", "myfile.mp3")
from urllib2 import URLError
try:
# your file request code here
except URLError, e:
if e.code == 404:
# your appropriate code here
else:
# raise maybe?
I followed this guide, which has a specific section about handling exceptions, and found it really helpful.
import urllib, urllib2
try:
urllib.urlretrieve ("http://www.example.com/", "myfile.mp3")
except URLError, e:
if e.code == 404:
print "4 0 4"
else:
print "%s" % e
This is what your code does. It basically tries to retrieve the web page of www.example.com and writes it to myfile.mp3. It does not end into exception because it is not looking for the myfile.mp3, it basically writes everything it gets in html to myfile.mp3
If you are looking for code to download files at a certain location on the web, try this
How do I download a zip file in python using urllib2?
Your code should look like this:
try:
urllib.urlretrieve ("http://www.example.com/myfile.mp3", "myfile.mp3")
except URLError,e:
if e.code==404:
print 'file not found. moving on...'
pass

Categories

Resources