Python 3 variable in request causing 404? - python

I've got two lists, one a list of dates the other a list of hours and I'm scanning a website looking for content.
date = ['1','2','3']
hour = ['1','2','3']
I've written the following while / for to loop through the dates and hours and open all the combinations:
datetotry = 0
while (datetotry < len(date)):
for i in range(len(hour)):
print ('https://www.website.org/backups/backup_2004-09-'+date[datetotry]+'_'+hour[i]+".sql")
html = opener.open('https://www.website.org/backups/backup_2004-09-'+date[datetotry]+'_'+hour[i]+'.sql').read()
datetotry += 1
When the console prints the url it looks okay with the variables being replaced with the numbers from the list.
But its possibly not replacing the variables in the actual url request.
The code was stopping due to a 404 error but I think I handled that with the info I found here:
https://docs.python.org/3/howto/urllib2.html#wrapping-it-up
The first part of the 404 error was showing the
date[datetotry]+'_'+hour[i]+
section, instead of the items from the list like when its printing to console.
Does this mean I have to do something like urllib.parse.urlencode to actually replace the variables?
I imported the libraries mentioned in the article and changed code to:
from urllib.error import URLError, HTTPError
from urllib.request import Request, urlopen
while (datetotry < len(date)):
for I in range(len(hour)):
HTML = Request('https://www.website.org/backups/backup_2004-09-'+date[datetotry]+'_'+hour[i]+'.sql')
try:
response = urlopen(html)
except URLError as e:
print('The server couldn\'t fulfill the request.')
print('Error code: ', e.code)
except URLError as e:
print('We failed to reach a server.')
print('Reason: ', e.reason)
else:
So the code actually runs, not stopping due to 404 being returned. What is the best way to see what its actually requesting and do I have to do some kind of encoding? New to programming especially Python 3.

Related

How to filter out crawler traps in a static corpus

I am doing a homework where we are asked to write a program to crawl a given static corpus. In the output, my code prints all the URLs crawled, but I know there are some that are traps, but I can't think of a way to filter those out in a Pythonic way.
I used regex to filter the tap-like url contents out, but this is not allowed in the homework as it is considered as hard-coding.
https://cbcl.ics.uci.edu/doku.php/software/arem?do=login&sectok=4d26fc0839d47d4ec13c5461c1ed6d96
http://cbcl.ics.uci.edu/doku.php/software/arem?do=login&sectok=d8b984cc6aa00bd1ef20471ac5150094
https://cbcl.ics.uci.edu/doku.php/software/arem?do=login&sectok=d8b984cc6aa00bd1ef20471ac5150094
http://cbcl.ics.uci.edu/doku.php/software/arem?do=login&sectok=d504a3676483838e82f07064ca3e12ee
and more with similar structure. There are also calendar urls with similar structure, only changing days:
http://calendar.ics.uci.edu/calendar.php?type=day&calendar=1&category=&day=22&month=01&year=2017
http://calendar.ics.uci.edu/calendar.php?type=day&calendar=1&category=&day=25&month=01&year=2017
http://calendar.ics.uci.edu/calendar.php?type=day&calendar=1&category=&day=26&month=01&year=2017
http://calendar.ics.uci.edu/calendar.php?type=day&calendar=1&category=&day=27&month=01&year=2017
I want to filter those out of my results but I can't think of any way.
I think this will solve your problem
import requests
for url in urls:
try:
response = requests.get(url)
# If the response was successful, no Exception will be raised
response.raise_for_status()
except Exception as err:
print(f'Other error occurred: {err}')
else:
print('Url is valid!')

How to download books automatically from Gutenberg

I am trying to download books from "http://www.gutenberg.org/". I want to know why my code gets nothing.
import requests
import re
import os
import urllib
def get_response(url):
response = requests.get(url).text
return response
def get_content(html):
reg = re.compile(r'(<span class="mw-headline".*?</span></h2><ul><li>.*</a></li></ul>)',re.S)
return re.findall(reg,html)
def get_book_url(response):
reg = r'a href="(.*?)"'
return re.findall(reg,response)
def get_book_name(response):
reg = re.compile('>.*</a>')
return re.findall(reg,response)
def download_book(book_url,path):
path = ''.join(path.split())
path = 'F:\\books\\{}.html'.format(path) #my local file path
if not os.path.exists(path):
urllib.request.urlretrieve(book_url,path)
print('ok!!!')
else:
print('no!!!')
def get_url_name(start_url):
content = get_content(get_response(start_url))
for i in content:
book_url = get_book_url(i)
if book_url:
book_name = get_book_name(i)
try:
download_book(book_url[0],book_name[0])
except:
continue
def main():
get_url_name(start_url)
if __name__ == '__main__':
start_url = 'http://www.gutenberg.org/wiki/Category:Classics_Bookshelf'
main()
I have run the code and get nothing, no tracebacks. How can I download the books automatically from the website?
I have run the code and get nothing,no tracebacks.
Well, there's no chance you get a traceback in the case of an exception in download_book() since you explicitely silent them:
try:
download_book(book_url[0],book_name[0])
except:
continue
So the very first thing you want to do is to at least print out errors:
try:
download_book(book_url[0],book_name[0])
except exception as e:
print("while downloading book {} : got error {}".format(book_url[0], e)
continue
or just don't catch exception at all (at least until you know what to expect and how to handle it).
I don't even know how to fix it
Learning how to debug is actually even more important than learning how to write code. For a general introduction, you want to read this first.
For something more python-specific, here are a couple ways to trace your program execution:
1/ add print() calls at the important places to inspect what you really get
2/ import your module in the interactive python shell and test your functions in isolation (this is easier when none of them depend on global variables)
3/ use the builtin step debugger
Now there are a few obvious issues with your code:
1/ you don't test the result of request.get() - an HTTP request can fail for quite a few reasons, and the fact you get a response doesn't mean you got the expected response (you could have a 400+ or 500+ response as well.
2/ you use regexps to parse html. DONT - regexps cannot reliably work on html, you want a proper HTML parser instead (BeautifulSoup is the canonical solution for web scraping as it's very tolerant). Also some of your regexps look quite wrong (greedy match-all etc).
start_url is not defined in main()
You need to use a global variable. Otherwise, a better (cleaner) approach is to pass in the variable that you are using. In any case, I would expect an error, start_url is not defined
def main(start_url):
get_url_name(start_url)
if __name__ == '__main__':
start_url = 'http://www.gutenberg.org/wiki/Category:Classics_Bookshelf'
main(start_url)
EDIT:
Nevermind, the problem is in this line: content = get_content(get_response(start_url))
The regex in get_content() does not seem to match anything. My suggestion would be to use BeautifulSoup, from bs4 import BeautifulSoup. For any information regarding why you shouldn't parse html with regex, see this answer RegEx match open tags except XHTML self-contained tags
Asking regexes to parse arbitrary HTML is like asking a beginner to write an operating system
As others have said, you get no output because your regex doesn't match anything. The text returned by the initial url has got a newline between </h2> and <ul>, try this instead:
r'(<span class="mw-headline".*?</span></h2>\n<ul><li>.*</a></li></ul>)'
When you fix that one, you will face another error, I suggest some debug printouts like this:
def get_url_name(start_url):
content = get_content(get_response(start_url))
for i in content:
print('[DEBUG] Handling:', i)
book_url = get_book_url(i)
print('[DEBUG] book_url:', book_url)
if book_url:
book_name = get_book_name(i)
try:
print('[DEBUG] book_url[0]:', book_url[0])
print('[DEBUG] book_name[0]:', book_name[0])
download_book(book_url[0],book_name[0])
except:
continue

How to avoid program termination for urllib2.httperror 404 error and display appropriate message

I'm scraping the content out of 100k systematic URLS (example.com/entry/1 > example.com/entry/100000).
However, around 10% of the URLs have been deleted, meaning when the script gets to them it gives me the error "urllib2.httperror http error 404" and stops running.
I'm relatively new to python and was wondering if there's a way to do something like this:
if result == error:
div_text = "missing"
So that the loop can continue to the next URL, but make a note that it failed.
urllib2.HTTPError is an exception raised by Python. You can wrap your URL call with a try/except block:
try:
# ... put your URL open call here ...
except urllib2.HTTPError:
div_text = 'missing'
This way, if this exception is encountered again, the Python interpreter will run the code inside that except block.

Handling bad URLs with requests

Sorry in advance for the beginner question. I'm just learning how to access web data in Python, and I'm having trouble understanding exception handling in the requests package.
So far, when accessing web data using the urllib package, I wrap the urlopen call in a try/except structure to catch bad URLs, like this:
import urllib, sys
url = 'https://httpbinTYPO.org/' # Note the typo in my URL
try: uh=urllib.urlopen(url)
except:
print 'Failed to open url.'
sys.exit()
text = uh.read()
print text
This is obviously kind of a crude way to do it, as it can mask all kinds of problems other than bad URLs.
From the documentation, I had sort of gathered that you could avoid the try/except structure when using the requests package, like this:
import requests, sys
url = 'https://httpbinTYPO.org/' # Note the typo in my URL
r = requests.get(url)
if r.raise_for_status() is not None:
print 'Failed to open url.'
sys.exit()
text = r.text
print text
However, this clearly doesn't work (throws an error and a traceback). What's the "right" (i.e., simple, elegant, Pythonic) way to do this?
Try to catch connection error:
from requests.exceptions import ConnectionError
try:
requests.get('https://httpbinTYPO.org/')
except ConnectionError:
print 'Failed to open url.'
You can specify a kind of exception after the keyword except. So to catch just errors that come from bad connections, you can do:
import urllib, sys
url = 'https://httpbinTYPO.org/' # Note the typo in my URL
try: uh=urllib.urlopen(url)
except IOError:
print 'Failed to open url.'
sys.exit()
text = uh.read()
print text

Python: handing exceptions when downloading non-existing files using urllib

I know how to download a file from the web using python, however I wish to handle cases where the file being requested does not exist. In which case, I want to print an error message ("404: File not found") and not write anything to disk. However, I still want to be able to continue executing the program (i.e. downloading other files in a list that may exist).
How do I do this? Below is some template code to download a file given its url (feel free to modify it if you believe there is a better way, but please keep it concise and simple).
import urllib
urllib.urlretrieve ("http://www.example.com/myfile.mp3", "myfile.mp3")
from urllib2 import URLError
try:
# your file request code here
except URLError, e:
if e.code == 404:
# your appropriate code here
else:
# raise maybe?
I followed this guide, which has a specific section about handling exceptions, and found it really helpful.
import urllib, urllib2
try:
urllib.urlretrieve ("http://www.example.com/", "myfile.mp3")
except URLError, e:
if e.code == 404:
print "4 0 4"
else:
print "%s" % e
This is what your code does. It basically tries to retrieve the web page of www.example.com and writes it to myfile.mp3. It does not end into exception because it is not looking for the myfile.mp3, it basically writes everything it gets in html to myfile.mp3
If you are looking for code to download files at a certain location on the web, try this
How do I download a zip file in python using urllib2?
Your code should look like this:
try:
urllib.urlretrieve ("http://www.example.com/myfile.mp3", "myfile.mp3")
except URLError,e:
if e.code==404:
print 'file not found. moving on...'
pass

Categories

Resources