I have been having a persistent problem getting an rss feed from a particular website. I wound up writing a rather ugly procedure to perform this function, but I am curious why this happens and whether any higher level interfaces handle this problem properly. This problem isn't really a show stopper, since I don't need to retrieve the feed very often.
I have read a solution that traps the exception and returns the partial content, yet since the incomplete reads differ in the amount of bytes that are actually retrieved, I have no certainty that such solution will actually work.
#!/usr/bin/env python
import os
import sys
import feedparser
from mechanize import Browser
import requests
import urllib2
from httplib import IncompleteRead
url = 'http://hattiesburg.legistar.com/Feed.ashx?M=Calendar&ID=543375&GUID=83d4a09c-6b40-4300-a04b-f88884048d49&Mode=2013&Title=City+of+Hattiesburg%2c+MS+-+Calendar+(2013)'
content = feedparser.parse(url)
if 'bozo_exception' in content:
print content['bozo_exception']
else:
print "Success!!"
sys.exit(0)
print "If you see this, please tell me what happened."
# try using mechanize
b = Browser()
r = b.open(url)
try:
r.read()
except IncompleteRead, e:
print "IncompleteRead using mechanize", e
# try using urllib2
r = urllib2.urlopen(url)
try:
r.read()
except IncompleteRead, e:
print "IncompleteRead using urllib2", e
# try using requests
try:
r = requests.request('GET', url)
except IncompleteRead, e:
print "IncompleteRead using requests", e
# this function is old and I categorized it as ...
# "at least it works darnnit!", but I would really like to
# learn what's happening. Please help me put this function into
# eternal rest.
def get_rss_feed(url):
response = urllib2.urlopen(url)
read_it = True
content = ''
while read_it:
try:
content += response.read(1)
except IncompleteRead:
read_it = False
return content, response.info()
content, info = get_rss_feed(url)
feed = feedparser.parse(content)
As already stated, this isn't a mission critical problem, yet a curiosity, as even though I can expect urllib2 to have this problem, I am surprised that this error is encountered in mechanize and requests as well. The feedparser module doesn't even throw an error, so checking for errors depends on the presence of a 'bozo_exception' key.
Edit: I just wanted to mention that both wget and curl perform the function flawlessly, retrieving the full payload correctly every time. I have yet to find a pure python method to work, excepting my ugly hack, and I am very curious to know what is happening on the backend of httplib. On a lark, I decided to also try this with twill the other day and got the same httplib error.
P.S. There is one thing that also strikes me as very odd. The IncompleteRead happens consistently at one of two breakpoints in the payload. It seems that feedparser and requests fail after reading 926 bytes, yet mechanize and urllib2 fail after reading 1854 bytes. This behavior is consistend, and I am left without explanation or understanding.
At the end of the day, all of the other modules (feedparser, mechanize, and urllib2) call httplib which is where the exception is being thrown.
Now, first things first, I also downloaded this with wget and the resulting file was 1854 bytes. Next, I tried with urllib2:
>>> import urllib2
>>> url = 'http://hattiesburg.legistar.com/Feed.ashx?M=Calendar&ID=543375&GUID=83d4a09c-6b40-4300-a04b-f88884048d49&Mode=2013&Title=City+of+Hattiesburg%2c+MS+-+Calendar+(2013)'
>>> f = urllib2.urlopen(url)
>>> f.headers.headers
['Cache-Control: private\r\n',
'Content-Type: text/xml; charset=utf-8\r\n',
'Server: Microsoft-IIS/7.5\r\n',
'X-AspNet-Version: 4.0.30319\r\n',
'X-Powered-By: ASP.NET\r\n',
'Date: Mon, 07 Jan 2013 23:21:51 GMT\r\n',
'Via: 1.1 BC1-ACLD\r\n',
'Transfer-Encoding: chunked\r\n',
'Connection: close\r\n']
>>> f.read()
< Full traceback cut >
IncompleteRead: IncompleteRead(1854 bytes read)
So it is reading all 1854 bytes but then thinks there is more to come. If we explicitly tell it to read only 1854 bytes it works:
>>> f = urllib2.urlopen(url)
>>> f.read(1854)
'\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">...snip...</rss>'
Obviously, this is only useful if we always know the exact length ahead of time. We can use the fact the partial read is returned as an attribute on the exception to capture the entire contents:
>>> try:
... contents = f.read()
... except httplib.IncompleteRead as e:
... contents = e.partial
...
>>> print contents
'\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">...snip...</rss>'
This blog post suggests this is a fault of the server, and describes how to monkey-patch the httplib.HTTPResponse.read() method with the try..except block above to handle things behind the scenes:
import httplib
def patch_http_response_read(func):
def inner(*args):
try:
return func(*args)
except httplib.IncompleteRead, e:
return e.partial
return inner
httplib.HTTPResponse.read = patch_http_response_read(httplib.HTTPResponse.read)
I applied the patch and then feedparser worked:
>>> import feedparser
>>> url = 'http://hattiesburg.legistar.com/Feed.ashx?M=Calendar&ID=543375&GUID=83d4a09c-6b40-4300-a04b-f88884048d49&Mode=2013&Title=City+of+Hattiesburg%2c+MS+-+Calendar+(2013)'
>>> feedparser.parse(url)
{'bozo': 0,
'encoding': 'utf-8',
'entries': ...
'status': 200,
'version': 'rss20'}
This isn't the nicest way of doing things, but it seems to work. I'm not expert enough in the HTTP protocols to say for sure whether the server is doing things wrong, or whether httplib is mis-handling an edge case.
I find out in my case, send a HTTP/1.0 request , fix the problem, just adding this to the code:
import httplib
httplib.HTTPConnection._http_vsn = 10
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.0'
after I do the request :
req = urllib2.Request(url, post, headers)
filedescriptor = urllib2.urlopen(req)
img = filedescriptor.read()
after I back to http 1.1 with (for connections that support 1.1) :
httplib.HTTPConnection._http_vsn = 11
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.1'
I have fixed the issue by using HTTPS instead of HTTP and its working fine. No code change was required.
Related
I'm trying to access an authenticated site using a cookies.txt file (generated with a Chrome extension) with Python Requests:
import requests, cookielib
cj = cookielib.MozillaCookieJar('cookies.txt')
cj.load()
r = requests.get(url, cookies=cj)
It doesn't throw any error or exception, but yields the login screen, incorrectly. However, I know that my cookie file is valid, because I can successfully retrieve my content using it with wget. Any idea what I'm doing wrong?
Edit:
I'm tracing cookielib.MozillaCookieJar._really_load and can verify that the cookies are correctly parsed (i.e. they have the correct values for the domain, path, secure, etc. tokens). But as the transaction is still resulting in the login form, it seems that wget must be doing something additional (as the exact same cookies.txt file works for it).
MozillaCookieJar inherits from FileCookieJar which has the following docstring in its constructor:
Cookies are NOT loaded from the named file until either the .load() or
.revert() method is called.
You need to call .load() method then.
Also, like Jermaine Xu noted the first line of the file needs to contain either # Netscape HTTP Cookie File or # HTTP Cookie File string. Files generated by the plugin you use do not contain such a string so you have to insert it yourself. I raised appropriate bug at http://code.google.com/p/cookie-txt-export/issues/detail?id=5
EDIT
Session cookies are saved with 0 in the 5th column. If you don't pass ignore_expires=True to load() method all such cookies are discarded when loading from a file.
File session_cookie.txt:
# Netscape HTTP Cookie File
.domain.com TRUE / FALSE 0 name value
Python script:
import cookielib
cj = cookielib.MozillaCookieJar('session_cookie.txt')
cj.load()
print len(cj)
Output:
0
EDIT 2
Although we managed to get cookies into the jar above they are subsequently discarded by cookielib because they still have 0 value in the expires attribute. To prevent this we have to set the expire time to some future time like so:
for cookie in cj:
# set cookie expire date to 14 days from now
cookie.expires = time.time() + 14 * 24 * 3600
EDIT 3
I checked both wget and curl and both use 0 expiry time to denote session cookies which means it's the de facto standard. However Python's implementation uses empty string for the same purpose hence the problem raised in the question. I think Python's behavior in this regard should be in line with what wget and curl do and that's why I raised the bug at http://bugs.python.org/issue17164
I'll note that replacing 0s with empty strings in the 5th column of the input file and passing ignore_discard=True to load() is the alternate way of solving the problem (no need to change expiry time in this case).
I tried taking into account everything that Piotr Dobrogost had valiantly figured out about MozillaCookieJar but to no avail. I got fed up and just parsed the damn cookies.txt myself and now all is well:
import re
import requests
def parseCookieFile(cookiefile):
"""Parse a cookies.txt file and return a dictionary of key value pairs
compatible with requests."""
cookies = {}
with open (cookiefile, 'r') as fp:
for line in fp:
if not re.match(r'^\#', line):
lineFields = line.strip().split('\t')
cookies[lineFields[5]] = lineFields[6]
return cookies
cookies = parseCookieFile('cookies.txt')
import pprint
pprint.pprint(cookies)
r = requests.get('https://example.com', cookies=cookies)
This worked for me:
from http.cookiejar import MozillaCookieJar
from pathlib import Path
import requests
cookies = Path('/Users/name/cookies.txt')
jar = MozillaCookieJar(cookies)
jar.load()
requests.get('https://path.to.site.com', cookies=jar)
<Response [200]>
I tried editing Tristan answer to add some info to it but it seems SO edit q is full therefore, I am writing this answer, since, I have struggled real bad on using existing cookies with python request.
First, get the cookies from the Chrome. Easiest way would be to use an extension called 'cookies.txt'
https://chrome.google.com/webstore/detail/get-cookiestxt/bgaddhkoddajcdgocldbbfleckgcbcid/related
After downloading those cookies, use the below code to make sure that you are able to parse the file without any issues.
import re, requests, pprint
def parseCookieFile(cookiefile):
"""Parse a cookies.txt file and return a dictionary of key value pairs
compatible with requests."""
cookies = {}
with open (cookiefile, 'r') as fp:
for line in fp:
if not re.match(r'^\#', line):
lineFields = re.findall(r'[^\s]+', line) #capturing anything but empty space
try:
cookies[lineFields[5]] = lineFields[6]
except Exception as e:
print (e)
return cookies
cookies = parseCookieFile('cookies.txt') #replace the filename
pprint.pprint(cookies)
Next, use those cookies with python request
x = requests.get('your__url', verify=False, cookies=cookies)
print (x.content)
This should save your day from going on different SO posts and trying those cookielib and other methods which never worked for me.
I finally found a way to make it work (I got the idea by looking at curl's verbose ouput): instead of loading my cookies from a file, I simply created a dict with the required value/name pairs:
cd = {'v1': 'n1', 'v2': 'n2'}
r = requests.get(url, cookies=cd)
and it worked (although it doesn't explain why the previous method didn't). Thanks for all the help, it's really appreciated.
I am trying to download books from "http://www.gutenberg.org/". I want to know why my code gets nothing.
import requests
import re
import os
import urllib
def get_response(url):
response = requests.get(url).text
return response
def get_content(html):
reg = re.compile(r'(<span class="mw-headline".*?</span></h2><ul><li>.*</a></li></ul>)',re.S)
return re.findall(reg,html)
def get_book_url(response):
reg = r'a href="(.*?)"'
return re.findall(reg,response)
def get_book_name(response):
reg = re.compile('>.*</a>')
return re.findall(reg,response)
def download_book(book_url,path):
path = ''.join(path.split())
path = 'F:\\books\\{}.html'.format(path) #my local file path
if not os.path.exists(path):
urllib.request.urlretrieve(book_url,path)
print('ok!!!')
else:
print('no!!!')
def get_url_name(start_url):
content = get_content(get_response(start_url))
for i in content:
book_url = get_book_url(i)
if book_url:
book_name = get_book_name(i)
try:
download_book(book_url[0],book_name[0])
except:
continue
def main():
get_url_name(start_url)
if __name__ == '__main__':
start_url = 'http://www.gutenberg.org/wiki/Category:Classics_Bookshelf'
main()
I have run the code and get nothing, no tracebacks. How can I download the books automatically from the website?
I have run the code and get nothing,no tracebacks.
Well, there's no chance you get a traceback in the case of an exception in download_book() since you explicitely silent them:
try:
download_book(book_url[0],book_name[0])
except:
continue
So the very first thing you want to do is to at least print out errors:
try:
download_book(book_url[0],book_name[0])
except exception as e:
print("while downloading book {} : got error {}".format(book_url[0], e)
continue
or just don't catch exception at all (at least until you know what to expect and how to handle it).
I don't even know how to fix it
Learning how to debug is actually even more important than learning how to write code. For a general introduction, you want to read this first.
For something more python-specific, here are a couple ways to trace your program execution:
1/ add print() calls at the important places to inspect what you really get
2/ import your module in the interactive python shell and test your functions in isolation (this is easier when none of them depend on global variables)
3/ use the builtin step debugger
Now there are a few obvious issues with your code:
1/ you don't test the result of request.get() - an HTTP request can fail for quite a few reasons, and the fact you get a response doesn't mean you got the expected response (you could have a 400+ or 500+ response as well.
2/ you use regexps to parse html. DONT - regexps cannot reliably work on html, you want a proper HTML parser instead (BeautifulSoup is the canonical solution for web scraping as it's very tolerant). Also some of your regexps look quite wrong (greedy match-all etc).
start_url is not defined in main()
You need to use a global variable. Otherwise, a better (cleaner) approach is to pass in the variable that you are using. In any case, I would expect an error, start_url is not defined
def main(start_url):
get_url_name(start_url)
if __name__ == '__main__':
start_url = 'http://www.gutenberg.org/wiki/Category:Classics_Bookshelf'
main(start_url)
EDIT:
Nevermind, the problem is in this line: content = get_content(get_response(start_url))
The regex in get_content() does not seem to match anything. My suggestion would be to use BeautifulSoup, from bs4 import BeautifulSoup. For any information regarding why you shouldn't parse html with regex, see this answer RegEx match open tags except XHTML self-contained tags
Asking regexes to parse arbitrary HTML is like asking a beginner to write an operating system
As others have said, you get no output because your regex doesn't match anything. The text returned by the initial url has got a newline between </h2> and <ul>, try this instead:
r'(<span class="mw-headline".*?</span></h2>\n<ul><li>.*</a></li></ul>)'
When you fix that one, you will face another error, I suggest some debug printouts like this:
def get_url_name(start_url):
content = get_content(get_response(start_url))
for i in content:
print('[DEBUG] Handling:', i)
book_url = get_book_url(i)
print('[DEBUG] book_url:', book_url)
if book_url:
book_name = get_book_name(i)
try:
print('[DEBUG] book_url[0]:', book_url[0])
print('[DEBUG] book_name[0]:', book_name[0])
download_book(book_url[0],book_name[0])
except:
continue
I'm trying to access an authenticated site using a cookies.txt file (generated with a Chrome extension) with Python Requests:
import requests, cookielib
cj = cookielib.MozillaCookieJar('cookies.txt')
cj.load()
r = requests.get(url, cookies=cj)
It doesn't throw any error or exception, but yields the login screen, incorrectly. However, I know that my cookie file is valid, because I can successfully retrieve my content using it with wget. Any idea what I'm doing wrong?
Edit:
I'm tracing cookielib.MozillaCookieJar._really_load and can verify that the cookies are correctly parsed (i.e. they have the correct values for the domain, path, secure, etc. tokens). But as the transaction is still resulting in the login form, it seems that wget must be doing something additional (as the exact same cookies.txt file works for it).
MozillaCookieJar inherits from FileCookieJar which has the following docstring in its constructor:
Cookies are NOT loaded from the named file until either the .load() or
.revert() method is called.
You need to call .load() method then.
Also, like Jermaine Xu noted the first line of the file needs to contain either # Netscape HTTP Cookie File or # HTTP Cookie File string. Files generated by the plugin you use do not contain such a string so you have to insert it yourself. I raised appropriate bug at http://code.google.com/p/cookie-txt-export/issues/detail?id=5
EDIT
Session cookies are saved with 0 in the 5th column. If you don't pass ignore_expires=True to load() method all such cookies are discarded when loading from a file.
File session_cookie.txt:
# Netscape HTTP Cookie File
.domain.com TRUE / FALSE 0 name value
Python script:
import cookielib
cj = cookielib.MozillaCookieJar('session_cookie.txt')
cj.load()
print len(cj)
Output:
0
EDIT 2
Although we managed to get cookies into the jar above they are subsequently discarded by cookielib because they still have 0 value in the expires attribute. To prevent this we have to set the expire time to some future time like so:
for cookie in cj:
# set cookie expire date to 14 days from now
cookie.expires = time.time() + 14 * 24 * 3600
EDIT 3
I checked both wget and curl and both use 0 expiry time to denote session cookies which means it's the de facto standard. However Python's implementation uses empty string for the same purpose hence the problem raised in the question. I think Python's behavior in this regard should be in line with what wget and curl do and that's why I raised the bug at http://bugs.python.org/issue17164
I'll note that replacing 0s with empty strings in the 5th column of the input file and passing ignore_discard=True to load() is the alternate way of solving the problem (no need to change expiry time in this case).
I tried taking into account everything that Piotr Dobrogost had valiantly figured out about MozillaCookieJar but to no avail. I got fed up and just parsed the damn cookies.txt myself and now all is well:
import re
import requests
def parseCookieFile(cookiefile):
"""Parse a cookies.txt file and return a dictionary of key value pairs
compatible with requests."""
cookies = {}
with open (cookiefile, 'r') as fp:
for line in fp:
if not re.match(r'^\#', line):
lineFields = line.strip().split('\t')
cookies[lineFields[5]] = lineFields[6]
return cookies
cookies = parseCookieFile('cookies.txt')
import pprint
pprint.pprint(cookies)
r = requests.get('https://example.com', cookies=cookies)
This worked for me:
from http.cookiejar import MozillaCookieJar
from pathlib import Path
import requests
cookies = Path('/Users/name/cookies.txt')
jar = MozillaCookieJar(cookies)
jar.load()
requests.get('https://path.to.site.com', cookies=jar)
<Response [200]>
I tried editing Tristan answer to add some info to it but it seems SO edit q is full therefore, I am writing this answer, since, I have struggled real bad on using existing cookies with python request.
First, get the cookies from the Chrome. Easiest way would be to use an extension called 'cookies.txt'
https://chrome.google.com/webstore/detail/get-cookiestxt/bgaddhkoddajcdgocldbbfleckgcbcid/related
After downloading those cookies, use the below code to make sure that you are able to parse the file without any issues.
import re, requests, pprint
def parseCookieFile(cookiefile):
"""Parse a cookies.txt file and return a dictionary of key value pairs
compatible with requests."""
cookies = {}
with open (cookiefile, 'r') as fp:
for line in fp:
if not re.match(r'^\#', line):
lineFields = re.findall(r'[^\s]+', line) #capturing anything but empty space
try:
cookies[lineFields[5]] = lineFields[6]
except Exception as e:
print (e)
return cookies
cookies = parseCookieFile('cookies.txt') #replace the filename
pprint.pprint(cookies)
Next, use those cookies with python request
x = requests.get('your__url', verify=False, cookies=cookies)
print (x.content)
This should save your day from going on different SO posts and trying those cookielib and other methods which never worked for me.
I finally found a way to make it work (I got the idea by looking at curl's verbose ouput): instead of loading my cookies from a file, I simply created a dict with the required value/name pairs:
cd = {'v1': 'n1', 'v2': 'n2'}
r = requests.get(url, cookies=cd)
and it worked (although it doesn't explain why the previous method didn't). Thanks for all the help, it's really appreciated.
What is the most appropriate method to determine the encoding in the text from a webpage. I process webpages from various languages and use Python and the "requests" library. The aim is eventually to be able to get clean text using some text extraction library for text mining projects
resp = requests.get(url)
Now I know that we have the following options:
1)
from requests.utils import get_encoding_from_headers
encoding = get_encoding_from_headers(resp.headers)
html = (resp.content).decode(encoding)
2)
from requests_toolbelt.utils.deprecated import get_encodings_from_content
encoding = get_encodings_from_content(resp.content)
html = (resp.content).decode(encoding)
3)
from requests_toolbelt.utils.deprecated import get_encodings_from_content
html = get_unicode_from_response(resp)
I processed around a 1000 urls and was expecting 1) and 2) to be the same but 20% of the time that wasnt the case. In those 20% cases (1) would give "ISO-8859-1" which from looking at the code means it didnt find the charset in the header and (2) mostly gave out "utf8"
Now does someone have some experience with this as to what the most appropriate technique among these is or if a better more cleaner way exists?
Don't trust the declared encoding, just test it directly on the HTML code you got with the chardet or cchardet (faster) libraries:
import requests
try:
# this module is much faster
import cchardet as chardet
except ImportError:
import chardet
response = requests.get(url)
guessed_encoding = chardet.detect(response.content)['encoding']
if guessed_encoding is not None:
try:
htmltext = response.content.decode(guessed_encoding)
except UnicodeDecodeError:
htmltext = response.text
else:
htmltext = response.text
# do something with htmltext...
Sorry in advance for the beginner question. I'm just learning how to access web data in Python, and I'm having trouble understanding exception handling in the requests package.
So far, when accessing web data using the urllib package, I wrap the urlopen call in a try/except structure to catch bad URLs, like this:
import urllib, sys
url = 'https://httpbinTYPO.org/' # Note the typo in my URL
try: uh=urllib.urlopen(url)
except:
print 'Failed to open url.'
sys.exit()
text = uh.read()
print text
This is obviously kind of a crude way to do it, as it can mask all kinds of problems other than bad URLs.
From the documentation, I had sort of gathered that you could avoid the try/except structure when using the requests package, like this:
import requests, sys
url = 'https://httpbinTYPO.org/' # Note the typo in my URL
r = requests.get(url)
if r.raise_for_status() is not None:
print 'Failed to open url.'
sys.exit()
text = r.text
print text
However, this clearly doesn't work (throws an error and a traceback). What's the "right" (i.e., simple, elegant, Pythonic) way to do this?
Try to catch connection error:
from requests.exceptions import ConnectionError
try:
requests.get('https://httpbinTYPO.org/')
except ConnectionError:
print 'Failed to open url.'
You can specify a kind of exception after the keyword except. So to catch just errors that come from bad connections, you can do:
import urllib, sys
url = 'https://httpbinTYPO.org/' # Note the typo in my URL
try: uh=urllib.urlopen(url)
except IOError:
print 'Failed to open url.'
sys.exit()
text = uh.read()
print text