Getting and trapping HTTP response using Mechanize in Python - python

I am trying to get the response codes from Mechanize in python. While I am able to get a 200 status code anything else isn't returned (404 throws and exception and 30x is ignored). Is there a way to get the original status code?
Thanks

Errors will throw an exception, so just use try:...except:... to handle them.
Your Mechanize browser object has a method set_handle_redirect() that you can use to turn 30x redirection on or off. Turn it off and you get an error for redirects that you handle just like you handle any other error:
>>> from mechanize import Browser
>>> browser = Browser()
>>> resp = browser.open('http://www.oxfam.com') # this generates a redirect
>>> resp.geturl()
'http://www.oxfam.org/'
>>> browser.set_handle_redirect(False)
>>> resp = browser.open('http://www.oxfam.com')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "build\bdist.win32\egg\mechanize\_mechanize.py", line 209, in open
File "build\bdist.win32\egg\mechanize\_mechanize.py", line 261, in _mech_open
mechanize._response.httperror_seek_wrapper: HTTP Error 301: Moved Permanently
>>>
>>> from urllib2 import HTTPError
>>> try:
... resp = browser.open('http://www.oxfam.com')
... except HTTPError, e:
... print "Got error code", e.code
...
Got error code 301

In twill, do get_browser().get_code()
twill is an outstanding automation and test layer built on top of mechanize, to make it easier to use. It is seriously handy.

Related

urllib IncompleteRead() error can I solve by just re-requesting?

I am running a script that is scraping several hundred pages on a site but recently I have been running into IncompleteRead() errors. My understanding is from looking on stackoverflow is that they can happen for any number of unknown reasons.
The error is caused randomly by the Request() function I believe from searching around:
for ec in unq:
print(ec)
url = Request("https://www.brenda-enzymes.org/enzyme.php?ecno=" +
ec, headers={'User-Agent': 'Mozilla/5.0'})
html = urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
3.5.2.3
2.1.3.15
2.5.1.72
1.5.1.2
6.1.1.9
3.2.2.27
Traceback (most recent call last):
File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 554, in _get_chunk_left
chunk_left = self._read_next_chunk_size()
File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 521, in _read_next_chunk_size
return int(line, 16)
ValueError: invalid literal for int() with base 16: b''
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 571, in _readall_chunked
chunk_left = self._get_chunk_left()
File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 556, in _get_chunk_left
raise IncompleteRead(b'')
IncompleteRead: IncompleteRead(0 bytes read)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<ipython-input-20-82f1876d3006>", line 5, in <module>
html = urlopen(url).read()
File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 464, in read
return self._readall_chunked()
File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 578, in _readall_chunked
raise IncompleteRead(b''.join(value))
IncompleteRead: IncompleteRead(1772944 bytes read)
The error happens randomly, as in not always the same url causes it, with https://www.brenda-enzymes.org/enzyme.php?ecno=3.2.2.27 causing this specific one.
Some solutions seems to introduce a try clause but within the except they store the partial data (I think). Why is the the case, why not just resubmit the request?
If so how would I just re run the request as doing that normally seems to solve the issue. Beyond this I have no idea how I can fix the problem.
As per Serges answer, a try function seems to be the way:
The stacktrace let think that you are reading a chunked tranfer encoded reponse and that for any reason you lost the connection between 2 chunks.
As you have said, this can happen for numerous causes, and the occurence is at random. So:
you cannot predict when or for what file it will happen
you cannot prevent it to happen
The best you can do is to catch the error and retry, after an optional delay.
For example:
import time
for ec in unq:
print(ec)
url = Request("https://www.brenda-enzymes.org/enzyme.php?ecno=" +
ec, headers={'User-Agent': 'Mozilla/5.0'})
sleep = 0
for i in range(4):
try:
html = urlopen(url).read()
break
except http.client.IncompleteRead:
if i == 3:
raise # give up after 4 attempts
time.sleep(sleep) # optionaly add a delay here
sleep += 5
soup = BeautifulSoup(html, 'html.parser')
The stacktrace let think that you are reading a chunked tranfer encoded reponse and that for any reason you lost the connection between 2 chunks.
As you have said, this can happen for numerous causes, and the occurence is at random. So:
you cannot predict when or for what file it will happen
you cannot prevent it to happen
The best you can do is to catch the error and retry, after an optional delay.
For example:
for ec in unq:
print(ec)
url = Request("https://www.brenda-enzymes.org/enzyme.php?ecno=" +
ec, headers={'User-Agent': 'Mozilla/5.0'})
for i in range(4):
try:
html = urlopen(url).read()
break
except http.client.IncompleteRead:
if i == 3:
raise # give up after 4 attempts
# optionaly add a delay here
soup = BeautifulSoup(html, 'html.parser')
I have faced with same issue and found this solution
After some little changes the code looks like here:
from http.client import IncompleteRead, HTTPResponse
from urllib.request import urlopen
from urllib.error import URLError, HTTPError
...
def patch_http_response_read(func):
def inner(args):
try:
return func(args)
except IncompleteRead as e:
return e.partial
return inner
HTTPResponse.read = patch_http_response_read(HTTPResponse.read)
try:
response = urlopen(my_url)
result = json.loads(response.read().decode('UTF-8'))
except URLError as e:
print('URL Error Reason: ', e.reason)
except HTTPError as e:
print('HTTP Error code: ', e.code)
I'm not sure that it is a better way. But it works in my case. I'll be happy if this advice will be useful to you or help to you to found something different good solution. Happy coding!

Python Requests unable to get redirected URL

I have defined the following function to get the redirected URLs using Requests library. However i get the error KeyError: 'location'
def get_redirected_url(r_url):
r = requests.get(r_url, allow_redirects=False)
url = r.headers['Location']
return url
Calling the function
get_redirected_url('http://omgili.com/ri/.wHSUbtEfZQujfav8g98PjRMi_ogV.5EwBTfg476RyS2Gqya3tDAwNIv8Yi8wQ9AK4.U2mxeyq2_xbUjqsOx8NYY8r0qgxD.4Bm2SrouZKnrg1jqRxEfVmGbtTaKTaaDJtOjtS46fYr6A5UJoh9BYxVtDGJIsbSfgshRXR3FVr4-')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 3, in get_redirected_url
File "/home/user/PycharmProjects/untitled/venv/lib/python3.6/site-packages/requests/structures.py", line 54, in __getitem__
return self._store[key.lower()][1]
KeyError: 'location'
Is it failing because the redirection waits for 5 seconds? If so, how do we incorporate that as well?
I have tried the other answers like this and this. But unable to crack it.
It is simple: r.headers doesn't have 'Location' key. You may have use the wrong key.
Edit: the site you want to browse with requests is protected.

Error when trying to open a webpage with mechanize

I'm trying to learn mechanize to create a chat logging bot later, so I tested out some basic code
import mechanize as mek
import re
br = mek.Browser()
br.open("google.com")
However, whenever I run it, I get this error.
Traceback (most recent call last):
File "/home/runner/.local/share/virtualenvs/python3/lib/python3.7/site-packages/mechanize/_mechanize.py", line 262, in _mech_open
url.get_full_url
AttributeError: 'str' object has no attribute 'get_full_url'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "main.py", line 5, in <module>
br.open("google.com")
File "/home/runner/.local/share/virtualenvs/python3/lib/python3.7/site-packages/mechanize/_mechanize.py", line 253, in open
return self._mech_open(url_or_request, data, timeout=timeout)
File "/home/runner/.local/share/virtualenvs/python3/lib/python3.7/site-packages/mechanize/_mechanize.py", line 269, in _mech_open
raise BrowserStateError("can't fetch relative reference: "
mechanize._mechanize.BrowserStateError: can't fetch relative reference: not viewing any document
I double checked with the documentation on the mechanize page and it seems consistent. What am I doing wrong?
You have to use a schema, otherwise mechanize thinks you are trying to open a local/relative path (as the error suggests).
br.open("google.com") should be br.open("http://google.com").
Then you will see an error mechanize._response.httperror_seek_wrapper: HTTP Error 403: b'request disallowed by robots.txt', because google.com does not allow crawlers. This can be remedied with br.set_handle_robots(False) before open.

ValueError: unknown url type

The title pretty much says it all. Here's my code:
from urllib2 import urlopen as getpage
print = getpage("www.radioreference.com/apps/audio/?ctid=5586")
and here's the traceback error I get:
Traceback (most recent call last):
File "C:/Users/**/Dropbox/Dev/ComServ/citetest.py", line 2, in <module>
contents = getpage("www.radioreference.com/apps/audio/?ctid=5586")
File "C:\Python25\lib\urllib2.py", line 121, in urlopen
return _opener.open(url, data)
File "C:\Python25\lib\urllib2.py", line 366, in open
protocol = req.get_type()
File "C:\Python25\lib\urllib2.py", line 241, in get_type
raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: www.radioreference.com/apps/audio/?ctid=5586
My best guess is that urllib can't retrieve data from untidy php URLs. if this is the case, is there a work around? If not, what am I doing wrong?
You should first try to add 'http://' in front of the url. Also, do not store the results in print, as it is binding the reference to another (non callable) object.
So this line should be:
page_contents = getpage("http://www.radioreference.com/apps/audio/?ctid=5586")
This returns a file like object. To read its contents you need to use different file manipulation methods, like this:
for line in page_contents.readlines():
print line
You need to pass a full URL: ie it must begin with http://.
Simply use http://www.radioreference.com/apps/audio/?ctid=5586 and it'll work fine.
In [24]: from urllib2 import urlopen as getpage
In [26]: print getpage("http://www.radioreference.com/apps/audio/?ctid=5586")
<addinfourl at 173987116 whose fp = <socket._fileobject object at 0xa5eb6ac>>

Getting InvalidURLError: ApplicationError: 1 in URLFetch

I am getting the following error:
InvalidURLError: ApplicationError: 1
Checked my code, and logged some various things and the url's causing this error to appear look pretty normal. They are being quoted through urllib.quote and visiting them through a browser results in a normal result.
The error is happening with many URL's, not one. The URL points to an API service and is constructed within the app.
Btw,here's a link to the google.appengine.api.urlfetch source code: http://code.google.com/p/googleappengine/source/browse/trunk/python/google/appengine/api/urlfetch.py?r=56.
The docstrings say that the error should happen when: "InvalidURLError if the url was invalid." and "If the URL is an empty string or obviously invalid, we throw an urlfetch.InvalidURLError"
Just to make it simple for those who would like to test this:
url = 'http://api.embed.ly/1/oembed?key=REMOVEDKEY&maxwidth=400&urls=http%3A//V.interesting.As,http%3A//abcn.ws/z26G9a,http%3A//apne.ws/z37VyP,http%3A//bambuser.com/channel/baba-omer/broadcast/2348417,http%3A//bambuser.com/channel/baba-omer/broadcast/2348417,http%3A//bambuser.com/channel/baba-omer/broadcast/2348417,http%3A//bbc.in/xFx3rc,http%3A//bbc.in/zkkLJq,http%3A//billingsgazette.com/news/local/former-president-bush-to-speak-at-billings-fundraiser-in-may/article_f7ef425a-349c-56a9-a399-606b48033f35.html,http%3A//billingsgazette.com/news/local/former-president-bush-to-speak-at-billings-fundraiser-in-may/article_f7ef425a-349c-56a9-a399-606b48033f35.html,http%3A//billingsgazette.com/news/local/friday-forecast-calls-for-cloudy-windy-day-nighttime-snow-possible/article_d3eb3159-68b0-5559-8255-03fce56eaedd.html,http%3A//billingsgazette.com/news/local/gallery-toy-run/collection_f5042a31-bfd4-5f63-a901-2a8c3e8fb26a.html%230,http%3A//billingsgazette.com/news/local/gas-prices-continue-to-drop-in-billings/article_4e8fd07e-0e1e-5c0e-b551-4162b60c4b60.html,http%3A//billingsgazette.com/news/local/gas-prices-continue-to-drop-in-billings/article_713a0c32-32c9-59f1-9aeb-67b8462bbe88.html,http%3A//billingsgazette.com/news/local/gas-prices-continue-to-fall-in-billings-area/article_2bdebf4b-242c-569e-b414-f388a48f4a14.html,http%3A//billingsgazette.com/news/local/gas-prices-dip-below-a-gallon-at-some-billings-stations/article_c7f4d373-dc2b-55c0-b457-10346c0274a6.html,http%3A//billingsgazette.com/news/local/gas-prices-keep-dropping-in-billings-area/article_3666cf9c-4552-5108-9d5c-de2bba12fa3f.html,http%3A//billingsgazette.com/news/local/government-and-politics/city-picks-st-vincent-as-care-provider-for-health-insurance/article_a899f885-15e1-5b98-b899-75acc01e8feb.html,http%3A//billingsgazette.com/news/local/government-and-politics/linder-settles-in-after-first-year-as-sheriff/article_55a9836e-2196-546d-80f0-48bdef717fa3.html,http%3A//billingsgazette.com/news/local/government-and-politics/new-council-members-city-judge-sworn-in/article_bb7ac948-1d45-579c-a057-1323fb2e643d.html'
from google.appengine.api import urlfetch
result = urlfetch.fetch(url=url)
Here's the traceback:
Traceback (most recent call last):
File "", line 1, in
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/api/urlfetch.py", line 263, in fetch return rpc.get_result()
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/api/apiproxy_stub_map.py", line 592, in get_result
return self.__get_result_hook(self)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/api/urlfetch.py", line 359, in _get_fetch_result
raise InvalidURLError(str(err))
InvalidURLError: ApplicationError: 1
I wonder if it's something very simple that I'm missing from all of this. Would appreciate your comments and ideas. Thanks!
Your URL is too long, there is a limit on the length of URLs.

Categories

Resources