Unexpected behaviour with Python urllib

Unexpected behaviour with Python urllib - python

I am tryint to consume a JSON response but I have one very weird behaviour. The end point is a Java app running on Tomcat.
I want to load the following url
http://opendata.diavgeia.gov.gr/api/decisions?count=50&output=json_full&from=1
Using Ruby open-uri I load the json. If I hit in in the browser I still get the response. Once I try to use Python 's urllib or urllib2 I get an error
javax.servlet.ServletException: Could not resolve view with name 'jsonView' in servlet with name 'diavgeia-api'
It s quite a strange and I guess the error lies in the API server. Any hints ?

The server appears to need an 'Accept' header:
>>> print urllib2.urlopen(
... urllib2.Request(
... "http://opendata.diavgeia.gov.gr/api/decisions?count=50&output=json_full&from=1",
... headers={"accept": "*/*"})).read()[:200]
{"model":{"queryInfo":{"total":117458,"count":50,"order":"desc","from":1},"expandedDecisions":[{"metadata":{"date":1291932000000,"tags":{"tag":[]},"decisionType":{"uid":27,"label":"ΔΑΠΑΝΗ","extr

Two possibilities, neither of which hold water:
The server is only prepared to use HTTP 1.1 (which urllib apparently doesn't support, but urllib2 does)
It's doing user agent sniffing, and rejecting Python (I tried using Firefox's UA string instead, but it still gave me an error)

Related

Why don't I get a response from my request?

I'm trying to make one simple request:
ua=UserAgent()
req = requests.get('https://www.casasbahia.com.br/' , headers={'User-Agent':ua.random})
I would understand if I received <Response [403] or something like that, but instead, a recive nothing, the code keep runing with no response.
using logging I see:
I know I could use a timeout to avoid keeping the code running, but I just want to understand why I don't get an response
thanks in advance

I never used this API before, but from what I researched on here just now, there are sites that can block requests from fake users.
So, for reproducing this example on my PC, I installed fake_useragent and requests modules on my Python 3.10, and tried to execute your script. It turns out that with my Authentic UserAgent string, the request can be done. When printed on the console, req.text shows the entire HTML file received from the request.
But if I try again with a fake user agent, using ua.random, it fails. The site was probably developed to detect and reject requests from fake agents (or bots).
Though again, this is just theory. I have no ways to access this site's server files to comprove it.

Python requests call fails with HTTPS

I am running a Flask restful API behind an NGINX web server on AWS. I am hitting that with a python module from my Pi.
Everything worked fine when I was using HTTP to make calls to the api. But I just locked down my api so only HTTPS is possible. I changed the UIRL used by my python module but it now fails. The code is quite simple...here is an extract:
jsonpkg = {'subscriberID': self.api_login, 'token': self.api_token,
'content': speech_content}
headers = {'Content-Type': 'application/json'}
r = requests.post(self.api_apiurl, data=json.dumps(jsonpkg), headers=headers)
The values are being correct set by the class init section. And I am importing the requests module at the top. Error messages indicate it is using python 2.7. However when I monitor the API I can see its not even hitting the server. I can point a browser to the api and its working fine.
Am I to understand the requests module in python 2.7 does not support https?
Are there additional parameters I need to send for https?

Aha! With a little more digging into the request module docs I found the answer. If I use the following
r = requests.post(self.api_apiurl, data=json.dumps(jsonpkg), headers=headers, verify=False)
then it works. So the issue is with verifying the cert. I am not quite sure why the browser gets by without this...but perhaps it does the extra stuff automatically. So I either need to NOT verify the cert or have a local copy(?) that can be verified.
Final Update:
I finally worked out how to concatenate my site certificate with the chain certificate (and understand why). This site here was a great help. Also, once they are concatenated you will probably get a second error, which if you google it you will find is caused by the need for a carriage return after the first certificate and before the second (edit the resulting concatenated file with notepad). I then was able to return the post to using "verify=True" which made the warnings about no verification go away.

Odd redirect location causes proxy error with urllib2

I am using urllib2 to do an http post request using Python 2.7.3. My request is returning an HTTPError exception (HTTP Error 502: Proxy Error).
Looking at the messages traffic with Charles, I see the following is happening:
I send the HTTP request (POST /index.asp?action=login HTTP/1.1) using urllib2
The remote server replies with status 303 and a location header of ../index.asp?action=news
urllib2 retries sending a get request: (GET /../index.asp?action=news HTTP/1.1)
The remote server replies with status 502 (Proxy error)
The 502 reply includes this in the response body: "DNS lookup failure for: 10.0.0.30:80index.asp" (Notice the malformed URL)
So I take this to mean that a proxy server on the remote server's network sees the "/../index.asp" URL in the request and misinterprets it, sending my request on with a bad URL.
When I make the same request with my browser (Chrome), the retry is sent to GET /index.asp?action=news. So Chrome takes off the leading "/.." from the URL, and the remote server replies with a valid response.
Is this a urllib2 bug? Is there something I can do so the retry ignores the "/.." in the URL? Or is there some other way to solve this problem? Thinking it might be a urllib2 bug, I swapped out urllib2 with requests but requests produced the same result. Of course, that may be because requests is built on urllib2.
Thanks for any help.

The Location being sent with that 302 is wrong in multiple ways.
First, if you read RFC2616 (HTTP/1.1 Header Field Definitions) 14.30 Location, the Location must be an absoluteURI, not a relative one. And section 10.3.3 makes it clear that this is the relevant definition.
Second, even if a relative URI were allowed, RFC 1808, Relative Uniform Resource Locators, 4. Resolving Relative URLs, step 6, only specifies special handling for .. in the pattern <segment>/../. That means that a relative URL shouldn't start with ... So, even if the base URL is http://example.com/foo/bar/ and the relative URL is ../baz/, the resolved URL is not http://example.com/foo/baz/, but http://example.com/foo/bar/../baz. (Of course most servers will treat these the same way, but that's up to each server.)
Finally, even if you did combine the relative and base URLs before resolving .., an absolute URI with a path starting with .. is invalid.
So, the bug is in the server's configuration.
Now, it just so happens that many user-agents will work around this bug. In particular, they turn /../foo into /foo to block users (or arbitrary JS running on their behalf without their knowledge) from trying to do "escape from webroot" attacks.
But that doesn't mean that urllib2 should do so, or that it's buggy for not doing so. Of course urllib2 should detect the error earlier so it can tell you "invalid path" or something, instead of running together an illegal absolute URI that's going to confuse the server into sending you back nonsense errors. But it is right to fail.
It's all well and good to say that the server configuration is wrong, but unless you're the one in charge of the server, you'll probably face an uphill battle trying to convince them that their site is broken and needs to be fixed when it works with every web browser they care about. Which means you may need to write your own workaround to deal with their site.
The way to do that with urllib2 is to supply your own HTTPRedirectHandler with an implementation of redirect_request method that recognizes this case and returns a different Request than the default code would (in particular, http://example.com/index.asp?action=news instead of http://example.com/../index.asp?action=news).

Python urllib.urlopen() call doesn't work with a URL that a browser accepts

If I point Firefox at http://bitbucket.org/tortoisehg/stable/wiki/Home/ReleaseNotes, I get a page of HTML. But if I try this in Python:
import urllib
site = 'http://bitbucket.org/tortoisehg/stable/wiki/Home/ReleaseNotes'
req = urllib.urlopen(site)
text = req.read()
I get the following:
500 Internal Server Error
The server encountered an internal error or misconfiguration and was unable to complete your request.
What am I doing wrong?

You are not doing anything wrong, bitbucket does some user agent detection (to detect mercurial clients for example). Just changing the user agent fixes it (if it doesn't have urllib as a substring).
You should fill an issue regarding this: http://bitbucket.org/jespern/bitbucket/issues/new/

You're doing nothing wrong, on the surface, and as the error page says you should contact the site's administrators because they're the ones with the server logs which may explain what's happening. Fortunately, bitbucket's site admins are a friendly bunch!
No doubt there is some header or combination of headers that browsers set one way, urllib sets another way, and a bug on the server gets tickled in the latter case. You may want to see exactly what headers are being sent e.g. with firebug in firefox, and reproduce those until you isolate exactly the server bug; most likely it's going to be the user agent or some "accept"-ish header that's tickling that bug.

I don't think you're doing anything wrong -- it looks like this server was just down? Your script worked fine for me ('text' contained the same data as that displayed in the browser).

Is it possible to fetch a https page via an authenticating proxy with urllib2 in Python 2.5?

I'm trying to add authenticating proxy support to an existing script, as it is the script connects to a https url (with urllib2.Request and urllib2.urlopen), scrapes the page and performs some actions based on what it has found. Initially I had hoped this would be as easy as simply adding a urllib2.ProxyHandler({"http": MY_PROXY}) as an arg to urllib2.build_opener which in turn is passed to urllib2.install_opener. Unfortunately this doesn't seem to work when attempting to do a urllib2.Request(ANY_HTTPS_PAGE). Googling around lends me to believe that the proxy support in urllib2 in python 2.5 does not support https urls. This surprised me to say the least.
There appear to be solutions floating around the web, for example http://bugs.python.org/issue1424152 contains a patch for urllib2 and httplib which purports to solve the issue (when I tried it the issue I began to get the following error instead: urllib2.URLError: <urlopen error (1, 'error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol')>). There is a cookbook recipe here http://code.activestate.com/recipes/456195 which I am planning to try next. All in all though I'm surprised this isn't supported "out of the box", which makes me wonder if I'm simply missing out on an obvious solutions, so in short — has anyone got a simple method for fetching https pages using an authenticating proxy with urllib2 in Python 2.5? Ideally this would work:
import urllib2
#perhaps the dictionary below needs a corresponding "https" entry?
#That doesn't seem to work out of the box.
proxy_handler = urllib2.ProxyHandler({"http": "http://user:pass#myproxy:port"})
urllib2.install_opener( urllib2.build_opener( urllib2.HTTPHandler,
urllib2.HTTPSHandler,
proxy_handler ))
request = urllib2.Request(A_HTTPS_URL)
response = urllib2.urlopen( request)
print response.read()
Many Thanks

You may want to look into httplib2. One of the examples claims support for SOCKS proxies if the socks module is installed.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.