Python urllib2 trace route

Python urllib2 trace route - python

I'm using Python and urllib2 to make POST requests and I have it working successfully. However, when I make several posts one after the other at times I get the error 502 proxy in use. Our company does us proxy but I'm not set up to hit the proxy since I'm working internally. Is there a way to get a trace route of how the POST request is being routed using urllib2 and Python?
Thanks

I'm not sure what you mean by "a trace route". traceroute is an IP thing, two levels below HTTP. And I doubt you want anything like that. You can find out whether there were any redirects, whether a proxy was used, etc., either by using a general-purpose sniffer or, much more simply, by just asking urllib2.
For example, let's say your code looks like this:
url = 'http://example.com'
data = urllib.urlencode({'spam': 'eggs'})
req = urllib2.Request(url, data)
resp = urllib2.urlopen(req)
respdata = resp.read()
Then req.has_proxy() will tell you whether it's going to use a proxy, resp.geturl() == url will tell you whether there was a redirect, etc. Read the docs for all the info available.
Meanwhile, if you don't want a proxy, you can either disable whatever settings urllib2 picked up that made it auto-configure the proxy (e.g., unset http_proxy before running your script), override the default handler chain to make sure there's no ProxyHandler, build an explicit OpenerDirector instead of using the default one, etc.

Related

Unable to get complete source code of web page using Python [duplicate]

I would like to try send requests.get to this website:
requests.get('https://rent.591.com.tw')
and I always get
<Response [404]>
I knew this is a common problem and tried different way but still failed.
but all of other website is ok.
any suggestion?

Webservers are black boxes. They are permitted to return any valid HTTP response, based on your request, the time of day, the phase of the moon, or any other criteria they pick. If another HTTP client gets a different response, consistently, try to figure out what the differences are in the request that Python sends and the request the other client sends.
That means you need to:
Record all aspects of the working request
Record all aspects of the failing request
Try out what changes you can make to make the failing request more like the working request, and minimise those changes.
I usually point my requests to a http://httpbin.org endpoint, have it record the request, and then experiment.
For requests, there are several headers that are set automatically, and many of these you would not normally expect to have to change:
Host; this must be set to the hostname you are contacting, so that it can properly multi-host different sites. requests sets this one.
Content-Length and Content-Type, for POST requests, are usually set from the arguments you pass to requests. If these don't match, alter the arguments you pass in to requests (but watch out with multipart/* requests, which use a generated boundary recorded in the Content-Type header; leave generating that to requests).
Connection: leave this to the client to manage
Cookies: these are often set on an initial GET request, or after first logging into the site. Make sure you capture cookies with a requests.Session() object and that you are logged in (supplied credentials the same way the browser did).
Everything else is fair game but if requests has set a default value, then more often than not those defaults are not the issue. That said, I usually start with the User-Agent header and work my way up from there.
In this case, the site is filtering on the user agent, it looks like they are blacklisting Python, setting it to almost any other value already works:
>>> requests.get('https://rent.591.com.tw', headers={'User-Agent': 'Custom'})
<Response [200]>
Next, you need to take into account that requests is not a browser. requests is only a HTTP client, a browser does much, much more. A browser parses HTML for additional resources such as images, fonts, styling and scripts, loads those additional resources too, and executes scripts. Scripts can then alter what the browser displays and load additional resources. If your requests results don't match what you see in the browser, but the initial request the browser makes matches, then you'll need to figure out what other resources the browser has loaded and make additional requests with requests as needed. If all else fails, use a project like requests-html, which lets you run a URL through an actual, headless Chromium browser.
The site you are trying to contact makes an additional AJAX request to https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1&region=1, take that into account if you are trying to scrape data from this site.
Next, well-built sites will use security best-practices such as CSRF tokens, which require you to make requests in the right order (e.g. a GET request to retrieve a form before a POST to the handler) and handle cookies or otherwise extract the extra information a server expects to be passed from one request to another.
Last but not least, if a site is blocking scripts from making requests, they probably are either trying to enforce terms of service that prohibit scraping, or because they have an API they rather have you use. Check for either, and take into consideration that you might be blocked more effectively if you continue to scrape the site anyway.

One thing to note: I was using requests.get() to do some webscraping off of links I was reading from a file. What I didn't realise was that the links had a newline character (\n) when I read each line from the file.
If you're getting multiple links from a file instead of a Python data type like a string, make sure to strip any \r or \n characters before you call requests.get("your link"). In my case, I used
with open("filepath", 'w') as file:
links = file.read().splitlines()
for link in links:
response = requests.get(link)

In my case this was due to fact that the website address was recently changed, and I was provided the old website address. At least this changed the status code from 404 to 500, which, I think, is progress :)

Odd redirect location causes proxy error with urllib2

I am using urllib2 to do an http post request using Python 2.7.3. My request is returning an HTTPError exception (HTTP Error 502: Proxy Error).
Looking at the messages traffic with Charles, I see the following is happening:
I send the HTTP request (POST /index.asp?action=login HTTP/1.1) using urllib2
The remote server replies with status 303 and a location header of ../index.asp?action=news
urllib2 retries sending a get request: (GET /../index.asp?action=news HTTP/1.1)
The remote server replies with status 502 (Proxy error)
The 502 reply includes this in the response body: "DNS lookup failure for: 10.0.0.30:80index.asp" (Notice the malformed URL)
So I take this to mean that a proxy server on the remote server's network sees the "/../index.asp" URL in the request and misinterprets it, sending my request on with a bad URL.
When I make the same request with my browser (Chrome), the retry is sent to GET /index.asp?action=news. So Chrome takes off the leading "/.." from the URL, and the remote server replies with a valid response.
Is this a urllib2 bug? Is there something I can do so the retry ignores the "/.." in the URL? Or is there some other way to solve this problem? Thinking it might be a urllib2 bug, I swapped out urllib2 with requests but requests produced the same result. Of course, that may be because requests is built on urllib2.
Thanks for any help.

The Location being sent with that 302 is wrong in multiple ways.
First, if you read RFC2616 (HTTP/1.1 Header Field Definitions) 14.30 Location, the Location must be an absoluteURI, not a relative one. And section 10.3.3 makes it clear that this is the relevant definition.
Second, even if a relative URI were allowed, RFC 1808, Relative Uniform Resource Locators, 4. Resolving Relative URLs, step 6, only specifies special handling for .. in the pattern <segment>/../. That means that a relative URL shouldn't start with ... So, even if the base URL is http://example.com/foo/bar/ and the relative URL is ../baz/, the resolved URL is not http://example.com/foo/baz/, but http://example.com/foo/bar/../baz. (Of course most servers will treat these the same way, but that's up to each server.)
Finally, even if you did combine the relative and base URLs before resolving .., an absolute URI with a path starting with .. is invalid.
So, the bug is in the server's configuration.
Now, it just so happens that many user-agents will work around this bug. In particular, they turn /../foo into /foo to block users (or arbitrary JS running on their behalf without their knowledge) from trying to do "escape from webroot" attacks.
But that doesn't mean that urllib2 should do so, or that it's buggy for not doing so. Of course urllib2 should detect the error earlier so it can tell you "invalid path" or something, instead of running together an illegal absolute URI that's going to confuse the server into sending you back nonsense errors. But it is right to fail.
It's all well and good to say that the server configuration is wrong, but unless you're the one in charge of the server, you'll probably face an uphill battle trying to convince them that their site is broken and needs to be fixed when it works with every web browser they care about. Which means you may need to write your own workaround to deal with their site.
The way to do that with urllib2 is to supply your own HTTPRedirectHandler with an implementation of redirect_request method that recognizes this case and returns a different Request than the default code would (in particular, http://example.com/index.asp?action=news instead of http://example.com/../index.asp?action=news).

Pyramid subrequests

I need to call GET, POST, PUT, etc. requests to another URI because of search, but I cannot find a way to do that internally with pyramid. Is there any way to do it at the moment?

Simply use the existing python libraries for calling other webservers.
On python 2.x, use urllib2, for python 3.x, use urllib.request instead. Alternatively, you could install requests.
Do note that calling external sites from your server while serving a request yourself could mean your visitors end up waiting for a 3rd-party web server that stopped responding. Make sure you set decent time outs.

pyramid uses webob which has a client api as of version 1.2
from webob import Request
r = Request.blank("http://google.com")
response = r.send()
generally anything you want to override for the request you would just pass in as a parameter.
from webob import Request
r = Request.blank("http://facebook.com",method="DELETE")
another handy feature is that you can see the request as the http that is passed over the wire
print r
DELETE HTTP/1.0
Host: facebook.com:80
docs

Also check the response status code: response.status_int
I use it for example, to introspect my internal URIs and see whether or not a given relative URI is really served by the framework (example to generate breadcrumbs and make intermediate paths as links only if there are pages behind)

Python CURL speficied ip address

I want to make a GET request to retrieve the contents of a web-page or a web service.
I want to send specific headers for this request AND
I want to set the IP address FROM WHICH this request will be sent.
(The server on which this code is running has multiple IP addresses available).
How can I achieve this with Python and its libraries?

I checked urllib2 and it won't set the source address (at least not on Python 2.7). The underlying library is httplib, which does have that feature, so you may have some luck using that directly.
From the httplib documentation:
class httplib.HTTPConnection(host[, port[, strict[, timeout[, source_address]]]])
The optional source_address parameter may be a tuple of a (host, port) to use as the source address the HTTP connection is made from.
You may even be able to convince urllib2 to use this feature by creating a custom HTTPHandler class. You will need to duplicate some code from urllib2.py, because AbstractHTTPHandler is using a simpler version of this call:
class AbstractHTTPHandler(BaseHandler):
# ...
def do_open(self, http_class, req):
# ...
h = http_class(host, timeout=req.timeout) # will parse host:port
Where http_class is httplib.HTTPConnection for HTTP connections.
Probably this would work instead, if patching urllib2.py (or duplicating and renaming it) is an acceptable workaround:
h = http_class(host, timeout=req.timeout, source_address=(req.origin_req_host,0))

There are many options available to you for making http requests. I don't even think there is really a commonly agreed upon "best". You could use any of these:
urllib2: http://docs.python.org/library/urllib2.html
requests: http://docs.python-requests.org/en/v0.10.4/index.html
mechanize: http://wwwsearch.sourceforge.net/mechanize/
This list is not exhaustive. Read the docs and take your pick. Some are lower level and some offer rich browser-like features. All of them let you set headers before making request.

Fiddler does not capture my script's requests

my code:
proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8888'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
f = urllib2.urlopen('http://www.google.com')
print f.read()
this request does not show in Fiddler's capture, does anyone know how to configure Fiddler so that the request is captured?
EDIT: the request works, and I can see the contents. Also, if I close Fiddler, the request fails, as expected, because there is no proxy. It is just that I do not see anything in Fiddler.
EDIT2: I see traffic from a .NET test console application that I wrote. But I do not see traffic from my python script.

I got the exactly the same issue, when fiddler2 opens, even I change
proxy = urllib2.ProxyHandler({'http': 'http://asdfl.com:13212/'}) (such none existing proxy server), it still can get the page content, I guess maybe when proxy server has been setup by fiddler2, urllib2 totally ignore the ProxyHandler for some reason, still can't figure out.
I got it, check that thread in stackoverflow:
urllib2 doesn't use proxy (Fiddler2), set using ProxyHandler
In Fiddler2, go to the page Tools->Fiddler Options ...->Connections, remove the trailing semicolon from the value in the "IE should bypass Fiddler for ..." field and restart Fiddler2.
this solution solved my problem, hope can help someone if you are struggling with it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.