Python problems with FancyURLopener, 401, and "Connection: close" - python

I'm new to Python, so forgive me if I am missing something obvious.
I am using urllib.FancyURLopener to retrieve a web document. It works fine when authentication is disabled on the web server, but fails when authentication is enabled.
My guess is that I need to subclass urllib.FancyURLopener to override the get_user_passwd() and/or prompt_user_passwd() methods. So I did:
class my_opener (urllib.FancyURLopener):
# Redefine
def get_user_passwd(self, host, realm, clear_cache=0):
print "get_user_passwd() called; host %s, realm %s" % (host, realm)
return ('name', 'password')
Then I attempt to open the page:
try:
opener = my_opener()
f = opener.open ('http://1.2.3.4/whatever.html')
content = f.read()
print "Got it: ", content
except IOError:
print "Failed!"
I expect FancyURLopener to handle the 401, call my get_user_passwd(), and retry the request.
It does not; I get the IOError exception when I call "f = opener.open()".
Wireshark tells me that the request is sent, and that the server is sending a "401 Unauthorized" response with two headers of interest:
WWW-Authenticate: BASIC
Connection: close
The connection is then closed, I catch my exception, and it's all over.
It fails the same way even if I retry the "f = opener.open()" after IOError.
I have verified that my my_opener() class is working by overriding the http_error_401() method with a simple "print 'Got 401 error'". I have also tried to override the prompt_user_passwd() method, but that doesn't happen either.
I see no way to proactively specify the user name and password.
So how do I get urllib to retry the request?
Thanks.

I just tried your code on my webserver (nginx) and it works as expected:
Get from urllib client
HTTP/1.1 401 Unauthorized from server with Headers
Connection: close
WWW-Authenticate: Basic realm="Restricted"
client tries again with Authorization header
Authorization: Basic <Base64encoded credentials>
Server responds with 200 OK + Content
So I guess your code is right (I tried it with python 2.7.1) and maybe the webserver you are trying to access is not working as expected. Here is the code tested using the free http basic auth testsite browserspy.dk (seems they are using apache - the code works as expected):
import urllib
class my_opener (urllib.FancyURLopener):
# Redefine
def get_user_passwd(self, host, realm, clear_cache=0):
print "get_user_passwd() called; host %s, realm %s" % (host, realm)
return ('test', 'test')
try:
opener = my_opener()
f = opener.open ('http://browserspy.dk/password-ok.php')
content = f.read()
print "Got it: ", content
except IOError:
print "Failed!"

Related

IP of Server using Tor and pycurl

I am working on a project using tor and python, for which I have to get the server ip of the given url using pycurl.
Currently I am using the following code for simple query and response.
def query(url):
"""
Uses pycurl to fetch a site using the proxy on the SOCKS_PORT.
"""
output = StringIO.StringIO()
curl = pycurl.Curl()
curl.setopt( pycurl.URL, url )
curl.setopt( pycurl.PROXY, '188.120.228.106' )
curl.setopt( pycurl.PROXYPORT, 1080 )
curl.setopt( pycurl.PROXYTYPE, pycurl.PROXYTYPE_SOCKS5_HOSTNAME )
curl.setopt(pycurl.CONNECTTIMEOUT, CONNECTION_TIMEOUT)
try:
curl.perform()
return output.getvalue()
except pycurl.error as exc:
raise ValueError("Unable to reach %s (%s)" % (url, exc))
Any suggestion on how to change the code so that I can also get the server IP of the given url.
Seems like the correct flag you're looking for is CURLINFO_PRIMARY_IP.
Try using curl.getinfo(pycurl.PRIMARY_IP) on your pycurl.Curl() object.

How to send response headers and status from CGI scripts

I am using CGIHTTPServer.py for creating simple CGI server. I want my CGI script to take care of response code if some operation goes wrong . How can I do that?
Code snippet from my CGI script.
if authmxn.authenticate():
stats = Stats()
print "Content-Type: application/json"
print 'Status: 200 OK'
print
print json.dumps(stats.getStats())
else:
print 'Content-Type: application/json'
print 'Status: 403 Forbidden'
print
print json.dumps({'msg': 'request is not authenticated'})
Some of the snippet from request handler,
def run_cgi(self):
'''
rest of code
'''
if not os.path.exists(scriptfile):
self.send_error(404, "No such CGI script (%s)" % `scriptname`)
return
if not os.path.isfile(scriptfile):
self.send_error(403, "CGI script is not a plain file (%s)" %
`scriptname`)
return
ispy = self.is_python(scriptname)
if not ispy:
if not (self.have_fork or self.have_popen2):
self.send_error(403, "CGI script is not a Python script (%s)" %
`scriptname`)
return
if not self.is_executable(scriptfile):
self.send_error(403, "CGI script is not executable (%s)" %
`scriptname`)
return
if not self.have_fork:
# Since we're setting the env in the parent, provide empty
# values to override previously set values
for k in ('QUERY_STRING', 'REMOTE_HOST', 'CONTENT_LENGTH',
'HTTP_USER_AGENT', 'HTTP_COOKIE'):
env.setdefault(k, "")
self.send_response(200, "Script output follows") # overrides the headers
decoded_query = query.replace('+', ' ')
It is possible to implement support for a Status: code message header that overrides the HTTP status line (first line of HTTP response, e.g. HTTP/1.0 200 OK). This requires:
sub-classing CGIHTTPRequestHandler in order to trick it into writing the CGI script's output into a StringIO object instead of directly to a socket.
Then, once the CGI script is complete, update the HTTP status line with the value provided in the Status: header.
It's a hack, but it's not too bad and no standard library code needs to be patched.
import BaseHTTPServer
import SimpleHTTPServer
from CGIHTTPServer import CGIHTTPRequestHandler
from cStringIO import StringIO
class BufferedCGIHTTPRequestHandler(CGIHTTPRequestHandler):
def setup(self):
"""
Arrange for CGI response to be buffered in a StringIO rather than
sent directly to the client.
"""
CGIHTTPRequestHandler.setup(self)
self.original_wfile = self.wfile
self.wfile = StringIO()
# prevent use of os.dup(self.wfile...) forces use of subprocess instead
self.have_fork = False
def run_cgi(self):
"""
Post-process CGI script response before sending to client.
Override HTTP status line with value of "Status:" header, if set.
"""
CGIHTTPRequestHandler.run_cgi(self)
self.wfile.seek(0)
headers = []
for line in self.wfile:
headers.append(line) # includes new line character
if line.strip() == '': # blank line signals end of headers
body = self.wfile.read()
break
elif line.startswith('Status:'):
# Use status header to override premature HTTP status line.
# Header format is: "Status: code message"
status = line.split(':')[1].strip()
headers[0] = '%s %s' % (self.protocol_version, status)
self.original_wfile.write(''.join(headers))
self.original_wfile.write(body)
def test(HandlerClass = BufferedCGIHTTPRequestHandler,
ServerClass = BaseHTTPServer.HTTPServer):
SimpleHTTPServer.test(HandlerClass, ServerClass)
if __name__ == '__main__':
test()
Needless to say, this is probably not the best way to go and you should look at a non-CGIHTTPServer solution, e.g. a micro-framework such as bottle, a proper web-server (from memory, CGIHTTPServer is not recommended for production use), fastcgi, or WSGI - just to name a few options.
With the standard library HTTP server you cannot do this. From the library documentation:
Note CGI scripts run by the CGIHTTPRequestHandler class cannot execute redirects (HTTP code 302), because code 200 (script output follows) is sent prior to execution of the CGI script. This pre-empts the status code.
This means that the server does not support the Status: <status-code> <reason> header from the script. You correctly identified the portion in the code that shows this support does not exist: The server sends status code 200 before even running the script. There is no way you can change this from within the script.
There are several tickets related to this in the Python bugtracker, some with patches, see e.g., issue13893. So one option for you would be to patch the standard library to add this feature.
However, I would strongly suggest you switch to another technology instead of CGI (or run a real web server).

Python: Access ftp like browsers do, with proxy

I want to access a ftp server, anonymous, just for download. My company have a proxy, and ftp ports (21) are blocked. I can't access the ftp server directly.
What I whant to do is to write some code that behaves exactly the same way browsers do. The idea is that, if I can download the files with my browser, there is way to do it with code.
My code works when I try to access a web site outside the company, but is still not working for ftp servers.
proxy = urllib2.ProxyHandler({'https': 'proxy.mycompanhy.com:8080',
'http': 'proxy.mycompanhy.com:80',
'ftp': 'proxy.mycompanhy.com:21' })
auth = urllib2.HTTPBasicAuthHandler()
opener = urllib2.build_opener(proxy, auth, urllib2.HTTPHandler)
urllib2.install_opener(opener)
urlAddress = 'https://python.org'
# urlAddress = 'ftp://ftp1.cptec.inpe.br'
conn = urllib2.urlopen(urlAddress)
return_str = conn.read()
print return_str
When I try to access python.org, it works fine. If I remove the install_opener part, it does not work anymore, proving that the proxy is required.
When I use the ftp url, it blocks (or timeout if I choose to use these parameters).
I understand that ftp and http are two very different protocols.
What I don't understand is the mechanism that browsers use to access these ftp servers.
I mean, I don't know if there is a layer on server side that interfaces between http and ftp, retriveing a html; or if browser, in some other maner, access the ftp and builds the page.
There also might be a confusion with the ftp domain (or the url) and the connection mode. It seems to me that when urllib2 reads the ftp://... it automatically uses the port 21.
I found a solution using wget. This package handles with proxies, but documentation was very ubscure. You need to setup an environment variable with proxy name.
import wget
import os
import errno
# setup proxy
os.environ["ftp_proxy"] = "proxy.mycompanhy.com"
os.environ["http_proxy"] = "proxy.mycompanhy.com"
os.environ["https_proxy"] = "proxy.mycompanhy.com"
src = "http://domain.gov/data/fileToDownload.txt"
out = "C:\\outFolder\\outFileName.txt" # out is optional
# create output folder if it doesn't exists
outFolder, _ = os.path.split( out )
try:
os.makedirs(outFolder)
except OSError as exc: # Python >2.5
if exc.errno == errno.EEXIST and os.path.isdir(outFolder):
pass
else: raise
# download
filename = wget.download(src, out)

Python urllib2 force IPv4

I am running a script using python that uses urllib2 to grab data from a weather api and display it on screen. I have had the problem that when I query the server, I get a "no address associated with hostname" error. I can view the output of the api with a web browser, and I can download the file with wget, but I have to force IPv4 to get it to work. Is it possible to force IPv4 in urllib2 when using urllib2.urlopen?
Not directly, no.
So, what can you do?
One possibility is to explicitly resolve the hostname to IPv4 yourself, and then use the IPv4 address instead of the name as the host. For example:
host = socket.gethostbyname('example.com')
page = urllib2.urlopen('http://{}/path'.format(host))
However, some virtual-server sites may require a Host: example.com header, and they will instead get a Host: 93.184.216.119. You can work around that by overriding the header:
host = socket.gethostbyname('example.com')
request = urllib2.Request('http://{}/path'.format(host),
headers = {'Host': 'example.com'})
page = urllib2.urlopen(request)
Alternatively, you can provide your own handlers in place of the standard ones. But the standard handler is mostly just a wrapper around httplib.HTTPConnection, and the real problem is in HTTPConnection.connect.
So, the clean way to do this is to create your own subclass of httplib.HTTPConnection, which overrides connect like this:
def connect(self):
host = socket.gethostbyname(self.host)
self.sock = socket.create_connection((host, self.post),
self.timeout, self.source_address)
if self._tunnel_host:
self._tunnel()
Then create your own subclass of urllib2.HTTPHandler that overrides http_open to use your subclass:
def http_open(self, req):
return self.do_open(my wrapper.MyHTTPConnection, req)
… and similarly for HTTPSHandler, and then hook up all the stuff properly as shown in the urllib2 docs.
The quick & dirty way to do the same thing is to just monkeypatch httplib.HTTPConnection.connect to the above function.
Finally, you could use a different library instead of urllib2. From what I remember, requests doesn't make this any easier (ultimately, you have to override or monkeypatch slightly different methods, but it's effectively the same). However, any libcurl wrapper will allow you to do the equivalent of curl_easy_setopt(h, CURLOPT_IPRESOLVE, CURLOPT_IPRESOLVE_V4).
Not a proper answer but an alternative: call curl?
import subprocess
import sys
def log_error(msg):
sys.stderr.write(msg + '\n')
def curl(url):
process = subprocess.Popen(
["curl", "-fsSkL4", url],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
)
stdout, stderr = process.communicate()
if process.returncode == 0:
return stdout
else:
log_error("Failed to fetch: %s" % url)
log_error(stderr)
exit(3)

Python - try statement breaking urllib2.urlopen

I'm writing a program in Python that has to make a http request while being forced onto a direct connection in order to avoid a proxy. Here is the code I use which successfully manages this:
print "INFO: Testing API..."
proxy = urllib2.ProxyHandler({})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
req = urllib2.urlopen('http://maps.googleapis.com/maps/api/geocode/json?address=blahblah&sensor=true')
returneddata = json.loads(req.read())
I then want to add a try statement around 'req', in order to handle a situation where the user is not connected to the internet, which I have tried like so:
try:
req = urllib2.urlopen('http://maps.googleapis.com/maps/api/geocode/json?address=blahblah&sensor=true')
except urllib2.URLError:
print "Unable to connect etc etc"
The trouble is that by doing that, it always throws the exception, even though the address is perfectly accessible & the code works without it.
Any ideas? Cheers.

Categories

Resources