Tell urllib2 to use custom DNS

Tell urllib2 to use custom DNS - python

I'd like to tell urllib2.urlopen (or a custom opener) to use 127.0.0.1 (or ::1) to resolve addresses. I wouldn't change my /etc/resolv.conf, however.
One possible solution is to use a tool like dnspython to query addresses and httplib to build a custom url opener. I'd prefer telling urlopen to use a custom nameserver though. Any suggestions?

Looks like name resolution is ultimately handled by socket.create_connection.
-> urllib2.urlopen
-> httplib.HTTPConnection
-> socket.create_connection
Though once the "Host:" header has been set, you can resolve the host and pass on the IP address through down to the opener.
I'd suggest that you subclass httplib.HTTPConnection, and wrap the connect method to modify self.host before passing it to socket.create_connection.
Then subclass HTTPHandler (and HTTPSHandler) to replace the http_open method with one that passes your HTTPConnection instead of httplib's own to do_open.
Like this:
import urllib2
import httplib
import socket
def MyResolver(host):
if host == 'news.bbc.co.uk':
return '66.102.9.104' # Google IP
else:
return host
class MyHTTPConnection(httplib.HTTPConnection):
def connect(self):
self.sock = socket.create_connection((MyResolver(self.host),self.port),self.timeout)
class MyHTTPSConnection(httplib.HTTPSConnection):
def connect(self):
sock = socket.create_connection((MyResolver(self.host), self.port), self.timeout)
self.sock = ssl.wrap_socket(sock, self.key_file, self.cert_file)
class MyHTTPHandler(urllib2.HTTPHandler):
def http_open(self,req):
return self.do_open(MyHTTPConnection,req)
class MyHTTPSHandler(urllib2.HTTPSHandler):
def https_open(self,req):
return self.do_open(MyHTTPSConnection,req)
opener = urllib2.build_opener(MyHTTPHandler,MyHTTPSHandler)
urllib2.install_opener(opener)
f = urllib2.urlopen('http://news.bbc.co.uk')
data = f.read()
from lxml import etree
doc = etree.HTML(data)
>>> print doc.xpath('//title/text()')
['Google']
Obviously there are certificate issues if you use the HTTPS, and you'll need to fill out MyResolver...

Another (dirty) way is monkey-patching socket.getaddrinfo.
For example this code adds a (unlimited) cache for dns lookups.
import socket
prv_getaddrinfo = socket.getaddrinfo
dns_cache = {} # or a weakref.WeakValueDictionary()
def new_getaddrinfo(*args):
try:
return dns_cache[args]
except KeyError:
res = prv_getaddrinfo(*args)
dns_cache[args] = res
return res
socket.getaddrinfo = new_getaddrinfo

You will need to implement your own dns lookup client (or using dnspython as you said). The name lookup procedure in glibc is pretty complex to ensure compatibility with other non-dns name systems. There's for example no way to specify a particular DNS server in the glibc library at all.

Related

Twisted - forwarding proxy requests to another proxy (proxy chain)

I am setting up a HTTP proxy in python to filter web content. I found a good example on StackOverflow which does exactly this using Twisted. However, I need another proxy to access the web. So, the proxy needs to forward requests to another proxy. What is the best way to do this using twisted.web.proxy?
I found a related question which needs something similar, but from a reverse proxy.
My best guess is that it should be possible to build a chained proxy by modifying or subclassing twisted.web.proxy.ProxyClient to connect to the next proxy instead of connecting to the web directly. Unfortunately I didn't find any clues in the documentation on how to do this.
The code I have so far (cite):
from twisted.python import log
from twisted.web import http, proxy
class ProxyClient(proxy.ProxyClient):
def handleResponsePart(self, buffer):
proxy.ProxyClient.handleResponsePart(self, buffer)
class ProxyClientFactory(proxy.ProxyClientFactory):
protocol = ProxyClient
class ProxyRequest(proxy.ProxyRequest):
protocols = dict(http=ProxyClientFactory)
class Proxy(proxy.Proxy):
requestFactory = ProxyRequest
class ProxyFactory(http.HTTPFactory):
protocol = Proxy
portstr = "tcp:8080:interface=localhost" # serve on localhost:8080
if __name__ == '__main__':
import sys
from twisted.internet import endpoints, reactor
log.startLogging(sys.stdout)
endpoint = endpoints.serverFromString(reactor, portstr)
d = endpoint.listen(ProxyFactory())
reactor.run()

This is actually not hard to implement using Twisted. Let me give you a simple example.
Suppose the first proxy is proxy1.py, like the code you pasted in your question;
the second proxy is proxy2.py.
For proxy1.py, you just need to override the process function of class ProxyRequest. Like this:
class ProxyRequest(proxy.ProxyRequest):
def process(self):
parsed = urllib_parse.urlparse(self.uri)
protocol = parsed[0]
host = parsed[1].decode('ascii')
port = self.ports[protocol]
if ':' in host:
host, port = host.split(':')
port = int(port)
rest = urllib_parse.urlunparse((b'', b'') + parsed[2:])
if not rest:
rest = rest + b'/'
class_ = self.protocols[protocol]
headers = self.getAllHeaders().copy()
if b'host' not in headers:
headers[b'host'] = host.encode('ascii')
self.content.seek(0, 0)
s = self.content.read()
clientFactory = class_(self.method, rest, self.clientproto, headers, s, self)
if (NeedGoToSecondProxy):
self.reactor.connectTCP(your_second_proxy_server_ip, your_second_proxy_port, clientFactory)
else:
self.reactor.connectTCP(host, port, clientFactory)
For proxy2.py, you just need to set up another simple proxy. A problem need to be noticed though, you may need to override process function in proxy2.py again, because the self.uri may not be valid after the proxy forward (chain).
For example, the original self.uri should be http://www.google.com/something?para1=xxx, and you may find it as /something?para1=xxx only, at second proxy. So you need to extract the host info from self.headers and complement the self.uri so that your second proxy can normally deliver it to the correct destination.

Rotate through Shadowsocks proxy via Python

I have a python code that would need to make use of various ShadowSocks proxy server that I have set up in order to use the IP of those servers.
Say for example I would like to use:
1.1.1.1:5678
2.2.2.2:5678
3.3.3.3:5678
i.e., all these servers have the same remote port and the local ports are all 1080.
My preference is to have the 3 proxies to rotate randomly so that each time I send a urlopen() request (in urllib2), my code randomly connect to one of the proxies and send the request via that proxy, and disconnect when the request is complete.
The IP could be hard coded or could be stored in some config files.
The problem is at the moment, all the sample online that I have found seems all require the connection to be pre-established and the Python code should simply use whatever that is on localhost:1080 instead of actively making connections.
I am just wondering if anyone could lend me a helping hand to accomplish this in the code.
Thanks!

If you have a look at the source of urllib2, you can see that when a default opener is installed, it is really just takes an object with an open method. So you really just need to create an object whose open method returns a random opener. Something like the following (untested) should work:
import urllib2
import random
class RandomOpener(object):
def __init__(self, ip_list)
self.ip_list = ip_list
def open(self, *args, **kwargs):
proxy = random.choice(self.ip_list)
handler = urllib2.ProxyHandler({'http': 'http://' + proxy})
opener = urllib2.build_opener(handler)
return opener(*args, **kwargs)
my_opener = RandomOpener(['1.1.1.1:5678',
'2.2.2.2:5678',
'3.3.3.3:5678'])
urllib2.install_opener(my_opener)

Python urllib2 force IPv4

I am running a script using python that uses urllib2 to grab data from a weather api and display it on screen. I have had the problem that when I query the server, I get a "no address associated with hostname" error. I can view the output of the api with a web browser, and I can download the file with wget, but I have to force IPv4 to get it to work. Is it possible to force IPv4 in urllib2 when using urllib2.urlopen?

Not directly, no.
So, what can you do?
One possibility is to explicitly resolve the hostname to IPv4 yourself, and then use the IPv4 address instead of the name as the host. For example:
host = socket.gethostbyname('example.com')
page = urllib2.urlopen('http://{}/path'.format(host))
However, some virtual-server sites may require a Host: example.com header, and they will instead get a Host: 93.184.216.119. You can work around that by overriding the header:
host = socket.gethostbyname('example.com')
request = urllib2.Request('http://{}/path'.format(host),
headers = {'Host': 'example.com'})
page = urllib2.urlopen(request)
Alternatively, you can provide your own handlers in place of the standard ones. But the standard handler is mostly just a wrapper around httplib.HTTPConnection, and the real problem is in HTTPConnection.connect.
So, the clean way to do this is to create your own subclass of httplib.HTTPConnection, which overrides connect like this:
def connect(self):
host = socket.gethostbyname(self.host)
self.sock = socket.create_connection((host, self.post),
self.timeout, self.source_address)
if self._tunnel_host:
self._tunnel()
Then create your own subclass of urllib2.HTTPHandler that overrides http_open to use your subclass:
def http_open(self, req):
return self.do_open(my wrapper.MyHTTPConnection, req)
… and similarly for HTTPSHandler, and then hook up all the stuff properly as shown in the urllib2 docs.
The quick & dirty way to do the same thing is to just monkeypatch httplib.HTTPConnection.connect to the above function.
Finally, you could use a different library instead of urllib2. From what I remember, requests doesn't make this any easier (ultimately, you have to override or monkeypatch slightly different methods, but it's effectively the same). However, any libcurl wrapper will allow you to do the equivalent of curl_easy_setopt(h, CURLOPT_IPRESOLVE, CURLOPT_IPRESOLVE_V4).

Not a proper answer but an alternative: call curl?
import subprocess
import sys
def log_error(msg):
sys.stderr.write(msg + '\n')
def curl(url):
process = subprocess.Popen(
["curl", "-fsSkL4", url],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
)
stdout, stderr = process.communicate()
if process.returncode == 0:
return stdout
else:
log_error("Failed to fetch: %s" % url)
log_error(stderr)
exit(3)

Python SimpleXMLRPCServer: get user IP and simple authentication

I am trying to make a very simple XML RPC Server with Python that provides basic authentication + ability to obtain the connected user's IP. Let's take the example provided in http://docs.python.org/library/xmlrpclib.html :
import xmlrpclib
from SimpleXMLRPCServer import SimpleXMLRPCServer
def is_even(n):
return n%2 == 0
server = SimpleXMLRPCServer(("localhost", 8000))
server.register_function(is_even, "is_even")
server.serve_forever()
So now, the first idea behind this is to make the user supply credentials and process them before allowing him to use the functions. I need very simple authentication, for example just a code. Right now what I'm doing is to force the user to supply this code in the function call and test it with an if-statement.
The second one is to be able to get the user IP when he calls a function or either store it after he connects to the server.
Moreover, I already have an Apache Server running and it might be simpler to integrate this into it.
What do you think?

This is a related question that I found helpful:
IP address of client in Python SimpleXMLRPCServer?
What worked for me was to grab the client_address in an overridden finish_request method of the server, stash it in the server itself, and then access this in an overridden server _dispatch routine. You might be able to access the server itself from within the method, too, but I was just trying to add the IP address as an automatic first argument to all my method calls. The reason I used a dict was because I'm also going to add a session token and perhaps other metadata as well.
from xmlrpc.server import DocXMLRPCServer
from socketserver import BaseServer
class NewXMLRPCServer( DocXMLRPCServer):
def finish_request( self, request, client_address):
self.client_address = client_address
BaseServer.finish_request( self, request, client_address)
def _dispatch( self, method, params):
metadata = { 'client_address' : self.client_address[ 0] }
newParams = ( metadata, ) + params
return DocXMLRPCServer._dispatch( self, method, metadata)
Note this will BREAK introspection functions like system.listMethods() because that isn't expecting the extra argument. One idea would be to check the method name for "system." and just pass the regular params in that case.

Validate SSL certificates with Python

I need to write a script that connects to a bunch of sites on our corporate intranet over HTTPS and verifies that their SSL certificates are valid; that they are not expired, that they are issued for the correct address, etc. We use our own internal corporate Certificate Authority for these sites, so we have the public key of the CA to verify the certificates against.
Python by default just accepts and uses SSL certificates when using HTTPS, so even if a certificate is invalid, Python libraries such as urllib2 and Twisted will just happily use the certificate.
How do I verify a certificate in Python?

I have added a distribution to the Python Package Index which makes the match_hostname() function from the Python 3.2 ssl package available on previous versions of Python.
http://pypi.python.org/pypi/backports.ssl_match_hostname/
You can install it with:
pip install backports.ssl_match_hostname
Or you can make it a dependency listed in your project's setup.py. Either way, it can be used like this:
from backports.ssl_match_hostname import match_hostname, CertificateError
...
sslsock = ssl.wrap_socket(sock, ssl_version=ssl.PROTOCOL_SSLv3,
cert_reqs=ssl.CERT_REQUIRED, ca_certs=...)
try:
match_hostname(sslsock.getpeercert(), hostname)
except CertificateError, ce:
...

You can use Twisted to verify certificates. The main API is CertificateOptions, which can be provided as the contextFactory argument to various functions such as listenSSL and startTLS.
Unfortunately, neither Python nor Twisted comes with a the pile of CA certificates required to actually do HTTPS validation, nor the HTTPS validation logic. Due to a limitation in PyOpenSSL, you can't do it completely correctly just yet, but thanks to the fact that almost all certificates include a subject commonName, you can get close enough.
Here is a naive sample implementation of a verifying Twisted HTTPS client which ignores wildcards and subjectAltName extensions, and uses the certificate-authority certificates present in the 'ca-certificates' package in most Ubuntu distributions. Try it with your favorite valid and invalid certificate sites :).
import os
import glob
from OpenSSL.SSL import Context, TLSv1_METHOD, VERIFY_PEER, VERIFY_FAIL_IF_NO_PEER_CERT, OP_NO_SSLv2
from OpenSSL.crypto import load_certificate, FILETYPE_PEM
from twisted.python.urlpath import URLPath
from twisted.internet.ssl import ContextFactory
from twisted.internet import reactor
from twisted.web.client import getPage
certificateAuthorityMap = {}
for certFileName in glob.glob("/etc/ssl/certs/*.pem"):
# There might be some dead symlinks in there, so let's make sure it's real.
if os.path.exists(certFileName):
data = open(certFileName).read()
x509 = load_certificate(FILETYPE_PEM, data)
digest = x509.digest('sha1')
# Now, de-duplicate in case the same cert has multiple names.
certificateAuthorityMap[digest] = x509
class HTTPSVerifyingContextFactory(ContextFactory):
def __init__(self, hostname):
self.hostname = hostname
isClient = True
def getContext(self):
ctx = Context(TLSv1_METHOD)
store = ctx.get_cert_store()
for value in certificateAuthorityMap.values():
store.add_cert(value)
ctx.set_verify(VERIFY_PEER | VERIFY_FAIL_IF_NO_PEER_CERT, self.verifyHostname)
ctx.set_options(OP_NO_SSLv2)
return ctx
def verifyHostname(self, connection, x509, errno, depth, preverifyOK):
if preverifyOK:
if self.hostname != x509.get_subject().commonName:
return False
return preverifyOK
def secureGet(url):
return getPage(url, HTTPSVerifyingContextFactory(URLPath.fromString(url).netloc))
def done(result):
print 'Done!', len(result)
secureGet("https://google.com/").addCallback(done)
reactor.run()

PycURL does this beautifully.
Below is a short example. It will throw a pycurl.error if something is fishy, where you get a tuple with error code and a human readable message.
import pycurl
curl = pycurl.Curl()
curl.setopt(pycurl.CAINFO, "myFineCA.crt")
curl.setopt(pycurl.SSL_VERIFYPEER, 1)
curl.setopt(pycurl.SSL_VERIFYHOST, 2)
curl.setopt(pycurl.URL, "https://internal.stuff/")
curl.perform()
You will probably want to configure more options, like where to store the results, etc. But no need to clutter the example with non-essentials.
Example of what exceptions might be raised:
(60, 'Peer certificate cannot be authenticated with known CA certificates')
(51, "common name 'CN=something.else.stuff,O=Example Corp,C=SE' does not match 'internal.stuff'")
Some links that I found useful are the libcurl-docs for setopt and getinfo.
http://curl.haxx.se/libcurl/c/curl_easy_setopt.html
http://curl.haxx.se/libcurl/c/curl_easy_getinfo.html

From release version 2.7.9/3.4.3 on, Python by default attempts to perform certificate validation.
This has been proposed in PEP 467, which is worth a read: https://www.python.org/dev/peps/pep-0476/
The changes affect all relevant stdlib modules (urllib/urllib2, http, httplib).
Relevant documentation:
https://docs.python.org/2/library/httplib.html#httplib.HTTPSConnection
This class now performs all the necessary certificate and hostname checks by default. To revert to the previous, unverified, behavior ssl._create_unverified_context() can be passed to the context parameter.
https://docs.python.org/3/library/http.client.html#http.client.HTTPSConnection
Changed in version 3.4.3: This class now performs all the necessary certificate and hostname checks by default. To revert to the previous, unverified, behavior ssl._create_unverified_context() can be passed to the context parameter.
Note that the new built-in verification is based on the system-provided certificate database. Opposed to that, the requests package ships its own certificate bundle. Pros and cons of both approaches are discussed in the Trust database section of PEP 476.

Or simply make your life easier by using the requests library:
import requests
requests.get('https://somesite.com', cert='/path/server.crt', verify=True)
A few more words about its usage.

Here's an example script which demonstrates certificate validation:
import httplib
import re
import socket
import sys
import urllib2
import ssl
class InvalidCertificateException(httplib.HTTPException, urllib2.URLError):
def __init__(self, host, cert, reason):
httplib.HTTPException.__init__(self)
self.host = host
self.cert = cert
self.reason = reason
def __str__(self):
return ('Host %s returned an invalid certificate (%s) %s\n' %
(self.host, self.reason, self.cert))
class CertValidatingHTTPSConnection(httplib.HTTPConnection):
default_port = httplib.HTTPS_PORT
def __init__(self, host, port=None, key_file=None, cert_file=None,
ca_certs=None, strict=None, **kwargs):
httplib.HTTPConnection.__init__(self, host, port, strict, **kwargs)
self.key_file = key_file
self.cert_file = cert_file
self.ca_certs = ca_certs
if self.ca_certs:
self.cert_reqs = ssl.CERT_REQUIRED
else:
self.cert_reqs = ssl.CERT_NONE
def _GetValidHostsForCert(self, cert):
if 'subjectAltName' in cert:
return [x[1] for x in cert['subjectAltName']
if x[0].lower() == 'dns']
else:
return [x[0][1] for x in cert['subject']
if x[0][0].lower() == 'commonname']
def _ValidateCertificateHostname(self, cert, hostname):
hosts = self._GetValidHostsForCert(cert)
for host in hosts:
host_re = host.replace('.', '\.').replace('*', '[^.]*')
if re.search('^%s$' % (host_re,), hostname, re.I):
return True
return False
def connect(self):
sock = socket.create_connection((self.host, self.port))
self.sock = ssl.wrap_socket(sock, keyfile=self.key_file,
certfile=self.cert_file,
cert_reqs=self.cert_reqs,
ca_certs=self.ca_certs)
if self.cert_reqs & ssl.CERT_REQUIRED:
cert = self.sock.getpeercert()
hostname = self.host.split(':', 0)[0]
if not self._ValidateCertificateHostname(cert, hostname):
raise InvalidCertificateException(hostname, cert,
'hostname mismatch')
class VerifiedHTTPSHandler(urllib2.HTTPSHandler):
def __init__(self, **kwargs):
urllib2.AbstractHTTPHandler.__init__(self)
self._connection_args = kwargs
def https_open(self, req):
def http_class_wrapper(host, **kwargs):
full_kwargs = dict(self._connection_args)
full_kwargs.update(kwargs)
return CertValidatingHTTPSConnection(host, **full_kwargs)
try:
return self.do_open(http_class_wrapper, req)
except urllib2.URLError, e:
if type(e.reason) == ssl.SSLError and e.reason.args[0] == 1:
raise InvalidCertificateException(req.host, '',
e.reason.args[1])
raise
https_request = urllib2.HTTPSHandler.do_request_
if __name__ == "__main__":
if len(sys.argv) != 3:
print "usage: python %s CA_CERT URL" % sys.argv[0]
exit(2)
handler = VerifiedHTTPSHandler(ca_certs = sys.argv[1])
opener = urllib2.build_opener(handler)
print opener.open(sys.argv[2]).read()

M2Crypto can do the validation. You can also use M2Crypto with Twisted if you like. The Chandler desktop client uses Twisted for networking and M2Crypto for SSL, including certificate validation.
Based on Glyphs comment it seems like M2Crypto does better certificate verification by default than what you can do with pyOpenSSL currently, because M2Crypto checks subjectAltName field too.
I've also blogged on how to get the certificates Mozilla Firefox ships with in Python and usable with Python SSL solutions.

Jython DOES carry out certificate verification by default, so using standard library modules, e.g. httplib.HTTPSConnection, etc, with jython will verify certificates and give exceptions for failures, i.e. mismatched identities, expired certs, etc.
In fact, you have to do some extra work to get jython to behave like cpython, i.e. to get jython to NOT verify certs.
I have written a blog post on how to disable certificate checking on jython, because it can be useful in testing phases, etc.
Installing an all-trusting security provider on java and jython.
http://jython.xhaus.com/installing-an-all-trusting-security-provider-on-java-and-jython/

The following code allows you to benefit from all SSL validation checks (e.g. date validity, CA certificate chain ...) EXCEPT a pluggable verification step e.g. to verify the hostname or do other additional certificate verification steps.
from httplib import HTTPSConnection
import ssl
def create_custom_HTTPSConnection(host):
def verify_cert(cert, host):
# Write your code here
# You can certainly base yourself on ssl.match_hostname
# Raise ssl.CertificateError if verification fails
print 'Host:', host
print 'Peer cert:', cert
class CustomHTTPSConnection(HTTPSConnection, object):
def connect(self):
super(CustomHTTPSConnection, self).connect()
cert = self.sock.getpeercert()
verify_cert(cert, host)
context = ssl.create_default_context()
context.check_hostname = False
return CustomHTTPSConnection(host=host, context=context)
if __name__ == '__main__':
# try expired.badssl.com or self-signed.badssl.com !
conn = create_custom_HTTPSConnection('badssl.com')
conn.request('GET', '/')
conn.getresponse().read()

pyOpenSSL is an interface to the OpenSSL library. It should provide everything you need.

I was having the same problem but wanted to minimize 3rd party dependencies (because this one-off script was to be executed by many users). My solution was to wrap a curl call and make sure that the exit code was 0. Worked like a charm.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tell urllib2 to use custom DNS - python

You will need to implement your own dns lookup client (or using dnspython as you said). The name lookup procedure in glibc is pretty complex to ensure compatibility with other non-dns name systems. There's for example no way to specify a particular DNS server in the glibc library at all.

Related

Twisted - forwarding proxy requests to another proxy (proxy chain)

Rotate through Shadowsocks proxy via Python

Python urllib2 force IPv4

Python SimpleXMLRPCServer: get user IP and simple authentication

Validate SSL certificates with Python

Categories

Resources