I am running a script using python that uses urllib2 to grab data from a weather api and display it on screen. I have had the problem that when I query the server, I get a "no address associated with hostname" error. I can view the output of the api with a web browser, and I can download the file with wget, but I have to force IPv4 to get it to work. Is it possible to force IPv4 in urllib2 when using urllib2.urlopen?
Not directly, no.
So, what can you do?
One possibility is to explicitly resolve the hostname to IPv4 yourself, and then use the IPv4 address instead of the name as the host. For example:
host = socket.gethostbyname('example.com')
page = urllib2.urlopen('http://{}/path'.format(host))
However, some virtual-server sites may require a Host: example.com header, and they will instead get a Host: 93.184.216.119. You can work around that by overriding the header:
host = socket.gethostbyname('example.com')
request = urllib2.Request('http://{}/path'.format(host),
headers = {'Host': 'example.com'})
page = urllib2.urlopen(request)
Alternatively, you can provide your own handlers in place of the standard ones. But the standard handler is mostly just a wrapper around httplib.HTTPConnection, and the real problem is in HTTPConnection.connect.
So, the clean way to do this is to create your own subclass of httplib.HTTPConnection, which overrides connect like this:
def connect(self):
host = socket.gethostbyname(self.host)
self.sock = socket.create_connection((host, self.post),
self.timeout, self.source_address)
if self._tunnel_host:
self._tunnel()
Then create your own subclass of urllib2.HTTPHandler that overrides http_open to use your subclass:
def http_open(self, req):
return self.do_open(my wrapper.MyHTTPConnection, req)
… and similarly for HTTPSHandler, and then hook up all the stuff properly as shown in the urllib2 docs.
The quick & dirty way to do the same thing is to just monkeypatch httplib.HTTPConnection.connect to the above function.
Finally, you could use a different library instead of urllib2. From what I remember, requests doesn't make this any easier (ultimately, you have to override or monkeypatch slightly different methods, but it's effectively the same). However, any libcurl wrapper will allow you to do the equivalent of curl_easy_setopt(h, CURLOPT_IPRESOLVE, CURLOPT_IPRESOLVE_V4).
Not a proper answer but an alternative: call curl?
import subprocess
import sys
def log_error(msg):
sys.stderr.write(msg + '\n')
def curl(url):
process = subprocess.Popen(
["curl", "-fsSkL4", url],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
)
stdout, stderr = process.communicate()
if process.returncode == 0:
return stdout
else:
log_error("Failed to fetch: %s" % url)
log_error(stderr)
exit(3)
Related
I am setting up a HTTP proxy in python to filter web content. I found a good example on StackOverflow which does exactly this using Twisted. However, I need another proxy to access the web. So, the proxy needs to forward requests to another proxy. What is the best way to do this using twisted.web.proxy?
I found a related question which needs something similar, but from a reverse proxy.
My best guess is that it should be possible to build a chained proxy by modifying or subclassing twisted.web.proxy.ProxyClient to connect to the next proxy instead of connecting to the web directly. Unfortunately I didn't find any clues in the documentation on how to do this.
The code I have so far (cite):
from twisted.python import log
from twisted.web import http, proxy
class ProxyClient(proxy.ProxyClient):
def handleResponsePart(self, buffer):
proxy.ProxyClient.handleResponsePart(self, buffer)
class ProxyClientFactory(proxy.ProxyClientFactory):
protocol = ProxyClient
class ProxyRequest(proxy.ProxyRequest):
protocols = dict(http=ProxyClientFactory)
class Proxy(proxy.Proxy):
requestFactory = ProxyRequest
class ProxyFactory(http.HTTPFactory):
protocol = Proxy
portstr = "tcp:8080:interface=localhost" # serve on localhost:8080
if __name__ == '__main__':
import sys
from twisted.internet import endpoints, reactor
log.startLogging(sys.stdout)
endpoint = endpoints.serverFromString(reactor, portstr)
d = endpoint.listen(ProxyFactory())
reactor.run()
This is actually not hard to implement using Twisted. Let me give you a simple example.
Suppose the first proxy is proxy1.py, like the code you pasted in your question;
the second proxy is proxy2.py.
For proxy1.py, you just need to override the process function of class ProxyRequest. Like this:
class ProxyRequest(proxy.ProxyRequest):
def process(self):
parsed = urllib_parse.urlparse(self.uri)
protocol = parsed[0]
host = parsed[1].decode('ascii')
port = self.ports[protocol]
if ':' in host:
host, port = host.split(':')
port = int(port)
rest = urllib_parse.urlunparse((b'', b'') + parsed[2:])
if not rest:
rest = rest + b'/'
class_ = self.protocols[protocol]
headers = self.getAllHeaders().copy()
if b'host' not in headers:
headers[b'host'] = host.encode('ascii')
self.content.seek(0, 0)
s = self.content.read()
clientFactory = class_(self.method, rest, self.clientproto, headers, s, self)
if (NeedGoToSecondProxy):
self.reactor.connectTCP(your_second_proxy_server_ip, your_second_proxy_port, clientFactory)
else:
self.reactor.connectTCP(host, port, clientFactory)
For proxy2.py, you just need to set up another simple proxy. A problem need to be noticed though, you may need to override process function in proxy2.py again, because the self.uri may not be valid after the proxy forward (chain).
For example, the original self.uri should be http://www.google.com/something?para1=xxx, and you may find it as /something?para1=xxx only, at second proxy. So you need to extract the host info from self.headers and complement the self.uri so that your second proxy can normally deliver it to the correct destination.
I have a python code that would need to make use of various ShadowSocks proxy server that I have set up in order to use the IP of those servers.
Say for example I would like to use:
1.1.1.1:5678
2.2.2.2:5678
3.3.3.3:5678
i.e., all these servers have the same remote port and the local ports are all 1080.
My preference is to have the 3 proxies to rotate randomly so that each time I send a urlopen() request (in urllib2), my code randomly connect to one of the proxies and send the request via that proxy, and disconnect when the request is complete.
The IP could be hard coded or could be stored in some config files.
The problem is at the moment, all the sample online that I have found seems all require the connection to be pre-established and the Python code should simply use whatever that is on localhost:1080 instead of actively making connections.
I am just wondering if anyone could lend me a helping hand to accomplish this in the code.
Thanks!
If you have a look at the source of urllib2, you can see that when a default opener is installed, it is really just takes an object with an open method. So you really just need to create an object whose open method returns a random opener. Something like the following (untested) should work:
import urllib2
import random
class RandomOpener(object):
def __init__(self, ip_list)
self.ip_list = ip_list
def open(self, *args, **kwargs):
proxy = random.choice(self.ip_list)
handler = urllib2.ProxyHandler({'http': 'http://' + proxy})
opener = urllib2.build_opener(handler)
return opener(*args, **kwargs)
my_opener = RandomOpener(['1.1.1.1:5678',
'2.2.2.2:5678',
'3.3.3.3:5678'])
urllib2.install_opener(my_opener)
I want to access a ftp server, anonymous, just for download. My company have a proxy, and ftp ports (21) are blocked. I can't access the ftp server directly.
What I whant to do is to write some code that behaves exactly the same way browsers do. The idea is that, if I can download the files with my browser, there is way to do it with code.
My code works when I try to access a web site outside the company, but is still not working for ftp servers.
proxy = urllib2.ProxyHandler({'https': 'proxy.mycompanhy.com:8080',
'http': 'proxy.mycompanhy.com:80',
'ftp': 'proxy.mycompanhy.com:21' })
auth = urllib2.HTTPBasicAuthHandler()
opener = urllib2.build_opener(proxy, auth, urllib2.HTTPHandler)
urllib2.install_opener(opener)
urlAddress = 'https://python.org'
# urlAddress = 'ftp://ftp1.cptec.inpe.br'
conn = urllib2.urlopen(urlAddress)
return_str = conn.read()
print return_str
When I try to access python.org, it works fine. If I remove the install_opener part, it does not work anymore, proving that the proxy is required.
When I use the ftp url, it blocks (or timeout if I choose to use these parameters).
I understand that ftp and http are two very different protocols.
What I don't understand is the mechanism that browsers use to access these ftp servers.
I mean, I don't know if there is a layer on server side that interfaces between http and ftp, retriveing a html; or if browser, in some other maner, access the ftp and builds the page.
There also might be a confusion with the ftp domain (or the url) and the connection mode. It seems to me that when urllib2 reads the ftp://... it automatically uses the port 21.
I found a solution using wget. This package handles with proxies, but documentation was very ubscure. You need to setup an environment variable with proxy name.
import wget
import os
import errno
# setup proxy
os.environ["ftp_proxy"] = "proxy.mycompanhy.com"
os.environ["http_proxy"] = "proxy.mycompanhy.com"
os.environ["https_proxy"] = "proxy.mycompanhy.com"
src = "http://domain.gov/data/fileToDownload.txt"
out = "C:\\outFolder\\outFileName.txt" # out is optional
# create output folder if it doesn't exists
outFolder, _ = os.path.split( out )
try:
os.makedirs(outFolder)
except OSError as exc: # Python >2.5
if exc.errno == errno.EEXIST and os.path.isdir(outFolder):
pass
else: raise
# download
filename = wget.download(src, out)
I am trying to make a very simple XML RPC Server with Python that provides basic authentication + ability to obtain the connected user's IP. Let's take the example provided in http://docs.python.org/library/xmlrpclib.html :
import xmlrpclib
from SimpleXMLRPCServer import SimpleXMLRPCServer
def is_even(n):
return n%2 == 0
server = SimpleXMLRPCServer(("localhost", 8000))
server.register_function(is_even, "is_even")
server.serve_forever()
So now, the first idea behind this is to make the user supply credentials and process them before allowing him to use the functions. I need very simple authentication, for example just a code. Right now what I'm doing is to force the user to supply this code in the function call and test it with an if-statement.
The second one is to be able to get the user IP when he calls a function or either store it after he connects to the server.
Moreover, I already have an Apache Server running and it might be simpler to integrate this into it.
What do you think?
This is a related question that I found helpful:
IP address of client in Python SimpleXMLRPCServer?
What worked for me was to grab the client_address in an overridden finish_request method of the server, stash it in the server itself, and then access this in an overridden server _dispatch routine. You might be able to access the server itself from within the method, too, but I was just trying to add the IP address as an automatic first argument to all my method calls. The reason I used a dict was because I'm also going to add a session token and perhaps other metadata as well.
from xmlrpc.server import DocXMLRPCServer
from socketserver import BaseServer
class NewXMLRPCServer( DocXMLRPCServer):
def finish_request( self, request, client_address):
self.client_address = client_address
BaseServer.finish_request( self, request, client_address)
def _dispatch( self, method, params):
metadata = { 'client_address' : self.client_address[ 0] }
newParams = ( metadata, ) + params
return DocXMLRPCServer._dispatch( self, method, metadata)
Note this will BREAK introspection functions like system.listMethods() because that isn't expecting the extra argument. One idea would be to check the method name for "system." and just pass the regular params in that case.
I'd like to tell urllib2.urlopen (or a custom opener) to use 127.0.0.1 (or ::1) to resolve addresses. I wouldn't change my /etc/resolv.conf, however.
One possible solution is to use a tool like dnspython to query addresses and httplib to build a custom url opener. I'd prefer telling urlopen to use a custom nameserver though. Any suggestions?
Looks like name resolution is ultimately handled by socket.create_connection.
-> urllib2.urlopen
-> httplib.HTTPConnection
-> socket.create_connection
Though once the "Host:" header has been set, you can resolve the host and pass on the IP address through down to the opener.
I'd suggest that you subclass httplib.HTTPConnection, and wrap the connect method to modify self.host before passing it to socket.create_connection.
Then subclass HTTPHandler (and HTTPSHandler) to replace the http_open method with one that passes your HTTPConnection instead of httplib's own to do_open.
Like this:
import urllib2
import httplib
import socket
def MyResolver(host):
if host == 'news.bbc.co.uk':
return '66.102.9.104' # Google IP
else:
return host
class MyHTTPConnection(httplib.HTTPConnection):
def connect(self):
self.sock = socket.create_connection((MyResolver(self.host),self.port),self.timeout)
class MyHTTPSConnection(httplib.HTTPSConnection):
def connect(self):
sock = socket.create_connection((MyResolver(self.host), self.port), self.timeout)
self.sock = ssl.wrap_socket(sock, self.key_file, self.cert_file)
class MyHTTPHandler(urllib2.HTTPHandler):
def http_open(self,req):
return self.do_open(MyHTTPConnection,req)
class MyHTTPSHandler(urllib2.HTTPSHandler):
def https_open(self,req):
return self.do_open(MyHTTPSConnection,req)
opener = urllib2.build_opener(MyHTTPHandler,MyHTTPSHandler)
urllib2.install_opener(opener)
f = urllib2.urlopen('http://news.bbc.co.uk')
data = f.read()
from lxml import etree
doc = etree.HTML(data)
>>> print doc.xpath('//title/text()')
['Google']
Obviously there are certificate issues if you use the HTTPS, and you'll need to fill out MyResolver...
Another (dirty) way is monkey-patching socket.getaddrinfo.
For example this code adds a (unlimited) cache for dns lookups.
import socket
prv_getaddrinfo = socket.getaddrinfo
dns_cache = {} # or a weakref.WeakValueDictionary()
def new_getaddrinfo(*args):
try:
return dns_cache[args]
except KeyError:
res = prv_getaddrinfo(*args)
dns_cache[args] = res
return res
socket.getaddrinfo = new_getaddrinfo
You will need to implement your own dns lookup client (or using dnspython as you said). The name lookup procedure in glibc is pretty complex to ensure compatibility with other non-dns name systems. There's for example no way to specify a particular DNS server in the glibc library at all.