using cookies with twisted.web.client

using cookies with twisted.web.client - python

I'm trying to make a web client application using twisted but having some trouble with cookies. Does anyone have an example I can look at?

While it's true that getPage doesn't easily allow direct access to the request or response headers (just one example of how getPage isn't a super awesome API), cookies are actually supported.
cookies = {cookies: tosend}
d = getPage(url, cookies=cookies)
def cbPage(result):
print 'Look at my cookies:', cookies
d.addCallback(cbPage)
Any cookies in the dictionary when it is passed to getPage will be sent. Any new cookies the server sets in response to the request will be added to the dictionary.
You might have missed this feature when looking at getPage because the getPage signature doesn't have a cookies parameter anywhere in it! However, it does take **kwargs, and this is how cookies is supported: any extra arguments passed to getPage that it doesn't know about itself, it passes on to HTTPClientFactory.__init__. Take a look at that method's signature to see all of the things you can pass to getPage.

Turns out there is no easy way afaict
The headers are stored in twisted.web.client.HTTPClientFactory but not available from twisted.web.client.getPage() which is the function designed for pulling back a web page. I ended up rewriting the function:
from twisted.web import client
def getPage(url, contextFactory=None, *args, **kwargs):
fact = client._makeGetterFactory(
url,
HTTPClientFactory,
contextFactory=contextFactory,
*args, **kwargs)
return fact.deferred.addCallback(lambda data: (data, fact.response_headers))

from twisted.internet import reactor
from twisted.web import client
def getPage(url, contextFactory=None, *args, **kwargs):
return client._makeGetterFactory(
url,
CustomHTTPClientFactory,
contextFactory=contextFactory,
*args, **kwargs).deferred
class CustomHTTPClientFactory(client.HTTPClientFactory):
def __init__(self,url, method='GET', postdata=None, headers=None,
agent="Twisted PageGetter", timeout=0, cookies=None,
followRedirect=1, redirectLimit=20):
client.HTTPClientFactory.__init__(self, url, method, postdata,
headers, agent, timeout, cookies,
followRedirect, redirectLimit)
def page(self, page):
if self.waiting:
self.waiting = 0
res = {}
res['page'] = page
res['headers'] = self.response_headers
res['cookies'] = self.cookies
self.deferred.callback(res)
if __name__ == '__main__':
def cback(result):
for k in result:
print k, '==>', result[k]
reactor.stop()
def eback(error):
print error.getTraceback()
reactor.stop()
d = getPage('http://example.com', agent='example web client',
cookies={ 'some' : 'cookie' } )
d.addCallback(cback)
d.addErrback(eback)
reactor.run()

Related

What is the best way to force a keyword while using **kwargs?

I'm not sure if I have used the correct terminology in the question.
Currently, I am trying to make a wrapper/interface around Google's Blogger API (Blog service).
[I know it has been done already, but I am using this as a project to learn OOP/python.]
I have made a method that gets a set of 25 posts from a blog:
def get_posts(self, **kwargs):
""" Makes an API request. Returns list of posts. """
api_url = '/blogs/{id}/posts'.format(id=self.id)
return self._send_request(api_url, kwargs)
def _send_request(self, url, parameters={}):
""" Sends an HTTP GET request to the Blogger API.
Returns JSON decoded response as a dict. """
url = '{base}{url}?'.format(base=self.base, url=url)
# Requests formats the parameters into the URL for me
try:
r = requests.get(url, params=parameters)
except:
print "** Could not reach url:\n", url
return
api_response = r.text
return self._jload(api_response)
The problem is, I have to specify the API key every time I call the get_posts function:
someblog = BloggerClient(url='http://someblog.blogger.com', key='0123')
someblog.get_posts(key=self.key)
Every API call requires that the key be sent as a parameter on the URL.
Then, what is the best way to do that?
I'm thinking a possible way (but probably not the best way?), is to add the key to the kwargs dictionary in the _send_request():
def _send_request(self, url, parameters={}):
""" Sends an HTTP get request to Blogger API.
Returns JSON decoded response. """
# Format the full API URL:
url = '{base}{url}?'.format(base=self.base, url=url)
# The api key will be always be added:
parameters['key']= self.key
try:
r = requests.get(url, params=parameters)
except:
print "** Could not reach url:\n", url
return
api_response = r.text
return self._jload(api_response)
I can't really get my head around what is the best way (or most pythonic way) to do it.

You could store it in a named constant.
If this code doesn't need to be secure, simply
API_KEY = '1ih3f2ihf2f'
If it's going to be hosted on a server somewhere or needs to be more secure, you could store the value in an environment variable
In your terminal:
export API_KEY='1ih3f2ihf2f'
then in your python script:
import os
API_KEY = os.environ.get('API_KEY')

The problem is, I have to specify the API key every time I call the get_posts function:
If it really is just this one method, the obvious idea is to write a wrapper:
def get_posts(blog, *args, **kwargs):
returns blog.get_posts(*args, key=key, **kwargs)
Or, better, wrap up the class to do it for you:
class KeyRememberingBloggerClient(BloggerClient):
def __init__(self, *args, **kwargs):
self.key = kwargs.pop('key')
super(KeyRememberingBloggerClient, self).__init__(*args, **kwargs)
def get_posts(self, *args, **kwargs):
return super(KeyRememberingBloggerClient, self).get_posts(
*args, key=self.key, **kwargs)
So now:
someblog = KeyRememberingBloggerClient(url='http://someblog.blogger.com', key='0123')
someblog.get_posts()
Yes, you can override or monkeypatch the _send_request method that all of the other methods use, but if there's only 1 or 2 methods that need to be fixed, why delve into the undocumented internals of the class, and fork the body of one of those methods just so you can change it in a way you clearly weren't expected to, instead of doing it cleanly?
Of course if there are 90 different methods scattered across 4 different classes, you might want to consider building these wrappers programmatically (and/or monkeypatching the classes)… or just patching the one private method, as you're doing. That seems reasonable.

Access the response object in a bottlepy after_request hook

I have the following web app:
import bottle
app = bottle.Bottle()
#app.route('/ping')
def ping():
print 'pong'
return 'pong'
#app.hook('after_request')
def after():
print 'foo'
print bottle.response.body
if __name__ == "__main__":
app.run(host='0.0.0.0', port='9999', server='cherrypy')
Is there a way to access the response body before sending the response back?
If I start the app and I query /ping, I can see in the console that the ping() and the after() function run in the right sequence
$ python bottle_after_request.py
Bottle v0.11.6 server starting up (using CherryPyServer())...
Listening on http://0.0.0.0:9999/
Hit Ctrl-C to quit.
pong
foo
but when in after() I try to access response.body, I don't have anything.
In Flask the after_request decorated functions take in input the response object so it's easy to access it. How can I do the same in Bottle?
Is there something I'm missing?

Is there a way to access the response body before sending the response back?
You could write a simple plugin, which (depending on what you're actually trying to do with the response) might be all you need.
Here's an example from the Bottle plugin docs, which sets a request header. It could just as easily manipulate body.
from bottle import response, install
import time
def stopwatch(callback):
def wrapper(*args, **kwargs):
start = time.time()
body = callback(*args, **kwargs)
end = time.time()
response.headers['X-Exec-Time'] = str(end - start)
return body
return wrapper
install(stopwatch)
Hope that works for your purposes.

You can use plugin approach, this is what i did
from bottle import response
class BottlePlugin(object):
name = 'my_custom_plugin'
api = 2
def __init__(self, debug=False):
self.debug = debug
self.app = None
def setup(self, app):
"""Handle plugin install"""
self.app = app
def apply(self, callback):
"""Handle route callbacks"""
def wrapper(*a, **ka):
"""Encapsulate the result in the expected api structure"""
# Check if the client wants a different format
# output depends what you are returning from view
# in my case its dict with keys ("data")
output = callback(*a, **ka)
data = output["data"]
paging = output.get("paging", {})
response_data = {
data: data,
paging: paging
}
# in case if you want to update response
# e.g response code
response.status = 200
return response_data
return wrapper

Python: How to use requests library to access a url through several different proxy servers?

As it says in the title, I am trying to access a url through several different proxies sequentially (using for loop). Right now this is my code:
import requests
import json
with open('proxies.txt') as proxies:
for line in proxies:
proxy=json.loads(line)
with open('urls.txt') as urls:
for line in urls:
url=line.rstrip()
data=requests.get(url, proxies={'http':line})
data1=data.text
print data1
and my urls.txt file:
http://api.exip.org/?call=ip
and my proxies.txt file:
{"https": "84.22.41.1:3128"}
{"http":"194.126.181.47:81"}
{"http":"218.108.170.170:82"}
that I got at [www.hidemyass.com][1]
for some reason, the output is
68.6.34.253
68.6.34.253
68.6.34.253
as if it is accessing that website through my own router ip address. In other words, it is not trying to access through the proxies I give it, it is just looping through and using my own over and over again. What am I doing wrong?

According to this thread, you need to specify the proxies dictionary as {"protocol" : "ip:port"}, so your proxies file should look like
{"https": "84.22.41.1.3128"}
{"http": "194.126.181.47:81"}
{"http": "218.108.170.170:82"}
EDIT:
You're reusing line for both URLs and proxies. It's fine to reuse line in the inner loop, but you should be using proxies=proxy--you've already parsed the JSON and don't need to build another dictionary. Also, as abanert says, you should be doing a check to ensure that the protocol you're requesting matches that of the proxy. The reason the proxies are specified as a dictionary is to allow lookup for the matching protocol.

There are two obvious problems right here:
data=requests.get(url, proxies={'http':line})
First, because you have a for line in urls: inside the for line in proxies:, line is going to be the current URL here, not the current proxy. And besides, even if you weren't reusing line, it would be the JSON string representation, not the dict you decoded from JSON.
Then, if you fix that to use proxy, instead of something like {'https': '83.22.41.1:3128'}, you're passing {'http': {'https': '83.22.41.1:3128'}}. And that obviously isn't a valid value.
To fix both of those problems, just do this:
data=requests.get(url, proxies=proxy)
Meanwhile, what happens when you have an HTTPS URL, but the current proxy is an HTTP proxy? You're not going to use the proxy. So you probably want to add something to skip over them, like this:
if urlparse.urlparse(url).scheme not in proxy:
continue

Directly copied from another answer of mine.
Well, actually you can, I've done this with a few lines of code and it works pretty well.
import requests
class Client:
def __init__(self):
self._session = requests.Session()
self.proxies = None
def set_proxy_pool(self, proxies, auth=None, https=True):
"""Randomly choose a proxy for every GET/POST request
:param proxies: list of proxies, like ["ip1:port1", "ip2:port2"]
:param auth: if proxy needs auth
:param https: default is True, pass False if you don't need https proxy
"""
from random import choice
if https:
self.proxies = [{'http': p, 'https': p} for p in proxies]
else:
self.proxies = [{'http': p} for p in proxies]
def get_with_random_proxy(url, **kwargs):
proxy = choice(self.proxies)
kwargs['proxies'] = proxy
if auth:
kwargs['auth'] = auth
return self._session.original_get(url, **kwargs)
def post_with_random_proxy(url, *args, **kwargs):
proxy = choice(self.proxies)
kwargs['proxies'] = proxy
if auth:
kwargs['auth'] = auth
return self._session.original_post(url, *args, **kwargs)
self._session.original_get = self._session.get
self._session.get = get_with_random_proxy
self._session.original_post = self._session.post
self._session.post = post_with_random_proxy
def remove_proxy_pool(self):
self.proxies = None
self._session.get = self._session.original_get
self._session.post = self._session.original_post
del self._session.original_get
del self._session.original_post
# You can define whatever operations using self._session
I use it like this:
client = Client()
client.set_proxy_pool(['112.25.41.136', '180.97.29.57'])
It's simple, but actually works for me.

python get headers only using urllib2

I have to implement a function to get headers only (without doing a GET or POST) using urllib2. Here is my function:
def getheadersonly(url, redirections = True):
if not redirections:
class MyHTTPRedirectHandler(urllib2.HTTPRedirectHandler):
def http_error_302(self, req, fp, code, msg, headers):
return urllib2.HTTPRedirectHandler.http_error_302(self, req, fp, code, msg, headers)
http_error_301 = http_error_303 = http_error_307 = http_error_302
cookieprocessor = urllib2.HTTPCookieProcessor()
opener = urllib2.build_opener(MyHTTPRedirectHandler, cookieprocessor)
urllib2.install_opener(opener)
class HeadRequest(urllib2.Request):
def get_method(self):
return "HEAD"
info = {}
info['headers'] = dict(urllib2.urlopen(HeadRequest(url)).info())
info['finalurl'] = urllib2.urlopen(HeadRequest(url)).geturl()
return info
Uses code from answer this and this. However this is doing redirection even when the flag is False. I tried the code with:
print getheadersonly("http://ms.com", redirections = False)['finalurl']
print getheadersonly("http://ms.com")['finalurl']
Its giving morganstanley.com in both cases. What is wrong here?

Firstly, your code contains several bugs:
On each request of getheadersonly you install a new global urlopener which is then used in subsequent calls of urllib2.urlopen
You make two HTTP-requests to get two different attributes of a response.
The implementation of urllib2.HTTPRedirectHandler.http_error_302 is not so trivial and I do not understand how can it prevent redirections in the first place.
Basically, you should understand that each handler is installed in an opener to handle certain kind of response. urllib2.HTTPRedirectHandler is there to convert certain http-codes into a redirections. If you do not want redirections, do not add a redirection handler into the opener. If you do not want to open ftp links, do not add FTPHandler, etc.
That is all you need is to create a new opener and add the urllib2.HTTPHandler() in it, customize the request to be 'HEAD' request and pass an instance of the request to the opener, read the attributes, and close the response.
class HeadRequest(urllib2.Request):
def get_method(self):
return 'HEAD'
def getheadersonly(url, redirections=True):
opener = urllib2.OpenerDirector()
opener.add_handler(urllib2.HTTPHandler())
opener.add_handler(urllib2.HTTPDefaultErrorHandler())
if redirections:
# HTTPErrorProcessor makes HTTPRedirectHandler work
opener.add_handler(urllib2.HTTPErrorProcessor())
opener.add_handler(urllib2.HTTPRedirectHandler())
try:
res = opener.open(HeadRequest(url))
except urllib2.HTTPError, res:
pass
res.close()
return dict(code=res.code, headers=res.info(), finalurl=res.geturl())

You can send a HEAD request using httplib. A HEAD request is the same as a GET request, but the server doesn't send then message body.

How to test twisted web resource with trial?

I'm developing a twisted.web server - it consists of some resources that apart from rendering stuff use adbapi to fetch some data and write some data to postgresql database. I'm trying to figoure out how to write a trial unittest that would test resource rendering without using net (in other words: that would initialize a resource, produce it a dummy request etc.).
Lets assume the View resource is a simple leaf that in render_GET returns NOT_DONE_YET and tinkers with adbapi to produce simple text as a result. Now, I've written this useless code and I can't come up how to make it actually initialize the resource and produce some sensible response:
from twisted.trial import unittest
from myserv.views import View
from twisted.web.test.test_web import DummyRequest
class ExistingView(unittest.TestCase):
def test_rendering(self):
slug = "hello_world"
view = View(slug)
request = DummyRequest([''])
output = view.render_GET(request)
self.assertEqual(request.responseCode, 200)
The output is... 1. I've also tried such approach: output = request.render(view) but same output = 1. Why? I'd be very gratefull for some example how to write such unittest!

Here's a function that will render a request and convert the result into a Deferred that fires when rendering is complete:
def _render(resource, request):
result = resource.render(request)
if isinstance(result, str):
request.write(result)
request.finish()
return succeed(None)
elif result is server.NOT_DONE_YET:
if request.finished:
return succeed(None)
else:
return request.notifyFinish()
else:
raise ValueError("Unexpected return value: %r" % (result,))
It's actually used in Twisted Web's test suite, but it's private because it has no unit tests itself. ;)
You can use it to write a test like this:
def test_rendering(self):
slug = "hello_world"
view = View(slug)
request = DummyRequest([''])
d = _render(view, request)
def rendered(ignored):
self.assertEquals(request.responseCode, 200)
self.assertEquals("".join(request.written), "...")
...
d.addCallback(rendered)
return d

Here is a DummierRequest class that fixes almost all my problems. Only thing left is it does not set any response code! Why?
from twisted.web.test.test_web import DummyRequest
from twisted.web import server
from twisted.internet.defer import succeed
from twisted.internet import interfaces, reactor, protocol, address
from twisted.web.http_headers import _DictHeaders, Headers
class DummierRequest(DummyRequest):
def __init__(self, postpath, session=None):
DummyRequest.__init__(self, postpath, session)
self.notifications = []
self.received_cookies = {}
self.requestHeaders = Headers()
self.responseHeaders = Headers()
self.cookies = [] # outgoing cookies
def setHost(self, host, port, ssl=0):
self._forceSSL = ssl
self.requestHeaders.setRawHeaders("host", [host])
self.host = address.IPv4Address("TCP", host, port)
def addCookie(self, k, v, expires=None, domain=None, path=None, max_age=None, comment=None, secure=None):
"""
Set an outgoing HTTP cookie.
In general, you should consider using sessions instead of cookies, see
L{twisted.web.server.Request.getSession} and the
L{twisted.web.server.Session} class for details.
"""
cookie = '%s=%s' % (k, v)
if expires is not None:
cookie = cookie +"; Expires=%s" % expires
if domain is not None:
cookie = cookie +"; Domain=%s" % domain
if path is not None:
cookie = cookie +"; Path=%s" % path
if max_age is not None:
cookie = cookie +"; Max-Age=%s" % max_age
if comment is not None:
cookie = cookie +"; Comment=%s" % comment
if secure:
cookie = cookie +"; Secure"
self.cookies.append(cookie)
def getCookie(self, key):
"""
Get a cookie that was sent from the network.
"""
return self.received_cookies.get(key)
def getClientIP(self):
"""
Return the IPv4 address of the client which made this request, if there
is one, otherwise C{None}.
"""
return "192.168.1.199"

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

using cookies with twisted.web.client - python

I'm trying to make a web client application using twisted but having some trouble with cookies. Does anyone have an example I can look at?

Related

What is the best way to force a keyword while using **kwargs?

Access the response object in a bottlepy after_request hook

Python: How to use requests library to access a url through several different proxy servers?

python get headers only using urllib2

How to test twisted web resource with trial?

Categories

Resources