Python raw http request retrieval issues via urllib2

Python raw http request retrieval issues via urllib2 - python

I am using Python 2.6.5 and I am trying to capture the raw http request sent via HTTP, this works fine except when I add a proxy handler into the mix so the situation is as follows:
HTTP and HTTPS requests work fine without the proxy handler: raw HTTP request captured
HTTP requests work fine with proxy handler: proxy ok, raw HTTP request captured
HTTPS requests fail with proxy handler: proxy ok but the raw HTTP request is not captured!
The following questions are close but do not solve my problem:
How do you get default headers in a urllib2 Request? <- My solution is heavily based on this
Python urllib2 > HTTP Proxy > HTTPS request
This sets the proxy for each request <- Did not work and doing it once at the start via an opener is more elegant and efficient (instead of setting the proxy for each request)
This is what I am doing:
class MyHTTPConnection(httplib.HTTPConnection):
def send(self, s):
global RawRequest
RawRequest = s # Saving to global variable for Requester class to see
httplib.HTTPConnection.send(self, s)
class MyHTTPHandler(urllib2.HTTPHandler):
def http_open(self, req):
return self.do_open(MyHTTPConnection, req)
class MyHTTPSConnection(httplib.HTTPSConnection):
def send(self, s):
global RawRequest
RawRequest = s # Saving to global variable for Requester class to see
httplib.HTTPSConnection.send(self, s)
class MyHTTPSHandler(urllib2.HTTPSHandler):
def https_open(self, req):
return self.do_open(MyHTTPSConnection, req)
Requester class:
global RawRequest
ProxyConf = { 'http':'http://127.0.0.1:8080', 'https':'http://127.0.0.1:8080' }
# If ProxyConf = { 'http':'http://127.0.0.1:8080' }, then Raw HTTPS request captured BUT the proxy does not see the HTTPS request!
# Also tried with similar results: ProxyConf = { 'http':'http://127.0.0.1:8080', 'https':'https://127.0.0.1:8080' }
ProxyHandler = urllib2.ProxyHandler(ProxyConf)
urllib2.install_opener(urllib2.build_opener(ProxyHandler, MyHTTPHandler, MyHTTPSHandler))
urllib2.Request('http://www.google.com', None) # global RawRequest updated
# This is the problem: global RawRequest NOT updated!?
urllib2.Request('https://accounts.google.com', None)
BUT, if I remove the ProxyHandler it works!:
global RawRequest
urllib2.install_opener(urllib2.build_opener(MyHTTPHandler, MyHTTPSHandler))
urllib2.Request('http://www.google.com', None) # global RawRequest updated
urllib2.Request('https://accounts.google.com', None) # global RawRequest updated
How can I add the ProxyHandler into the mix while keeping access to the RawRequest?
Thank you in advance.

Answering my own question: It seems a bug in the underlying libraries, making RawRequest a list solves the problem: The HTTP Raw request is the first item. The custom HTTPS class is called several times, the last of which is blank. The fact that the custom HTTP class is only called once suggests this is a bug in python but the list solution gets around it
RawRequest = s
just needs to be changed to:
RawRequest.append(s)
with a previous initialisation of RawRequest = [] and retrieval of raw request via RawRequest[0] (first element of the list)

Related

Modify json body with mitmproxy

I am trying to intercept and modify a graphql response's body. Here is my addon code:
from mitmproxy import ctx
from mitmproxy import http
import json
def response(flow: http.HTTPFlow) -> None:
if flow.request.pretty_url == "https://my.graphql/endpoint":
request_data = json.loads(flow.request.get_text())
if request_data["operationName"] == "MyOperationName":
data = json.loads(flow.response.get_text())
data["data"]["product"]["name"] = "New Name"
flow.response.text = json.dumps(data)
I can see the modified response in mitmproxy console. But the iOS simulator I am using is still getting the original response. Does anyone know how can I pass the modified response to the device?

From the documentation
def response(self, flow: mitmproxy.http.HTTPFlow):
"""
The full HTTP response has been read.
Note: If response streaming is active, this event fires after the entire body has been streamed.
HTTP trailers, if present, have not been transmitted to the client yet and can still be modified.
"""
ctx.log(f"response: {flow=}")
It appears that you might be streaming the response body which would mean that modifications would be ignored.
Consider using def request event hook instead

How to extend scrapy feed export to send crawl results in POST requests - open and store not called?

I have written a spider and it works fine. I now wish to send the results of that scrape to a rest API via POST request.
I'm under the impression that extending the feed export functionality to have it make POST requests instead of writing to a file/sending it to S3/etc.
I am sure that my settings.py and configuration is correct, but for some reason the open and close functions are never called, even when I literally copy and paste and rename preexisting feed export classes from feedexport.py.
Please see code below:
settings.py
EXTENSIONS = {'project.extensions.ExportToApi.ExportScrape':400}
FEED_STORAGES = {'http': 'project.extensions.ExportToApi.ExportScrape'}
FEED_API_URL = 'http://localhost:8000'
(there's a simpleserver at localhost:8000 there printing out all the GET and POST requests it receives)
/project/extensions/ExportToApi.py
from scrapy.extensions.feedexport import BlockingFeedStorage
class ExportScrape(BlockingFeedStorage):
def __init__(self, crawler, uri):
self.crawler = crawler
self.url = uri
#classmethod
def from_crawler(cls, crawler):
uri_from_settings = crawler.settings['FEED_API_URL']
return cls(crawler, uri_from_settings)
def _store_in_thread(self, file):
file.seek(0)
import requests
r = requests.post(self.url, data=file)
file.close()
When added, print statements are run from __init__ and from_crawler, but not _store_in_thread, nor open or store if they are implemented.

POST requests to rest api - is directly the same type of http requests as regular requests scrapy send.
One of options is to.. inside parse method return Request object with empty callback to rest API endpoint instead of item object of dict (and withiout any custom feed storages)
def parse(self, response)
...
#yield item
...
yield Request(
method='post',
url=server_url,
body=converted_item_data,
callback=self.not_parse)
def not_parse(self, response):
pass

Google API client (Python): is it possible to use BatchHttpRequest with ETag caching

I'm using YouTube data API v3.
Is it possible to make a big BatchHttpRequest (e.g., see here) and also to use ETags for local caching at the httplib2 level (e.g., see here)?
ETags work fine for single queries, I don't understand if they are useful also for batch requests.

TL;DR:
BatchHttpRequest cannot be used with caching
HERE IT IS:
First lets see the way to initialize BatchHttpRequest:
from apiclient.http import BatchHttpRequest
def list_animals(request_id, response, exception):
if exception is not None:
# Do something with the exception
pass
else:
# Do something with the response
pass
def list_farmers(request_id, response):
"""Do something with the farmers list response."""
pass
service = build('farm', 'v2')
batch = service.new_batch_http_request()
batch.add(service.animals().list(), callback=list_animals)
batch.add(service.farmers().list(), callback=list_farmers)
batch.execute(http=http)
Second lets see how ETags are used:
from google.appengine.api import memcache
http = httplib2.Http(cache=memcache)
Now lets analyze:
Observe the last line of BatchHttpRequest example: batch.execute(http=http), and now checking the source code for execute, it calls _refresh_and_apply_credentials, which applies the http object we pass it.
def _refresh_and_apply_credentials(self, request, http):
"""Refresh the credentials and apply to the request.
Args:
request: HttpRequest, the request.
http: httplib2.Http, the global http object for the batch.
"""
# For the credentials to refresh, but only once per refresh_token
# If there is no http per the request then refresh the http passed in
# via execute()
Which means, execute call which takes in http, can be passed the ETag http you would have created as:
http = httplib2.Http(cache=memcache)
# This would mean we would get the ETags cached http
batch.execute(http=http)
Update 1:
Could try with a custom object as well:
from googleapiclient.discovery_cache import DISCOVERY_DOC_MAX_AGE
from googleapiclient.discovery_cache.base import Cache
from googleapiclient.discovery_cache.file_cache import Cache as FileCache
custCache = FileCache(max_age=DISCOVERY_DOC_MAX_AGE)
http = httplib2.Http(cache=custCache)
# This would mean we would get the ETags cached http
batch.execute(http=http)
Because, this is just a hunch on the comment in http2 lib:
"""If 'cache' is a string then it is used as a directory name for
a disk cache. Otherwise it must be an object that supports the
same interface as FileCache.
Conclusion Update 2:
After again verifying the google-api-python source code, I see that, BatchHttpRequest is fixed with 'POST' request and has a content-type of multipart/mixed;.. - source code.
Giving a clue about the fact that, BatchHttpRequest is useful in order to POST data which is then processed down the later.
Now, keeping that in mind, observing what httplib2 request method uses: _updateCache only when following criteria are met:
Request is in ["GET", "HEAD"] or response.status == 303 or is a redirect request
ElSE -- response.status in [200, 203] and method in ["GET", "HEAD"]
OR -- if response.status == 304 and method == "GET"
This means, BatchHttpRequest cannot be used with caching.

How not to let python requests calculate content-length and use the provided one?

We have some custom module where we have redefined open, seek, read, tell functions to read only a part of file according to the arguments.
But, this logic overrides the default tell and python requests is trying to calculate the content-length which involves using tell(), which then redirects to our custom tell function and the logic is somewhere buggy and returns a wrong value. And I tried some changes, it throws error.
Found the following from models.py of requests:
def prepare_content_length(self, body):
if hasattr(body, 'seek') and hasattr(body, 'tell'):
body.seek(0, 2)
self.headers['Content-Length'] = builtin_str(body.tell())
body.seek(0, 0)
elif body is not None:
l = super_len(body)
if l:
self.headers['Content-Length'] = builtin_str(l)
elif (self.method not in ('GET', 'HEAD')) and (self.headers.get('Content-Length') is None):
self.headers['Content-Length'] = '0'
For now, I am not able to figure out where's the bug and stressed out to investigate more and fix it. And everything else work except content-length calculation by python requests.
So, I have created my own definition for finding content-length. And I have included the value in requests header. But, the request is still preparing the content-length and throwing error.
How can I restrict not preparing content-length and use the specified content-length?

Requests lets you modify a request before sending. See Prepared Requests.
For example:
from requests import Request, Session
s = Session()
req = Request('POST', url, data=data, headers=headers)
prepped = req.prepare()
# do something with prepped.headers
prepped.headers['Content-Length'] = your_custom_content_length_calculation()
resp = s.send(prepped, ...)
If your session has its own configuration (like cookie persistence or connection-pooling), then you should use s.prepare_request(req) instead of req.prepare().

Python - How to handle HTTPS request with (Urllib2 + SSL) though a HTTP proxy

I am trying to test a proxy connection by using urllib2.ProxyHandler. However, there probably some situation that I am going to request a HTTPS website (eg: https://www.whatismyip.com/)
Urllib2.urlopen() will throw ERROR if request a HTTPS site. So I tried to use a helper function to rewrite the URLOPEN method.
Here is the helper function:
def urlopen(url, timeout):
if hasattr(ssl, 'SSLContext'):
SslContext = ssl.create_default_context()
SslContext.check_hostname = False
SslContext.verify_mode = ssl.CERT_NONE
return urllib2.urlopen(url, timeout=timeout, context=SslContext)
else:
return urllib2.urlopen(url, timeout=timeout)
This helper function based on answer
Then I use:
urllib2.install_opener(
urllib2.build_opener(
urllib2.ProxyHandler({'http': '127.0.0.1:8080'})
)
)
to setup http proxy for urllib.opener.
Ideally, it should working when i request a website by using urlopen('http://whatismyip.com', 30) and it should pass all traffic through http proxy.
However, the urlopen() will fall into if hasattr(ssl, 'SSLContext') all the time even if it is a HTTP site. In addition, HTTPS site is not using HTTP proxy either. This cause the HTTP proxy become invalid and all traffic going through unproxied network
I also tried this answer to change HTTP into HTTPS urllib2.ProxyHandler({'https': '127.0.0.1:8080'}) but it still not working.
My proxy is working. If i am using urllib2.urlopen() instead of the rewrite version urlopen(), it works for HTTP site.
But, I do need consider the suitation if the urlopen gonna need to be used on a HTTPS ONLY site.
How to do that?
Thanks
UPDATE1: I cannot get this work with Python 2.7.11 and some of server working properly with Python 2.7.5. I assue it is python version issue.
Urllib2 will not go through HTTPS Proxy so all HTTPS web address will failed to use proxy.

The problem is when you pass context argument to urllib2.urlopen() then urllib2 creates opener itself instead of using the global one, which is the one that gets set when you call urllib2.install_opener(). As a result your instance of ProxyHandler which you meant to be used is not being used.
The solution is not to install opener but to use the opener directly. When building your opener, you have to pass both an instance of your ProxyHandler class (to set proxies for http and https protocols) and an instance of HTTPSHandler class (to set https context).
I created https://bugs.python.org/issue29379 for this issue.

I personally would suggest the use of something such as python-requests as it will alleviate a lot of the issues with setting up the proxy using urllib2 directly. When using requests with a proxy you will have to do: (From their documentation)
import requests
proxies = {
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}
requests.get('http://example.org', proxies=proxies)
And disabling SSL Certificate verification is as simple as passing verify=False the requests.get command above. However, this should be used sparingly and the actual issue with the SSL Cert verification should be resolve.

One more solution is to pass context into HTTPSHandler and pass this handler into build_opener together with ProxyHandler:
proxies = {'https': 'http://localhost:8080'}
proxy = urllib2.ProxyHandler(proxies)
context = ssl.SSLContext(ssl.PROTOCOL_TLSv1)
handler = urllib2.HTTPSHandler(context=context)
opener = urllib2.build_opener(proxy, handler)
urllib2.install_opener(opener)
Now you can view all your HTTPS requests/responses in your proxy.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python raw http request retrieval issues via urllib2 - python

Related

Modify json body with mitmproxy

How to extend scrapy feed export to send crawl results in POST requests - open and store not called?

Google API client (Python): is it possible to use BatchHttpRequest with ETag caching

How not to let python requests calculate content-length and use the provided one?

Python - How to handle HTTPS request with (Urllib2 + SSL) though a HTTP proxy

Categories

Resources