My current twisted server code. It is a simple experiment to take url encoded requests and convert them into a JSON like string to then return.
from twisted.web.server import Site
from twisted.web.resource import Resource
from twisted.internet import reactor
import urllib.parse
class FormPage(Resource):
isLeaf = True
def render_GET(self, request):
print(request.uri)
x = (request.uri).decode('ascii')
x = x[1:]
x = todi(x)
return x.encode('ascii')
def todi(st):
if len(st) == 0:
return '{}'
if st[len(st)-1] == '/':
st = st[:-1]
if len(st) == 0:
return '()'
if st[0] == '?':
st = st[1:]
st = urllib.parse.parse_qsl(st)
return str(dict(st))
factory = Site(FormPage())
reactor.listenTCP(80, factory)
reactor.run()
I've paid attention to the font my browser displays when I am receiving simple text. For example this site: http://icanhazip.com/ when you visit, the font looks like consola font (default font for MS notepad). However, when I visit my site, my browser displays a font that looks like Times New Roman.
I have done some debugging since, such as forcing the site to return a simple string of characters, but nothing can stop twisted from giving me ugly looking fonts.
Here, have an example.
Also note that I did the thing in Chrome where you right click and use the "View page source" button. Trust me, both my examples are simply raw text according to that.
Looking at the headers returned by your Twisted server and comparing them to those returned by the other web site, the latter specifies Content-Type: text/plain; charset=UTF-8, whereas the Twisted server does not specify the Content-Type at all.
Your browser (and I've found it to be the same with Firefox) uses a different font when the content type is specified as text/plain vs. an unspecified content type.
In Twisted you can set the Content-Type header with request.setHeader() like this:
def render_GET(self, request):
print(request.uri)
x = (request.uri).decode('ascii')
x = x[1:]
x = todi(x)
request.setHeader('Content-Type', 'text/plain; charset=UTF-8')
return x.encode('UTF-8')
As this sets the Content-Type it might as well specify the charset too. UTF-8 is (probably) preferred, and the response text is similarly encoded.
Related
I have a DataFrame common_ips containing IPs as shown below.
I need to achieve two basic tasks:
Identify private and public IPs.
Check organisation for public IPs.
Here is what I am doing:
import json
import urllib
import re
baseurl = 'http://ipinfo.io/' # no HTTPS supported (at least: not without a plan)
def isIPpublic(ipaddress):
return not isIPprivate(ipaddress)
def isIPprivate(ipaddress):
if ipaddress.startswith("::ffff:"):
ipaddress=ipaddress.replace("::ffff:", "")
# IPv4 Regexp from https://stackoverflow.com/questions/30674845/
if re.search(r"^(?:10|127|172\.(?:1[6-9]|2[0-9]|3[01])|192\.168)\..*", ipaddress):
# Yes, so match, so a local or RFC1918 IPv4 address
return True
if ipaddress == "::1":
# Yes, IPv6 localhost
return True
return False
def getipInfo(ipaddress):
url = '%s%s/json' % (baseurl, ipaddress)
try:
urlresult = urllib.request.urlopen(url)
jsonresult = urlresult.read() # get the JSON
parsedjson = json.loads(jsonresult) # put parsed JSON into dictionary
return parsedjson
except:
return None
def checkIP(ipaddress):
if (isIPpublic(ipaddress)):
if bool(getipInfo(ipaddress)):
if 'bogon' in getipInfo(ipaddress).keys():
return 'Private IP'
elif bool(getipInfo(ipaddress).get('org')):
return getipInfo(ipaddress)['org']
else:
return 'No organization data'
else:
return 'No data available'
else:
return 'Private IP'
And applying it to my common_ips DataFrame with
common_ips['Info'] = common_ips.IP.apply(checkIP)
But it's taking longer than I expected. And for some IPs, it's giving incorrect Info.
For instance:
where it should have been AS19902 Department of Administrative Services as I cross-checked it by
and
What am I missing here ? And how can I achieve these tasks in a more Pythonic way ?
A blanket except: is basically always a bug. You are returning None instead of handling any anomalous or error response from the server, and of course the rest of your code has no way to recover.
As a first debugging step, simply take out the try/except handling. Maybe then you can find a way to put back a somewhat more detailed error handler for some cases which you know how to recover from.
def getipInfo(ipaddress):
url = '%s%s/json' % (baseurl, ipaddress)
urlresult = urllib.request.urlopen(url)
jsonresult = urlresult.read() # get the JSON
parsedjson = json.loads(jsonresult) # put parsed JSON into dictionary
return parsedjson
Perhaps the calling code in checkIP should have a try/except instead, and e.g. retry after sleeping for a bit if the server indicates that you are going too fast.
(In the absence of an authorization token, it looks like you are using the free version of this service, which is probably not in any way guaranteed anyway. Also maybe look at using their recommended library -- I haven't looked at it in more detail, but I would imagine it at the very least knows better how to behave in the case of a server-side error. It's almost certainly also more Pythonic, at least in the sense that you should not reinvent things which already exist.)
We use a custom scraper that have to take a separate website for a language (this is an architecture limitation). Like site1.co.uk, site1.es, site1.de etc.
But we need to parse a website with many languages, separated by url - like site2.com/en, site2.com/de, site2.com/es and so on.
I thought about MITMProxy: I could redirect all requests this way:
en.site2.com/* --> site2.com/en
de.site2.com/* --> site2.com/de
...
I have written a small script which simply takes URLs and rewrites them:
class MyMaster(flow.FlowMaster):
def handle_request(self, r):
url = r.get_url()
# replace URLs
if 'blabla' in url:
r.set_url(url.replace('something', 'another'))
But the target host generates 301 redirect with the response from the webserver - 'the page has been moved here' and the link to the site2.com/en
It worked when I played with URL rewriting, i.e. site2.com/en --> site2.com/de.
But for different hosts (subdomain and the root domain, to be precise), it does not work.
I tried to replace the Host header in the handle_request method from above:
for key in r.headers.keys():
if key.lower() == 'host':
r.headers[key] = ['site2.com']
also I tried to replace the Referrer - all of that didn't help.
How can I finally spoof that request from the subdomain to the main domain? If it generates a HTTP(s) client warning it's ok since we need that for the scraper (and the warnings there can be turned off), not the real browser.
Thanks!
You need to replace the content of the response and craft the header with just a few fields.
Open a new connection to the redirected url and craft your response :
def handle_request(self, flow):
newUrl = <new-url>
retryCount = 3
newResponse = None
while True:
try:
newResponse = requests.get(newUrl) # import requests
except:
if retryCount == 0:
print 'Cannot reach new url ' + newUrl
traceback.print_exc() # import traceback
return
retryCount -= 1
continue
break
responseHeaders = Headers() # from netlib.http import Headers
if 'Date' in newResponse.headers:
responseHeaders['Date'] = str(newResponse.headers['Date'])
if 'Connection' in newResponse.headers:
responseHeaders['Connection'] = str(newResponse.headers['Connection'])
if 'Content-Type' in newResponse.headers:
responseHeaders['Content-Type'] = str(newResponse.headers['Content-Type'])
if 'Content-Length' in newResponse.headers:
responseHeaders['Content-Length'] = str(newResponse.headers['Content-Length'])
if 'Content-Encoding' in newResponse.headers:
responseHeaders['Content-Encoding'] = str(inetResponse.headers['Content-Encoding'])
response = HTTPResponse( # from libmproxy.models import HTTPResponse
http_version='HTTP/1.1',
status_code=200,
reason='OK',
headers=responseHeaders,
content=newResponse.content)
flow.reply(response)
so I have been writing a simple web server in Python, and right now I'm trying to handle multipart/form-data POST requests. I can already handle application/x-www-form-urlencoded POST requests, but the same code won't work for the multipart. If it looks like I am misunderstanding anything, please call me out, even if it's something minor. Also if you guys have any advice on making my code better please let me know as well :) Thanks!
When the request comes in, I first parse it, and split it into a dictionary of headers and a string for the body of the request. I use those to then construct a FieldStorage form, which I can then treat like a dictionary to pull the data out:
requestInfo = ''
while requestInfo[-4:] != '\r\n\r\n':
requestInfo += conn.recv(1)
requestSplit = requestInfo.split('\r\n')[0].split(' ')
requestType = requestSplit[0]
url = urlparse.urlparse(requestSplit[1])
path = url[2] # Grab Path
if requestType == "POST":
headers, body = parse_post(conn, requestInfo)
print "!!!Request!!! " + requestInfo
print "!!!Body!!! " + body
form = cgi.FieldStorage(headers = headers, fp = StringIO(body), environ = {'REQUEST_METHOD':'POST'}, keep_blank_values=1)
Here's my parse_post method:
def parse_post(conn, headers_string):
headers = {}
headers_list = headers_string.split('\r\n')
for i in range(1,len(headers_list)-2):
header = headers_list[i].split(': ', 1)
headers[header[0]] = header[1]
content_length = int(headers['Content-Length'])
content = conn.recv(content_length)
# Parse Content differently if it's a multipart request??
return headers, content
So for an x-www-form-urlencoded POST request, I can treat FieldStorage form like a dictionary, and if I call, for example:
firstname = args['firstname'].value
print firstname
It will work. However, if I instead send a multipart POST request, it ends up printing nothing.
This is the body of the x-www-form-urlencoded request:
firstname=TEST&lastname=rwar
This is the body of the multipart request:
--070f6a3146974d399d97c85dcf93ed44
Content-Disposition: form-data; name="lastname"; filename="lastname"
rwar
--070f6a3146974d399d97c85dcf93ed44
Content-Disposition: form-data; name="firstname"; filename="firstname"
TEST
--070f6a3146974d399d97c85dcf93ed44--
So here's the question, should I manually parse the body for the data in parse_post if it's a multipart request?
Or is there a method that I need/can use to parse the multipart body?
Or am I doing this wrong completely?
Thanks again, I know it's a long read but I wanted to make sure my question was comprehensive
So I solved my problem, but in a totally hacky way.
Ended up manually parsing the body of the request, here's the code I wrote:
if("multipart/form-data" in headers["Content-Type"]):
data_list = []
content_list = content.split("\r\n\r\n")
for i in range(len(content_list) - 1):
data_list.append("")
data_list[0] += content_list[0].split("name=")[1].split(";")[0].replace('"','') + "="
for i,c in enumerate(content_list[1:-1]):
key = c.split("name=")[1].split(";")[0].replace('"','')
data_list[i+1] += key + "="
value = c.split("\r\n")
data_list[i] += value[0]
data_list[-1] += content_list[-1].split("\r\n")[0]
content = "&".join(data_list)
If anybody can still solve my problem without having to manually parse the body, please let me know!
There's the streaming-form-data project that provides a Python parser to parse data that's multipart/form-data encoded. It's intended to allow parsing data in chunks, but since there's no chunk size enforced, you could just pass your entire input at once and it should do the job. It should be installable via pip install streaming_form_data.
Here's the source code - https://github.com/siddhantgoel/streaming-form-data
Documentation - https://streaming-form-data.readthedocs.io/en/latest/
Disclaimer: I'm the author. Of course, please create an issue in case you run into a bug. :)
I have seen this thread already - How can I unshorten a URL?
My issue with the resolved answer (that is using the unshort.me API) is that I am focusing on unshortening youtube links. Since unshort.me is used readily, this returns almost 90% of the results with captchas which I am unable to resolve.
So far I am stuck with using:
def unshorten_url(url):
resolvedURL = urllib2.urlopen(url)
print resolvedURL.url
#t = Test()
#c = pycurl.Curl()
#c.setopt(c.URL, 'http://api.unshort.me/?r=%s&t=xml' % (url))
#c.setopt(c.WRITEFUNCTION, t.body_callback)
#c.perform()
#c.close()
#dom = xml.dom.minidom.parseString(t.contents)
#resolvedURL = dom.getElementsByTagName("resolvedURL")[0].firstChild.nodeValue
return resolvedURL.url
Note: everything in the comments is what I tried to do when using the unshort.me service which was returning captcha links.
Does anyone know of a more efficient way to complete this operation without using open (since it is a waste of bandwidth)?
one line functions, using requests library and yes, it supports recursion.
def unshorten_url(url):
return requests.head(url, allow_redirects=True).url
Use the best rated answer (not the accepted answer) in that question:
# This is for Py2k. For Py3k, use http.client and urllib.parse instead, and
# use // instead of / for the division
import httplib
import urlparse
def unshorten_url(url):
parsed = urlparse.urlparse(url)
h = httplib.HTTPConnection(parsed.netloc)
resource = parsed.path
if parsed.query != "":
resource += "?" + parsed.query
h.request('HEAD', resource )
response = h.getresponse()
if response.status/100 == 3 and response.getheader('Location'):
return unshorten_url(response.getheader('Location')) # changed to process chains of short urls
else:
return url
You DO have to open it, otherwise you won't know what URL it will redirect to. As Greg put it:
A short link is a key into somebody else's database; you can't expand the link without querying the database
Now to your question.
Does anyone know of a more efficient way to complete this operation
without using open (since it is a waste of bandwidth)?
The more efficient way is to not close the connection, keep it open in the background, by using HTTP's Connection: keep-alive.
After a small test, unshorten.me seems to take the HEAD method into account and doing a redirect to itself:
> telnet unshorten.me 80
Trying 64.202.189.170...
Connected to unshorten.me.
Escape character is '^]'.
HEAD http://unshort.me/index.php?r=http%3A%2F%2Fbit.ly%2FcXEInp HTTP/1.1
Host: unshorten.me
HTTP/1.1 301 Moved Permanently
Date: Mon, 22 Aug 2011 20:42:46 GMT
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
X-AspNet-Version: 2.0.50727
Location: http://resolves.me/index.php?r=http%3A%2F%2Fbit.ly%2FcXEInp
Cache-Control: private
Content-Length: 0
So if you use the HEAD HTTP method, instead of GET, you will actually end up doing the same work twice.
Instead, you should keep the connection alive, which will save you only a little bandwidth, but what it will certainly save is the latency of establishing a new connection every time. Establishing a TCP/IP connection is expensive.
You should get away with a number of kept-alive connections to the unshorten service equal to the number of concurrent connections your own service receives.
You could manage these connections in a pool. That's the closest you can get. Beside tweaking your kernel's TCP/IP stack.
Here a src code that takes into account almost of the useful corner cases:
set a custom Timeout.
set a custom User Agent.
check whether we have to use an http or https connection.
resolve recursively the input url and prevent ending within a loop.
The src code is on github # https://github.com/amirkrifa/UnShortenUrl
comments are welcome ...
import logging
logging.basicConfig(level=logging.DEBUG)
TIMEOUT = 10
class UnShortenUrl:
def process(self, url, previous_url=None):
logging.info('Init url: %s'%url)
import urlparse
import httplib
try:
parsed = urlparse.urlparse(url)
if parsed.scheme == 'https':
h = httplib.HTTPSConnection(parsed.netloc, timeout=TIMEOUT)
else:
h = httplib.HTTPConnection(parsed.netloc, timeout=TIMEOUT)
resource = parsed.path
if parsed.query != "":
resource += "?" + parsed.query
try:
h.request('HEAD',
resource,
headers={'User-Agent': 'curl/7.38.0'}
)
response = h.getresponse()
except:
import traceback
traceback.print_exec()
return url
logging.info('Response status: %d'%response.status)
if response.status/100 == 3 and response.getheader('Location'):
red_url = response.getheader('Location')
logging.info('Red, previous: %s, %s'%(red_url, previous_url))
if red_url == previous_url:
return red_url
return self.process(red_url, previous_url=url)
else:
return url
except:
import traceback
traceback.print_exc()
return None
import requests
short_url = "<your short url goes here>"
long_url = requests.get(short_url).url
print(long_url)
I am attempting to make use of the Twisted.Web framework.
Notice the three line comments (#line1, #line2, #line3). I want to create a proxy (gateway?) that will forward a request to one of two servers depending on the url. If I uncomment either comment 1 or 2 (and comment the rest), the request is proxied to the correct server. However, of course, it does not pick the server based on the URL.
from twisted.internet import reactor
from twisted.web import proxy, server
from twisted.web.resource import Resource
class Simple(Resource):
isLeaf = True
allowedMethods = ("GET","POST")
def getChild(self, name, request):
if name == "/" or name == "":
return proxy.ReverseProxyResource('localhost', 8086, '')
else:
return proxy.ReverseProxyResource('localhost', 8085, '')
simple = Simple()
# site = server.Site(proxy.ReverseProxyResource('localhost', 8085, '')) #line1
# site = server.Site(proxy.ReverseProxyResource('localhost', 8085, '')) #line2
site = server.Site(simple) #line3
reactor.listenTCP(8080, site)
reactor.run()
As the code above currently stands, when I run this script and navigate to server "localhost:8080/ANYTHING_AT_ALL" I get the following response.
Method Not Allowed
Your browser approached me (at /ANYTHING_AT_ALL) with the method "GET". I
only allow the methods GET, POST here.
I don't know what I am doing wrong? Any help would be appreciated.
Since your Simple class implements the getChild() method, it is implied that this is not a leaf node, however, you are stating that it is a leaf node by setting isLeaf = True. (How can a leaf node have a child?).
Try changing isLeaf = True to isLeaf = False and you'll find that it redirects to the proxy as you'd expect.
From the Resource.getChild docstring:
... This will not be called if the class-level variable 'isLeaf' is set in
your subclass; instead, the 'postpath' attribute of the request will be
left as a list of the remaining path elements....
Here is the final working solution. Basically two resource request go to the GAE server, and all remaining request go to the GWT server.
Other than implementing mhawke's change, there is only one other change, and that was adding '"/" + name' to the proxy servers path. I assume this had to be done because that portion of the path was consumed and placed in the 'name' variable.
from twisted.internet import reactor
from twisted.web import proxy, server
from twisted.web.resource import Resource
class Simple(Resource):
isLeaf = False
allowedMethods = ("GET","POST")
def getChild(self, name, request):
print "getChild called with name:'%s'" % name
if name == "get.json" or name == "post.json":
print "proxy on GAE"
return proxy.ReverseProxyResource('localhost', 8085, "/"+name)
else:
print "proxy on GWT"
return proxy.ReverseProxyResource('localhost', 8086, "/"+name)
simple = Simple()
site = server.Site(simple)
reactor.listenTCP(8080, site)
reactor.run()
Thank you.