The below script snippet is part of a script that implements a simplehttpserver instance which triggers a third party module upon a GET request. I am able to capture the third party module's stdout messages and send them out to the webbrowser.
Currently, the script collects all the stdout messages and dumps them to the client, when the invoked module has been finished....
Since I want each message to appear in the browser as it is sent to stdout, output buffering needs to be disabled.
How do I do that in pythons simplehttpserver?
def do_GET(self):
global key
stdout_ = sys.stdout #Keep track of the previous value.
stream = cStringIO.StringIO()
sys.stdout = stream
''' Present frontpage with user authentication. '''
if self.headers.getheader('Authorization') == None:
self.do_AUTHHEAD()
self.wfile.write('no auth header received')
pass
elif self.headers.getheader('Authorization') == 'Basic '+key:
if None != re.search('/api/v1/check/*', self.path):
recordID = self.path.split('/')[-1]
self.send_response(200)
self.send_header('Content-Type', 'application/json')
self.send_header('Access-Control-Allow-Origin', '*')
self.send_header('Access-Control-Allow-Methods', 'GET,POST,PUT,OPTIONS')
self.send_header("Access-Control-Allow-Headers", "X-Requested-With, Content-Type, Authorization")
self.end_headers()
notStarted = True
while True:
if notStarted is True:
self.moduleXYZ.start()
notStarted is False
if "finished" in stream.getvalue():
sys.stdout = stdout_ # restore the previous stdout.
self.wfile.write(stream.getvalue())
break
Update
I modified the approach to fetch the status messages from the class, instead of using stdout. I included Martijns nice idea of how to keep track of changes.
When I run the server now, I realize that I really need threading? It appears that the script waits until it is finished before it proceeds to the while loop.
Should I better implement threading in the server or in the module class?
def do_GET(self):
global key
''' Present frontpage with user authentication. '''
if self.headers.getheader('Authorization') == None:
self.do_AUTHHEAD()
self.wfile.write('no auth header received')
pass
elif self.headers.getheader('Authorization') == 'Basic '+key:
if None != re.search('/api/v1/check/*', self.path):
recordID = self.path.split('/')[-1]
self.send_response(200)
self.send_header('Content-Type', 'application/json')
self.send_header('Access-Control-Allow-Origin', '*')
self.send_header('Access-Control-Allow-Methods', 'GET,POST,PUT,OPTIONS')
self.send_header("Access-Control-Allow-Headers", "X-Requested-With, Content-Type, Authorization")
self.end_headers()
self.moduleABC.startCrawl()
while True:
if self.moduleABC.done:
print "done"
break
output = self.moduleABC.statusMessages
self.wfile.write(output[sent:])
sent = len(output)
else:
self.send_response(403)
self.send_header('Content-Type', 'application/json')
self.end_headers()
Update 2 (working)
This is my updated GET method. The class object of the third party module is instatiated in the GET method. The module's main method is run in a thread. I use Martijns ideas to monitor progress.
It took me a while to figure out that it is necesarry to append some extra bytes to the status text that is sent to the browser to force a buffer flush!
Thanks for your help with this.
def do_GET(self):
global key
abcd = abcdModule(u"abcd")
''' Present frontpage with user authentication. '''
if self.headers.getheader('Authorization') == None:
self.do_AUTHHEAD()
self.wfile.write('no auth header received')
pass
elif self.headers.getheader('Authorization') == 'Basic '+key:
if None != re.search('/api/v1/check/*', self.path):
recordID = self.path.split('/')[-1]
abcd.setMasterlist([urllib.unquote(recordID)])
abcd.useCaching = False
abcd.maxRecursion = 1
self.send_response(200)
self.send_header('Content-Type', 'application/json')
self.send_header('Access-Control-Allow-Origin', '*')
self.send_header('Access-Control-Allow-Methods', 'GET,POST,PUT,OPTIONS')
self.send_header("Access-Control-Allow-Headers", "X-Requested-With, Content-Type, Authorization")
self.end_headers()
thread.start_new_thread(abcd.start, ())
sent = 0
while True:
if abcd.done:
print "done"
break
output = abcd.statusMessages
if len(output) == sent + 1:
print abcd.statusMessages[-1]
self.wfile.write(json.dumps(abcd.statusMessages))
self.wfile.write("".join([" " for x in range(1,1000)]))
sent = len(output)
else:
self.send_response(403)
self.send_header('Content-Type', 'application/json')
self.end_headers()
else:
self.do_AUTHHEAD()
self.wfile.write(self.headers.getheader('Authorization'))
self.wfile.write('not authenticated')
pass
return
You really want to fix moduleXYZ to not use stdout as the only means of output. This makes the module unsuitable for use in a multithreaded server, for example; two separate threads calling moduleXYZ will lead to output being woven together in unpredictable ways.
However, there is no stream buffering going on here. You are instead capturing all stdout in a cStringIO object, and only when you see the string "finished" in the captured string do you output the result. What you should do there instead is continuously output that value, tracking how much of it you already sent out:
self.moduleXYZ.start()
sent = 0
while True:
output = stream.getvalue()
self.wfile.write(output[sent:])
sent = len(output)
if "finished" in output:
sys.stdout = stdout_
break
Better still, just connect stdout to self.wfile and have the module write directly to the response; you'll need a different method to detect if the module thread is done in this case:
old_stdout = sys.stdout
sys.stdout = self.wfile
self.moduleXYZ.start()
while True:
if self.moduleXYZ.done():
sys.stdout = old_stdout
break
Related
Using Tornado, I have a POST request that takes a long time as it makes many requests to another API service and processes the data. This can take minutes to fully complete. I don't want this to block the entire web server from responding to other requests, which it currently does.
I looked at multiple threads here on SO, but they are often 8 years old and the code does not work anylonger as tornado removed the "engine" component from tornado.gen.
Is there an easy way to kick off this long get call and not have it block the entire web server in the process? Is there anything I can put in the code to say.. "submit the POST response and work on this one function without blocking any concurrent server requests from getting an immediate response"?
Example:
main.py
def make_app():
return tornado.web.Application([
(r"/v1", MainHandler),
(r"/v1/addfile", AddHandler, dict(folderpaths = folderpaths)),
(r"/v1/getfiles", GetHandler, dict(folderpaths = folderpaths)),
(r"/v1/getfile", GetFileHandler, dict(folderpaths = folderpaths)),
])
if __name__ == "__main__":
app = make_app()
sockets = tornado.netutil.bind_sockets(8888)
tornado.process.fork_processes(0)
tornado.process.task_id()
server = tornado.httpserver.HTTPServer(app)
server.add_sockets(sockets)
tornado.ioloop.IOLoop.current().start()
addHandler.py
class AddHandler(tornado.web.RequestHandler):
def initialize(self, folderpaths):
self.folderpaths = folderpaths
def blockingFunction(self):
time.sleep(320)
post("AWAKE")
def post(self):
user = self.get_argument('user')
folderpath = self.get_argument('inpath')
outpath = self.get_argument('outpath')
workflow_value = self.get_argument('workflow')
status_code, status_text = validateInFolder(folderpath)
if (status_code == 200):
logging.info("Status Code 200")
result = self.folderpaths.add_file(user, folderpath, outpath, workflow_value)
self.write(result)
self.finish()
#At this point the path is validated.
#POST response should be send out. Internal process should continue, new
#requests should not be blocked
self.blockingFunction()
Idea is that if input-parameters are validated the POST response should be sent out.
Then internal process (blockingFunction()) should be started, that should not block the Tornado Server from processing another API POST request.
I tried defining the (blockingFunction()) as async, which allows me to process multiple concurrent user requests - however there was a warning about missing "await" with async method.
Any help welcome. Thank you
class AddHandler(tornado.web.RequestHandler):
def initialize(self, folderpaths):
self.folderpaths = folderpaths
def blockingFunction(self):
time.sleep(320)
post("AWAKE")
async def post(self):
user = self.get_argument('user')
folderpath = self.get_argument('inpath')
outpath = self.get_argument('outpath')
workflow_value = self.get_argument('workflow')
status_code, status_text = validateInFolder(folderpath)
if (status_code == 200):
logging.info("Status Code 200")
result = self.folderpaths.add_file(user, folderpath, outpath, workflow_value)
self.write(result)
self.finish()
#At this point the path is validated.
#POST response should be send out. Internal process should continue, new
#requests should not be blocked
await loop.run_in_executor(None, self.blockingFunction)
#if this had multiple parameters it would be
#await loop.run_in_executor(None, self.blockingFunction, param1, param2)
Thank you #xyres
Further read: https://www.tornadoweb.org/en/stable/faq.html
My REST API written in Python spawns processes that take about 3 minutes to complete. I store the PID in a global array and setup a secondary check method that should confirm whether the process is still running, or if it has finished.
The only methods I can find are to poll the subprocess (which I don't have access to in this route), or try to kill the process to see if it's alive. Is there any clean way to just get a binary answer of if it's still running based on the PID, and if it completed successfully if not?
from flask import Flask, jsonify, request, Response
from subprocess import Popen, PIPE
import os
app = Flask(__name__)
QUEUE_ID = 0
jobs = []
#app.route("/compile", methods=["POST"])
def compileFirmware():
f = request.files['file']
f.save(f.filename)
os.chdir("/opt/src/2.0.x")
process = Popen(['platformio', 'run', '-e', 'mega2560'], stdout=PIPE, stderr=PIPE, universal_newlines=True)
global QUEUE_ID
QUEUE_ID += 1
data = {'id':QUEUE_ID, 'pid':process.pid}
jobs.append(data)
output, errors = process.communicate()
print (output)
print (errors)
response = jsonify()
response.status_code = 202 #accepted
response.headers['location'] = '/queue/' + str(QUEUE_ID)
response.headers.add('Access-Control-Allow-Origin', '*')
return response
#app.route("/queue/<id>", methods=["GET"])
def getStatus(id):
#CHECK PID STATUS HERE
content = {'download_url': 'download.com'}
response = jsonify(content)
response.headers.add('Access-Control-Allow-Origin', '*')
return response
if __name__ == '__main__':
app.run(host='0.0.0.0',port=8080)
Here is a little simulation that works:
from flask import Flask, jsonify, request, Response, abort
from subprocess import Popen, PIPE
import os
app = Flask(__name__)
QUEUE = { }
#app.route("/compile", methods=["POST"])
def compileFirmware():
process = Popen(['python','-c','"import time; time.sleep(300)"'], stdout=PIPE, stderr=PIPE, universal_newlines=True)
QUEUE[str(process.pid)] = process # String because in GET the url param will be interpreted as str
response = jsonify()
response.status_code = 202 #accepted
response.headers['location'] = '/queue/' + str(process.pid)
response.headers.add('Access-Control-Allow-Origin', '*')
return response
#app.route("/queue/<id>", methods=["GET"])
def getStatus(id):
process = QUEUE.get(id, None)
if process is None:
abort(404, description="Process not found")
retcode = process.poll()
if retcode is None:
content = {'download_url': None, 'message': 'Process is still running.'}
else:
# QUEUE.pop(id) # Remove reference from QUEUE ?
content = {'download_url': 'download.com', 'message': f'process has completed with retcode: {retcode}'}
response = jsonify(content)
response.headers.add('Access-Control-Allow-Origin', '*')
return response
if __name__ == '__main__':
app.run(host='0.0.0.0',port=8080)
There are further considerations you must think about if this application is going to be used as more than an individual project.
We use QUEUE global variable to store the states of processes. But in a real project, the deployment via wsgi / gunicorn can have multiple workers with each worker having its own global variable. So for scaling up, consider using a redis / mq data store instead.
Does the QUEUE ever need to be cleaned up? Should it be cleaned up? It has a disadvantage that if you clean it up after the value has been GET once, the next GET fetches 404. It is a design decision if the GET api must be idempotent(most likely yes).
I have a custom HTTP request handler that can be simplified to something like this:
# Python 3:
from http import server
class MyHandler(server.BaseHTTPRequestHandler):
def do_GET(self):
self.send_response(200)
self.send_header("Content-type", "text/html")
self.end_headers()
# Here's where all the complicated logic is done to generate HTML.
# For clarity here, replace with a simple stand-in:
html = "<html><p>hello world</p></html>"
self.wfile.write(html.encode())
I'd like to unit-test this handler (i.e. make sure that my do_GET executes without an exception) without actually starting a web server. Is there any lightweight way to mock the SimpleHTTPServer so that I can test this code?
Expanding on the answer from jakevdp, I managed to be able to check the output, too:
try:
import unittest2 as unittest
except ImportError:
import unittest
try:
from io import BytesIO as IO
except ImportError:
from StringIO import StringIO as IO
from server import MyHandlerSSL # My BaseHTTPRequestHandler child
class TestableHandler(MyHandlerSSL):
# On Python3, in socketserver.StreamRequestHandler, if this is
# set it will use makefile() to produce the output stream. Otherwise,
# it will use socketserver._SocketWriter, and we won't be able to get
# to the data
wbufsize = 1
def finish(self):
# Do not close self.wfile, so we can read its value
self.wfile.flush()
self.rfile.close()
def date_time_string(self, timestamp=None):
""" Mocked date time string """
return 'DATETIME'
def version_string(self):
""" mock the server id """
return 'BaseHTTP/x.x Python/x.x.x'
class MockSocket(object):
def getsockname(self):
return ('sockname',)
class MockRequest(object):
_sock = MockSocket()
def __init__(self, path):
self._path = path
def makefile(self, *args, **kwargs):
if args[0] == 'rb':
return IO(b"GET %s HTTP/1.0" % self._path)
elif args[0] == 'wb':
return IO(b'')
else:
raise ValueError("Unknown file type to make", args, kwargs)
class HTTPRequestHandlerTestCase(unittest.TestCase):
maxDiff = None
def _test(self, request):
handler = TestableHandler(request, (0, 0), None)
return handler.wfile.getvalue()
def test_unauthenticated(self):
self.assertEqual(
self._test(MockRequest(b'/')),
b"""HTTP/1.0 401 Unauthorized\r
Server: BaseHTTP/x.x Python/x.x.x\r
Date: DATETIME\r
WWW-Authenticate: Basic realm="MyRealm", charset="UTF-8"\r
Content-type: text/html\r
\r
<html><head><title>Authentication Failed</title></html><body><h1>Authentication Failed</h1><p>Authentication Failed. Authorised Personnel Only.</p></body></html>"""
)
def main():
unittest.main()
if __name__ == "__main__":
main()
The code I am testing returns a 401 Unauthorised for "/". Change the response as appopriate for your test case.
Here's one approach I came up with to mock the server. Note that this should be compatible with both Python 2 and python 3. The only issue is that I can't find a way to access the result of the GET request, but at least the test will catch any exceptions it comes across!
try:
# Python 2.x
import BaseHTTPServer as server
from StringIO import StringIO as IO
except ImportError:
# Python 3.x
from http import server
from io import BytesIO as IO
class MyHandler(server.BaseHTTPRequestHandler):
"""Custom handler to be tested"""
def do_GET(self):
# print just to confirm that this method is being called
print("executing do_GET") # just to confirm...
self.send_response(200)
self.send_header("Content-type", "text/html")
self.end_headers()
# Here's where all the complicated logic is done to generate HTML.
# For clarity here, replace with a simple stand-in:
html = "<html><p>hello world</p></html>"
self.wfile.write(html.encode())
def test_handler():
"""Test the custom HTTP request handler by mocking a server"""
class MockRequest(object):
def makefile(self, *args, **kwargs):
return IO(b"GET /")
class MockServer(object):
def __init__(self, ip_port, Handler):
handler = Handler(MockRequest(), ip_port, self)
# The GET request will be sent here
# and any exceptions will be propagated through.
server = MockServer(('0.0.0.0', 8888), MyHandler)
test_handler()
So this is a little tricky depending on how "deep" you want to go into the BaseHTTPRequestHandler behavior to define your unit test. At the most basic level I think you can use this example from the mock library:
>>> from mock import MagicMock
>>> thing = ProductionClass()
>>> thing.method = MagicMock(return_value=3)
>>> thing.method(3, 4, 5, key='value')
3
>>> thing.method.assert_called_with(3, 4, 5, key='value')
So if you know which methods in the BaseHTTPRequestHandler your class is going to call you could mock the results of those methods to be something acceptable. This can of course get pretty complex depending on how many different types of server responses you want to test.
I have lately been improving security on my webserver, which I wrote myself using http.server and BaseHTTPRequestHandler. I have blocked (403'd) most essential server files, which I do not want users to be able to access. Files include the python server script and all databases, plus some HTML templates.
However, in this post on stackoverflow I read that using open(curdir + sep + self.path) in a do_GET request might potentially make every file on your computer readable.
Can someone explain this to me? If the self.path is ip:port/index.html every time, how can someone access files that are above the root / directory?
I understand that the user (obviously) can change the index.html to anything else, but I don't see how they can access directories above root.
Also if you're wondering why I'm not using nginx or apache, I wanted to create my own web server and website for learning purposes. I have no intention to run an actual website myself, and if I do want to, I will probably rent a server or use existing server software.
class Handler(http.server.BaseHTTPRequestHandler):
def do_GET(self):
try:
if "SOME BLOCKED FILE OR DIRECTORY" in self.path:
self.send_error(403, "FORBIDDEN")
return
#I have about 6 more of these 403 parts, but I left them out for readability
if self.path.endswith(".html"):
if self.path.endswith("index.html"):
#template is the Template Engine that I created to create dynamic HTML content
parser = template.TemplateEngine()
content = parser.get_content("index", False, "None", False)
self.send_response(200)
self.send_header("Content-type", "text/html")
self.end_headers()
self.wfile.write(content.encode("utf-8"))
return
elif self.path.endswith("auth.html"):
parser = template.TemplateEngine()
content = parser.get_content("auth", False, "None", False)
self.send_response(200)
self.send_header("Content-type", "text/html")
self.end_headers()
self.wfile.write(content.encode("utf-8"))
return
elif self.path.endswith("about.html"):
parser = template.TemplateEngine()
content = parser.get_content("about", False, "None", False)
self.send_response(200)
self.send_header("Content-type", "text/html")
self.end_headers()
self.wfile.write(content.encode("utf-8"))
return
else:
try:
f = open(curdir + sep + self.path, "rb")
self.send_response(200)
self.send_header("Content-type", "text/html")
self.end_headers()
self.wfile.write((f.read()))
f.close()
return
except IOError as e:
self.send_response(404)
self.send_header("Content-type", "text/html")
self.end_headers()
return
else:
if self.path.endswith(".css"):
h1 = "Content-type"
h2 = "text/css"
elif self.path.endswith(".gif"):
h1 = "Content-type"
h2 = "gif"
elif self.path.endswith(".jpg"):
h1 = "Content-type"
h2 = "jpg"
elif self.path.endswith(".png"):
h1 = "Content-type"
h2 = "png"
elif self.path.endswith(".ico"):
h1 = "Content-type"
h2 = "ico"
elif self.path.endswith(".py"):
h1 = "Content-type"
h2 = "text/py"
elif self.path.endswith(".js"):
h1 = "Content-type"
h2 = "application/javascript"
else:
h1 = "Content-type"
h2 = "text"
f = open(curdir+ sep + self.path, "rb")
self.send_response(200)
self.send_header(h1, h2)
self.end_headers()
self.wfile.write(f.read())
f.close()
return
except IOError:
if "html_form_action.asp" in self.path:
pass
else:
self.send_error(404, "File not found: %s" % self.path)
except Exception as e:
self.send_error(500)
print("Unknown exception in do_GET: %s" % e)
You're making an invalid assumption:
If the self.path is ip:port/index.html every time, how can someone access files that are above the root / directory?
But self.path is never ip:port/index.html. Try logging it and see what you get.
For example, if I request http://example.com:8080/foo/bar/index.html, the self.path is not example.com:8080/foo/bar/index.html, but just /foo/bar/index.html. In fact, your code couldn't possibly work otherwise, because curdir+ sep + self.path would give you a path starting with ./example.com:8080/, which won't exist.
And then ask yourself what happens if it's /../../../../../../../etc/passwd.
This is one of many reasons to use os.path instead of string manipulation for paths. For examples, instead of this:
f = open(curdir + sep + self.path, "rb")
Do this:
path = os.path.abspath(os.path.join(curdir, self.path))
if os.path.commonprefix((path, curdir)) != curdir:
# illegal!
I'm assuming that curdir here is an absolute path, not just from os import curdir or some other thing that's more likely to give you . than anything else. If it's the latter, make sure to abspath it as well.
This can catch other ways of escaping the jail as well as passing in .. strings… but it's not going to catch everything. For example, if there's a symlink pointing out of the jail, there's no way abspath can tell that someone's gone through the symlink.
self.path contains the request path. If I were to send a GET request and ask for the resource located at /../../../../../../../etc/passwd, I would break out of your application's current folder and be able to access any file on your filesystem (that you have permission to read).
I am having problems with pycurl in conjunction with Twitter's Streaming API filter stream.
What is happening when I run the code below it seems to barf on the perform call. I know this because I placed print statements before and after the perform call. I am using Python 2.6.1 and I am on a Mac if that matters.
#!/usr/bin/python
print "Content-type: text/html"
print
import pycurl, json, urllib
STREAM_URL = "http://stream.twitter.com/1/statuses/filter.json?follow=1&count=100"
USER = "user"
PASS = "password"
print "<html><head></head><body>"
class Client:
def __init__(self):
self.buffer = ""
self.conn = pycurl.Curl()
self.conn.setopt(pycurl.POST,1)
self.conn.setopt(pycurl.USERPWD, "%s:%s" % (USER,PASS))
self.conn.setopt(pycurl.URL, STREAM_URL)
self.conn.setopt(pycurl.WRITEFUNCTION, self.on_receive)
try:
self.conn.perform()
self.conn.close()
except BaseException:
traceback.print_exc()
def on_receive(self,data):
self.buffer += data
if data.endswith("\r\n") and self.buffer.strip():
content = json.loads(self.buffer)
self.buffer = ""
print content
if "text" in content:
print u"{0[user][name]}: {0[text]}".format(content)
client = Client()
print "</body></html>"
First, try turning on verbosity to help debug:
self.conn.setopt(pycurl.VERBOSE ,1)
It looks like you aren't setting the basic auth mode:
self.conn.setopt(pycurl.HTTPAUTH, pycurl.HTTPAUTH_BASIC)
Also according to the documentation, you need to provide a POST of the parameters to the API, not pass them in as a GET parameter:
data = dict( track='stack overflow' )
self.conn.setopt(pycurl.POSTFIELDS,urlencode(data))
You are trying to use basic authentication.
Basic Authentication sends user
credentials in the header of the HTTP
request. This makes it easy to use,
but insecure. OAuth is the Twitter
preferred method of authentication
moving forward - come August 2010,
we'll be turning off Basic Auth from
the API. --Authentication, Twitter