I am trying to create python intellectual proxy-server that should be able for streaming large request body content from client to the some internal storages (that may be amazon s3, swift, ftp or something like this). Before streaming server should requests some internal API server that determines parameters for uploading to internal storages. The main restriction is that it should be done in one HTTP operation with method PUT. Also it should work asynchronously because there will be a lot of file uploads.
What solution allows me to read chunks from upload content and starts streaming this chunks to internal storages befor user will have uploaded whole file? All python web applications that I know wait for a whole content will be received before give management to the wsgi applications/python web server.
One of the solutions that I found is tornado fork https://github.com/nephics/tornado . But it is unofficial and tornado developers don't hurry to include it into the main branch.
So may be you know some existing solutions for my problem? Tornado? Twisted? gevents?
Here's an example of a server that does streaming upload handling written with Twisted:
from twisted.internet import reactor
from twisted.internet.endpoints import serverFromString
from twisted.web.server import Request, Site
from twisted.web.resource import Resource
from twisted.application.service import Application
from twisted.application.internet import StreamServerEndpointService
# Define a Resource class that doesn't really care what requests are made of it.
# This simplifies things since it lets us mostly ignore Twisted Web's resource
# traversal features.
class StubResource(Resource):
isLeaf = True
def render(self, request):
return b""
class StreamingRequestHandler(Request):
def handleContentChunk(self, chunk):
# `chunk` is part of the request body.
# This method is called as the chunks are received.
Request.handleContentChunk(self, chunk)
# Unfortunately you have to use a private attribute to learn where
# the content is being sent.
path = self.channel._path
print "Server received %d more bytes for %s" % (len(chunk), path)
class StreamingSite(Site):
requestFactory = StreamingRequestHandler
application = Application("Streaming Upload Server")
factory = StreamingSite(StubResource())
endpoint = serverFromString(reactor, b"tcp:8080")
StreamServerEndpointService(endpoint, factory).setServiceParent(application)
This is a tac file (put it in streamingserver.tac and run twistd -ny streamingserver.tac).
Because of the need to use self.channel._path this isn't a completely supported approach. The API overall is pretty clunky as well so this is more an example that it's possible than that it's good. There has long been an intent to make this sort of thing easier (http://tm.tl/288) but it will probably be a long while yet before this is accomplished.
It seems I have a solution using gevent library and monkey patching:
from gevent.monkey import patch_all
patch_all()
from gevent.pywsgi import WSGIServer
def stream_to_internal_storage(data):
pass
def simple_app(environ, start_response):
bytes_to_read = 1024
while True:
readbuffer = environ["wsgi.input"].read(bytes_to_read)
if not len(readbuffer) > 0:
break
stream_to_internal_storage(readbuffer)
start_response("200 OK", [("Content-type", "text/html")])
return ["hello world"]
def run():
config = {'host': '127.0.0.1', 'port': 45000}
server = WSGIServer((config['host'], config['port']), application=simple_app)
server.serve_forever()
if __name__ == '__main__':
run()
It works well when I try to upload huge file:
curl -i -X PUT --progress-bar --verbose --data-binary #/path/to/huge/file "http://127.0.0.1:45000"
Related
The API should allow arbitrary HTTP get requests containing URLs the user wants scraped, and then Flask should return the results of the scrape.
The following code works for the first http request, but after twisted reactor stops, it won't restart. I may not even be going about this the right way, but I just want to put a RESTful scrapy API up on Heroku, and what I have so far is all I can think of.
Is there a better way to architect this solution? Or how can I allow scrape_it to return without stopping twisted reactor (which can't be started again)?
from flask import Flask
import os
import sys
import json
from n_grams.spiders.n_gram_spider import NGramsSpider
# scrapy api
from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals
app = Flask(__name__)
def scrape_it(url):
items = []
def add_item(item):
items.append(item)
runner = CrawlerRunner()
d = runner.crawl(NGramsSpider, [url])
d.addBoth(lambda _: reactor.stop()) # <<< TROUBLES HERE ???
dispatcher.connect(add_item, signal=signals.item_passed)
reactor.run(installSignalHandlers=0) # the script will block here until the crawling is finished
return items
#app.route('/scrape/<path:url>')
def scrape(url):
ret = scrape_it(url)
return json.dumps(ret, ensure_ascii=False, encoding='utf8')
if __name__ == '__main__':
PORT = os.environ['PORT'] if 'PORT' in os.environ else 8080
app.run(debug=True, host='0.0.0.0', port=int(PORT))
I think there is no a good way to create Flask-based API for Scrapy. Flask is not a right tool for that because it is not based on event loop. To make things worse, Twisted reactor (which Scrapy uses) can't be started/stopped more than once in a single thread.
Let's assume there is no problem with Twisted reactor and you can start and stop it. It won't make things much better because your scrape_it function may block for an extended period of time, and so you will need many threads/processes.
I think the way to go is to create an API using async framework like Twisted or Tornado; it will be more efficient than a Flask-based (or Django-based) solution because the API will be able to serve requests while Scrapy is running a spider.
Scrapy is based on Twisted, so using twisted.web or https://github.com/twisted/klein can be more straightforward. But Tornado is also not hard because you can make it use Twisted event loop.
There is a project called ScrapyRT which does something very similar to what you want to implement - it is an HTTP API for Scrapy. ScrapyRT is based on Twisted.
As an examle of Scrapy-Tornado integration check Arachnado - here is an example on how to integrate Scrapy's CrawlerProcess with Tornado's Application.
If you really want Flask-based API then it could make sense to start crawls in separate processes and/or use queue solution like Celery. This way you're loosing most of the Scrapy efficiency; if you go this way you can use requests + BeautifulSoup as well.
I have been working on similar project last week, it's SEO service API, my workflow was like this:
The client send a request to Flask-based server with a URRL to scrape, and a callback url to notify the client when scrapping is done (client here is an other web app)
Run Scrapy in the background using Celery. The spider will save the data to the database.
The backgound service will notify the client by calling the callback url when the spider is done.
I just started using Crossbar.io to implement a live stats page. I've looked at a lot of code examples, but I can't figure out how to do this:
I have a Django service (to avoid confusion, you can assume I´m talking about a function in views.py) and I'd like it to publish messages in a specific topic, whenever it gets called. I've seen these approaches: (1) Extending ApplicationSession and (2) using an Application instance that is "runned".
None of them work for me, because the Django service doesn't live inside a class, and is not executed as a stand-alone python file either, so I don't find a way to call the "publish" method (that is the only thing I want to do on the server side).
I tried to get an instance of "StatsBackend", which extends ApplicationSession, and publish something... But StatsBackend._instance is None always (even when I execute 'crossbar start' and StatsBackend.init() is called).
StatsBackend.py:
from twisted.internet.defer import inlineCallbacks
from autobahn import wamp
from autobahn.twisted.wamp import ApplicationSession
class StatsBackend(ApplicationSession):
_instance = None
def __init__(self, config):
ApplicationSession.__init__(self, config)
StatsBackend._instance = self
#classmethod
def update_stats(cls, amount):
if cls._instance:
cls._instance.publish('com.xxx.statsupdate', {'amount': amount})
#inlineCallbacks
def onJoin(self, details):
res = yield self.register(self)
print("CampaignStatsBackend: {} procedures registered!".format(len(res)))
test.py:
import StatsBackend
StatsBackend.update_stats(100) #Doesn't do anything, StatsBackend._instance is None
Django is a blocking WSGI application, and that does not blend well with AutobahnPython, which is non-blocking (runs on top of Twisted or asyncio).
However, Crossbar.io has a built-in REST bridge, which includes a HTTP Pusher to which you can submit events via any HTTP/POST capable client. Crossbar.io will forward those events to regular WAMP subscribers (eg via WebSocket in real-time).
Crossbar.io also comes with a complete application template to demonstrate above functionality. To try:
cd ~/test1
crossbar init --template pusher
crossbar start
Open your browser at http://localhost:8080 (open the JS console) and in a second terminal
curl -H "Content-Type: application/json" \
-d '{"topic": "com.myapp.topic1", "args": ["Hello, world"]}' \
http://127.0.0.1:8080/push
You can then do the publish from within a blocking application like Django.
I found what I needed: It is possible to do a HTTP POST request to publish on a topic.
You can read the doc for more information: https://github.com/crossbario/crossbar/wiki/Using-the-REST-to-WebSocket-Pusher
Github offers to send Post-receive hooks to an URL of your choice when there's activity on your repo.
I want to write a small Python command-line/background (i.e. no GUI or webapp) application running on my computer (later on a NAS), which continually listens for those incoming POST requests, and once a POST is received from Github, it processes the JSON information contained within. Processing the json as soon as I have it is no problem.
The POST can come from a small number of IPs given by github; I plan/hope to specify a port on my computer where it should get sent.
The problem is, I don't know enough about web technologies to deal with the vast number of options you find when searching.. do I use Django, Requests, sockets,Flask, microframeworks...? I don't know what most of the terms involved mean, and most sound like they offer too much/are too big to solve my problem - I'm simply overwhelmed and don't know where to start.
Most tutorials about POST/GET I could find seem to be concerned with either sending or directly requesting data from a website, but not with continually listening for it.
I feel the problem is not really a difficult one, and will boil down to a couple of lines, once I know where to go/how to do it. Can anybody offer pointers/tutorials/examples/sample code?
First thing is, web is request-response based. So something will request your link, and you will respond accordingly. Your server application will be continuously listening on a port; that you don't have to worry about.
Here is the similar version in Flask (my micro framework of choice):
from flask import Flask, request
import json
app = Flask(__name__)
#app.route('/',methods=['POST'])
def foo():
data = json.loads(request.data)
print "New commit by: {}".format(data['commits'][0]['author']['name'])
return "OK"
if __name__ == '__main__':
app.run()
Here is a sample run, using the example from github:
Running the server (the above code is saved in sample.py):
burhan#lenux:~$ python sample.py
* Running on http://127.0.0.1:5000/
Here is a request to the server, basically what github will do:
burhan#lenux:~$ http POST http://127.0.0.1:5000 < sample.json
HTTP/1.0 200 OK
Content-Length: 2
Content-Type: text/html; charset=utf-8
Date: Sun, 27 Jan 2013 19:07:56 GMT
Server: Werkzeug/0.8.3 Python/2.7.3
OK # <-- this is the response the client gets
Here is the output at the server:
New commit by: Chris Wanstrath
127.0.0.1 - - [27/Jan/2013 22:07:56] "POST / HTTP/1.1" 200 -
Here's a basic web.py example for receiving data via POST and doing something with it (in this case, just printing it to stdout):
import web
urls = ('/.*', 'hooks')
app = web.application(urls, globals())
class hooks:
def POST(self):
data = web.data()
print
print 'DATA RECEIVED:'
print data
print
return 'OK'
if __name__ == '__main__':
app.run()
I POSTed some data to it using hurl.it (after forwarding 8080 on my router), and saw the following output:
$ python hooks.py
http://0.0.0.0:8080/
DATA RECEIVED:
test=thisisatest&test2=25
50.19.170.198:33407 - - [27/Jan/2013 10:18:37] "HTTP/1.1 POST /hooks" - 200 OK
You should be able to swap out the print statements for your JSON processing.
To specify the port number, call the script with an extra argument:
$ python hooks.py 1234
I would use:
https://github.com/carlos-jenkins/python-github-webhooks
You can configure a web server to use it, or if you just need a process running there without a web server you can launch the integrated server:
python webhooks.py
This will allow you to do everything you said you need. It, nevertheless, requires a bit of setup in your repository and in your hooks.
Late to the party and shameless autopromotion, sorry.
If you are using Flask, here's a very minimal code to listen for webhooks:
from flask import Flask, request, Response
app = Flask(__name__)
#app.route('/webhook', methods=['POST'])
def respond():
print(request.json) # Handle webhook request here
return Response(status=200)
And the same example using Django:
from django.http import HttpResponse
from django.views.decorators.http import require_POST
#require_POST
def example(request):
print(request.json) # Handle webhook request here
return HttpResponse('Hello, world. This is the webhook response.')
If you need more information, here's a great tutorial on how to listen for webhooks with Python.
If you're looking to watch for changes in any repo...
1. If you own the repo that you want to watch
In your repo page, Go to settings
click webhooks, new webhook (top right)
give it your ip/endpoint and setup everything to your liking
use any server to get notified
2. Not your Repo
take the url you want i.e https://github.com/fire17/gd-xo/
add /commits/master.atom to the end such as:
https://github.com/fire17/gd-xo/commits/master.atom
Use any library you want to get that page's content, like:
filter out the keys you want, for example the element
response = requests.get("https://github.com/fire17/gd-xo/commits/master.atom").text
response.split("<updated>")[1].split("</updated>")[0]
'2021-08-06T19:01:53Z'
make a loop that checks this every so often and if this string has changed, then you can initiate a clone/pull request or do whatever you like
I need to read some values from the wsgi request before my flask app is loaded. If I read the url from the wsgi request I can access the file without any issues once the flask app is loaded (after the middleware runs).
But if I attempt to access the params it seems to remove the post data once the flask app is loaded. I even went to the extreme of wrapping the wsgi request with a special Webob Request to prevent this "read once" problem.
Does anyone know how to access values from the wsgi request in middleware without doing any sort of side effect harm to the request so you can get post data / file data in a flask app?
from webob import Request
class SomeMiddleware(object):
def __init__(self, environ):
self.request = Request(environ)
self.orig_environ = environ
def apply_middleware(self):
print self.request.url #will not do any harm
print self.request.params #will cause me to lose data
Here is my flask view
#app.route('/')
def hello_world():
from flask import request
the_file = request.files['file']
print "and the file is", the_file
From what I can tell, this is a limitation of the way that WSGI works. The stream needs only be consumable once (PEP 333 and 3333 only require that the stream support read* calls, tell does not need to be supported). Once the stream is exhausted it cannot be re-streamed to other WSGI applications further "inward". Take a look at these two sections of Werkzeug's documentation for more information:
http://werkzeug.pocoo.org/docs/request_data/
http://werkzeug.pocoo.org/docs/http/#module-werkzeug.formparser
The way to avoid this issue is to wrap the input stream (wsgi.input) in an object that implements the read and readline methods. Then, only when the final application in the chain actually attempts to exhaust the stream will your methods be run. See Flask's documentation on generating a request checksum for an example of this pattern.
That being said, are you sure a middleware is the best solution to your problems? If you need to perform some action (dispatch, logging, authentication) based on the content of the body of the request you may be better off making it a part of your application, rather than a stand-alone application of its own.
Given, when a user requests /foo on my server, I send the following HTTP response (not closing the connection):
Content-Type: multipart/x-mixed-replace; boundary=-----------------------
-----------------------
Content-Type: text/html
foo
When the user goes to /bar (which will send 204 No Content so the view doesn't change), I want to send the following data in the initial response.
-----------------------
Content-Type: text/html
bar
How would I get the second request to trigger this from the initial response? I'm planning on possibly creating a fancy [engines that support multipart/x-mixed-replace (currently only Gecko)]-only email webapp that does server-push and Ajax effects without any JavaScript, just for fun.
No complete answer, but:
In your question, you're describing a Comet-style architecture. Regarding support of Comet-style techniques in Python/WSGI, there is a StackOverflow question, which talks about various Python servers with support for long-running requests a la Comet.
Also interesting is this mail thread in the Python Web-SIG: "Could WSGI handle Asynchronous response?". In May 2008, there was a broad discussion in the Web-SIG about the topic of asynchronous requests in WSGI.
A recent development is evserver, a lightweight WSGI server, which implements the Asynchronous WSGI extension proposed by Christopher Stawarz in the Web-SIG in May 2008.
Finally, the Tornado web server supports non-blocking asynchronous requests. It has a chat example application using long polling, which has similarities with your requirements.
If the problem is to pass some command from /bar application to /foo application and you are using some servlet-like approach (the Python code is loaded once and not for each request as in CGI), you can just change some class property of the /foo application and be ready to react to the change in the /foo instance (by checking the property state).
Obviously the /foo application should not return right after the first request and yield content line by line.
Thought this is just theory, I have not tried that myself.
I have created some small example (just for fun, you know :))
import threading
num = 0
cond = threading.Condition()
def app(environ, start_response):
global num
cond.acquire()
num += 1
cond.notifyAll()
cond.release()
start_response("200 OK", [("Content-Type", "multipart/x-mixed-replace; boundary=xxx")])
while True:
n = num
s = "--xxx\r\nContent-Type: text/html\r\n\r\n%s\n" % n
yield s
# wait for num change:
cond.acquire()
while num == n:
cond.wait()
cond.release()
from cherrypy.wsgiserver import CherryPyWSGIServer
server = CherryPyWSGIServer(("0.0.0.0", 3000), app)
try:
server.start()
except KeyboardInterrupt:
server.stop()
# Now whenever you visit http://127.0.0.1:3000/, the number increases.
# It also automatically increases in all previously opened windows/tabs.
The idea of a shared variable and thread synchronization (using condition variable object) is based on the fact that WSGI server provided by CherryPyWSGIServer is threaded.
Not sure if this is quite what you're looking for, but there is a fairly old way of doing server push using a mime content of multipart/x-mixed-replace
Basically you compose the response as a mime object with content type multipart/x-mixed-replace, and send the first "version" of a document down. The browser will keep the socket open.
Then as the server decides to push more data, a new "version" of the document gets sent from the server, and the browser will intelligently replace (within whatever frame/iframe contains the content) the content.
This was an early way of doing webcams, where the server would send down (push) image after image, and the browser would just keep replacing the image in the document over and over. This is also a way of doing a "Loading..." message over a single HTTP request.