Building a RESTful Flask API for Scrapy

Building a RESTful Flask API for Scrapy - python

The API should allow arbitrary HTTP get requests containing URLs the user wants scraped, and then Flask should return the results of the scrape.
The following code works for the first http request, but after twisted reactor stops, it won't restart. I may not even be going about this the right way, but I just want to put a RESTful scrapy API up on Heroku, and what I have so far is all I can think of.
Is there a better way to architect this solution? Or how can I allow scrape_it to return without stopping twisted reactor (which can't be started again)?
from flask import Flask
import os
import sys
import json
from n_grams.spiders.n_gram_spider import NGramsSpider
# scrapy api
from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals
app = Flask(__name__)
def scrape_it(url):
items = []
def add_item(item):
items.append(item)
runner = CrawlerRunner()
d = runner.crawl(NGramsSpider, [url])
d.addBoth(lambda _: reactor.stop()) # <<< TROUBLES HERE ???
dispatcher.connect(add_item, signal=signals.item_passed)
reactor.run(installSignalHandlers=0) # the script will block here until the crawling is finished
return items
#app.route('/scrape/<path:url>')
def scrape(url):
ret = scrape_it(url)
return json.dumps(ret, ensure_ascii=False, encoding='utf8')
if __name__ == '__main__':
PORT = os.environ['PORT'] if 'PORT' in os.environ else 8080
app.run(debug=True, host='0.0.0.0', port=int(PORT))

I think there is no a good way to create Flask-based API for Scrapy. Flask is not a right tool for that because it is not based on event loop. To make things worse, Twisted reactor (which Scrapy uses) can't be started/stopped more than once in a single thread.
Let's assume there is no problem with Twisted reactor and you can start and stop it. It won't make things much better because your scrape_it function may block for an extended period of time, and so you will need many threads/processes.
I think the way to go is to create an API using async framework like Twisted or Tornado; it will be more efficient than a Flask-based (or Django-based) solution because the API will be able to serve requests while Scrapy is running a spider.
Scrapy is based on Twisted, so using twisted.web or https://github.com/twisted/klein can be more straightforward. But Tornado is also not hard because you can make it use Twisted event loop.
There is a project called ScrapyRT which does something very similar to what you want to implement - it is an HTTP API for Scrapy. ScrapyRT is based on Twisted.
As an examle of Scrapy-Tornado integration check Arachnado - here is an example on how to integrate Scrapy's CrawlerProcess with Tornado's Application.
If you really want Flask-based API then it could make sense to start crawls in separate processes and/or use queue solution like Celery. This way you're loosing most of the Scrapy efficiency; if you go this way you can use requests + BeautifulSoup as well.

I have been working on similar project last week, it's SEO service API, my workflow was like this:
The client send a request to Flask-based server with a URRL to scrape, and a callback url to notify the client when scrapping is done (client here is an other web app)
Run Scrapy in the background using Celery. The spider will save the data to the database.
The backgound service will notify the client by calling the callback url when the spider is done.

Related

issue where Flask app request handling is choked when a regex.serach is happening?

Setup :
Language i am using is Python
I am running a flask app with threaded =True and inside the flask app when an endpoint is hit, it starts a thread and returns a thread started successfully with status code 200.
Inside the thread, there is re.Search happening, which takes 30-40 seconds(worst case) and once the thread is completed it hits a callback URL completing one request.
Ideally, the Flask app is able to handle concurrent requests
Issue:
When the re.search is happening inside the thread, the flask app is not accepting concurrent requests. I am assuming some kind of thread locking is happening and unable to figure out.
Question :
is it ok to do threading(using multi processing) inside flask app which has "threaded = True"?
When regex is happening does it do any thread locking?
Code snippet: hit_callback does a post request to another api which is not needed for this issue.
import threading
from flask import *
import re
app = Flask(__name__)
#app.route(/text)
def temp():
text = request.json["text"]
def extract_company(text)
attrib_list = ["llc","ltd"] # realone is 700 in length
entity_attrib = r"((\s|^)" + r"(.)?(\s|$)|(\s|^)".join(attrib_list) + r"(\s|$))"
raw_client = re.search("(.*(?:" + entity_attrib + "))", text, re.I)
hit_callback(raw_client)
extract_thread = threading.Thread(target=extract_company, )
extract_thread.start()
return jsonify({"Response": True}), 200
if __name__ == "__main__":
app.run(debug=True, host='0.0.0.0', port=4557, threaded=True)

Please read up on the GIL - basically, Python can only execute ONE piece of code at the same time - even if in threads. So your re-search runs and blocks all other threads until run to completion.
Solutions: Use Gunicorn etc. pp to run multiple flask processes - do not attempt to do everything in one flask process.
To add something: Also your design is problematic - 40 seconds or so for a http answer is way to long. A better design would have at least two services: a web server and a search service. The first would register a search and give back an id. Your search service would communicate asynchronously with the web service and return resullt +id when the result is ready. Your clients could pull the web service until they get a result.

How can we create asynchronous API in django?

I want to create a third party chatbot API which is asynchronous and replies "ok" after 10 seconds pause.
import time
def wait():
time.sleep(10)
return "ok"
# views.py
def api(request):
return wait()
I have tried celery for the same as follows where I am waiting for celery response in view itself:
import time
from celery import shared_task
#shared_task
def wait():
time.sleep(10)
return "ok"
# views.py
def api(request):
a = wait.delay()
work = AsyncResult(a.id)
while True:
if work.ready():
return work.get(timeout=1)
But this solution works synchronously and makes no difference. How can we make it asynchronous without asking our user to keep on requesting until the result is received?

As mentioned in #Blusky's answer:
The asynchronous API will exist in django 3.X. not before.
If this is not an option, then the answer is just no.
Please note as well, that even with django 3.X any django code, that accesses the database will not be asynchronous it had to be executed in a thread (thread pool)
Celery is intended for background tasks or deferred tasks, but celery will never return an HTTP response as it didn't receive the HTTP request to which it should respond to. Celery is also not asyncio friendly.
You might have to think of changing your architecture / implementation.
Look at your overall problem and ask yourself whether you really need an asynchronous API with Django.
Is this API intended for browser applications or for machine to machine applications?
Could your client's use web sockets and wait for the answer?
Could you separate blocking and non blocking parts on your server side?
Use django for everything non blocking, for everything periodic / deferred (django + celelry) and implement the asynchronous part with web server plugins or python ASGI code or web sockets.
Some ideas
Use Django + nginx nchan (if your web server is nginx)
Link to nchan: https://nchan.io/
your API call would create a task id, start a celery task, return immediately the task id or a polling url.
The polling URL would be handled for example via an nchan long polling channel.
your client connects to the url corresponding to an nchan long polling channel and celery deblocks it whenever you're task is finished (the 10s are over)
Use Django + an ASGI server + one handcoded view and use strategy similiar to nginx nchan
Same logic as above, but you don't use nginx nchan, but implement it yourself
Use an ASGI server + a non blocking framework (or just some hand coded ASGI views) for all blocking urls and Django for the rest.
They might exchange data via the data base, local files or via local http requests.
Just stay blocking and throw enough worker processes / threads at your server
This is probably the worst suggestion, but if it is just for personal use,
and you know how many requests you will have in parallel then just make sure you have enough Django workers, so that you can afford to be blocking.
I this case you would block an entire Django worker for each slow request.
Use websockets. e.g. with the channels module for Django
Websockets can be implemented with earlier versions of django (>= 2.2) with the django channels module (pip install channels) ( https://github.com/django/channels )
You need an ASGI server to server the asynchronous part. You could use for example Daphne ot uvicorn (The channel doc explains this rather well)
Addendum 2020-06-01: simple async example calling synchronous django code
Following code uses the starlette module as it seems quite simple and small
miniasyncio.py
import asyncio
import concurrent.futures
import os
import django
from starlette.applications import Starlette
from starlette.responses import Response
from starlette.routing import Route
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'pjt.settings')
django.setup()
from django_app.xxx import synchronous_func1
from django_app.xxx import synchronous_func2
executor = concurrent.futures.ThreadPoolExecutor(max_workers=2)
async def simple_slow(request):
""" simple function, that sleeps in an async matter """
await asyncio.sleep(5)
return Response('hello world')
async def call_slow_dj_funcs(request):
""" slow django code will be called in a thread pool """
loop = asyncio.get_running_loop()
future_func1 = executor.submit(synchronous_func1)
func1_result = future_func1.result()
future_func2 = executor.submit(synchronous_func2)
func2_result = future_func2.result()
response_txt = "OK"
return Response(response_txt, media_type="text/plain")
routes = [
Route("/simple", endpoint=simple_slow),
Route("/slow_dj_funcs", endpoint=call_slow_dj_funcs),
]
app = Starlette(debug=True, routes=routes)
you could for example run this code with
pip install uvicorn
uvicorn --port 8002 miniasyncio:app
then on your web server route these specific urls to uvicorn and not to your django application server.

The best option is to use the futur async API, which will be proposed on Django in 3.1 release (which is already available in alpha)
https://docs.djangoproject.com/en/dev/releases/3.1/#asynchronous-views-and-middleware-support
(however, you will need to use an ASGI Web Worker to make it work properly)

Checkout Django 3 ASGI (Asynchronous Server Gateway Interface) support:
https://docs.djangoproject.com/en/3.0/howto/deployment/asgi/

How to use Tornado CurlAsyncHTTPClient to fetch web pages until it fully loaded and js has executed

I tried to build a Tornado application that could provide RESTful APIs to craw web pages. And I found that CurlAsyncHTTPClient cannot fetch a fully loaded page or a js-generated page.
Are there any solutions to this problem? Is there a library that could fetch fully loaded pages or js-generated pages and work with Tornado's non-blocking mechanism?
I would appreciate if you can provide any suggestions or solutions. :)

The http client (whenever it is from Tornado, async, curl, simple, requests, ...) makes a request to given resource and fetches the response - nothing more, no parsing response, executing.
Instead you want fetch response, parse it, download all dependencies (js), execute/render it in the right order. Basically you will need to write a browser or use one.
There are many implementations of headless browser (mostly based on webkit). The worth noting (in python) is Ghost.py and selenium (with phantomjs) - Is there a way to use PhantomJS in Python?.
The main drawbacks, in the context of your question, is they do not support Tornado's in any way (http client as a mechanism to making requests, and so on). So I would, personally, drop Tornado for this task.
With PySide (Qt) - dirty example without Tornado:
from PySide.QtCore import *
from PySide.QtGui import *
from PySide.QtWebKit import QWebPage
class DownloadPage(QWebPage):
html = ''
def __init__(self, url, app, parent=None):
QWebpage.__init__(self, parent)
DownloadPage.html = ''
self.loadFinished.connect(app.quit)
self.mainFrame().load(QUrl(url))
def save(self):
DownloadPage.html = self.mainFrame().toHtml()
def get_page(url):
app = QApplication.instance()
if not app: # create QApplication if it doesnt exist
app = QApplication(sys.argv)
dp = DownloadPage(url, app)
app.aboutToQuit.connect(dp.save)
app.exec_()
return DownloadPage.html

Redirect all ips in config file to static address

Currently using Tornado as a wrapper for several WSGI apps (mostly Flask apps). Recently been noticing a lot of hacking attempts, and been wondering if it's possible to automatically look at a list of the IPs defined in some file and then redirect all of those IPs to a page saying something like "Someone using this IP tried hacking our site, prove you're not a bot and we'll re-allow your ip".
The tornado code that runs the server is here:
from tornado.wsgi import WSGIContainer
from tornado.httpserver import HTTPServer
from tornado.ioloop import IOLoop
from Wrapper.app import application
http_server = HTTPServer(WSGIContainer(application))
http_server.listen(80)
IOLoop.instance().start()
And wrapper.app is below:
from werkzeug.wsgi import DispatcherMiddleware
from Splash import splash_app
from SentimentDemo import sentiment_app
from FERDemo import FER_app
application = DispatcherMiddleware(splash_app, {
'/api/sentiment': sentiment_app,
'/api/fer': FER_app
})
I haven't been able to find any documentation on this sort of thing, so I'm sorry in advance if this question seems uninformed, but even just a place to start looking would be spectacular.

You want to subclass WSGIContainer and override its __call__ method. Something like
class MyWSGIContainer(WSGIContainer):
def __call__(self, request):
if request.remote_ip in blacklist:
self.write_redirect()
else:
super(MyWSGIContainer, self)(request)
For some tips on writing self.write_redirect() look at the code for the WSGIContainer here; you can see how it formats the HTTP headers. You should use HTTP 302 Temporary Redirect.
Then pass your MyWSGIContainer instance into HTTPServer, instead of the default WSGIContainer.

Web interface for a twisted application

I have a application written in Twisted and I want to add a web interface to control and monitor it. I'll need plenty of dynamic pages that show the current status and configuration, so I hoped for a framework that offers at least a templating language with inheritance and some basic routing.
Since I am using Twisted anyways I wanted to use twisted.web - but it's templating language is too basic and it seems that the only framework, Nevow is quite dead (it's on launchpad but the homepage and wiki are down and I can't find any documentation).
So what are my options?
Is there any other twisted.web based framework?
Are there other frameworks that work with twisted's reactor?
Should I just get a web framework (I'm thinking web.py or flask) and run it in a thread?
Thanks for your answers.

Since Nevow is still down and I didn't want to write routing and support for a templating lib myself, I ended up using Flask. It turned out to be quite easy:
# make a Flask app
from flask import Flask, render_template, g
app = Flask(__name__)
#app.route("/")
def index():
return render_template("index.html")
# run in under twisted through wsgi
from twisted.web.wsgi import WSGIResource
from twisted.web.server import Site
resource = WSGIResource(reactor, reactor.getThreadPool(), app)
site = Site(resource)
# bind it etc
# ...
It works flawlessly so far.

You can bind it directly into the reactor like the example below:
reactor.listenTCP(5050, site)
reactor.run()
If you need to add children to a WSGI root visit this link for more details.
Here is an example showing how to combine WSGI Resource with a static child.
from twisted.internet import reactor
from twisted.web import static as Static, server, twcgi, script, vhost
from twisted.web.resource import Resource
from twisted.web.wsgi import WSGIResource
from flask import Flask, g, request
class Root( Resource ):
"""Root resource that combines the two sites/entry points"""
WSGI = WSGIResource(reactor, reactor.getThreadPool(), app)
def getChild( self, child, request ):
# request.isLeaf = True
request.prepath.pop()
request.postpath.insert(0,child)
return self.WSGI
def render( self, request ):
"""Delegate to the WSGI resource"""
return self.WSGI.render( request )
def main():
static = Static.File("/path/folder")
static.processors = {'.py': script.PythonScript,
'.rpy': script.ResourceScript}
static.indexNames = ['index.rpy', 'index.html', 'index.htm']
root = Root()
root.putChild('static', static)
reactor.listenTCP(5050, server.Site(root))
reactor.run()

Nevow is the obvious choice. Unfortunately the divmod web server hardware and the backup server hardware failed at the same time. They are attempting to recover the data and publish it on launchpad, but it may take a while.
You could also use basically any existing template module with twisted.web; Jinja2 comes to mind.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.