How to start my scrapy spider using http request?

How to start my scrapy spider using http request? - python

I'm an newer in Python and just wrote some spiders using scrapy. Now i want to active my spider using http request like this:
http://xxxxx.com/myspidername/args
I used nginx + uwsgi + django to call my scrapy spider.
Steps:
create&config django project
create scrapy project in the django project root and write my spider
start uwsgi: uwsgi -x django_socket.xml
call my spider in the django app's views.py
from django.http import HttpResponse
from scrapy import cmdline
def index(request, mid):
cmd = "scrapy crawl myitem -a mid=" + mid
cmdline.execute(cmd.split())
return HttpResponse("Hello, it work")
when i visit the http://myhost/myapp/index pointed to the index view, the nginx return error page and the error log shows "upstream prematurely closed connection while reading response header from upstream", and i can see the process uwsgi dispeared, but in the uwsgi's log i can see my spider run correctly.
How can i fix this error?
Is this way right? any other way to do what i want?

I don't think it's such good idea to launch a spider tool inside django views. Django web app is meant to provide quick request/response to end users so that they could retrieve information quickly. Even if I'm not entirely sure what caused the error to happen, I would imagine that your view function would stuck in there as long as the spider don't finish.
There are two options here you could try to improve the user experience and minimize the error that could happen:
crontab
It runs your script regularly. It's reliable and easier for you to log and debug. But it's not flexible for scheduling and lack of control.
celery
This is pretty python/django specific tool that could schedule dynamically your task. You could define either crontab like tasks to run regularly, or apply a task at the run time. It won't block your view function and execute everything in a separate process, so it's most likely what you want. It needs some setup so it might be not straightforward at first. However, many people have used it and it works great once everything is in place.

Nginx does asynchronous, non-blocking IO.
The call too scrapy.cmdline is synchronous. Most likely this messes up in the context of nginx.
Try opening a new process upon receiving the request.
There are many (well maybe not THAT many) ways to do this.
Try this question and its answers first:
Starting a separate process

Related

Is it possible run a fastapi in command line?

We can run any script in python doing:
python main.py
Is it possible do the same if the script was a FastApi application?
Something like:
python main.py GET /login.html
To call a GET method that returns a login.html page.
If not, how I could start a FastApi application without using Uvicorn or another webserver?
I would like can run the script only when necessary.
Thanks

FastApi is designed to allow you to BUILD APIs which can be queried using a HTTP client, not directly query those APIs yourself; however, technically I believe you could.
When you start the script you could start the FastApi app in a another process running in the background, then send a request to it.
import subprocess
import threading
import requests
url = "localhost/some_path"
# launch sub process in background task while redirecting all output to /dev/null
thread = threading.Thread(target=lambda: subprocess.check_output(["uvcorn", "main:app"]))
thread.start()
response = requests.get(url)
# do something with the response...
thread.join()
Obviously this snippet has MUCH room for improvement, for example the thread will never actually end unless something bad happens, this is just a minimal example.
This is method has the clear drawback of starting the API each time you want to run the command. A better approach would be to emulate applications such as Docker, in which you would start up a local server daemon which you would then ping using the command line app.
This would mean that you would have the API running for much longer in the background, but typically these APIs are fairly light and you shouldn't notice and hit to you computer's performance. This also provides the benefit of multiple users being able to run the command at the same time.
If you used the first previous method you may run into situations where user A send a GET request, starting up the server taking hold of the configured host port combo. When user B tries to run the same command just after, they will find themselves unable to start the server. and perform the request.
This will also allow you to eventually move the API to an external server with minimal effort down the line. All you would need to do is change the base url of the requests.
TLDR; Run the FastApi application as a daemon, and query the local server from the command line program instead.

Performing a blocking request in django view

In one of the views in my django application, I need to perform a relatively lengthy network IO operation. The problem is other requests must wait for this request to be completed even though they have nothing to do with it.
I did some research and stumbled upon Celery but as I understand, it is used to perform background tasks independent of the request. (so I can not use the result of the task for the response to the request)
Is there a way to process views asynchronously in django so while the network request is pending other requests can be processed?
Edit: What I forgot to mention is that my application is a web service using django rest framework. So the result of a view is a json response not a page that I can later modify using AJAX.

The usual solution here is to offload the task to celery, and return a "please wait" response in your view. If you want, you can then use an Ajax call to periodically hit a view that will report whether the response is ready, and redirect when it is.

You want to maintain that HTTP connection for an extended period of time but still allow other requests to be managed, right? There's no simple solution to this problem. Also, any solution will be a level away from Django as it depends on how you process requests.
I don't know what you're currently using, so I can only tell you how I handled this in the past... I was using uwsgi to provide the WSGI interface between my python application and nginx. In uwsgi I used the asynchronous functions to suspend my long running connection when there was time to wait on the IO connections. The methods allow you to ask it to suspend things until there is something to read or write and then allow other connections to be serviced.
The above mentioned async calls use "green threads". It's much lighter weight then regular threads and you have control over when you move from thread to thread.

I am not saying that it is a good solution for your scenario[1], but the simple answer is using the following pattern:
async_result = some_task.delay(arg1)
result = async_result.get()
Check documentation for the get method. And instead of using the delay method you can use anything that returns an AsyncResult (like the apply_async method
[1] Why it may be a bad idea? Having an ongoing connection waiting a lot is bad for Django (it is not ready for long-lived connections), may conflict with the proxy configuration (if there is a reverse proxy somewhere) and may be identified as a timeout from the browser. So... it seems a Bad Idea[TM] to use this pattern for a Django Rest Framework view.

Django View Returns Multiple Results Upon Refresh

I've being trying to setup my first django app. I've used nginx + uwsgi + django. I tested a simple Hello World view, but when I check the url I'm getting a different result per refresh:
AttributeError at /control/
Hello world
The standard Django message:
It worked!
Congratulations on your first Django-powered page.
In the urls.py I have this simple pattern:
urlpatterns = patterns('',
url(r'^control/?$',hello),
)
On views.py I have:
def hello(request):
return HttpResponse("Hello world")
Why do the results per refresh vary?

How do you start and stop your uWSGI server might be an issue here.
uWSGI is not reloading itself when code changes by default, so unless stopped or reloaded, it will contain cached version of old django code and operate with it. But caching won't occur immediately, just on first request.
But... uWSGI can spawn multiple workers at once, so it can process multiple requests at once (each worker thread can process one request at time) and that workers will have their own cached version of code.
So one scenario can be: you've started you uWSGI server, then you did some request (before making any views, so standard django it worked shows up), that time one of workers cached code responsible for that action. You've decided to change something, but you've done some error, on next request code providing to that error was cached on another worker. Then you've fixed that error and next worker cached fixed code, providing to you your hello world response.
And now there is situation when all workers have some cached version of your code and depending of which one will process your code, you are getting different results.
Solution for that is: restart your uWSGI instance, and maybe add py-auto-reload to your uWSGI config, so uWSGI will automatically reload when code is changed (use that option only in development, never in production environment).
Other scenario is: you don't have multiple workers, but every time you've changed something in code, you're starting new uWSGI instance without stopping old one. That will cause multiple instances to run in same time, and when you are using unix socket, they will co-exist, taking turns when processing requests. If it's the case, stop all uWSGI instances and next time stop old instance before starting new. Or just simply reload old one.

Scrapy + Django in production

I'm writing a Django web app that makes use of Scrapy and locally all works great, but I wonder how to set up a production environment where my spiders are launched periodically and automatically (I mean that once a spiders complete its job it gets relaunched after a certain time... for example after 24h).
Currently I launch my spiders using a custom Django command, which has the main goal of allowing the use of Django's ORM to store scraped items, so I run:
python manage.py scrapy crawl myspider
and results are stored in my Postgres database.
I installed scrapyd, since it seems that is the preferred way to run scrapy in production
but unfortunately I can't use it without writing a monkey patch (which I would like to avoid), since it use JSON for its web-service API and I get "modelX is not json serializable" exception.
I looked at django-dynamic-scraper, but it seems not be designed to be flexible and customizable as Scrapy is and in fact in docs they say:
Since it simplifies things DDS is not usable for all kinds of
scrapers, but it is well suited for the relatively common case of
regularly scraping a website with a list of updated items
I also thought to use crontab to schedule my spiders... but at what interval should I run my spiders? and if my EC2 instance (I'm gonna use amazon webservices to host my code) needs a reboot I have to re-run all my spiders manually... mmmh... things get complicated...
So... what could be an effective setup for a production environment? How do you handle it? What's your advice?

I had the same question which led me to yours here. Here is what I think and what I did with my project.
Currently I launch my spiders using a custom Django command, which has
the main goal of allowing the use of Django's ORM to store scraped
items
This sounds very interesting. I also wanted to use Django's ORM inside Scrapy spiders, so I did import django and set it up before scraping took place. I guess that is unnecessary if you call scrapy from already instantiated Django context.
I installed scrapyd, since it seems that is the preferred way to run
scrapy in production but unfortunately I can't use it without writing
a monkey patch (which I would like to avoid)
I had idea of using subprocess.Popen, with stdout and stderr redirected to PIPE. Then take both stdout and stderr results and process them. I didn't have need to gather items from output, since spiders are already writing results to database via pipelines. It gets recursive if you call scrapy process from Django this way, and scrapy process sets up Django context so it can use ORM.
Then I tried scrapyd and yes, you have to fire up HTTP requests to the scrapyd to enqueue job, but it doesn't signal you when job is finished or if it is pending. That part you have to check and I guess that is a place for monkey patch.
I also thought to use crontab to schedule my spiders... but at what
interval should I run my spiders? and if my EC2 instance (I'm gonna
use amazon webservices to host my code) needs a reboot I have to
re-run all my spiders manually... mmmh... things get complicated...
So... what could be an effective setup for a production environment?
How do you handle it? What's your advice?
I'm currently using cron for scheduling scraping. It's not something that users can change, even though they want but it has it's pros too. That way I'm sure users wont shrink the period and make multiple scrapers work at the same time.
I have concerns with introducing unnecessary links in chain. Scrapyd would be the middle link and it seems like its doing it's job for now, but it also can be weak link if it can't hold the production load.
Having in mind that you posted this while ago, I'd be grateful to hear what was your solution regarding the whole Django-Scrapy-Scrapyd integration.
Cheers

How can I ensure a Celery task runs with the right settings?

I have two sites running essentially the same codebase, with only slight differences in settings. Each site is built in Django, with a WordPress blog integrated.
Each site needs to import blog posts from WordPress and store them in the Django database. When a user publishes a post, WordPress hits a webhook URL on the Django side, which kicks off a Celery task that grabs the JSON version of the post and imports it.
My initial thought was that each site could run its own instance of manage.py celeryd, each is in its own virtualenv, and the two sites would stay out of each other's way. Each is daemonized with a separate upstart script.
But it looks like they're colliding somehow. I can run one at a time successfully, but if both are running, one instance won't receive tasks, or tasks will run with the wrong settings (in this case, each has a WORDPRESS_BLOG_URL setting).
I'm using a Redis queue, if that makes a difference. What am I doing wrong here?

Have you specified the name of the default queue that celery should use? If you haven't set CELERY_DEFAULT_QUEUE the both sites will be using the same queue and getting each other's messages. You need to set this setting to a different value for each site to keep the message separate.
Edit
You're right, CELERY_DEFAULT_QUEUE is only for backends like RabbitMQ. I think you need to set a different database number for each site, using a different number at the end of your broker url.

If you are using django-celery then make sure you don't have an instance of celery running outside of your virtualenvs. Then start the celery instance within your virtualenvs using manage.py celeryd like you have done. I recommend setting up supervisord to keep track of your instances.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.