How to set max_proc_per_cpu in Scrapyd

How to set max_proc_per_cpu in Scrapyd - python

I have the following two Scrapy projects with the following configurations
The Project1's scrapy.cfg
[settings]
default = Project1.settings
[deploy]
url = http://localhost:6800/
project = Project1
[scrapyd]
eggs_dir = eggs
logs_dir = logs
logs_to_keep = 500
dbs_dir = dbs
max_proc = 5
max_proc_per_cpu = 10
http_port = 6800
debug = off
runner = scrapyd.runner
application = scrapyd.app.application
and Project2's scrapy.cfg
[settings]
default = Project2.settings
[deploy]
url = http://localhost:6800/
project = Project2
[scrapyd]
eggs_dir = eggs
logs_dir = logs
logs_to_keep = 500
dbs_dir = dbs
max_proc = 5
max_proc_per_cpu = 10
http_port = 6800
debug = off
runner = scrapyd.runner
application = scrapyd.app.application
but when I take look at http://localhost:6800/jobs I always see just 8 items are in running, it means default max_proc_per_cpu is not applied, I delete the projects with the following commands
curl http://localhost:6800/delproject.json -d project=Project1
curl http://localhost:6800/delproject.json -d project=Project2
and deploy them again to make sure new changes are deployed. but the running spiders number still is 8 .
my VPS CPU has two cores. I could get it with
python -c 'import multiprocessing; print(multiprocessing.cpu_count())'.
how can I get the Scrapyd deployed configuration?
how can I set Max process per cpu?

According to the documentation, in Unix-like systems, the configuration file is first looked upon in the /etc/scrapyd/scrapyd.conf location. I entered the configuration file here, but it did not work. Finally, it worked when I kept the scrapy.conf file as a hidden file in the directory from which the scrapy server started. For me, it happened to be the home directory.
You can read about the details here: https://scrapyd.readthedocs.io/en/stable/config.html

Related

Gunicorn throws error 403 when accessing static files

python==2.7.5, django==1.11.10, gunicorn==19.7.1, RHEL 7.4
I have a django project at my job written not by me.
It was in eventcat user's home directory and with time we ran out of available space on the disk. I was to move the project to /data/.
After I moved the project directory and set up a new environment I faced the problem that static files are not loaded and throwing 403 forbidden error.
Well, I know that gunicorn is not supposed to serve static files on production, but this is an internal project with low load. I have to deal with it as is.
The server is started with a selfwritten script (I changed the environment line to new path):
#!/bin/sh
. ~/.bash_profile
. /data/eventcat/env/bin/activate
exec gunicorn -c gunicorn.conf.py eventcat.wsgi:application
The gunicorn.conf.py consists of:
bind = '127.0.0.1:8000'
backlog = 2048
workers = 1
worker_class = 'sync'
worker_connections = 1000
timeout = 120
keepalive = 2
spew = False
daemon = True
pidfile = 'eventcat.pid'
umask = 0
user = None
group = None
tmp_upload_dir = None
errorlog = 'er.log'
loglevel = 'debug'
accesslog = 'ac.log'
access_log_format = '%(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"'
proc_name = None
def post_fork(server, worker):
server.log.info("Worker spawned (pid: %s)", worker.pid)
def pre_fork(server, worker):
pass
def pre_exec(server):
server.log.info("Forked child, re-executing.")
def when_ready(server):
server.log.info("Server is ready. Spawning workers")
def worker_int(worker):
worker.log.info("worker received INT or QUIT signal")
import threading, sys, traceback
id2name = dict([(th.identm, th.name) for th in threading.enumerate()])
code = []
for threadId, stack in sys._current_frames().items():
code.append("\n# Thread: %s(%d)" % (id2name.get(threadId, ""), threadId))
for filename, lineno, name, line in traceback.exctract_stack(stack):
code.append('File: "%s", line %d, in %s' %(filename, lineno, name))
if line:
code.append(" %s" % (line.strip()))
worker.log.debug("\n".join(code))
def worker_abort(worker):
worker.log.info("worker received SIGABRT signal")
All the files in static directory are owned by eventcat user just like the directory itself.
I couldn't find any useful information in er.log and ac.log.
The server is running on https protocol and there is an ssl.conf in project directory. It has aliases for static and media pointing to previous project location and I changed all these entries to the new ones. Though I couldn't find where this config file is used.
Please, advise how can I find out what is the cause of the issue. What config files or anything should I look into?
UPDATE:
Thanks to #ruddra, gunicorn wasn't serving static at all. It was httpd that was. After making changes in httpd config everything is working.

As far as I know, gunicorn does not serve static contents. So to serve static contents, its best to use either whitenoise or you can use NGINX, Apache or any reverse proxy server. You can check Gunicorn's documentation on deployment using NGINX.
If you want to use whitenoise, then please install it using:
pip install whitenoise
Then add whitenoise to MIDDLEWARES like this(inside settings.py):
MIDDLEWARE = [
# 'django.middleware.security.SecurityMiddleware',
'whitenoise.middleware.WhiteNoiseMiddleware',
# ...
]

Very high latency on simple GET requests

I wrote a very simple flask web service that simply returns text hey23 upon being hit and hosted on AWS EC2 t2.micro machine (1GB RAM 1CPU)
To execute this python application I used uwsgi as my app server. Finally I've put my complete setup behind Nginx.
So my stack is Flask+uwsgi+Nginx
Everything is working fine and good. I only have complaint with the execution time. The average latency measured using wrk is ~370ms, which is too much considering the amount of work this service is doing.
Running 30s test # http://XX.XXX.XX.XX/printtest
12 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 370.98ms 178.90ms 1.96s 91.78%
Req/Sec 93.72 36.72 270.00 69.55%
33124 requests in 30.11s, 5.41MB read
Socket errors: connect 0, read 0, write 0, timeout 15
Non-2xx or 3xx responses: 1173
Requests/sec: 1100.26
Transfer/sec: 184.14KB
hello-test.py
from flask import Flask, jsonify, request
app = Flask(__name__)
#app.route("/print23")
def helloprint():
return "hey23"
if __name__ == "__main__":
app.run(host='0.0.0.0', port=8080, threaded=True)
uwsgi.ini
[uwsgi]
#application's base folder
base = /var/www/demoapp
master = true
#python module to import
app = hello-test
module = %(app)
home = %(base)/venv
pythonpath = %(base)
#socket file's location
socket = /var/www/demoapp/%n.sock
#permissions for the socket file
chmod-socket = 644
#the variable that holds a flask application inside the module imported at line #6
callable = app
disable-logging = True
#location of log files
logto = /var/log/uwsgi/%n.log
max-worker-lifetime = 30
processes = 10
threads = 2
enable-threads = True
cheaper = 2
cheaper-initial = 5
cheaper-step = 1
cheaper-algo = spare
cheaper-overload = 5
Even if i forget about wrk benchmark, even with posting GET requests from my POSTMAN client, I am getting similar latency.
What is wrong here? No matter what, some takeaways
Code cannot be optimised. It just has to return hey23 string.
Nothing can be wrong with Nginx.
I certainly assume ~370ms is not a
good response time for an API doing such a simple task.
Changing the region in which my EC2 machine is hosted may bring some change, but common, this should not be sole reason.
Then what am i missing?

UWSGI + nginx repeated logging in django

I am having some weird issues when I run my application on a dev. server using UWSGI+nginx.. It works fine when my request completes within 5-6 mins.. For long deployments and requests taking longer than that, the UWSGI logs repeats the logs after around 5mins.. It's as if it spawns another process and I get two kinds of logs(one for current process and the repeated process). I am not sure why this is happening.. Did not find anything online. I am sure this is not related to my code because the same thing works perfectly fine in the lab env.. where I use the django runserver. Any insight would be appreciated..
The uwsgi.ini:-
# netadc_uwsgi.ini file
#uid = nginx
#gid = nginx
# Django-related settings
env = HTTPS=on
# the base directory (full path)
chdir = /home/netadc/apps/netadc
# Django's wsgi file
module = netadc.wsgi
# the virtualenv (full path)
home = /home/netadc/.venvs/netadc
# process-related settings
# master
master = true
# maximum number of worker processes
processes = 10
buffer-size = 65536
# the socket (use the full path to be safe
socket = /home/netadc/apps/netadc/netadc/uwsgi/tmp/netadc.sock
# ... with appropriate permissions - may be needed
#chmod-socket = 666
# daemonize
daemonize = true
# logging
logger = file:/home/netadc/apps/netadc/netadc/uwsgi/tmp/netadc_uwsgi.log
# clear environment on exit
vacuum = true

Gunicorn Flask Caching

I have a Flask application that is running using gunicorn and nginx. But if I change the value in the db, the application fails to update in the browser under some conditions.
I have a flask script that has the following commands
from msldata import app, db, models
path = os.path.dirname(os.path.abspath(__file__))
manager = Manager(app)
#manager.command
def run_dev():
app.debug = True
if os.environ.get('PROFILE'):
from werkzeug.contrib.profiler import ProfilerMiddleware
app.config['PROFILE'] = True
app.wsgi_app = ProfilerMiddleware(app.wsgi_app, restrictions=[30])
if 'LISTEN_PORT' in app.config:
port = app.config['LISTEN_PORT']
else:
port = 5000
print app.config
app.run('0.0.0.0', port=port)
print app.config
#manager.command
def run_server():
from gunicorn.app.base import Application
from gunicorn.six import iteritems
# workers = multiprocessing.cpu_count() * 2 + 1
workers = 1
options = {
'bind': '0.0.0.0:5000',
}
class GunicornRunner(Application):
def __init__(self, app, options=None):
self.options = options or {}
self.application = app
super(GunicornRunner, self).__init__()
def load_config(self):
config = dict([(key, value) for key, value in iteritems(self.options) if key in self.cfg.settings and value is not None])
for key, value in iteritems(config):
self.cfg.set(key.lower(), value)
def load(self):
return self.application
GunicornRunner(app, options).run()
Now if i run the server run_dev in debug mode db modifications are updated
if run_server is used the modifications are not seen unless the app is restarted
However if i run like gunicorn -c a.py app:app, the db updates are visible.
a.py contents
import multiprocessing
bind = "0.0.0.0:5000"
workers = multiprocessing.cpu_count() * 2 + 1
Any suggestions on where I am missing something..

I also ran into this situation. Running flask in Gunicorn with several workers and the flask-cache won´t work anymore.
Since you are already using
app.config.from_object('default_config') (or similar filename)
just add this to you config:
CACHE_TYPE = "filesystem"
CACHE_THRESHOLD = 1000000 (some number your harddrive can manage)
CACHE_DIR = "/full/path/to/dedicated/cache/directory/"
I bet you used "simplecache" before...

I was/am seeing the same thing, Only when running gunicorn with flask. One workaround is to set Gunicorn max-requests to 1. However thats not a real solution if you have any kind of load due to the resource overhead of restarting the workers after each request. I got around this by having nginx serve the static content and then changing my flask app to render the template and write to static, then return a redirect to the static file.

Flask-Caching SimpleCache doesn't work w. workers > 1 Gunicorn
Had similar issue using version Flask 2.02 and Flask-Caching 1.10.1.
Everything works fine in development mode until you put on gunicorn with more than 1 worker. One probably reason is that on development there is only one process/worker so weirdly under this restrict circumstances SimpleCache works.
My code was:
app.config['CACHE_TYPE'] = 'SimpleCache' # a simple Python dictionary
cache = Cache(app)
Solution to work with Flask-Caching use FileSystemCache, my code now:
app.config['CACHE_TYPE'] = 'FileSystemCache'
app.config['CACHE_DIR'] = 'cache' # path to your server cache folder
app.config['CACHE_THRESHOLD'] = 100000 # number of 'files' before start auto-delete
cache = Cache(app)

Why do I get a login prompt when I deploy the django-celery example app on dotcloud?

I've been struggling to get the demo application in django-celery working on dotcloud. I have looked at the tutorial at http://docs.dotcloud.com/0.9/tutorials/python/django-celery/ but it isn't a great deal of help.
The example application is a Django 1.4 app. I'm not sure why, but when I navigate to the deployed application, instead of the index page it presents me with a username password popup. The message in the popup is
The server at TheDomain requires a username and password. The server says: RabbitMQ Management.
Does anyone know why this behaviour has been added?
The differences from the django-celery example app are:
# Django settings for project in settings.py
import os
import json
import djcelery
# Load the dotCloud environment
with open('/home/dotcloud/environment.json') as f:
dotcloud_env = json.load(f)
# Configure Celery using the RabbitMQ credentials found in the dotCloud
# environment.
djcelery.setup_loader()
BROKER_HOST = dotcloud_env['DOTCLOUD_BROKER_AMQP_HOST']
BROKER_PORT = int(dotcloud_env['DOTCLOUD_BROKER_AMQP_PORT'])
BROKER_USER = dotcloud_env['DOTCLOUD_BROKER_AMQP_LOGIN']
BROKER_PASSWORD = dotcloud_env['DOTCLOUD_BROKER_AMQP_PASSWORD']
BROKER_VHOST = '/'
Instead of the database settings in the app - I've replaced the database settings with.
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.postgresql_psycopg2',
'NAME': 'template1',
'USER': dotcloud_env['DOTCLOUD_DB_SQL_LOGIN'],
'PASSWORD': dotcloud_env['DOTCLOUD_DB_SQL_PASSWORD'],
'HOST': dotcloud_env['DOTCLOUD_DB_SQL_HOST'],
'PORT': int(dotcloud_env['DOTCLOUD_DB_SQL_PORT']),
}
}
I've also added a requirements.txt file with
Django==1.4
django-celery
setproctitle
and the dotcloud.yml file
www:
type: python
broker:
type: rabbitmq
workers:
type: python-worker
db:
type: postgresql
and the supervisor.conf
[program:djcelery]
directory = $HOME/current/
command = python manage.py celeryd -E -l info -c 2
stderr_logfile = /var/log/supervisor/%(program_name)s_error.log
stdout_logfile = /var/log/supervisor/%(program_name)s.log
[program:celerycam]
directory = $HOME/current/
command = python manage.py celerycam
stderr_logfile = /var/log/supervisor/%(program_name)s_error.log
stdout_logfile = /var/log/supervisor/%(program_name)s.log
and to the postinstall file I added
dotcloud_get_env() {
sed -n "/$1/ s/.*: \"\(.*\)\".*/\1/p" < "$HOME/environment.json"
}
setup_django_celery() {
cat > $HOME/current/supervisord.conf << EOF
[program:djcelery]
directory = $HOME/current/
command = python manage.py celeryd -E -l info -c 2
stderr_logfile = /var/log/supervisor/%(program_name)s_error.log
stdout_logfile = /var/log/supervisor/%(program_name)s.log
[program:celerycam]
directory = $HOME/current/
command = python manage.py celerycam
stderr_logfile = /var/log/supervisor/%(program_name)s_error.log
stdout_logfile = /var/log/supervisor/%(program_name)s.log
EOF
}
if [ `dotcloud_get_env SERVICE_NAME` = workers ] ; then
setup_django_celery
fi
the last fi was added but not in the dotcloud tutorial.
Edit
I've whipped together a repo with this example, as when it works it should be quite useful for others. It's available at: https://github.com/asunwatcher/django-celery-dotcloud

This looks like an error in our CLI.
Try dotcloud url and you will see that your application has two URLs, one for your www service and one for your rabbitMQ, which is a rabbit management interface. You can log in there with the rabbit user name and password given in the dotCloud environment.
For some reason we're picking the wrong one to show you at the end of the push.
The url for your www service is the one you want.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to set max_proc_per_cpu in Scrapyd - python

Related

Gunicorn throws error 403 when accessing static files

Very high latency on simple GET requests

UWSGI + nginx repeated logging in django

Gunicorn Flask Caching

Why do I get a login prompt when I deploy the django-celery example app on dotcloud?

Categories

Resources