NLTK on a production web application

NLTK on a production web application - python

I'd like to incorporate a custom tagger into a web application (running on Pyramid) I'm developing. I have the tagger working fine on my local machine using NLTK, but I've read that NLTK is relatively slow for production.
It seems that the standard way of storing the tagger is to Pickle it. On my machine, it takes a few seconds to load the 11.7MB pickle file.
Is NLTK even practical for production? Should I be looking at scikit-learn or even something like Mahout?
If NLTK is good enough, what is the best way to ensure that it properly uses memory, etc.?

I run text-processing and its associated NLP APIs, and it uses about 2 dozen different pickled models, which are loaded by a Django app (gunicorn behind nginx). The models are loaded as soon as they are needed, and once loaded, they stay in memory. That means whenever I restart the gunicorn server, the first requests that need a model have to wait a few seconds for it load, but every subsequent request gets to use the model that's already cached in RAM. Restarts only happen when I deploy new features, which usually involves updating the models, so I'd need to reload them anyway. So if you don't expect to make code changes very often, and don't have strong requirements on consistent request times, then you probably don't need a separate daemon.
Other than the initial load time, the main limiting factor is memory. I currently only have 1 worker process, because when all the models are loaded into memory, a single process can take up to 1GB (YMMV, and for a single 11MB pickle file, your memory requirements will be much lower). Processing an individual request with an already loaded model is fast enough (usually <50ms) that I currently don't need more than 1 worker, and if I did, the simplest solution is to add enough RAM to run more worker processes.
If you are worried about memory, then do look into scikit-learn, since equivalent models can use significantly less memory than NLTK. But, they are not necessarily faster or more accurate.

The best way to reduce start-up latency is to run the tagger as a daemon (persistent service) that your web app sends snippets of text to tag. That way your tagger loads only when the system boots up and if/when the daemon needs to be restarted.
Only you can decide if the NLTK is fast enough for your needs. Once the tagger is loaded, you've probably noticed that the NLTK can tag several pages of text without perceivable delay. But resource consumption and the number of concurrent users could complicate things.

Related

How to calculate max requests per second of a Django app?

I am about the deploy a Django app, and then it struck me that I couldn't find a way to anticipate how many requests per second my application can handle.
Is there a way of calculating how many requests per second can a Django application handle, without resorting to things like doing a test deployment and use an external tool such as locust?
I know there are several factors involved (such as number of database queries, etc.), but perhaps there is a convenient way of calculating, even estimating, how many visitors can a single Django app instance handle.
EDIT: Removed the mention to Gunicorn, since it only adds confusion to what I truly wanted to know.

Is there a way of calculating how many requests per second can a
Django application handle, without resorting to things like doing a
test deployment and use an external tool such as locust?
No and Yes. As mackarone pointed out, I don't think there's anyway you avoid measuring it. Consider the case where you did a local benchmark on your local dev server talking to a local DB instance, in order to generate a baseline for estimation. The issue with this is that the hardware, network (distance between services) all make a huge difference. So any numbers you generated locally would be relatively worthless for capacity planning.
In my experiences, local testing is great for relative changes. Consider the case where you wanted to see the performance impact of sql query planninng. Establishing a local baseline, making the change, than observing the effect locally is useful to gauge relative speedup.
How to generate these numbers?
I would recommend deploying the app to the hardware, and network you plan on testing on. This deploy should use your production configuration and component topology (ie if you're going to run gunicorn, make sure gunicorn is running instead of NGINX, or if you're going to have a proxy in front of gunicorn, make sure that is setup. I would run a single instance of your application using your production config.
Once this is running, I would launch a load test against the single instance using any of the popular load testing tools:
Apache Benchmark
Siege
Vegeta
K6
etc
You can launch these load tests from your single machine and ramp up traffic until response times are no longer acceptable in order to get a feel for the # of concurrent connections, and throughput your application can accommodate.
Now you have some idea of what a single instance of your service is able to handle. Up until your db (or other shared resources) are saturated these numbers can be used to project how many instances of your service are necessary to handle some amount of traffic!

According to the Gunicorn documentation
How Many Workers?
DO NOT scale the number of workers to the number of clients you expect to have. Gunicorn should only need 4-12 worker processes to handle hundreds or thousands of requests per second.
Gunicorn relies on the operating system to provide all of the load balancing when handling requests. Generally we recommend (2 x $num_cores) + 1 as the number of workers to start off with. While not overly scientific, the formula is based on the assumption that for a given core, one worker will be reading or writing from the socket while the other worker is processing a request.
Obviously, your particular hardware and application are going to affect the optimal number of workers. Our recommendation is to start with the above guess and tune using TTIN and TTOU signals while the application is under load.
Always remember, there is such a thing as too many workers. After a point your worker processes will start thrashing system resources decreasing the throughput of the entire system.
The best thing is tune it using some load testing tool as locust as you mentioned.
Emphasis mine

You have to install (loadtest) first, it is a npm package,
I was learning redis and at that time I found this, you can use it, it worked for me,
For More check this tutorial: https://realpython.com/caching-in-django-with-redis/#start-by-measuring-performance
npm install -g loadtest
loadtest -n 100 -k http://localhost:8000/myUrl/

Can a serverless architecture support high memory needs?

The challenge is to run a set of data processing and data science scripts that consume more memory than expected.
Here are my requirements:
Running 10-15 Python 3.5 scripts via Cron Scheduler
These different 10-15 scripts each take somewhere between 10 seconds to 20 minutes to complete
They run on different hours of the day, some of them run every 10 minute while some run once a day
Each script logs what it has done so that I can take a look at it later if something goes wrong
Some of the scripts sends e-mails to me and to my team mates
None of the scripts have an HTTP/web server component; they all run on Cron schedules and not user-facing
All the scripts' code is fed from my Github repository; when scripts wake up, they first do a git pull origin master and then start executing. That means, pushing to master causes all scripts to be on the latest version.
Here is what I currently have:
Currently I am using 3 Digital Ocean servers (droplets) for these scripts
Some of the scripts require a huge amount of memory (I get segmentation fault in droplets with less than 4GB of memory)
I am willing to introduce a new script that might require even larger memory (the new script currently faults in a 4GB droplet)
The setup of the droplets are relatively easy (thanks to Python venv) but not to the point of executing a single command to spin off a new droplet and set it up
Having a full dedicated 8GB / 16B droplet for my new script sounds a bit inefficient and expensive.
What would be a more efficient way to handle this?

What would be a more efficient way to handle this?
I'll answer in three parts:
Options to reduce memory consumption
Minimalistic architecture for serverless computing
How to get there
(I) Reducing Memory Consumption
Some handle large loads of data
If you find the scripts use more memory than you expect, the only way to reduce the memory requirements is to
understand which parts of the scripts drive memory consumption
refactor the scripts to use less memory
Typical issues that drive memory consumption are:
using the wrong data structure - e.g. if you have numerical data it is usually better to load the data into a numpy array as opposed to a Python array. If you create a lot of objects of custom classes, it can help to use __slots__
loading too much data into memory at once - e.g. if the processing can be split into several parts independent of each other, it may be more efficient to only load as much data as one part needs, then use a loop to process all the parts.
hold object references that are no longer needed - e.g. in the course of processing you create objects to represent or process some part of the data. If the script keeps a reference to such an object, it won't get released until the end of the program. One way around this is to use weak references, another is to use del to dereference objects explicitely. Sometimes it also helps to call the garbage collector.
using an offline algorithm when there is an online version (for machine learning) - e.g. some of scikit's algorithms provide a version for incremental learning such as LinearRegression => SGDRegressior or LogisticRegression => SGDClassifier
some do minor data science tasks
Some algorithms do require large amounts of memory. If using an online algorithm for incremental learning is not an option, the next best strategy is to use a service that only charges for the actual computation time/memory usage. That's what is typically referred to as serverless computing - you don't need to manage the servers (droplets) yourself.
The good news is that in principle the provider you use, Digital Ocean, provides a model that only charges for resources actually used. However this is not really serverless: it is still your task to create, start, stop and delete the droplets to actually benefit. Unless this process is fully automated, the fun factor is a bit low ;-)
(II) Minimalstic Architecture for Serverless Computing
a full dedicated 8GB / 16B droplet for my new script sounds a bit inefficient and expensive
Since your scripts run only occassionally / on a schedule, your droplet does not need to run or even exist all the time. So you could set this is up the following way:
Create a schedulding droplet. This can be of a small size. It's only purpose is to run a scheduler and to create a new droplet when a script is due, then submit the task for execution on this new worker droplet. The worker droplet can be of the specific size to accommodate the script, i.e. every script can have a droplet of whatever size it requires.
Create a generic worker. This is the program that runs upon creation of a new droplet by the scheduler. It receives as input the URL to the git repository where the actual script to be run is stored, and a location to store results. It then checks out the code from the repository, runs the scripts and stores the results.
Once the script has finished, the scheduler deletes the worker droplet.
With this approach there are still fully dedicated droplets for each script, but they only cost money while the script runs.
(III) How to get there
One option is to build an architecture as described above, which would essentially be an implementation of a minimalistic architecture for serverless computing. There are several Python libraries to interact with the Digital Ocean API. You could also use libcloud as a generic multi-provider cloud API to make it easy(ier) to switch providers later on.
Perhaps the better alternative before building yourself is to evaluate one of the existing open source serverless options. An extensive curated list is provided by the good fellows at awesome-serverless. Note at the time of writing this, many of the open source projects are still in their early stages, the more mature options are commerical.
As always with engineering decisions, there is a trade-off between the time/cost required to build or host yourself v.s. the cost of using a readily-available commercial service. Ultimately that's a decision only you can take.

Daemon background tasks on flask (uwsgi) application

Edit for clarify my question:
I want to attach a python service on uwsgi using this feature (I can't understand the examples) and I also want to be able to communicate results between them. Below I present some context and also present my first thought on the communication matter, expecting maybe some advice or another approach to take.
I have an already developed python application that uses multiprocessing.Pool to run on demand tasks. The main reason for using the pool of workers is that I need to share several objects between them.
On top of that, I want to have a flask application that triggers tasks from its endpoints.
I've read several questions here on SO looking for possible drawbacks of using flask with python's multiprocessing module. I'm still a bit confused but this answer summarizes well both the downsides of starting a multiprocessing.Pool directly from flask and what my options are.
This answer shows an uWSGI feature to manage daemon/services. I want to follow this approach so I can use my already developed python application as a service of the flask app.
One of my main problems is that I look at the examples and do not know what I need to do next. In other words, how would I start the python app from there?
Another problem is about the communication between the flask app and the daemon process/service. My first thought is to use flask-socketIO to communicate, but then, if my server stops I need to deal with the connection... Is this a good way to communicate between server and service? What are other possible solutions?
Note:
I'm well aware of Celery, and I pretend to use it in a near future. In fact, I have an already developed node.js app, on which users perform actions that should trigger specific tasks from the (also) already developed python application. The thing is, I need a production-ready version as soon as possible, and instead of modifying the python application, that uses multiprocessing, I thought it would be faster to create a simple flask server to communicate with node.js through HTTP. This way I would only need to implement a flask app that instantiates the python app.
Edit:
Why do I need to share objects?
Simply because the creation of the objects in questions takes too long. Actually, the creation takes an acceptable amount of time if done once, but, since I'm expecting (maybe) hundreds to thousands of requests simultaneously having to load every object again would be something I want to avoid.
One of the objects is a scikit classifier model, persisted on a pickle file, which takes 3 seconds to load. Each user can create several "job spots" each one will take over 2k documents to be classified, each document will be uploaded on an unknown point in time, so I need to have this model loaded in memory (loading it again for every task is not acceptable).
This is one example of a single task.
Edit 2:
I've asked some questions related to this project before:
Bidirectional python-node communication
Python multiprocessing within node.js - Prints on sub process not working
Adding a shared object to a manager.Namespace
As stated, but to clarify: I think the best solution would be to use Celery, but in order to quickly have a production ready solution, I trying to use this uWSGI attach daemon solution

I can see the temptation to hang on to multiprocessing.Pool. I'm using it in production as part of a pipeline. But Celery (which I'm also using in production) is much better suited to what you're trying to do, which is distribute work across cores to a resource that's expensive to set up. Have N cores? Start N celery workers, which of which can load (or maybe lazy-load) the expensive model as a global. A request comes in to the app, launch a task (e.g., task = predict.delay(args), wait for it to complete (e.g., result = task.get()) and return a response. You're trading a little bit of time learning celery for saving having to write a bunch of coordination code.

Python script load testing web page

I want to do a test load for a web page. I want to do it in python with multiple threads.
First POST request would login user (set cookies).
Then I need to know how many users doing the same POST request simultaneously can server take.
So I'm thinking about spawning threads in which requests would be made in loop.
I have a couple of questions:
1. Is it possible to run 1000 - 1500 requests at the same time CPU wise? I mean wouldn't it slow down the system so it's not reliable anymore?
2. What about the bandwidth limitations? How good the channel should be for this test to be reliable?
Server on which test site is hosted is Amazon EC2 script would be run from another server(Amazon too).
Thanks!

cPython does not take advantage from multiple cores when running multiple threads. It means, that basically, You will only have one core doing the testing job.
There are dedicated tools to do what You want to do. Let me suggest two:
FunkLoad is a functional and load web tester, written in Python, whose main use cases are:
Functional testing of web projects, and thus regression testing as well.
Performance testing: by loading the web application and monitoring
your servers it helps you to pinpoint bottlenecks, giving a detailed
report of performance measurement.
Load testing tool to expose bugs that do not surface in cursory testing,
like volume testing or longevity testing.
Stress testing tool to overwhelm the web application resources and test
the application recoverability.
Writing web agents by scripting any web repetitive task, like checking if
a site is alive.
Tsung is an open-source multi-protocol distributed load testing tool
The purpose of Tsung is to simulate
users in order to test the scalability
and performance of IP based
client/server applications. You can
use it to do load and stress testing
of your servers. Many protocols have
been implemented and tested, and it
can be easily extended. WebDAV, LDAP
and MySQL support have been added
recently (experimental).
It can be distributed on several
client machines and is able to
simulate hundreds of thousands of
virtual users concurrently (or even
millions if you have enough hardware
...).
If You decide to write Your own tool, You will probably want to use Python's multiprocessing module as it would let You use multiple cores. You should also take a look on Twisted as it would let You easily handle multiple sockets while a limited number of threads. That would be much better than spawning a new thread for each socket.
You work with Amazon EC2, so I would recommend using Tsung. You can rent a dozen of multicore servers for a few hours and run some really heavy load tests with Tsung. It scales very well in this kind of configuration.
As for the bandwidth, it's usually not a problem, but it depends on the application. You will have to monitor all Your resources closely while performing a load test.

too many variables. 1000 at the same time... no. in the same second... possibly. bandwidth may well be the bottleneck. this is something best solved by experimentation.

How long do zipimported module imports remain cached in memory when using appengine / python and is there a way to keep them in memory?

I've recently uploaded an app that uses django appengine patch and currently have a cron job that runs every two minutes. On each invocation of the worker url it consumes quite a bit of resources
/worker_url 200 7633ms 34275cpu_ms 28116api_ms
That is because on each invocation it does a cold zipimport of all the libraries django etc.
How long do the imported modules stay in memory?
Is there a way to keep these modules in memory so even if subsequent calls are not within the timeframe that these modules stay in memory they still don't invoke the overhead?

app engine keeps everything in memory according to normal Python semantics as long as it's serving one or more requests in the same process in the same node; if and when it needs those resources, the process goes away (so nothing stays in memory that it used to have), and new processes may be started (on the same node or different nodes) any time to serve requests (whether other processes serving other requests are still running, or not). This is much like the fast-CGI model: you're assured of normal semantics within a single request but apart from that anything between 0 and N (no upper limit) different nodes may be running your code, each serving sequentially anything between 0 and K (no upper limit) different requests.
There is nothing you can do to "stay in memory" (for zipimported modules or anything else).
For completeness let me mention memcache, which is an explicit hint/request to the app engine runtime to keep something in a special form of memory, a distributed hash table that's shared among all processes running your code -- that's hard though not impossible to use for imported modules (you'd need pretty sophisticated import hooks) and I recommend against the effort needed to develop such hooks because even in presence of such explicit hints the app engine runtime can still at any time choose to eject anything you've stashed away in the cache, anyway.
Rather -- I'm not sure why a cron job in particular would need all of django nor of why you're zipimporting it rather than just using the 1.0.2 that now comes with app engine, per the docs -- care to elaborate? This might be a useful issue for you to optimize.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.