I'm running a python Flask app on a Ubuntu 14.04 server where I load some data, the two major things being:
Google News vectors, to be used with Word2vec (the GoogleNewsVec is about 4GB)
350MB json file with data
This all loads on my local Windows machine in ~5mins, which has similar specs to the server (8GB RAM). The weird thing is that the two parts seperately load fine. So when I do:
load_word2vec_model()
load_json_data()
It would load the model really quickly and get stuck at load_json_data():
[13:16:55] Loading model..
[13:17:13] Finished loading model.
[13:17:13] Loading scores..
[13:17:13] Loading scores dict from json file..
But when i do it in reverse order:
load_json_data()
load_word2vec_model()
It gets stuck at loading the word2vec model:
[13:20:29] Loading scores..
[13:20:29] Loading scores dict from json file..
[13:22:42] Finished loading from json file.
[13:22:42] Finished loading scores.
[13:22:42] Loading model..
I do not get any python error message. This leads me to believe that the server somehow reached its max. memory usage, and will not load the entire model.
On my local Windows machine it does use up a lot of memory, but eventually it will load (in about 5 minutes total). Why does this not happen on the server, I've waited for an hour but it never loads.
This is the htop output for the server:
The difference between your Windows machine and Ubuntu server is most likely caused by pagefile (Windows)/swappiness (Linux) configuration. To sum up, swapping is storing some parts of memory, preferably not-so-important stuff, to disk to make some room for some other process that asks for memory.
Now, end-user targeted Windows machines comes with a pagefile, i.e. the file that is used to write memory contents, size configuration of around 75% of the memory size. But Ubuntu servers, AWS comes to mind, usually comes with no swap partition/file and swappiness, i.e. the likelyhood that your memory will be swapped to disk, set to 0, i.e. not at all.
The solution is either setting up a swap file and swappiness configuration, or throwing more memory at the problem. The former solution will make your application work just as it is on Windows. The latter will solve it for good.
NVM:
It seems you have swap enabled.
Related
I work for a digital marketing agency having multiple clients. And in one of the projects, I have a very resource intensive python script (which fetches data for Facebook ads), to be run on all those clients (say 500+ in number) in ubuntu 16.04 server.
Originally script took around 2 mins to complete, with 300 MB RES & 1000 MB VM (as per htop), for 1 client. Hence optimized it with ThreadPoolExecutor (max_workers=10) so that script can run on 4 clients concurrently (almost).
Then found out that sometimes, script froze during run (or basically its in "comatose state"). Debugged & profiled and found that its not the script that's causing issue, but its the system.
Then batched the script, means if there are 20 input clients, ran 5 instances (4*5=20) of script. Here sometimes it went fine but sometimes last instance froze.
Then found out that RAM (2G) was being overused, hence increased swapping memory from 0 to 1G. That did the trick. But if few clients are heavy in memory, same thing happens.
Have attached the screenshot of the latest run where after running the 8 instances, last 2 froze. They can be left for days for that matter.
I am thinking of increasing the server RAM from 2G to 4G but not sure if that's the permanent solution. Did anyone has faced similar issue?
You need to fix the Ram consumption of your script,
if your script allocates more memory than your system can provide it get's memory errors, in case you have them in threadpools or similar constructs the threads may never return under some circumstances.
You can fix this by using async functions with timeouts and implementing automatic restart handlers, in case a process does not yield an expected results.
The best way to do that is heavily dependent on the script and will probably require altering already created code
The issue is definitly with your script and not with the OS.
The fastetst workaround would be to increase she system memory or to reduce the amount of threads.
If just adding 1GB of swap area "almost" did the trick then definitely increasing the physical memory is a good way to go. Btw remember that swapping means you're using disk storage, whose speed is measured in millisecs, while RAM speed is measured in nanosecs - so avoiding swap guarantees a performance boost.
And then, reboot your system every now and then. Although Linux is far better than Windows in this respect, memory leaks do occur in Linux too, and a reboot every few months will surely help.
As Gornoka stated you need to alter the memory comsumption of the script as added details this can also be done by removing declared variables within the script once used with the keyword
del
This can also be done by ensuring that if it is processing massive files it does this line by line and saving it as it finishes each line.
I have had this happen and it usually is an indicator of working with to much data at once within the ram and it is always better to work with it partially whenever possible and if not possible get more RAM
I'm a beginner in Python. I used multiprocessing.Pool in my project to imporve performance.
Here's a snippet of code I use the multiprocessing.Pool.
I build the pool at the starting of my resident server, and use the Pool.apply_async method every time when the server get a request :
# build pool when server started
mp.set_start_method('forkserver')
self._driver_pool = Pool(processes=10)
self._executor_pool = Pool(processes=30)
# use pool every time get a request
driver = driver_class(driver_context, init_table, self._manager, **kwargs_dict)
future = self._driver_pool.apply_async(driver.run)
I tested the code on my computer which's operating system is MacOS, and then I deploy the code on a Linux computer.
I found that when I run my code on MacOS, the Pool.apply_async method costs likely 10ms, but the same code on Linux will cost 2s.
I don't understand why there is such a big difference in performance, Is there something wrong with the way I use the multiprocessing.Pool?
After some tests, I have a conjecture.
The current phenomenon is when the size of Pool is set to be 30, the first 30 requests were slow, but after that, the performance of tasks will decrease significantly.
On MacOS, I compared performance in both scenarios with and without pyc files, I found that the cost will raise after I deleted the pyc files.
I suspect there are several possible reasons for the performance differences:
When using 'forkserver' method to start a process, it will load all the resources including import files, which means the process will try to find the pyc files, otherwise it will compile the python file to pyc files and then load them.
The processes in a Pool will never release, which means once a process load pyc files into its memory, it will never load again.
The Mac computer has SSD hard disk, which means if a process on Mac try to load pyc files, it will get better performance than the process on a computer which do not have SSD hard disk.
Now the question I'm running into is whether there are ways to pre-load resources for processes started with 'forkserver' method for better performance.
The challenge is to run a set of data processing and data science scripts that consume more memory than expected.
Here are my requirements:
Running 10-15 Python 3.5 scripts via Cron Scheduler
These different 10-15 scripts each take somewhere between 10 seconds to 20 minutes to complete
They run on different hours of the day, some of them run every 10 minute while some run once a day
Each script logs what it has done so that I can take a look at it later if something goes wrong
Some of the scripts sends e-mails to me and to my team mates
None of the scripts have an HTTP/web server component; they all run on Cron schedules and not user-facing
All the scripts' code is fed from my Github repository; when scripts wake up, they first do a git pull origin master and then start executing. That means, pushing to master causes all scripts to be on the latest version.
Here is what I currently have:
Currently I am using 3 Digital Ocean servers (droplets) for these scripts
Some of the scripts require a huge amount of memory (I get segmentation fault in droplets with less than 4GB of memory)
I am willing to introduce a new script that might require even larger memory (the new script currently faults in a 4GB droplet)
The setup of the droplets are relatively easy (thanks to Python venv) but not to the point of executing a single command to spin off a new droplet and set it up
Having a full dedicated 8GB / 16B droplet for my new script sounds a bit inefficient and expensive.
What would be a more efficient way to handle this?
What would be a more efficient way to handle this?
I'll answer in three parts:
Options to reduce memory consumption
Minimalistic architecture for serverless computing
How to get there
(I) Reducing Memory Consumption
Some handle large loads of data
If you find the scripts use more memory than you expect, the only way to reduce the memory requirements is to
understand which parts of the scripts drive memory consumption
refactor the scripts to use less memory
Typical issues that drive memory consumption are:
using the wrong data structure - e.g. if you have numerical data it is usually better to load the data into a numpy array as opposed to a Python array. If you create a lot of objects of custom classes, it can help to use __slots__
loading too much data into memory at once - e.g. if the processing can be split into several parts independent of each other, it may be more efficient to only load as much data as one part needs, then use a loop to process all the parts.
hold object references that are no longer needed - e.g. in the course of processing you create objects to represent or process some part of the data. If the script keeps a reference to such an object, it won't get released until the end of the program. One way around this is to use weak references, another is to use del to dereference objects explicitely. Sometimes it also helps to call the garbage collector.
using an offline algorithm when there is an online version (for machine learning) - e.g. some of scikit's algorithms provide a version for incremental learning such as LinearRegression => SGDRegressior or LogisticRegression => SGDClassifier
some do minor data science tasks
Some algorithms do require large amounts of memory. If using an online algorithm for incremental learning is not an option, the next best strategy is to use a service that only charges for the actual computation time/memory usage. That's what is typically referred to as serverless computing - you don't need to manage the servers (droplets) yourself.
The good news is that in principle the provider you use, Digital Ocean, provides a model that only charges for resources actually used. However this is not really serverless: it is still your task to create, start, stop and delete the droplets to actually benefit. Unless this process is fully automated, the fun factor is a bit low ;-)
(II) Minimalstic Architecture for Serverless Computing
a full dedicated 8GB / 16B droplet for my new script sounds a bit inefficient and expensive
Since your scripts run only occassionally / on a schedule, your droplet does not need to run or even exist all the time. So you could set this is up the following way:
Create a schedulding droplet. This can be of a small size. It's only purpose is to run a scheduler and to create a new droplet when a script is due, then submit the task for execution on this new worker droplet. The worker droplet can be of the specific size to accommodate the script, i.e. every script can have a droplet of whatever size it requires.
Create a generic worker. This is the program that runs upon creation of a new droplet by the scheduler. It receives as input the URL to the git repository where the actual script to be run is stored, and a location to store results. It then checks out the code from the repository, runs the scripts and stores the results.
Once the script has finished, the scheduler deletes the worker droplet.
With this approach there are still fully dedicated droplets for each script, but they only cost money while the script runs.
(III) How to get there
One option is to build an architecture as described above, which would essentially be an implementation of a minimalistic architecture for serverless computing. There are several Python libraries to interact with the Digital Ocean API. You could also use libcloud as a generic multi-provider cloud API to make it easy(ier) to switch providers later on.
Perhaps the better alternative before building yourself is to evaluate one of the existing open source serverless options. An extensive curated list is provided by the good fellows at awesome-serverless. Note at the time of writing this, many of the open source projects are still in their early stages, the more mature options are commerical.
As always with engineering decisions, there is a trade-off between the time/cost required to build or host yourself v.s. the cost of using a readily-available commercial service. Ultimately that's a decision only you can take.
I am using a postgres database with sql-alchemy and flask. I have a couple of jobs which I have to run through the entire database to updates entries. When I do this on my local machine I get a very different behavior compared to the server.
E.g. there seems to be an upper limit on how many entries I can get from the database?
On my local machine I just query all elements, while on the server I have to query 2000 entries step by step.
If I have too many entries the server gives me the message 'Killed'.
I would like to know
1. Who is killing my jobs (sqlalchemy, postgres)?
2. Since this does seem to behave differently on my local machine there must be a way to control this. Where would that be?
thanks
carl
Just the message "killed" appearing in the terminal window usually means the kernel was running out of memory and killed the process as an emergency measure.
Most libraries which connect to PostgreSQL will read the entire result set into memory, by default. But some libraries have a way to tell it to process the results row by row, so they aren't all read into memory at once. I don't know if flask has this option or not.
Perhaps your local machine has more available RAM than the server does (or fewer demands on the RAM it does have), or perhaps your local machine is configured to read from the database row by row rather than all at once.
Most likely kernel is killing your Python script. Python can have horrible memory usage.
I have a feeling you are trying to do these 2000 entry batches in a loop in one Python process. Python does not release all used memory, so the memory usage grows until it gets killed. (You can watch this with top command.)
You should try adapting your script to process 2000 records in a step and then quit. If you run in multiple times, it should continue where it left off. Or, a better option, try using multiprocessing and run each job in separate worker. Run the jobs serially and let them die, when they finish. This way they will release the memory back to OS when they exit.
I'd like to incorporate a custom tagger into a web application (running on Pyramid) I'm developing. I have the tagger working fine on my local machine using NLTK, but I've read that NLTK is relatively slow for production.
It seems that the standard way of storing the tagger is to Pickle it. On my machine, it takes a few seconds to load the 11.7MB pickle file.
Is NLTK even practical for production? Should I be looking at scikit-learn or even something like Mahout?
If NLTK is good enough, what is the best way to ensure that it properly uses memory, etc.?
I run text-processing and its associated NLP APIs, and it uses about 2 dozen different pickled models, which are loaded by a Django app (gunicorn behind nginx). The models are loaded as soon as they are needed, and once loaded, they stay in memory. That means whenever I restart the gunicorn server, the first requests that need a model have to wait a few seconds for it load, but every subsequent request gets to use the model that's already cached in RAM. Restarts only happen when I deploy new features, which usually involves updating the models, so I'd need to reload them anyway. So if you don't expect to make code changes very often, and don't have strong requirements on consistent request times, then you probably don't need a separate daemon.
Other than the initial load time, the main limiting factor is memory. I currently only have 1 worker process, because when all the models are loaded into memory, a single process can take up to 1GB (YMMV, and for a single 11MB pickle file, your memory requirements will be much lower). Processing an individual request with an already loaded model is fast enough (usually <50ms) that I currently don't need more than 1 worker, and if I did, the simplest solution is to add enough RAM to run more worker processes.
If you are worried about memory, then do look into scikit-learn, since equivalent models can use significantly less memory than NLTK. But, they are not necessarily faster or more accurate.
The best way to reduce start-up latency is to run the tagger as a daemon (persistent service) that your web app sends snippets of text to tag. That way your tagger loads only when the system boots up and if/when the daemon needs to be restarted.
Only you can decide if the NLTK is fast enough for your needs. Once the tagger is loaded, you've probably noticed that the NLTK can tag several pages of text without perceivable delay. But resource consumption and the number of concurrent users could complicate things.