I am developing a Python module to manage the execution of long-baseline simulation processes. Whilst I have had some successes to date, I would like to improve it in order to make it more usable for its intended, er, users.
My code to date uses sys, time, subprocess and the package watchdog; it monitors a folder for new run files, places these in a FIFO queue and executes them simultaneously up to a pre-defined limit. This works, but is a bit clunky, and managing this dynamically is impractical.
I am not a computer scientist or a software engineer; this is why I am asking for help. Despite that, I am confident that I would be able to do the (bulk of the) coding work, but I would like some guidance on:
What are the computer science-y names for some of the concepts I'm describing
What packages are out there for doing some of these things in Python 3.x
What approaches you would use to do these things
What data structures you might use to do this
I appreciate any help you're able to offer, and of course please let me know if there is a better StackExchange site where I could submit this question. Thank you in advance!
Some detail about the simulations:
Entirely based in Windows 10
One of three or four different varieties
Executed from command line with reference to an input file
Simulations either fail instantly or almost instantly, or run for several hours, days or even weeks
Typical batches are up to tens of simulations
Simulations use lots of RAM and an entire CPU
I should also point out that, according to the terms of the simulation software licenses, I am expressly forbidden to use third-party cloud computing platforms.
Here's what I need the module to do:
Simulations dispatched automatically
Simultaneous simulations up to a limit
Queue runs indefinitely
Queue can be added to dynamically (reference in a new file)
Feedback on what's running and what's in the queue
First in, first out queuing
And some extra ideas, from users, about what would be nice to have:
Can dispatch simulations onto multiple computers
Simultaneous simulation limit could be changed dynamically
System of prioritisation
Simulations can be terminated early
If the module crashes, can be restarted without having to rebuild the queue
A browser-based interface where the queue could be viewed / managed
Related
The challenge is to run a set of data processing and data science scripts that consume more memory than expected.
Here are my requirements:
Running 10-15 Python 3.5 scripts via Cron Scheduler
These different 10-15 scripts each take somewhere between 10 seconds to 20 minutes to complete
They run on different hours of the day, some of them run every 10 minute while some run once a day
Each script logs what it has done so that I can take a look at it later if something goes wrong
Some of the scripts sends e-mails to me and to my team mates
None of the scripts have an HTTP/web server component; they all run on Cron schedules and not user-facing
All the scripts' code is fed from my Github repository; when scripts wake up, they first do a git pull origin master and then start executing. That means, pushing to master causes all scripts to be on the latest version.
Here is what I currently have:
Currently I am using 3 Digital Ocean servers (droplets) for these scripts
Some of the scripts require a huge amount of memory (I get segmentation fault in droplets with less than 4GB of memory)
I am willing to introduce a new script that might require even larger memory (the new script currently faults in a 4GB droplet)
The setup of the droplets are relatively easy (thanks to Python venv) but not to the point of executing a single command to spin off a new droplet and set it up
Having a full dedicated 8GB / 16B droplet for my new script sounds a bit inefficient and expensive.
What would be a more efficient way to handle this?
What would be a more efficient way to handle this?
I'll answer in three parts:
Options to reduce memory consumption
Minimalistic architecture for serverless computing
How to get there
(I) Reducing Memory Consumption
Some handle large loads of data
If you find the scripts use more memory than you expect, the only way to reduce the memory requirements is to
understand which parts of the scripts drive memory consumption
refactor the scripts to use less memory
Typical issues that drive memory consumption are:
using the wrong data structure - e.g. if you have numerical data it is usually better to load the data into a numpy array as opposed to a Python array. If you create a lot of objects of custom classes, it can help to use __slots__
loading too much data into memory at once - e.g. if the processing can be split into several parts independent of each other, it may be more efficient to only load as much data as one part needs, then use a loop to process all the parts.
hold object references that are no longer needed - e.g. in the course of processing you create objects to represent or process some part of the data. If the script keeps a reference to such an object, it won't get released until the end of the program. One way around this is to use weak references, another is to use del to dereference objects explicitely. Sometimes it also helps to call the garbage collector.
using an offline algorithm when there is an online version (for machine learning) - e.g. some of scikit's algorithms provide a version for incremental learning such as LinearRegression => SGDRegressior or LogisticRegression => SGDClassifier
some do minor data science tasks
Some algorithms do require large amounts of memory. If using an online algorithm for incremental learning is not an option, the next best strategy is to use a service that only charges for the actual computation time/memory usage. That's what is typically referred to as serverless computing - you don't need to manage the servers (droplets) yourself.
The good news is that in principle the provider you use, Digital Ocean, provides a model that only charges for resources actually used. However this is not really serverless: it is still your task to create, start, stop and delete the droplets to actually benefit. Unless this process is fully automated, the fun factor is a bit low ;-)
(II) Minimalstic Architecture for Serverless Computing
a full dedicated 8GB / 16B droplet for my new script sounds a bit inefficient and expensive
Since your scripts run only occassionally / on a schedule, your droplet does not need to run or even exist all the time. So you could set this is up the following way:
Create a schedulding droplet. This can be of a small size. It's only purpose is to run a scheduler and to create a new droplet when a script is due, then submit the task for execution on this new worker droplet. The worker droplet can be of the specific size to accommodate the script, i.e. every script can have a droplet of whatever size it requires.
Create a generic worker. This is the program that runs upon creation of a new droplet by the scheduler. It receives as input the URL to the git repository where the actual script to be run is stored, and a location to store results. It then checks out the code from the repository, runs the scripts and stores the results.
Once the script has finished, the scheduler deletes the worker droplet.
With this approach there are still fully dedicated droplets for each script, but they only cost money while the script runs.
(III) How to get there
One option is to build an architecture as described above, which would essentially be an implementation of a minimalistic architecture for serverless computing. There are several Python libraries to interact with the Digital Ocean API. You could also use libcloud as a generic multi-provider cloud API to make it easy(ier) to switch providers later on.
Perhaps the better alternative before building yourself is to evaluate one of the existing open source serverless options. An extensive curated list is provided by the good fellows at awesome-serverless. Note at the time of writing this, many of the open source projects are still in their early stages, the more mature options are commerical.
As always with engineering decisions, there is a trade-off between the time/cost required to build or host yourself v.s. the cost of using a readily-available commercial service. Ultimately that's a decision only you can take.
I have a project I am working on for developing a distributed computing model for one of my other projects. One of my scripts is multiprocessed, monitoring for changes in directories, decoding pickles, creating authentication keys for them based on the node to which they are headed, and repickling them. (Edit: These processes all operate in loops.)
My question is, when looking at the activity monitor in OSX, my %CPU column displays 100% for the primary processes that run the scripts. These three that are showing 100% are the manager script, and the two nodes (I am simulating the model on one machine, the intent is to move the model to a live cluster network in the future). Is this bad? My system usage shows 27.50%, user 12.50%, and 65% idle.
I've attempted research myself, first, and my only thought is that these numbers display that the process is utilizing the CPU for the entire time it's alive, and is never idle.
Can I please get some clarification?
Update based on comments:
My processes run in an endless loop, monitoring for changes to files in their respective directories, in order to receive new 'jobs' from the manager process/script (a separate computer in the cluster in the project's final implementation). Maybe there is a better way to be waiting for I/O in this form that doesn't require so much processor time?
Solution (If however Sub-Optimal):
Implemented a time.sleep(n) period at the end of each loop for 0.1 seconds. Brought the CPU time down to no more than 0.4%.
Still, though, I am looking for a more optimal way of reducing CPU time, without using modules not included in the Python Standard Library, I'd like to avoid using a time.sleep(n) period, as I want the system to be able to respond at any moment, and if the load of input files gets very high, I do not want it to be wasting time sleeping, when it could be processing the files.
My processes operate by busy waiting, so it was causing them to use excessive CPU time:
while True:
files = os.path.listdir('./sub_directory/')
if files != []:
do_something()
This is the basis for each script/process I am executing.
Python threading module documentation says something like this
In CPython, due to the Global Interpreter Lock, only one thread can
execute Python code at once (even though certain performance-oriented
libraries might overcome this limitation). If you want your
application to make better use of the computational resources of
multi-core machines, you are advised to use multiprocessing. However,
threading is still an appropriate model if you want to run multiple
I/O-bound tasks simultaneously.
Can someone explain whether I can use threading module in my situation or not?
I'm going to detect the frameworks used by websites.
So here is how my app works
My MySQL database contains around 10 million domains ( id, domain, frameworks )
Fetch 1000 rows from the database
Scrape domain one by one using requests module
Detect the frameworks
Update the database row with the results.
Since I have 10 million domains, its going to take very long time. So I would like to speed up the process by using threads.
But i'm not sure whether my app is I/O bound or not. Can someone explain?
Thankyou
I guess, the most time expensive activity will be fetching all the urls.
So the answer to your question is: Yes, your app is very likely to be I/O bound.
You plan to scrape domains one by one, this would lead into really long processing time. You shall definitely do that concurrently. One solution is described in my answer to similar question related to scraping web sites.
Anyway, the number of your urls seems really large, you might need to take advantage from splitting the work to multiple workers - for this purpose you might use e.g. Celery framework. However, as your task is really I/O bound, you would earn some speed only, if your workers work on multiple computers, ideally with independent connectivity. I did similar task on DigitalOcean machines and it worked very well.
I'm working on a fairly simple CGI with Python. I'm about to put it into Django, etc. The overall setup is pretty standard server side (i.e. computation is done on the server):
User uploads data files and clicks "Run" button
Server forks jobs in parallel behind the scenes, using lots of RAM and processor power. ~5-10 minutes later (average use case), the program terminates, having created a file of its output and some .png figure files.
Server displays web page with figures and some summary text
I don't think there are going to be hundreds or thousands of people using this at once; however, because the computation going on takes a fair amount of RAM and processor power (each instance forks the most CPU-intensive task using Python's Pool).
I wondered if you know whether it would be worth the trouble to use a queueing system. I came across a Python module called beanstalkc, but on the page it said it was an "in-memory" queueing system.
What does "in-memory" mean in this context? I worry about memory, not just CPU time, and so I want to ensure that only one job runs (or is held in RAM, whether it receives CPU time or not) at a time.
Also, I was trying to decide whether
the result page (served by the CGI) should tell you it's position in the queue (until it runs and then displays the actual results page)
OR
the user should submit their email address to the CGI, which will email them the link to the results page when it is complete.
What do you think is the appropriate design methodology for a light traffic CGI for a problem of this sort? Advice is much appreciated.
Definitely use celery. You can run an amqp server or I think you can sue the database as a queue for the messages. It allows you to run tasks in the background and it can use multiple worker machines to do the processing if you want. It can also do cron jobs that are database based if you use django-celery
It's as simple as this to run a task in the background:
#task
def add(x, y):
return x + y
In a project I have it's distributing the work over 4 machines and it works great.
Python seems to have many different packages available to assist one in parallel processing on an SMP based system or across a cluster. I'm interested in building a client server system in which a server maintains a queue of jobs and clients (local or remote) connect and run jobs until the queue is empty. Of the packages listed above, which is recommended and why?
Edit: In particular, I have written a simulator which takes in a few inputs and processes things for awhile. I need to collect enough samples from the simulation to estimate a mean within a user specified confidence interval. To speed things up, I want to be able to run simulations on many different systems, each of which report back to the server at some interval with the samples that they have collected. The server then calculates the confidence interval and determines whether the client process needs to continue. After enough samples have been gathered, the server terminates all client simulations, reconfigures the simulation based on past results, and repeats the processes.
With this need for intercommunication between the client and server processes, I question whether batch-scheduling is a viable solution. Sorry I should have been more clear to begin with.
Have a go with ParallelPython. Seems easy to use, and should provide the jobs and queues interface that you want.
There are also now two different Python wrappers around the map/reduce framework Hadoop:
http://code.google.com/p/happy/
http://wiki.github.com/klbostee/dumbo
Map/Reduce is a nice development pattern with lots of recipes for solving common patterns of problems.
If you don't already have a cluster, Hadoop itself is nice because it has full job scheduling, automatic data distribution of data across the cluster (i.e. HDFS), etc.
Given that you tagged your question "scientific-computing", and mention a cluster, some kind of MPI wrapper seems the obvious choice, if the goal is to develop parallel applications as one might guess from the title. Then again, the text in your question suggests you want to develop a batch scheduler. So I don't really know which question you're asking.
The simplest way to do this would probably just to output the intermediate samples to separate files (or a database) as they finish, and have a process occasionally poll these output files to see if they're sufficient or if more jobs need to be submitted.