dask, joblib, ipyparallel and other schedulers for embarrassingly parallel problems

dask, joblib, ipyparallel and other schedulers for embarrassingly parallel problems - python

This is a more general question about how to run "embarassingly paralllel" problems with python "schedulers" in a science environment.
I have a code that is a Python/Cython/C hybrid (for this example I'm using github.com/tardis-sn/tardis .. but I have more such problems for other codes) that is internally OpenMP parallalized. It provides a single function that takes a parameter dictionary and evaluates to an object within a few hundred seconds running on ~8 cores (result=fun(paramset, calibdata) where paramset is a dict and result is an object (collection of pandas and numpy arrays basically) and calibdata is a pre-loaded pandas dataframe/object). It logs using the standard Python logging function.
I would like a python framework that can easily evaluate ~10-100k parameter sets using fun on a SLURM/TORQUE/... cluster environment.
Ideally, this framework would automatically spawn workers (given availability with a few cores each) and distribute the parameter sets between the workers (different parameter sets take different amount of time). It would be nice to see the state (in_queue, running, finished, failed) for each of the parameter-sets as well as logs (if it failed or finished or is running).
It would be nice if it keeps track of what is finished and what needs to be done so that I can restart this if my scheduler tasks fails. It would be nice if this seemlessly integrates into jupyter notebook and runs locally for testing.
I have tried dask but that does not seem to queue the tasks but runs them all-at-once with client.map(fun, [list of parameter sets]). Maybe there are better tools or maybe this is a very niche problem. It's also unclear to me what the difference between dask, joblib and ipyparallel is (having quickly tried all three of them at various stages).
Happy to give additional info if things are not clear.
UPDATE: so dask seems to provide some functionality of what I require - but dealing with an OpenMP parallelized code in addition to dask is not straightforward - see issue https://github.com/dask/dask-jobqueue/issues/181

Related

How to be sure what's being imported while using Python Multiprocessing?

Short context:
Our application has a backend written in Python. It contains a couple of rest API endpoints and Message Queue handling (rabbitMQ and Pika).
The reason why we use Python is that it is a Data Science/AI project - so lot of data processing require some DS knowledge.
Problem:
Some parts of our application have CPU heavy calculations, and we are using multiprocessing to add parallelization.
However, we need to be careful because each process it's starting new interpreter and imports everything again.
The environment is windows, so the process creation is a "spawn".
Question:
Is there the best way how to maintain this? The team is big, so there is a chance that someone will put some big object creation or long processing function that will start on application boot that might be called and kept in memory while creating a pool of processes.

Profiling for a program using python-eve server

I am running a backend python-eve server with multiple functions being called to provide one service. I want to do profiling for this python backend server. I want to find out which among the multiple functionalities is taking time for execution. I have heard and used cprofiler but for a server that is continuously running, how do I do profiling? Moreover, I am using Pycharm IDE to work with the python code. So, it will be beneficial if there's a way I can do profiling using Pycharm.

While I do not have direct experience with python-eve, I wrote pprofile (pypi) mainly to use on Zope (also a long-running process).
The basic idea (at least on processes using worker threads, like Zope) is to start pprofile in statistic mode, let it collect samples for a while (how long heavily depends on how busy the process is and the level of detail you want to capture in the profiling result), and finally to build a profiling result archive, which in my case contains both the profiling result and all executed source code (so I'm extra-sure I'm annotating the correct version of the code).
I have so far not needed to do extra-long profiling session, like do one query to start profiling, then later another query to stop and fetch the result - or even to keep the result server-side and browse it somehow, and how to do this will likely heavily depend on server details.
You can find the extra customisation for Zope (allowing to fetch the python source out of Zope's Python Scripts, TAL expressions, and beyond pure python profiling it also collects object database loading durations, and even SQL queries run, to provide a broader picture) in pprofile.zope, just to give you an idea of what can be done.

Can a serverless architecture support high memory needs?

The challenge is to run a set of data processing and data science scripts that consume more memory than expected.
Here are my requirements:
Running 10-15 Python 3.5 scripts via Cron Scheduler
These different 10-15 scripts each take somewhere between 10 seconds to 20 minutes to complete
They run on different hours of the day, some of them run every 10 minute while some run once a day
Each script logs what it has done so that I can take a look at it later if something goes wrong
Some of the scripts sends e-mails to me and to my team mates
None of the scripts have an HTTP/web server component; they all run on Cron schedules and not user-facing
All the scripts' code is fed from my Github repository; when scripts wake up, they first do a git pull origin master and then start executing. That means, pushing to master causes all scripts to be on the latest version.
Here is what I currently have:
Currently I am using 3 Digital Ocean servers (droplets) for these scripts
Some of the scripts require a huge amount of memory (I get segmentation fault in droplets with less than 4GB of memory)
I am willing to introduce a new script that might require even larger memory (the new script currently faults in a 4GB droplet)
The setup of the droplets are relatively easy (thanks to Python venv) but not to the point of executing a single command to spin off a new droplet and set it up
Having a full dedicated 8GB / 16B droplet for my new script sounds a bit inefficient and expensive.
What would be a more efficient way to handle this?

What would be a more efficient way to handle this?
I'll answer in three parts:
Options to reduce memory consumption
Minimalistic architecture for serverless computing
How to get there
(I) Reducing Memory Consumption
Some handle large loads of data
If you find the scripts use more memory than you expect, the only way to reduce the memory requirements is to
understand which parts of the scripts drive memory consumption
refactor the scripts to use less memory
Typical issues that drive memory consumption are:
using the wrong data structure - e.g. if you have numerical data it is usually better to load the data into a numpy array as opposed to a Python array. If you create a lot of objects of custom classes, it can help to use __slots__
loading too much data into memory at once - e.g. if the processing can be split into several parts independent of each other, it may be more efficient to only load as much data as one part needs, then use a loop to process all the parts.
hold object references that are no longer needed - e.g. in the course of processing you create objects to represent or process some part of the data. If the script keeps a reference to such an object, it won't get released until the end of the program. One way around this is to use weak references, another is to use del to dereference objects explicitely. Sometimes it also helps to call the garbage collector.
using an offline algorithm when there is an online version (for machine learning) - e.g. some of scikit's algorithms provide a version for incremental learning such as LinearRegression => SGDRegressior or LogisticRegression => SGDClassifier
some do minor data science tasks
Some algorithms do require large amounts of memory. If using an online algorithm for incremental learning is not an option, the next best strategy is to use a service that only charges for the actual computation time/memory usage. That's what is typically referred to as serverless computing - you don't need to manage the servers (droplets) yourself.
The good news is that in principle the provider you use, Digital Ocean, provides a model that only charges for resources actually used. However this is not really serverless: it is still your task to create, start, stop and delete the droplets to actually benefit. Unless this process is fully automated, the fun factor is a bit low ;-)
(II) Minimalstic Architecture for Serverless Computing
a full dedicated 8GB / 16B droplet for my new script sounds a bit inefficient and expensive
Since your scripts run only occassionally / on a schedule, your droplet does not need to run or even exist all the time. So you could set this is up the following way:
Create a schedulding droplet. This can be of a small size. It's only purpose is to run a scheduler and to create a new droplet when a script is due, then submit the task for execution on this new worker droplet. The worker droplet can be of the specific size to accommodate the script, i.e. every script can have a droplet of whatever size it requires.
Create a generic worker. This is the program that runs upon creation of a new droplet by the scheduler. It receives as input the URL to the git repository where the actual script to be run is stored, and a location to store results. It then checks out the code from the repository, runs the scripts and stores the results.
Once the script has finished, the scheduler deletes the worker droplet.
With this approach there are still fully dedicated droplets for each script, but they only cost money while the script runs.
(III) How to get there
One option is to build an architecture as described above, which would essentially be an implementation of a minimalistic architecture for serverless computing. There are several Python libraries to interact with the Digital Ocean API. You could also use libcloud as a generic multi-provider cloud API to make it easy(ier) to switch providers later on.
Perhaps the better alternative before building yourself is to evaluate one of the existing open source serverless options. An extensive curated list is provided by the good fellows at awesome-serverless. Note at the time of writing this, many of the open source projects are still in their early stages, the more mature options are commerical.
As always with engineering decisions, there is a trade-off between the time/cost required to build or host yourself v.s. the cost of using a readily-available commercial service. Ultimately that's a decision only you can take.

Python Pandas Threading reads read_fwf

Newbie-ish python/pandas user here. I've been playing with using chunksize arg in read_fwf and iterating value_counts of variables. I wrote a function to pass args such as the fileiterator and variables to parse and count. I was hoping to parallelize this function and be able to read 2 files at the same time into the same function.
It does appear to work... However, I'm getting unexpected slow downs. The threads finish same time but one seems to be slowing the other down (IO bottleneck?). I'm getting faster times by running the functions sequentially rather than parallel (324 secs Vs 172 secs). Ideas? I'm I executing this wrong? I've tried multiprocess but startmap errors that I can't pickle the fileiterator (output of read_fwf).
testdf1=pd.read_fwf(filepath_or_buffer='200k.dat',header=None,colspecs=wlist,names=nlist,dtype=object,na_values=[''],chunksize=1000)
testdf2=pd.read_fwf(filepath_or_buffer='200k2.dat',header=None,colspecs=wlist,names=nlist,dtype=object,na_values=[''],chunksize=1000)
def tfuncth(df,varn,q,*args):
td={}
for key in varn.keys():
td[key]=pd.Series()
for rdf in df:
if args is not None:
for arg in args:
rdf=eval(f"rdf.query(\"{arg}\")")
for key in varn.keys():
ecode=f'rdf.{varn[key]}.value_counts()'
td[key]=pd.concat([td[key],eval(ecode)])
td[key]=td[key].groupby(td[key].index).sum()
for key in varn.keys():
td[key]=pd.DataFrame(td[key].reset_index()).rename(columns={'index':'Value',0:'Counts'}).assign(Var=key,PCT=lambda x:round(x.Counts/x.Counts.sum()*100,2))[['Var','Value','Counts','PCT']]
q.put(td)
bands={
'1':'A',
'2':'B',
'3':'C',
'4':'D',
'5':'E',
'6':'F',
'7':'G',
'8':'H',
'9':'I'
}
vdict={
'var1':'e1270.str.slice(0,2)',
'var2':'e1270.str.slice(2,3)',
'band':'e7641.str.slice(0,1).replace(bands)'
}
my_q1=queue.Queue()
my_q2=queue.Queue()
thread1=threading.Thread(target=tfuncth,args=(testdf1,vdict,my_q1,flter1))
thread2=threading.Thread(target=tfuncth,args=(testdf2,vdict,my_q2))
thread1.start()
thread2.start()
UPDATE:
After much reading This is the conclusion I've came too. This is extremely simplified conclusion I'm sure so if someone knows otherwise please inform me.
Pandas is not a fully multi-thread friendly package
Apparently there’s a package called ‘dask’ that is and it replicates a lot of pandas functions. So I’ll be looking into that.
Python is not truly a multi-threading compatible language in many
cases
Python is bound by its compiler. In pure python, its interpreted and bound by the GIL for only execution of one thread at a time
Multiple threads can be spun off but will only be able to parallel non-cpu bound functions.
My code is wrapped with IO and CPU. The simple IO is probably running parallel but getting held up waiting on the processor for execution.
I plan to test this out by writing IO only operations and attempting threading.
Python can be compiled with different compilers that don’t have a global interpreter lock (GIL) on threads.
Thus packages such as ‘dask’ can utilize multi-threading.

I did manage to get this to work and fix my problems by using the multiprocessing package. I ran into two issues.
1) multiprocessing package is not compatible with Juypter Notebook
and
2) you can't pickle a handle to a pandas reader (multiprocessing pickles objects passed to the processes).
I fixed 1 by coding outside the Notebook environment and I fixed 2 by passing in the arguments needed to open a chunking file to each process and had each process start their own chunk read.
After doing those two things I was able to get a 60% increase in speed over sequential runs.

Profiling a long-running Python Server

I have a long-running twisted server.
In a large system test, at one particular point several minutes into the test, when some clients enter a particular state and a particular outside event happens, then this server takes several minutes of 100% CPU and does its work very slowly. I'd like to know what it is doing.
How do you get a profile for a particular span of time in a long-running server?
I could easily send the server start and stop messages via HTTP if there was a way to enable or inject the profiler at runtime?
Given the choice, I'd like stack-based/call-graph profiling but even leaf sampling might give insight.

yappi profiler can be started and stopped at runtime.

There are two interesting tools that came up that try to solve that specific problem, where you might not necessarily have instrumented profiling in your code in advance but want to profile production code in a pinch.
pyflame will attach to an existing process using the ptrace(2) syscall and create "flame graphs" of the process. It's written in Python.
py-spy works by reading the process memory instead and figuring out the Python call stack. It also provides a flame graph but also a "top-like" interface to show which function is taking the most time. It's written in Rust and Python.

Not a very Pythonic answer, but maybe straceing the process gives some insight (assuming you are on a Linux or similar).
Using strictly Python, for such things I'm using tracing all calls, storing their results in a ringbuffer and use a signal (maybe you could do that via your HTTP message) to dump that ringbuffer. Of course, tracing slows down everything, but in your scenario you could switch on the tracing by an HTTP message as well, so it will only be enabled when your trouble is active as well.

Pyliveupdate is a tool designed for the purpose: profiling long running programs without restarting them. It allows you to dynamically selecting specific functions to profiling or stop profiling without instrument your code ahead of time -- it dynamically instrument code to do profiling.
Pyliveupdate have three key features:
Profile specific Python functions' (by function names or module names) call time.
Add / remove profilings without restart programs.
Show profiling results with call summary and flamegraphs.
Check out a demo here: https://asciinema.org/a/304465.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.