How to optimize multi-processing in Python with Pool

How to optimize multi-processing in Python with Pool - python

I know there are several tutorials out there for multiprocessing in python.
Anyways, I just want to ask you for my specific use case.
So here is my setup:
I have a machine with 8 physical cores and 32 GB RAM and
I have a pickle dict with 8 GB of text holding 3.000.000 entries.
Now the question is, how would I get the most juice out of my CPU?
data = pickle.load(handle)
dict_list = split_dict_equally(data, 16) # split data into 16 equal chunks
with Pool(16) as pool:
data = pool.map(func, dict_list)
For each element in the dict an external package is used to send requests which is the time-intensive part I want to speed up.
If I use the setup defined above on a smaller sample (100 entries) I am able to reduce the execution-time of the script from 72 sec to 6 secs. However, I doubt that the same setup is reducing execution time for the large dataset.
The reason I believe this is that when I look in the Activity Meter Manger, only just under 10% of the CPU is being used. The question is therefore, what is the optimal split size of the dict relative to the proceessing units used defined within the pool instance given the large pickle file?
I'd really appricate it, if you could help me speed it up, thanks.

Related

Dask Dataframe nunique operation: Worker running out of memory (MRE)

tl;dr
I want to
dd.read_parquet('*.parq')['column'].nunique().compute()
but I get
WARNING - Worker exceeded 95% memory budget. Restarting
a couple of times before the workers get killed altogether.
Long version
I have a dataset with
10 billion rows,
~20 columns,
and a single machine with around 200GB memory. I am trying to use dask's LocalCluster to process the data, but my workers quickly exceed their memory budget and get killed even if I use a reasonably small subset and try using basic operations.
I have recreated a toy problem demonstrating the issue below.
Synthetic data
To approximate the problem above on a smaller scale, I will create a single column with 32-character ids with
a million unique ids
total length of 200 million rows
split into 100 parquet files
The result will be
100 files, 66MB each, taking 178MB when loaded as a Pandas dataframe (estimated by df.memory_usage(deep=True).sum())
If loaded as a pandas dataframe, all the data take 20GB in memory
A single Series with all ids (which is what I assume the workers also have to keep in memory when computing nunique) takes about 90MB
import string
import os
import numpy as np
import pandas as pd
chars = string.ascii_letters + string.digits
n_total = int(2e8)
n_unique = int(1e6)
# Create random ids
ids = np.sum(np.random.choice(np.array(list(chars)).astype(object), size=[n_unique, 32]), axis=1)
outputdir = os.path.join('/tmp', 'testdata')
os.makedirs(outputdir, exist_ok=True)
# Sample from the ids to create 100 parquet files
for i in range(100):
df = pd.DataFrame(np.random.choice(ids, n_total // 100), columns=['id'])
df.to_parquet(os.path.join(outputdir, f'test-{str(i).zfill(3)}.snappy.parq'), compression='snappy')
Attempt at a solution
Let's assume that my machine only has 8GB of memory. Since the partitions take about 178MB and the result 90MB, according to Wes Kinney's rule of thumb, I might need up to 2-3Gb of memory. Therefore, either
n_workers=2, memory_limit='4GB', or
n_workers_1, memroy_limit='8GB'
seems like a good choice. Sadly, when I try it, I get
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
a couple of times, before the worker(s) get killed altogether.
import os
from dask.distributed import Client, LocalCluster
import dask.dataframe as dd
cluster = LocalCluster(n_workers=4, memory_limit='6GB')
client = Client(cluster)
dd.read_parquet(os.path.join('/tmp', 'testdata', '*.parq'))['id'].nunique().compute()
In fact, it seems, for example, with 4 workers, they each need 6GB of memory before being able to perform the task.
Can this situation be improved?

That's a great example of a recurring problem. The only shocking thing is that delayed was not used during the synthetic data creation:
import dask
#dask.delayed
def create_sample(i):
df = pd.DataFrame(np.random.choice(ids, n_total // 100), columns=['id'])
df.to_parquet(os.path.join(outputdir, f'test-{str(i).zfill(3)}.snappy.parq'), compression='snappy')
return
# Sample from the ids to create 100 parquet files
dels = [create_sample(i) for i in range(100)]
_ = dask.compute(dels)
For the following answer I will actually just use a small number of partitions (so change to range(5)), to have sane visualizations. Let's start with the loading:
df = dd.read_parquet(os.path.join('/tmp', 'testdata', '*.parq'), use_cols=['id'])
print(df.npartitions) # 5
This is a minor point, but having use_cols=['id'] in .read_parquet(), exploits the parquet advantage of columnar extraction (it might be that dask will do some optimization behind the scenes, but if you know the columns you want, there's no harm in being explicit).
Now, when you run df['id'].nunique(), here's the DAG that dask will compute:
With more partitions, there would be more steps, but it's apparent that there's a potential bottleneck when each partition is trying to send data that is quite large. This data can be very large for high-dimensional columns, so if each worker is trying to send a result that requires object that is 100MB, then the receiving worker will have to have 5 times the memory to accept the data (which could potentially decrease after further value-counting).
Additional consideration is how many tasks a single worker can run at a given time. The easiest way to control how many tasks can run at the same time on a given worker is resources. If you initiate the cluster with resources:
cluster = LocalCluster(n_workers=2, memory_limit='4GB', resources={'foo': 1})
Then every worker has the specified resources (in this case it's 1 unit of arbitrary foo), so if you think that processing a single partition should happen one at a time (due to high memory footprint), then you can do:
# note, no split_every is needed in this case since we're just
# passing a single number
df['id'].nunique().compute(resources={'foo': 1})
This will ensure that any single worker is busy with 1 task at a time, preventing excessive memory usage. (side note: there's also .nunique_approx(), which may be of interest)
To control the amount of data that any given worker receives for further processing, one approach is to use split_every option. Here's what the DAG will look like with split_every=3:
You can see that now (for this number of partitions), the max memory that a worker will need is 3 times that max size of the dataset. So depending on your worker memory settings you might want to set split_every to a low value (2,3,4 or so).
In general, the more unique the variable, the more memory is needed for each partition's object with unique counts, and so a lower value of split_every is going to be useful to put a cap on the max memory usage. If the variable is not very unique, then each individual partition's unique count will be a small object, so there's no need to have a split_every restriction.

How to tune the parameters for a better insert performance in Postgresql?

The dataset is consist of two folders, each of them has 180 CSV files. My python script will read two CSV files from those two folders each time for 180 loops.
After reading the data and some data processing, the data from two CSV files is merged to one list, with the shape[(1,1), (1,2), (1,3), ...]. There would be about 3,000,000,000 tuples in the list.
I first revised some parameters (as below) of postgresql.conf, and tried to insert the whole list to DB at once by executemany, it took like 30,000 seconds, which would take me two months to read them out.
Then I tried to split the list to 30 sub-lists and read them one after another. Reading sub-list each took about 230 seconds, which is much faster, but will still coast 15 days.
I will probably use multiprocessing since I have more RAM and CPUs to use (one process uses 35 GB memory, I guess I will try to use three processes). But I would like to ask, can I somehow optimize the parameters of DB for a better insert performance in this step? Because the psql documentation did not talk much about the parameter tuning, except to set the share_buffers to 25% of RAM would enhance the performance.
Thanks in advance!
My current parameters, that are not followed by default:
shared_buffers = 10 GB
temp_buffers = 1GB
max_wal_size = 1GB
effective_cache_size = 20GB
work_mem = 2GB
maintenance_work_mem = 4GB
ENV: Redhat, RAM: 150 GB, NUmber of kerns: 40

Optimizing dask.distributed scheduling for data reduction

I have a question pertaining to the scheduling/execution order of tasks in dask.distributed for the case of strong data reduction of a large raw dataset.
We are using dask.distributed for a code which extracts information from movie frames. Its specific application is crystallography, but quite generally the steps are:
Read frames of a movie stored as a 3D array in a HDF5 file (or a few thereof which are concatenated) into a dask array. This is obviously quite I/O-heavy
Group these frames into consecutive sub-stacks of typically 10 move stills, the frames of which are aggregated (summed or averaged), resulting in a single 2D image.
Run several, computationally heavy, analysis functions on the 2D image (such as positions of certain features), returning a dictionary of results, which is negligibly small compared to the movie itself.
We implement this by using the dask.array API for steps 1 and 2 (the latter using map_blocks with a block/chunk size of one or a few of the aggregation sub-stacks), then converting the array chunks to dask.delayed objects (using to_delayed) which are passed to a function doing the actual data reduction function (step 3). We take care to properly align the chunks of the HDF5 arrays, the dask computation and the aggregation ranges in step 2 such that the task graph of each final delayed object (elements of tasks) is very clean. Here's the example code:
def sum_sub_stacks(mov):
# aggregation function
sub_stk = []
for k in range(mov.shape[0]//10):
sub_stk.append(mov[k*10:k*10+10,...].sum(axis=0, keepdims=True))
return np.concatenate(sub_stk)
def get_info(mov):
# reduction function
results = []
for frame in mov:
results.append({
'sum': frame.sum(),
'variance': frame.var()
# ...actually much more complex/expensive stuff
})
return results
# connect to dask.distributed scheduler
client = Client(address='127.0.0.1:8786')
# 1: get the movie
fh = h5py.File('movie_stack.h5')
movie = da.from_array(fh['/entry/data/raw_counts'], chunks=(100,-1,-1))
# 2: sum sub-stacks within movie
movie_aggregated = movie.map_blocks(sum_sub_stacks,
chunks=(10,) + movie.chunks[1:],
dtype=movie.dtype)
# 3: create and run reduction tasks
tasks = [delayed(get_info)(chk)
for chk in movie_aggregated.to_delayed().ravel()]
info = client.compute(tasks, sync=True)
The ideal scheduling of operations would clearly be for each worker to perform the 1-2-3 sequence on a single chunk and then move on to the next, which would keep I/O load constant, CPUs maxed out and memory low.
What happens instead is that first all workers are trying to read as many chunks as possible from the files (step 1) which creates an I/O bottleneck and quickly exhausts the worker memory causing thrashing to the local drives. Often, at some point workers eventually move to steps 2/3 which quickly frees up memory and properly uses all CPUs, but in other cases workers get killed in an uncoordinated way or the entire computation is stalling. Also intermediate cases happen where surviving workers behave reasonably for a while only.
Is there any way to give hints to the scheduler to process the tasks in the preferred order as described above or are there other means to improve the scheduling behavior? Or is there something inherently stupid about this code/way of doing things?

First, there is nothing inherently stupid about what you are doing at all!
In general, Dask tries to reduce the number of temporaries it is holding onto and it also balances this with parallelizability (width of the graph and the number of workers). Scheduling is complex and there is yet another optimization Dask uses which fuses tasks together to make them more optimal. With lots of little chunks you may run into issues: https://docs.dask.org/en/latest/array-best-practices.html?highlight=chunk%20size#select-a-good-chunk-size
Dask does have a number of optimization configurations which I would recommend playing with after considering other chunksizes. I would also encourage you to read through the following issue as there is a healthy discussion around scheduling configurations.
Lastly, you might consider additional memory configuration of your workers as you may want to more tightly control how much memory each worker should use

Python real time parallel / distributed data processing

I have a Python application that processes in-memory-data. It provides <1 second response, by querying ~1 million records and then aggregating the result set.
What would be the best Python framework(s) to make this application more scalable ?
Here are more details :
Data is loaded from a single table on disk which is loaded into memory as numpy arrays and custom indexes using dictionaries.
This application starts breaching the 1 second time limit when the number of records grow more than > 5 million. Search part / locating the indexes takes 100 ms only. I see lot of time (900 to 2000 milli secs) is spent in just summing up the result set.
Also able to see CPU cores & RAM are not used to their full capacity. I see each core is used only upto 20% and a plenty of memory is free.
I just read a long list of python frameworks on distributed computing. Looking for specific solutions for real time responses, by:
making a better usage of available CPU & RAM in single machine through parallel processing to stay within < 1 second response time.
later by extending it beyond a single machine, to support even ~100 million records. This data is in a single table/file that can be horizontally partitioned across many machines, and these machines can work independently on their own data.
Suggestions from what you have "seen" working from your past experience is greatly appreciated.

Gensim LdaMulticore is not multiprocessing properly (using just 4 workers)

I am using Gensim's LDAMulticore to perform LDA. I have around 28M small documents (around 100 characters each).
I have given workers argument to be 20 but the top shows it using only 4 processes. There are some discussions around it that it might be slow in reading corpus like:
gensim LdaMulticore not multiprocessing?
https://github.com/piskvorky/gensim/issues/288
But both of them uses MmCorpus . Although my corpus is completely in memory. I have machine with very large RAM (250 GB) and loading the corpus in memory takes around 40 GB. But even after that LDAMulticore is using just 4 processes. I created the corpus as:
corpus = [dictionary.doc2bow(text) for text in texts]
I am not able to understand what can be the limiting factor here?

I would check what is the batch size you use
I found that in cases the Batch X n_workers is greater than number of documents, I cannot utilize all the available workers I have.
This make sense as you give each worker a number of docs per pass. You might "starve" some of them if the batch value is not considered.
I am not sure it solves your specific problem, but is indeed the reason many people mentioned the multicore does not "work" as expected in terms of multiprocessing

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to optimize multi-processing in Python with Pool - python

Related

Dask Dataframe nunique operation: Worker running out of memory (MRE)

How to tune the parameters for a better insert performance in Postgresql?

Optimizing dask.distributed scheduling for data reduction

Python real time parallel / distributed data processing

Gensim LdaMulticore is not multiprocessing properly (using just 4 workers)

Categories

Resources