Synchronize ProcessPoolExecutor - python

I am trying to run some very large time-series data using concurrent.futures.ProcessPoolExecutor(). The dataset contains multiple time series (that are independent). The entire dataset is available in a list of tuples data that I pass through a helper function as follows:
def help_func(daa):
large_function(daa[0], daa[1], daa[2])
with concurrent.futures.ProcessPoolExecutor() as executor:
executor.map(help_func, data, chunksize=1)
Now, although the different time-series' contained in data are independent across columns, due the nature of the time-series data, the values within a time series need to be handled one after the other. By ordering the data variable in terms of the different time series, I am sure that map will always make calls sequentially over time.
With executor.map I cannot figure out a way to map a particular time-series to the same core always, or somehow share the state from previous attempts to a process running on a new core.
With the current setup, whenever the processing for a particular timestamp is called on a new core, it starts from the initialization step.
Is there any elegant solution to this issue?

Related

Multiprocessing without copying data in Python

I have a VERY large data structure, on which I need to run multiple functions, none of which are mutating (therefore, no risk of a race condition. I simply want to get the results faster by running these functions in parallel. For example, getting percentile values for a large data set)
How would I achieve this with multiprocessing, without having to create a copy of the data each time a process starts, which would end up making things slower than if I hadn’t bothered in the first place ?
(The absence of a code example is on purpose, as I don’t think the details of the data structure and functions are in any way relevant.)

Read only Pandas dataset in Dask Distributed

TL;DR
I want to allow workers to use a scattered Pandas Dataframe, but not allow them to mutate any data. Look below for sample code. Is this possible? (or is this a pure Pandas question?)
Full question
I'm reading a Pandas Dataframe, and scattering it to the workers. I then use this future when I submit new tasks, and store it in a Variable for easy access.
Sample code:
df = pq.read_table('/data/%s.parq' % dataset).to_pandas()
df = client.scatter(df, broadcast=True, direct=True)
v = Variable('dataset')
v.set(df)
When I submit a job I use:
def top_ten_x_based_on_y(dataset, column):
return dataset.groupby(dataset['something'])[column].mean().sort_values(ascending=False)[0:10].to_dict()
a = client.submit(top_ten_x_based_on_y, df, 'a_column')
Now, I want to run 10-20 QPS on this dataset which all workers have in memory (data < RAM) but I want to protect against accidental changes of the dataset, such as one worker "corrupting" it's own memory which can lead to inconsistencies. Preferably raising an exception on trying to modify.
The data set is roughly 2GB.
I understand this might be problematic since a Pandas Dataframe itself is not immutable (although a Numpy array can be made to).
Other ideas:
Copy the dataset on each query, but 2GB copy takes time even in RAM (roughly 1.4 seconds)
Devise a way to hash a dataframe (probematic in itself, even though hash_pandas_object now exists), and check before and after (or every minute) if dataframe is the same as expected. Running hash_pandas_object takes roughly 5 seconds.
Unfortunately Dask currently offers no additional features on top of Python to avoid mutation in this way. Dask just runs Python functions, and those Python functions can do anything they like.
Your suggestions of copying or checking before running operations seems sensible to me.
You might also consider raising this as a question or feature request to Pandas itself.

mapping a function of variable execution time over a large collection with Dask

I have a large collection of entries E and a function f: E --> pd.DataFrame. The execution time of function f can vary drastically for different inputs. Finally all DataFrames should be concatenated into a single DataFrame.
The situation I'd like to avoid is a partitioning (using 2 partitions for the sake of the example) where accidentally all fast function executions happen on partition 1 and all slow executions on partition 2, thus not optimally using the workers.
partition 1:
[==][==][==]
partition 2:
[============][=============][===============]
--------------------time--------------------->
My current solution is to iterate over the collection of entries and create a Dask graph using delayed, aggregating the delayed partial DataFrame results in a final result DataFrame with dd.from_delayed.
delayed_dfs = []
for e in collection:
delayed_partial_df = delayed(f)(e, arg2, ...)
delayed_dfs.append(delayed_partial_df)
result_df = from_delayed(delayed_dfs, meta=make_meta({..}))
I reasoned that the Dask scheduler would take care of optimally assigning work to the available workers.
is this a correct assumption?
would you consider the overall approach reasonable?
As mentioned in the comments above, yes, what you are doing is sensible.
The tasks will be assigned to workers initially, but if some workers finish their allotted tasks before others then they will dynamically steal tasks from those workers with excess work.
Also as mentioned in the comments, you might consider using the diagnostic dashboard to get a good sense of what the scheduler is doing. All of the information about worker load, work stealing, etc. are easily viewable.
http://distributed.readthedocs.io/en/latest/web.html

Can I .set_index() lazily ( or to be executed concurrently ), on Dask Dataframes?

tl;dr: Is it possible to .set_index() method on several Dask Dataframes in parallel concurrently? Alternatively, is it possible to .set_index() lazily on several Dask Dataframes which, consequently, would lead to the indexes being set in parallel concurrently?
Here is the scenario:
I have several time series
Each time series is stored is several .csv files. Each file contains data related to a specific day. Also, files are scattered amongst different folders (each folder is contains data for one month)
Each time series has different sampling rates
All time series have the same columns. All have a column which contains DateTime, amongst others.
Data is too large to be processed in memory. That's why I am using Dask.
I want to merge all the time series into a single DataFrame, aligned by DateTime. For this, I need to first resample() each and all time series to a common sampling rate. And then .join() all time series.
.resample() can only be applied on index. Hence, before resampling I need to .set_index() on the DateTime column on each time series.
When I ask .set_index() method on one time series, computation starts immediately. Which leads to my code being blocked and waiting. At this moment, if I check my machine resources usage, I can see that many cores are being used but the usage does not go above ~15%. Which makes me think that, ideally, I could have the .set_index() method being applied to more than one time series at the same time.
After reaching the above situation, I've tried some not elegant solutions to parallelize application of .set_index() method on several time series (e.g. create a multiprocessing.Pool ), which were not successful. Before giving more details on those, is there a clean way on how to solve the situation above? Was the above scenario thought at some point when implementing Dask?
Alternatively, is it possible to .set_index() lazily? If .set_index() method could be applied lazily, I would create a full computation graph with the steps described above and in the end ,everything would be computed in parallel concurrently (I think).
Dask.dataframe needs to know the min and max values of all of the partitions of the dataframe in order to sensibly do datetime operations in parallel. By default it will read the data once in order to find good partitions. If the data is not sorted it will then do a shuffle (perhaps very expensive) to sort
In your case it sounds like your data is already sorted and that you might be able to provide these explicitly. You should look at the last example of the dd.DataFrame.set_index docstring
A common case is when we have a datetime column that we know to be
sorted and is cleanly divided by day. We can set this index for free
by specifying both that the column is pre-sorted and the particular
divisions along which is is separated
>>> import pandas as pd
>>> divisions = pd.date_range('2000', '2010', freq='1D')
>>> df2 = df.set_index('timestamp', sorted=True, divisions=divisions) # doctest: +SKIP

Large Pandas Dataframe parallel processing

I am accessing a very large Pandas dataframe as a global variable. This variable is accessed in parallel via joblib.
Eg.
df = db.query("select id, a_lot_of_data from table")
def process(id):
temp_df = df.loc[id]
temp_df.apply(another_function)
Parallel(n_jobs=8)(delayed(process)(id) for id in df['id'].to_list())
Accessing the original df in this manner seems to copy the data across processes. This is unexpected since the original df isnt being altered in any of the subprocesses? (or is it?)
The entire DataFrame needs to be pickled and unpickled for each process created by joblib. In practice, this is very slow and also requires many times the memory of each.
One solution is to store your data in HDF (df.to_hdf) using the table format. You can then use select to select subsets of data for further processing. In practice this will be too slow for interactive use. It is also very complex, and your workers will need to store their work so that it can be consolidated in the final step.
An alternative would be to explore numba.vectorize with target='parallel'. This would require the use of NumPy arrays not Pandas objects, so it also has some complexity costs.
In the long run, dask is hoped to bring parallel execution to Pandas, but this is not something to expect soon.
Python multiprocessing is typically done using separate processes, as you noted, meaning that the processes don't share memory. There's a potential workaround if you can get things to work with np.memmap as mentioned a little farther down the joblib docs, though dumping to disk will obviously add some overhead of its own: https://pythonhosted.org/joblib/parallel.html#working-with-numerical-data-in-shared-memory-memmaping

Categories

Resources