Multiprocessing without copying data in Python - python

I have a VERY large data structure, on which I need to run multiple functions, none of which are mutating (therefore, no risk of a race condition. I simply want to get the results faster by running these functions in parallel. For example, getting percentile values for a large data set)
How would I achieve this with multiprocessing, without having to create a copy of the data each time a process starts, which would end up making things slower than if I hadn’t bothered in the first place ?
(The absence of a code example is on purpose, as I don’t think the details of the data structure and functions are in any way relevant.)

Related

Synchronize ProcessPoolExecutor

I am trying to run some very large time-series data using concurrent.futures.ProcessPoolExecutor(). The dataset contains multiple time series (that are independent). The entire dataset is available in a list of tuples data that I pass through a helper function as follows:
def help_func(daa):
large_function(daa[0], daa[1], daa[2])
with concurrent.futures.ProcessPoolExecutor() as executor:
executor.map(help_func, data, chunksize=1)
Now, although the different time-series' contained in data are independent across columns, due the nature of the time-series data, the values within a time series need to be handled one after the other. By ordering the data variable in terms of the different time series, I am sure that map will always make calls sequentially over time.
With executor.map I cannot figure out a way to map a particular time-series to the same core always, or somehow share the state from previous attempts to a process running on a new core.
With the current setup, whenever the processing for a particular timestamp is called on a new core, it starts from the initialization step.
Is there any elegant solution to this issue?

Read only Pandas dataset in Dask Distributed

TL;DR
I want to allow workers to use a scattered Pandas Dataframe, but not allow them to mutate any data. Look below for sample code. Is this possible? (or is this a pure Pandas question?)
Full question
I'm reading a Pandas Dataframe, and scattering it to the workers. I then use this future when I submit new tasks, and store it in a Variable for easy access.
Sample code:
df = pq.read_table('/data/%s.parq' % dataset).to_pandas()
df = client.scatter(df, broadcast=True, direct=True)
v = Variable('dataset')
v.set(df)
When I submit a job I use:
def top_ten_x_based_on_y(dataset, column):
return dataset.groupby(dataset['something'])[column].mean().sort_values(ascending=False)[0:10].to_dict()
a = client.submit(top_ten_x_based_on_y, df, 'a_column')
Now, I want to run 10-20 QPS on this dataset which all workers have in memory (data < RAM) but I want to protect against accidental changes of the dataset, such as one worker "corrupting" it's own memory which can lead to inconsistencies. Preferably raising an exception on trying to modify.
The data set is roughly 2GB.
I understand this might be problematic since a Pandas Dataframe itself is not immutable (although a Numpy array can be made to).
Other ideas:
Copy the dataset on each query, but 2GB copy takes time even in RAM (roughly 1.4 seconds)
Devise a way to hash a dataframe (probematic in itself, even though hash_pandas_object now exists), and check before and after (or every minute) if dataframe is the same as expected. Running hash_pandas_object takes roughly 5 seconds.
Unfortunately Dask currently offers no additional features on top of Python to avoid mutation in this way. Dask just runs Python functions, and those Python functions can do anything they like.
Your suggestions of copying or checking before running operations seems sensible to me.
You might also consider raising this as a question or feature request to Pandas itself.

dask distributed and work stealing: how to tidy up tasks that have run twice?

Dask distributed supports work stealing, which can speed up the computation and makes it more robust, however, each task can be run more than once.
Here I am asking for a way to "tidy up" the results of workers, who did not contribute to the final result. To illustrate what I am asking for:
Let's assume, each worker is doing Monte-Carlo like simulations and saves the ~10GB simulation result in a results folder. In case of work stealing, the simulation result will be stored several times, making it desirable to only keep one of it. What would be the best way to achieve this? Can dask.distributed automatically call some "tidy up" procedure on tasks that did not end up to contribute to the final result?
Edit:
I currently start the simulation using the following code:
c = distributed.Client(myserver)
mytask.compute(get = c.get) #mytask is a delayed object
So I guess, afterwards all data is deleted from the cluster and if I "look at data that exists in multiple locations" after the computation, it is not guaranteed that I can find the respective tasks? Also I currently have no clear idea how to map the ID of the future object to the filename to which the respective task saved its results. I currently rely on tempfile to avoid name collisions, which given the setting of a Monte Carlo simulation is by far the easiest.
There is currently no difference in Dask between a task that was stolen and a task that was not.
If you want you can look at data that exists in multiple locations and then send commands directly to those workers using the following operations:
http://distributed.readthedocs.io/en/latest/api.html#distributed.client.Client.who_has
http://distributed.readthedocs.io/en/latest/api.html#distributed.client.Client.run

Fast saving and retrieving of python data structures for an autocorrect program?

So, I have written an autocomplete and autocorrect program in Python 2. I have written the autocorrect program using the approach mentioned is Peter Norvig's blog on how to write a spell checker, link.
Now, I am using a trie data structure implemented using nested lists. I am using a trie as it can give me all words starting with a particular prefix.At the leaf would be a tuple with the word and a value denoting the frequency of the word.For e.g.- the words bad,bat,cat would be saved as-
['b'['a'['d',('bad',4),'t',('bat',3)]],'c'['a'['t',('cat',4)]]]
Where 4,3,4 are the number times the words have been used or the frequency value. Similarly I have made a trie of about 130,000 words of the english dictionary and stored it using cPickle.
Now, it takes about 3-4 seconds for the entire trie to be read each time.The problem is each time a word is encountered the frequency value has to be incremented and then the updated trie needs to be saved again. As you can imagine it would be a big problem waiting each time for 3-4 seconds to read and then again that much time to save the updated trie each time. I will need to perform a lot of update operations each time the program is run and save them.
Is there a faster or efficient way to store a large data structure which repeatedly will be updated? How are the data structures of the autocorrect programs in IDEs and mobile devices saved & retrieved so fast? I am open to different approaches as well.
A few things come to mind.
1) Split the data. Say use 26 files each storing the tries starting with a certain character. You can improve it so that you use a prefix. This way the amount of data you need to write is less.
2) Don't reflect everything to disk. If you need to perform a lot of operations do them in ram(memory) and write them down at then end. If you're afraid of data loss, you can checkpoint your computation after some time X or after a number of operations.
3) Multi-threading. Unless you program only does spellchecking, it's likely there are other things it needs to do. Have a separate thread that does loading writing so that it doesn't block everything while it does disk IO. Multi-threading in python is a bit tricky but it can be done.
4) Custom structure. Part of the time spent in serialization is invoking serialization functions. Since you have a dictionary for everything that's a lot of function calls. In the perfect case you should have a memory representation that matches exactly the disk representation. You would then simply read a large string and put it into your custom class (and write that string to disk when you need to). This is a bit more advanced and likely the benefits will not be that huge especially since python is not so efficient in playing with bits, but if you need to squeeze the last bit of speed out of it, this is the way to go.
I would suggest you to move serialization to a separate thread and run it periodically. You don't need to re-read your data each time because you already have the latest version in memory. This way your program would be responsive to the user while the data is being saved to the disk. The saved version on disk may be lagging and the latest updates may get lost in case of program crash but this shouldn't be a big issue for your use case, I think.
It depends on a particular use case and environment but, I think, most programs having local data sets sync them using multi-threading.

Large Pandas Dataframe parallel processing

I am accessing a very large Pandas dataframe as a global variable. This variable is accessed in parallel via joblib.
Eg.
df = db.query("select id, a_lot_of_data from table")
def process(id):
temp_df = df.loc[id]
temp_df.apply(another_function)
Parallel(n_jobs=8)(delayed(process)(id) for id in df['id'].to_list())
Accessing the original df in this manner seems to copy the data across processes. This is unexpected since the original df isnt being altered in any of the subprocesses? (or is it?)
The entire DataFrame needs to be pickled and unpickled for each process created by joblib. In practice, this is very slow and also requires many times the memory of each.
One solution is to store your data in HDF (df.to_hdf) using the table format. You can then use select to select subsets of data for further processing. In practice this will be too slow for interactive use. It is also very complex, and your workers will need to store their work so that it can be consolidated in the final step.
An alternative would be to explore numba.vectorize with target='parallel'. This would require the use of NumPy arrays not Pandas objects, so it also has some complexity costs.
In the long run, dask is hoped to bring parallel execution to Pandas, but this is not something to expect soon.
Python multiprocessing is typically done using separate processes, as you noted, meaning that the processes don't share memory. There's a potential workaround if you can get things to work with np.memmap as mentioned a little farther down the joblib docs, though dumping to disk will obviously add some overhead of its own: https://pythonhosted.org/joblib/parallel.html#working-with-numerical-data-in-shared-memory-memmaping

Categories

Resources