I have a ddf with lots of partitions
ddf = dd.read_parquet("./input-*", engine='fastparquet')
ddf
Dask DataFrame Structure:
datetime ndvi str utm_x utm_y fpath scl_value
npartitions=71
Dask Name: read-parquet, 71 tasks
In each partition I want to run a custom function
my_df_list = list()
for arg_key, arg_value in my_dict_of_args.items() :
ddf_item = ddf_sliced.map_partitions(myfunc,
my_arg1 = arg_key,
my_arg2 = arg_value,
meta = my_meta)
my_df_list.append(ddf_item)
Things start to get tricky there, I have experienced the following command is too much for my pc, taking forever the beginning of the first item computation and eventually depleting all my ram:
dask.compute(*my_df_list)
Example graph using 2 dfs instead 71, dask.visualize(*my_df_list):
But it can handle easily the computation of each partition, one by one:
my_df_list[0].compute()
...
my_df_list[71].compute()
Example graph using 2 dfs instead 71 my_df_list[0].visualize():
Im struggling understanding the difference since to me its the same iteration scheme.
If it is indeed an overhead I will be glad to get some alternative flows to not call .compute on each element manually.
EDIT 1
After posting the graph images I understand dask.compute(*list) boost parallelism to optimize the df readings. See documentation section, Avoid calling compute repeatedly.
Now I can see the real problem is the initialization of the graph and probably my code: even loading 2 dfs instead of 71, my memory is depleted far before the real computation starts, when using dask.compute(*list)
Related
In dask distributed I get the following warning, which I would not expect:
/home/miniconda3/lib/python3.6/site-packages/distributed/worker.py:739: UserWarning: Large object of size 1.95 MB detected in task graph:
(['int-58e78e1b34eb49a68c65b54815d1b158', 'int-5cd ... 161071d7ae7'],)
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and
keep data on workers
future = client.submit(func, big_data) # bad
big_future = client.scatter(big_data) # good
future = client.submit(func, big_future) # good
% (format_bytes(len(b)), s))
The reason I'm suprised is, that I'm doing exactly what the warning is suggesting:
import dask.dataframe as dd
import pandas
from dask.distributed import Client, LocalCluster
c = Client(LocalCluster())
dask_df = dd.from_pandas(pandas.DataFrame.from_dict({'A':[1,2,3,4,5]*1000}), npartitions=10)
filter_list = c.scatter(list(range(2,100000,2)))
mask = c.submit(dask_df['A'].isin, filter_list)
dask_df[mask.result()].compute()
So my question is: Am I doing something wrong or is this a bug?
pandas='0.22.0'
dask='0.17.0'
The main reason why dask is complaining isn't the list, it's the pandas dataframe inside the dask dataframe.
dask_df = dd.from_pandas(pandas.DataFrame.from_dict({'A':[1,2,3,4,5]*1000}), npartitions=10)
You are creating a biggish amount of data locally when you create a pandas dataframe in your local session. Then you operate with it on the cluster. This will require moving your pandas dataframe to the cluster.
You're welcome to ignore these warnings, but in general I would not be surprised if performance here is worse than with pandas alone.
There are a few other things going on here. Your scatter of a list produces a bunch of futures, which may not be what you want. You're calling submit on a dask object, which is usually unnecessary.
I'm changing my TensorFlow code from the old queue interface to the new Dataset API. With the old interface I could specify the num_threads argument to the tf.train.shuffle_batch queue. However, the only way to control the amount of threads in the Dataset API seems to be in the map function using the num_parallel_calls argument. However, I'm using the flat_map function instead, which doesn't have such an argument.
Question: Is there a way to control the number of threads/processes for the flat_map function? Or is there are way to use map in combination with flat_map and still specify the number of parallel calls?
Note that it is of crucial importance to run multiple threads in parallel, as I intend to run heavy pre-processing on the CPU before data enters the queue.
There are two (here and here) related posts on GitHub, but I don't think they answer this question.
Here is a minimal code example of my use-case for illustration:
with tf.Graph().as_default():
data = tf.ones(shape=(10, 512), dtype=tf.float32, name="data")
input_tensors = (data,)
def pre_processing_func(data_):
# normally I would do data-augmentation here
results = (tf.expand_dims(data_, axis=0),)
return tf.data.Dataset.from_tensor_slices(results)
dataset_source = tf.data.Dataset.from_tensor_slices(input_tensors)
dataset = dataset_source.flat_map(pre_processing_func)
# do something with 'dataset'
To the best of my knowledge, at the moment flat_map does not offer parallelism options.
Given that the bulk of the computation is done in pre_processing_func, what you might use as a workaround is a parallel map call followed by some buffering, and then using a flat_map call with an identity lambda function that takes care of flattening the output.
In code:
NUM_THREADS = 5
BUFFER_SIZE = 1000
def pre_processing_func(data_):
# data-augmentation here
# generate new samples starting from the sample `data_`
artificial_samples = generate_from_sample(data_)
return atificial_samples
dataset_source = (tf.data.Dataset.from_tensor_slices(input_tensors).
map(pre_processing_func, num_parallel_calls=NUM_THREADS).
prefetch(BUFFER_SIZE).
flat_map(lambda *x : tf.data.Dataset.from_tensor_slices(x)).
shuffle(BUFFER_SIZE)) # my addition, probably necessary though
Note (to myself and whoever will try to understand the pipeline):
Since pre_processing_func generates an arbitrary number of new samples starting from the initial sample (organised in matrices of shape (?, 512)), the flat_map call is necessary to turn all the generated matrices into Datasets containing single samples (hence the tf.data.Dataset.from_tensor_slices(x) in the lambda) and then flatten all these datasets into one big Dataset containing individual samples.
It's probably a good idea to .shuffle() that dataset, or generated samples will be packed together.
New to python. Working with IPython.
I want to do some calculation on a pandas dataframe with a rolling window. The process looks like this:
def calculate_avg_ret_t(return_matrix, rolling_window, t):
ret_t = return_matrix.iloc[ np.arange((t-rolling_window+1),t+1,1), ]
avg_ret_t = ret_t.mean().mean() # much more complicated in reality
return avg_ret_t
return_matrix = pd.DataFrame( np.random.randn(10000, 10000) )
rolling_window = 21
avg_ret_ts = []
for t in np.arange(rolling_window-1,10001,1):
%time avg_ret_t = calculate_avg_ret_t(return_matrix, rolling_window, t)
avg_ret_ts.append(avg_ret_t)
The actual function executed within each for loop is much more complicated and time-consuming, hence the need for parallelization. Can this process be parallized, and if so, what's the most user-friendly module to do that?
I realized the potential problem is that the function has to call the gigantic input return_matrix in each loop. Should I first transform that matrix to a R-list like object, depending on rolling_window?
If the function is only dependent on the data in a given slice, then this would be easily parallelized. I would do the following:
1) Split the data set into N sets where N is the number of processors. The sets should overlap sufficiently.
2) Each processor compute the quantities on its own data subset.
You may want to look at using mpi4py in ipython. See for example https://ipython.org/ipython-doc/3/parallel/parallel_mpi.html. This would allow you to develop and debug parallel code quite easily.
I'm currently using Pandas for a project with csv source files of around 600mb. During the analysis I am reading in the csv to a dataframe, grouping on some column and applying a simple function to the grouped dataframe. I noticed that I was going into Swap Memory during this process and so carried out a basic test:
I first created a fairly large dataframe in the shell:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3000000, 3),index=range(3000000),columns=['a', 'b', 'c'])
I defined a pointless function called do_nothing():
def do_nothing(group):
return group
And ran the following command:
df = df.groupby('a').apply(do_nothing)
My system has 16gb of RAM and is running Debian (Mint). After creating the dataframe I was using ~600mb of RAM. As soon as the apply method began to execute, that value started to soar. It steadily climbed up to around 7gb(!) before finishing the command and settling back down to 5.4gb (while the shell was still active). The problem is, my work requires doing more than the 'do_nothing' method and as such while executing the real program, I cap my 16gb of RAM and start swapping, making the program unusable. Is this intended? I can't see why Pandas should need 7gb of RAM to effectively 'do_nothing', even if it has to store the grouped object.
Any ideas on what's causing this/how to fix it?
Cheers,
.P
Using 0.14.1, I don't think their is a memory leak (1/3 size of your frame).
In [79]: df = DataFrame(np.random.randn(100000,3))
In [77]: %memit -r 3 df.groupby(df.index).apply(lambda x: x)
maximum of 3: 1365.652344 MB per loop
In [78]: %memit -r 10 df.groupby(df.index).apply(lambda x: x)
maximum of 10: 1365.683594 MB per loop
Two general comments on how to approach a problem like this:
1) use the cython level function if at all possible, will be MUCH faster, and will use much less memory. IOW, it almost always worth it to decouple a groupby expression and void using function (if possible, somethings are just too complicated, but that's the point, you want to break things down). e.g.
Instead of:
df.groupby(...).apply(lambda x: x.sum() / x.mean())
It is MUCH better to do:
g = df.groupby(...)
g.sum() / g.mean()
2) You can easily 'control' the groupby by doing your aggregation manually (additionally this will allow periodic output and garbage collection if needed).
results = []
for i, (g, grp) in enumerate(df.groupby(....)):
if i % 500 == 0:
print "checkpoint: %s" % i
gc.collect()
results.append(func(g,grp))
# final result
pd.concate(results)
I happily use pandas to store and manipulate experimental data. Usually, I choose HDF format (which I don't master) via pd.HDFstore to save stuff.
My dataframes got bigger and bigger and some economy in memory is needed.
I read some of the guides linked in related questions, although I cannot achieve a sustainable memory consumption, e.g. in the following typical task of mine:
. load some `df` in memory (scale size is 10GB)
. do business with some other preloaded `df`
. unload
. repeat
Apparently I keep on failing in the unloading stage.
Hence, I would like you to consider the following experiments.
(From fresh started kernel (in ipython notebook, if that matters))
import pandas as pd
for idx in range(6):
print idx
store = pd.HDFStore('detection_DB_N.h5')
detection_DB = store['detection_DB']
store.close()
del detection_DB
stats (from top):
. memory used by first iteration ~8GB
. memory used at the end of execution ~10GB (6 cycles)
Then, in the same kernel, I run
for idx in range(6):
print idx
store = pd.HDFStore('detection_DB_N.h5')
detection_DB = store['detection_DB']
store.close()
#del detection_DB #SAME AS BEFORE, BUT I DON'T del
stats:
. memory used at the end of execution ~15GB
Calling a del detection_DB doesn't make any difference in memory (CPU usage goes high for some 5sec).
Analogusly, calling
import gc
gc.collect()
doesn't make any relevant difference.
I add, for what is worth, that repeating the previous calls, I arrived to have ~20GB occupied (and no loaded object to play with).
Can anyone shed some light?
How can I achieve ~0GB (or so) occupied after del?