iterate over GroupBy object in dask - python

Is it possible to iterate over a dask GroupBy object to get access to the underlying dataframes? I tried:
import dask.dataframe as dd
import pandas as pd
pdf = pd.DataFrame({'A':[1,2,3,4,5], 'B':['1','1','a','a','a']})
ddf = dd.from_pandas(pdf, npartitions = 3)
groups = ddf.groupby('B')
for name, df in groups:
print(name)
However, this results in an error: KeyError: 'Column not found: 0'
More generally speaking, what kind of interactions does the dask GroupBy object allow, except from the apply method?

you could iterate through groups doing this with dask, maybe there is a better way but this works for me.
import dask.dataframe as dd
import pandas as pd
pdf = pd.DataFrame({'A':[1, 2, 3, 4, 5], 'B':['1','1','a','a','a']})
ddf = dd.from_pandas(pdf, npartitions = 3)
groups = ddf.groupby('B')
for group in pdf['B'].unique():
print groups.get_group(group)
this would return
dd.DataFrame<dataframe-groupby-get_group-e3ebb5d5a6a8001da9bb7653fface4c1, divisions=(0, 2, 4, 4)>
dd.DataFrame<dataframe-groupby-get_group-022502413b236592cf7d54b2dccf10a9, divisions=(0, 2, 4, 4)>

Generally iterating over Dask.dataframe objects is not recommended. It is inefficient. Instead you might want to try constructing a function and mapping that function over the resulting groups using groupby.apply

Related

Using transform to filter dataframe based on groupby information

I want to filter out id's that not appear 3 times in the dataset below.
I thought of using groupby and transform('size'), but that doesn't work.
Why?
data = pd.DataFrame({'id':[0,0,0, 1,1,1, 2,2, 3,3,3, 4, 4],
'info':[23,22,12,12,14,23,11,2,98,76,46,341,12]})
data[data.groupby(['id']).transform('size')==3]
Specify column after groupby:
df = data[data.groupby(['id'])['id'].transform('size')==3]
Alternative:
df = data[data['id'].map(data['id'].value_counts())==3]

HDFStore and querying by attributes

I am currently running a parameter study in which the results are returned as pandas DataFrames. I want to store these DFs in a HDF5 file together with the parameter values that were used to create them (parameter foo in the example below, with values 'bar' and 'foo', respectively).
I would like to be able to query the HDF5 file based on these attributes to arrive at the respective DFs - for example, I would like to be able to query for a DF with the attribute foo equal to 'bar'. Is it possible to do this in HDF5? Or would it be smarter in this case to create a multiindex DF instead of saving the parameter values as attributes?
import pandas as pd
df_1 = pd.DataFrame({'col_1': [1, 2],
'col_2': [3, 4]})
df_2 = pd.DataFrame({'col_1': [5, 6],
'col_2': [7, 8]})
store = pd.HDFStore('file.hdf5')
store.put('table_1', df_1)
store.put('table_2', df_2)
store.get_storer('table_1').attrs.foo = 'bar'
store.get_storer('table_2').attrs.foo = 'foo'
store.close()

Unpack tuple inside function when using Dask map partitions

I'm trying to run a function over many partitions of a Dask dataframe. The code requires unpacking tuples and works well with Pandas but not with Dask map_partitions. The data corresponds to lists of tuples, where the length of the lists can vary, but the tuples are always of a known fixed length.
import dask.dataframe as dd
import pandas as pd
def func(df):
for index, row in df.iterrows():
tuples = row['A']
for t in tuples:
x, y = t
# Do more stuff
# Create Pandas dataframe
# Each list may have a different length, tuples have fixed known length
df = pd.DataFrame({'A': [[(1, 1), (3, 4)], [(3, 2)]]})
# Pandas to Dask
ddf = dd.from_pandas(df, npartitions=2)
# Run function over Pandas dataframe
func(df)
# Run function over Dask dataframe
ddf.map_partitions(func).compute()
Here, the Pandas version runs with no issues. However, the Dask one, raises the error:
ValueError: Metadata inference failed in `func`.
You have supplied a custom function and Dask is unable to
determine the type of output that that function returns.
To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.
Original error is below:
------------------------
ValueError('not enough values to unpack (expected 2, got 1)')
In my original function, I'm using these tuples as auxiliary variables, and the data which is finally returned is completely different so using meta doesn't fix the problem. How can I unpack the tuples?
When you use map_partitions without specifying meta, dask will try to run the functions to infer what the output is. This can cause problems if your function is not compatible with the sample dataframe used, you can see this sample dataframe with ddf._meta_nonempty (in this case it will return a column of foo).
An easy fix in this case is to provide meta, it's okay for returned data to be of different format, e.g. if each returned result is a list, you can provide meta=list:
import dask.dataframe as dd
import pandas as pd
def func(df):
for index, row in df.iterrows():
tuples = row['A']
for t in tuples:
x, y = t
return [1,2,3]
df = pd.DataFrame({'A': [[(1, 1), (3, 4)], [(3, 2)]]})
ddf = dd.from_pandas(df, npartitions=2)
ddf.map_partitions(func, meta=list).compute()
Another approach is to make your function compatible with the sample dataframe used. The sample dataframe has an object column but it contains foo rather than a list of tuples, so it cannot be unpacked as a tuple. Modifying your function to accept non-tuple columns (with x, *y = t) will make it work:
import dask.dataframe as dd
import pandas as pd
def func(df):
for index, row in df.iterrows():
tuples = row['A']
for t in tuples:
x, *y = t
return [1,2,3]
df = pd.DataFrame({'A': [[(1, 1), (3, 4)], [(3, 2)]]})
ddf = dd.from_pandas(df, npartitions=2)
#notice that no meta is specified here
ddf.map_partitions(func).compute()

Filter a pd.DataFrame by the type of its index elements

Is there a Pythonic way to filter a pd.DataFrame based on the type of its index elements? When reading an Excel file of time-series data, I often wish to discard rows whose indices are not datetime objects. My current solution is as follows.
import datetime
import pandas as pd
df = pd.DataFrame(index=[1, datetime.datetime(2020, 1, 1), '2019'], data=[1, 2, 3])
df[df.index.map(lambda i: isinstance(i, datetime.datetime))]
You could use a list comprehension instead of the map-lambda construction:
df[[isinstance(df.index[i], datetime.datetime) for i in range(len(df))]]
But I'm not sure that's more Pythonic.

Iterate through a dask series (getting unique values from dask series to list)

I need to iterate through unique values from a dask dataframe. I used .unique() to get the unique values of the columns but now i'm given a dask object that I cannot use to iterate. I need to know how to get these unique values out of this dask object into a list (or something similar) so I can use those values to iterate through the dask dataframe.
df = dd.read_csv('file.csv')
df.column1.unique()
for unique_value in column1_array:
print(unique_value)
This is the error I get:
NotImplementedError: Series getitem in only supported for other series objects with matching partition structure
You can use the .compute() method to convert your Dask Series into a Pandas Series object and then iterate over that.
for x in s.compute():
...
See https://docs.dask.org/en/latest/dataframe-best-practices.html#reduce-and-then-use-pandas
There are also iteritems and iterrows methods
This issue has been resolved in dask=2.3.
In [1]: import pandas as pd
...: import dask.dataframe as dd
...: import dask
In [2]: dask.__version__
Out[2]: '2.3.0'
In [3]: df = pd.DataFrame({"temp1":[1,2,2,4],"temp2":[1,2,2,4]})
...: ddf = dd.from_pandas(df,npartitions=2)
...: for unique_value in ddf.temp1.unique():
...: print(unique_value)
...:
1
2
4

Categories

Resources