Dask groupby-apply followed by join on index without expensive reindexing

Dask groupby-apply followed by join on index without expensive reindexing - python

I'm in a situation in Dask which I would like to get out of, without using a lot of expensive reset_index operations.
I have a task which does a groupby-apply (where the apply returns a dataframe, which has a different size to the input dataframe, in the example this is simulated by the .head() and .tail() with reset_index()).
A operations is carried out on a different dataframe, and these two data frames need to be joined. However, the behavior is not as I had expected. I had expected the dataframe to join only on the dask index, and since dask doesn't implement multi index, I am surprised to see that it joins on both the dask index, and the returned index from the apply:
import dask.dataframe as dd
import pandas as pd
ddf = dd.from_pandas(
pd.DataFrame(
{
"group_col": ["A", "A", "A", "B", "B"],
"val_col": [1, 2, 3, 4, 5],
"val_col2": [5, 4, 3, 2, 1]
}
), npartitions=1)
ddf = ddf.set_index("group_col")
out_ddf = ddf.groupby("group_col").apply(lambda _df: _df.head(2).reset_index(drop=True))
out_ddf2 = ddf.groupby("group_col").apply(lambda _df: _df.tail(1).reset_index(drop=True))
out_ddf.join(out_ddf2, rsuffix="_other").compute()
Below is the output of the above.
val_col val_col2 val_col_other val_col2_other
group_col
A 0 1 5 3.0 3.0
1 2 4 NaN NaN
B 0 4 2 5.0 1.0
1 5 1 NaN NaN
The desired output (without expensive reshuffling) would be:
val_col val_col2 val_col_other val_col2_other
group_col
A 1 5 3 3
2 4 3 3
B 4 2 5 1
5 1 5 1
I have tried various combinations of .join/.merge calls, and I have been able to achieve the result with:
out_ddf.reset_index().merge(out_ddf2.reset_index(), suffixes=(None, "_other"), on="group_col").compute()
but I want to do some more operations on the same index later on, so I'm concerned this will hurt the performance, having to jiggle around the index so much.
So I'm looking for solutions which will give the desired result without the overhead of changing the dask indices during the operation, since the data frames are pretty big.
Thanks!

The code below might not work in general, but for your example, I would use the fact that the computations are done within a group and combine them in a single function that is applied within a group. This avoids merges/data shuffles:
def myfunc(df):
df1 = df.head(2).reset_index(drop=True)
df2 = df.tail(1).add_suffix('_other').reset_index(drop=True)
return df1.join(df2).fillna(method='ffill')
out_ddf = ddf.groupby('group_col').apply(myfunc)
print(out_ddf.compute())
For more complex workflows, a more nuanced solution will be needed to keep track of data dependencies in each computation.

Related

pandas transform with NaN values in grouped columns [duplicate]

I have a DataFrame with many missing values in columns which I wish to groupby:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': ['1', '2', '3'], 'b': ['4', np.NaN, '6']})
In [4]: df.groupby('b').groups
Out[4]: {'4': [0], '6': [2]}
see that Pandas has dropped the rows with NaN target values. (I want to include these rows!)
Since I need many such operations (many cols have missing values), and use more complicated functions than just medians (typically random forests), I want to avoid writing too complicated pieces of code.
Any suggestions? Should I write a function for this or is there a simple solution?

pandas >= 1.1
From pandas 1.1 you have better control over this behavior, NA values are now allowed in the grouper using dropna=False:
pd.__version__
# '1.1.0.dev0+2004.g8d10bfb6f'
# Example from the docs
df
a b c
0 1 2.0 3
1 1 NaN 4
2 2 1.0 3
3 1 2.0 2
# without NA (the default)
df.groupby('b').sum()
a c
b
1.0 2 3
2.0 2 5
# with NA
df.groupby('b', dropna=False).sum()
a c
b
1.0 2 3
2.0 2 5
NaN 1 4

This is mentioned in the Missing Data section of the docs:
NA groups in GroupBy are automatically excluded. This behavior is consistent with R
One workaround is to use a placeholder before doing the groupby (e.g. -1):
In [11]: df.fillna(-1)
Out[11]:
a b
0 1 4
1 2 -1
2 3 6
In [12]: df.fillna(-1).groupby('b').sum()
Out[12]:
a
b
-1 2
4 1
6 3
That said, this feels pretty awful hack... perhaps there should be an option to include NaN in groupby (see this github issue - which uses the same placeholder hack).
However, as described in another answer, "from pandas 1.1 you have better control over this behavior, NA values are now allowed in the grouper using dropna=False"

Ancient topic, if someone still stumbles over this--another workaround is to convert via .astype(str) to string before grouping. That will conserve the NaN's.
df = pd.DataFrame({'a': ['1', '2', '3'], 'b': ['4', np.NaN, '6']})
df['b'] = df['b'].astype(str)
df.groupby(['b']).sum()
a
b
4 1
6 3
nan 2

I am not able to add a comment to M. Kiewisch since I do not have enough reputation points (only have 41 but need more than 50 to comment).
Anyway, just want to point out that M. Kiewisch solution does not work as is and may need more tweaking. Consider for example
>>> df = pd.DataFrame({'a': [1, 2, 3, 5], 'b': [4, np.NaN, 6, 4]})
>>> df
a b
0 1 4.0
1 2 NaN
2 3 6.0
3 5 4.0
>>> df.groupby(['b']).sum()
a
b
4.0 6
6.0 3
>>> df.astype(str).groupby(['b']).sum()
a
b
4.0 15
6.0 3
nan 2
which shows that for group b=4.0, the corresponding value is 15 instead of 6. Here it is just concatenating 1 and 5 as strings instead of adding it as numbers.

All answers provided thus far result in potentially dangerous behavior as it is quite possible you select a dummy value that is actually part of the dataset. This is increasingly likely as you create groups with many attributes. Simply put, the approach doesn't always generalize well.
A less hacky solve is to use pd.drop_duplicates() to create a unique index of value combinations each with their own ID, and then group on that id. It is more verbose but does get the job done:
def safe_groupby(df, group_cols, agg_dict):
# set name of group col to unique value
group_id = 'group_id'
while group_id in df.columns:
group_id += 'x'
# get final order of columns
agg_col_order = (group_cols + list(agg_dict.keys()))
# create unique index of grouped values
group_idx = df[group_cols].drop_duplicates()
group_idx[group_id] = np.arange(group_idx.shape[0])
# merge unique index on dataframe
df = df.merge(group_idx, on=group_cols)
# group dataframe on group id and aggregate values
df_agg = df.groupby(group_id, as_index=True)\
.agg(agg_dict)
# merge grouped value index to results of aggregation
df_agg = group_idx.set_index(group_id).join(df_agg)
# rename index
df_agg.index.name = None
# return reordered columns
return df_agg[agg_col_order]
Note that you can now simply do the following:
data_block = [np.tile([None, 'A'], 3),
np.repeat(['B', 'C'], 3),
[1] * (2 * 3)]
col_names = ['col_a', 'col_b', 'value']
test_df = pd.DataFrame(data_block, index=col_names).T
grouped_df = safe_groupby(test_df, ['col_a', 'col_b'],
OrderedDict([('value', 'sum')]))
This will return the successful result without having to worry about overwriting real data that is mistaken as a dummy value.

One small point to Andy Hayden's solution – it doesn't work (anymore?) because np.nan == np.nan yields False, so the replace function doesn't actually do anything.
What worked for me was this:
df['b'] = df['b'].apply(lambda x: x if not np.isnan(x) else -1)
(At least that's the behavior for Pandas 0.19.2. Sorry to add it as a different answer, I do not have enough reputation to comment.)

I answered this already, but some reason the answer was converted to a comment. Nevertheless, this is the most efficient solution:
Not being able to include (and propagate) NaNs in groups is quite aggravating. Citing R is not convincing, as this behavior is not consistent with a lot of other things. Anyway, the dummy hack is also pretty bad. However, the size (includes NaNs) and the count (ignores NaNs) of a group will differ if there are NaNs.
dfgrouped = df.groupby(['b']).a.agg(['sum','size','count'])
dfgrouped['sum'][dfgrouped['size']!=dfgrouped['count']] = None
When these differ, you can set the value back to None for the result of the aggregation function for that group.

How to convert a Pandas series into a Dataframe for merging [duplicate]

If you came here looking for information on how to
merge a DataFrame and Series on the index, please look at this
answer.
The OP's original intention was to ask how to assign series elements
as columns to another DataFrame. If you are interested in knowing the
answer to this, look at the accepted answer by EdChum.
Best I can come up with is
df = pd.DataFrame({'a':[1, 2], 'b':[3, 4]}) # see EDIT below
s = pd.Series({'s1':5, 's2':6})
for name in s.index:
df[name] = s[name]
a b s1 s2
0 1 3 5 6
1 2 4 5 6
Can anybody suggest better syntax / faster method?
My attempts:
df.merge(s)
AttributeError: 'Series' object has no attribute 'columns'
and
df.join(s)
ValueError: Other Series must have a name
EDIT The first two answers posted highlighted a problem with my question, so please use the following to construct df:
df = pd.DataFrame({'a':[np.nan, 2, 3], 'b':[4, 5, 6]}, index=[3, 5, 6])
with the final result
a b s1 s2
3 NaN 4 5 6
5 2 5 5 6
6 3 6 5 6

Update
From v0.24.0 onwards, you can merge on DataFrame and Series as long as the Series is named.
df.merge(s.rename('new'), left_index=True, right_index=True)
# If series is already named,
# df.merge(s, left_index=True, right_index=True)
Nowadays, you can simply convert the Series to a DataFrame with to_frame(). So (if joining on index):
df.merge(s.to_frame(), left_index=True, right_index=True)

You could construct a dataframe from the series and then merge with the dataframe.
So you specify the data as the values but multiply them by the length, set the columns to the index and set params for left_index and right_index to True:
In [27]:
df.merge(pd.DataFrame(data = [s.values] * len(s), columns = s.index), left_index=True, right_index=True)
Out[27]:
a b s1 s2
0 1 3 5 6
1 2 4 5 6
EDIT for the situation where you want the index of your constructed df from the series to use the index of the df then you can do the following:
df.merge(pd.DataFrame(data = [s.values] * len(df), columns = s.index, index=df.index), left_index=True, right_index=True)
This assumes that the indices match the length.

Here's one way:
df.join(pd.DataFrame(s).T).fillna(method='ffill')
To break down what happens here...
pd.DataFrame(s).T creates a one-row DataFrame from s which looks like this:
s1 s2
0 5 6
Next, join concatenates this new frame with df:
a b s1 s2
0 1 3 5 6
1 2 4 NaN NaN
Lastly, the NaN values at index 1 are filled with the previous values in the column using fillna with the forward-fill (ffill) argument:
a b s1 s2
0 1 3 5 6
1 2 4 5 6
To avoid using fillna, it's possible to use pd.concat to repeat the rows of the DataFrame constructed from s. In this case, the general solution is:
df.join(pd.concat([pd.DataFrame(s).T] * len(df), ignore_index=True))
Here's another solution to address the indexing challenge posed in the edited question:
df.join(pd.DataFrame(s.repeat(len(df)).values.reshape((len(df), -1), order='F'),
columns=s.index,
index=df.index))
s is transformed into a DataFrame by repeating the values and reshaping (specifying 'Fortran' order), and also passing in the appropriate column names and index. This new DataFrame is then joined to df.

Nowadays, much simpler and concise solution can achieve the same task. Leveraging the capability of DataFrame.apply() to turn a Series into columns of its belonging DataFrame, we can use:
df.join(df.apply(lambda x: s, axis=1))
Result:
a b s1 s2
3 NaN 4 5 6
5 2.0 5 5 6
6 3.0 6 5 6
Here, we used DataFrame.apply() with a simple lambda function as the applied function on axis=1. The applied lambda function simply just returns the Series s:
df.apply(lambda x: s, axis=1)
Result:
s1 s2
3 5 6
5 5 6
6 5 6
The result has already inherited the row index of the original DataFrame df. Consequently, we can simply join df with this interim result by DataFrame.join() to get the desired final result (since they have the same row index).
This capability of DataFrame.apply() to turn a Series into columns of its belonging DataFrame is well documented in the official document as follows:
By default (result_type=None), the final return type is inferred from
the return type of the applied function.
The default behaviour (result_type=None) depends on the return value of the
applied function: list-like results will be returned as a Series of
those. However if the apply function returns a Series these are
expanded to columns.
The official document also includes example of such usage:
Returning a Series inside the function is similar to passing
result_type='expand'. The resulting column names will be the Series
index.
df.apply(lambda x: pd.Series([1, 2], index=['foo', 'bar']), axis=1)
foo bar
0 1 2
1 1 2
2 1 2

If I could suggest setting up your dataframes like this (auto-indexing):
df = pd.DataFrame({'a':[np.nan, 1, 2], 'b':[4, 5, 6]})
then you can set up your s1 and s2 values thus (using shape() to return the number of rows from df):
s = pd.DataFrame({'s1':[5]*df.shape[0], 's2':[6]*df.shape[0]})
then the result you want is easy:
display (df.merge(s, left_index=True, right_index=True))
Alternatively, just add the new values to your dataframe df:
df = pd.DataFrame({'a':[nan, 1, 2], 'b':[4, 5, 6]})
df['s1']=5
df['s2']=6
display(df)
Both return:
a b s1 s2
0 NaN 4 5 6
1 1.0 5 5 6
2 2.0 6 5 6
If you have another list of data (instead of just a single value to apply), and you know it is in the same sequence as df, eg:
s1=['a','b','c']
then you can attach this in the same way:
df['s1']=s1
returns:
a b s1
0 NaN 4 a
1 1.0 5 b
2 2.0 6 c

You can easily set a pandas.DataFrame column to a constant. This constant can be an int such as in your example. If the column you specify isn't in the df, then pandas will create a new column with the name you specify. So after your dataframe is constructed, (from your question):
df = pd.DataFrame({'a':[np.nan, 2, 3], 'b':[4, 5, 6]}, index=[3, 5, 6])
You can just run:
df['s1'], df['s2'] = 5, 6
You could write a loop or comprehension to make it do this for all the elements in a list of tuples, or keys and values in a dictionary depending on how you have your real data stored.

If df is a pandas.DataFrame then df['new_col']= Series list_object of length len(df) will add the or Series list_object as a column named 'new_col'. df['new_col']= scalar (such as 5 or 6 in your case) also works and is equivalent to df['new_col']= [scalar]*len(df)
So a two-line code serves the purpose:
df = pd.DataFrame({'a':[1, 2], 'b':[3, 4]})
s = pd.Series({'s1':5, 's2':6})
for x in s.index:
df[x] = s[x]
Output:
a b s1 s2
0 1 3 5 6
1 2 4 5 6

Pandas loc does not work to subset DataFrame when using a variable

I am fairly new to Python, especially pandas. I have a DataFrame called KeyRow which is from a bigger df:
KeyRow=df.loc[df['Order'] == UniqueOrderName[i]]
Then I make a nested loop
for i in range (0,len(PersonNum)):
print(KeyRow.loc[KeyRow['Aisle'] == '6', 'FixedPill'])
So it appears to only work when a constant is placed, whereas if I use PersonNum[0] instead of '6',even though both values are equivalent, it appears not to work. When I use PersonNum[i] this is the output I am getting:
Series([], Name: FixedPill, dtype: object)
Whereas if I use 'x' I get a desired result:
15 5
Name: FixedPill, dtype: object

It's a little unclear what your are trying to accomplish with this questions. If you are looking to filter a DataFrame, then I would suggest to never do this in an iterative manner. You should take full advantage of the slicing capabilities of .loc. Consider the example:
df = pd.DataFrame([[1,2,3], [4,5,6],
[1,2,3], [2,5,6],
[1,2,3], [4,5,6],
[1,2,3], [4,5,6]],
columns=["A", "B", "C"])
df.head()
A B C
0 1 2 3
1 4 5 6
2 1 2 3
3 2 5 6
4 1 2 3
Suppose you have a list of PersonNum that you want to use to locate the a particular field where your list is PersonNum = [1, 2]. You can slice the DataFrame in one step by performing:
df.loc[df["A"].isin(PersonNum), "B"]
Which will return a pandas Series and
df.loc[df["A"].isin(PersonNum), "B"].to_frame()
which returns a new DataFrame. Utilizing the .loc is significantly faster than an iterative approach.

Loop through different Pandas Dataframes

im new to Python, and have what is probably a basis question.
I have imported a number of Pandas Dataframes consisting of stock data for different sectors. So all columns are the same, just with different dataframe names.
I need to do a lot of different small operations on some of the columns, and I can figure out how to do it on one Dataframe at a time, but I need to figure out how to loop over the different frames and do the same operations on each.
For example for one DF i do:
ConsumerDisc['IDX_EST_PRICE_BOOK']=1/ConsumerDisc['IDX_EST_PRICE_BOOK']
ConsumerDisc['IDX_EST_EV_EBITDA']=1/ConsumerDisc['IDX_EST_EV_EBITDA']
ConsumerDisc['INDX_GENERAL_EST_PE']=1/ConsumerDisc['INDX_GENERAL_EST_PE']
ConsumerDisc['EV_TO_T12M_SALES']=1/ConsumerDisc['EV_TO_T12M_SALES']
ConsumerDisc['CFtoEarnings']=ConsumerDisc['CASH_FLOW_PER_SH']/ConsumerDisc['TRAIL_12M_EPS']
And instead of just copying and pasting this code for the next 10 sectors, I want to to do it in a loop somehow, but I cant figure out how to access the df via variable, eg:
CS=['ConsumerDisc']
CS['IDX_EST_PRICE_BOOK']=1/CS['IDX_EST_PRICE_BOOK']
so I could just create a list of df names and loop through it.
Hope you can give a small example as how to do this.

You're probably looking for something like this
for df in (df1, df2, df3):
df['IDX_EST_PRICE_BOOK']=1/df['IDX_EST_PRICE_BOOK']
df['IDX_EST_EV_EBITDA']=1/df['IDX_EST_EV_EBITDA']
df['INDX_GENERAL_EST_PE']=1/df['INDX_GENERAL_EST_PE']
df['EV_TO_T12M_SALES']=1/df['EV_TO_T12M_SALES']
df['CFtoEarnings']=df['CASH_FLOW_PER_SH']/df['TRAIL_12M_EPS']
Here we're iterating over the dataframes that we've put in a tuple datasctructure, does that make sense?

Do you mean something like this?
import pandas as pd
d = {'a' : pd.Series([1, 2, 3, 10]), 'b' : pd.Series([2, 2, 6, 8])}
z = {'d' : pd.Series([4, 2, 3, 1]), 'e' : pd.Series([21, 2, 60, 8])}
df = pd.DataFrame(d)
zf = pd.DataFrame(z)
df.head()
a b
0 1 2
1 2 2
2 3 6
3 10 8
df = df.apply(lambda x: 1/x)
df.head()
a b
0 1.0 0.500000
1 2.0 0.500000
2 3.0 0.166667
3 10.0 0.125000
You have more functions so you can create a function and then just apply that to each DataFrame. Alternatively you could also apply these lambda functions to only specific columns. So lets say you want to apply only 1/column to the every column but the last (going by your example, I am assuming it is in the end) you could do df.ix[:, :-1].apply(lambda x : 1/x).

Can I get a trimmed mean of all columns in a dataframe with nan values?

The problem is that I want to get the trimmed mean of all the columns in a pandas dataframe (i.e. the mean of the values in a given column, excluding the max and the min values). It's likely that some columns will have nan values. Basically, I want to get the exact same functionality as the pandas.DataFrame.mean function, except that it's the trimmed mean.
The obvious solution is to use the scipy tmean function, and iterate over the df columns. So I did:
import scipy as sp
trim_mean = []
for i in data_clean3.columns:
trim_mean.append(sp.tmean(data_clean3[i]))
This worked great, until I encountered nan values, which caused tmean to choke. Worse, when I dropped the nan values in the dataframe, there were some datasets that were wiped out completely as they had an nan value in every column. This means that when I amalgamate all my datasets into a master set, there'll be holes on the master set where the trimmed mean should be.
Does anyone know of a way around this? As in, is there a way to get tmean to behave like the standard scipy stats functions and ignore nan values?
(Note that my code is calculating a big number of descriptive statistics on large datasets with limited hardware; highly involved or inefficient workarounds might not be optimal. Hopefully, though, I'm just missing something simple.)
(EDIT: Someone suggested in a comment (that has since vanished?) that I should used the trim_mean scipy function, which allows you to top and tail a specific proportion of the data. This is just to say that this solution won't work for me, as my datasets are of unequal sizes, so I cannot specify a fixed proportion of data that will be OK to remove in every case; it must always just be the max and the min values.)

consider df
np.random.seed()
data = np.random.choice((0, 25, 35, 100, np.nan),
(1000, 2),
p=(.01, .39, .39, .01, .2))
df = pd.DataFrame(data, columns=list('AB'))
Construct your mean using sums and divide by relevant normalizer.
(df.sum() - df.min() - df.max()) / (df.notnull().sum() - 2)
A 29.707674
B 30.402228
dtype: float64
df.mean()
A 29.756987
B 30.450617
dtype: float64

you colud use df.mean(skipna =True) DataFrame.mean
df1 = pd.DataFrame([[5, 1, 'a'], [6, 2, 'b'],[7, 3, 'd'],[np.nan, 4, 'e'],[9, 5, 'f'],[5, 1, 'g']], columns = ["A", "B", "C"])
print df1
df1 = df1[df1.A != df1.A.max()] # Remove max values
df1 = df1[df1.A != df1.A.min()] # Remove min values
print "\nDatafrmae after removing max and min\n"
print df1
print "\nMean of A\n"
print df1["A"].mean(skipna =True)
output
A B C
0 5.0 1 a
1 6.0 2 b
2 7.0 3 d
3 NaN 4 e
4 9.0 5 f
5 5.0 1 g
Datafrmae after removing max and min
A B C
1 6.0 2 b
2 7.0 3 d
3 NaN 4 e
Mean of A
6.5

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.