I am using pandas.DataFrame.resample to resample a grouped Pandas dataframe with a timestamp index.
In one of the columns, I would like to resample such that I select the most frequent value. At the moment, I am only having success using NumPy functions like np.max or np.sum etc.
#generate test dataframe
data = np.random.randint(0,10,(366,2))
index = pd.date_range(start=pd.Timestamp('1-Dec-2012'), periods=366, unit='D')
test = pd.DataFrame(data, index=index)
#generate group array
group = np.random.randint(0,2,(366,))
#define how dictionary for resample
how_dict = {0: np.max, 1: np.min}
#perform grouping and resample
test.groupby(group).resample('48 h',how=how_dict)
The previous code works because I have used NumPy functions. However, if I want to use resample by most frequent value, I am not sure. I try defining a custom function like
def frequent(x):
(value, counts) = np.unique(x, return_counts=True)
return value[counts.argmax()]
However, if I now do:
how_dict = {0: np.max, 1: frequent}
I get an empty dataframe...
df = test.groupby(group).resample('48 h',how=how_dict)
df.shape
Your resample period is too short, so when a group is empty on a period, your user function raise a ValueError not kindly caught by pandas .
But it works without empty groups, for example with regular groups:
In [8]: test.groupby(arange(366)%2).resample('48h',how=how_dict).head()
Out[8]:
0 1
0 2012-12-01 4 8
2012-12-03 0 3
2012-12-05 9 5
2012-12-07 3 4
2012-12-09 7 3
Or with bigger periods :
In [9]: test.groupby(group).resample('122D',how=how_dict)
Out[9]:
0 1
0 2012-12-02 9 0
2013-04-03 9 0
2013-08-03 9 6
1 2012-12-01 9 3
2013-04-02 9 7
2013-08-02 9 1
EDIT
A workaround can be to manage the empty case :
def frequent(x):
if len(x)==0 : return -1
(value, counts) = np.unique(x, return_counts=True)
return value[counts.argmax()]
For
In [11]: test.groupby(group).resample('48h',how=how_dict).head()
Out[11]:
0 1
0 2012-12-01 5 3
2012-12-03 3 4
2012-12-05 NaN -1
2012-12-07 5 0
2012-12-09 1 4
Related
I have a question related to the earlier question: Identifying consecutive NaN's with pandas
I am new on stackoverflow so I cannot add a comment, but I would like to know how I can partly keep the original index of the dataframe when counting the number of consecutive nans.
So instead of:
df = pd.DataFrame({'a':[1,2,np.NaN, np.NaN, np.NaN, 6,7,8,9,10,np.NaN,np.NaN,13,14]})
df
Out[38]:
a
0 1
1 2
2 NaN
3 NaN
4 NaN
5 6
6 7
7 8
8 9
9 10
10 NaN
11 NaN
12 13
13 14
I would like to obtain the following:
Out[41]:
a
0 0
1 0
2 3
5 0
6 0
7 0
8 0
9 0
10 2
12 0
13 0
I have found a workaround. It is quite ugly, but it does the trick. I hope you don't have massive data, because it might be not very performing:
df = pd.DataFrame({'a':[1,2,np.NaN, np.NaN, np.NaN, 6,7,8,9,10,np.NaN,np.NaN,13,14]})
df1 = df.a.isnull().astype(int).groupby(df.a.notnull().astype(int).cumsum()).sum()
# Determine the different groups of NaNs. We only want to keep the 1st. The 0's are non-NaN values, the 1's are the first in a group of NaNs.
b = df.isna()
df2 = b.cumsum() - b.cumsum().where(~b).ffill().fillna(0).astype(int)
df2 = df2.loc[df2['a'] <= 1]
# Set index from the non-zero 'NaN-count' to the index of the first NaN
df3 = df1.loc[df1 != 0]
df3.index = df2.loc[df2['a'] == 1].index
# Update the values from df3 (which has the right values, and the right index), to df2
df2.update(df3)
The NaN-group thingy is inspired by the following answer: This is coming from the this answer.
import pandas as pd
df = pd.DataFrame({'a': [1,2,3,4,5,6,7],
'b': [1,1,1,0,0,0,0],
})
grouped = df.groupby('b')
now sample from each group, e.g., I want 30% from group b = 1, and 20% from group b = 0. How should I do that?
if I want to have 150% for some group, can i do that?
You can dynamically return a random sample dataframe with different % of samples as defined per group. You can do this with percentages below 100% (see example 1) AND above 100% (see example 2) by passing replace=True:
Using np.select, create a new column c that returns the number of rows per group to be sampled randomly according to a 20%, 40%, etc. percentage that you set.
From there, you can sample x rows per group based off these percentage conditions. From these rows, return the .index of the rows and filter for the rows with .loc as well as columns 'a','b'. The code grouped.apply(lambda x: x['c'].sample(frac=x['c'].iloc[0])) creates a multiindex series of the output you are looking for, but it requires some cleanup. This is why for me it is just easier to grab the .index and filter the original dataframe with .loc, rather than try to clean up the messy multiindex series.
grouped = df.groupby('b', group_keys=False)
df['c'] = np.select([df['b'].eq(0), df['b'].eq(1)], [0.4, 0.2])
df.loc[grouped.apply(lambda x: x['c'].sample(frac=x['c'].iloc[0])).index, ['a','b']]
Out[1]:
a b
6 7 0
8 9 0
3 4 1
If you would like to return a larger random sample using duplicates of the existing cvalues, simply pass replace=True. Then, do some cleanup to get the output.
grouped = df.groupby('b', group_keys=False)
v = df['b'].value_counts()
df['c'] = np.select([df['b'].eq(0), df['b'].eq(1)],
[int(v.loc[0] * 1.2), int(v.loc[1] * 2)]) #frac parameter doesn't work with sample when frac > 1, so we have to calcualte the integer value for number of rows to be sampled.
(grouped.apply(lambda x: x['b'].sample(x['c'].iloc[0], replace=True))
.reset_index()
.rename({'index' : 'a'}, axis=1))
Out[2]:
a b
0 7 0
1 8 0
2 9 0
3 7 0
4 7 0
5 8 0
6 1 1
7 3 1
8 3 1
9 1 1
10 0 1
11 0 1
12 4 1
13 2 1
14 3 1
15 0 1
You can get a DataFrame from the GroupBy object with, e.g. grouped.get_group(0). If you want to sample from that you can use the .sample method. For instance grouped.get_group(0).sample(frac=0.2) gives:
a
5 6
For the example you give both samples will only give one element because the groups have 4 and 3 elements and 0.2*4 = 0.8 and 0.3*3 = 0.9 both round to 1.
I have the following DataFrame (in reality I'm working with around 20 million rows):
shop month day sale
1 7 1 10
1 6 1 8
1 5 1 9
2 7 1 10
2 6 1 8
2 5 1 9
I want another column: "Prev month sales", where sales are equal to the "Sales of previous month with same day, e.g.
shop month day sale prev month sale
1 7 1 10 8
1 6 1 8 9
1 5 1 9 9
2 7 1 10 8
2 6 1 8 9
2 5 1 9 9
One solution using .concat(), set_index(), and .loc[]:
# Get index of (shop, previous month, day).
# This will serve as a unique index to look up prev. month sale.
prev = pd.concat((df.shop, df.month - 1, df.day), axis=1)
# Unfortunately need to convert to list of tuples for MultiIndexing
prev = pd.MultiIndex.from_arrays(prev.values.T)
# old: [tuple(i) for i in prev.values]
# Now call .loc on df to look up each prev. month sale.
sale_prev_month = df.set_index(['shop', 'month', 'day']).loc[prev]
# And finally just concat rather than merge/join operation
# because we want to ignore index & mimic a left join.
df = pd.concat((df, sale_prev_month.reset_index(drop=True)), axis=1)
shop month day sale sale
0 1 7 1 10 8.0
1 1 6 1 8 9.0
2 1 5 1 9 NaN
3 2 7 1 10 8.0
4 2 6 1 8 9.0
5 2 5 1 9 NaN
Your new column will be float, not int, because of the presence of NaNs.
Update - an attempt with dask
I don't use dask day to day so this is probably woefully sub-par. Trying to work around the fact that dask does not implement pandas' MultiIndex. So, you can concatenate your three existing indices into a string column and lookup on that.
import dask.dataframe as dd
# Play around with npartitions or chunksize here!
df2 = dd.from_pandas(df, npartitions=10)
# Get a *single* index of unique (shop, month, day IDs)
# Dask doesn't support MultiIndex
empty = pd.Series(np.empty(len(df), dtype='object')) # Passed to `meta`
current = df2.loc[:, on].apply(lambda col: '_'.join(col.astype(str)), axis=1,
meta=empty)
prev = df2.loc[:, on].assign(month=df2['month'] - 1)\
.apply(lambda col: '_'.join(col.astype(str)), axis=1, meta=empty)
df2 = df2.set_index(current)
# We know have two dask.Series, `current` and `prev`, in the
# concatenated format "shop_month_day".
# We also have a dask.DataFrame, df2, which is indexed by `current`
# I would think we could just call df2.loc[prev].compute(), but
# that's throwing a KeyError for me, so slightly more expensive:
sale_prev_month = df2.compute().loc[prev.compute()][['sale']]\
.reset_index(drop=True)
# Now just concat as before
# Could re-break into dask objects here if you really needed to
df = pd.concat((df, sale_prev_month.reset_index(drop=True)), axis=1)
I have a pandas dataframe that I groupby, and then perform an aggregate calculation to get the mean for:
grouped = df.groupby(['year_month', 'company'])
means = grouped.agg({'size':['mean']})
Which gives me a dataframe back, but I can't seem to filter it to the specific company and year_month that I want:
means[(means['year_month']=='201412')]
gives me a KeyError
The issue is that you are grouping based on 'year_month' and 'company' . Hence in the means DataFrame, year_month and company would be part of the index (MutliIndex). You cannot access them as you access other columns.
One method to do this would be to get the values of the level 'year_month' of index . Example -
means.loc[means.index.get_level_values('year_month') == '201412']
Demo -
In [38]: df
Out[38]:
A B C
0 1 2 10
1 3 4 11
2 5 6 12
3 1 7 13
4 2 8 14
5 1 9 15
In [39]: means = df.groupby(['A','B']).mean()
In [40]: means
Out[40]:
C
A B
1 2 10
7 13
9 15
2 8 14
3 4 11
5 6 12
In [41]: means.loc[means.index.get_level_values('A') == 1]
Out[41]:
C
A B
1 2 10
7 13
9 15
As already pointed out, you will end up with a 2 level index. You could try to unstack the aggregated dataframe:
means = df.groupby(['year_month', 'company']).agg({'size':['mean']}).unstack(level=1)
This should give you a single 'year_month' index, 'company' as columns and your aggregate size as values. You can then slice by the index:
means.loc['201412']
I have a Pandas Series and based on a random number I want to pick a row (5 in the code example below) and drop that row. When the row is dropped I want to create a new index for the remaining rows (0 to 8). The code below:
print 'Original series: ', sample_mean_series
print 'Length of original series', len(sample_mean_series)
sample_mean_series = sample_mean_series.drop([5],axis=0)
print 'Series with item 5 dropped: ', sample_mean_series
print 'Length of modified series:', len(sample_mean_series)
print sample_mean_series.reindex(range(len(sample_mean_series)))
And this is the output:
Original series:
0 0.000074
1 -0.000067
2 0.000076
3 -0.000017
4 -0.000038
5 -0.000051
6 0.000125
7 -0.000108
8 -0.000009
9 -0.000052
Length of original series 10
Series with item 5 dropped:
0 0.000074
1 -0.000067
2 0.000076
3 -0.000017
4 -0.000038
6 0.000125
7 -0.000108
8 -0.000009
9 -0.000052
Length of modified series: 9
0 0.000074
1 -0.000067
2 0.000076
3 -0.000017
4 -0.000038
5 NaN
6 0.000125
7 -0.000108
8 -0.000009
My problem is that the row number 8 is dropped. I want to drop row "5 NaN" and keep -0.000052 with an index 0 to 8. This is what I want it to look like:
0 0.000074
1 -0.000067
2 0.000076
3 -0.000017
4 -0.000038
5 0.000125
6 -0.000108
7 -0.000009
8 -0.000052
Somewhat confusingly, reindex does not mean "create a new index". To create a new index, just assign to the index attribute. So at your last step just do sample_mean_series.index = range(len(sample_mean_series)).
Here's a one-liner:
In [1]: s
Out[1]:
0 -0.942184
1 0.397485
2 -0.656745
3 1.415797
4 1.123858
5 -1.890870
6 0.401715
7 -0.193306
8 -1.018140
9 0.262998
I use the Series.drop method to drop row 5 and then use reset_index to re-number the indices to be consecutive. Without using reset_index, the indices would jump from 4 to 6 with no 5.
By default, reset_index will move the original index into a DataFrame and return it alongside the series values. Passing drop=True prevents this from happening.
In [2]: s2 = s.drop([5]).reset_index(drop=True)
In [3]: s2
Out[3]:
0 -0.942184
1 0.397485
2 -0.656745
3 1.415797
4 1.123858
5 0.401715
6 -0.193306
7 -1.018140
8 0.262998
Name: 0
To drop rows in a dataframe and clean up index:
b = df['amount'] > 10000
df_dropped = df.drop(df[~b].index).reset_index()
df.reset_index(drop=True, inplace = True)
Will do exactly what you want.
When you reset the index, the old index is added as a column, and a new sequential index is used. You can use the drop parameter to avoid the old index being added as a column.