Using slice on DaraFrameGroupBy - python

I need to use slice on DataFrameGroupBy object.
For example, assume there is DataFrame with A-Z columns, if I want to use columns A-C I will use .loc[:, 'A':'C'], but when I'm using DataFrameGroupBy, I can't use slicing so I have to write [['A', 'B', 'C']]
Take a look here:
from numpy import around
from numpy.random import uniform
from pandas import DataFrame
from string import ascii_lowercase
data = around(a=uniform(low=1.0, high=50.0, size=(6, len(ascii_lowercase) + 1)), decimals=3)
df = DataFrame(data=data, columns=['group'] + list(ascii_lowercase), dtype='float64')
rows, columns = df.shape
df.loc[:rows // 2, 'group'] = 1.0
df.loc[rows // 2:, 'group'] = 2.0
print(df)
abc = df.groupby(by='group')[['a', 'b', 'c']].shift(periods=1)
print(abc)
Output of df is:
group a b c ... w x y z
0 1.0 22.380 36.873 10.073 ... 26.052 38.625 48.122 33.841
1 1.0 16.702 32.160 35.018 ... 12.990 17.878 19.297 16.330
2 1.0 9.957 25.202 7.106 ... 46.500 12.932 37.401 43.134
3 2.0 42.395 40.616 24.611 ... 30.436 33.521 42.136 2.690
4 2.0 2.069 29.891 2.217 ... 20.734 12.365 9.302 47.019
5 2.0 4.208 23.955 33.966 ... 45.439 16.488 32.892 9.345
Output of abc is:
a b c
0 NaN NaN NaN
1 22.380 36.873 10.073
2 16.702 32.160 35.018
3 NaN NaN NaN
4 42.395 40.616 24.611
5 2.069 29.891 2.217
How can I avoid of using [['a', 'b', 'c']]? I have 105 columns that I need to write there, I want use slicing like .loc[:, 'a':'c']
Thank you all :)

You can grouping by Series df['group'], so is possible filter columns before groupby to pass only filtered columns names:
abc = df.loc[:, 'a':'c'].groupby(by=df['group']).shift(periods=1)
print(abc)
a b c
0 NaN NaN NaN
1 37.999 21.197 39.527
2 35.560 27.214 23.211
3 NaN NaN NaN
4 49.053 11.319 37.279
5 27.881 38.529 46.550
Another idea is use:
cols = df.loc[:, 'a':'c'].columns
abc = df.groupby(by='group')[cols].shift(periods=1)

Related

Pandas groupby mean() not ignoring NaNs

If I calculate the mean of a groupby object and within one of the groups there is a NaN(s) the NaNs are ignored. Even when applying np.mean it is still returning just the mean of all valid numbers. I would expect a behaviour of returning NaN as soon as one NaN is within the group. Here a simplified example of the behaviour
import pandas as pd
import numpy as np
c = pd.DataFrame({'a':[1,np.nan,2,3],'b':[1,2,1,2]})
c.groupby('b').mean()
a
b
1 1.5
2 3.0
c.groupby('b').agg(np.mean)
a
b
1 1.5
2 3.0
I want to receive following result:
a
b
1 1.5
2 NaN
I am aware that I can replace NaNs beforehand and that i probably can write my own aggregation function to return NaN as soon as NaN is within the group. This function wouldn't be optimized though.
Do you know of an argument to achieve the desired behaviour with the optimized functions?
Btw, I think the desired behaviour was implemented in a previous version of pandas.
By default, pandas skips the Nan values. You can make it include Nan by specifying skipna=False:
In [215]: c.groupby('b').agg({'a': lambda x: x.mean(skipna=False)})
Out[215]:
a
b
1 1.5
2 NaN
There is mean(skipna=False), but it's not working
GroupBy aggregation methods (min, max, mean, median, etc.) have the skipna parameter, which is meant for this exact task, but it seems that currently (may-2020) there is a bug (issue opened on mar-2020), which prevents it from working correctly.
Quick workaround
Complete working example based on this comments: #Serge Ballesta, #RoelAdriaans
>>> import pandas as pd
>>> import numpy as np
>>> c = pd.DataFrame({'a':[1,np.nan,2,3],'b':[1,2,1,2]})
>>> c.fillna(np.inf).groupby('b').mean().replace(np.inf, np.nan)
a
b
1 1.5
2 NaN
For additional information and updates follow the link above.
Use the skipna option -
c.groupby('b').apply(lambda g: g.mean(skipna=False))
Another approach would be to use a value that is not ignored by default, for example np.inf:
>>> c = pd.DataFrame({'a':[1,np.inf,2,3],'b':[1,2,1,2]})
>>> c.groupby('b').mean()
a
b
1 1.500000
2 inf
There are three different methods for it:
slowest:
c.groupby('b').apply(lambda g: g.mean(skipna=False))
faster than apply but slower than default sum:
c.groupby('b').agg({'a': lambda x: x.mean(skipna=False)})
Fastest but need more codes:
method3 = c.groupby('b').sum()
nan_index = c[c['b'].isna()].index.to_list()
method3.loc[method3.index.isin(nan_index)] = np.nan
I landed here in search of a fast (vectorized) way of doing this, but did not find it. Also, in the case of complex numbers, groupby behaves a bit strangely: it doesn't like mean(), and with sum() it will convert groups where all values are NaN into 0+0j.
So, here is what I came up with:
Setup:
df = pd.DataFrame({
'a': [1, 2, 1, 2],
'b': [1, np.nan, 2, 3],
'c': [1, np.nan, 2, np.nan],
'd': np.array([np.nan, np.nan, 2, np.nan]) * 1j,
})
gb = df.groupby('a')
Default behavior:
gb.sum()
Out[]:
b c d
a
1 3.0 3.0 0.000000+2.000000j
2 3.0 0.0 0.000000+0.000000j
A single NaN kills the group:
cnt = gb.count()
siz = gb.size()
mask = siz.values[:, None] == cnt.values
gb.sum().where(mask)
Out[]:
b c d
a
1 3.0 3.0 NaN
2 NaN NaN NaN
Only NaN if all values in group are NaN:
cnt = gb.count()
gb.sum() * (cnt / cnt)
out
Out[]:
b c d
a
1 3.0 3.0 0.000000+2.000000j
2 3.0 NaN NaN
Corollary: mean of complex:
cnt = gb.count()
gb.sum() / cnt
Out[]:
b c d
a
1 1.5 1.5 0.000000+2.000000j
2 3.0 NaN NaN

how to overwrite values of first row of dataframe

Given a panda.Dataframe such as:
df = pd.DataFrame(np.random.randn(10,5), columns = ['a','b','c','d','e'])
I would like to know the best way to replace all values in the first row with a 0 (or some other specific value) and work with the new dataframe. I would like to do this in a general way, where there may be more or less columns than in this example.
Despite the simplicity of the question, I was not able to come across a solution. Most examples posted by others had to do with fillna() and related methods
You can use iloc to do that pretty cleanly like:
Code:
df.iloc[0] = 0
Test Code:
df = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])
print(df)
df.iloc[0] = 0
print(df)
Results:
a b c d e
0 0.715524 -0.914676 0.241008 -1.353033 0.170578
1 -0.300348 1.118491 -0.520407 0.185877 -0.950839
2 1.942239 0.980477 0.110457 -0.558483 0.903775
3 0.400923 1.347769 -0.120445 0.036253 0.683571
4 -0.761881 -0.642469 2.030019 2.274070 -0.067672
5 0.566003 0.263949 -0.567247 0.689599 0.870442
6 1.904812 -0.689312 1.400950 1.942681 -1.268679
7 -0.253381 0.464208 1.362960 0.129433 0.527576
8 -1.404035 0.174586 1.006268 0.007333 1.172559
9 0.330404 0.735610 1.277451 -0.104888 0.528356
a b c d e
0 0.000000 0.000000 0.000000 0.000000 0.000000
1 -0.300348 1.118491 -0.520407 0.185877 -0.950839
2 1.942239 0.980477 0.110457 -0.558483 0.903775
3 0.400923 1.347769 -0.120445 0.036253 0.683571
4 -0.761881 -0.642469 2.030019 2.274070 -0.067672
5 0.566003 0.263949 -0.567247 0.689599 0.870442
6 1.904812 -0.689312 1.400950 1.942681 -1.268679
7 -0.253381 0.464208 1.362960 0.129433 0.527576
8 -1.404035 0.174586 1.006268 0.007333 1.172559
9 0.330404 0.735610 1.277451 -0.104888 0.528356

Update DataFrame using a slice of a MultiIndex (many rows) with Pandas

I have a MultiIndex of dimensions I, M and would like for one i \in I to update all M rows at the same time.
Here is my data frame:
>>> result.head(n=10)
Out[9]:
FINLWT21
i INCAGG
0 1 NaN
7 NaN
9 NaN
5 NaN
3 NaN
1 1 NaN
7 NaN
9 NaN
5 NaN
3 NaN
Here is what I would like to fill in:
sample.groupby(field).sum()
FINLWT21
INCAGG
1 8800809.719
3 9951002.611
5 9747721.721
7 7683066.990
9 11091861.692
I thought the right command would be result.loc[i] = sample.groupby(field).sum(). However, here is the contents of result afterwards:
>>> result.loc[i]
Out[11]:
FINLWT21
INCAGG
1 NaN
7 NaN
9 NaN
5 NaN
3 NaN
How can I update all the "inner index" at the same time?
you want to use pd.IndexSlice. It returns an object that can be used in sclicing with loc.
Solution
result.sort_index();
slc = pd.IndexSlice[i, :]
result.loc[slc, :] = sample.groupby(field).sum()
Explanation
result.sort_index(); -> pd.IndexSclice requires the index be sorted.
slc = pd.IndexSclice[i, :] -> syntax to create a generic slicer to get ith group of 1st level for a pd.MultiIndex with 2 levels.
'result.loc[slc, :] = ` -> use the slice
Demonstration
import pandas as pd
import numpy as np
result = pd.DataFrame([], columns=['FINLWT21'],
index=pd.MultiIndex.from_product([[0, 1], [1, 7, 9, 5, 3]]))
result.sort_index(inplace=True);
slc = pd.IndexSlice[0, :]
result.loc[slc, :] = [1, 2, 3, 4, 5]
print result
FINLWT21
0 1 1
3 2
5 3
7 4
9 5
1 1 NaN
3 NaN
5 NaN
7 NaN
9 NaN
This is a function I have which might be what you are looking for:
def _assign_multi_index(dest, k, v, inplace=True, bool_nan=False):
"""
assigns v to dest[k] inplace, doing a "sensible" multi-index alignment, raising
a ValueError if no alignment is achieved.
I'm not sure if there's a better way to do this, or a reason not to do it
the way it's currently written.
"""
if not inplace:
raise NotImplementedError()
if k in dest:
warn("key '{}' already exists, continue with caution!".format(k))
v_names = v.index.names
dest_names = dest.index.names
if all(n in dest_names for n in v_names):
if len(v_names) < len(dest_names):
# if need be, drop some index levels temporarily in dest
dropped_names = [n for n in dest_names if n not in v_names]
dest.reset_index(dropped_names, inplace=True)
v.index = v.index.reorder_levels([n for n in dest_names if n in v_names]) # just to be safe
else:
raise ValueError("index levels do not match dest.")
dest[k] = v
# restore the original index levels if need be
if dest.index.names != dest_names:
dest.reset_index(inplace=True)
dest.set_index(dest_names, inplace=True)
if bool_nan != np.nan and v.dtype.name == 'bool' and dest[k].dtype.name != 'bool':
# this happens when nans had to be inserted, let's convert nans
dest_k = dest[k].copy()
dest_k[pd.isnull(dest_k)] = bool_nan
dest[k] = dest_k.astype(bool)
Turns out that the probably best way is to add an index to the right data set. The following works as expected:
data = sample.groupby(field).sum()
data['index'] = i
result.loc[i] = data.reset_index().set_index(['index', field])

Applying a function to a pandas col

I would like to map the function GetPermittedFAR to my dataframe(df) such that I could test if a value in the col zonedist1 == a certain value I could build new cols such as df['FAR_Permitted'] etc.
I have tried various means of map() etc. but haven't gotten this to work. I feel this should be a pretty simple thing to do?
Ideally, I would use a simple list comprehension / lambda as I have many of these test conditional values resulting in col data to create.
import pandas as pd
import numpy as np
def GetPermittedFAR():
if df['zonedist1'] == 'R7-3':
df['FAR_Permitted'] = 0.5
df['Building Height Max'] = 35
if df['zonedist1'] == 'R3-2':
df['FAR_Permitted'] = 0.5
df['Building Height Max'] = 35
if df['zonedist1'] == 'R1-1':
df['FAR_Permitted'] = 0.7
df['Building Height Max'] = 100
#etc...if statement for each unique value in 'zonedist'
df = pd.DataFrame({'zonedist1':['R7-3', 'R3-2', 'R1-1',
'R1-2', 'R2', 'R2A', 'R2X',
'R1-1','R7-3','R3-2','R7-3',
'R3-2', 'R1-1', 'R1-2'
]}
df = df.apply(lambda x: GetPermittedFAR(), axis=1)
How about using pd.merge()?
Let df be your dataframe
In [612]: df
Out[612]:
zonedist1
0 R7-3
1 R3-2
2 R1-1
3 R1-2
4 R2
5 R2A
6 R2X
merge be another dataframe with conditions
In [613]: merge
Out[613]:
zonedist1 FAR_Permitted Building Height Max
0 R7-3 0.5 35
1 R3-2 0.5 35
Then, merge df with merge on 'left'
In [614]: df.merge(merge, how='left')
Out[614]:
zonedist1 FAR_Permitted Building Height Max
0 R7-3 0.5 35
1 R3-2 0.5 35
2 R1-1 NaN NaN
3 R1-2 NaN NaN
4 R2 NaN NaN
5 R2A NaN NaN
6 R2X NaN NaN
Later you can replace NaN values.

How to create a lagged data structure using pandas dataframe

Example
s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5])
print s
1 5
2 4
3 3
4 2
5 1
Is there an efficient way to create a series. e.g. containing in each row the lagged values (in this example up to lag 2)
3 [3, 4, 5]
4 [2, 3, 4]
5 [1, 2, 3]
This corresponds to s=pd.Series([[3,4,5],[2,3,4],[1,2,3]], index=[3,4,5])
How can this be done in an efficient way for dataframes with a lot of timeseries which are very long?
Thanks
Edited after seeing the answers
ok, at the end I implemented this function:
def buildLaggedFeatures(s,lag=2,dropna=True):
'''
Builds a new DataFrame to facilitate regressing over all possible lagged features
'''
if type(s) is pd.DataFrame:
new_dict={}
for col_name in s:
new_dict[col_name]=s[col_name]
# create lagged Series
for l in range(1,lag+1):
new_dict['%s_lag%d' %(col_name,l)]=s[col_name].shift(l)
res=pd.DataFrame(new_dict,index=s.index)
elif type(s) is pd.Series:
the_range=range(lag+1)
res=pd.concat([s.shift(i) for i in the_range],axis=1)
res.columns=['lag_%d' %i for i in the_range]
else:
print 'Only works for DataFrame or Series'
return None
if dropna:
return res.dropna()
else:
return res
it produces the wished outputs and manages the naming of columns in the resulting DataFrame.
For a Series as input:
s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5])
res=buildLaggedFeatures(s,lag=2,dropna=False)
lag_0 lag_1 lag_2
1 5 NaN NaN
2 4 5 NaN
3 3 4 5
4 2 3 4
5 1 2 3
and for a DataFrame as input:
s2=s=pd.DataFrame({'a':[5,4,3,2,1], 'b':[50,40,30,20,10]},index=[1,2,3,4,5])
res2=buildLaggedFeatures(s2,lag=2,dropna=True)
a a_lag1 a_lag2 b b_lag1 b_lag2
3 3 4 5 30 40 50
4 2 3 4 20 30 40
5 1 2 3 10 20 30
As mentioned, it could be worth looking into the rolling_ functions, which will mean you won't have as many copies around.
One solution is to concat shifted Series together to make a DataFrame:
In [11]: pd.concat([s, s.shift(), s.shift(2)], axis=1)
Out[11]:
0 1 2
1 5 NaN NaN
2 4 5 NaN
3 3 4 5
4 2 3 4
5 1 2 3
In [12]: pd.concat([s, s.shift(), s.shift(2)], axis=1).dropna()
Out[12]:
0 1 2
3 3 4 5
4 2 3 4
5 1 2 3
Doing work on this will be more efficient that on lists...
Very simple solution using pandas DataFrame:
number_lags = 3
df = pd.DataFrame(data={'vals':[5,4,3,2,1]})
for lag in xrange(1, number_lags + 1):
df['lag_' + str(lag)] = df.vals.shift(lag)
#if you want numpy arrays with no null values:
df.dropna().values for numpy arrays
for Python 3.x (change xrange to range)
number_lags = 3
df = pd.DataFrame(data={'vals':[5,4,3,2,1]})
for lag in range(1, number_lags + 1):
df['lag_' + str(lag)] = df.vals.shift(lag)
print(df)
vals lag_1 lag_2 lag_3
0 5 NaN NaN NaN
1 4 5.0 NaN NaN
2 3 4.0 5.0 NaN
3 2 3.0 4.0 5.0
4 1 2.0 3.0 4.0
For a dataframe df with the lag to be applied on 'col name', you can use the shift function.
df['lag1']=df['col name'].shift(1)
df['lag2']=df['col name'].shift(2)
I like to put the lag numbers in the columns by making the columns a MultiIndex. This way, the names of the columns are retained.
Here's an example of the result:
# Setup
indx = pd.Index([1, 2, 3, 4, 5], name='time')
s=pd.Series(
[5, 4, 3, 2, 1],
index=indx,
name='population')
shift_timeseries_by_lags(pd.DataFrame(s), [0, 1, 2])
Result: a MultiIndex DataFrame with two column labels: the original one ("population") and a new one ("lag"):
Solution: Like in the accepted solution, we use DataFrame.shift and then pandas.concat.
def shift_timeseries_by_lags(df, lags, lag_label='lag'):
return pd.concat([
shift_timeseries_and_create_multiindex_column(df, lag,
lag_label=lag_label)
for lag in lags], axis=1)
def shift_timeseries_and_create_multiindex_column(
dataframe, lag, lag_label='lag'):
return (dataframe.shift(lag)
.pipe(append_level_to_columns_of_dataframe,
lag, lag_label))
I wish there were an easy way to append a list of labels to the existing columns. Here's my solution.
def append_level_to_columns_of_dataframe(
dataframe, new_level, name_of_new_level, inplace=False):
"""Given a (possibly MultiIndex) DataFrame, append labels to the column
labels and assign this new level a name.
Parameters
----------
dataframe : a pandas DataFrame with an Index or MultiIndex columns
new_level : scalar, or arraylike of length equal to the number of columns
in `dataframe`
The labels to put on the columns. If scalar, it is broadcast into a
list of length equal to the number of columns in `dataframe`.
name_of_new_level : str
The label to give the new level.
inplace : bool, optional, default: False
Whether to modify `dataframe` in place or to return a copy
that is modified.
Returns
-------
dataframe_with_new_columns : pandas DataFrame with MultiIndex columns
The original `dataframe` with new columns that have the given `level`
appended to each column label.
"""
old_columns = dataframe.columns
if not hasattr(new_level, '__len__') or isinstance(new_level, str):
new_level = [new_level] * dataframe.shape[1]
if isinstance(dataframe.columns, pd.MultiIndex):
new_columns = pd.MultiIndex.from_arrays(
old_columns.levels + [new_level],
names=(old_columns.names + [name_of_new_level]))
elif isinstance(dataframe.columns, pd.Index):
new_columns = pd.MultiIndex.from_arrays(
[old_columns] + [new_level],
names=([old_columns.name] + [name_of_new_level]))
if inplace:
dataframe.columns = new_columns
return dataframe
else:
copy_dataframe = dataframe.copy()
copy_dataframe.columns = new_columns
return copy_dataframe
Update: I learned from this solution another way to put a new level in a column, which makes it unnecessary to use append_level_to_columns_of_dataframe:
def shift_timeseries_by_lags_v2(df, lags, lag_label='lag'):
return pd.concat({
'{lag_label}_{lag_number}'.format(lag_label=lag_label, lag_number=lag):
df.shift(lag)
for lag in lags},
axis=1)
Here's the result of shift_timeseries_by_lags_v2(pd.DataFrame(s), [0, 1, 2]):
Here is a cool one liner for lagged features with _lagN suffixes in column names using pd.concat:
lagged = pd.concat([s.shift(lag).rename('{}_lag{}'.format(s.name, lag+1)) for lag in range(3)], axis=1).dropna()
You can do following:
s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5])
res = pd.DataFrame(index = s.index)
for l in range(3):
res[l] = s.shift(l)
print res.ix[3:,:].as_matrix()
It produces:
array([[ 3., 4., 5.],
[ 2., 3., 4.],
[ 1., 2., 3.]])
which I hope is very close to what you are actually want.
For multiple (many of them) lags, this could be more compact:
df=pd.DataFrame({'year': range(2000, 2010), 'gdp': [234, 253, 256, 267, 272, 273, 271, 275, 280, 282]})
df.join(pd.DataFrame({'gdp_' + str(lag): df['gdp'].shift(lag) for lag in range(1,4)}))
Assuming you are focusing on a single column in your data frame, saved into s. This shortcode will generate instances of the column with 7 lags.
s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5], name='test')
shiftdf=pd.DataFrame()
for i in range(3):
shiftdf = pd.concat([shiftdf , s.shift(i).rename(s.name+'_'+str(i))], axis=1)
shiftdf
>>
test_0 test_1 test_2
1 5 NaN NaN
2 4 5.0 NaN
3 3 4.0 5.0
4 2 3.0 4.0
5 1 2.0 3.0
Based on the proposal by #charlie-brummitt, here is a revision that fix a set of columns:
def shift_timeseries_by_lags(df, fix_columns, lag_numbers, lag_label='lag'):
df_fix = df[fix_columns]
df_lag = df.drop(columns=fix_columns)
df_lagged = pd.concat({f'{lag_label}_{lag}':
df_lag.shift(lag) for lag in lag_numbers},
axis=1)
df_lagged.columns = ['__'.join(reversed(x)) for x in df_lagged.columns.to_flat_index()]
return pd.concat([df_fix, df_lagged], axis=1)
Here is an example of usage:
df = shift_timeseries_by_lags(df_province_cases, fix_columns=['country', 'state'], lag_numbers=[1,2,3])
I personally prefer the lag name as suffix. But can be changed removing reversed().

Categories

Resources