boolean selection with loc in python - python

s = pd.Series(np.nan, index=[49,48,47,46,45, 1, 2, 3, 4, 5])
output of s is
49 NaN
48 NaN
47 NaN
46 NaN
45 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
s.loc[[False,True]]
it gives output as-:
48 NaN
.loc Access a group of rows and columns by label(s), I have given list of false and true and it is also not equal to length of axis being sliced.
My doubt is if we gave list of boolean array to loc it slice the dataframe/series with position instead of label?

I certainly get an error when I am running this:
import pandas as pd
import numpy as np
s = pd.Series(np.nan, index=[49,48,47,46,45, 1, 2, 3, 4, 5])
s.loc[[False,True]]
the error (as expected) is:
IndexError: Item wrong length 2 instead of 10.
Maybe your problem is specific to a certain version of pandas? Maybe an old one? I used pandas version 0.25.3

Related

Pandas pct_change with moving average

I would like to use pandas' pct_change to compute the rate of change between each value and the previous rolling average (before that value). Here is what I mean:
If I have:
import pandas as pd
df = pd.DataFrame({'data': [1, 2, 3, 7]})
I would expect to get, for window size of 2:
0 NaN
1 NaN
2 1
3 1.8
, because roc(3, avg(1, 2)) = (3-1.5)/1.5 = 1 and same calculation goes for 1.8. using pct_change with periods parameter just skips previous nth entries, it doesn't do the job.
Any ideas on how to do this in an elegant pandas way for any window size?
here is one way to do it, using rolling and shift
df['avg']=df.rolling(2).mean()
df['poc'] = (df['data'] - df['avg'].shift(+1))/ df['avg'].shift(+1)
df.drop(columns='avg')
data poc
0 1 NaN
1 2 NaN
2 3 1.0
3 7 1.8

Vectorized way of finding the index of a previously occurring element

Let's say I have this Pandas series:
num = pd.Series([1,2,3,4,5,6,5,6,4,2,1,3])
What I want to do is to get a number, say 5, and return the index where it previously occurred. So if I'm using the element 5, I should get 4 as the element appears in indices 4 and 6. Now I want to do this for all of the elements of the series, and can be easily done using a for loop:
for idx,x in enumerate(num):
idx_prev = num[num == x].idxmax()
if(idx_prev < idx):
return idx_prev
However, this process consumes too much time for longer series lengths due to the looping. Is there a way to implement the same thing but in a vectorized form? The output should be something like this:
[NaN,NaN,NaN,NaN,NaN,NaN,4,5,3,1,0,2]
You can use groupby to shift the index:
num.index.to_series().groupby(num).shift()
Output:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 4.0
7 5.0
8 3.0
9 1.0
10 0.0
11 2.0
dtype: float64
It's possible to keep working in numpy.
Equivalent of [num[num == x].idxmax() for idx,x in enumerate(num)] using numpy is:
_, out = np.unique(num.values, return_inverse=True)
which assigns
array([0, 1, 2, 3, 4, 5, 4, 5, 3, 1, 0, 2], dtype=int64)
to out. Now you can assign bad values of out to Nans like this:
out_series = pd.Series(out)
out_series[out >= np.arange(len(out))] = np.nan

Pandas groupby mean() not ignoring NaNs

If I calculate the mean of a groupby object and within one of the groups there is a NaN(s) the NaNs are ignored. Even when applying np.mean it is still returning just the mean of all valid numbers. I would expect a behaviour of returning NaN as soon as one NaN is within the group. Here a simplified example of the behaviour
import pandas as pd
import numpy as np
c = pd.DataFrame({'a':[1,np.nan,2,3],'b':[1,2,1,2]})
c.groupby('b').mean()
a
b
1 1.5
2 3.0
c.groupby('b').agg(np.mean)
a
b
1 1.5
2 3.0
I want to receive following result:
a
b
1 1.5
2 NaN
I am aware that I can replace NaNs beforehand and that i probably can write my own aggregation function to return NaN as soon as NaN is within the group. This function wouldn't be optimized though.
Do you know of an argument to achieve the desired behaviour with the optimized functions?
Btw, I think the desired behaviour was implemented in a previous version of pandas.
By default, pandas skips the Nan values. You can make it include Nan by specifying skipna=False:
In [215]: c.groupby('b').agg({'a': lambda x: x.mean(skipna=False)})
Out[215]:
a
b
1 1.5
2 NaN
There is mean(skipna=False), but it's not working
GroupBy aggregation methods (min, max, mean, median, etc.) have the skipna parameter, which is meant for this exact task, but it seems that currently (may-2020) there is a bug (issue opened on mar-2020), which prevents it from working correctly.
Quick workaround
Complete working example based on this comments: #Serge Ballesta, #RoelAdriaans
>>> import pandas as pd
>>> import numpy as np
>>> c = pd.DataFrame({'a':[1,np.nan,2,3],'b':[1,2,1,2]})
>>> c.fillna(np.inf).groupby('b').mean().replace(np.inf, np.nan)
a
b
1 1.5
2 NaN
For additional information and updates follow the link above.
Use the skipna option -
c.groupby('b').apply(lambda g: g.mean(skipna=False))
Another approach would be to use a value that is not ignored by default, for example np.inf:
>>> c = pd.DataFrame({'a':[1,np.inf,2,3],'b':[1,2,1,2]})
>>> c.groupby('b').mean()
a
b
1 1.500000
2 inf
There are three different methods for it:
slowest:
c.groupby('b').apply(lambda g: g.mean(skipna=False))
faster than apply but slower than default sum:
c.groupby('b').agg({'a': lambda x: x.mean(skipna=False)})
Fastest but need more codes:
method3 = c.groupby('b').sum()
nan_index = c[c['b'].isna()].index.to_list()
method3.loc[method3.index.isin(nan_index)] = np.nan
I landed here in search of a fast (vectorized) way of doing this, but did not find it. Also, in the case of complex numbers, groupby behaves a bit strangely: it doesn't like mean(), and with sum() it will convert groups where all values are NaN into 0+0j.
So, here is what I came up with:
Setup:
df = pd.DataFrame({
'a': [1, 2, 1, 2],
'b': [1, np.nan, 2, 3],
'c': [1, np.nan, 2, np.nan],
'd': np.array([np.nan, np.nan, 2, np.nan]) * 1j,
})
gb = df.groupby('a')
Default behavior:
gb.sum()
Out[]:
b c d
a
1 3.0 3.0 0.000000+2.000000j
2 3.0 0.0 0.000000+0.000000j
A single NaN kills the group:
cnt = gb.count()
siz = gb.size()
mask = siz.values[:, None] == cnt.values
gb.sum().where(mask)
Out[]:
b c d
a
1 3.0 3.0 NaN
2 NaN NaN NaN
Only NaN if all values in group are NaN:
cnt = gb.count()
gb.sum() * (cnt / cnt)
out
Out[]:
b c d
a
1 3.0 3.0 0.000000+2.000000j
2 3.0 NaN NaN
Corollary: mean of complex:
cnt = gb.count()
gb.sum() / cnt
Out[]:
b c d
a
1 1.5 1.5 0.000000+2.000000j
2 3.0 NaN NaN

python string formatting variables' names from a list

Hi everyone I'm trying to define a set of variables and I want to format their names.
The set up is:
features=['Gender','Age','Rank'] + other11columns #selected columns of my data
In [1]:data['Gender'].unique()
Out[1]: array([0, 1], dtype=int64)
In [2]:data['Age'].unique()
Out[2]: array([10, 20, 30, 40, 50], dtype=int64)
In [3]:data['Rank'].unique()
Out[3]: array([0, 1, 2, 3, 4, 5, 6], dtype=int64)
.....
first I want to set up some empty data frames with each tag. I want something like these:
report_Gender
Out[3]:
Prediction Actual
0 NaN NaN
1 NaN NaN
report_Age
Out[5]:
Prediction Actual
10 NaN NaN
20 NaN NaN
30 NaN NaN
40 NaN NaN
50 NaN NaN
report_Rank
Out[6]:
Prediction Actual
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
.......
The following code doesn't work but indicates what I want to do
for i in range(len(features)-1):
report_features[i]=pd.DataFrame(index=data[feature[i]].unique(),columns=['Prediction','Actual'])
I tried to play with the string formatting with %s operation but didn't figure out how to put in variables' name... any help is appreciated :)
Dynamically creating global variables can get hairy. It is much easier if you put it in a smaller scope ==> any object, e.g., a dictionary. You can achieve what you want like this
my_dictionary = dict()
for f in features:
my_dictionary['report_{}'.format(f)] = pd.DataFrame(index=data[f].unique(),columns=['Prediction','Actual'])
You can access the df like my_dictionary['report_Gender'] for example.
Another way would be to create a class:
class Reports:
pass
for f in features:
setattr(Reports, 'report_{}'.format(f), pd.DataFrame(index=data[f].unique(),columns=['Prediction','Actual'])
Then access as Reports.report_Gender etc...
You can use the setattr method if you really wan't to do it but I'll suggest to follow Ravi Patel's advice
for i in range(len(features)-1):
setattr(object_method_or_module_your_variable_belong,
name_for_you_varialbe,
pd.DataFrame(index=data[feature[i]].unique(),columns=['Prediction','Actual'])

Update DataFrame using a slice of a MultiIndex (many rows) with Pandas

I have a MultiIndex of dimensions I, M and would like for one i \in I to update all M rows at the same time.
Here is my data frame:
>>> result.head(n=10)
Out[9]:
FINLWT21
i INCAGG
0 1 NaN
7 NaN
9 NaN
5 NaN
3 NaN
1 1 NaN
7 NaN
9 NaN
5 NaN
3 NaN
Here is what I would like to fill in:
sample.groupby(field).sum()
FINLWT21
INCAGG
1 8800809.719
3 9951002.611
5 9747721.721
7 7683066.990
9 11091861.692
I thought the right command would be result.loc[i] = sample.groupby(field).sum(). However, here is the contents of result afterwards:
>>> result.loc[i]
Out[11]:
FINLWT21
INCAGG
1 NaN
7 NaN
9 NaN
5 NaN
3 NaN
How can I update all the "inner index" at the same time?
you want to use pd.IndexSlice. It returns an object that can be used in sclicing with loc.
Solution
result.sort_index();
slc = pd.IndexSlice[i, :]
result.loc[slc, :] = sample.groupby(field).sum()
Explanation
result.sort_index(); -> pd.IndexSclice requires the index be sorted.
slc = pd.IndexSclice[i, :] -> syntax to create a generic slicer to get ith group of 1st level for a pd.MultiIndex with 2 levels.
'result.loc[slc, :] = ` -> use the slice
Demonstration
import pandas as pd
import numpy as np
result = pd.DataFrame([], columns=['FINLWT21'],
index=pd.MultiIndex.from_product([[0, 1], [1, 7, 9, 5, 3]]))
result.sort_index(inplace=True);
slc = pd.IndexSlice[0, :]
result.loc[slc, :] = [1, 2, 3, 4, 5]
print result
FINLWT21
0 1 1
3 2
5 3
7 4
9 5
1 1 NaN
3 NaN
5 NaN
7 NaN
9 NaN
This is a function I have which might be what you are looking for:
def _assign_multi_index(dest, k, v, inplace=True, bool_nan=False):
"""
assigns v to dest[k] inplace, doing a "sensible" multi-index alignment, raising
a ValueError if no alignment is achieved.
I'm not sure if there's a better way to do this, or a reason not to do it
the way it's currently written.
"""
if not inplace:
raise NotImplementedError()
if k in dest:
warn("key '{}' already exists, continue with caution!".format(k))
v_names = v.index.names
dest_names = dest.index.names
if all(n in dest_names for n in v_names):
if len(v_names) < len(dest_names):
# if need be, drop some index levels temporarily in dest
dropped_names = [n for n in dest_names if n not in v_names]
dest.reset_index(dropped_names, inplace=True)
v.index = v.index.reorder_levels([n for n in dest_names if n in v_names]) # just to be safe
else:
raise ValueError("index levels do not match dest.")
dest[k] = v
# restore the original index levels if need be
if dest.index.names != dest_names:
dest.reset_index(inplace=True)
dest.set_index(dest_names, inplace=True)
if bool_nan != np.nan and v.dtype.name == 'bool' and dest[k].dtype.name != 'bool':
# this happens when nans had to be inserted, let's convert nans
dest_k = dest[k].copy()
dest_k[pd.isnull(dest_k)] = bool_nan
dest[k] = dest_k.astype(bool)
Turns out that the probably best way is to add an index to the right data set. The following works as expected:
data = sample.groupby(field).sum()
data['index'] = i
result.loc[i] = data.reset_index().set_index(['index', field])

Categories

Resources