pandas - checking a condition for each group in a dataframe

pandas - checking a condition for each group in a dataframe - python

I have got a dataframe:
df = pd.DataFrame({'index' : range(8),
'variable1' : ["A","A","B","B","A","B","B","A"],
'variable2' : ["a","b","a","b","a","b","a","b"],
'variable3' : ["x","x","x","y","y","y","x","y"],
'result': [1,0,0,1,1,0,0,1]})
df2 = df.pivot_table(values='result',rows='index',cols=['variable1','variable2','variable3'])
df2['A']['a']['x'][4] = 1
df2['B']['a']['x'][3] = 1
variable1 A B
variable2 a b a b
variable3 x y x y x y
index
0 1 NaN NaN NaN NaN NaN
1 NaN NaN 0 NaN NaN NaN
2 NaN NaN NaN NaN 0 NaN
3 NaN NaN NaN NaN 1 1
4 1 1 NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN 0
6 NaN NaN NaN NaN 0 NaN
7 NaN NaN NaN 1 NaN NaN
Now I want to check for simultaneous occurrences of x == 1 and y == 1, but only within each subgroup, defined by variable1 and variable2. So, for the dataframe shown above, the condition is met for index == 4 (group A-a), but not for index == 3 (groups B-a and B-b).
I suppose some groupby() magic would be needed, but I cannot find the right way. I have also tried experimenting with a stacked dataframe (using df.stack()), but this did not get me any closer...

You can use groupby on the 2 first levels variable1 and variable2 to get the sum of the x and y columns at that level:
r = df2.groupby(level=[0,1], axis=1).sum()
r
Out[50]:
variable1 A B
variable2 a b a b
index
0 1 NaN NaN NaN
1 NaN 0 NaN NaN
2 NaN NaN 0 NaN
3 NaN NaN 1 1
4 2 NaN NaN NaN
5 NaN NaN NaN 0
6 NaN NaN 0 NaN
7 NaN 1 NaN NaN
Consequently, the lines you are searching for are the ones that contain the value 2:
r[r==2].dropna(how='all')
Out[53]:
variable1 A B
variable2 a b a b
index
4 2 NaN NaN NaN

Related

Set Column Value Based on Calculate Condition from Each Row

I have a empty dataframe as
columns_name = list(str(i) for i in range(10))
dfa = pd.DataFrame(columns=columns_name, index=['A', 'B', 'C', 'D'])
dfa['Count'] = [10, 6, 9, 4]
0
1
2
3
4
5
6
7
8
9
Count
A
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
10
B
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
6
C
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
9
D
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
4
I want to replace Nan values with a symbol with the difference of max(Count) - Current(max).
So, the final result will look like.
0
1
2
3
4
5
6
7
8
9
Count
A
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
10
B
NaN
NaN
NaN
NaN
NaN
NaN
-
-
-
-
6
C
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
-
9
D
NaN
NaN
NaN
NaN
-
-
-
-
-
-
4
I am stuck at
dfa.at[dfa.index, [str(col) for col in list(range(dfa['Count'].max() - dfa['Count']))]] = '-'
and getting KeyError: 'Count'

Actually, your this part of the code dfa.at[dfa.index, [str(col) for col in list(range(dfa['Count'].max() - dfa['Count']))]] = '-' has issue.
Just try to create the list which you are trying to use inside comprehension
list(range(dfa['Count'].max() - dfa['Count']))
It'll throw TypeError
If you notice, you'll figure out that (dfa['Count'].max() - dfa['Count']) will give following series:
A 0
B 4
C 1
D 6
And since you're trying to pass a series to python's range function, it will throw the error.
One possible solution might be:
for index, cols in zip(dfa.index, [list(map(str, col)) for col in (dfa).apply(lambda x: list(range(x['Count'], dfa['Count'].max())), axis=1).values]):
dfa.loc[index, cols] = '-'
OUTPUT:
Out[315]:
0 1 2 3 4 5 6 7 8 9 Count
A NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 10
B NaN NaN NaN NaN NaN NaN - - - - 6
C NaN NaN NaN NaN NaN NaN NaN NaN NaN - 9
D NaN NaN NaN NaN - - - - - - 4

Broadcasting is also an option:
import pandas as pd
import numpy as np
columns_name = list(str(i) for i in range(10))
dfa = pd.DataFrame(columns=columns_name, index=['A', 'B', 'C', 'D'])
dfa['Count'] = [10, 6, 9, 4]
# Broadcast based on column index (Excluding Count)
m = (
dfa['Count'].to_numpy()[:, None] == np.arange(0, dfa.shape[1] - 1)
).cumsum(axis=1).astype(bool)
# Grab Columns To Update
non_count_columns = dfa.columns[dfa.columns != 'Count']
# Update based on mask
dfa[non_count_columns] = dfa[non_count_columns].mask(m, '-')
print(dfa)
Output:
0 1 2 3 4 5 6 7 8 9 Count
A NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 10
B NaN NaN NaN NaN NaN NaN - - - - 6
C NaN NaN NaN NaN NaN NaN NaN NaN NaN - 9
D NaN NaN NaN NaN - - - - - - 4

Pandas set all values after first NaN to NaN

For each row I would like to set all values to NaN after the appearance of the first NaN. E.g.:
a b c
1 2 3 4
2 nan 2 nan
3 3 nan 23
Should become this:
a b c
1 2 3 4
2 nan nan nan
3 3 nan nan
So far I only know how to do this with an apply with a for loop over each column per row - it's very slow!

Check with cumprod
df=df.where(df.notna().cumprod(axis=1).eq(1))
a b c
1 2.0 3.0 4.0
2 NaN NaN NaN
3 3.0 NaN NaN

How do you shift each row in pandas data frame by a specific value?

If I have a pandas dataframe like this:
2 3 4 NaN NaN NaN
1 NaN NaN NaN NaN NaN
5 6 7 2 3 NaN
4 3 NaN NaN NaN NaN
and an array for the number I would like to shift:
array = [2, 4, 0, 3]
How do I iterate through each row to shift the columns by the number in my array to get something like this:
NaN NaN 2 3 4 NaN
NaN NaN NaN NaN 1 NaN
5 6 7 2 3 NaN
NaN NaN NaN 3 4 NaN
I was trying to do something like this but had no luck.
df = pd.DataFrame(values)
for rows in df.iterrows():
df[rows] = df.shift[change_in_bins[rows]]

Use for loop with loc and shift:
for index,value in enumerate([2, 4, 0, 3]):
df.loc[index,:] = df.loc[index,:].shift(value)
print(df)
0 1 2 3 4 5
0 NaN NaN 2.0 3.0 4.0 NaN
1 NaN NaN NaN NaN 1.0 NaN
2 5.0 6.0 7.0 2.0 3.0 NaN
3 NaN NaN NaN 4.0 3.0 NaN

Combine multi columns to one column Pandas

Hi I have the following dataframe
z a b c
a 1 NaN NaN
ss NaN 2 NaN
cc 3 NaN NaN
aa NaN 4 NaN
ww NaN 5 NaN
ss NaN NaN 6
aa NaN NaN 7
g NaN NaN 8
j 9 NaN NaN
I would like to create a new column d to do something like this
z a b c d
a 1 NaN NaN 1
ss NaN 2 NaN 2
cc 3 NaN NaN 3
aa NaN 4 NaN 4
ww NaN 5 NaN 5
ss NaN NaN 6 6
aa NaN NaN 7 7
g NaN NaN 8 8
j 9 NaN NaN 9
For the numbers, it is not in integer. It is in np.float64. The integers are for clear example. you may assume the numbers are like 32065431243556.62, 763835218962767.8 Thank you for your help

We can replace the NA by 0 and sum up the rows.
df['d'] = df[['a', 'b', 'c']].fillna(0).sum(axis=1)

In fact, it's not nessary to use fillna, sum can transform the NAN elements to zeros automatically.
I'm a python newcomer as well,and I suggest maybe you should read the pandas cookbook first.
The code is:
df['Total']=df[['a','b','c']].sum(axis=1).astype(int)

You can use pd.DataFrame.ffill over axis=1:
df['D'] = df.ffill(1).iloc[:, -1].astype(int)
print(df)
a b c D
0 1.0 NaN NaN 1
1 NaN 2.0 NaN 2
2 3.0 NaN NaN 3
3 NaN 4.0 NaN 4
4 NaN 5.0 NaN 5
5 NaN NaN 6.0 6
6 NaN NaN 7.0 7
7 NaN NaN 8.0 8
8 9.0 NaN NaN 9
Of course, if you have float values, int conversion is not required.

if there is only one value per row as given example, you can use the code below to dropna for each row and assign the remaining value to column d
df['d']=df.apply(lambda row: row.dropna(), axis=1)

pandas DataFrame filter by rows and columns

I have a python pandas DataFrame that looks like this:
A B C ... ZZ
2008-01-01 00 NaN NaN NaN ... 1
2008-01-02 00 NaN NaN NaN ... NaN
2008-01-03 00 NaN NaN 1 ... NaN
... ... ... ... ... ...
2012-12-31 00 NaN 1 NaN ... NaN
and I can't figure out how to get a subset of the DataFrame where there is one or more '1' in it, so that the final df should be something like this:
B C ... ZZ
2008-01-01 00 NaN NaN ... 1
2008-01-03 00 NaN 1 ... NaN
... ... ... ... ...
2012-12-31 00 1 NaN ... NaN
This is, removing all rows and columns that do not have a 1 in it.
I try this which seems to remove the rows with no 1:
df_filtered = df[df.sum(1)>0]
And the try to remove columns with:
df_filtered = df_filtered[df.sum(0)>0]
but get this error after the second line:
IndexingError('Unalignable boolean Series key provided')

Do it with loc:
In [90]: df
Out[90]:
0 1 2 3 4 5
0 1 NaN NaN 1 1 NaN
1 NaN NaN NaN NaN NaN NaN
2 1 1 NaN NaN 1 NaN
3 1 NaN 1 1 NaN NaN
4 NaN NaN NaN NaN NaN NaN
In [91]: df.loc[df.sum(1) > 0, df.sum(0) > 0]
Out[91]:
0 1 2 3 4
0 1 NaN NaN 1 1
2 1 1 NaN NaN 1
3 1 NaN 1 1 NaN
Here's why you get that error:
Let's say I have the following frame, df, (similar to yours):
In [112]: df
Out[112]:
a b c d e
0 0 1 1 NaN 1
1 NaN NaN NaN NaN NaN
2 0 0 0 NaN 0
3 0 0 1 NaN 1
4 1 1 1 NaN 1
5 0 0 0 NaN 0
6 1 0 1 NaN 0
When I sum along the rows and threshold at 0, I get:
In [113]: row_sum = df.sum()
In [114]: row_sum > 0
Out[114]:
a True
b True
c True
d False
e True
dtype: bool
Since the index of row_sum is the columns of df, it doesn't make sense in this case to try to use the values of row_sum > 0 to fancy-index into the rows of df, since their row indices are not aligned and they cannot be aligned.

Alternatively to remove all NaN rows or columns you can use .any() too.
In [1680]: df
Out[1680]:
0 1 2 3 4 5
0 1.0 NaN NaN 1.0 1.0 NaN
1 NaN NaN NaN NaN NaN NaN
2 1.0 1.0 NaN NaN 1.0 NaN
3 1.0 NaN 1.0 1.0 NaN NaN
4 NaN NaN NaN NaN NaN NaN
In [1681]: df.loc[df.any(axis=1), df.any(axis=0)]
Out[1681]:
0 1 2 3 4
0 1.0 NaN NaN 1.0 1.0
2 1.0 1.0 NaN NaN 1.0
3 1.0 NaN 1.0 1.0 NaN

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas - checking a condition for each group in a dataframe - python

Related

Set Column Value Based on Calculate Condition from Each Row

Pandas set all values after first NaN to NaN

How do you shift each row in pandas data frame by a specific value?

Combine multi columns to one column Pandas

pandas DataFrame filter by rows and columns

Categories

Resources