For the following dataframe:
Headers: Name P1 P2 P3
L1: A 1 0 2
L2: B 1 1 1
L3: C 0 5 6
I want to get yes where all P1, P2 and P3 are greater than 0.
Currently I am using either of the following methods:
Method1:
df['Check']= np.where((df['P1'] > 0) & (df['P2'] > 0) & (df['P3'] > 0),'Yes','No')
Method2:
df.loc[(df['P1'] > 0) & (df['P2'] > 0) & (df['P3'] > 0), 'Check'] = "Yes"
I have a large dataset with a lot of columns where the conditions are to be applied.
Is there a shorter alternative to the multiple & conditions, wherein I won't have to write the conditions for each and every variable and instead use a combined index range for the multiple columns?
I think need DataFrame.all for check all Trues per rows:
cols = ['P1','P2','P3']
df['Check']= np.where((df[cols] > 0).all(axis=1),'Yes','No')
print (df)
Name P1 P2 P3 Check
0 A 1 0 2 No
1 B 1 1 1 Yes
2 C 0 5 6 No
print ((df[cols] > 0))
P1 P2 P3
0 True False True
1 True True True
2 False True True
print ((df[cols] > 0).all(axis=1))
0 False
1 True
2 False
dtype: bool
Related
Say I have a dataframe:
Do
Re
Mi
Fa
So
1
0
Foo
100
50
1
1
Bar
75
20
0
0
True
59
59
1
1
False
0
12
How would I go about finding all rows where the value in columns "Do" and "Re" BOTH equal 1 AND "Fa" is higher than "So"?
I've tried a couple of ways, but they first returns an error complaining of ambiguity:
df['both_1']=((df['Do'] > df['Re']) & (df['Mi'] == df['Fa'] == 1))
I also tried breaking it down into steps, but I realised the last step will result in me bringing in both True and False statements. I only want True.
df['Do_1'] = df['Do'] == 1
df['Re_1'] = df['Re'] == 1
# This is where I realised I'm bringing in the False rows too
df['both_1'] = (df['Do1'] == df['Re_1'])```
Chain mask by another & for bitwise AND:
df['both_1']= (df['Fa'] > df['So']) & (df['Do'] == 1) & (df['Re'] == 1)
print (df)
Do Re Mi Fa So both_1
0 1 0 Foo 100 50 False
1 1 1 Bar 75 20 True
2 0 0 True 59 59 False
3 1 1 False 0 12 False
Or if possible multiple columns in list comapre filtered columns by subset df[['Do', 'Re']] and test if all Trues by DataFrame.all:
df['both_1']= (df['Fa'] > df['So']) & (df[['Do', 'Re']] == 1).all(axis=1)
If need filter use boolean indexing:
df1 = df[(df['Fa'] > df['So']) & (df['Do'] == 1) & (df['Re'] == 1)]
For second solution:
df1 = df[(df['Fa'] > df['So']) & (df[['Do', 'Re']] == 1).all(axis=1)]
You can combine the three conditions using & (bitwise and):
df[(df['Do'] == 1) & (df['Re'] == 1) & (df['Fa'] > df['So'])]
Output (for your sample data):
Do Re Mi Fa So
1 1 1 Bar 75 20
Try:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Do':[1,1,0,1],'Re':[0,1,0,1],'Mi':['Foo','Bar',True,False],'Fa':[100,75,59,0],'Sol':[50,20,59,12]})
df.loc[np.where((df.Do == 1) & (df.Re ==1) & (df.Fa > df.Sol))[0]]
Output:
Do Re Mi Fa Sol
1 1 1 Bar 75 20
If you do not want to import anything else, just do this:
df.assign(keep=[1 if (df.loc[i,'Do'] == 1 and df.loc[i,'Re'] == 1 and df.loc[i,'Fa'] > df.loc[i,'Sol']) else 0 for i in df.index]).query("keep == 1").drop('keep',axis=1)
Let df be a dataframe of boolean values with a two column index. I want to calculate the value for every id. For example, this is how it would look on this specific case.
value consecutive
id Week
1 1 True 1
1 2 True 2
1 3 False 0
1 4 True 1
1 5 True 2
2 1 False 0
2 2 False 0
2 3 True 1
This is my solution:
def func(id,week):
M = df.loc[id]
M= df.loc[id][:week+1]
consecutive_list = list()
S=0
for index,row in M.iterrows():
if row['value']:
S+=1
else:
S=0
consecutive_list.append(S)
return consecutive_list[-1]
Then we generate the column "consecutive" as a list on the following way:
Consecutive_list = list()
for k in df.index:
id = k[0]
week=k[1]
Consecutive_list.append(func(id,week))
df['consecutive'] = Consecutive_list
I would like to know if there is a more Pythonic way to do this.
EDIT: I wrote the "consecutive" column in order to show what I expect this to be.
If you are trying to add the consecutive column to the df, this should work:
df.assign(consecutive = df['value'].groupby(df['value'].diff().ne(0).cumsum()).cumsum())
Output:
value consecutive
1 a True 1
b True 2
2 a False 0
b True 1
3 a True 2
b False 0
4 a False 0
b True 1
I have this data frame of many records with columns, let's say, from 'a' to 'z' in pandas.
I need to build a table of 4 cells with the numbers of records that have 'a' = 0 and 'z'= 0, 'a' = 0 and 'z' != 0, 'a' != 0 and 'z' = 0, 'a' != 0 and 'z' != 0.
What is the best Pythony-pandas way to do it? I can compute the four sums doing indexing and summing, but I'm sure there should be a built-in elegant way of doing it.
What is that way?
You can group by the values of whether each column is 0 and get the size of these groups:
>>> df.groupby([df['a'] == 0, df['z'] == 0]).size()
a z
False False 19
True 2
True False 4
dtype: int64
Alternately you can also create a 2-column dataframe with these series and use value_counts which yields the same result:
>>> pd.concat([df['a'] == 0, df['z'] == 0], axis='columns').value_counts()
a z
False False 19
True False 4
False True 2
dtype: int64
Calling this result counts we can then simply unstack if you want a 2 by 2 table:
>>> counts.unstack('z', fill_value=0)
z False True
a
False 19 2
True 4 0
In the first level of counts’ index, or the index in the 2x2 table, True means a == 0 and False means a != 0. In the second level or columns, True means z == 0. In this sample data no rows have a == 0 and z == 0 at the same time.
If you want to generically rename them you can do something like this:
>>> for level, name in enumerate(counts.index.names):
... counts.rename({True: f'{name} == 0', False: '{name} != 0'}, level=level, inplace=True)
>>> counts
a z
a != 0 z != 0 19
a == 0 z != 0 4
a != 0 z == 0 2
dtype: int64
>>> counts.unstack('z', fill_value=0)
z z != 0 z == 0
a
a != 0 19 2
a == 0 4 0
Alternately to flatten the index to a single level, and this also works generically with any number of variables:
>>> counts.index = counts.index.map(lambda tup: ' & '.join(f'~{var}' if truth else var for truth, var in zip(tup, counts.index.names)))
>>> counts
a & z 19
~a & z 4
a & ~z 2
a = df['a'] == 0
z = df['z'] == 0
count = pd.Series([sum(a & z), sum(a & ~z), sum(~a & z), sum(~a & ~z)],
index=['a & z', 'a & ~z', '~a & z', '~a & ~z'])
>>> count
a & z 3
a & ~z 2
~a & z 2
~a & ~z 13
dtype: int64
Use pd.crosstab(), as follows:
a = df['a'] == 0
z = df['z'] == 0
count_tab = pd.crosstab(a, z, rownames=["'a' = 0 ?"], colnames=["'z' = 0 ?"])
Result:
print(count_tab)
'z' = 0 ? False True
'a' = 0 ?
False 13 2
True 2 3
I have created a new column by comparing two boolean columns. If both are positive, I assign a 1, otherwise a 0. This is my code below, but is there a way to be more pythonic? I tried list comprehension but failed.
lst = []
for i,k in zip(df['new_customer'],df['y']):
if i == 1 & k == 1:
lst.append(1)
else:
lst.append(0)
df['new_customer_subscription'] = lst
Use np.sign:
m = np.sign(df[['new_customer', 'y']]) >= 0
df['new_customer_subscription'] = m.all(axis=1).astype(int)
If you want to consider only positive non-zero values, change >= 0 to > 0 (since np.sign(0) is 0).
# Sample DataFrame.
df = pd.DataFrame(np.random.randn(5, 2), columns=['A', 'B'])
df
A B
0 0.511684 -0.512633
1 -1.254813 -1.721734
2 0.751830 0.285449
3 -0.934877 1.407998
4 -1.686066 -0.947015
# Get the sign of the numbers.
m = np.sign(df[['A', 'B']]) >= 0
m
A B
0 True False
1 False False
2 True True
3 False True
4 False False
# Find all rows where both columns are `True`.
m.all(axis=1).astype(int)
0 0
1 0
2 1
3 0
4 0
dtype: int64
Another solution if you have to deal with only two columns would be:
df['new_customer_subscription'] = (
df['new_customer'].gt(0) & df['y'].gt(0)).astype(int)
To generalise to multiple columns, use logical_and.reduce:
df['new_customer_subscription'] = np.logical_and.reduce(
df[['new_customer', 'y']] > 0, axis=1).astype(int)
Or,
df['new_customer_subscription'] = (df[['new_customer', 'y']] > 0).all(1).astype(int)
Another way to do this is using the np.where from the numpys module:
df['Indicator'] = np.where((df.A > 0) & (df.B > 0), 1, 0)
Output
A B Indicator
0 -0.464992 0.418243 0
1 -0.902320 0.496530 0
2 0.219111 1.052536 1
3 -1.377076 0.207964 0
4 1.051078 2.041550 1
The np.where method works like this:
np.where(condition, true value, false value)
Given this DataFrame:
df = pandas.DataFrame({"a": [1,10,20,3,10], "b": [50,60,55,0,0], "c": [1,30,1,0,0]})
What is the best way to make a new column, "filter" that has value "pass" if the values at columns a and b are both greater than x and value "fail" otherwise?
It can be done by iterating through rows but it's inefficient and inelegant:
c = []
for x, v in df.iterrows():
if v["a"] >= 20 and v["b"] >= 20:
c.append("pass")
else:
c.append("fail")
df["filter"] = c
One way would be to create a column of boolean values like this:
>>> df['filter'] = (df['a'] >= 20) & (df['b'] >= 20)
a b c filter
0 1 50 1 False
1 10 60 30 False
2 20 55 1 True
3 3 0 0 False
4 10 0 0 False
You can then change the boolean values to 'pass' or 'fail' using replace:
>>> df['filter'].astype(object).replace({False: 'fail', True: 'pass'})
0 fail
1 fail
2 pass
3 fail
4 fail
You can extend this to more columns using all. For example, to find rows across the columns with entries greater than 0:
>>> cols = ['a', 'b', 'c'] # a list of columns to test
>>> df[cols] > 0
a b c
0 True True True
1 True True True
2 True True True
3 True False False
4 True False False
Using all across axis 1 of this DataFrame creates the new column:
>>> (df[cols] > 0).all(axis=1)
0 True
1 True
2 True
3 False
4 False
dtype: bool