How to check whether all columns in a row are positive numbers? - python

I have created a new column by comparing two boolean columns. If both are positive, I assign a 1, otherwise a 0. This is my code below, but is there a way to be more pythonic? I tried list comprehension but failed.
lst = []
for i,k in zip(df['new_customer'],df['y']):
if i == 1 & k == 1:
lst.append(1)
else:
lst.append(0)
df['new_customer_subscription'] = lst

Use np.sign:
m = np.sign(df[['new_customer', 'y']]) >= 0
df['new_customer_subscription'] = m.all(axis=1).astype(int)
If you want to consider only positive non-zero values, change >= 0 to > 0 (since np.sign(0) is 0).
# Sample DataFrame.
df = pd.DataFrame(np.random.randn(5, 2), columns=['A', 'B'])
df
A B
0 0.511684 -0.512633
1 -1.254813 -1.721734
2 0.751830 0.285449
3 -0.934877 1.407998
4 -1.686066 -0.947015
# Get the sign of the numbers.
m = np.sign(df[['A', 'B']]) >= 0
m
A B
0 True False
1 False False
2 True True
3 False True
4 False False
# Find all rows where both columns are `True`.
m.all(axis=1).astype(int)
0 0
1 0
2 1
3 0
4 0
dtype: int64
Another solution if you have to deal with only two columns would be:
df['new_customer_subscription'] = (
df['new_customer'].gt(0) & df['y'].gt(0)).astype(int)
To generalise to multiple columns, use logical_and.reduce:
df['new_customer_subscription'] = np.logical_and.reduce(
df[['new_customer', 'y']] > 0, axis=1).astype(int)
Or,
df['new_customer_subscription'] = (df[['new_customer', 'y']] > 0).all(1).astype(int)

Another way to do this is using the np.where from the numpys module:
df['Indicator'] = np.where((df.A > 0) & (df.B > 0), 1, 0)
Output
A B Indicator
0 -0.464992 0.418243 0
1 -0.902320 0.496530 0
2 0.219111 1.052536 1
3 -1.377076 0.207964 0
4 1.051078 2.041550 1
The np.where method works like this:
np.where(condition, true value, false value)

Related

Assign a value to the first row of a sub-dataframe

I have a Pandas dataframe that looks like this:
df = pd.DataFrame({'gp_id': [1, 2, 1, 2], 'A': [1, 2, 3, 4]})
gp_id A
0 1 1
1 2 2
2 1 3
3 2 4
I want to assign the value -1 to the first row of the group with the id 2 (gp_id = 2), to get the following output:
gp_id A
0 1 1
1 2 -1
2 1 3
3 2 4
To do this, I've tried the following code:
df[df.gp_id == 2].A.iloc[0] = -1
But this doesn't do anything as I'm assigning a value in the sub-dataframe df[df.gp_id == 2] and I'm not modifying the original dataframe df.
Is there an easy way to solve this problem?
You could do:
df.loc[(df.gp_id == 2).argmax(), 'A'] = -1
as pd.Series.argmax returns the first max.
If you are not sure that the value is present in the dataframe, you could do:
cond = (df.gp_id == 2)
if cond.sum():
df.loc[cond.argmax(), 'A'] = -1
General solution if possible mask return no rows is chain another mask by cumulative sum of mask by & for bitwise AND and set values by DataFrame.loc:
m = df.gp_id == 2
df.loc[m & (m.cumsum() == 1), 'A'] = -1
Working well if no match - no assign, no error, no incorrect assignment:
m = df.gp_id == 7
df.loc[m & (m.cumsum() == 1), 'A'] = -1
Solution if always match mask at least one row is:
idx = df[df.gp_id == 2].index[0]
df.loc[idx, 'A'] = -1
print (df)
gp_id A
0 1 1
1 2 -1
2 1 3
3 2 4
If no match, solution raise error, no incorrect assignment.

Computing a number table for two types of records in Panda

I have this data frame of many records with columns, let's say, from 'a' to 'z' in pandas.
I need to build a table of 4 cells with the numbers of records that have 'a' = 0 and 'z'= 0, 'a' = 0 and 'z' != 0, 'a' != 0 and 'z' = 0, 'a' != 0 and 'z' != 0.
What is the best Pythony-pandas way to do it? I can compute the four sums doing indexing and summing, but I'm sure there should be a built-in elegant way of doing it.
What is that way?
You can group by the values of whether each column is 0 and get the size of these groups:
>>> df.groupby([df['a'] == 0, df['z'] == 0]).size()
a z
False False 19
True 2
True False 4
dtype: int64
Alternately you can also create a 2-column dataframe with these series and use value_counts which yields the same result:
>>> pd.concat([df['a'] == 0, df['z'] == 0], axis='columns').value_counts()
a z
False False 19
True False 4
False True 2
dtype: int64
Calling this result counts we can then simply unstack if you want a 2 by 2 table:
>>> counts.unstack('z', fill_value=0)
z False True
a
False 19 2
True 4 0
In the first level of counts’ index, or the index in the 2x2 table, True means a == 0 and False means a != 0. In the second level or columns, True means z == 0. In this sample data no rows have a == 0 and z == 0 at the same time.
If you want to generically rename them you can do something like this:
>>> for level, name in enumerate(counts.index.names):
... counts.rename({True: f'{name} == 0', False: '{name} != 0'}, level=level, inplace=True)
>>> counts
a z
a != 0 z != 0 19
a == 0 z != 0 4
a != 0 z == 0 2
dtype: int64
>>> counts.unstack('z', fill_value=0)
z z != 0 z == 0
a
a != 0 19 2
a == 0 4 0
Alternately to flatten the index to a single level, and this also works generically with any number of variables:
>>> counts.index = counts.index.map(lambda tup: ' & '.join(f'~{var}' if truth else var for truth, var in zip(tup, counts.index.names)))
>>> counts
a & z 19
~a & z 4
a & ~z 2
a = df['a'] == 0
z = df['z'] == 0
count = pd.Series([sum(a & z), sum(a & ~z), sum(~a & z), sum(~a & ~z)],
index=['a & z', 'a & ~z', '~a & z', '~a & ~z'])
>>> count
a & z 3
a & ~z 2
~a & z 2
~a & ~z 13
dtype: int64
Use pd.crosstab(), as follows:
a = df['a'] == 0
z = df['z'] == 0
count_tab = pd.crosstab(a, z, rownames=["'a' = 0 ?"], colnames=["'z' = 0 ?"])
Result:
print(count_tab)
'z' = 0 ? False True
'a' = 0 ?
False 13 2
True 2 3

Pandas dataframe doesn't update column value under condition

Originally I have a dataframe (along with other columns, but the information is not relevant here)
index
DNA
0
0
1
1
2
-1
3
0
I added an additional boolean column called consec_bs in my dataframe. I defined consec_bs as such: if the absolute value of the df['DNA'] - df['DNA'].shift() equals 2, consec_bs is True. Otherwise it's false. df['DNA'] only takes values -1, 0, or 1. My code is as following:
def consec_bs(df):
df['consec_bs'] = False
temp = df.shift()
df['diff'] = abs(df['DNA'] - temp['DNA'])
df[df['diff'] == 2].loc['consec_bs'] = True
return df
And the output df is
index
DNA
consec_bs
0
0
False
1
1
False
2
-1
False
3
0
False
However, consec_bs should return true at index 2.
I've tried df[df['diff'] == 2]['consec_bs'].replace(False, True, inplace = True), but it doesn't update consec_bs.
This is a chained assignment problem. Do df.loc[df['diff'] == 2, 'consec_bs'] = True instead.
This problem is described in the pandas docs (Why does assignment fail when using chained indexing?)
Try:
df["consec_bs"] = (abs(df['DNA'] - df['DNA'].shift()).eq(2))
print(df)
index DNA consec_bs
0 0 0 False
1 1 1 False
2 2 -1 True
3 3 0 False

Pandas change values in a groupby

I've a df like
a flag
0 1 False
1 0 False
2 1 False
3 0 False
4 0 False
and lets say I want to randomly put some True on every group in column a in order to obtain
a flag
0 1 True
1 0 True
2 1 True
3 0 False
4 0 True
So far I'm able to do so with the following code
import pandas as pd
import numpy as np
def rndm_flag(ds, n):
l = len(ds)
n = min([l, n])
vec = ds.sample(n).index
ds["flag"] = np.where(ds.index.isin(vec),
True, ds["flag"])
return(ds)
N = 5
df = pd.DataFrame({"a":np.random.randint(0,2,N),
"flag":[False]*N})
dfs = list(df.groupby("a"))
dfs = [x[1] for x in dfs]
df = pd.concat([rndm_flag(x, 2) for x in dfs])
df.sort_index(inplace=True)
But I'm wondering if there is an alternative (more elegant) way to do so.
This should give you some idea:
## create dataframe
df = pd.DataFrame({'a':[1,0,1,0,0], 'b':False})
## create flag
d['b'] = d.groupby('a').transform(lambda x: (np.random.choice([True, False], len(x), p = [0.65,0.35])))
print(d)
a b
0 1 False
1 0 True
2 1 False
3 0 True
4 0 True

Making new column in pandas DataFrame based on filter

Given this DataFrame:
df = pandas.DataFrame({"a": [1,10,20,3,10], "b": [50,60,55,0,0], "c": [1,30,1,0,0]})
What is the best way to make a new column, "filter" that has value "pass" if the values at columns a and b are both greater than x and value "fail" otherwise?
It can be done by iterating through rows but it's inefficient and inelegant:
c = []
for x, v in df.iterrows():
if v["a"] >= 20 and v["b"] >= 20:
c.append("pass")
else:
c.append("fail")
df["filter"] = c
One way would be to create a column of boolean values like this:
>>> df['filter'] = (df['a'] >= 20) & (df['b'] >= 20)
a b c filter
0 1 50 1 False
1 10 60 30 False
2 20 55 1 True
3 3 0 0 False
4 10 0 0 False
You can then change the boolean values to 'pass' or 'fail' using replace:
>>> df['filter'].astype(object).replace({False: 'fail', True: 'pass'})
0 fail
1 fail
2 pass
3 fail
4 fail
You can extend this to more columns using all. For example, to find rows across the columns with entries greater than 0:
>>> cols = ['a', 'b', 'c'] # a list of columns to test
>>> df[cols] > 0
a b c
0 True True True
1 True True True
2 True True True
3 True False False
4 True False False
Using all across axis 1 of this DataFrame creates the new column:
>>> (df[cols] > 0).all(axis=1)
0 True
1 True
2 True
3 False
4 False
dtype: bool

Categories

Resources