Using a column with a boolean to access other columns - python

I have a pandas dataframe like the following:
A B C
1 2 1
3 4 0
5 2 0
5 3 1
And would like to get the value from A if the value of C is 1 and the value of B if C is zero. How would I do this? Ultimately I'd like to end up with a vector with the values of A if C is one and B if C is 0 which would be [1,4,2,5]

Assuming you mean "from A is the value of C is 1 and from B if the value of C is 0", which makes sense given your intended output, I might use Series.where:
>>> df
A B C
0 1 2 1
1 3 4 0
2 5 2 0
3 5 3 1
>>> df.A.where(df.C, df.B)
0 1
1 4
2 2
3 5
dtype: int64
which is read "make a series using values of A if the corresponding value of C is true, otherwise use the corresponding value of B". Here since 1 is true we can just use df.C, but we could use df.C == 1 or df.C*5+3 < 4 or any other boolean Series.

Related

Groupby selected rows by a condition on a column value and then transform another column

This seems to be easy but couldn't find a working solution for it:
I have a dataframe with 3 columns:
df = pd.DataFrame({'A': [0,0,2,2,2],
'B': [1,1,2,2,3],
'C': [1,1,2,3,4]})
A B C
0 0 1 1
1 0 1 1
2 2 2 2
3 2 2 3
4 2 3 4
I want to select rows based on values of column A, then groupby based on values of column B, and finally transform values of column C into sum. something along the line of this (obviously not working) code:
df[df['A'].isin(['2']), 'C'] = df[df['A'].isin(['2']), 'C'].groupby('B').transform('sum')
desired output for above example is:
A B C
0 0 1 1
1 0 1 1
2 2 2 5
3 2 3 4
I also know how to split dataframe and do it. I am looking more for a solution that does it without the need of split+concat/merge. Thank you.
Is it just
s = df['A'].isin([2])
pd.concat((df[s].groupby(['A','B'])['C'].sum().reset_index(),
df[~s])
)
Output:
A B C
0 2 2 5
1 2 3 4
0 0 1 1
Update: Without splitting, you can assign a new column indicating special values of A:
(df.sort_values('A')
.assign(D=(~df['A'].isin([2])).cumsum())
.groupby(['D','A','B'])['C'].sum()
.reset_index('D',drop=True)
.reset_index()
)
Output:
A B C
0 0 1 1
1 0 1 1
2 2 2 5
3 2 3 4

Sequence length in dataframe in python

I have a dataframe in python that has a column like below:
Type
A
A
B
B
B
I want to add another column to my data frame according to the sequence of Type:
Type Seq
A 1
A 2
B 1
B 2
B 3
I was doing it in R with the following command:
setDT(df)[ , Seq := seq_len(.N), by = rleid(Type) ]
I am not sure how to do it python.
Use Series.rank,
df['seq'] = df['Type'].rank(method = 'dense').astype(int)
Type seq
0 A 1
1 A 1
2 B 2
3 B 2
4 B 2
Edit for updated question
df['seq'] = df.groupby('Type').cumcount() + 1
df
Output:
Type seq
0 A 1
1 A 2
2 B 1
3 B 2
4 B 3
Use pd.factorize:
import pandas as pd
df['seq'] = pd.factorize(df['Type'])[0] + 1
df
Output:
Type seq
0 A 1
1 A 1
2 B 2
3 B 2
4 B 2
In pandas
(df.Type!=df.Type.shift()).ne(0).cumsum()
Out[58]:
0 1
1 1
2 2
3 2
4 2
Name: Type, dtype: int32
More info
v=c('A','A','B','B','B','A')
data.table::rleid(v)
[1] 1 1 2 2 2 3
df
Type
0 A
1 A
2 B
3 B
4 B
5 A# assign a new number in R data.table rleid
(df.Type!=df.Type.shift()).ne(0).cumsum()
Out[60]:
0 1
1 1
2 2
3 2
4 2
5 3# check
Might not be the best way but try this:
df.loc[df['Type'] == A, 'Seq'] = 1
Similarly, for B:
df.loc[df['Type'] == B, 'Seq'] = 2
A strange (and not recommended) way of doing it is to use the built-in ord() function to get the Unicode code-point of the character.
That is:
df['Seq'] = df['Type'].apply(lamba x: ord(x.lower())-96)
A much better way of doing it is to change the type of the strings to categories:
df['Seq'] = df['Type'].astype('category').cat.codes
You may have to increment the codes if you want different numbers.

What does the function np.isreal does in a dataframe?

Can anyone explain the below code?
pima_df[~pima_df.applymap(np.isreal).all(1)]
pima_df is a dataframe.
You are extracting rows in which atleast one complex number occurs.
e.g : pima_df =
a b
0 1 2
1 2 4+3j
2 3 5
result would be :
a b
1 2 (4+3j)
in short :
applymap - apply function on each and every element of dataframe.
np.isreal - returns true for real otherwise false
all - returns true if each element along an axis is true otherwise false.
~ - negates the boolean index
Please look at the doc or help(np.isreal).
Returns a bool array, where True if input element is real.
If element has complex type with zero complex part, the return value
for that element is True.
To be precise Numpy Provides a set of methods for comparing and performing operations on arrays elementwise:
np.isreal : Determines whether each element of array is real.
np.all : Determines whether all array element of a specific array evaluate to True.
tilde(~) : used for Boolean indexing which means not.
applymap: applymap works element-wise on a DataFrame.
all() : used to find rows where all the values are True.
The ~ is the operator equivalent of the invert dunder which has been overridden explicitly for the purpose performing vectorized logical inversions on pd.DataFrame/pd.Series objects.
Example of boolean index (~):
>>> df
a b c d
0 a a 2 6
1 a a 4 7
2 b a 1 6
3 b a 2 1
4 c b 3 6
5 c b 0 2
6 d b 3 3
7 d b 2 1
8 e c 4 3
9 e c 2 0
10 f c 0 6
11 f c 1 2
>>> df.query('a in b')
a b c d
0 a a 2 6
1 a a 4 7
2 b a 1 6
3 b a 2 1
4 c b 3 6
5 c b 0 2
OR
>>> df[~df.a.isin(df.b)] # same as above
a b c d
6 d b 3 3
7 d b 2 1
8 e c 4 3
9 e c 2 0
10 f c 0 6
11 f c 1 2
hope this will help.

Select rows which have only zeros in columns

I want to select the rows in a dataframe which have zero in every column in a list of columns. e.g. this df:.
In:
df = pd.DataFrame([[1,2,3,6], [2,4,6,8], [0,0,3,4],[1,0,3,4],[0,0,0,0]],columns =['a','b','c','d'])
df
Out:
a b c d
0 1 2 3 6
1 2 4 6 8
2 0 0 3 4
3 1 0 3 4
4 0 0 0 0
Then:
In:
mylist = ['a','b']
selection = df.loc[df['mylist']==0]
selection
I would like to see:
Out:
a b c d
2 0 0 3 4
4 0 0 0 0
Should be simple but I'm having a slow day!
You'll need to determine whether all columns of a row have zeros or not. Given a boolean mask, use DataFrame.all(axis=1) to do that.
df[df[mylist].eq(0).all(1)]
a b c d
2 0 0 3 4
4 0 0 0 0
Note that if you wanted to find rows with zeros in every column, remove the subsetting step:
df[df.eq(0).all(1)]
a b c d
4 0 0 0 0
Using reduce and Numpy's logical_and
The point of this is to eliminate the need to create new Pandas objects and simply produce the mask we are looking for using the data where it sits.
from functools import reduce
df[reduce(np.logical_and, (df[c].values == 0 for c in mylist))]
a b c d
2 0 0 3 4
4 0 0 0 0

Pandas: select rows if a specific column satisfies a certain condition

Say I have this dataframe df:
A B C
0 1 1 2
1 2 2 2
2 1 3 1
3 4 5 2
Say you want to select all rows which column C is >1. If I do this:
newdf=df['C']>1
I only obtain True or False in the resulting df. Instead, in the example given I want this result:
A B C
0 1 1 2
1 2 2 2
3 4 5 2
What would you do? Do you suggest using iloc?
Use boolean indexing:
newdf=df[df['C']>1]
use query
df.query('C > 1')

Categories

Resources