I have a pandas dataframe where I would like to verify that column A is greater than column B (row wise). I am doing something like
tmp=df['B']-df['A']
if(any( [ v for v in tmp if v > 0])):
....
I was wondering if there was better(concise) way of doing it or if pandas dataframe had any such built in routines to accomplish this
df = pd.DataFrame({'A': [1, 2, 3], 'B': [3, 1, 1]})
temp = df['B'] - df['A']
print(temp)
0 2
1 -1
2 -2
Now you can create a Boolean series using temp > 0:
print(temp > 0)
0 True
1 False
2 False
dtype: bool
This boolean series can be fed to any and therefore you can use:
if any(temp > 0):
print('juhu!')
Or simply (which avoids temp):
if any(df['B'] > df['A']):
print('juhu')
using the same logic of creating a Boolean series first:
print(df['B'] > df['A'])
0 True
1 False
2 False
dtype: bool
df['B']>df['A'] will be pandas series in boolean datatype.
>>> (df['B']>df['A']).dtype
dtype('bool')
For example
>>> df['B']>df['A']
0 True
1 False
2 False
3 True
4 True
dtype: bool
any() function returns True if any of the item in an iterable is true
>>> if any(df['B']>df['A']):
... print(True)
...
True
I guess you wanted to check if any df[‘B'] > df[‘A’] then do something.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [2, 0, 6, 3]})
if np.where(df['B'] > df['A'], 1, 0).sum():
print('do something')
Related
I am trying to very efficiently chain a variable amount of boolean pandas Series, to be used as a filter on a DataFrame through boolean indexing.
Normally when dealing with multiple boolean conditions, one chains them like this
condition_1 = (df.A > some_value)
condition_2 = (df.B <= other_value)
condition_3 = (df.C == another_value)
full_indexer = condition_1 & condition_2 & condition_3
but this becomes a problem with a variable amount of conditions.
bool_indexers = [
condition_1,
condition_2,
...,
condition_N,
]
I have tried out some possible solutions, but I am convinced it can be done more efficiently.
Option 1
Loop over the indexers and apply consecutively.
full_indexer = bool_indexers[0]
for indexer in bool_indexers[1:]:
full_indexer &= indexer
Option 2
Put into a DataFrame and calculate the row product.
full_indexer = pd.DataFrame(bool_indexers).product(axis=0)
Option 3
Use numpy.product (like in this answer) and create a new Series out of the result.
full_indexer = pd.Series(np.prod(np.vstack(bool_indexers), axis=0))
All three solutions are somewhat inefficient because they rely on looping or force you to create a new object (which can be slow if repeated many times).
Can it be done more efficiently or is this it?
Use np.logical_and:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [0, 1, 2], 'B': [0, 1, 2], 'C': [0, 1, 2]})
m1 = df.A > 0
m2 = df.B <= 1
m3 = df.C == 1
m = np.logical_and.reduce([m1, m2, m3])
# OR m = np.all([m1, m2, m3], axis=0)
out = df[np.logical_and.reduce([m1, m2, m3])]
Output:
>>> pd.concat([m1, m2, m3], axis=1)
A B C
0 False True False
1 True True True
2 True False False
>>> m
array([False, True, False])
>>> out
A B C
1 1 1 1
I have dataframe:
A B C D
1 0 0 2
0 1 0 0
0 0 0 0
I need to select all values which are greater then 0 and put them in a list.
if row doesnt contain any positive value 0 should be written to list.
So, the output for given dataframe should look like this:
[1,2,1,0]
How this can be resolved?
Here is a simple loop you could use (looping through df.values gives us rows as arrays):
output = []
for ar in df.values:
nonzeros = ar[ar > 0]
# If nonzeros is not empty proceed and extend the output
if nonzeros.size:
output.extend(nonzeros)
# If not add 0
else:
output.append(0)
print(output)
returns:
[1, 2, 1, 0]
We can make extensive use of pandas + numpy here:
Mask all values which are greater than 0
m = df.gt(0)
A B C D
0 True False False True
1 False True False False
2 False False False False
Mask rows which dont contain any values above 0:
s1 = m.any(axis=1).astype(int).values
Get all the values greater than 0 in an array:
s2 = df.values[m]
Finally concat both arrays with each other:
np.concatenate([s2, s1[s1==0]]).tolist()
Output
[1, 2, 1, 0]
In your case , first stack with your df, then we apply your condition , if the row contain the none 0 we select , if all 0 , then we keep it as zero
df.stack().groupby(level=0).apply(lambda x : x.head(1) if all(x==0) else x[x!=0]).tolist()
[1, 2, 1, 0]
Or without apply
np.concatenate(df.mask(df==0).stack().groupby(level=0).apply(list).reindex(df.index,fill_value=[0]).values)
array([1., 2., 1., 0.])
Shorten the process
np.concatenate(list(map(lambda x : [x[0]] if all(x==0) else x[x!=0],df.values)))
array([1, 2, 1, 0])
You could apply a custom function which will process each row of the DataFrame and return a list. Then to sum returned lists.
In [1]: import pandas as pd
In [2]: df = pd.read_clipboard()
In [3]: df
Out[3]:
A B C D
0 1 0 0 2
1 0 1 0 0
2 0 0 0 0
In [4]: def get_positive_values(row):
...: # If all elements in a row are zeros
...: # then return a list with a single zero
...: if row.eq(0).all():
...: return [0]
...: # Else return a list with positive values only.
...: return row[row.gt(0)].tolist()
...:
...:
In [5]: df.apply(get_positive_values, axis=1).sum()
Out[5]: [1, 2, 1, 0]
I have a list of pandas Series objects obj and a list of indices idx. What I want is a new Series out that for each row in idx contains the value of obj[idx] if idx is not 255 and -1 otherwise.
The following code does what I want to achieve, but I'd like to know if there's a better way of doing this, especially without the overhead of first creating a Python list and then converting that into a pandas series.
>>> import pandas as pd
>>> obj = [pd.Series([1, 2, 3]), pd.Series([4, 5, 6]), pd.Series([7, 8, 9])]
>>> idx = pd.Series([0, 255, 2])
>>> out = pd.Series([obj[idx[row]][row] if idx[row] != 255 else -1 for row in range(len(idx))])
>>> out
0 1
1 -1
2 9
dtype: int64
>>>
Thanks in advance.
Usingreindex + lookup
pd.Series(pd.concat(obj,1).reindex(idx).lookup(idx,idx.index)).fillna(-1)
Out[822]:
0 1.0
1 -1.0
2 9.0
dtype: float64
Consider the dataframe df
df = pd.DataFrame({
1: [1, 2],
2: ['a', 3],
3: [None, 7]
})
df
1 2 3
0 1 a NaN
1 2 3 7.0
When I compare with a string
df == 'a'
TypeError: Could not compare ['a'] with block values
However, taking the transpose fixes the problem?!
(df.T == 'a').T
1 2 3
0 False True False
1 False False False
What is this error? Is it something I can fix with how I'm constructing my dataframe? What is different about comparing to the transpose?
When creating your data frame, declare dtype=object:
In [1013]: df = pd.DataFrame({
...: 1: [1, 2],
...: 2: ['a', 3],
...: 3: [None, 7]
...: }, dtype=object)
In [1014]: df
Out[1014]:
1 2 3
0 1 a None
1 2 3 7
Now, you can compare without transposition:
In [1015]: df == 'a'
Out[1015]:
1 2 3
0 False True False
1 False False False
My belief is that to begin with, your columns aren't objects (they're coerced wherever possible) but transposition forces the change because of the mixed values.
Found this in the source code pandas/internals.py:
if not isinstance(result, np.ndarray):
# differentiate between an invalid ndarray-ndarray comparison
# and an invalid type comparison
...
raise TypeError('Could not compare [%s] with block values' %
repr(other))
If the item being compared does not match the dtype of the array, this error is thrown.
I can easily check whether each is equal to a number:
In [20]: s = pd.Series([1, 2, 3])
In [21]: s == 1
Out[21]:
0 True
1 False
2 False
My problem is, is there a function like s._in([1, 2]) and output something like
0 True
1 True
2 False
Yes, it is called isin. Do s.isin([1, 2]).