how to count and change value in dataframe by row - python

I want to count the number of times a value False appears in my dataframe and get the number of how many times False appears in a row.
So here is how my table should look initially:
A
B
C
D
count
First
row
True
False
0
Second
row
False
True
0
Third
row
True
True
0
Fourth
row
False
False
0
This is how it should look:
A
B
C
D
count
First
row
True
False
1
Second
row
False
True
1
Third
row
True
True
0
Fourth
row
False
False
2
This is my code, I have tried to count for at least one column to begin with something, but it does not change the value in count column.
import pandas as pd
data = {'A': ['One', None, 'One', None], 'B': [None, 'Two', None, 'Two'], 'C': [True, False, True, False],
'D': [False, True, True, False], 'count': [0, 0, 0, 0]}
df = pd.DataFrame(data)
for index, row in df.iterrows():
if row['C'] is False:
row['count'] += 1
print(df.head(4))

IIUC, you want to count the False (boolean) values per row?
You can subset the boolean columns with select_dtypes, then invert the boolean value with ~ (so that False becomes True and is equivalent to 1), then sum per row:
df['count'] = (~df.select_dtypes('boolean')).sum(axis=1)
output:
A B C D count
0 One None True False 1
1 None Two False True 1
2 One None True True 0
3 None Two False False 2

Select columns 'C' and 'D', flip/invert the booleans (~) and then sum across both columns:
df['count'] = (~df[['C', 'D']]).sum(axis='columns')

Related

Compare one row of a dataframe with rows of other dataframe?

I have two dataframes say df and thresh_df. The shape of df is say 1000*200 and thresh_df is 1*200.
I need to compare the thresh_df row with each row of df element wise respectively and I have to fetch the corresponding column number whose values are less than the values of thresh_df.
I tried the following
compared_df = df.apply(lambda x : np.where(x < thresh_df.values))
But I get an empty dataframe! If question is unclear and need any explanations,please let me know in the comments.
I think apply is not necessary only compare one row DataFrame converted to Series by selecting first row:
df = pd.DataFrame({
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
})
thresh_df = pd.DataFrame({
'B':[4],
'C':[7],
'D':[4],
'E':[5],
})
compared_df = df < thresh_df.iloc[0]
print (compared_df)
B C D E
0 False False True False
1 False False True True
2 False False False False
3 False True False False
4 False True True True
5 False True True True
Then use DataFrame.any for filter at least one True per row and filter index values:
idx = df.index[compared_df.any(axis=1)]
print (idx)
Int64Index([0, 1, 3, 4, 5], dtype='int64')
Detail:
print (compared_df.any(axis=1))
0 True
1 True
2 False
3 True
4 True
5 True
dtype: bool

Check if two rows in pandas DataFrame has same set of values regard & regardless of column order

I have two dataframe with same index but different column names. Number of columns are the same. I want to check, index by index, 1) whether they have same set of values regardless of column order, and 2) whether they have same set of values regarding column order.
ind = ['aaa', 'bbb', 'ccc']
df1 = pd.DataFrame({'old1': ['A','A','A'], 'old2': ['B','B','B'], 'old3': ['C','C','C']}, index=ind)
df2 = pd.DataFrame({'new1': ['A','A','A'], 'new2': ['B','C','B'], 'new3': ['C','B','D']}, index=ind)
This is the output I need.
OpX OpY
-------------
aaa True True
bbb False True
ccc False False
Could anyone help me with OpX and OpY?
Using tuple and set: keep the order or tuple , and reorder with set
s1=df1.apply(tuple,1)==df2.apply(tuple,1)
s2=df1.apply(set,1)==df2.apply(set,1)
pd.concat([s1,s2],1)
Out[746]:
0 1
aaa True True
bbb False True
ccc False False
Since cs95 mentioned apply have problem here
s=np.equal(df1.values,df2.values).all(1)
t=np.equal(np.sort(df1.values,1),np.sort(df2.values,1)).all(1)
pd.DataFrame(np.column_stack([s,t]),index=df1.index)
Out[754]:
0 1
aaa True True
bbb False True
ccc False False
Here's a solution that is performant and should scale. First, align the DataFrames on the index so you can compare them easily.
df3 = df2.set_axis(df1.columns, axis=1, inplace=False)
df4, df5 = df1.align(df3)
For req 1, simply call DataFrame.equals (or just use the == op):
u = (df4 == df5).all(axis=1)
u
aaa True
bbb False
ccc False
dtype: bool
Req 2 is slightly more complex, sort them along the first axis, then compare.
v = pd.Series((np.sort(df4) == np.sort(df5)).all(axis=1), index=u.index)
v
aaa True
bbb True
ccc False
dtype: bool
Concatenate the results,
pd.concat([u, v], axis=1, keys=['X', 'Y'])
X Y
aaa True True
bbb False True
ccc False False
For item 2):
(df1.values == df2.values).all(axis=1)
This checks element-wise equality of the dataframes, and gives True when all entries in a row are equal.
For item 1), sort the values along each row first:
import numpy as np
(np.sort(df1.values, axis=1) == np.sort(df2.values, axis=1)).all(axis=1)
Construct a new DataFrame and check the equality:
df3 = pd.DataFrame(index=ind)
df3['OpX'] = (df1.values == df2.values).all(1)
df3['OpY'] = (df1.apply(np.sort, axis=1).values == df2.apply(np.sort, axis=1).values).all(1)
print(df3)
Output:
OpX OpY
aaa True True
bbb False True
ccc False False

Python - Is one string column in another?

I have a pandas dataframe with two columns. I need to determine if the string value from one column is in the string value of another column. The second column could be a 'single value' like 'value1' or it could be multiple items separated by a '/' in the string, like this: 'value1/value2/value3'.
For each row, I need to determine if the string is present in the other string in the same row, so that 'value1' in 'value1/value2/value3' would evaluate to True.
My attempts thus far fail to check within each row, and just look to see if the first column string is present in ALL rows column 2.
Here is an example:
import pandas as pd
df = pd.DataFrame({'a':['a','b','c','d','e'],
'b':['a/b','c/d','c/a','a/b','e']})
df['a'].isin(df['b'])
Expected result would evaluate to:
True
False
True
False
True
Comprehension
[a in b for a, b in zip(df.a, df.b)]
[True, False, True, False, True]
df.assign(In=[a in b for a, b in zip(df.a, df.b)])
a b In
0 a a/b True
1 b c/d False
2 c c/a True
3 d a/b False
4 e e True
Numpy
from numpy.core.defchararray import find
a, b = df.values.astype(str).T
find(b, a) >= 0
array([ True, False, True, False, True])
df.assign(In=find(b, a) >= 0)
a b In
0 a a/b True
1 b c/d False
2 c c/a True
3 d a/b False
4 e e True

Create a new column by comparing rows pandas

My dataframe looks like this
df = pd.Dataframe({ 'a': ["10001", "10001", "10002", "10002" , "10002"], 'b': ['hello', 'hello', 'hola', 'hello', 'hola']})
I want to create a new column 'c' of boolean values with the following condition:
If values of 'a' is the same (i.e. 1st and 2nd row, 3rd and 4th and 5th row), check if values of 'b' of those rows are the same. (2nd row returns True. 4th row returns False).
If values of 'a' is not the same, skip.
My current code is the following:
def check_consistency(col1,col2):
df['match'] = df[col1].eq(df[col1].shift())
t = []
for i in df['match']:
if i == True:
t.append(df[col2].eq(df[col2].shift()))
check_consistency('a','b')
And it returns error.
I think this is groupby
df.groupby('a').b.apply(lambda x : x==x.shift())
Out[431]:
0 False
1 True
2 False
3 False
4 False
Name: b, dtype: bool
A bitwise & should do: Checking if both the conditions are satisfied:
df['c'] = (df.a == df.a.shift()) & (df.b == df.b.shift())
df.c
#0 False
#1 True
#2 False
#3 False
#4 False
#Name: c, dtype: bool
Alternatively, if you want to make your current code work, you can do something like (essentially doing the same check as above):
def check_consistency(col1,col2):
df['match'] = df[col1].eq(df[col1].shift())
for i in range(len(df['match'])):
if (df['match'][i] == True):
df.loc[i,'match'] = (df.loc[i, col2] == df.loc[i-1, col2])
check_consistency('a','b')

Cycling through a dictionary of dataframes for values above a threshold

I am trying to set up a way to cycle though a 100 element dictionary, where each element is a 42,000 row dataframe and check whether the value is above a threshold
I am having problems working out how to store the row data so it is not overwritten.
I've made a simple example of what I would like to do:
I have a three element dictionary (my_dic) where each element is a dataframe
I want to cycle through each row of every dataframe and check if any of the columns are above a threshold number.
I have been trying to use .any() and .where but I dont know how to capture the data for each df separately.
I would like to end up with three separate new df that have boolean values in each column that is above the threshold.
Any help would be great!
d1 = np.random.rand(3,3)
d2 = np.random.rand(3,3)
d3 = np.random.rand(3,3)
df1 =pd.DataFrame(d1, index=['a', 'b', 'c'])
df2 =pd.DataFrame(d2, index=['a', 'b', 'c'])
df3 =pd.DataFrame(d3, index=['a', 'b', 'c'])
my_dic = {}
my_dic['a'] = df1
my_dic['b'] = df2
my_dic['c'] = df3
threshold = 0.5
I want to cycle through every row of each of the keys in my_dic and change the value to booleans if it is greater than a threshold number
for k in my_dic:
print k
data = my_dic[k]
for row in range(len(data)):
print row
np.where(data.iloc[row,:] > threshold)
This is the point I am struggling with, Im not sure how to keep this data, without it being overwritten
I assume you're looking for a dict comprhension, if you just want a boolean mask.
result = {k : v > threshold for k, v in my_dic.items()}
for v in result.values():
print(v, '\n')
0 1 2
a False False False
b True False True
c False True True
0 1 2
a False True True
b True True True
c False True True
0 1 2
a False False False
b False False False
c True True False
If you want the result as 0/1, use astype:
result = {k : v.gt(threshold).astype(int) for k, v in my_dic.items()}

Categories

Resources