Columns L,M,N of my dataframe are populated with 'true' and 'false' statements(1000 rows). I would like to create a new column 'count_false' that will return the number of times 'false' statement occurred in columns L,M and N.
Any tips appreciated!
Thank you.
You can negate your dataframe and sum over axis=1:
df = pd.DataFrame(np.random.randint(0, 2, (5, 3)), columns=list('LMN')).astype(bool)
df['Falses'] = (~df).sum(1)
print(df)
L M N Falses
0 True False True 1
1 True False False 2
2 True True True 0
3 False True False 2
4 False False True 2
If you have additional columns, you can filter accordingly:
df['Falses'] = (~df[list('LMN')]).sum(1)
Try this : df[df==false].count()
As explained in this Stack question True and False equal to 1 and 0 in Python, therefore something like the line three of the following example should solve your problem:
import pandas as pd
df = pd.DataFrame([[True, False, True],[False, False, False],[True, False, True],[False, False, True]], columns=['L','M','N'])
df['count_false'] = 3 - (df['L']*1 + df['M']*1 + df['N']*1)
Related
In a pandas dataframe, I have a column of mixed data types, such as text, integers and datetimes. I need to find columns where datetimes match: (1) exact values in some cases, (2) only the date (ignoring time), or (3) only the date and time, but ignoring seconds.
In the following code example with a mixed data type dataframe column, there are three dates of varying imprecision. Mapping the conditions into a separate dataframe works for a precise value.
import pandas as pd
import numpy as np
# example data frame
inp = [{'Id': 0, 'mixCol': np.nan},
{'Id': 1, 'mixCol': "text"},
{'Id': 2, 'mixCol': 43831},
{'Id': 3, 'mixCol': pd.to_datetime("2020-01-01 00:00:00")},
{'Id': 4, 'mixCol': pd.to_datetime("2020-01-01 01:01:00")},
{'Id': 5, 'mixCol': pd.to_datetime("2020-01-01 01:01:01")}
]
df = pd.DataFrame(inp)
print(df.dtypes)
myMap = pd.DataFrame()
myMap["Exact"] = df["mixCol"] == pd.to_datetime("2020-01-01 01:01:01")
0 False
1 False
2 False
3 False
4 False
5 True
6 False
The output I need should be:
Id Exact DateOnly NoSeconds
0 False False False
1 False False False
2 False False False
3 False True False
0 False True True
5 True True True
6 False False False
BUT, mapping just the date, without time, maps as if the date had a time of 00:00:00.
myMap["DateOnly"] = df["mixCol"] == pd.to_datetime("2020-01-01")
Id Exact DateOnly
0 False False
1 False False
2 False False
3 False True
0 False False
5 True False
6 False False
Trying to convert values in the mixed column throws an AttributeError: 'Series' object has not attribute 'date'; and trying to use ">" and "<" to define the relevant range throws a TypeError: '>=' not supported between instances of 'str' and 'Timestamp'
myMap["DateOnly"] = df["mixCol"].date == pd.to_datetime("2020-01-01")
myMap["NoSeconds"] = (df["mixCol"] >= pd.to_datetime("2020-01-01 01:01:00")) & (df["mixCol"] < pd.to_datetime("2020-01-01 01:02:00"))
If I try to follow the solution for mix columns in pandas proposed here, both the np.nan and text value map true as dates.
df["IsDate"] = df.apply(pd.to_datetime, errors='coerce',axis=1).nunique(1).eq(1).map({True:True ,False:False})
I'm not sure how to proceed in this situation?
Use Series.dt.normalize for compare datetimes with remove times (set them to 00:00:00) or with Series.dt.floor by days or minutes for remove seconds:
#convert column to all datetimes with NaT
d = pd.to_datetime(df["mixCol"], errors='coerce')
myMap["DateOnly"] = d.dt.normalize() == pd.to_datetime("2020-01-01")
myMap["DateOnly"] = d.dt.floor('D') == pd.to_datetime("2020-01-01")
#alternative with dates
myMap["DateOnly"] = d.dt.date == pd.to_datetime("2020-01-01").date()
myMap['NoSeconds'] = d.dt.floor('Min') == pd.to_datetime("2020-01-01 01:01:00")
print (myMap)
Exact DateOnly NoSeconds
0 False False False
1 False False False
2 False False False
3 False True False
4 False True True
5 True True True
I can't for the life of me find an example of this operation in pandas.... I am trying to write an IF statement saying IF my Check column is true then pull the value from my Proficiency_Value column, and if it's False, then default to 1000.
report_skills['Check'] = report_skills.apply(lambda x: x.Skill_Specialization in x.Specialization, axis=1)
report_skills = report_skills.loc[report_skills['Check'] == True, 'Proficiency_Value'] = 1000
Any ideas why this is not working? I'm sure this is an easy fix
Let`s create a small example DataFrame like this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Check':[True,False, False ,True],
'Proficiency_Value':range(4)
})
>>> df
Check Proficiency_Value
0 True 0
1 False 1
2 False 2
3 True 3
If you use now the np.where() functione, you can get the result you are asking for.
df['Proficiency_Value'] = np.where(df['Check']==True, df['Proficiency_Value'], 1000)
>>> df
Check Proficiency_Value
0 True 0
1 False 1000
2 False 1000
3 True 3
Having the following running code:
import datetime as dt
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
my_funds = [1, 2, 5, 7, 9, 11]
my_time = ['2020-01', '2019-12', '2019-11', '2019-10', '2019-09', '2019-08']
df = pd.DataFrame({'TIME': my_time, 'FUNDS':my_funds})
for x in range(2,3):
df.insert(len(df.columns), f'x**{x}', df["FUNDS"]**x)
df = df.replace([1, 7, 9, 25],float('nan'))
print(df.isnull().values.ravel().sum()) #5 (obviously counting NaNs in total)
print(sum(map(any, df.isnull()))) #3 (I guess counting the NaNs in the left column)
I am getting the dataframe below. I want to get the count of the total rows, with 1 or more NaN, which in my case is 4, on rows - [0, 2, 3, 4].
Use:
print (df.isna().any(axis=1).sum())
4
Explanation: First compare missing values by DataFrame.isna:
print (df.isna())
TIME FUNDS x**2
0 False True True
1 False False False
2 False False True
3 False True False
4 False True False
5 False False False
And test if at least per rows is True by DataFrame.any:
print (df.isna().any(axis=1))
0 True
1 False
2 True
3 True
4 True
5 False
dtype: bool
And last count Trues by sum.
Another option:
nan_rows = len(df[df["FUNDS"].isna() | df["x**2"].isna()])
New option Series.clip
to take one when there is more than one NaN per row
df.isna().sum(axis=1).clip(upper=1).sum()
#4
How can I check if a given value is NaN?
e.g. if (a == np.NaN) (doesn't work)
Please note that:
Numpy's isnan method throws errors with data types like string
Pandas docs only provide methods to drop rows containing NaNs, or ways to check if/when DataFrame contains NaNs. I'm asking about checking if a specific value is NaN.
Relevant Stackoverflow questions and Google search results seem to be about checking "if any value is NaN" or "which values in a DataFrame"
There must be a clean way to check if a given value is NaN?
You can use the inate property that NaN != NaN
so a == a will return False if a is NaN
This will work even for strings
Example:
In[52]:
s = pd.Series([1, np.NaN, '', 1.0])
s
Out[52]:
0 1
1 NaN
2
3 1
dtype: object
for val in s:
print(val==val)
True
False
True
True
This can be done in a vectorised manner:
In[54]:
s==s
Out[54]:
0 True
1 False
2 True
3 True
dtype: bool
but you can still use the method isnull on the whole series:
In[55]:
s.isnull()
Out[55]:
0 False
1 True
2 False
3 False
dtype: bool
UPDATE
As noted by #piRSquared if you compare None==None this will return True but pd.isnull will return True so depending on whether you want to treat None as NaN you can still use == for comparison or pd.isnull if you want to treat None as NaN
Pandas has isnull, notnull, isna, and notna
These functions work for arrays or scalars.
Setup
a = np.array([[1, np.nan],
[None, '2']])
Pandas functions
pd.isna(a)
# same as
# pd.isnull(a)
array([[False, True],
[ True, False]])
pd.notnull(a)
# same as
# pd.notna(a)
array([[ True, False],
[False, True]])
DataFrame (or Series) methods
b = pd.DataFrame(a)
b.isnull()
# same as
# b.isna()
0 1
0 False True
1 True False
b.notna()
# same as
# b.notnull()
0 1
0 True False
1 False True
I would like to make a boolean vector that is created by the comparison of two input boolean vectors. I can use a for loop, but is there a better way to do this?
My ideal solution would look like this:
df['A'] = [True, False, False, True]
df['B'] = [True, False, False, False]
C = ((df['A']==True) or (df['B']==True)).as_matrix()
print C
>>> True, False, False, True
I think this is what you are looking for:
C = (df['A']) | (df['B'])
C
0 True
1 False
2 False
3 True
dtype: bool
You could then leave this as a series or convert it to a list or array
Alternatively you could use any method with axis=1 to search in index. It also will work for any number of columns where you have True values:
In [1105]: df
Out[1105]:
B A
0 True True
1 False False
2 False False
3 False True
In [1106]: df.any(axis=1)
Out[1106]:
0 True
1 False
2 False
3 True
dtype: bool