Convert True/False to 1/0 Python [duplicate] - python

I have a column in python pandas DataFrame that has boolean True/False values, but for further calculations I need 1/0 representation. Is there a quick pandas/numpy way to do that?

A succinct way to convert a single column of boolean values to a column of integers 1 or 0:
df["somecolumn"] = df["somecolumn"].astype(int)

Just multiply your Dataframe by 1 (int)
[1]: data = pd.DataFrame([[True, False, True], [False, False, True]])
[2]: print data
0 1 2
0 True False True
1 False False True
[3]: print data*1
0 1 2
0 1 0 1
1 0 0 1

True is 1 in Python, and likewise False is 0*:
>>> True == 1
True
>>> False == 0
True
You should be able to perform any operations you want on them by just treating them as though they were numbers, as they are numbers:
>>> issubclass(bool, int)
True
>>> True * 5
5
So to answer your question, no work necessary - you already have what you are looking for.
* Note I use is as an English word, not the Python keyword is - True will not be the same object as any random 1.

This question specifically mentions a single column, so the currently accepted answer works. However, it doesn't generalize to multiple columns. For those interested in a general solution, use the following:
df.replace({False: 0, True: 1}, inplace=True)
This works for a DataFrame that contains columns of many different types, regardless of how many are boolean.

You also can do this directly on Frames
In [104]: df = DataFrame(dict(A = True, B = False),index=range(3))
In [105]: df
Out[105]:
A B
0 True False
1 True False
2 True False
In [106]: df.dtypes
Out[106]:
A bool
B bool
dtype: object
In [107]: df.astype(int)
Out[107]:
A B
0 1 0
1 1 0
2 1 0
In [108]: df.astype(int).dtypes
Out[108]:
A int64
B int64
dtype: object

Use Series.view for convert boolean to integers:
df["somecolumn"] = df["somecolumn"].view('i1')

You can use a transformation for your data frame:
df = pd.DataFrame(my_data condition)
transforming True/False in 1/0
df = df*1

I had to map FAKE/REAL to 0/1 but couldn't find proper answer.
Please find below how to map column name 'type' which has values FAKE/REAL to 0/1 (Note: similar can be applied to any column name and values)
df.loc[df['type'] == 'FAKE', 'type'] = 0
df.loc[df['type'] == 'REAL', 'type'] = 1

This is a reproducible example based on some of the existing answers:
import pandas as pd
def bool_to_int(s: pd.Series) -> pd.Series:
"""Convert the boolean to binary representation, maintain NaN values."""
return s.replace({True: 1, False: 0})
# generate a random dataframe
df = pd.DataFrame({"a": range(10), "b": range(10, 0, -1)}).assign(
a_bool=lambda df: df["a"] > 5,
b_bool=lambda df: df["b"] % 2 == 0,
)
# select all bool columns (or specify which cols to use)
bool_cols = [c for c, d in df.dtypes.items() if d == "bool"]
# apply the new coding to a new dataframe (or can replace the existing one)
df_new = df.assign(**{c: lambda df: df[c].pipe(bool_to_int) for c in bool_cols})

Tries and tested:
df[col] = df[col].map({'True': 1,'False' :0 })
If there are more than one columns with True/False, use the following.
for col in bool_cols:
df[col] = df[col].map({'True': 1,'False' :0 })
#AMC wrote this in a comment

If the column is of the type object
df["somecolumn"] = df["somecolumn"].astype(bool).astype(int)

Related

How to evaluate conditions after each other in Pandas .loc?

I have a Pandas DataFrame where column B contains mixed types
A B C
0 1 1 False
1 2 abc False
2 3 2 False
3 4 3 False
4 5 b False
I want to modify column C to be True when the value in column B is of type int and also has a value greater than or equal to 3. So in this example df['B'][3] should match this condition
I tried to do this:
df.loc[(df['B'].astype(str).str.isdigit()) & (df['B'] >= 3)] = True
However I get the following error because of the str values inside column B:
TypeError: '>' not supported between instances of 'str' and 'int'
If I'm able to only test the second condition on the subset provided after the first condition this would solve my problem I think. What can I do to achieve this?
A good way without the use of apply would be to use pd.to_numeric with errors='coerce' which will change the str type to NaN, without changing the type of column B:
df['C'] = pd.to_numeric(df.B, 'coerce') >= 3
>>> print(df)
A B C
0 1 1 False
1 2 abc False
2 3 2 False
3 4 3 True
4 5 b False
One solution could be:
df["B"].apply(lambda x: str(x).isdigit() and int(x) >= 3)
If x is not a digit, then the evaluation will stop and won't try to parse x to int - which throws a ValueError if the argument is not parseable into an int.
There are many ways around this (e.g. use a custom (lambda) function with df.apply, use df.replace() first), but I think the easiest way might be just to use an intermediate column.
First, create a new column that does the first check, then do the second check on this new column.
This works (although nikeros' answer is more elegant).
def check_maybe_int(n):
return int(n) >= 3 if n.isdigit() else False
df.B.apply(check_maybe_int)
But the real answer is, don't do this! Mixed columns prevent a lot of Pandas' optimisations. apply is not vectorised, so it's a lot slower than vector int comparison should be.
you can use apply(type) as picture illustrate
d = {'col1': [1, 2,1, 2], 'col2': [3, 4,1, 2],'col3': [1, 2,1, 2],'col4': [1, 'e',True, 2.345]}
df = pd.DataFrame(data=d)
a = df.col4.apply(type)
b = [ i==str for i in a ]
df['col5'] = b

Compare columns in a dictionary of dataframes

I have a dictionary of dataframes (Di_1). Each dataframe has the same number of columns, column names, number of rows and row indexes. I also have a list of the names of the dataframes (dfs). I would like to compare the contents of one of the columns (A) in each dataframe with those of the last dataframe in the list to see whether they are the same. For example:
df_A = pd.DataFrame({'A': [1,0,1,0]})
df_B = pd.DataFrame({'A': [1,1,0,0]})
Di_1 = {'X': df_A, 'Y': df_B}
dfs = ['X','Y']
I tried:
for df in dfs:
Di_1[str(df)]['True'] = Di_1[str(df)]['A'] .equals(Di_1[str(dfs[-1])]['A'])
I got:
[0,0,0,0]
I would like to get:
[1,0,0,1]
My attempt is checking whether the whole column is the same but I would instead please like to get it to go through each dataframe row by row.
I think you make things too complicated here. You can
series_last = Di_1[dfs[-1]]['A']
for df in map(Di_1.get, dfs):
df['True'] = df['A'] == series_last
This will produce as result:
>>> df_A
A True
0 1 True
1 0 False
2 1 False
3 0 True
>>> df_B
A True
0 1 True
1 1 True
2 0 True
3 0 True
So each df_i has an extra column named 'True' (perhaps you better use a different name), that checks if for a specific row, the value is the same as the one in the series_last.
In case the dfs contains something else than strings, we can first convert these to strings:
series_last = Di_1[str(dfs[-1])]['A']
for df in map(Di_1.get, map(str, dfs)):
df['True'] = df['A'] == series_last
Create a list:
l=[Di_1[i] for i in dfs]
Then using isin() you can compare the first and last df
l[0].isin(l[-1]).astype(int)
A
0 1
1 0
2 0
3 1

Pandas: Check value exist in a Column, which is stored as list

I have the below data frame. The status column stores the value as a list.
df
STATUS
1 [REQUESTED, RECEIVED]
2 [XYZ]
3 [RECEIVED]
When I try the below logic:
df['STATUS'].str.upper().isin(['RECEIVED'])
It gives me
1 False
2 False
3 False
But I am expecting
1 True
2 False
3 True
as we have the value RECEIVED at rows 1 and 3
For a simple check like this, you can join the list of strings and use contains.
EDIT:
To account for the difference between RECEIVED and RECEIVED CASH, you can join the lists with a unique character (such as '=') AND surround the resulting string with the same character, and then check for =RECEIVED=.
('=' + df['STATUS'].str.join('=') + '=').str.contains('=RECEIVED=')
It's possible you mean something like
>>> df.STATUS.astype(str).str.upper().str.contains('RECEIVED')
1 True
2 False
3 False
(Your example has a typo, incidentally - 1. has RECEIVED and 3. has RECIEVED.)
as isin is the opposite of what is meant by your example.
Data from jde
df = pd.DataFrame({'STATUS': [['REQUESTED', 'RECEIVED'], ['XYZ'], ['RECEIVED']]},
index=[1, 2, 3])
df.STATUS.apply(lambda x : 'RECEIVED' in x)
Out[11]:
1 True
2 False
3 True
Name: STATUS, dtype: bool
It's hard to operate directly with list values. You can concatenate the strings into one, using some separator character, and then check the condition:
import pandas as pd
df = pd.DataFrame({'STATUS': [['REQUESTED', 'RECEIVED'], ['XYZ'], ['RECEIVED']]},
index=[1, 2, 3])
print(df['STATUS'].str.join('|').str.contains('RECEIVED'))
Output:
1 True
2 False
3 True
Name: STATUS, dtype: bool
A more efficient option would be to replace the strings with numerical flags. This can be done really nicely since Python 3.6 using enum.Flag.
import enum
import pandas as pd
class Status(enum.Flag):
REQUESTED = enum.auto()
RECEIVED = enum.auto()
XYZ = enum.auto()
df = pd.DataFrame({'STATUS': [Status.REQUESTED | Status.RECEIVED, Status.XYZ, Status.RECEIVED]}, index=[1, 2, 3])
print(df['STATUS'] & Status.RECEIVED)
Or, if you already have a data frame with strings:
import enum
import pandas as pd
from functools import reduce
class Status(enum.Flag):
REQUESTED = enum.auto()
RECEIVED = enum.auto()
XYZ = enum.auto()
df = pd.DataFrame({'STATUS': [['REQUESTED', 'RECEIVED'], ['XYZ'], ['RECEIVED']]}, index=[1, 2, 3])
df['STATUS_ENUM'] = df['STATUS'].apply(lambda v: reduce(lambda a, b: a | Status[b], v, Status(0)))
print(df['STATUS_ENUM'] & Status.RECEIVED)

Dataframe returning None value

I was returning a dataframe of characters from GOT such that they were alive and predicted to die, but only if they have some house name. (important person). I was expecting it to skip NaN's, but it returned them as well. I've attached screenshot of output. Please help.
PS I haven't attached any spoilers so you may go ahead.
import pandas
df=pandas.read_csv('character-predictions.csv')
a=df[((df['actual']==1) & (df['pred']==0)) & (df['house'] !=None)]
b=a[['name', 'house']]
You need notnull with ix for selecting columns:
b = df.ix[((df['actual']==1) & (df['pred']==0)) & (df['house'].notnull()), ['name', 'house']]
Sample:
df = pd.DataFrame({'house':[None,'a','b'],
'pred':[0,0,5],
'actual':[1,1,5],
'name':['J','B','C']})
print (df)
actual house name pred
0 1 None J 0
1 1 a B 0
2 5 b C 5
b = df.ix[((df['actual']==1) & (df['pred']==0)) & (df['house'].notnull()), ['name', 'house']]
print (b)
name house
1 B a
You can also check pandas documentation:
Warning
One has to be mindful that in python (and numpy), the nan's don’t compare equal, but None's do. Note that Pandas/numpy uses the fact that np.nan != np.nan, and treats None like np.nan.
In [11]: None == None
Out[11]: True
In [12]: np.nan == np.nan
Out[12]: False
So as compared to above, a scalar equality comparison versus a None/np.nan doesn’t provide useful information.
In [13]: df2['one'] == np.nan
Out[13]:
a False
b False
c False
d False
e False
f False
g False
h False
Name: one, dtype: bool

How do I verify that two series of a pandas dataFrame have the same elements?

I wish to compare two series in a dataFrame, and get a boolean True or False answer as to whether they have exactly the same elements.
If an element differs, then I'd like to know its index number.
Thank you!
IIUC you can use isin:
In [123]:
s1 = pd.Series(np.arange(5))
s2 = pd.Series(np.arange(1,6))
s2
Out[123]:
0 1
1 2
2 3
3 4
4 5
dtype: int32
In [125]:
s1.isin(s2)
Out[125]:
0 False
1 True
2 True
3 True
4 True
dtype: bool
From the above you can get the index values that are False by negating the mask using ~:
In [127]:
s1[~s1.isin(s2)].index
Out[127]:
Int64Index([0], dtype='int64')
EdChum, thanks for your answer!
It's better than the one I have managed to work out, which I will post below anyway:
ser1 = Series(np.arange(16))
arr = ser1.reshape(4,4)
df = DataFrame((arr),columns=['a','b','c','d'])
ser_e = Series([2,6,10,14])
df['e'] = ser_e
df['c']>df['b']
df.loc[df['c'] != df['e'] ]

Categories

Resources