I was returning a dataframe of characters from GOT such that they were alive and predicted to die, but only if they have some house name. (important person). I was expecting it to skip NaN's, but it returned them as well. I've attached screenshot of output. Please help.
PS I haven't attached any spoilers so you may go ahead.
import pandas
df=pandas.read_csv('character-predictions.csv')
a=df[((df['actual']==1) & (df['pred']==0)) & (df['house'] !=None)]
b=a[['name', 'house']]
You need notnull with ix for selecting columns:
b = df.ix[((df['actual']==1) & (df['pred']==0)) & (df['house'].notnull()), ['name', 'house']]
Sample:
df = pd.DataFrame({'house':[None,'a','b'],
'pred':[0,0,5],
'actual':[1,1,5],
'name':['J','B','C']})
print (df)
actual house name pred
0 1 None J 0
1 1 a B 0
2 5 b C 5
b = df.ix[((df['actual']==1) & (df['pred']==0)) & (df['house'].notnull()), ['name', 'house']]
print (b)
name house
1 B a
You can also check pandas documentation:
Warning
One has to be mindful that in python (and numpy), the nan's don’t compare equal, but None's do. Note that Pandas/numpy uses the fact that np.nan != np.nan, and treats None like np.nan.
In [11]: None == None
Out[11]: True
In [12]: np.nan == np.nan
Out[12]: False
So as compared to above, a scalar equality comparison versus a None/np.nan doesn’t provide useful information.
In [13]: df2['one'] == np.nan
Out[13]:
a False
b False
c False
d False
e False
f False
g False
h False
Name: one, dtype: bool
Related
I have a Pandas DataFrame where column B contains mixed types
A B C
0 1 1 False
1 2 abc False
2 3 2 False
3 4 3 False
4 5 b False
I want to modify column C to be True when the value in column B is of type int and also has a value greater than or equal to 3. So in this example df['B'][3] should match this condition
I tried to do this:
df.loc[(df['B'].astype(str).str.isdigit()) & (df['B'] >= 3)] = True
However I get the following error because of the str values inside column B:
TypeError: '>' not supported between instances of 'str' and 'int'
If I'm able to only test the second condition on the subset provided after the first condition this would solve my problem I think. What can I do to achieve this?
A good way without the use of apply would be to use pd.to_numeric with errors='coerce' which will change the str type to NaN, without changing the type of column B:
df['C'] = pd.to_numeric(df.B, 'coerce') >= 3
>>> print(df)
A B C
0 1 1 False
1 2 abc False
2 3 2 False
3 4 3 True
4 5 b False
One solution could be:
df["B"].apply(lambda x: str(x).isdigit() and int(x) >= 3)
If x is not a digit, then the evaluation will stop and won't try to parse x to int - which throws a ValueError if the argument is not parseable into an int.
There are many ways around this (e.g. use a custom (lambda) function with df.apply, use df.replace() first), but I think the easiest way might be just to use an intermediate column.
First, create a new column that does the first check, then do the second check on this new column.
This works (although nikeros' answer is more elegant).
def check_maybe_int(n):
return int(n) >= 3 if n.isdigit() else False
df.B.apply(check_maybe_int)
But the real answer is, don't do this! Mixed columns prevent a lot of Pandas' optimisations. apply is not vectorised, so it's a lot slower than vector int comparison should be.
you can use apply(type) as picture illustrate
d = {'col1': [1, 2,1, 2], 'col2': [3, 4,1, 2],'col3': [1, 2,1, 2],'col4': [1, 'e',True, 2.345]}
df = pd.DataFrame(data=d)
a = df.col4.apply(type)
b = [ i==str for i in a ]
df['col5'] = b
I want to use numpy.where() to add a column to a pandas.DataFrame. I'd like to use NaN values for the rows where the condition is false (to indicate that these values are "missing").
Consider:
>>> import numpy; import pandas
>>> df = pandas.DataFrame({'A':[1,2,3,4]}); print(df)
A
0 1
1 2
2 3
3 4
>>> df['B'] = numpy.nan
>>> df['C'] = numpy.where(df['A'] < 3, 'yes', numpy.nan)
>>> print(df)
A B C
0 1 NaN yes
1 2 NaN yes
2 3 NaN nan
3 4 NaN nan
>>> df.isna()
A B C
0 False True False
1 False True False
2 False True False
3 False True False
Why does B show "NaN" but C shows "nan"? And why does DataFrame.isna() fail to detect the NaN values in C?
Should I use something other than numpy.nan inside where? None and pandas.NA both seem to work and can be detected by DataFrame.isna(), but I'm not sure these are the best choice.
Thank you!
Edit: As per #Tim Roberts and #DYZ, numpy.where returns an array of type string, so the str constructor is called on numpy.NaN. The values in column C are actually strings "nan". The question remains, however: what is the most elegant thing to do here? Should I use None? Or something else?
np.where coerces the second and the third parameter to the same datatype. Since the second parameter is a string, the third one is converted to a string, too, by calling function str():
str(numpy.nan)
# 'nan'
As the result, the values in column C are all strings.
You can first fill the NaN rows with None and then convert them to np.nan with fillna():
df['C'] = numpy.where(df['A'] < 3, 'yes', None)
df['C'].fillna(np.nan, inplace=True)
B is a pure numeric column. C has a mixture of strings and numerics, so the column has type "object", and it prints differently.
I'd like to assign the new column to my DataFrame base on condition - if row.id is one of the bad_cat value.
bad_cat = [71,84]
df = pd.DataFrame({'name' : ['a','b','c','d','e'], 'id' : [1,2,71,5,84]})
df['type'] = df[df.id in bad_cat]
Output:
name id type
a 1 False
b 2 False
c 71 True
d 5 False
e 84 True
It seems my code doesn't work - could you explain how to do it.
The most intuitive answer would be one provided by Quang Hoang using .isin method. This will create a mask resulting in a series of bool statements:
df['type'] = df['id'].isin(bad_cat)
The other approach could be to use index - this can be faster solution under some circumstances. After setting index to column that will be assessed against values provided in the list, you can use .loc for slicing and setting type to True for vlaues that match those on the list.
df.set_index('id', inplace=True)
df['type'] = False
df['type'].loc[bad_cat] = True
for both solutions output will be:
name type
id
1 a False
2 b False
71 c True
5 d False
84 e True
Note: that values in the column that serves as an index does not have to be unique.
I have a column in python pandas DataFrame that has boolean True/False values, but for further calculations I need 1/0 representation. Is there a quick pandas/numpy way to do that?
A succinct way to convert a single column of boolean values to a column of integers 1 or 0:
df["somecolumn"] = df["somecolumn"].astype(int)
Just multiply your Dataframe by 1 (int)
[1]: data = pd.DataFrame([[True, False, True], [False, False, True]])
[2]: print data
0 1 2
0 True False True
1 False False True
[3]: print data*1
0 1 2
0 1 0 1
1 0 0 1
True is 1 in Python, and likewise False is 0*:
>>> True == 1
True
>>> False == 0
True
You should be able to perform any operations you want on them by just treating them as though they were numbers, as they are numbers:
>>> issubclass(bool, int)
True
>>> True * 5
5
So to answer your question, no work necessary - you already have what you are looking for.
* Note I use is as an English word, not the Python keyword is - True will not be the same object as any random 1.
This question specifically mentions a single column, so the currently accepted answer works. However, it doesn't generalize to multiple columns. For those interested in a general solution, use the following:
df.replace({False: 0, True: 1}, inplace=True)
This works for a DataFrame that contains columns of many different types, regardless of how many are boolean.
You also can do this directly on Frames
In [104]: df = DataFrame(dict(A = True, B = False),index=range(3))
In [105]: df
Out[105]:
A B
0 True False
1 True False
2 True False
In [106]: df.dtypes
Out[106]:
A bool
B bool
dtype: object
In [107]: df.astype(int)
Out[107]:
A B
0 1 0
1 1 0
2 1 0
In [108]: df.astype(int).dtypes
Out[108]:
A int64
B int64
dtype: object
Use Series.view for convert boolean to integers:
df["somecolumn"] = df["somecolumn"].view('i1')
You can use a transformation for your data frame:
df = pd.DataFrame(my_data condition)
transforming True/False in 1/0
df = df*1
I had to map FAKE/REAL to 0/1 but couldn't find proper answer.
Please find below how to map column name 'type' which has values FAKE/REAL to 0/1 (Note: similar can be applied to any column name and values)
df.loc[df['type'] == 'FAKE', 'type'] = 0
df.loc[df['type'] == 'REAL', 'type'] = 1
This is a reproducible example based on some of the existing answers:
import pandas as pd
def bool_to_int(s: pd.Series) -> pd.Series:
"""Convert the boolean to binary representation, maintain NaN values."""
return s.replace({True: 1, False: 0})
# generate a random dataframe
df = pd.DataFrame({"a": range(10), "b": range(10, 0, -1)}).assign(
a_bool=lambda df: df["a"] > 5,
b_bool=lambda df: df["b"] % 2 == 0,
)
# select all bool columns (or specify which cols to use)
bool_cols = [c for c, d in df.dtypes.items() if d == "bool"]
# apply the new coding to a new dataframe (or can replace the existing one)
df_new = df.assign(**{c: lambda df: df[c].pipe(bool_to_int) for c in bool_cols})
Tries and tested:
df[col] = df[col].map({'True': 1,'False' :0 })
If there are more than one columns with True/False, use the following.
for col in bool_cols:
df[col] = df[col].map({'True': 1,'False' :0 })
#AMC wrote this in a comment
If the column is of the type object
df["somecolumn"] = df["somecolumn"].astype(bool).astype(int)
I am trying to update the column values (of pandas dataframe) as follows:
1234(456 should become 1234
abcde(fg should become abcde
I wrote the following code, but for some reason it is not working:
energy[(energy['Country'].str.contains('\(')) &
(energy['Country'] != np.NAN)
].apply(lambda x: x['Country'].split('(')[0])
Here is the error: ValueError: cannot index with vector containing NA / NaN values
Any ideas to refine my code and make it work?
Try this:
In [23]: df
Out[23]:
Country
0 1234(456)
1 abcde(fg xxxx
In [24]: df.Country.str.replace(r'([^\(]*).*', r'\1')
Out[24]:
0 1234
1 abcde
Name: Country, dtype: object
Assume we have a DF similar in format as yours:
energy = pd.DataFrame(dict(Country=[np.NaN, '1234(456', 'abcde(fg', np.NaN, 'pqrst'],
State=['A','B','C','D','E']))
energy
Let's see the first part of the boolean mask created:
mask1 = energy['Country'].str.contains('\(')
mask1
0 NaN
1 True
2 True
3 NaN
4 False
Name: Country, dtype: object
When you try to use this mask, you get:
energy[mask]
ValueError: cannot index with vector containing NA / NaN values
which is evident as there are both bool and float dtypes present concurrently.
Also, the second mask:
mask2 = energy['Country'] != np.NAN # --> In python, the Nan's don't compare equal
mask2
0 True
1 True
2 True
3 True
4 True
Name: Country, dtype: bool
You can clearly see that eventhough we've created a mask, there are some Nan's present which aren't converted to their boolean types.
approach 1:
One hack would be to set the default values of NaN in str.contains as False, like:
mask = energy['Country'].str.contains('\(', na=False) #
mask
0 False
1 True
2 True
3 False
4 False
Name: Country, dtype: bool
Then, use it like:
energy[mask].apply(lambda x: x['Country'].split('(')[0], axis=1)
1 1234
2 abcde
dtype: object
approach 2:
Another way would be use dropna and then create the mask:
mask = energy['Country'].dropna().str.contains('\(')
mask
1 True
2 True
4 False
Name: Country, dtype: bool
Try the following. It replaces with the first string if ( is in the string else it returns the original.
energy['Country'] = energy.apply(lambda x: x['Country'].split("(")[0] if "(" in x['Country'] else x['Country'], axis=1)
You can try this:
energy['Country'] = energy['Country'].astype(str).map(lambda x: x.split('(')[0] if '(' in x else x)