I want to use numpy.where() to add a column to a pandas.DataFrame. I'd like to use NaN values for the rows where the condition is false (to indicate that these values are "missing").
Consider:
>>> import numpy; import pandas
>>> df = pandas.DataFrame({'A':[1,2,3,4]}); print(df)
A
0 1
1 2
2 3
3 4
>>> df['B'] = numpy.nan
>>> df['C'] = numpy.where(df['A'] < 3, 'yes', numpy.nan)
>>> print(df)
A B C
0 1 NaN yes
1 2 NaN yes
2 3 NaN nan
3 4 NaN nan
>>> df.isna()
A B C
0 False True False
1 False True False
2 False True False
3 False True False
Why does B show "NaN" but C shows "nan"? And why does DataFrame.isna() fail to detect the NaN values in C?
Should I use something other than numpy.nan inside where? None and pandas.NA both seem to work and can be detected by DataFrame.isna(), but I'm not sure these are the best choice.
Thank you!
Edit: As per #Tim Roberts and #DYZ, numpy.where returns an array of type string, so the str constructor is called on numpy.NaN. The values in column C are actually strings "nan". The question remains, however: what is the most elegant thing to do here? Should I use None? Or something else?
np.where coerces the second and the third parameter to the same datatype. Since the second parameter is a string, the third one is converted to a string, too, by calling function str():
str(numpy.nan)
# 'nan'
As the result, the values in column C are all strings.
You can first fill the NaN rows with None and then convert them to np.nan with fillna():
df['C'] = numpy.where(df['A'] < 3, 'yes', None)
df['C'].fillna(np.nan, inplace=True)
B is a pure numeric column. C has a mixture of strings and numerics, so the column has type "object", and it prints differently.
Related
I have a Pandas DataFrame where column B contains mixed types
A B C
0 1 1 False
1 2 abc False
2 3 2 False
3 4 3 False
4 5 b False
I want to modify column C to be True when the value in column B is of type int and also has a value greater than or equal to 3. So in this example df['B'][3] should match this condition
I tried to do this:
df.loc[(df['B'].astype(str).str.isdigit()) & (df['B'] >= 3)] = True
However I get the following error because of the str values inside column B:
TypeError: '>' not supported between instances of 'str' and 'int'
If I'm able to only test the second condition on the subset provided after the first condition this would solve my problem I think. What can I do to achieve this?
A good way without the use of apply would be to use pd.to_numeric with errors='coerce' which will change the str type to NaN, without changing the type of column B:
df['C'] = pd.to_numeric(df.B, 'coerce') >= 3
>>> print(df)
A B C
0 1 1 False
1 2 abc False
2 3 2 False
3 4 3 True
4 5 b False
One solution could be:
df["B"].apply(lambda x: str(x).isdigit() and int(x) >= 3)
If x is not a digit, then the evaluation will stop and won't try to parse x to int - which throws a ValueError if the argument is not parseable into an int.
There are many ways around this (e.g. use a custom (lambda) function with df.apply, use df.replace() first), but I think the easiest way might be just to use an intermediate column.
First, create a new column that does the first check, then do the second check on this new column.
This works (although nikeros' answer is more elegant).
def check_maybe_int(n):
return int(n) >= 3 if n.isdigit() else False
df.B.apply(check_maybe_int)
But the real answer is, don't do this! Mixed columns prevent a lot of Pandas' optimisations. apply is not vectorised, so it's a lot slower than vector int comparison should be.
you can use apply(type) as picture illustrate
d = {'col1': [1, 2,1, 2], 'col2': [3, 4,1, 2],'col3': [1, 2,1, 2],'col4': [1, 'e',True, 2.345]}
df = pd.DataFrame(data=d)
a = df.col4.apply(type)
b = [ i==str for i in a ]
df['col5'] = b
I have a dataframe similar to the below example, with a column that contains True or Nan:
df = pd.DataFrame({'Data': [1, 2, 3, 4, 5], 'T/F':[True, True, True, True, True]})
Data T/F
0 1 True
1 2 True
2 3 True
3 4 True
4 5 True
I want to try and remove the true from the final row in this dataframe, but when I do all the other Trues become 1:
df.loc[df.last_valid_index(), 'T/F'] = np.nan
Data T/F
0 1 1.0
1 2 1.0
2 3 1.0
3 4 1.0
4 5 NaN
I was wondering if anyone knows why this happens? and any way I can stop it? I'm thinking I might need to change my code to use False instead of nan.
you can use pd.NA instead:
df.loc[df.last_valid_index(), 'T/F'] = pd.NA
output of df:
Data T/F
0 1 True
1 2 True
2 3 True
3 4 True
4 5 <NA>
Note: Since the type of np.nan is float so that why it is convering boolean True to 1.0 and boolean False to 0.0
Also pd.NA preserve datatype you can check that by:
print(df['T/F'].map(type))
#output of above code:
0 <class 'bool'>
1 <class 'bool'>
2 <class 'bool'>
3 <class 'bool'>
4 <class 'pandas._libs.missing.NAType'>
Name: T/F, dtype: object
One column, one type, when you set value np.nan, program will convert this column to float. In my memory , df.astype() function just process column, not ceil.
This is due to the fact that the T/F column contains bool data and the value you try to assign is of type numpy.float64, so the column is being casted to the highest mutual dtype, which is numpy.float64 in this case.
If you would like to contain a mixed values in this column, i.e., both bool and numpy.float64, you should cast this column to object before updating it, as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Data': [1, 2, 3, 4, 5], 'T/F':[True, True, True, True, True]})
df['T/F'] = df['T/F'].astype('object')
df.loc[df.last_valid_index(), 'T/F'] = np.nan
df
Output:
Data T/F
0 1 True
1 2 True
2 3 True
3 4 True
4 5 NaN
Note:
Holding mixed values in pandas.DataFrames is ussualy not a good practice, as it slows down considerably the performance, so it should be avoided whenever possible.
Cheers
I have a dataframe like that below and want to create a new variable that is a 1/0 or True/False if all of the available scores in certain columns are equal to or above 4.
The data is quite messy. Some cells are NaN (respondent didn't provide a response), some are white space (bad formatting or respondent pressed space bar, maybe?).
ID Var1 Var2 Var3
id0001 2 NaN 2
id0002 10 3 10
id0003 8 0
id0004 NaN NaN NaN
id0005 7 3 7
id0006 NaN 9 9
I don't want to drop those rows with a missing value because most have a missing value. I can't just make NaN and white space cells 0 because 0 means something here. I can easily make all white space cells NaN, but I don't know how to ignore them as then I have instances of 'str' and 'int' when I do something like the following:
scoreoffouroraboveforall = [(df.Var1 >= 4) & (df.Var2 >= 4) & (df.Var3 >= 4)]
This is probably very simple to do, but I'm at a loss.
Use pd.to_numeric with optional parameter errors=coerce to convert each of the column in Var1, Var2 and Var3 to numeric type, then using DataFrame.ge and DataFrame.all along axis=1 to create the boolean mask as required with True/False values:
m = df[['Var1', 'Var2', 'Var3']].apply(
pd.to_numeric, errors='coerce').ge(4).all(axis=1)
Result:
print(m)
0 False
1 False
2 False
3 False
4 False
5 False
dtype: bool
I need to read a very large Excel file into a DataFrame. The file has string, integer, float, and Boolean data, as well as missing data and totally empty rows. It may also be worth noting that some of the cell values are derived from cell formulas and/or VBA - although theoretically that shouldn't affect anything.
As the title says, pandas sometimes reads Boolean values as float or int 1's and 0's, instead of True and False. It appears to have something to do with the amount of empty rows and type of other data. For simplicity's sake, I'm just linking a 2-sheet Excel file where the issue is replicated.
Boolean_1.xlsx
Here's the code:
import pandas as pd
df1 = pd.read_excel('Boolean_1.xlsx','Sheet1')
df2 = pd.read_excel('Boolean_1.xlsx','Sheet2')
print(df1, '\n' *2, df2)
Here's the print. Mainly note row ZBA, which has the same values in both sheets, but different values in the DataFrames:
Name stuff Unnamed: 1 Unnamed: 2 Unnamed: 3
0 AFD a dsf ads
1 DFA 1 2 3
2 DFD 123.3 41.1 13.7
3 IIOP why why why
4 NaN NaN NaN NaN
5 ZBA False False True
Name adslfa Unnamed: 1 Unnamed: 2 Unnamed: 3
0 asdf 6.0 3.0 6.0
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 ZBA 0.0 0.0 1.0
I was also able to get integer 1's and 0's output in the large file I'm actually working on (yay), but wasn't able to easily replicate it.
What could be causing this inconsistency, and is there a way to force pandas to read Booleans as they should be read?
Pandas type-casting is applied by column / series. In general, Pandas doesn't work well with mixed types, or object dtype. You should expect internalised logic to determine the most efficient dtype for a series. In this case, Pandas has chosen float dtype as applicable for a series containing float and bool values. In my opinion, this is efficient and neat.
However, as you noted, this won't work when you have a transposed input dataset. Let's set up an example from scratch:
import pandas as pd, numpy as np
df = pd.DataFrame({'A': [True, False, True, True],
'B': [np.nan, np.nan, np.nan, False],
'C': [True, 'hello', np.nan, True]})
df = df.astype({'A': bool, 'B': float, 'C': object})
print(df)
A B C
0 True NaN True
1 False NaN hello
2 True NaN NaN
3 True 0.0 True
Option 1: change "row dtype"
You can, without transposing your data, change the dtype for objects in a row. This will force series B to have object dtype, i.e. a series storing pointers to arbitrary types:
df.iloc[3] = df.iloc[3].astype(bool)
print(df)
A B C
0 True NaN True
1 False NaN hello
2 True NaN NaN
3 True False True
print(df.dtypes)
A bool
B object
C object
dtype: object
Option 2: transpose and cast to Boolean
In my opinion, this is the better option, as a data type is being attached to a specific category / series of input data.
df = df.T # transpose dataframe
df[3] = df[3].astype(bool) # convert series to Boolean
print(df)
0 1 2 3
A True False True True
B NaN NaN NaN False
C True hello NaN True
print(df.dtypes)
0 object
1 object
2 object
3 bool
dtype: object
Read_excel will determine the dtype for each column based on the first row in the column with a value. If the first row of that column is empty, Read_excel will continue to the next row until a value is found.
In Sheet1, your first row with values in column B, C, and D contains strings. Therefore, all subsequent rows will be treated as strings for these columns. In this case, FALSE = False
In Sheet2, your first row with values in column B, C, and D contains integers. Therefore, all subsequent rows will be treated as integers for these columns. In this case, FALSE = 0.
I am using pandas and want to select subsets of data and apply it to other columns.
e.g.
if there is data in column A; &
if there is NO data in column B;
then, apply the data in column A to column D
I have this working fine for now using .isnull() and .notnull().
e.g.
df = pd.DataFrame({'A' : pd.Series(np.random.randn(4)),
'B' : pd.Series(np.nan),
'C' : pd.Series(['yes','yes','no','maybe'])})
df['D']=''
df
Out[44]:
A B C D
0 0.516752 NaN yes
1 -0.513194 NaN yes
2 0.861617 NaN no
3 -0.026287 NaN maybe
# Now try the first conditional expression
df['D'][df['A'].notnull() & df['B'].isnull()] \
= df['A'][df['A'].notnull() & df['B'].isnull()]
df
Out[46]:
A B C D
0 0.516752 NaN yes 0.516752
1 -0.513194 NaN yes -0.513194
2 0.861617 NaN no 0.861617
3 -0.026287 NaN maybe -0.0262874
When one adds a third condition, to also check whether data in column C matches a particular string, we get the error:
df['D'][df['A'].notnull() & df['B'].isnull() & df['C']=='yes'] \
= df['A'][df['A'].notnull() & df['B'].isnull() & df['C']=='yes']
File "C:\Anaconda2\Lib\site-packages\pandas\core\ops.py", line 763, in wrapper
res = na_op(values, other)
File "C:\Anaconda2\Lib\site-packages\pandas\core\ops.py", line 718, in na_op
raise TypeError("invalid type comparison")
TypeError: invalid type comparison
I have read that this occurs due to the different datatypes. And I can get it working if I change all the strings in column C for integers or booleans. We also know that string on its own would work, e.g. df['A'][df['B']=='yes'] gives a boolean list.
So any ideas how/why this is not working when combining these datatypes in this conditional expression? What are the more pythonic ways to do what appears to be quite long-winded?
Thanks
In case this solution doesn't work for anyone, another situation that happened to me was that even though I was reading all data in as dtype=str (and therefore doing any string comparison should be OK [ie df[col] == "some string"]), I had a column of all nulls, which becomes type float, which will give an error when comparing to a string.
To get around that, you can use .astype(str) to ensure a string to string comparison will be performed.
I think you need add parentheses () to conditions, also better is use ix for selecting column with boolean mask which can be assigned to variable mask:
mask = (df['A'].notnull()) & (df['B'].isnull()) & (df['C']=='yes')
print (mask)
0 True
1 True
2 False
3 False
dtype: bool
df.ix[mask, 'D'] = df.ix[mask, 'A']
print (df)
A B C D
0 -0.681771 NaN yes -0.681771
1 -0.871787 NaN yes -0.871787
2 -0.805301 NaN no
3 1.264103 NaN maybe