Fill values in pandas column with condition involving 2 other columns

Fill values in pandas column with condition involving 2 other columns - python

I am trying to fill this 'C' column in such a way that when the value in 'A' is not NaN, 'C' takes value from 'B', else the value in 'C' remains unchanged.
Heres the code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': ['greek', 'indian', np.nan, np.nan, 'australian'], 'B':
np.random.random(5)})
df['C'] = np.nan
df
I tried df.C = df.B.where(df.A != np.nan, np.nan), but it isnt working as the condition involves another column i think, for loop isnt yielding the desired result either.
How to get there using shortest lines of codes as possible?

The problem is not with np.where, the problem is that you are comparing the value directly against np.nan using !=
>>> np.nan == np.nan
False
So, use a function/method that allows you to check if the value is nan or not:
>>> df.C = df.B.where(df.A.notna(), np.nan)
A B C
0 greek 0.030809 0.030809
1 indian 0.545261 0.545261
2 NaN 0.470802 NaN
3 NaN 0.716640 NaN
4 australian 0.148297 0.148297

Related

Pandas how to fillna in place on a column?

After running:
df[['column']].fillna(value=myValue, inplace=True)
or:
df['column'].fillna(value=myValue, inplace=True)
or:
# Throws warning "A value is trying to be set on a copy of a slice..."
df.fillna({'column': myValue}, inplace=True)
or:
df[['column']] = df[['column']].fillna({'column': myValue})
or:
df['column'] = df['column'].fillna({'column': myValue})
My df['column'] still contains nan (!)
list(df['column'].unique()) returns ['a', 'b', 'c', 'd', nan] and sum(pd.isnull(df['column'])) returns 1,000+.
I've tried several variations but this problem persists. How do you fillna in place on a column in pandas?

Ed Chum's comment's correctly points out the difference between the methods you propoosed. Here is an example I used to show how it works.
import pandas as pd
import numpy as np
d = {'col1': [1, 2, 3, 4], 'col2': [3, 4, np.nan, np.nan]}
df = pd.DataFrame(data=d)
df
col1 col2
0 1 3.0
1 2 4.0
2 3 NaN
3 4 NaN
df['col2'].fillna(value=6, inplace=True)
col1 col2
0 1 3.0
1 2 4.0
2 3 6.0
3 4 6.0
Having posted this, I think it'd be most valuable to see what your my_value variable's value is and what your dataframe looks like.
I discard Aditya's hypothesis. In the case the nan would be a string, it would appear between quotations marks, and it doesn't.
Hope this helps!

One cause of this problem can be that the nan values in your dataset might be the string 'nan' instead of NaN.
To solve this, you can use the replace() method instead of fillna().
Eg code:
df['column'].replace(to_replace='nan',value=myValue,inplace=True)

First of all, the correct syntax from your list is
df['column'].fillna(value=myValue, inplace=True)
If list(df['column'].unique()) returns ['a', 'b', 'c', 'd', nan], this means that the values in your dataset are probably not equal to np.NaN, but rather equal to the string "nan".

Remove row values containing non-numeric values in pandas dataframe [duplicate]

I have a large dataframe in pandas that apart from the column used as index is supposed to have only numeric values:
df = pd.DataFrame({'a': [1, 2, 3, 'bad', 5],
'b': [0.1, 0.2, 0.3, 0.4, 0.5],
'item': ['a', 'b', 'c', 'd', 'e']})
df = df.set_index('item')
How can I find the row of the dataframe df that has a non-numeric value in it?
In this example it's the fourth row in the dataframe, which has the string 'bad' in the a column. How can this row be found programmatically?

You could use np.isreal to check the type of each element (applymap applies a function to each element in the DataFrame):
In [11]: df.applymap(np.isreal)
Out[11]:
a b
item
a True True
b True True
c True True
d False True
e True True
If all in the row are True then they are all numeric:
In [12]: df.applymap(np.isreal).all(1)
Out[12]:
item
a True
b True
c True
d False
e True
dtype: bool
So to get the subDataFrame of rouges, (Note: the negation, ~, of the above finds the ones which have at least one rogue non-numeric):
In [13]: df[~df.applymap(np.isreal).all(1)]
Out[13]:
a b
item
d bad 0.4
You could also find the location of the first offender you could use argmin:
In [14]: np.argmin(df.applymap(np.isreal).all(1))
Out[14]: 'd'
As #CTZhu points out, it may be slightly faster to check whether it's an instance of either int or float (there is some additional overhead with np.isreal):
df.applymap(lambda x: isinstance(x, (int, float)))

Already some great answers to this question, however here is a nice snippet that I use regularly to drop rows if they have non-numeric values on some columns:
# Eliminate invalid data from dataframe (see Example below for more context)
num_df = (df.drop(data_columns, axis=1)
.join(df[data_columns].apply(pd.to_numeric, errors='coerce')))
num_df = num_df[num_df[data_columns].notnull().all(axis=1)]
The way this works is we first drop all the data_columns from the df, and then use a join to put them back in after passing them through pd.to_numeric (with option 'coerce', such that all non-numeric entries are converted to NaN). The result is saved to num_df.
On the second line we use a filter that keeps only rows where all values are not null.
Note that pd.to_numeric is coercing to NaN everything that cannot be converted to a numeric value, so strings that represent numeric values will not be removed. For example '1.25' will be recognized as the numeric value 1.25.
Disclaimer: pd.to_numeric was introduced in pandas version 0.17.0
Example:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({"item": ["a", "b", "c", "d", "e"],
...: "a": [1,2,3,"bad",5],
...: "b":[0.1,0.2,0.3,0.4,0.5]})
In [3]: df
Out[3]:
a b item
0 1 0.1 a
1 2 0.2 b
2 3 0.3 c
3 bad 0.4 d
4 5 0.5 e
In [4]: data_columns = ['a', 'b']
In [5]: num_df = (df
...: .drop(data_columns, axis=1)
...: .join(df[data_columns].apply(pd.to_numeric, errors='coerce')))
In [6]: num_df
Out[6]:
item a b
0 a 1 0.1
1 b 2 0.2
2 c 3 0.3
3 d NaN 0.4
4 e 5 0.5
In [7]: num_df[num_df[data_columns].notnull().all(axis=1)]
Out[7]:
item a b
0 a 1 0.1
1 b 2 0.2
2 c 3 0.3
4 e 5 0.5

# Original code
df = pd.DataFrame({'a': [1, 2, 3, 'bad', 5],
'b': [0.1, 0.2, 0.3, 0.4, 0.5],
'item': ['a', 'b', 'c', 'd', 'e']})
df = df.set_index('item')
Convert to numeric using 'coerce' which fills bad values with 'nan'
a = pd.to_numeric(df.a, errors='coerce')
Use isna to return a boolean index:
idx = a.isna()
Apply that index to the data frame:
df[idx]
output
Returns the row with the bad data in it:
a b
item
d bad 0.4

Sorry about the confusion, this should be the correct approach. Do you want only to capture 'bad' only, not things like 'good'; Or just any non-numerical values?
In[15]:
np.where(np.any(np.isnan(df.convert_objects(convert_numeric=True)), axis=1))
Out[15]:
(array([3]),)

In case you are working with a column with string values, you can use
THE VERY USEFUL function series.str.isnumeric() like:
a = pd.Series(['hi','hola','2.31','288','312','1312', '0,21', '0.23'])
What i do is to copy that column to new column, and do a str.replace('.','') and str.replace(',','') then i select the numeric values.
and:
a = a.str.replace('.','')
a = a.str.replace(',','')
a.str.isnumeric()
Out[15]:
0 False
1 False
2 True
3 True
4 True
5 True
6 True
7 True
dtype: bool
Good luck all!

I'm thinking something like, just give an idea, to convert the column to string, and work with string is easier. however this does not work with strings containing numbers, like bad123. and ~ is taking the complement of selection.
df['a'] = df['a'].astype(str)
df[~df['a'].str.contains('0|1|2|3|4|5|6|7|8|9')]
df['a'] = df['a'].astype(object)
and using '|'.join([str(i) for i in range(10)]) to generate '0|1|...|8|9'
or using np.isreal() function, just like the most voted answer
df[~df['a'].apply(lambda x: np.isreal(x))]

Did you convert your data using .astype() ?
All great comments above must solve 99% of the cases, but if you are still in trouble, please also check if you converted your data type.
Sometimes I force the data to type float16 to save memory. Using:
df[col] = df[col].astype(np.float16)
But this might silently break your code. So if you did any kind of data type transformation, double check for overflows. Disable the conversion and try again.
It worked for me!

Assign a Series to several Rows of a Pandas DataFrame

I have a pandas DataFrame prepared with an Index and columns, all values are NaN.
Now I computed a result, which can be used for more than one row of a DataFrame, and I would like to assign them all at once. This can be done by a loop, but I am pretty sure that this assignment can be done at once.
Here is a scenario:
import pandas as pd
df = pd.DataFrame(index=['A', 'B', 'C'], columns=['C1', 'C2']) # original df
s = pd.Series({'C1': 1, 'C2': 'ham'}) # a computed result
index = pd.Index(['A', 'C']) # result is valid for rows 'A' and 'C'
The naive approach is
df.loc[index, :] = s
But this does not change the DataFrame at all. It remains as
C1 C2
A NaN NaN
B NaN NaN
C NaN NaN
How can this assignment be done?

It seems we can use the underlying array data to assign -
df.loc[index, :] = s.values
Now, this assumes that the order of index in s is same as in the columns of df. If that's not the case, as suggested by #Nras, we could use s[df.columns].values for right side assignment.

pandas DataFrame set value on boolean mask

I'm trying to set a number of different in a pandas DataFrame all to the same value. I thought I understood boolean indexing for pandas, but I haven't found any resources on this specific error.
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']})
mask = df.isin([1, 3, 12, 'a'])
df[mask] = 30
Traceback (most recent call last):
...
TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value
Above, I want to replace all of the True entries in the mask with the value 30.
I could do df.replace instead, but masking feels a bit more efficient and intuitive here. Can someone explain the error, and provide an efficient way to set all of the values?

You can't use the boolean mask on mixed dtypes for this unfortunately, you can use pandas where to set the values:
In [59]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']})
mask = df.isin([1, 3, 12, 'a'])
df = df.where(mask, other=30)
df
Out[59]:
A B
0 1 a
1 30 30
2 3 30
Note: that the above will fail if you do inplace=True in the where method, so df.where(mask, other=30, inplace=True) will raise:
TypeError: Cannot do inplace boolean setting on mixed-types with a non
np.nan value
EDIT
OK, after a little misunderstanding you can still use where y just inverting the mask:
In [2]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']})
mask = df.isin([1, 3, 12, 'a'])
df.where(~mask, other=30)
Out[2]:
A B
0 30 30
1 2 b
2 30 f

If you want to use different columns to create your mask, you need to call the values property of the dataframe.
Example
Let's say we want to, replace values in A_1 and 'A_2' according to a mask in B_1 and B_2. For example, replace those values in A (to 999) that corresponds to nulls in B.
The original dataframe:
A_1 A_2 B_1 B_2
0 1 4 y n
1 2 5 n NaN
2 3 6 NaN NaN
The desired dataframe
A_1 A_2 B_1 B_2
0 1 4 y n
1 2 999 n NaN
2 999 999 NaN NaN
The code:
df = pd.DataFrame({
'A_1': [1, 2, 3],
'A_2': [4, 5, 6],
'B_1': ['y', 'n', np.nan],
'B_2': ['n', np.nan, np.nan]})
_mask = df[['B_1', 'B_2']].notnull().values
df[['A_1', 'A_2']] = df[['A_1','A_2']].where(_mask, other=999)
A_1 A_2
0 1 4
1 2 999
2 999 999

I'm not 100% sure but I suspect the error message relates to the fact that there is not identical treatment of missing data across different dtypes. Only float has NaN, but integers can be automatically converted to floats so it's not a problem there. But it appears mixing number dtypes and object dtypes does not work so easily...
Regardless of that, you could get around it pretty easily with np.where:
df[:] = np.where( mask, 30, df )
A B
0 30 30
1 2 b
2 30 f

pandas uses NaN to mark invalid or missing data and can be used across types, since your DataFrame as mixed int and string data types it will not accept the assignment to a single type (other than NaN) as this would create a mixed type (int and str) in B through an in-place assignment.
#JohnE method using np.where creates a new DataFrame in which the type of column B is an object not a string as in the initial example.

Finding non-numeric rows in dataframe in pandas?

I have a large dataframe in pandas that apart from the column used as index is supposed to have only numeric values:
df = pd.DataFrame({'a': [1, 2, 3, 'bad', 5],
'b': [0.1, 0.2, 0.3, 0.4, 0.5],
'item': ['a', 'b', 'c', 'd', 'e']})
df = df.set_index('item')
How can I find the row of the dataframe df that has a non-numeric value in it?
In this example it's the fourth row in the dataframe, which has the string 'bad' in the a column. How can this row be found programmatically?

You could use np.isreal to check the type of each element (applymap applies a function to each element in the DataFrame):
In [11]: df.applymap(np.isreal)
Out[11]:
a b
item
a True True
b True True
c True True
d False True
e True True
If all in the row are True then they are all numeric:
In [12]: df.applymap(np.isreal).all(1)
Out[12]:
item
a True
b True
c True
d False
e True
dtype: bool
So to get the subDataFrame of rouges, (Note: the negation, ~, of the above finds the ones which have at least one rogue non-numeric):
In [13]: df[~df.applymap(np.isreal).all(1)]
Out[13]:
a b
item
d bad 0.4
You could also find the location of the first offender you could use argmin:
In [14]: np.argmin(df.applymap(np.isreal).all(1))
Out[14]: 'd'
As #CTZhu points out, it may be slightly faster to check whether it's an instance of either int or float (there is some additional overhead with np.isreal):
df.applymap(lambda x: isinstance(x, (int, float)))

Already some great answers to this question, however here is a nice snippet that I use regularly to drop rows if they have non-numeric values on some columns:
# Eliminate invalid data from dataframe (see Example below for more context)
num_df = (df.drop(data_columns, axis=1)
.join(df[data_columns].apply(pd.to_numeric, errors='coerce')))
num_df = num_df[num_df[data_columns].notnull().all(axis=1)]
The way this works is we first drop all the data_columns from the df, and then use a join to put them back in after passing them through pd.to_numeric (with option 'coerce', such that all non-numeric entries are converted to NaN). The result is saved to num_df.
On the second line we use a filter that keeps only rows where all values are not null.
Note that pd.to_numeric is coercing to NaN everything that cannot be converted to a numeric value, so strings that represent numeric values will not be removed. For example '1.25' will be recognized as the numeric value 1.25.
Disclaimer: pd.to_numeric was introduced in pandas version 0.17.0
Example:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({"item": ["a", "b", "c", "d", "e"],
...: "a": [1,2,3,"bad",5],
...: "b":[0.1,0.2,0.3,0.4,0.5]})
In [3]: df
Out[3]:
a b item
0 1 0.1 a
1 2 0.2 b
2 3 0.3 c
3 bad 0.4 d
4 5 0.5 e
In [4]: data_columns = ['a', 'b']
In [5]: num_df = (df
...: .drop(data_columns, axis=1)
...: .join(df[data_columns].apply(pd.to_numeric, errors='coerce')))
In [6]: num_df
Out[6]:
item a b
0 a 1 0.1
1 b 2 0.2
2 c 3 0.3
3 d NaN 0.4
4 e 5 0.5
In [7]: num_df[num_df[data_columns].notnull().all(axis=1)]
Out[7]:
item a b
0 a 1 0.1
1 b 2 0.2
2 c 3 0.3
4 e 5 0.5

# Original code
df = pd.DataFrame({'a': [1, 2, 3, 'bad', 5],
'b': [0.1, 0.2, 0.3, 0.4, 0.5],
'item': ['a', 'b', 'c', 'd', 'e']})
df = df.set_index('item')
Convert to numeric using 'coerce' which fills bad values with 'nan'
a = pd.to_numeric(df.a, errors='coerce')
Use isna to return a boolean index:
idx = a.isna()
Apply that index to the data frame:
df[idx]
output
Returns the row with the bad data in it:
a b
item
d bad 0.4

Sorry about the confusion, this should be the correct approach. Do you want only to capture 'bad' only, not things like 'good'; Or just any non-numerical values?
In[15]:
np.where(np.any(np.isnan(df.convert_objects(convert_numeric=True)), axis=1))
Out[15]:
(array([3]),)

In case you are working with a column with string values, you can use
THE VERY USEFUL function series.str.isnumeric() like:
a = pd.Series(['hi','hola','2.31','288','312','1312', '0,21', '0.23'])
What i do is to copy that column to new column, and do a str.replace('.','') and str.replace(',','') then i select the numeric values.
and:
a = a.str.replace('.','')
a = a.str.replace(',','')
a.str.isnumeric()
Out[15]:
0 False
1 False
2 True
3 True
4 True
5 True
6 True
7 True
dtype: bool
Good luck all!

I'm thinking something like, just give an idea, to convert the column to string, and work with string is easier. however this does not work with strings containing numbers, like bad123. and ~ is taking the complement of selection.
df['a'] = df['a'].astype(str)
df[~df['a'].str.contains('0|1|2|3|4|5|6|7|8|9')]
df['a'] = df['a'].astype(object)
and using '|'.join([str(i) for i in range(10)]) to generate '0|1|...|8|9'
or using np.isreal() function, just like the most voted answer
df[~df['a'].apply(lambda x: np.isreal(x))]

Did you convert your data using .astype() ?
All great comments above must solve 99% of the cases, but if you are still in trouble, please also check if you converted your data type.
Sometimes I force the data to type float16 to save memory. Using:
df[col] = df[col].astype(np.float16)
But this might silently break your code. So if you did any kind of data type transformation, double check for overflows. Disable the conversion and try again.
It worked for me!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fill values in pandas column with condition involving 2 other columns - python

Related

Pandas how to fillna in place on a column?

Remove row values containing non-numeric values in pandas dataframe [duplicate]

Assign a Series to several Rows of a Pandas DataFrame

pandas DataFrame set value on boolean mask

Finding non-numeric rows in dataframe in pandas?

Categories

Resources