Excluding 'None' when checking for 'NaN' values in pandas - python

I'm cleaning a dataset of NaN to run linear regression on it, in the process, I replaced someNaN with None.
After doing this I check for remaining columns with NaN values using the following code, where houseprice is the name of the dataframe
def cols_NaN():
return houseprice.columns[houseprice.isnull().any()].tolist()
print houseprice[cols_NaN()].isnull().sum()
the problem is that the result of the above includes None values also. I want to select those columns which have NaN values. How can I do that?

Only thing I could think of is to check if elements are float because np.nan is of type float and is null.
Consider the dataframe df
df = pd.DataFrame(dict(A=[1., None, np.nan]), dtype=np.object)
print(df)
A
0 1
1 None
2 NaN
Then we test if both float and isnull
df.A.apply(lambda x: isinstance(x, float)) & df.A.isnull()
0 False
1 False
2 True
Name: A, dtype: bool

For working with column names it is a bit different, because need map and pandas.isnull:
For houseprice.columns.apply() and if houseprice.columns.isnull() get errors:
AttributeError: 'Index' object has no attribute 'apply'
AttributeError: 'Index' object has no attribute 'isnull'
houseprice = pd.DataFrame(columns = [np.nan, None, 'a'])
print (houseprice)
Empty DataFrame
Columns: [nan, None, a]
print (houseprice.columns[(houseprice.columns.map(type) == float) &
(pd.isnull(houseprice.columns))].tolist())
[nan]
And for check all values in DataFrame is necessary applymap:
houseprice = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[np.nan,8,9],
'D':[1,3,5],
'E':['a','s',None],
'F':[np.nan,4,3]})
print (houseprice)
A B C D E F
0 1 4 NaN 1 a NaN
1 2 5 8.0 3 s 4.0
2 3 6 9.0 5 None 3.0
print (houseprice.columns[(houseprice.applymap(lambda x: isinstance(x, float)) &
houseprice.isnull()).any()])
Index(['C', 'F'], dtype='object')
And for sum this code is simplier - sum True values in boolean mask:
print ((houseprice.applymap(lambda x: isinstance(x, float)) &
houseprice.isnull()).any().sum())
2

Related

pandas: cannot set column with substring extracted from other column

I'm doing something wrong when attempting to set a column for a masked subset of rows to the substring extracted from another column.
Here is some example code that illustrates the problem I am facing:
import pandas as pd
data = [
{'type': 'A', 'base_col': 'key=val'},
{'type': 'B', 'base_col': 'other_val'},
{'type': 'A', 'base_col': 'key=val'},
{'type': 'B', 'base_col': 'other_val'}
]
df = pd.DataFrame(data)
mask = df['type'] == 'A'
df.loc[mask, 'derived_col'] = df[mask]['base_col'].str.extract(r'key=(.*)')
print("df:")
print(df)
print("mask:")
print(mask)
print("extraction:")
print(df[mask]['base_col'].str.extract(r'key=(.*)'))
The output I get from the above code is as follows:
df:
type base_col derived_col
0 A key=val NaN
1 B other_val NaN
2 A key=val NaN
3 B other_val NaN
mask:
0 True
1 False
2 True
3 False
Name: type, dtype: bool
extraction:
0
0 val
2 val
The boolean mask is as I expect and the extracted substrings on the subset of rows (indexes 0, 2) are also as I expect yet the new derived_col comes out as all NaN. The output I would expect in the derived_col would be 'val' for indexes 0 and 2, and NaN for the other two rows.
Please clarify what I am getting wrong here. Thanks!
You should assign the serise not df , check the column should pick 0
mask = df['type'] == 'A'
df.loc[mask, 'derived_col'] = df[mask]['base_col'].str.extract(r'key=(.*)')[0]
df
Out[449]:
type base_col derived_col
0 A key=val val
1 B other_val NaN
2 A key=val val
3 B other_val NaN

Reassigning Entries in a Column of Pandas DataFrame

My goal is to conditionally index a data frame and change the values in a column for these indexes.
I intend on looking through the column 'A' to find entries = 'a' and update their column 'B' with the word 'okay.
group = ['a']
df = pd.DataFrame({"A": [a,b,a,a,c], "B": [NaN,NaN,NaN,NaN,NaN]})
>>>df
A B
0 a NaN
1 b NaN
2 a NaN
3 a NaN
4 c NaN
df[df['A'].apply(lambda x: x in group)]['B'].fillna('okay', inplace=True)
This gives me the following error:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._update_inplace(new_data)
Following the documentation (what I understood of it) I tried the following instead:
df[df['A'].apply(lambda x: x in group)].loc[:,'B'].fillna('okay', inplace=True)
I can't figure out why the reassignment of 'NaN' to 'okay' is not occurring inplace and how this can be rectified?
Thank you.
Try this with lambda:
Solution First:
>>> df
A B
0 a NaN
1 b NaN
2 a NaN
3 a NaN
4 c NaN
Using lambda + map or apply..
>>> df["B"] = df["A"].map(lambda x: "okay" if "a" in x else "NaN")
OR# df["B"] = df["A"].map(lambda x: "okay" if "a" in x else np.nan)
OR# df['B'] = df['A'].apply(lambda x: 'okay' if x == 'a' else np.nan)
>>> df
A B
0 a okay
1 b NaN
2 a okay
3 a okay
4 c NaN
Solution second:
>>> df
A B
0 a NaN
1 b NaN
2 a NaN
3 a NaN
4 c NaN
another fancy way to Create Dictionary frame and apply it using map function across the column:
>>> frame = {'a': "okay"}
>>> df['B'] = df['A'].map(frame)
>>> df
A B
0 a okay
1 b NaN
2 a okay
3 a okay
4 c NaN
Solution Third:
This is already been posted by #d_kennetz but Just want to club together, wher you can also do the assignment to both columns (A & B)in one shot:..
>>> df.loc[df.A == 'a', 'B'] = "okay"
If I understand this correctly, you simply want to replace the value for a column on those rows matching a given condition (i.e. where A column belongs to a certain group, here with a single value 'a'). The following should do the trick:
import pandas as pd
group = ['a']
df = pd.DataFrame({"A": ['a','b','a','a','c'], "B": [None,None,None,None,None]})
print(df)
df.loc[df['A'].isin(group),'B'] = 'okay'
print(df)
What we're doing here is we're using the .loc filter, which just returns a view on the existing dataframe.
First argument (df['A'].isin(group)) filters on those rows matching a given criterion. Notice you can use the equality operator (==) but not the in operator and therefore have to use .isin() instead).
Second argument selects only the 'B' column.
Then you just assign the desired value (which is a constant).
Here's the output:
A B
0 a None
1 b None
2 a None
3 a None
4 c None
A B
0 a okay
1 b None
2 a okay
3 a okay
4 c None
If you wanted to fancier stuff, you might want do the following:
import pandas as pd
group = ['a', 'b']
df = pd.DataFrame({"A": ['a','b','a','a','c'], "B": [None,None,None,None,None]})
df.loc[df['A'].isin(group),'B'] = "okay, it was " + df['A']+df['A']
print(df)
Which gives you:
A B
0 a okay, it was aa
1 b okay, it was bb
2 a okay, it was aa
3 a okay, it was aa
4 c None

If a data frame column is a list, extracting elements of the list is giving an error

I am looking to extract the 0th member of each of the lists using below code:
df["column"].apply(lambda x: x[0])
but I have getting the following error:
TypeError: 'float' object is not subscriptable.
I think problem is some NaNs values.
You can check it:
print (df[df["column"].isnull()])
column
2 NaN
So you can use str[0]:
df["column"].str[0]
Sample:
df = pd.DataFrame({'column':[['a','s'],['d'], np.nan, ['s','d','f']]})
print (df)
column
0 [a, s]
1 [d]
2 NaN
3 [s, d, f]
df['new'] = df["column"].str[0]
print (df)
column new
0 [a, s] a
1 [d] d
2 NaN NaN
3 [s, d, f] s
print (df["column"].apply(lambda x: x[0]))
TypeError: 'float' object is not subscriptable
Same error is if float as scalars between lists:
df = pd.DataFrame({'column':[[4.4,7.8],[1], 4.7, [4, 7.4, 1.2]]})
print (df)
column
0 [4.4, 7.8]
1 [1]
2 4.7
3 [4, 7.4, 1.2]
You can check all non lists values:
print (df[df["column"].apply(lambda x: isinstance(x, float))])
column
2 4.7
Solution is use if-else with lambda function:
print (df["column"].apply(lambda x: x if isinstance(x, float) else x[0]))
0 4.4
1 1.0
2 4.7
3 4.0
Name: column, dtype: float64

Pandas Rounds int64 number when loading dictionaries

I am loading a list of dictionaries into a pandas dataframe, i.e. if d is my list of dicts, simply:
pd.DataFrame(d)
Unfortunately, one value in the dictionary is a 64-bit integer. It is getting converted to float because some dictionaries don't have a value for this column and are therefore given NaN values, thereby converting the entire column to a float.
For example:
col1
0 NaN
1 NaN
2 NaN
3 0.000000e+00
4 1.506758e+18
5 1.508758e+18
If I try to fillna all the NaNs to zero then recast the column astype(np.int64) returns values that are all slightly off (due to rounding). How can I avoid this and keep my original 64-bit values intact?
Demo:
In [10]: d
Out[10]: {'a': [1506758000000000000, nan, 1508758000000000000]}
Naive approach:
In [11]: pd.DataFrame(d)
Out[11]:
a
0 1.506758e+18
1 NaN
2 1.508758e+18
Workaround (pay attention at dtype=str):
In [12]: pd.DataFrame(d, dtype=str).fillna(0).astype(np.int64)
Out[12]:
a
0 1506758000000000000
1 0
2 1508758000000000000
To my knowledge there is no way to override the inference here, you will need to fill the missing values before passing to pandas. Something like this:
d = [{'col1': 1}, {'col2': 2}]
cols_to_check = ['col1']
for row in d:
for col in cols_to_check:
if col not in row:
row[col] = 0
d
Out[39]: [{'col1': 1}, {'col1': 0, 'col2': 2}]
pd.DataFrame(d)
Out[40]:
col1 col2
0 1 NaN
1 0 2.0
You can create a series with comprehension and unstack with a fill_value parameter
pd.Series(
{(i, j): v for i, x in enumerate(d)
for j, v in x.items()},
dtype=np.int64
).unstack(fill_value=0)

Apply function for two dataframes in pandas

I have two dataframe.
df0
a b
c 0.3 0.6
d 0.4 NaN
df1
a b
c 3 2
d 0 4
I have a custom function:
def concat(d0,d1):
if d0 is not None and d1 is not None:
return '%s,%s' % (d0, d1)
return None
Result I expect:
a b
c 0.3,3 0.6,2
d 0.4,0 NaN
How could I apply the function for those two dataframe?
Here is a solution.
The idea is first to reduce your dataframes to a flat list of values. This allows you to loop over the value of the two dataframes using zip and applying your function.
Finally, you go back to original shape using numpy reshape
new_vals = [concat(d0,d1) for d0, d1 in zip(df1.values.flat, df2.values.flat)]
result = pd.DataFrame(np.reshape(new_vals, (2, 2)), index = ['c', 'd'], columns = ['a', 'b'])
If you it's your specific application, you can do :
#Concatenate the two as String
df = df0.astype(str) + "," +df1.astype(str)
#Remove the nan
df = df.applymap(lambda x: x if 'nan' not in x else np.nan)
You'll be better performance wise than using apply
output
a b
c 0.3,3 0.6,2
d 0.4,0 NaN
Use add with applymap and mask:
df = df0.astype(str).add(',').add(df1.astype(str))
df = df.mask(df.applymap(lambda x: 'nan' in x))
print (df)
a b
c 0.3,3 0.6,2
d 0.4,0 NaN
Another solution is last replace NaN by conditions with mask, by default Trues are replaced to NaN:
df = df0.astype(str).add(',').add(df1.astype(str))
m = df0.isnull() | df1.isnull()
print (m)
a b
c False False
d False True
df = df.mask(m)
print (df)
a b
c 0.3,3 0.6,2
d 0.4,0 NaN

Categories

Resources