I'm recoding multiple columns in a dataframe and have come across a strange result that I can't quite figure out. I'm probably not recoding in the most efficient manner possible, but it's mostly the error that I'm hoping someone can explain.
s1 = pd.DataFrame([np.nan, '1', '2', '3', '4', '5'], columns=['col1'])
s2 = pd.DataFrame([np.nan, 1, 2, 3, 4, 5], columns=['col1'])
s1_dic = {np.nan: np.nan, '1': 1, '2':2, '3':3, '4':3, '5':3}
s2_dic = {np.nan: np.nan, 1: 1, 2:2, 3:3, 4:3, 5:3}
s1['col1'].apply(lambda x: s1_dic[x])
s2['col1'].apply(lambda x: s2_dic[x])
s1 works fine, but when I try to do the same thing with a list of integers and a np.nan, I get KeyError: nan which is confusing. Any help would be appreciated.
A workaround is to use the get dict method, rather than the lambda:
In [11]: s1['col1'].apply(s1_dic.get)
Out[11]:
0 NaN
1 1
2 2
3 3
4 3
5 3
Name: col1, dtype: float64
In [12]: s2['col1'].apply(s2_dic.get)
Out[12]:
0 NaN
1 1
2 2
3 3
4 3
5 3
Name: col1, dtype: float64
It's not clear to me right now why this is different...
Note: the dicts can be accessed by nan:
In [21]: s1_dic[np.nan]
Out[21]: nan
In [22]: s2_dic[np.nan]
Out[22]: nan
and hash(np.nan) == 0 so it's not that...
Update: Apparently the issue is with np.nan vs np.float64(np.nan), the former has np.nan is np.nan (because np.nan is bound to a specific instantiated nan object) whilst float('nan') is not float('nan'):
This means that get won't find float('nan'):
In [21]: nans = [float('nan') for _ in range(5)]
In [22]: {f: 1 for f in nans}
Out[22]: {nan: 1, nan: 1, nan: 1, nan: 1, nan: 1}
This means you can actually retrieve the nans from a dict, any such retrieval would be implementation specific! In fact, as the dict uses the id of these nans, this entire behavior above may be implementation specific (if nan shared the same id, as they may do in a REPL/ipython session).
You can catch the nullness beforehand:
In [31]: s2['col1'].apply(lambda x: s2_dic[x] if pd.notnull(x) else x)
Out[31]:
0 NaN
1 1
2 2
3 3
4 3
5 3
Name: col1, dtype: float64
But I think the original suggestion of using .get is a better option.
Related
I am new to stackoverflow.
I noticed this behavior of pandas combine_first() and would simply like to understand why.
When I have the following dataframe,
df = pd.DataFrame({'A':[6,'',7,''], 'B':[1, 3, 5, 3]})
df['A'].combine_first(df['B'])
Out[1]:
0 6
1
2 7
3
Name: A, dtype: object
Whereas initiating with np.nan instead of ' ' gives the expected behavior of combine_first()
df = pd.DataFrame({'A':[6,np.nan,7,np.nan], 'B':[1, 3, 5, 3]})
df['A'].combine_first(df['B'])
Out[2]:
0 6.0
1 3.0
2 7.0
3 3.0
Name: A, dtype: float64
And also replacing the ' ' with np.nan and then applying combine_first() doesn't seem to work either.
df = pd.DataFrame({'A':[6,'',7,''], 'B':[1, 3, 5, 3]})
df.replace('', np.nan)
df['A'].combine_first(df['B'])
Out[3]:
0 6
1
2 7
3
Name: A, dtype: object
I would like to understand why this happens before using an alternate method for this purpose.
This seemed to have been pretty obvious for people here. But thank-you for posting the comments!
My mistake in the 3rd dataframe I posted, pointed out by #W-B
df = pd.DataFrame({'A':[6,'',7,''], 'B':[1, 3, 5, 3]})
df = df.replace('', np.nan)
df['A'].combine_first(df['B'])
Also as #ALollz pointed out, df['A'] has empty strings ' ' are not null values. It does sound simple in hind-sight. But I couldn't figure it out earlier!
Thank-you!
This should be straightforward, but the closest thing I've found is this post:
pandas: Filling missing values within a group, and I still can't solve my problem....
Suppose I have the following dataframe
df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3], 'name': ['A','A', 'B','B','B','B', 'C','C','C']})
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
and I'd like to fill in "NaN" with mean value in each "name" group, i.e.
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
I'm not sure where to go after:
grouped = df.groupby('name').mean()
Thanks a bunch.
One way would be to use transform:
>>> df
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
>>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
>>> df
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
fillna + groupby + transform + mean
This seems intuitive:
df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))
The groupby + transform syntax maps the groupwise mean to the index of the original dataframe. This is roughly equivalent to #DSM's solution, but avoids the need to define an anonymous lambda function.
#DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:
df = pd.DataFrame(
{
'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],
'name': ['A','A', 'B','B','B','B', 'C','C','C'],
'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],
'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
}
)
... gives ...
category name other_value value
0 X A 10.0 1.0
1 X A NaN NaN
2 X B NaN NaN
3 X B 20.0 2.0
4 X B 30.0 3.0
5 X B 10.0 1.0
6 Y C 30.0 3.0
7 Y C NaN NaN
8 Y C 30.0 3.0
In this generalized case we would like to group by category and name, and impute only on value.
This can be solved as follows:
df['value'] = df.groupby(['category', 'name'])['value']\
.transform(lambda x: x.fillna(x.mean()))
Notice the column list in the group-by clause, and that we select the value column right after the group-by. This makes the transformation only be run on that particular column. You could add it to the end, but then you will run it for all columns only to throw out all but one measure column at the end. A standard SQL query planner might have been able to optimize this, but pandas (0.19.2) doesn't seem to do this.
Performance test by increasing the dataset by doing ...
big_df = None
for _ in range(10000):
if big_df is None:
big_df = df.copy()
else:
big_df = pd.concat([big_df, df])
df = big_df
... confirms that this increases the speed proportional to how many columns you don't have to impute:
import pandas as pd
from datetime import datetime
def generate_data():
...
t = datetime.now()
df = generate_data()
df['value'] = df.groupby(['category', 'name'])['value']\
.transform(lambda x: x.fillna(x.mean()))
print(datetime.now()-t)
# 0:00:00.016012
t = datetime.now()
df = generate_data()
df["value"] = df.groupby(['category', 'name'])\
.transform(lambda x: x.fillna(x.mean()))['value']
print(datetime.now()-t)
# 0:00:00.030022
On a final note you can generalize even further if you want to impute more than one column, but not all:
df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']\
.transform(lambda x: x.fillna(x.mean()))
Shortcut:
Groupby + Apply + Lambda + Fillna + Mean
>>> df['value1']=df.groupby('name')['value'].apply(lambda x:x.fillna(x.mean()))
>>> df.isnull().sum().sum()
0
This solution still works if you want to group by multiple columns to replace missing values.
>>> df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, np.nan,np.nan, 4, 3],
'name': ['A','A', 'B','B','B','B', 'C','C','C'],'class':list('ppqqrrsss')})
>>> df['value']=df.groupby(['name','class'])['value'].apply(lambda x:x.fillna(x.mean()))
>>> df
value name class
0 1.0 A p
1 1.0 A p
2 2.0 B q
3 2.0 B q
4 3.0 B r
5 3.0 B r
6 3.5 C s
7 4.0 C s
8 3.0 C s
I'd do it this way
df.loc[df.value.isnull(), 'value'] = df.groupby('group').value.transform('mean')
The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:
df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(
lambda x: x.fillna(x.mean()))
To summarize all above concerning the efficiency of the possible solution
I have a dataset with 97 906 rows and 48 columns.
I want to fill in 4 columns with the median of each group.
The column I want to group has 26 200 groups.
The first solution
start = time.time()
x = df_merged[continuous_variables].fillna(df_merged.groupby('domain_userid')[continuous_variables].transform('median'))
print(time.time() - start)
0.10429811477661133 seconds
The second solution
start = time.time()
for col in continuous_variables:
df_merged.loc[df_merged[col].isnull(), col] = df_merged.groupby('domain_userid')[col].transform('median')
print(time.time() - start)
0.5098445415496826 seconds
The next solution I only performed on a subset since it was running too long.
start = time.time()
for col in continuous_variables:
x = df_merged.head(10000).groupby('domain_userid')[col].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
11.685635566711426 seconds
The following solution follows the same logic as above.
start = time.time()
x = df_merged.head(10000).groupby('domain_userid')[continuous_variables].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
42.630549907684326 seconds
So it's quite important to choose the right method.
Bear in mind that I noticed once a column was not a numeric the times were going up exponentially (makes sense as I was computing the median).
def groupMeanValue(group):
group['value'] = group['value'].fillna(group['value'].mean())
return group
dft = df.groupby("name").transform(groupMeanValue)
I know that is an old question. But I am quite surprised by the unanimity of apply/lambda answers here.
Generally speaking, that is the second worst thing to do after iterating rows, from timing point of view.
What I would do here is
df.loc[df['value'].isna(), 'value'] = df.groupby('name')['value'].transform('mean')
Or using fillna
df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))
I've checked with timeit (because, again, unanimity for apply/lambda based solution made me doubt my instinct). And that is indeed 2.5 faster than the most upvoted solutions.
To fill all the numeric null values with the mean grouped by "name"
num_cols = df.select_dtypes(exclude='object').columns
df[num_cols] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
df.fillna(df.groupby(['name'], as_index=False).mean(), inplace=True)
You can also use "dataframe or table_name".apply(lambda x: x.fillna(x.mean())).
I am loading a list of dictionaries into a pandas dataframe, i.e. if d is my list of dicts, simply:
pd.DataFrame(d)
Unfortunately, one value in the dictionary is a 64-bit integer. It is getting converted to float because some dictionaries don't have a value for this column and are therefore given NaN values, thereby converting the entire column to a float.
For example:
col1
0 NaN
1 NaN
2 NaN
3 0.000000e+00
4 1.506758e+18
5 1.508758e+18
If I try to fillna all the NaNs to zero then recast the column astype(np.int64) returns values that are all slightly off (due to rounding). How can I avoid this and keep my original 64-bit values intact?
Demo:
In [10]: d
Out[10]: {'a': [1506758000000000000, nan, 1508758000000000000]}
Naive approach:
In [11]: pd.DataFrame(d)
Out[11]:
a
0 1.506758e+18
1 NaN
2 1.508758e+18
Workaround (pay attention at dtype=str):
In [12]: pd.DataFrame(d, dtype=str).fillna(0).astype(np.int64)
Out[12]:
a
0 1506758000000000000
1 0
2 1508758000000000000
To my knowledge there is no way to override the inference here, you will need to fill the missing values before passing to pandas. Something like this:
d = [{'col1': 1}, {'col2': 2}]
cols_to_check = ['col1']
for row in d:
for col in cols_to_check:
if col not in row:
row[col] = 0
d
Out[39]: [{'col1': 1}, {'col1': 0, 'col2': 2}]
pd.DataFrame(d)
Out[40]:
col1 col2
0 1 NaN
1 0 2.0
You can create a series with comprehension and unstack with a fill_value parameter
pd.Series(
{(i, j): v for i, x in enumerate(d)
for j, v in x.items()},
dtype=np.int64
).unstack(fill_value=0)
What is the best way to account for (not a number) nan values in a pandas DataFrame?
The following code:
import numpy as np
import pandas as pd
dfd = pd.DataFrame([1, np.nan, 3, 3, 3, np.nan], columns=['a'])
dfv = dfd.a.value_counts().sort_index()
print("nan: %d" % dfv[np.nan].sum())
print("1: %d" % dfv[1].sum())
print("3: %d" % dfv[3].sum())
print("total: %d" % dfv[:].sum())
Outputs:
nan: 0
1: 1
3: 3
total: 4
While the desired output is:
nan: 2
1: 1
3: 3
total: 6
I am using pandas 0.17 with Python 3.5.0 with Anaconda 2.4.0.
To count just null values, you can use isnull():
In [11]:
dfd.isnull().sum()
Out[11]:
a 2
dtype: int64
Here a is the column name, and there are 2 occurrences of the null value in the column.
If you want to count only NaN values in column 'a' of a DataFrame df, use:
len(df) - df['a'].count()
Here count() tells us the number of non-NaN values, and this is subtracted from the total number of values (given by len(df)).
To count NaN values in every column of df, use:
len(df) - df.count()
If you want to use value_counts, tell it not to drop NaN values by setting dropna=False (added in 0.14.1):
dfv = dfd['a'].value_counts(dropna=False)
This allows the missing values in the column to be counted too:
3 3
NaN 2
1 1
Name: a, dtype: int64
The rest of your code should then work as you expect (note that it's not necessary to call sum; just print("nan: %d" % dfv[np.nan]) suffices).
A good clean way to count all NaN's in all columns of your dataframe would be ...
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})
print(df.isna().sum().sum())
Using a single sum, you get the count of NaN's for each column. The second sum, sums those column sums.
This one worked for me best!
If you wanna get a simple summary use (great for data science to count missing values and their type):
df.info(verbose=True, null_counts=True)
Or another cool one is:
df['<column_name>'].value_counts(dropna=False)
Example:
df = pd.DataFrame({'a': [1, 2, 1, 2, np.nan],
...: 'b': [2, 2, np.nan, 1, np.nan],
...: 'c': [np.nan, 3, np.nan, 3, np.nan]})
This is the df:
a b c
0 1.0 2.0 NaN
1 2.0 2.0 3.0
2 1.0 NaN NaN
3 2.0 1.0 3.0
4 NaN NaN NaN
Run Info:
df.info(verbose=True, null_counts=True)
...:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
a 4 non-null float64
b 3 non-null float64
c 2 non-null float64
dtypes: float64(3)
So you see for C you get, out of 5 rows 2 non-nulls, b/c you have null at rows: [0,2,4]
And this is what you get using value_counts for each column:
In [17]: df['a'].value_counts(dropna=False)
Out[17]:
2.0 2
1.0 2
NaN 1
Name: a, dtype: int64
In [18]: df['b'].value_counts(dropna=False)
Out[18]:
NaN 2
2.0 2
1.0 1
Name: b, dtype: int64
In [19]: df['c'].value_counts(dropna=False)
Out[19]:
NaN 3
3.0 2
Name: c, dtype: int64
if you only want the summary of null value for each column, using the following code
df.isnull().sum()
if you want to know how many null values in the data frame using following code
df.isnull().sum().sum() # calculate total
Yet another way to count all the nans in a df:
num_nans = df.size - df.count().sum()
Timings:
import timeit
import numpy as np
import pandas as pd
df_scale = 100000
df = pd.DataFrame(
[[1, np.nan, 100, 63], [2, np.nan, 101, 63], [2, 12, 102, 63],
[2, 14, 102, 63], [2, 14, 102, 64], [1, np.nan, 200, 63]] * df_scale,
columns=['group', 'value', 'value2', 'dummy'])
repeat = 3
numbers = 100
setup = """import pandas as pd
from __main__ import df
"""
def timer(statement, _setup=None):
print (min(
timeit.Timer(statement, setup=_setup or setup).repeat(
repeat, numbers)))
timer('df.size - df.count().sum()')
timer('df.isna().sum().sum()')
timer('df.isnull().sum().sum()')
prints:
3.998805362999999
3.7503365439999996
3.689461442999999
so pretty much equivalent
dfd['a'].isnull().value_counts()
return :
(True 695
False 60,
Name: a, dtype: int64)
True : represents the null values count
False : represent the non-null values count
This is a simple noob question, but it is vexing me. Following a tutorial, I want to select the first value in column "A". The tutorial says run print(df[0]['A']) but Python3 gives me an error. However, it works perfectly if I use print(df[0:1]['A']). Why is that?
Here is the full code for replication:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(100, 3), index=pd.date_range('1/1/2000', periods=100), columns=['A', 'B', 'C'])
print(df[0:1]['A'])
because df[0]['A'] means column 0 at index A; you need to use df.iloc[0]['A'], or df['A'][0], or df.ix[0]['A']
see here for the indexing and slicing.
see here for when you get a copy as opposed to a view.
See the selecting ranges section of the docs. As mentioned:
With DataFrame, slicing inside of [] slices the rows. This is provided largely as a convenience since it is such a common operation.
The flip side being that this is inconsistent.
It's worth mentioning that you can often be explicit with loc/iloc:
In [11]: df = pd.DataFrame([[1, 2], [3, 4], [5, 6]], columns=['A', 'B'])
In [12]: df['A']
Out[12]:
0 1
1 3
2 5
Name: A, dtype: int64
In [13]: df.loc[:, 'A'] # equivalently
Out[13]:
0 1
1 3
2 5
Name: A, dtype: int64
In [14]: df.iloc[:, 0] # accessing column by position
Out[14]:
0 1
1 3
2 5
Name: A, dtype: int64
It's worth mentioning another inconsistency with slicing:
In [15]: df.loc[0:1, 'A']
Out[15]:
0 1
1 3
dtype: int64
In [16]: df.iloc[0:1, 0] # doesn't include 1th row
Out[16]:
0 1
dtype: int64
To select with a position and a label use ix:
In [17]: df.ix[0:1, 'A']
Out[17]:
0 1
1 3
Name: A, dtype: int64
Note labels take precedence with ix.
It's worth emphasising that assignment is garaunteed to work with one loc/iloc/ix, but may fail when chaining:
In [18]: df.ix[0:1, 'A'] = 7 # works
In [19]: df['A'][0:1] = 7 # *sometimes* works, avoid!