Having some trouble with filling NaNs. I want to take a dataframe column with a few NaNs and fill them with a value derived from a 'lookup table' based on a value from another column.
(You might recognize my data from the Titanic data set)...
Pclass Age
0 1 33
1 3 24
2 1 23
3 2 NaN
4 1 Nan
I want to fill the NaN with a value from series 'pclass_lookup':
pclass_lookup
1 38.1
2 29.4
3 25.2
I have tried doing fillna with indexing like:
df.Age.fillna(pclass_lookup[df.Pclass]), but it gives me an error of
ValueError: cannot reindex from a duplicate axis
lambdas were a try too:
df.Age.map(lambda x: x if x else pclass_lookup[df.Pclass]
but, that seems not to fill it right, either. Am I totally missing the boat here? '
Firstly you have a duff value for row 4, you in fact have string 'Nan' which is not the same as 'NaN' so even if your code did work this value would never be replaced.
So you need to replace that duff value and then you can just call map to perform the lookup on the NaN values:
In [317]:
df.Age.replace('Nan', np.NaN, inplace=True)
df.loc[df['Age'].isnull(),'Age'] = df['Pclass'].map(df1.pclass_lookup)
df
Out[317]:
Pclass Age
0 1 33
1 3 24
2 1 23
3 2 29.4
4 1 38.1
Timings
For a df with 5000 rows:
In [26]:
%timeit df.loc[df['Age'].isnull(),'Age'] = df['Pclass'].map(df1.pclass_lookup)
100 loops, best of 3: 2.41 ms per loop
In [27]:
%%timeit
def remove_na(x):
if pd.isnull(x['Age']):
return df1[x['Pclass']]
else:
return x['Age']
df['Age'] =df.apply(remove_na, axis=1)
1 loops, best of 3: 278 ms per loop
In [28]:
%%timeit
nulls = df.loc[df.Age.isnull(), 'Pclass']
df.loc[df.Age.isnull(), 'Age'] = df1.loc[nulls].values
100 loops, best of 3: 3.37 ms per loop
So you see here that apply as it is iterating row-wise scales poorly compared to the other two methods which are vectorised but map is still the fastest.
Building on the response of #vrajs5:
# Create dummy data
df = pd.DataFrame()
df['Pclass'] = [1,3,1,2,1]
df['Age'] = [33,24,23,None, None]
pclass_lookup = pd.Series([38.1,29.4,25.2], index = range(1,4))
# Solution:
nulls = df.loc[df.Age.isnull(), 'Pclass']
df.loc[df.Age.isnull(), 'Age'] = pclass_lookup.loc[nulls].values
>>> df
Pclass Age
0 1 33.0
1 3 24.0
2 1 23.0
3 2 29.4
4 1 38.1
Following should work for you:
df = pd.DataFrame()
df['Pclass'] = [1,3,1,2,1]
df['Age'] = [33,24,23,None, None]
df
Pclass Age
0 1 33
1 3 24
2 1 23
3 2 NaN
4 1 NaN
pclass_lookup = pd.Series([38.1,29.4,25.2], index = range(1,4))
pclass_lookup
1 38.1
2 29.4
3 25.2
dtype: float64
def remove_na(x):
if pd.isnull(x['Age']):
return pclass_lookup[x['Pclass']]
else:
return x['Age']
df['Age'] =df.apply(remove_na, axis=1)
Pclass Age
0 1 33.0
1 3 24.0
2 1 23.0
3 2 29.4
4 1 38.1
Related
I have dataframe in pandas:
In [10]: df
Out[10]:
col_a col_b col_c col_d
0 France Paris 3 4
1 UK Londo 4 5
2 US Chicago 5 6
3 UK Bristol 3 3
4 US Paris 8 9
5 US London 44 4
6 US Chicago 12 4
I need to count unique cities. I can count unique states
In [11]: df['col_a'].nunique()
Out[11]: 3
and I can try to count unique cities
In [12]: df['col_b'].nunique()
Out[12]: 5
but it is wrong because US Paris and Paris in France are different cities. So now I'm doing in like this:
In [13]: df['col_a_b'] = df['col_a'] + ' - ' + df['col_b']
In [14]: df
Out[14]:
col_a col_b col_c col_d col_a_b
0 France Paris 3 4 France - Paris
1 UK Londo 4 5 UK - Londo
2 US Chicago 5 6 US - Chicago
3 UK Bristol 3 3 UK - Bristol
4 US Paris 8 9 US - Paris
5 US London 44 4 US - London
6 US Chicago 12 4 US - Chicago
In [15]: df['col_a_b'].nunique()
Out[15]: 6
Maybe there is a better way? Without creating an additional column.
By using ngroups
df.groupby(['col_a', 'col_b']).ngroups
Out[101]: 6
Or using set
len(set(zip(df['col_a'],df['col_b'])))
Out[106]: 6
You can select col_a and col_b, drop the duplicates, then check the shape/len of the result data frame:
df[['col_a', 'col_b']].drop_duplicates().shape[0]
# 6
len(df[['col_a', 'col_b']].drop_duplicates())
# 6
Because groupby ignore NaNs, and may unnecessarily invoke a sorting process, choose accordingly which method to use if you have NaNs in the columns:
Consider a data frame as following:
df = pd.DataFrame({
'col_a': [1,2,2,pd.np.nan,1,4],
'col_b': [2,2,3,pd.np.nan,2,pd.np.nan]
})
print(df)
# col_a col_b
#0 1.0 2.0
#1 2.0 2.0
#2 2.0 3.0
#3 NaN NaN
#4 1.0 2.0
#5 4.0 NaN
Timing:
df = pd.concat([df] * 1000)
%timeit df.groupby(['col_a', 'col_b']).ngroups
# 1000 loops, best of 3: 625 µs per loop
%timeit len(df[['col_a', 'col_b']].drop_duplicates())
# 1000 loops, best of 3: 1.02 ms per loop
%timeit df[['col_a', 'col_b']].drop_duplicates().shape[0]
# 1000 loops, best of 3: 1.01 ms per loop
%timeit len(set(zip(df['col_a'],df['col_b'])))
# 10 loops, best of 3: 56 ms per loop
%timeit len(df.groupby(['col_a', 'col_b']))
# 1 loop, best of 3: 260 ms per loop
Result:
df.groupby(['col_a', 'col_b']).ngroups
# 3
len(df[['col_a', 'col_b']].drop_duplicates())
# 5
df[['col_a', 'col_b']].drop_duplicates().shape[0]
# 5
len(set(zip(df['col_a'],df['col_b'])))
# 2003
len(df.groupby(['col_a', 'col_b']))
# 2003
So the difference:
Option 1:
df.groupby(['col_a', 'col_b']).ngroups
is fast, and it excludes rows that contain NaNs.
Option 2 & 3:
len(df[['col_a', 'col_b']].drop_duplicates())
df[['col_a', 'col_b']].drop_duplicates().shape[0]
Reasonably fast, it considers NaNs as a unique value.
Option 4 & 5:
len(set(zip(df['col_a'],df['col_b'])))
len(df.groupby(['col_a', 'col_b']))
slow, and it is following the logic that numpy.nan == numpy.nan is False, so different (nan, nan) rows are considered different.
In [105]: len(df.groupby(['col_a', 'col_b']))
Out[105]: 6
import pandas as pd
data = {'field1':[1,4,1,68,9],'field2':[1,1,4,5,9]}
df = pd.DataFrame(data)
results = df.groupby('field1')['field2'].nunique()
results
Output:
field1
1 2
4 1
9 1
68 1
Name: field2, dtype: int64
try this, I'm basically subtracting the number of duplicate groups from the number of rows in df. This is assuming we are grouping all the categories in the df
df.shape[0] - df[['col_a','col_b']].duplicated().sum()
774 µs ± 603 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I have a CSV with one row for every observation per individual:
USER DATE SCORE
1 7/9/2015 37.2
1 11/18/2015 68.9
2 7/7/2015 45.1
2 11/2/2015 42.9
3 6/4/2015 56
3 10/27/2015 39
3 5/11/2016 42.9
I'd like to produce a dataframe where the first observation is assigned to round one, second to round two, and so forth. So the result would look like:
USER R1 R2 R3
1 37.2 68.9 NaN
2 45.1 42.9 NaN
3 56 39 42.9
I've played around with pd.pivot and pd.unstack, but can't get what I need.
Suggestions?
You can use groupby with apply for creating new columns:
#if necessary sort values
df = df.sort_values(by=['USER','DATE'])
df = df.groupby('USER')['SCORE'].apply(lambda x: pd.Series(x.values))
.unstack()
.rename(columns = lambda x: 'R' + str(x+1))
.reset_index()
print (df)
USER R1 R2 R3
0 1 37.2 68.9 NaN
1 2 45.1 42.9 NaN
2 3 56.0 39.0 42.9
Another solution with pivot and unstack:
#if necessary sort values
df = df.sort_values(by=['USER','DATE'])
df = pd.pivot(index=df['USER'],columns=df.groupby('USER').cumcount() + 1,values=df['SCORE'])
.add_prefix('R')
.reset_index()
print (df)
USER R1 R2 R3
0 1 37.2 68.9 NaN
1 2 45.1 42.9 NaN
2 3 56.0 39.0 42.9
First sort values by USER and DATE (this seems to be done already in example data but just to be sure).
Then create a new column ROUND that will sequentially number entries for every user.
Set index to columns USER and ROUND.
Finally, unstack the SCORE column.
Here's some example code:
import pandas as pd
from io import StringIO
data = '''USER DATE SCORE
1 7/9/2015 37.2
1 11/18/2015 68.9
2 7/7/2015 45.1
2 11/2/2015 42.9
3 6/4/2015 56
3 10/27/2015 39
3 5/11/2016 42.9'''
df = (pd.read_csv(StringIO(data),sep='\s+',parse_dates=['DATE'])
.sort_values(by=['USER','DATE'])
.assign(ROUND = lambda x: x.groupby('USER').cumcount() + 1)
.set_index(['USER','ROUND'])['SCORE']
.unstack()
.add_prefix('R')
)
Test data:
In [1]:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'AAA' : [4,5,6,7,9,10],
'BBB' : [10,20,30,40,11,10],
'CCC' : [100,50,25,10,10,11]});
In [2]:df
Out[2]:
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 25
3 7 40 10
4 9 11 10
5 10 10 11
In [3]: thresh = 2
df['aligned'] = np.where(df.AAA == df.BBB,max(df.AAA)|(df.BBB),np.nan)
The following np.where statement provides max(df.AAA or df.BBB) when df.AAA and df.BBB are exactly aligned. I would like to have the max when the columns are within the value in thresh and also consider all columns. It does not have to be via np.where. Can you please show me ways of approaching this?
So for row 5 it should be 11.0 in df.aligned as this is the max value and within thresh of df.AAA and df.BBB.
Ultimately I am looking for ways to find levels across multiple columns where the values are closely aligned.
Current Output with my code:
df
AAA BBB CCC aligned
0 4 10 100 NaN
1 5 20 50 NaN
2 6 30 25 NaN
3 7 40 10 NaN
4 9 11 10 NaN
5 10 10 11 10.0
Desired Output:
df
AAA BBB CCC aligned
0 4 10 100 NaN
1 5 20 50 NaN
2 6 30 25 NaN
3 7 40 10 NaN
4 9 11 10 11.0
5 10 10 11 11.0
The desired output shows rows 4 and 5 with values on df.aligned. As these have values within thresh of each other (values 10 and 11 are within the range specified in thresh variable).
"Within thresh distance" to me means that the difference between the max
and the min of a row should be less than thresh. We can use DataFrame.apply with parameter axis=1 so that we apply the lambda function on each row.
In [1]: filt_thresh = df.apply(lambda x: (x.max() - x.min())<thresh, axis=1)
100 loops, best of 3: 1.89 ms per loop
Alternatively there's a faster solution as pointed out below by #root:
filt_thresh = np.ptp(df.values, axis=1) < tresh
10000 loops, best of 3: 48.9 µs per loop
Or, staying with pandas:
filt_thresh = df.max(axis=1) - df.min(axis=1) < thresh
1000 loops, best of 3: 943 µs per loop
We can now use boolean indexing and calculate the max of each row that matches (hence the axis=1 parameter in max()again):
In [2]: df.loc[filt_thresh, 'aligned'] = df[filt_thresh].max(axis=1)
In [3]: df
Out[3]:
AAA BBB CCC aligned
0 4 10 100 NaN
1 5 20 50 NaN
2 6 30 25 NaN
3 7 40 10 NaN
4 9 11 10 NaN
5 10 10 11 11.0
Update:
If you wanted to calculate the minimum distance between elements for each row, that'd be equivalent to sorting the array of values (np.sort()), calculating the difference between consecutive numbers (np.diff), and taking the min of the resulting array. Finally, compare that to tresh.
Here's the apply way that has the advantage of being a bit clearer to understand.
filt_thresh = df.apply(lambda row: np.min(np.diff(np.sort(row))) < thresh, axis=1)
1000 loops, best of 3: 713 µs per loop
And here's the vectorized equivalent:
filt_thresh = np.diff(np.sort(df)).min(axis=1) < thresh
The slowest run took 4.31 times longer than the fastest.
This could mean that an intermediate result is being cached.
10000 loops, best of 3: 67.3 µs per loop
I have an array with missing values in various places.
import numpy as np
import pandas as pd
x = np.arange(1,10).astype(float)
x[[0,1,6]] = np.nan
df = pd.Series(x)
print(df)
0 NaN
1 NaN
2 3.0
3 4.0
4 5.0
5 6.0
6 NaN
7 8.0
8 9.0
dtype: float64
For each NaN, I want to take the value proceeding it, an divide it by two. And then propogate that to the next consecutive NaN, so I would end up with:
0 0.75
1 1.5
2 3.0
3 4.0
4 5.0
5 6.0
6 4.0
7 8.0
8 9.0
dtype: float64
I've tried df.interpolate(), but that doesn't seem to work with consecutive NaN's.
Another solution with fillna with method ffill, what it same as ffill() function:
#back order of Series
b = df[::-1].isnull()
#find all consecutives NaN, count them, divide by 2 and replace 0 to 1
a = (b.cumsum() - b.cumsum().where(~b).ffill()).mul(2).replace({0:1})
print(a)
8 1
7 1
6 2
5 1
4 1
3 1
2 1
1 2
0 4
dtype: int32
print(df.bfill().div(a))
0 0.75
1 1.50
2 3.00
3 4.00
4 5.00
5 6.00
6 4.00
7 8.00
8 9.00
dtype: float64
Timings (len(df)=9k):
In [315]: %timeit (mat(df))
100 loops, best of 3: 11.3 ms per loop
In [316]: %timeit (jez(df1))
100 loops, best of 3: 2.52 ms per loop
Code for timings:
import numpy as np
import pandas as pd
x = np.arange(1,10).astype(float)
x[[0,1,6]] = np.nan
df = pd.Series(x)
print(df)
df = pd.concat([df]*1000).reset_index(drop=True)
df1 = df.copy()
def jez(df):
b = df[::-1].isnull()
a = (b.cumsum() - b.cumsum().where(~b).ffill()).mul(2).replace({0:1})
return (df.bfill().div(a))
def mat(df):
prev = 0
new_list = []
for i in df.values[::-1]:
if np.isnan(i):
new_list.append(prev/2.)
prev = prev / 2.
else:
new_list.append(i)
prev = i
return pd.Series(new_list[::-1])
print (mat(df))
print (jez(df1))
You can do something like this:
import numpy as np
import pandas as pd
x = np.arange(1,10).astype(float)
x[[0,1,6]] = np.nan
df = pd.Series(x)
prev = 0
new_list = []
for i in df.values[::-1]:
if np.isnan(i):
new_list.append(prev/2.)
prev = prev / 2.
else:
new_list.append(i)
prev = i
df = pd.Series(new_list[::-1])
It loops over the values of the df, in reverse. It keeps track of the previous value. It adds the actual value if it is not NaN, otherwise the half of the previous value.
This might not be the most sophisticated Pandas solution, but you can change the behavior quite easy.
I have a pandas DataFrame with many small groups:
In [84]: n=10000
In [85]: df=pd.DataFrame({'group':sorted(range(n)*4),'val':np.random.randint(6,size=4*n)}).sort(['group','val']).reset_index(drop=True)
In [86]: df.head(9)
Out[86]:
group val
0 0 0
1 0 0
2 0 1
3 0 2
4 1 1
5 1 2
6 1 2
7 1 4
8 2 0
I want to do something special for groups where val==1 appears but not val==0. E.g. replace the 1 in the group by 99 only if the val==0 is in that group.
But for DataFrames of this size it is quite slow:
In [87]: def f(s):
....: if (0 not in s) and (1 in s): s[s==1]=99
....: return s
....:
In [88]: %timeit df.groupby('group')['val'].transform(f)
1 loops, best of 3: 11.2 s per loop
Looping through the data frame is much uglier but much faster:
In [89]: %paste
def g(df):
df.sort(['group','val'],inplace=True)
last_g=-1
for i in xrange(len(df)):
if df.group.iloc[i]!=last_g:
has_zero=False
if df.val.iloc[i]==0:
has_zero=True
elif has_zero and df.val.iloc[i]==1:
df.val.iloc[i]=99
return df
## -- End pasted text --
In [90]: %timeit g(df)
1 loops, best of 3: 2.53 s per loop
But I would like to optimizing it further if possible.
Any idea of how to do so?
Thanks
Based on Jeff's answer, I got a solution that is very fast. I'm putting it here if others find it useful:
In [122]: def do_fast(df):
.....: has_zero_mask=df.group.isin(df[df.val==0].group.unique())
.....: df.val[(df.val==1) & has_zero_mask]=99
.....: return df
.....:
In [123]: %timeit do_fast(df)
100 loops, best of 3: 11.2 ms per loop
Not 100% sure this is what you are going for, but should be simple to have a different filtering/setting criteria
In [37]: pd.set_option('max_rows',10)
In [38]: np.random.seed(1234)
In [39]: def f():
# create the frame
df=pd.DataFrame({'group':sorted(range(n)*4),
'val':np.random.randint(6,size=4*n)}).sort(['group','val']).reset_index(drop=True)
df['result'] = np.nan
# Create a count per group
df['counter'] = df.groupby('group').cumcount()
# select which values you want, returning the indexes of those
mask = df[df.val==1].groupby('group').grouper.group_info[0]
# set em
df.loc[df.index.isin(mask) & df['counter'] == 1,'result'] = 99
In [40]: %timeit f()
10 loops, best of 3: 95 ms per loop
In [41]: df
Out[41]:
group val result counter
0 0 3 NaN 0
1 0 4 99 1
2 0 4 NaN 2
3 0 5 99 3
4 1 0 NaN 0
... ... ... ... ...
39995 9998 4 NaN 3
39996 9999 0 NaN 0
39997 9999 0 NaN 1
39998 9999 2 NaN 2
39999 9999 3 NaN 3
[40000 rows x 4 columns]