I'm using pandas 13.0 and I'm trying to create a new colum using apply() and a function name foo().
My dataframe is as follow:
df = pandas.DataFrame({
'a':[ 0.0, 0.1, 0.2, 0.3],
'b':[10.0, 20.0, 30.0, 40.0],
'c':[ 1.0, 2.0, 3.0, 4.0]
})
df.set_index(df['a'], inplace=True)
So my dataframe is:
in: print df
out:
a b c
a
0.0 0.0 10.0 1.0
0.1 0.1 20.0 2.0
0.2 0.2 30.0 3.0
0.3 0.3 40.0 4.0
My function is as follow:
def foo(arg1, arg2):
return arg1*arg2
Now I want to create a column name 'd' using foo();
df['d'] = df.apply(foo(df['b'], df['c']), axis=1)
But I get the following error:
TypeError: ("'Series' object is not callable", u'occurred at index 0.0')
How can I use pandas.apply() with foo() for index that are made of float?
Thanks
The problem here is that you are trying to process this row-wise but you are passing series as arguements which is wrong you could do it this way:
In [7]:
df['d'] = df.apply(lambda row: foo(row['b'], row['c']), axis=1)
df
Out[7]:
a b c d
a
0.0 0.0 10 1 10
0.1 0.1 20 2 40
0.2 0.2 30 3 90
0.3 0.3 40 4 160
A better way would be to just call your function direct:
In [8]:
df['d'] = foo(df['b'], df['c'])
df
Out[8]:
a b c d
a
0.0 0.0 10 1 10
0.1 0.1 20 2 40
0.2 0.2 30 3 90
0.3 0.3 40 4 160
The advantage with the above method is that it is vectorised and will perform the operation on the whole series rather than a row at a time.
In [15]:
%timeit df['d'] = df.apply(lambda row: foo(row['b'], row['c']), axis=1)
%timeit df['d'] = foo(df['b'], df['c'])
1000 loops, best of 3: 270 µs per loop
1000 loops, best of 3: 214 µs per loop
Not much difference here, now compare with a 400,000 row df:
In [18]:
%timeit df['d'] = df.apply(lambda row: foo(row['b'], row['c']), axis=1)
%timeit df['d'] = foo(df['b'], df['c'])
1 loops, best of 3: 5.84 s per loop
100 loops, best of 3: 8.68 ms per loop
So you see here ~672x speed up.
Related
Whats the most pythonic way to add a column (of weights) to an existing Pandas DataFrame "df" by a condition on dfs column?
Small example:
df = pd.DataFrame({'A' : [1, 2, 3], 'B' : [4, 5, 6]})
df
Out[110]:
A B
0 1 4
1 2 5
2 3 6
I'd Like to add a "weight" column where if df['B'] >= 6 then df['weight'] = 20, else, df['weight'] = 1
So my output will be:
A B weight
0 1 4 1
1 2 5 1
2 3 6 20
Approach #1
Here's one with type-conversion and scaling -
df['weight'] = (df['B'] >= 6)*19+1
Approach #2
Another possibly faster one with using the underlying array data -
df['weight'] = (df['B'].values >= 6)*19+1
Approach #3
Leverage multi-cores with numexpr module -
import numexpr as ne
val = df['B'].values
df['weight'] = ne.evaluate('(val >= 6)*19+1')
Timings on 500k rows as commented by OP for a random data in range [0,9) for the vectorized methods posted thus far -
In [149]: np.random.seed(0)
...: df = pd.DataFrame({'B' : np.random.randint(0,9,(500000))})
# #jpp's soln
In [150]: %timeit df['weight1'] = np.where(df['B'] >= 6, 20, 1)
100 loops, best of 3: 3.57 ms per loop
# #jpp's soln with array data
In [151]: %timeit df['weight2'] = np.where(df['B'].values >= 6, 20, 1)
100 loops, best of 3: 3.27 ms per loop
In [154]: %timeit df['weight3'] = (df['B'] >= 6)*19+1
100 loops, best of 3: 2.73 ms per loop
In [155]: %timeit df['weight4'] = (df['B'].values >= 6)*19+1
1000 loops, best of 3: 1.76 ms per loop
In [156]: %%timeit
...: val = df['B'].values
...: df['weight5'] = ne.evaluate('(val >= 6)*19+1')
1000 loops, best of 3: 1.14 ms per loop
One last one ...
With the output being 1 or 20, we could safely use lower precision : uint8 for a turbo speedup over already discussed ones, like so -
In [208]: %timeit df['weight6'] = (df['B'].values >= 6)*np.uint8(19)+1
1000 loops, best of 3: 428 µs per loop
You can use numpy.where for a vectorised solution:
df['weight'] = np.where(df['B'] >= 6, 20, 1)
Result:
A B weight
0 1 4 1
1 2 5 1
2 3 6 20
Here's a method using df.apply
df['weight'] = df.apply(lambda row: 20 if row['B'] >= 6 else 1, axis=1)
Output:
In [6]: df
Out[6]:
A B weight
0 1 4 1
1 2 5 1
2 3 6 20
I'm trying to group some columns into a list inside a single column.
In case one of these columns contain NaN, the result column should be just NaN instead of the list.
df = pd.DataFrame({'a.0':[11, 1.1], 'a.5':[12, 1.2], 'a.10':[13, pd.np.NaN]})
The result column of the DF should be as such:
a.0 a.10 a.5 result
0 . 11.0 13.0 12.0 [11, 13, 12]
1 . 1.1 nan 1.2 nan
These 2 lines do the job:
df['result'] = df[['a.0','a.10','a.5']].values.tolist()
df['result'] = pd.np.where(df[['a.0','a.10','a.5']].isnull().any(axis=1), pd.np.nan, df['result'])
And I was wondering how to do it in one line. Help would be appreciated
Using isnull + any + mask/where
df['result'] = pd.Series(df.values.tolist()).mask(df.isnull().any(1))
Or,
df['result'] = pd.Series(df.values.tolist()).where(~df.isnull().any(1))
a.0 a.10 a.5 result
0 11.0 13.0 12.0 [11.0, 13.0, 12.0]
1 1.1 NaN 1.2 NaN
Performance
df = pd.concat([df] * 100000, ignore_index=True)
# Wen's solution
%timeit df.apply(lambda x : pd.Series([x.tolist()]) if ~x.isnull().any() else np.nan,1)
1 loop, best of 3: 1min 37s per loop
# Anton vBR's solution
%timeit [np.nan if np.isnan(v).any() else list(v[1:]) for v in df.itertuples()]
1 loop, best of 3: 5.79 s per loop
# my answer
%timeit pd.Series(df.values.tolist()).mask(df.isnull().any(1))
10 loops, best of 3: 133 ms per loop
Conclusions
apply is inefficient, more so than a loop. An apply call typically has many overheads associated with it. A pure python loop is generally faster.
A vectorised approach such as where/mask will vectorize operations and offers improved performance over loopy solutions.
Update With timings and large data-sets cᴏʟᴅsᴘᴇᴇᴅ answer is the best. List comprehensions always suffer here. I've updated my previous answer with timings.
You could use itertuples and assign np.nan if any np.nan in row:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a.0':np.random.choice(np.append(np.nan,np.arange(10)), 1000),
'a.5':np.random.choice(10, 1000),
'a.10':np.random.choice(10, 1000)})
# 3 solutions to solve the problem
# Assign with df['results'] =
%timeit [np.nan if np.isnan(v).any() else list(v[1:]) for v in df.itertuples()]
%timeit pd.Series(df.values.tolist()).mask(df.isnull().any(1))
%timeit df.apply(lambda x : pd.Series([x.tolist()]) if ~x.isnull().any() else np.nan,1)
Timings:
100 loops, best of 3: 8.38 ms per loop
1000 loops, best of 3: 772 µs per loop
1 loop, best of 3: 214 ms per loop
df['result']=df.apply(lambda x : pd.Series([x.tolist()]) if ~x.isnull().any() else np.nan,1)
df
Out[30]:
a.0 a.10 a.5 result
0 11.0 13.0 12.0 [11.0, 13.0, 12.0]
1 1.1 NaN 1.2 NaN
I have a dataframe and I'd like to be able to use np.where to find certain elements based on a given condition, and then use pd.drop to erase the elements corresponding to the index found with np.where.
I.e.,
idx_to_drop = np.where(myDf['column10'].isnull() | myDf['column14'].isnull())
myDf.drop(idx_to_drop)
But I get a value error since drop does not take numpy array indexes. Is there a way to achieve this using np.where and some drop function in pandas?
There are two common patterns to achieve that:
select those rows that DON'T satisfy your "dropping" condition or negate your conditions and select those rows that satisfy those conditions - #jezrael has provided a good example for that approach.
drop the rows satisfying your "dropping" conditions:
df = df.drop(np.where(df['column10'].isnull() | df['column14'].isnull())[0])
Timing: first approach seems to be bit faster:
Setup:
df = pd.DataFrame(np.random.rand(100,5), columns=list('abcde'))
df.loc[::7, ::2] = np.nan
df = pd.concat([df] * 10**4, ignore_index=True)
In [117]: df.shape
Out[117]: (1000000, 5)
In [118]: %timeit df[~(df['a'].isnull() | df['e'].isnull())]
10 loops, best of 3: 46.6 ms per loop
In [119]: %timeit df[df['a'].notnull() & df['e'].notnull()]
10 loops, best of 3: 39.9 ms per loop
In [120]: %timeit df.drop(np.where(df['a'].isnull() | df['e'].isnull())[0])
10 loops, best of 3: 65.5 ms per loop
In [122]: %timeit df.drop(np.where(df[['a','e']].isnull().any(1))[0])
10 loops, best of 3: 97.1 ms per loop
In [123]: %timeit df[df[['a','e']].notnull().all(1)]
10 loops, best of 3: 72 ms per loop
I think you need boolean indexing with inverse condition by ~, isnull and | (bitwise or):
print (~(myDf['column10'].isnull() | myDf['column14'].isnull()))
0 False
1 True
2 False
dtype: bool
myDf[~(myDf['column10'].isnull() | myDf['column14'].isnull())]
Sample:
myDf = pd.DataFrame({'column10':[np.nan, 1,5], 'column14':[np.nan, 1,np.nan]})
print (myDf)
column10 column14
0 NaN NaN
1 1.0 1.0
2 5.0 NaN
myDf = myDf[~(myDf['column10'].isnull() | myDf['column14'].isnull())]
print (myDf)
column10 column14
1 1.0 1.0
Solution with notnull and & (bitwise and)
myDf = myDf[myDf['column10'].notnull() & myDf['column14'].notnull()]
print (myDf)
column10 column14
1 1.0 1.0
Another solutions with any or all:
myDf = myDf[~myDf[['column10', 'column14']].isnull().any(axis=1)]
print (myDf)
column10 column14
1 1.0 1.0
myDf = myDf[myDf[['column10', 'column14']].notnull().all(axis=1)]
print (myDf)
column10 column14
1 1.0 1.0
I have a 130M rows dataframe, here is a sample:
id id2 date value
0 33208381500016 1927637 2014-07-31 120.0
1 77874276700016 3418498 2014-11-22 10.5
2 77874276700016 1174018 2014-11-22 8.4
3 77874276700016 1174018 2014-11-20 1.4
4 77874276700016 1643839 2014-06-27 4.2
5 77874276700016 1972929 2014-06-27 6.7
6 77874276700016 1972929 2014-06-27 12.7
7 77874276700016 1588191 2014-02-20 123.4
8 77874276700016 1966627 2014-02-20 973.1
9 77874276700016 1830252 2014-02-20 0.5
I need to perform a groupby on this dataframe (called data). For a simple groupby like a sum no problem:
data[['id','value']].groupby('id',as_index=False).sum()
time: 11.19s
But now I need to retrieve the list of values in another column (or it's length). This following code works, but takes ages, it there a more efficient way to do it?
temp = data[['id','date','id2']].drop_duplicates()
temp.groupby('id',as_index = False).agg({'date': lambda x: set(x.tolist()),'id2':lambda x: len(set(x.tolist()))})
time: 159s
First question:
Is there a more efficient way to count the number of unique id2 for every id, but still using this groupby? What I mean is I don't want to split the two groupbys as it will probably take longer (performing one groupby with 2 aggregations takes approximately 1.5 times one sole grouby).
Second question:
Is there a more efficient way to retrieve the list of unique dates? I know it has been addressed in this question but I can't simply use .apply(list).
To get the unique dates, use SeriesGroupBy.unique(). To count the number of unique id2 in each group, use SeriesGroupBy.nunique().
temp = data[['id', 'date', 'id2']].drop_duplicates()
temp.groupby('id', as_index=False).agg({'date': 'unique', 'id2': 'nunique'})
Not dropping duplicates beforehand may be faster — pandas only has to iterate once over all your data instead of twice.
data.groupby('id', as_index=False).agg({'date': 'unique', 'id2': 'nunique'})
EDIT:
Here are some benchmarks. Interestingly, SeriesGroupBy.unique() and SeriesGroupBy.nunique() do not seem to be faster than using sets. But not dropping duplicates before is.
import io
import pandas as pd
raw = io.StringIO("""\
id id2 date value
0 33208381500016 1927637 2014-07-31 120.0
1 77874276700016 3418498 2014-11-22 10.5
2 77874276700016 1174018 2014-11-22 8.4
3 77874276700016 1174018 2014-11-20 1.4
4 77874276700016 1643839 2014-06-27 4.2
5 77874276700016 1972929 2014-06-27 6.7
6 77874276700016 1972929 2014-06-27 12.7
7 77874276700016 1588191 2014-02-20 123.4
8 77874276700016 1966627 2014-02-20 973.1
9 77874276700016 1830252 2014-02-20 0.5
""")
data = pd.read_csv(raw, delim_whitespace=True)
def using_sets_drop_then_group():
temp = data[['id', 'date', 'id2']].drop_duplicates()
temp.groupby('id', as_index=False).agg({'date': lambda x: set(x),
'id2': lambda x: len(set(x))})
def using_sets_drop_just_group():
data.groupby('id', as_index=False).agg({'date': lambda x: set(x),
'id2': lambda x: len(set(x))})
def using_unique_drop_then_group():
temp = data[['id', 'date', 'id2']].drop_duplicates()
temp.groupby('id', as_index=False).agg({'date': 'unique', 'id2': 'nunique'})
def using_unique_just_group():
data.groupby('id', as_index=False).agg({'date': 'unique', 'id2': 'nunique'})
%timeit using_sets_drop_then_group() # => 100 loops, best of 3: 4.82 ms per loop
%timeit using_sets_drop_just_group() # => 100 loops, best of 3: 2.91 ms per loop
%timeit using_unique_drop_then_group() # => 100 loops, best of 3: 5.14 ms per loop
%timeit using_unique_just_group() # => 100 loops, best of 3: 3.26 ms per loop
EDIT:
In a comment, #ptrj suggests SeriesGroupBy.unique() and SeriesGroupBy.nunique() may be faster if dates are converted to datetime64. Alas it does not seem to be the case, at least for this small sample of data.
data['parsed_date'] = pd.to_datetime(data['date'])
def using_sets_and_datetime64():
data.groupby('id', as_index=False).agg({'parsed_date': lambda x: set(x),
'id2': lambda x: len(set(x))})
def using_unique_and_datetime64():
data.groupby('id', as_index=False).agg({'parsed_date': 'unique',
'id2': 'nunique'})
%timeit using_sets_and_datetime64() # => 100 loops, best of 3: 3.2 ms per loop
%timeit using_unique_and_datetime64() # => 100 loops, best of 3: 3.53 ms per loop
EDIT:
#MaxU's suggestion of concatenating 100,000 copies of the sample data indeed leads to SeriesGroupBy.unique() and SeriesGroupBy.nunique() outperforming set.
large_data = pd.concat([data] * 10**5, ignore_index=True)
def using_sets():
large_data.groupby('id', as_index=False).agg({'date': lambda x: set(x),
'id2': lambda x: len(set(x))})
def using_unique():
large_data.groupby('id', as_index=False).agg({'date': 'unique',
'id2': 'nunique'})
def using_sets_and_datetime64():
large_data.groupby('id', as_index=False).agg({'parsed_date': lambda x: set(x),
'id2': lambda x: len(set(x))})
def using_unique_and_datetime64():
large_data.groupby('id', as_index=False).agg({'parsed_date': 'unique',
'id2': 'nunique'})
%timeit using_sets() # => 1 loops, best of 3: 295 ms per loop
%timeit using_unique() # => 1 loops, best of 3: 327 ms per loop
%timeit using_sets_and_datetime64() # => 1 loops, best of 3: 5.02 s per loop
%timeit using_unique_and_datetime64() # => 1 loops, best of 3: 248 ms per loop
If I've got a DataFrame in pandas which looks something like:
A B C
0 1 NaN 2
1 NaN 3 NaN
2 NaN 4 5
3 NaN NaN NaN
How can I get the first non-null value from each row? E.g. for the above, I'd like to get: [1, 3, 4, None] (or equivalent Series).
Fill the nans from the left with fillna, then get the leftmost column:
df.fillna(method='bfill', axis=1).iloc[:, 0]
This is a really messy way to do this, first use first_valid_index to get the valid columns, convert the returned series to a dataframe so we can call apply row-wise and use this to index back to original df:
In [160]:
def func(x):
if x.values[0] is None:
return None
else:
return df.loc[x.name, x.values[0]]
pd.DataFrame(df.apply(lambda x: x.first_valid_index(), axis=1)).apply(func,axis=1)
Out[160]:
0 1
1 3
2 4
3 NaN
dtype: float64
EDIT
A slightly cleaner way:
In [12]:
def func(x):
if x.first_valid_index() is None:
return None
else:
return x[x.first_valid_index()]
df.apply(func, axis=1)
Out[12]:
0 1
1 3
2 4
3 NaN
dtype: float64
I'm going to weigh in here as I think this is a good deal faster than any of the proposed methods. argmin gives the index of the first False value in each row of the result of np.isnan in a vectorized way, which is the hard part. It still relies on a Python loop to extract the values but the look up is very quick:
def get_first_non_null(df):
a = df.values
col_index = np.isnan(a).argmin(axis=1)
return [a[row, col] for row, col in enumerate(col_index)]
EDIT:
Here's a fully vectorized solution which is can be a good deal faster again depending on the shape of the input. Updated benchmarking below.
def get_first_non_null_vec(df):
a = df.values
n_rows, n_cols = a.shape
col_index = np.isnan(a).argmin(axis=1)
flat_index = n_cols * np.arange(n_rows) + col_index
return a.ravel()[flat_index]
If a row is completely null then the corresponding value will be null also.
Here's some benchmarking against unutbu's solution:
df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 1500), p=(0.01, 0.99)))
#%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 220 ms per loop
100 loops, best of 3: 16.2 ms per loop
100 loops, best of 3: 12.6 ms per loop
In [109]:
df = pd.DataFrame(np.random.choice([1, np.nan], (100000, 150), p=(0.01, 0.99)))
#%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 246 ms per loop
10 loops, best of 3: 48.2 ms per loop
100 loops, best of 3: 15.7 ms per loop
df = pd.DataFrame(np.random.choice([1, np.nan], (1000000, 15), p=(0.01, 0.99)))
%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 326 ms per loop
1 loops, best of 3: 326 ms per loop
10 loops, best of 3: 35.7 ms per loop
Here is another way to do it:
In [183]: df.stack().groupby(level=0).first().reindex(df.index)
Out[183]:
0 1
1 3
2 4
3 NaN
dtype: float64
The idea here is to use stack to move the columns into a row index level:
In [184]: df.stack()
Out[184]:
0 A 1
C 2
1 B 3
2 B 4
C 5
dtype: float64
Now, if you group by the first row level -- i.e. the original index -- and take the first value from each group, you essentially get the desired result:
In [185]: df.stack().groupby(level=0).first()
Out[185]:
0 1
1 3
2 4
dtype: float64
All we need to do is reindex the result (using the original index) so as to
include rows that are completely NaN:
df.stack().groupby(level=0).first().reindex(df.index)
This is nothing new, but it's a combination of the best bits of #yangie's approach with a list comprehension, and #EdChum's df.apply approach that I think is easiest to understand.
First, which columns to we want to pick our values from?
In [95]: pick_cols = df.apply(pd.Series.first_valid_index, axis=1)
In [96]: pick_cols
Out[96]:
0 A
1 B
2 B
3 None
dtype: object
Now how do we pick the values?
In [100]: [df.loc[k, v] if v is not None else None
....: for k, v in pick_cols.iteritems()]
Out[100]: [1.0, 3.0, 4.0, None]
This is ok, but we really want the index to match that of the original DataFrame:
In [98]: pd.Series({k:df.loc[k, v] if v is not None else None
....: for k, v in pick_cols.iteritems()})
Out[98]:
0 1
1 3
2 4
3 NaN
dtype: float64
groupby in axis=1
If we pass a callable that returns the same value, we group all columns together. This allows us to use groupby.agg which gives us the first method that makes this easy
df.groupby(lambda x: 'Z', 1).first()
Z
0 1.0
1 3.0
2 4.0
3 NaN
This returns a dataframe with the column name of the thing I was returning in my callable
lookup, notna, and idxmax
df.lookup(df.index, df.notna().idxmax(1))
array([ 1., 3., 4., nan])
argmin and slicing
v = df.values
v[np.arange(len(df)), np.isnan(v).argmin(1)]
array([ 1., 3., 4., nan])
Here is a one line solution:
[row[row.first_valid_index()] if row.first_valid_index() else None for _, row in df.iterrows()]
Edit:
This solution iterates over rows of df. row.first_valid_index() returns label for first non-NA/null value, which will be used as index to get the first non-null item in each row.
If there is no non-null value in the row, row.first_valid_index() would be None, thus cannot be used as index, so I need a if-else statement.
I packed everything into a list comprehension for brevity.
JoeCondron's answer (EDIT: before his last edit!) is cool but there is margin for significant improvement by avoiding the non-vectorized enumeration:
def get_first_non_null_vect(df):
a = df.values
col_index = np.isnan(a).argmin(axis=1)
return a[np.arange(a.shape[0]), col_index]
The improvement is small if the DataFrame is relatively flat:
In [4]: df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 1500), p=(0.01, 0.99)))
In [5]: %timeit get_first_non_null(df)
10 loops, best of 3: 34.9 ms per loop
In [6]: %timeit get_first_non_null_vect(df)
10 loops, best of 3: 31.6 ms per loop
... but can be relevant on slim DataFrames:
In [7]: df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 15), p=(0.1, 0.9)))
In [8]: %timeit get_first_non_null(df)
100 loops, best of 3: 3.75 ms per loop
In [9]: %timeit get_first_non_null_vect(df)
1000 loops, best of 3: 718 µs per loop
Compared to JoeCondron's vectorized version, the runtime is very similar (this is still slightly quicker for slim DataFrames, and slightly slower for large ones).
df=pandas.DataFrame({'A':[1, numpy.nan, numpy.nan, numpy.nan], 'B':[numpy.nan, 3, 4, numpy.nan], 'C':[2, numpy.nan, 5, numpy.nan]})
df
A B C
0 1.0 NaN 2.0
1 NaN 3.0 NaN
2 NaN 4.0 5.0
3 NaN NaN NaN
df.apply(lambda x: numpy.nan if all(x.isnull()) else x[x.first_valid_index()], axis=1).tolist()
[1.0, 3.0, 4.0, nan]