I have the following dataframe:
d = {'col1': [1, "Avoid", 3, "Avoid"], 'col2': ["AA", "BB", "Avoid", "Avoid"]}
df = pd.DataFrame(data=d)
df
col1 col2
0 1 AA
1 Avoid BB
2 3 Avoid
3 Avoid Avoid
I have to conditionally concat col1 and col2 into col3. Conditions:
Only concat 2 columns as long as none of them is Avoid.
If any of col1 and col2 is Avoid, col3 will be equal to Avoid as well.
When performing concatination, " & " needs to be added between column values in col3. For instance, first row of col3 will be "1 & AA".
The end result is supposed to look as the following:
col1 col2 col3
0 1 AA 1 & AA
1 Avoid BB Avoid
2 3 Avoid Avoid
3 Avoid Avoid Avoid
How can I do this without dealing with for loops?
Run a list comprehension on the strings in plain python:
out = [f"{l}&{r}"
if 'Avoid' not in {l, r}
else 'Avoid'
for l, r in zip(df.col1, df.col2)]
df.assign(col3 = out)
col1 col2 col3
0 1 AA 1&AA
1 Avoid BB Avoid
2 3 Avoid Avoid
3 Avoid Avoid Avoid
Is not an efficient way to work with pandas, but if you can't change the data structure these are solutions:
Solution 1:
def custom_merge(cols):
if cols["col1"]=="Avoid" or cols["col2"]=="Avoid":
return "Avoid"
else:
return f"{cols['col1']} & {cols['col2']}"
df['col3'] = df.apply(custom_merge, axis=1)
Solution 2:
df['col3'] = (df["col1"].astype(str) + " & " + df["col2"].astype(str)).apply(lambda x: "Avoid" if 'Avoid' in x else x)
Both solutions results in the following:
col1 col2 col3
0 1 AA 1 & AA
1 Avoid BB Avoid
2 3 Avoid Avoid
3 Avoid Avoid Avoid
Execution Time comparison
In this section I will count the execution times of the proposed solutions.
#mozway proposed 2 very tricky solutions in his answer, which I will call Solution 3a and Solution 3b. Another interesting solution is #sammywemmy's solution that uses a list comprehension and then adds the list to the dataframe, I will call this solution solution 4
The instances of the experiments will have the following structure:
import pandas as pd
n = 100000
d = {'col1': [1, "Avoid", 3, "Avoid"] * n, 'col2': ["AA", "BB", "Avoid", "Avoid"] * n}
df = pd.DataFrame(data=d)
Execution time:
Solution 1:
%timeit 3.56 s ± 71.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Solution 2:
%timeit 140 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Solution 3a:
%timeit 3.44 s ± 77.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Solution 3b:
%timeit 893 ms ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Solution 4:
%timeit 191 ms ± 5.71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Solution 1 and solution 3a have similar execution times on the instance under consideration. Solution 3b runs 4 times faster than solutions 1 and 3a. The fastest solution is solution 2 which goes about 7 times faster than solution 2 and about 25 times faster than solutions 1 and 3a. This is because solution 2 takes advantage of pandas vectorization. Solution 4 has a similar runtime as solution 2, taking advantage of list comprehension for the merge operation (without using pandas).
TIPS:
If you can change the data format, the advice is to format the data so that you can use the native pandas functions to do the join operations or if you can't change the data format and can do without pandas you might have a slight speed up over solution 2 using dictionaries or lists, doing the merge operation using list comphrehnsion.
try this:
df["col3"]=df.apply(lambda x:"Avoid" if x["col1"]=="Avoid" or x["col2"]=="Avoid" else f"{x['col1']} & {x['col2']}",axis=1)
df["col3"] = df["col1"] + " & " + df["col2"]
df["col3"] = df["col3"].apply(lambda x: "Avoid" if x.contains("Avoid") else x)
Use boolean operations, this enables you to use an arbitrary number of columns:
# is any value in the row "Avoid"?
m = df.eq('Avoid').any(1)
# concatenate all columns unless there was a "Avoid"
df['col3'] = df.astype(str).agg(' & '.join, axis=1).mask(m, 'Avoid')
Alternative that should be faster if you have many rows and few with "Avoid":
m = df.ne('Avoid').all(1)
df.loc[m, 'col3'] = df[m].astype(str).agg(' & '.join, axis=1)
df['col3'] = df['col3'].fillna('Avoid')
output:
col1 col2 col3
0 1 AA 1 & AA
1 Avoid BB Avoid
2 3 Avoid Avoid
3 Avoid Avoid Avoid
Related
Need help extracting multiple numbers from a column in a dataframe and remove duplicates and separate them with a comma.
Col1
Abcde 10 hijk20
wewrw5 gagdhdh5
Mnbjgkh10,20, 30
Expected output;
Col2
10,20
5
10,20,30
Try this:
punctuations = ['!','(',')','-','[',']','{','}',';',':','"','<','>','.','/','?']
for index, row in dataframe.iterrows():
content = dataframe.iloc[index:index, column_index]
for p in punctuations:
content.replace(p, " ")
only_numbers = re.sub("[^0-9]", " ", content)
only_numbers.strip()
numbers_found = only_numbers.split(" ")
no_duplicates = list(set(numbers_found))
comma_separated = ",".join(no_duplicates)
dataframe.iloc[index:index, column_index] = comma_separated
Does this answer your question? findall() from the re module with the regular expression r'\d+' returns a list containing all non-overlapping matches of one or more consecutive decimal digits in the string. The built-in set() removes any duplicates from that list and applying the sorted() built-in returns a sorted list of the elements in the set. We also make use of numpy.vectorize as it is faster than apply() from Pandas for this particular application (at least on my system) though I have shown how to use apply() as well.
Method 1
import pandas as pd
import numpy as np
import re
# compile RE - matches one or more decimal digits
p = re.compile(r'\d+')
# data
d = {'col1': ['Abcde 10 hijk20', 'wewrw5 gagdhdh5', 'Mnbjgkh10,20, 30'],
'col2': [''] * 3}
# DataFrame
df = pd.DataFrame(d)
# modify col2 based on col1
df['col2'] = np.vectorize(
lambda y: ','.join(sorted(set(p.findall(y)))),
)(df['col1'])
print(df)
Output
col1 col2
0 Abcde 10 hijk20 10,20
1 wewrw5 gagdhdh5 5
2 Mnbjgkh10,20, 30 10,20,30
If you can only use pandas and not numpy, you can do
Method 2
# modify col2 based on col1
df['col2'] = df.apply(
lambda x: ','.join(sorted(set(p.findall(x['col1'])))) , axis=1)
or even
Method 3
# modify col2 based on col1
for index, row in df.iterrows():
row['col2'] = ','.join(sorted(set(p.findall(row['col1']))))
Efficiency
On my system, vectorize (method 1) is fastest, method 3 is second fastest and method 2 is the slowest.
# Method 1
82.9 µs ± 170 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
# Method 2
399 µs ± 8.54 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# Method 3
117 µs ± 178 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
I have seen many answers posted to questions on Stack Overflow involving the use of the Pandas method apply. I have also seen users commenting under them saying that "apply is slow, and should be avoided".
I have read many articles on the topic of performance that explain apply is slow. I have also seen a disclaimer in the docs about how apply is simply a convenience function for passing UDFs (can't seem to find that now). So, the general consensus is that apply should be avoided if possible. However, this raises the following questions:
If apply is so bad, then why is it in the API?
How and when should I make my code apply-free?
Are there ever any situations where apply is good (better than other possible solutions)?
apply, the Convenience Function you Never Needed
We start by addressing the questions in the OP, one by one.
"If apply is so bad, then why is it in the API?"
DataFrame.apply and Series.apply are convenience functions defined on DataFrame and Series object respectively. apply accepts any user defined function that applies a transformation/aggregation on a DataFrame. apply is effectively a silver bullet that does whatever any existing pandas function cannot do.
Some of the things apply can do:
Run any user-defined function on a DataFrame or Series
Apply a function either row-wise (axis=1) or column-wise (axis=0) on a DataFrame
Perform index alignment while applying the function
Perform aggregation with user-defined functions (however, we usually prefer agg or transform in these cases)
Perform element-wise transformations
Broadcast aggregated results to original rows (see the result_type argument).
Accept positional/keyword arguments to pass to the user-defined functions.
...Among others. For more information, see Row or Column-wise Function Application in the documentation.
So, with all these features, why is apply bad? It is because apply is slow. Pandas makes no assumptions about the nature of your function, and so iteratively applies your function to each row/column as necessary. Additionally, handling all of the situations above means apply incurs some major overhead at each iteration. Further, apply consumes a lot more memory, which is a challenge for memory bounded applications.
There are very few situations where apply is appropriate to use (more on that below). If you're not sure whether you should be using apply, you probably shouldn't.
Let's address the next question.
"How and when should I make my code apply-free?"
To rephrase, here are some common situations where you will want to get rid of any calls to apply.
Numeric Data
If you're working with numeric data, there is likely already a vectorized cython function that does exactly what you're trying to do (if not, please either ask a question on Stack Overflow or open a feature request on GitHub).
Contrast the performance of apply for a simple addition operation.
df = pd.DataFrame({"A": [9, 4, 2, 1], "B": [12, 7, 5, 4]})
df
A B
0 9 12
1 4 7
2 2 5
3 1 4
<!- ->
df.apply(np.sum)
A 16
B 28
dtype: int64
df.sum()
A 16
B 28
dtype: int64
Performance wise, there's no comparison, the cythonized equivalent is much faster. There's no need for a graph, because the difference is obvious even for toy data.
%timeit df.apply(np.sum)
%timeit df.sum()
2.22 ms ± 41.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
471 µs ± 8.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Even if you enable passing raw arrays with the raw argument, it's still twice as slow.
%timeit df.apply(np.sum, raw=True)
840 µs ± 691 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Another example:
df.apply(lambda x: x.max() - x.min())
A 8
B 8
dtype: int64
df.max() - df.min()
A 8
B 8
dtype: int64
%timeit df.apply(lambda x: x.max() - x.min())
%timeit df.max() - df.min()
2.43 ms ± 450 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.23 ms ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In general, seek out vectorized alternatives if possible.
String/Regex
Pandas provides "vectorized" string functions in most situations, but there are rare cases where those functions do not... "apply", so to speak.
A common problem is to check whether a value in a column is present in another column of the same row.
df = pd.DataFrame({
'Name': ['mickey', 'donald', 'minnie'],
'Title': ['wonderland', "welcome to donald's castle", 'Minnie mouse clubhouse'],
'Value': [20, 10, 86]})
df
Name Value Title
0 mickey 20 wonderland
1 donald 10 welcome to donald's castle
2 minnie 86 Minnie mouse clubhouse
This should return the row second and third row, since "donald" and "minnie" are present in their respective "Title" columns.
Using apply, this would be done using
df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)
0 False
1 True
2 True
dtype: bool
df[df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)]
Name Title Value
1 donald welcome to donald's castle 10
2 minnie Minnie mouse clubhouse 86
However, a better solution exists using list comprehensions.
df[[y.lower() in x.lower() for x, y in zip(df['Title'], df['Name'])]]
Name Title Value
1 donald welcome to donald's castle 10
2 minnie Minnie mouse clubhouse 86
<!- ->
%timeit df[df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)]
%timeit df[[y.lower() in x.lower() for x, y in zip(df['Title'], df['Name'])]]
2.85 ms ± 38.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
788 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The thing to note here is that iterative routines happen to be faster than apply, because of the lower overhead. If you need to handle NaNs and invalid dtypes, you can build on this using a custom function you can then call with arguments inside the list comprehension.
For more information on when list comprehensions should be considered a good option, see my writeup: Are for-loops in pandas really bad? When should I care?.
Note
Date and datetime operations also have vectorized versions. So, for example, you should prefer pd.to_datetime(df['date']), over,
say, df['date'].apply(pd.to_datetime).
Read more at the
docs.
A Common Pitfall: Exploding Columns of Lists
s = pd.Series([[1, 2]] * 3)
s
0 [1, 2]
1 [1, 2]
2 [1, 2]
dtype: object
People are tempted to use apply(pd.Series). This is horrible in terms of performance.
s.apply(pd.Series)
0 1
0 1 2
1 1 2
2 1 2
A better option is to listify the column and pass it to pd.DataFrame.
pd.DataFrame(s.tolist())
0 1
0 1 2
1 1 2
2 1 2
<!- ->
%timeit s.apply(pd.Series)
%timeit pd.DataFrame(s.tolist())
2.65 ms ± 294 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
816 µs ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Lastly,
"Are there any situations where apply is good?"
Apply is a convenience function, so there are situations where the overhead is negligible enough to forgive. It really depends on how many times the function is called.
Functions that are Vectorized for Series, but not DataFrames
What if you want to apply a string operation on multiple columns? What if you want to convert multiple columns to datetime? These functions are vectorized for Series only, so they must be applied over each column that you want to convert/operate on.
df = pd.DataFrame(
pd.date_range('2018-12-31','2019-01-31', freq='2D').date.astype(str).reshape(-1, 2),
columns=['date1', 'date2'])
df
date1 date2
0 2018-12-31 2019-01-02
1 2019-01-04 2019-01-06
2 2019-01-08 2019-01-10
3 2019-01-12 2019-01-14
4 2019-01-16 2019-01-18
5 2019-01-20 2019-01-22
6 2019-01-24 2019-01-26
7 2019-01-28 2019-01-30
df.dtypes
date1 object
date2 object
dtype: object
This is an admissible case for apply:
df.apply(pd.to_datetime, errors='coerce').dtypes
date1 datetime64[ns]
date2 datetime64[ns]
dtype: object
Note that it would also make sense to stack, or just use an explicit loop. All these options are slightly faster than using apply, but the difference is small enough to forgive.
%timeit df.apply(pd.to_datetime, errors='coerce')
%timeit pd.to_datetime(df.stack(), errors='coerce').unstack()
%timeit pd.concat([pd.to_datetime(df[c], errors='coerce') for c in df], axis=1)
%timeit for c in df.columns: df[c] = pd.to_datetime(df[c], errors='coerce')
5.49 ms ± 247 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.94 ms ± 48.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.16 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.41 ms ± 1.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can make a similar case for other operations such as string operations, or conversion to category.
u = df.apply(lambda x: x.str.contains(...))
v = df.apply(lambda x: x.astype(category))
v/s
u = pd.concat([df[c].str.contains(...) for c in df], axis=1)
v = df.copy()
for c in df:
v[c] = df[c].astype(category)
And so on...
Converting Series to str: astype versus apply
This seems like an idiosyncrasy of the API. Using apply to convert integers in a Series to string is comparable (and sometimes faster) than using astype.
The graph was plotted using the perfplot library.
import perfplot
perfplot.show(
setup=lambda n: pd.Series(np.random.randint(0, n, n)),
kernels=[
lambda s: s.astype(str),
lambda s: s.apply(str)
],
labels=['astype', 'apply'],
n_range=[2**k for k in range(1, 20)],
xlabel='N',
logx=True,
logy=True,
equality_check=lambda x, y: (x == y).all())
With floats, I see the astype is consistently as fast as, or slightly faster than apply. So this has to do with the fact that the data in the test is integer type.
GroupBy operations with chained transformations
GroupBy.apply has not been discussed until now, but GroupBy.apply is also an iterative convenience function to handle anything that the existing GroupBy functions do not.
One common requirement is to perform a GroupBy and then two prime operations such as a "lagged cumsum":
df = pd.DataFrame({"A": list('aabcccddee'), "B": [12, 7, 5, 4, 5, 4, 3, 2, 1, 10]})
df
A B
0 a 12
1 a 7
2 b 5
3 c 4
4 c 5
5 c 4
6 d 3
7 d 2
8 e 1
9 e 10
<!- ->
You'd need two successive groupby calls here:
df.groupby('A').B.cumsum().groupby(df.A).shift()
0 NaN
1 12.0
2 NaN
3 NaN
4 4.0
5 9.0
6 NaN
7 3.0
8 NaN
9 1.0
Name: B, dtype: float64
Using apply, you can shorten this to a a single call.
df.groupby('A').B.apply(lambda x: x.cumsum().shift())
0 NaN
1 12.0
2 NaN
3 NaN
4 4.0
5 9.0
6 NaN
7 3.0
8 NaN
9 1.0
Name: B, dtype: float64
It is very hard to quantify the performance because it depends on the data. But in general, apply is an acceptable solution if the goal is to reduce a groupby call (because groupby is also quite expensive).
Other Caveats
Aside from the caveats mentioned above, it is also worth mentioning that apply operates on the first row (or column) twice. This is done to determine whether the function has any side effects. If not, apply may be able to use a fast-path for evaluating the result, else it falls back to a slow implementation.
df = pd.DataFrame({
'A': [1, 2],
'B': ['x', 'y']
})
def func(x):
print(x['A'])
return x
df.apply(func, axis=1)
# 1
# 1
# 2
A B
0 1 x
1 2 y
This behaviour is also seen in GroupBy.apply on pandas versions <0.25 (it was fixed for 0.25, see here for more information.)
Not all applys are alike
The below chart suggests when to consider apply1. Green means possibly efficient; red avoid.
Some of this is intuitive: pd.Series.apply is a Python-level row-wise loop, ditto pd.DataFrame.apply row-wise (axis=1). The misuses of these are many and wide-ranging. The other post deals with them in more depth. Popular solutions are to use vectorised methods, list comprehensions (assumes clean data), or efficient tools such as the pd.DataFrame constructor (e.g. to avoid apply(pd.Series)).
If you are using pd.DataFrame.apply row-wise, specifying raw=True (where possible) is often beneficial. At this stage, numba is usually a better choice.
GroupBy.apply: generally favoured
Repeating groupby operations to avoid apply will hurt performance. GroupBy.apply is usually fine here, provided the methods you use in your custom function are themselves vectorised. Sometimes there is no native Pandas method for a groupwise aggregation you wish to apply. In this case, for a small number of groups apply with a custom function may still offer reasonable performance.
pd.DataFrame.apply column-wise: a mixed bag
pd.DataFrame.apply column-wise (axis=0) is an interesting case. For a small number of rows versus a large number of columns, it's almost always expensive. For a large number of rows relative to columns, the more common case, you may sometimes see significant performance improvements using apply:
# Python 3.7, Pandas 0.23.4
np.random.seed(0)
df = pd.DataFrame(np.random.random((10**7, 3))) # Scenario_1, many rows
df = pd.DataFrame(np.random.random((10**4, 10**3))) # Scenario_2, many columns
# Scenario_1 | Scenario_2
%timeit df.sum() # 800 ms | 109 ms
%timeit df.apply(pd.Series.sum) # 568 ms | 325 ms
%timeit df.max() - df.min() # 1.63 s | 314 ms
%timeit df.apply(lambda x: x.max() - x.min()) # 838 ms | 473 ms
%timeit df.mean() # 108 ms | 94.4 ms
%timeit df.apply(pd.Series.mean) # 276 ms | 233 ms
1 There are exceptions, but these are usually marginal or uncommon. A couple of examples:
df['col'].apply(str) may slightly outperform df['col'].astype(str).
df.apply(pd.to_datetime) working on strings doesn't scale well with rows versus a regular for loop.
For axis=1 (i.e. row-wise functions) then you can just use the following function in lieu of apply. I wonder why this isn't the pandas behavior. (Untested with compound indexes, but it does appear to be much faster than apply)
def faster_df_apply(df, func):
cols = list(df.columns)
data, index = [], []
for row in df.itertuples(index=True):
row_dict = {f:v for f,v in zip(cols, row[1:])}
data.append(func(row_dict))
index.append(row[0])
return pd.Series(data, index=index)
Are there ever any situations where apply is good?
Yes, sometimes.
Task: decode Unicode strings.
import numpy as np
import pandas as pd
import unidecode
s = pd.Series(['mañana','Ceñía'])
s.head()
0 mañana
1 Ceñía
s.apply(unidecode.unidecode)
0 manana
1 Cenia
Update
I was by no means advocating for the use of apply, just thinking since the NumPy cannot deal with the above situation, it could have been a good candidate for pandas apply. But I was forgetting the plain ol list comprehension thanks to the reminder by #jpp.
I have a panda data frame in python as below:
df['column'] = [abc, mno]
[mno, pqr]
[abc, mno]
[mno, pqr]
I want to get the count of each item below :
abc = 2,
mno= 4 ,
pqr = 2
I can do iteration over the each row to count but this is not the kind of solution I m looking for.
If there is any way where I can use iloc or anything related to that, please suggest to me.
I have looked at various solutions with a similar problem but none of them satisfied my scenario.
Here is how I'd solve it using .explode() and .value_counts() you can furthermore assign it as a column or do as you please with the output:
In one line:
print(df.explode('column')['column'].value_counts())
Full example:
import pandas as pd
data_1 = {'index':[0,1,2,3],'column':[['abc','mno'],['mno','pqr'],['abc','mno'],['mno','pqr']]}
df = pd.DataFrame(data_1)
df = df.set_index('index')
print(df)
column
index
0 [abc, mno]
1 [mno, pqr]
2 [abc, mno]
3 [mno, pqr]
Here we perform the .explode() to create individual values from the lists and value_counts() to count repetition of unique values:
df_new = df.explode('column')
print(df_new['column'].value_counts())
Output:
mno 4
abc 2
pqr 2
Use collections.Counter
from collections import Counter
from itertools import chain
Counter(chain.from_iterable(df.column))
Out[196]: Counter({'abc': 2, 'mno': 4, 'pqr': 2})
%timeit
df1 = pd.concat([df]*10000, ignore_index=True)
In [227]: %timeit pd.Series(Counter(chain.from_iterable(df1.column)))
14.3 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [228]: %timeit df1.column.explode().value_counts()
127 ms ± 3.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I have a data frame like this,
col1 col2
[1,2] [3,4]
[5,6] [7,8]
[9,5] [1,3]
[8,4] [3,6]
and I have a function f which takes two list inputs and returns a single value. I want to add the column as col3 and apply the function with col1 and col2 values. the output of the function will be col3 values, so the final data frame would look like,
col1 col2 col3
[1,2] [3,4] 3
[5,6] [7,8] 5
[9,5] [1,3] 8
[8,4] [3,6] 9
Using a for loop and passing list values each time I can calculate the col3 values. but the execution time will be longer. Looking for pythonic way to do the task more efficiently.
Working with lists in pandas is not good vectorized, possible solution with list comprehension:
df['col3'] = [func(a, b) for a,b in zip(df.col1, df.col2)]
Pandas apply solution (should be slowier):
df['col3'] = df.apply(lambda x: func(x.col1, x.col2), axis=1)
But if function should be vectorized and same length of list in columns maybe is possible rewrite it to numpy.
If not, maybe rewritten function to numba should help.
Performance with custom function:
#[40000 rows x 2 columns]
df = pd.concat([df] * 10000, ignore_index=True)
#sample function
def func(x, y):
return min(x + y)
In [144]: %timeit df['col31'] = [func(a, b) for a,b in zip(df.col1, df.col2)]
39.6 ms ± 331 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [145]: %timeit df['col32'] = df.apply(lambda x: func(x.col1, x.col2), axis=1)
2.25 s ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I am trying to learn pandas but I have been puzzled with the following. I want to replace NaNs in a DataFrame with the row average. Hence something like df.fillna(df.mean(axis=1)) should work but for some reason it fails for me. Am I missing anything, is there something wrong with what I'm doing? Is it because its not implemented? see link here
import pandas as pd
import numpy as np
pd.__version__
Out[44]:
'0.15.2'
In [45]:
df = pd.DataFrame()
df['c1'] = [1, 2, 3]
df['c2'] = [4, 5, 6]
df['c3'] = [7, np.nan, 9]
df
Out[45]:
c1 c2 c3
0 1 4 7
1 2 5 NaN
2 3 6 9
In [46]:
df.fillna(df.mean(axis=1))
Out[46]:
c1 c2 c3
0 1 4 7
1 2 5 NaN
2 3 6 9
However something like this looks to work fine
df.fillna(df.mean(axis=0))
Out[47]:
c1 c2 c3
0 1 4 7
1 2 5 8
2 3 6 9
As commented the axis argument to fillna is NotImplemented.
df.fillna(df.mean(axis=1), axis=1)
Note: this would be critical here as you don't want to fill in your nth columns with the nth row average.
For now you'll need to iterate through:
m = df.mean(axis=1)
for i, col in enumerate(df):
# using i allows for duplicate columns
# inplace *may* not always work here, so IMO the next line is preferred
# df.iloc[:, i].fillna(m, inplace=True)
df.iloc[:, i] = df.iloc[:, i].fillna(m)
print(df)
c1 c2 c3
0 1 4 7.0
1 2 5 3.5
2 3 6 9.0
An alternative is to fillna the transpose and then transpose, which may be more efficient...
df.T.fillna(df.mean(axis=1)).T
As an alternative, you could also use an apply with a lambda expression like this:
df.apply(lambda row: row.fillna(row.mean()), axis=1)
yielding also
c1 c2 c3
0 1.0 4.0 7.0
1 2.0 5.0 3.5
2 3.0 6.0 9.0
For an efficient solution, use DataFrame.where:
We could use where on axis=0:
df.where(df.notna(), df.mean(axis=1), axis=0)
or mask on axis=0:
df.mask(df.isna(), df.mean(axis=1), axis=0)
By using axis=0, we can fill in the missing values in each column with the row averages.
These methods perform very similarly (where does slightly better on large DataFrames (300_000, 20)) and is ~35-50% faster than the numpy methods posted here and is 110x faster than the double transpose method.
Some benchmarks:
df = creator()
>>> %timeit df.where(df.notna(), df.mean(axis=1), axis=0)
542 ms ± 3.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df.mask(df.isna(), df.mean(axis=1), axis=0)
555 ms ± 21.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df.fillna(0) + df.isna().values * df.mean(axis=1).values.reshape(-1,1)
751 ms ± 22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit fill = pd.DataFrame(np.broadcast_to(df.mean(1).to_numpy()[:, None], df.shape), columns=df.columns, index=df.index); df.update(fill, overwrite=False)
848 ms ± 22.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df.apply(lambda row: row.fillna(row.mean()), axis=1)
1min 4s ± 5.32 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit df.T.fillna(df.mean(axis=1)).T
1min 5s ± 2.4 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
def creator():
A = np.random.rand(300_000, 20)
A.ravel()[np.random.choice(A.size, 300_000, replace=False)] = np.nan
return pd.DataFrame(A)
I'll propose an alternative that involves casting into numpy arrays. Performance wise, I think this is more efficient and probably scales better than the other proposed solutions so far.
The idea being to use an indicator matrix (df.isna().values which is 1 if the element is N/A, 0 otherwise) and broadcast-multiplying that to the row averages.
Thus, we end up with a matrix (exactly the same shape as the original df), which contains the row-average value if the original element was N/A, and 0 otherwise.
We add this matrix to the original df, making sure to fillna with 0 so that, in effect, we have filled the N/A's with the respective row averages.
# setup code
df = pd.DataFrame()
df['c1'] = [1, 2, 3]
df['c2'] = [4, 5, 6]
df['c3'] = [7, np.nan, 9]
# fillna row-wise
row_avgs = df.mean(axis=1).values.reshape(-1,1)
df = df.fillna(0) + df.isna().values * row_avgs
df
giving
c1 c2 c3
0 1.0 4.0 7.0
1 2.0 5.0 3.5
2 3.0 6.0 9.0
You can broadcast the mean to a DataFrame with the same index as the original and then use update with overwrite=False to get the behavior of .fillna. Unlike .fillna, update allows for filling when the Indices have duplicated labels. Should be faster than the looping .fillna for smaller than 50,000 rows or so.
fill = pd.DataFrame(np.broadcast_to(df.mean(1).to_numpy()[:, None], df.shape),
columns=df.columns,
index=df.index)
df.update(fill, overwrite=False)
print(df)
1 1 1
0 1.0 4.0 7.0
0 2.0 5.0 3.5
0 3.0 6.0 9.0
Just had the same problem. I found this workaround to be working:
df.transpose().fillna(df.mean(axis=1)).transpose()
I'm not sure though about the efficiency of this solution.