Why is getting the reverse of an index in pandas so slow?

Why is getting the reverse of an index in pandas so slow? - python

I have a pandas dataframe that I'm using to store network data; it looks like:
from_id, to_id, count
X, Y, 3
Z, Y, 4
Y, X, 2
...
I am trying to add a new column, inverse_count, which gets the count value for the row where the from_id and to_id are reversed from the current row.
I'm taking the following approach. I thought that it would be fast but it is much slower than I anticipated, and I can't figure out why.
def get_inverse_val(x):
# Takes the inverse of the index for a given row
# When passed to apply with axis = 1, the index becomes the name
try:
return df.loc[(x.name[1], x.name[0]), 'count']
except KeyError:
return 0
df = df.set_index(['from_id', 'to_id'])
df['inverse_count'] = df.apply(get_inverse_val, axis = 1)

Why not do a simple merge for this?
df = pd.DataFrame({'from_id': ['X', 'Z', 'Y'], 'to_id': ['Y', 'Y', 'X'], 'count': [3,4,2]})
pd.merge(
left = df,
right = df,
how = 'left',
left_on = ['from_id', 'to_id'],
right_on = ['to_id', 'from_id']
)
from_id_x to_id_x count_x from_id_y to_id_y count_y
0 X Y 3 Y X 2.0
1 Z Y 4 NaN NaN NaN
2 Y X 2 X Y 3.0
Here we merge from (from, to) -> (to, from) to get reversed matching pairs. In general, you should avoid using apply() as it's slow. (To understand why, realized that it is not a vectorized operation.)

You can use .set_index twice to create two dataframes with opposite index orders and assign to create your inverse_count column.
df = (df.set_index(['from_id','to_id'])
.assign(inverse_count=df.set_index(['to_id','from_id'])['count'])
.reset_index())
from_id to_id count inverse_count
0 X Y 3 2.0
1 Z Y 4 NaN
2 Y X 2 3.0
Since the question was regarding speed let's look at performance on a larger dataset:
Setup:
import pandas as pd
import string
import itertools
df = pd.DataFrame(list(itertools.permutations(string.ascii_uppercase, 2)), columns=['from_id', 'to_id'])
df['count'] = df.index % 25 + 1
print(df)
from_id to_id count
0 A B 1
1 A C 2
2 A D 3
3 A E 4
4 A F 5
.. ... ... ...
645 Z U 21
646 Z V 22
647 Z W 23
648 Z X 24
649 Z Y 25
Set_index:
%timeit (df.set_index(['from_id','to_id'])
.assign(inverse_count=df.set_index(['to_id','from_id'])['count'])
.reset_index())
6 ms ± 24.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Merge (from Ben's answer):
%timeit pd.merge(
left = df,
right = df,
how = 'left',
left_on = ['from_id', 'to_id'],
right_on = ['to_id', 'from_id'] )
1.73 ms ± 57.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So, it looks like the merge approach is the faster option.

Related

pandas groupby.agg() taking the mode of categorical variables where NaN is the only variable for a group

I want to find the most common value for each group. UPDATE: If there are real values and NaNs, I want to drop the NaNs. I only want NaN, when that is all the values.
Some of my groups have all their data missing. And I would like the result in these cases to be missing data (NaN) as the most common value.
In these cases the DataFrame.groupby.agg(pd.Series.mode) function returns an empty categorical. What I want is NaN.
A toy example follows ...
data = """
Group, Value
A, 1
A, 1
A, 1
B, 2
C, 3
C,
C,
D,
D,
"""
from io import StringIO
df = (
pd.read_csv(StringIO(data),
skipinitialspace=True)
.astype('category')
)
df.groupby('Group')['Value'].agg(pd.Series.mode)
Which yields ...
A 1.0
B 2.0
C 3.0
D [], Categories (3, float64): [1.0, 2.0, 3.0]
Name: Value, dtype: object
My question: is there a way to get NAN, or to detect the empty categorical and make that a NaN. UPDATED: Noting, that I cannot use dropna=False, as that would give me an incorrect answer for C above.
By way of context, my original DataFrame has 27 million rows, and my grouped frame has 6 million rows. So, I want to avoid slow solutions.

You can apply pd.Series.mode and then pd.to_numeric with errors="coerce":
x = df.groupby("Group")["Value"].agg(pd.Series.mode)
print(pd.to_numeric(x, errors="coerce"))
Prints:
Group
A 1.0
B 2.0
C 3.0
D NaN
Name: Value, dtype: float64

You can use a custom aggregation and check if isna().all():
df.groupby('Group')['Value'].agg(lambda x: x.mode() if not x.isna().all() else np.nan)
# Group
# A 1.0
# B 2.0
# C 3.0
# D NaN
# Name: Value, dtype: float64
Out of curiosity, timed with df = pd.concat([df] * 100000) (900,000 rows):
>>> def coerce(df):
... x = df.groupby("Group")["Value"].agg(pd.Series.mode)
... return pd.to_numeric(x, errors="coerce")
>>> %timeit coerce(df)
22.1 ms ± 2.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> def isna(df):
... return df.groupby('Group')['Value'].agg(lambda x: x.mode() if not x.isna().all() else np.nan)
>>> %timeit isna(df)
20.9 ms ± 732 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Aggregating a dataframe: weighted averages for numerical columns and concatenating as string for other types

I got a dataframe which looks like this:
np.random.seed(11)
df = pd.DataFrame({
'item_id': np.random.randint(1, 4, 10),
'item_type': [chr(x) for x in np.random.randint(65, 80, 10)],
'value1': np.round(np.random.rand(10)*30, 1),
'value2': np.round(np.random.randn(10)*30, 1),
'n': np.random.randint(100, size=10)
})
item_id item_type value1 value2 n
0 2 A 26.8 -39.2 59
1 1 N 25.7 -33.6 1
2 2 A 5.0 22.1 3
3 2 N 19.0 47.2 8
4 1 M 0.6 -0.9 87
5 2 N 3.5 -20.5 81
6 3 E 9.5 32.9 68
7 1 C 4.7 -9.3 72
8 2 M 22.8 21.8 32
9 1 B 24.5 46.5 78
I would like to transform this dataframe to have a single row for each item_id.
The columns should be aggregated by finding the weighted average of value1 and value2 (weighted by n), and combining categorical variable item_type if it is not unique. The end result looks like this:
item_type value1 value2
item_id
1 B/C/M/N 9.778571 11.955882
2 A/M/N 15.089071 -15.474317
3 E 9.500000 32.900000
What I have tried
This can be done with a custom function and using apply, like this one:
def func(x):
record = ['/'.join(sorted(x.item_type.unique()))]
total_rows = x.n.sum()
for c in ['value1', 'value2']:
record.append((x[c] * x.n / total_rows).sum())
return pd.Series(record, index=['item_type', 'value1', 'value2'])
%%timeit
df.groupby('item_id').apply(func)
6.95 ms ± 30.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This is for 10 records. I have a dataframe above 40 million records. I am searching for the most efficient way to do this, before I start thinking of going parallel. All other operations I've done on this dataframe take less than a minute, but this one is sloow.
any ideas appreciated!

Not sure but may be worth a try. We can do a similar operation on a window using transform then group the results , string operations are quite slow in general:
(df.groupby("item_id")['item_type'].agg(lambda x: '/'.join(sorted(x.unique())))
.to_frame()
.join(
df[['value1','value2']].mul(df['n'],0).div(
df.groupby("item_id")['n'].transform('sum'),0)
.groupby(df['item_id']).sum()))
Or :
cols = ['value1','value2']
(df.groupby("item_id")['item_type'].agg(lambda x: '/'.join(sorted(x.unique())))
.to_frame().join
(pd.DataFrame(((df[cols].to_numpy() * df['n'].to_numpy()[:,None])/
df.groupby("item_id")['n'].transform('sum').to_numpy()[:,None]),
index=df.index,columns=cols)
.groupby(df['item_id']).sum()))
item_type value1 value2
item_id
1 B/C/M/N 9.778571 11.955882
2 A/M/N 15.089071 -15.474317
3 E 9.500000 32.900000

What about using NumPy functions only:
def numpy_func(group):
n = group['n'].values
item_type = np.str.join('/', np.unique(group["item_type"].values))
value1 = np.average(group["value1"].values, weights = n)
value2 = np.average(group["value2"].values, weights = n)
return pd.Series([item_type, value1, value2], index=['item_type', 'value1', 'value2'])
df.groupby("item_id").apply(numpy_func)
Execution time comparison
I compared your function with mine:
from datetime import datetime
before = datetime.now()
for i in range(1000):
df.groupby("item_id").apply(numpy_func)
after = datetime.now()
print(after - before)
#0:00:03.954935
before = datetime.now()
for i in range(1000):
df.groupby("item_id").apply(func)
after = datetime.now()
print(after - before)
#0:00:06.307923
It was like 1/3 faster.

How do you apply a function on a dataframe column using data from previous rows?

I have a Dataframe which has three columns: nums with some values to work with, b which is always either 1 or 0 and the result column which is currently zero everywhere except in the first row (because we must have an initial value to work with).
The dataframe looks like this:
nums b result
0 20.0 1 20.0
1 22.0 0 0
2 30.0 1 0
3 29.1 1 0
4 20.0 0 0
...
The Problem
I'd like to go over each row in the dataframe starting with the second row, do some calculation and store the result in the result column. Since I'm working with large files, I need a way to make this operation fast so that's why I want something like apply.
The calculation I want to do is to take the value in nums and in result from the previous row, and if in the current row the b col is 0 then I want (for example) to add the num and the result from that previous row. If b in that row is 1 I'd like to substract them for example.
What have I tried?
I tried using apply but I couldn't access the previous row and sadly it seems that if I do manage to access the previous row, the dataframe won't update the result column until the end.
I also tried using a loop like so, but it's too slow for the large filews I'm working with:
for i in range(1, len(df.index)):
row = df.index[i]
new_row = df.index[i - 1] # get index of previous row for "nums" and "result"
df.loc[row, 'result'] = some_calc_func(prev_result=df.loc[new_row, 'result'], prev_num=df.loc[new_row, 'nums'], \
current_b=df.loc[row, 'b'])
some_calc_func looks like this (just a general example):
def some_calc_func(prev_result, prev_num, current_b):
if current_b == 1:
return prev_result * prev_num / 2
else:
return prev_num + 17
Please answer with respect to some_calc_func

If you want to keep the function some_calc_func and not use another library, you should not try to access each element at each iteration, you can use zip on the columns nums and b with a shift between both as you try to access nums from the previous row and keep in memory the prev_res at each iteration. Also, append to a list instead of the dataframe, and after the loop assign the list to the column.
prev_res = df.loc[0, 'result'] #get first result
l_res = [prev_res] #initialize the list of results
# loop with zip to get both values at same time,
# use loc to start b at second row but not num
for prev_num, curren_b in zip(df['nums'], df.loc[1:, 'b']):
# use your function to calculate the new prev_res
prev_res = some_calc_func (prev_res, prev_num, curren_b)
# add to the list of results
l_res.append(prev_res)
# assign to the column
df['result'] = l_res
print (df) #same result than with your method
nums b result
0 20.0 1 20.0
1 22.0 0 37.0
2 30.0 1 407.0
3 29.1 1 6105.0
4 20.0 0 46.1
Now with a dataframe df of 5000 rows, I got:
%%timeit
prev_res = df.loc[0, 'result']
l_res = [prev_res]
for prev_num, curren_b in zip(df['nums'], df.loc[1:, 'b']):
prev_res = some_calc_func (prev_res, prev_num, curren_b)
l_res.append(prev_res)
df['result'] = l_res
# 4.42 ms ± 695 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
and with your original solution, it was ~750x slower
%%timeit
for i in range(1, len(df.index)):
row = df.index[i]
new_row = df.index[i - 1] # get index of previous row for "nums" and "result"
df.loc[row, 'result'] = some_calc_func(prev_result=df.loc[new_row, 'result'], prev_num=df.loc[new_row, 'nums'], \
current_b=df.loc[row, 'b'])
#3.25 s ± 392 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
EDIT with another library called numba, if the function some_calc_func can be easily used with Numba decorator.
from numba import jit
# decorate your function
#jit
def some_calc_func(prev_result, prev_num, current_b):
if current_b == 1:
return prev_result * prev_num / 2
else:
return prev_num + 17
# create a function to do your job
# numba likes numpy arrays
#jit
def with_numba(prev_res, arr_nums, arr_b):
# array for results and initialize
arr_res = np.zeros_like(arr_nums)
arr_res[0] = prev_res
# loop on the length of arr_b
for i in range(len(arr_b)):
#do the calculation and set the value in result array
prev_res = some_calc_func (prev_res, arr_nums[i], arr_b[i])
arr_res[i+1] = prev_res
return arr_res
Finally, call it like
df['result'] = with_numba(df.loc[0, 'result'],
df['nums'].to_numpy(),
df.loc[1:, 'b'].to_numpy())
And with a timeit, I get another ~9x faster than my method with zip, and the speed up could increase with the size
%timeit df['result'] = with_numba(df.loc[0, 'result'],
df['nums'].to_numpy(),
df.loc[1:, 'b'].to_numpy())
# 526 µs ± 45.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Note using Numba might be problematic depending on your actual some_calc_func

IIUC:
>>> df['result'] = (df[df.result.eq(0)].b.replace({0: 1, 1: -1}) * df.nums
).fillna(df.result).cumsum()
>>> df
nums b result
0 20.0 1 20.0
1 22.0 0 42.0
2 30.0 1 12.0
3 29.1 1 -17.1
4 20.0 0 2.9
Explanation:
# replace 0 with 1 and 1 with -1 in column `b` for rows where result==0
>>> df[df.result.eq(0)].b.replace({0: 1, 1: -1})
1 1
2 -1
3 -1
4 1
Name: b, dtype: int64
# multiply with nums
>>> (df[df.result.eq(0)].b.replace({0: 1, 1: -1}) * df.nums)
0 NaN
1 22.0
2 -30.0
3 -29.1
4 20.0
dtype: float64
# fill the 'NaN' with the corresponding value from df.result (which is 20 here)
>>> (df[df.result.eq(0)].b.replace({0: 1, 1: -1}) * df.nums).fillna(df.result)
0 20.0
1 22.0
2 -30.0
3 -29.1
4 20.0
dtype: float64
# take the cumulative sum (cumsum)
>>> (df[df.result.eq(0)].b.replace({0: 1, 1: -1}) * df.nums).fillna(df.result).cumsum()
0 20.0
1 42.0
2 12.0
3 -17.1
4 2.9
dtype: float64
According to your requirement in comments, I can not think of a way without loops:
c1, c2 = 2, 1
l = [df.loc[0, 'result']] # store the first result in a list
# then loop over the series (df.b * df.nums)
for i, val in (df.b * df.nums).iteritems():
if i: # except for 0th index
if val == 0: # (df.b * df.nums) == 0 if df.b == 0
l.append(l[-1]) # append the last result
else: # otherwise apply the rule
t = l[-1] *c2 + val * c1
l.append(t)
>>> l
[20.0, 20.0, 80.0, 138.2, 138.2]
>>> df['result'] = l
nums b result
0 20.0 1 20.0
1 22.0 0 20.0
2 30.0 1 80.0 # [ 20 * 1 + 30 * 2]
3 29.1 1 138.2 # [ 80 * 1 + 29.1 * 2]
4 20.0 0 138.2
Seems fast enough, did not test for large sample.

you have a f(...) to apply, but cannot because you need to keep a memory (of previous) row. You can do this either with a closure or a class. Below is a class implementation:
import pandas as pd
class Func():
def __init__(self, value):
self._prev = value
self._init = True
def __call__(self, x):
if self._init:
res = self._prev
self._init = False
elif x.b == 0:
res = x.nums - self._prev
else:
res = x.nums + self._prev
self._prev = res
return res
#df = pd.read_clipboard()
f = Func(20)
df['result'] = df.apply(f, axis=1)
You can replace the __call__ with whatever you want in some_calc_func body.

I realize this is what #Prodipta's answer was getting at, but this approach uses the global keyword instead to remember the previous result each iteration of apply:
prev_result = 20
def my_calc(row):
global prev_result
i = int(row.name) #the index of the current row
if i==0:
return prev_result
elif row['b'] == 1:
out = prev_result * df.loc[i-1,'nums']/2 #loc to get prev_num
else:
out = df.loc[i-1,'nums'] + 17
prev_result = out
return out
df['result'] = df.apply(my_calc, axis=1)
Result for your example data:
nums b result
0 20.0 1 20.0
1 22.0 0 37.0
2 30.0 1 407.0
3 29.1 1 6105.0
4 20.0 0 46.1
And here's a speed test a la #Ben T's answer - not the best but not the worst?
In[0]
df = pd.DataFrame({'nums':np.random.randint(0,100,5000),'b':np.random.choice([0,1],5000)})
prev_result = 20
%%timeit
df['result'] = df.apply(my_calc, axis=1)
Out[0]
117 ms ± 5.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

re-using your loop and some_calc_func
I am using your loop and have reduced it to a bare minimum as below
for i in range(1, len(df)):
df.loc[i, 'result'] = some_calc_func(df.loc[i, 'b'], df.loc[i - 1, 'result'], df.loc[i, 'nums'])
and the some_calc_func is implemented as below
def some_calc_func(bval, prev_result, curr_num):
if bval == 0:
return prev_result + curr_num
else:
return prev_result - curr_num
The result is as below
nums b result
0 20.0 1 20.0
1 22.0 0 42.0
2 30.0 1 12.0
3 29.1 1 -17.1
4 20.0 0 2.9

Pandas count NAs with a groupby for all columns [duplicate]

This question already has answers here:
Pandas count null values in a groupby function
(3 answers)
Groupby class and count missing values in features
(5 answers)
Closed 3 years ago.
This question shows how to count NAs in a dataframe for a particular column C. How do I count NAs for all columns (that aren't the groupby column)?
Here is some test code that doesn't work:
#!/usr/bin/env python3
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,1,2,2],
'b':[1,np.nan,2,np.nan],
'c':[1,np.nan,2,3]})
# result = df.groupby('a').isna().sum()
# AttributeError: Cannot access callable attribute 'isna' of 'DataFrameGroupBy' objects, try using the 'apply' method
# result = df.groupby('a').transform('isna').sum()
# AttributeError: Cannot access callable attribute 'isna' of 'DataFrameGroupBy' objects, try using the 'apply' method
result = df.isna().groupby('a').sum()
print(result)
# result:
# b c
# a
# False 2.0 1.0
result = df.groupby('a').apply(lambda _df: df.isna().sum())
print(result)
# result:
# a b c
# a
# 1 0 2 1
# 2 0 2 1
Desired output:
b c
a
1 1 1
2 1 0

It's always best to avoid groupby.apply in favor of the basic functions which are cythonized, as this scales better with many groups. This will lead to a great increase in performance. In this case first check isnull() on the entire DataFrame then groupby + sum.
df[df.columns.difference(['a'])].isnull().groupby(df.a).sum().astype(int)
# b c
#a
#1 1 1
#2 1 0
To illustrate the performance gain:
import pandas as pd
import numpy as np
N = 50000
df = pd.DataFrame({'a': [*range(N//2)]*2,
'b': np.random.choice([1, np.nan], N),
'c': np.random.choice([1, np.nan], N)})
%timeit df[df.columns.difference(['a'])].isnull().groupby(df.a).sum().astype(int)
#7.89 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.groupby('a')[['b', 'c']].apply(lambda x: x.isna().sum())
#9.47 s ± 111 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Your question has the answer (You mistyped _df as df):
result = df.groupby('a')['b', 'c'].apply(lambda _df: _df.isna().sum())
result
b c
a
1 1 1
2 1 0

Using apply with isna and sum. Plus we select the correct columns, so we don't get the unnecessary a column:
Note: apply can be slow, it's recommended to use one of the vectorized solutions, see the answers of WenYoBen, Anky or ALollz
df.groupby('a')[['b', 'c']].apply(lambda x: x.isna().sum())
Output
b c
a
1 1 1
2 1 0

Another way would be set_index() on a and groupby on the index and sum:
df.set_index('a').isna().groupby(level=0).sum()*1
Or:
df.set_index('a').isna().groupby(level=0).sum().astype(int)
Or without groupby courtesy #WenYoBen:
df.set_index('a').isna().sum(level=0).astype(int)
b c
a
1 1 1
2 1 0

I will do count then sub with value_counts, the reason why I did not using apply , cause it is usually has bad performance
df.groupby('a')[['b','c']].count().rsub(df.a.value_counts(dropna=False),axis=0)
Out[78]:
b c
1 1 1
2 1 0
Alternative
df.isna().drop('a',1).astype(int).groupby(df['a']).sum()
Out[83]:
b c
a
1 1 1
2 1 0

You need to drop the column after using apply.
df.groupby('a').apply(lambda x: x.isna().sum()).drop('a',1)
Output:
b c
a
1 1 1
2 1 0

Another dirty work:
df.set_index('a').isna().astype(int).groupby(level=0).sum()
Output:
b c
a
1 1 1
2 1 0

You could write your own aggregation function as follows:
df.groupby('a').agg(lambda x: x.isna().sum())
which results in
b c
a
1 1.0 1.0
2 1.0 0.0

pandas rolling max with groupby

I have a problem getting the rolling function of Pandas to do what I wish. I want for each frow to calculate the maximum so far within the group. Here is an example:
df = pd.DataFrame([[1,3], [1,6], [1,3], [2,2], [2,1]], columns=['id', 'value'])
looks like
id value
0 1 3
1 1 6
2 1 3
3 2 2
4 2 1
Now I wish to obtain the following DataFrame:
id value
0 1 3
1 1 6
2 1 6
3 2 2
4 2 2
The problem is that when I do
df.groupby('id')['value'].rolling(1).max()
I get the same DataFrame back. And when I do
df.groupby('id')['value'].rolling(3).max()
I get a DataFrame with Nans. Can someone explain how to properly use rolling or some other Pandas function to obtain the DataFrame I want?

It looks like you need cummax() instead of .rolling(N).max()
In [29]: df['new'] = df.groupby('id').value.cummax()
In [30]: df
Out[30]:
id value new
0 1 3 3
1 1 6 6
2 1 3 6
3 2 2 2
4 2 1 2
Timing (using brand new Pandas version 0.20.1):
In [3]: df = pd.concat([df] * 10**4, ignore_index=True)
In [4]: df.shape
Out[4]: (50000, 2)
In [5]: %timeit df.groupby('id').value.apply(lambda x: x.cummax())
100 loops, best of 3: 15.8 ms per loop
In [6]: %timeit df.groupby('id').value.cummax()
100 loops, best of 3: 4.09 ms per loop
NOTE: from Pandas 0.20.0 what's new
Improved performance of groupby().cummin() and groupby().cummax() (GH15048, GH15109, GH15561, GH15635)

Using apply will be a tiny bit faster:
# Using apply
df['output'] = df.groupby('id').value.apply(lambda x: x.cummax())
%timeit df['output'] = df.groupby('id').value.apply(lambda x: x.cummax())
1000 loops, best of 3: 1.57 ms per loop
Other method:
df['output'] = df.groupby('id').value.cummax()
%timeit df['output'] = df.groupby('id').value.cummax()
1000 loops, best of 3: 1.66 ms per loop

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why is getting the reverse of an index in pandas so slow? - python

Related

pandas groupby.agg() taking the mode of categorical variables where NaN is the only variable for a group

Aggregating a dataframe: weighted averages for numerical columns and concatenating as string for other types

How do you apply a function on a dataframe column using data from previous rows?

Pandas count NAs with a groupby for all columns [duplicate]

pandas rolling max with groupby

Categories

Resources