I'm building a time series, trying to get a more efficient way to do this - ideally vectorized.
The pandas apply with list comprehension step is very slow (on a big data set).
import datetime
import pandas as pd
# Dummy data:
todays_date = datetime.datetime.now().date()
xdates = pd.date_range(todays_date-datetime.timedelta(10), periods=4, freq='D')
categories = list(2*'A') + list(2*'B')
d = {'xdate': xdates, 'periods': [8]*2 + [2]*2, 'interval': [3]*2 + [12]*2}
df = pd.DataFrame(d,index=categories)
# This step is slow:
df['sdates'] = df.apply(lambda x: [x.xdate + pd.DateOffset(months=k*x.interval) for k in range(x.periods)], axis=1)
# This step is quite quick, but shown here for completeness
df = df.explode('sdates')
Maybe something like this:
df['sdates'] = [df.xdate + df.periods * [df.interval.astype('timedelta64[M]')]]
but the syntax isn't quite right.
This code
df = pd.DataFrame(d,index=categories)
df['m_offsets'] = df.interval.apply(lambda x: list(range(0, 72, x)))
df = df.explode('m_offsets')
df['sdate'] = df.xdate + df.m_offsets * pd.DateOffset(months=1)
I think is similar to one of the answers, but the last step, pd.DateOffset gives a warning:
PerformanceWarning: Adding/subtracting array of DateOffsets to DatetimeArray not vectorized
I tried building something along the lines of one answer, but as mentioned the modular arithmatic needs tweaking a lot to deal with edge cases, and haven't figured that out yet (calendar monthrange wasn't playing nicely).
This function doesn't run:
from calendar import monthrange
def add_months(df, date_col, n_col):
""" Adds ncol months do date_col """
z = df.copy()
# calculate new year/month/day and convert to datetime
z['year'] = (z[date_col].dt.year * 12 + (z[date_col].dt.month-1) + z[n_col]) // 12
z['month'] = ((z[date_col].dt.month + z[n_col] - 1) % 12) + 1
x,x = monthrange(z.year, z.month)
z['days_in_month'] = monthrange(z.year, z.month)
z['target_day'] = z[date_col].dt.day
# z['day'] = min(z.target_day, z.days_in_month)
z['day'] = z.days_in_month
z['sdates'] = pd.to_datetime(z[['year', 'month', 'day']])
return z['sdates']
This works, for now, but the dateoffset is a really heavy step.
df = pd.DataFrame(d,index=categories)
df['m_offsets'] = df.interval.apply(lambda x: list(range(0, 72, x)))
df = df.explode('m_offsets')
df['sdates'] = df.apply(lambda x: x.xdate + pd.DateOffset(months=x.m_offsets), axis=1)
Here's one option. You're adding months, so we can actually calculate new year/month/day by only dealing with integers in a vectorized way, and then create datetime from these y/m/d combinations:
def f_proposed(df):
z = df.copy()
z = z.reset_index()
# repeat xdate as many times as the number of periods
z = z.loc[np.repeat(z.index, z['periods'])]
# calculate k number of months to add
z['k'] = z.groupby(level=0).cumcount() * z['interval']
# calculate new year/month/day and convert to datetime
z['year'] = (z['xdate'].dt.year * 12 + z['xdate'].dt.month - 1 + z['k']) // 12
z['month'] = (z['xdate'].dt.month - 1 + z['k']) % 12 + 1
# clip day to days_in_month
z['days_in_month'] = pd.to_datetime(
z['year'].astype(str)+'-'+z['month'].astype(str)+'-01').dt.days_in_month
z['day'] = np.clip(z['xdate'].dt.day, 0, z['days_in_month'])
z['sdates'] = pd.to_datetime(z[['year', 'month', 'day']])
# drop temporary columns
z = z.set_index('index').drop(columns=['k', 'year', 'month', 'day', 'days_in_month'])
return z
To compare performance with the original, I've generated a test dataset with 10,000 rows.
Here's my timings (~23x speedup for 10K):
%timeit f_proposed(z)
82.7 ms ± 222 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit f_original(z)
1.92 s ± 2.75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
P.S. For 170K it takes about 1.39s with f_proposed and 33.6 s with f_original on my machine
Semi-vectorized way
As I say below, I don't think there is a pure vectorized way to add a variable and general DateOffset to a Series of Timestamps. #perl solution works in the case where the DateOffset is an exact multiple of 1 month.
Now, adding a single constant DateOffset is vectorized, so we can use the following. It capitalizes on the fact that there is a limited set of distinct values for the date offset. It is also relatively fast, and it is correct for any DateOffset and dates:
n = df['periods'].values
period_no = np.repeat(n - n.cumsum(), n) + np.arange(n.sum())
z = pd.DataFrame(
np.repeat(df.reset_index().values, repeats=n, axis=0),
columns=df.reset_index().columns,
).set_index('index')
z = z.assign(madd=period_no * z['interval'])
z['sdates'] = z['xdate']
for madd in set(z['madd'].unique()):
z.loc[z['madd'] == madd, 'sdates'] += pd.DateOffset(months=madd)
Timing:
# modified large dummy data:
N = 170_000
todays_date = datetime.datetime.now().date()
xdates = pd.date_range(todays_date-datetime.timedelta(10), periods=N, freq='H')
categories = np.random.choice(list('ABCDE'), N)
d = {'xdate': xdates, 'periods': np.random.randint(1,10,N), 'interval': np.random.randint(1,12,N)}
df = pd.DataFrame(d,index=categories)
%%time (the above)
CPU times: user 3.49 s, sys: 13.5 ms, total: 3.51 s
Wall time: 3.51 s
(Note: for 10K rows using the generation above, I see times of ~240ms, but of course it is dependent on how many distinct month offsets you have in your data).
Example result (for one draw of 170K rows as per above):
>>> z.tail()
xdate periods interval madd sdates
index
B 2040-08-25 06:00:00 8 8 48 2044-08-25 06:00:00
B 2040-08-25 06:00:00 8 8 56 2045-04-25 06:00:00
D 2040-08-25 07:00:00 3 2 0 2040-08-25 07:00:00
D 2040-08-25 07:00:00 3 2 2 2040-10-25 07:00:00
D 2040-08-25 07:00:00 3 2 4 2040-12-25 07:00:00
Correction on the initial answer
I stand corrected: my original answer is not vectorized either. The first part, exploding the DataFrame and building the number of months to add, is vectorized and very fast. But the second part, adding a DateOffset of a variable number of months, is not.
I hope I am wrong, but I don't think there is currently a way to do that second part in a vectorized way.
Direct date-parts manipulation (e.g. month = (month - 1 + n_months) % 12 + 1, etc.) are bound to fail for corner cases (e.g. '2021-02-31'). Short of replicating the logic used in DateOffset, this is not going to work for certain cases.
Initial answer
Here is a vectorized way:
n = df.periods.values
period_no = np.repeat(n - n.cumsum(), n) + np.arange(n.sum())
z = pd.DataFrame(
np.repeat(df.reset_index().values, repeats=n, axis=0),
columns=df.reset_index().columns,
).set_index('index').assign(period_no=period_no)
z['sdates'] = z['period_no'] * z['interval'] * pd.DateOffset(months=1) + z['xdate']
Related
I have a pandas dataframe with three columns structured like this:
Sample Start End
<string> <int> <int>
The values in "Start" and "End" are intervals of positions on a larger string (e.g. from position 9000 to 11000). My goal is to subdivide the larger string into windows of 10000 positions, and count how many of those are contained in intervals from my dataframe.
For example, window 0:10000 would contain 1000 positions and window 10000:20000 would contain the other 1000 positions from interval 9000:11000.
To do this, I am first running a function to split these intervals into windows, such that if this is the input:
Sample Start End
A 2500 5000
A 9000 11000
A 18000 19500
Then this is the output:
Sample Start End W_start W_end
A 2500 5000 0 10000
A 9000 10000 0 10000
A 10000 11000 10000 20000
A 18000 19500 10000 20000
This is the function I'm doing it with, where df_sub is a line of the dataframe and w_size is the window size (10000):
def split_into_windows(df_sub, w_size):
start, end = df_sub.Start, df_sub.End
w_start = start - (start % w_size)
w_end = w_start + w_size
if (w_start <= start <= w_end) and (w_start <= end <= w_end):
df_out = df_sub
elif (w_start <= start <= w_end) and (end > w_end):
out = []
df_tmp = df_sub.copy()
df_tmp.End = w_end
out.append(df_tmp.copy())
while (end > w_end):
w_start += w_size
w_end += w_size
df_tmp.Start = max(start, w_start)
df_tmp.End = min(end, w_end)
out.append(df_tmp.copy())
df_out = pd.DataFrame(out)
return df_out
I'm calling the function with apply():
df = df.apply(split_into_windows, axis=1, args=(w_size,))
But I'm getting this error:
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
Looking online I found that this issue seems to be related with pandas merge but I am not using pandas merge. I believe it may be related to the fact that some lines produce a single output series, and some others produce a small dataframe (the split ones).
See here:
Sample A
Start 6928
End 9422
Sample Start End
0 A 9939 10000
1 A 10000 11090
Any tips on how to fix this?
Minimal dataset to reproduce: https://file.io/iZ3fguCFlRbq
EDIT #1:
I tried changing a line in the function to have a coherent output (i.e. returning dataframes only):
df_out = df_sub.to_frame().T
And now the apply() round "works", as in throws no errors, but the output looks like this:
0 Sample Start End
0 A 0 6915
1 Sample Start End
0 A 6928 9422
2 Sample Start End
0 A 9939 10000
...
<class 'pandas.core.series.Series'>
EDIT #2:
I cannot use .iterrows(), it takes too long (estimate: weeks) with the size of dataframe I'm operating with.
EDIT #3:
Using multiprocessing like this made me get through the day but it is still a suboptimal solution, compared to what I could achieve with a functioning apply() call and a pandas parallel application such as pandarallel or swifter. Still looking for any tip :)
pool = mp.Pool(processes=48)
q = mp.Manager().Queue()
start = time.time()
for index, row in df_test.iterrows():
pool.apply_async(split_into_windows, args=(row, w_size, q))
pool.close()
pool.join()
out = []
while q.empty() == False:
out.append(q.get())
df = pd.DataFrame(out)
If I understand everything correctly, here is a possible solution:
import pandas as pd
window_step = 10000
# Get indices of the window for start and end (here, the end is inclusive).
df['start_loc'] = df['Start'] // window_step
df['end_loc'] = (df['End']-1) // window_step
# Build the intervals for the W_start and W_end columns for each row.
intervals = [list((s*window_step, (s+1)*window_step) for s in range(r[0], r[1]+1))
for r in zip(df['start_loc'], df['end_loc'])]
# Insert in df and explode the interval column to get extra rows.
df['interval'] = intervals
df = df.explode(column='interval')
# Split the interval in two columns.
df[['W_start', 'W_end']] = pd.DataFrame(df['interval'].tolist(), index=df.index)
# Correct the starts and ends that are wrong because duplicated with explode.
wrong_ends = df['End'].to_numpy() > df['W_end'].to_numpy()
df.loc[wrong_ends, 'End'] = df.loc[wrong_ends, 'W_end']
wrong_starts = df['Start'].to_numpy() < df['W_start'].to_numpy()
df.loc[wrong_starts, 'Start'] = df.loc[wrong_starts, 'W_start']
df = df.drop(columns=['start_loc', 'end_loc', 'interval'])
print(df)
Sample Start End W_start W_end
0 A 2500 5000 0 10000
1 A 9000 10000 0 10000
1 A 10000 11000 10000 20000
2 A 18000 19500 10000 20000
Then, from here, to calculate the number of positions included in each window you could do:
df['included_positions'] = df['End'] - df['Start']
sample_win_cnt = df.groupby(['Sample', 'W_start', 'W_end']).sum().drop(columns=['Start', 'End'])
print(sample_win_cnt)
included_positions
Sample W_start W_end
A 0 10000 3500
10000 20000 2500
Here I grouped by 'Sample' as well. I am not sure this is what you want. If not, you can also just group by 'W_start' and 'W_end'.
Output with the other example:
Input:
Sample Start End
0 A 9939 10000
1 A 10000 11090
Interval result:
Sample Start End W_start W_end
0 A 9939 10000 0 10000
1 A 10000 11090 10000 20000
Counts:
included_positions
Sample W_start W_end
A 0 10000 61
10000 20000 1090
I tested it on a DataFrame with >1M rows and it seemed to calculate the results in less than a second.
#user2246849 is perfect. I only think it's a little hard to follow when it comes to define intervals.
My suggestion here is to play with a row only to define a function that take a row and return the intervals. I mean given df you take x = df.iloc[1] and build a function which return [[0, 10_000], [10_000, 20_000]]
import pandas as pd
df = pd.DataFrame(
{'Sample': {0: 'A', 1: 'A', 2: 'A'},
'Start': {0: 2500, 1: 9000, 2: 18000},
'End': {0: 5000, 1: 11000, 2: 19500}})
def get_intervals(x, window_step):
out = [
[i * window_step,
(i + 1) * window_step]
for i in range(
x["Start"] // window_step,
(x["End"] - 1) // window_step + 1)]
return out
And we assign intervals with an apply
df["intervals"] = df.apply(
lambda x: get_intervals(x, window_step), axis=1)
which return
Sample Start End intervals
0 A 2500 5000 [[0, 10000]]
1 A 9000 11000 [[0, 10000], [10000, 20000]]
2 A 18000 19500 [[10000, 20000]]
From now on you can follow the other answer.
My algorithm stepped up from 35 seconds to 15 minutes runtime when implementing this feature over a daily timeframe. The algo retrieves daily history in bulk and iterates over a subset of the dataframe (from t0 to tX where tX is the current row of iteration). It does this to emulate what would happen during the real time operations of the algo. I know there are ways of improving it by utilizing memory between frame calculations but I was wondering if there was a more pandas-ish implementation that would see immediate benefit.
Assume that self.Step is something like 0.00001 and self.Precision is 5; they are used for binning the ohlc bar information into discrete steps for the sake of finding the poc. _frame is a subset of rows of the entire dataframe, and _low/_high are respective to that. The following block of code executes on the entire _frame which could be upwards of ~250 rows every time there is a new row added by the algo (when calculating yearly timeframe on daily data). I believe it's the iterrows that's causing the major slowdown. The dataframe has columns such as high, low, open, close, volume. I am calculating time price opportunity and volume point of control.
# Set the complete index of prices +/- 1 step due to weird floating point precision issues
volume_prices = pd.Series(0, index=np.around(np.arange(_low - self.Step, _high + self.Step, self.Step), decimals=self.Precision))
time_prices = volume_prices.copy()
for index, state in _frame.iterrows():
_prices = np.around(np.arange(state.low, state.high, self.Step), decimals=self.Precision)
# Evenly distribute the bar's volume over its range
volume_prices[_prices] += state.volume / _prices.size
# Increment time at price
time_prices[_prices] += 1
# Pandas only returns the 1st row of the max value,
# so we need to reverse the series to find the other side
# and then find the average price between those two extremes
volume_poc = (volume_prices.idxmax() + volume_prices.iloc[::-1].idxmax()) / 2)
time_poc = (time_prices.idxmax() + time_prices.iloc[::-1].idxmax()) / 2)
you can use this function as a base and to adjust it:
def f(x): #function to find the POC price and volume
a = x['tradePrice'].value_counts().index[0]
b = x.loc[x['tradePrice'] == a, 'tradeVolume'].sum()
return pd.Series([a,b],['POC_Price','POC_Volume'])
Here's what I worked out. I'm still not sure the answer you code is producing is correct, I think your line volume_prices[_prices] += state.Volume / _prices.size is not being applied to every record in volume_prices, but here it is with benchmarking. About a 9x improvement.
def vpOriginal():
Step = 0.00001
Precision = 5
_frame = getData()
_low = 85.0
_high = 116.4
# Set the complete index of prices +/- 1 step due to weird floating point precision issues
volume_prices = pd.Series(0, index=np.around(np.arange(_low - Step, _high + Step, Step), decimals=Precision))
time_prices = volume_prices.copy()
time_prices2 = volume_prices.copy()
for index, state in _frame.iterrows():
_prices = np.around(np.arange(state.Low, state.High, Step), decimals=Precision)
# Evenly distribute the bar's volume over its range
volume_prices[_prices] += state.Volume / _prices.size
# Increment time at price
time_prices[_prices] += 1
time_prices2 += 1
# Pandas only returns the 1st row of the max value,
# so we need to reverse the series to find the other side
# and then find the average price between those two extremes
# print(volume_prices.head(10))
volume_poc = (volume_prices.idxmax() + volume_prices.iloc[::-1].idxmax() / 2)
time_poc = (time_prices.idxmax() + time_prices.iloc[::-1].idxmax() / 2)
return volume_poc, time_poc
def vpNoDF():
Step = 0.00001
Precision = 5
_frame = getData()
_low = 85.0
_high = 116.4
# Set the complete index of prices +/- 1 step due to weird floating point precision issues
volume_prices = pd.Series(0, index=np.around(np.arange(_low - Step, _high + Step, Step), decimals=Precision))
time_prices = volume_prices.copy()
for index, state in _frame.iterrows():
_prices = np.around((state.High - state.Low) / Step , 0)
# Evenly distribute the bar's volume over its range
volume_prices.loc[state.Low:state.High] += state.Volume / _prices
# Increment time at price
time_prices.loc[state.Low:state.High] += 1
# Pandas only returns the 1st row of the max value,
# so we need to reverse the series to find the other side
# and then find the average price between those two extremes
volume_poc = (volume_prices.idxmax() + volume_prices.iloc[::-1].idxmax() / 2)
time_poc = (time_prices.idxmax() + time_prices.iloc[::-1].idxmax() / 2)
return volume_poc, time_poc
getData()
Out[8]:
Date Open High Low Close Volume Adj Close
0 2008-10-14 116.26 116.40 103.14 104.08 70749800 104.08
1 2008-10-13 104.55 110.53 101.02 110.26 54967000 110.26
2 2008-10-10 85.70 100.00 85.00 96.80 79260700 96.80
3 2008-10-09 93.35 95.80 86.60 88.74 57763700 88.74
4 2008-10-08 85.91 96.33 85.68 89.79 78847900 89.79
5 2008-10-07 100.48 101.50 88.95 89.16 67099000 89.16
6 2008-10-06 91.96 98.78 87.54 98.14 75264900 98.14
7 2008-10-03 104.00 106.50 94.65 97.07 81942800 97.07
8 2008-10-02 108.01 108.79 100.00 100.10 57477300 100.10
9 2008-10-01 111.92 112.36 107.39 109.12 46303000 109.12
vpOriginal()
Out[9]: (142.55000000000001, 142.55000000000001)
vpNoDF()
Out[10]: (142.55000000000001, 142.55000000000001)
%timeit vpOriginal()
2.79 s ± 24.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit vpNoDF()
300 ms ± 8.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I've managed to get it down to 2 mins instead of 15 - at on daily timeframes anyway. It's still slow on lower timeframes (10 minutes on Hourly over a 2 year period with a precision of 2 for equities). Working with DataFrames as opposed to Series was FAR slower. I'm hoping for more but I don't know what I can do aside from the following solution:
# Upon class instantiation, I've created attributes for each timeframe
# related to `volume_at_price` and `time_at_price`. They serve as memory
# in between frame calculations
def _prices_at(self, frame, bars=0):
# Include 1 step above high as np.arange does not
# include the upper limit by default
state = frame.iloc[-min(bars + 1, frame.index.size)]
bins = np.around(np.arange(state.low, state.high + self.Step, self.Step), decimals=self.Precision)
return pd.Series(state.volume / bins.size, index=bins)
# SetFeature/Feature implement timeframed attributes (i.e., 'volume_at_price_D')
_v = 'volume_at_price'
_t = 'time_at_price'
# Add to x_at_price histogram
_p = self._prices_at(frame)
self.SetFeature(_v, self.Feature(_v).add(_p, fill_value=0))
self.SetFeature(_t, self.Feature(_t).add(_p * 0 + 1, fill_value=0))
# Remove old data from histogram
_p = self._prices_at(frame, self.Bars)
v = self.SetFeature(_v, self.Feature(_v).subtract(_p, fill_value=0))
t = self.SetFeature(_t, self.Feature(_t).subtract(_p * 0 + 1, fill_value=0))
self.SetFeature('volume_poc', (v.idxmax() + v.iloc[::-1].idxmax()) / 2)
self.SetFeature('time_poc', (t.idxmax() + t.iloc[::-1].idxmax()) / 2)
Essentially I have data which provides a start time, the number of time slots and the duration of each slot.
I want to convert that into a dataframe of start and end times - which I've achieved but I can't help but think is not efficient or particularly pythonic.
The real data has multiple ID's hence the grouping.
import pandas as pd
slots = pd.DataFrame({"ID": 1, "StartDate": pd.to_datetime("2019-01-01 10:30:00"), "Quantity": 3, "Duration": pd.to_timedelta(30, unit="minutes")}, index=[0])
grp_data = slots.groupby("ID")
bob = []
for rota_id, row in grp_data:
start = row.iloc[0, 1]
delta = row.iloc[0, 3]
for quantity in range(1, int(row.iloc[0, 2] + 1)):
data = {"RotaID": rota_id,
"DateStart": start,
"Duration": delta,
"DateEnd": start+delta}
bob.append(data)
start = start + delta
fred = pd.DataFrame(bob)
This might be answered elsewhere but I've no idea how to properly search this since I'm not sure what my problem is.
EDIT: I've updated my code to be more efficient with it's function calls and it is faster, but I'm still interested in knowing if there is a vectorised approach to this.
How about this way:
indices_dup = [np.repeat(i, quantity) for i, quantity in enumerate(slots.Quantity.values)]
slots_ext = slots.loc[np.concatenate(indices_dup).ravel(), :]
# Add a counter per ID; used to 'shift' the duration along StartDate
slots_ext['counter'] = slots_ext.groupby('ID').cumcount()
# Calculate DateStart and DateEnd based on counter and Duration
slots_ext['DateStart'] = (slots_ext.counter) * slots_ext.Duration.values + slots_ext.StartDate
slots_ext['DateEnd'] = (slots_ext.counter + 1) * slots_ext.Duration.values + slots_ext.StartDate
slots_ext.loc[:, ['ID', 'DateStart', 'Duration', 'DateEnd']].reset_index(drop=True)
Performance
Looking at performance on a larger dataframe (duplicated 1000 times) using
slots_large = pd.concat([slots] * 1000, ignore_index=True).drop('ID', axis=1).reset_index().rename(columns={'index': 'ID'})
Yields:
Old method: 289 ms ± 4.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
New method: 8.13 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In case this ever helps anyone:
I found that my data set had varying delta's per ID and #RubenB 's initial answer doesn't handle those. Here was my final solution based on his/her code:
# RubenB's code
indices_dup = [np.repeat(i, quantity) for i, quantity in enumerate(slots.Quantity.values)]
slots_ext = slots.loc[np.concatenate(indices_dup).ravel(), :]
# Calculate the cumulative sum of the delta per rota ID
slots_ext["delta_sum"] = slots_ext.groupby("ID")["Duration"].cumsum()
slots_ext["delta_sum"] = pd.to_timedelta(slots_ext["delta_sum"], unit="minutes")
# Use the cumulative sum to calculate the running end dates and then the start dates
first_value = slots_ext.StartDate[0]
slots_ext["EndDate"] = slots_ext.delta_sum.values + slots_ext.StartDate
slots_ext["StartDate"] = slots_ext.EndDate.shift(1)
slots_ext.loc[0, "StartDate"] = first_value
slots_ext.reset_index(drop=True, inplace=True)
Other questions attempting to provide the python equivalent to R's sweepfunction (like here) do not really address the case of multiple arguments where it is most useful.
Say I wish to apply a 2 argument function to each row of a Dataframe with the matching element from a column of another DataFrame:
df = data.frame("A" = 1:3,"B" = 11:13)
df2= data.frame("X" = 10:12,"Y" = 10000:10002)
sweep(df,1, FUN="*",df2$X)
In python I got the equivalent using apply on what is basically a loop through the row counts.
df = pd.DataFrame( { "A" : range(1,4),"B" : range(11,14) } )
df2 = pd.DataFrame( { "X" : range(10,13),"Y" : range(10000,10003) } )
pd.Series(range(df.shape[0])).apply(lambda row_count: np.multiply(df.iloc[row_count,:],df2.iloc[row_count,df2.columns.get_loc('X')]))
I highly doubt this is efficient in pandas, what is a better way of doing this?
Both bits of code should result in a Dataframe/matrix of 6 numbers when applying *:
A B
1 10 110
2 22 132
3 36 156
I should state clearly that the aim is to insert one's own function into this sweep like behavior say:
df = data.frame("A" = 1:3,"B" = 11:13)
df2= data.frame("X" = 10:12,"Y" = 10000:10002)
myFunc = function(a,b) { floor((a + b)^min(a/2,b/3)) }
sweep(df,1, FUN=myFunc,df2$X)
resulting in:
A B
[1,] 3 4
[2,] 3 4
[3,] 3 5
What is a good way of doing that in python pandas?
If I understand this correctly, you are looking to apply a binary function f(x,y) to a dataframe (for the x) row-wise with arguments from a series for y. One way to do this is to borrow the implementation from pandas internals itself. If you want to extend this function (e.g. apply along columns, it can be done in a similar manner, as long as f is binary. If you need more arguments, you can simply do a partial on f to make it binary
import pandas as pd
from pandas.core.dtypes.generic import ABCSeries
def sweep(df, series, FUN):
assert isinstance(series, ABCSeries)
# row-wise application
assert len(df) == len(series)
return df._combine_match_index(series, FUN)
# define your binary operator
def f(x, y):
return x*y
# the input data frames
df = pd.DataFrame( { "A" : range(1,4),"B" : range(11,14) } )
df2 = pd.DataFrame( { "X" : range(10,13),"Y" : range(10000,10003) } )
# apply
test1 = sweep(df, df2.X, f)
# performance
# %timeit sweep(df, df2.X, f)
# 155 µs ± 1.27 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)#
# another method
import numpy as np
test2 = pd.Series(range(df.shape[0])).apply(lambda row_count: np.multiply(df.iloc[row_count,:],df2.iloc[row_count,df2.columns.get_loc('X')]))
# %timeit performance
# 1.54 ms ± 56.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
assert all(test1 == test2)
Hope this helps.
In pandas
df.mul(df2.X,axis=0)
A B
0 10 110
1 22 132
2 36 156
I have a data frame that contains a group ID, two distance measures (longitude/latitude type measure), and a value. For a given set of distances, I want to find the number of other groups nearby, and the average values of those other groups nearby.
I've written the following code, but it is so inefficient that it simply does not complete in a reasonable time for very large data sets. The calculation of nearby retailers is quick. But the calculation of the average value of nearby retailers is extremely slow. Is there a better way to make this more efficient?
distances = [1,2]
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)),
columns=['Group','Dist1','Dist2','Value'])
# get one row per group, with the two distances for each row
df_groups = df.groupby('Group')[['Dist1','Dist2']].mean()
# create KDTree for quick searching
tree = cKDTree(df_groups[['Dist1','Dist2']])
# find points within a given radius
for i in distances:
closeby = tree.query_ball_tree(tree, r=i)
# put into density column
df_groups['groups_within_' + str(i) + 'miles'] = [len(x) for x in closeby]
# get average values of nearby groups
for idx, val in enumerate(df_groups.index):
val_idx = df_groups.iloc[closeby[idx]].index.values
mean = df.loc[df['Group'].isin(val_idx), 'Value'].mean()
df_groups.loc[val, str(i) + '_mean_values'] = mean
# merge back to dataframe
df = pd.merge(df, df_groups[['groups_within_' + str(i) + 'miles',
str(i) + '_mean_values']],
left_on='Group',
right_index=True)
Its clear that the problem is indexing the main dataframe, with the isin method. As the dataframe grows in length a much larger search has to be done. I propose you do that same search, on the smaller df_groups data frame and calculate an updated average instead.
df = pd.DataFrame(np.random.randint(0,100,size=(100000, 4)),
columns=['Group','Dist1','Dist2','Value'])
distances = [1,2]
# get means of all values and count, the totals for each sample
df_groups = df.groupby('Group')[['Dist1','Dist2','Value']].agg({'Dist1':'mean','Dist2':'mean',
'Value':['mean','count']})
# remove multicolumn index
df_groups.columns = [' '.join(col).strip() for col in df_groups.columns.values]
#Rename columns
df_groups.rename(columns={'Dist1 mean':'Dist1','Dist2 mean':'Dist2','Value mean':'Value','Value count':
'Count'},inplace=True)
# create KDTree for quick searching
tree = cKDTree(df_groups[['Dist1','Dist2']])
for i in distances:
closeby = tree.query_ball_tree(tree, r=i)
# put into density column
df_groups['groups_within_' + str(i) + 'miles'] = [len(x) for x in closeby]
#create column to look for subsets
df_groups['subs'] = [df_groups.index.values[idx] for idx in closeby]
#set this column to prep updated mean calculation
df_groups['ComMean'] = df_groups['Value'] * df_groups['Count']
#perform updated mean
df_groups[str(i) + '_mean_values'] = [(df_groups.loc[df_groups.index.isin(row), 'ComMean'].sum() /
df_groups.loc[df_groups.index.isin(row), 'Count'].sum()) for row in df_groups['subs']]
df = pd.merge(df, df_groups[['groups_within_' + str(i) + 'miles',
str(i) + '_mean_values']],
left_on='Group',
right_index=True)
the formula for and upated mean is just (m1*n1 + m2*n2)/(n1+n2)
old setup
100000 rows
%timeit old(df)
1 loop, best of 3: 694 ms per loop
1000000 rows
%timeit old(df)
1 loop, best of 3: 6.08 s per loop
10000000 rows
%timeit old(df)
1 loop, best of 3: 6min 13s per loop
new setup
100000 rows
%timeit new(df)
10 loops, best of 3: 136 ms per loop
1000000 rows
%timeit new(df)
1 loop, best of 3: 525 ms per loop
10000000 rows
%timeit new(df)
1 loop, best of 3: 4.53 s per loop