I have a data frame that contains a group ID, two distance measures (longitude/latitude type measure), and a value. For a given set of distances, I want to find the number of other groups nearby, and the average values of those other groups nearby.
I've written the following code, but it is so inefficient that it simply does not complete in a reasonable time for very large data sets. The calculation of nearby retailers is quick. But the calculation of the average value of nearby retailers is extremely slow. Is there a better way to make this more efficient?
distances = [1,2]
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)),
columns=['Group','Dist1','Dist2','Value'])
# get one row per group, with the two distances for each row
df_groups = df.groupby('Group')[['Dist1','Dist2']].mean()
# create KDTree for quick searching
tree = cKDTree(df_groups[['Dist1','Dist2']])
# find points within a given radius
for i in distances:
closeby = tree.query_ball_tree(tree, r=i)
# put into density column
df_groups['groups_within_' + str(i) + 'miles'] = [len(x) for x in closeby]
# get average values of nearby groups
for idx, val in enumerate(df_groups.index):
val_idx = df_groups.iloc[closeby[idx]].index.values
mean = df.loc[df['Group'].isin(val_idx), 'Value'].mean()
df_groups.loc[val, str(i) + '_mean_values'] = mean
# merge back to dataframe
df = pd.merge(df, df_groups[['groups_within_' + str(i) + 'miles',
str(i) + '_mean_values']],
left_on='Group',
right_index=True)
Its clear that the problem is indexing the main dataframe, with the isin method. As the dataframe grows in length a much larger search has to be done. I propose you do that same search, on the smaller df_groups data frame and calculate an updated average instead.
df = pd.DataFrame(np.random.randint(0,100,size=(100000, 4)),
columns=['Group','Dist1','Dist2','Value'])
distances = [1,2]
# get means of all values and count, the totals for each sample
df_groups = df.groupby('Group')[['Dist1','Dist2','Value']].agg({'Dist1':'mean','Dist2':'mean',
'Value':['mean','count']})
# remove multicolumn index
df_groups.columns = [' '.join(col).strip() for col in df_groups.columns.values]
#Rename columns
df_groups.rename(columns={'Dist1 mean':'Dist1','Dist2 mean':'Dist2','Value mean':'Value','Value count':
'Count'},inplace=True)
# create KDTree for quick searching
tree = cKDTree(df_groups[['Dist1','Dist2']])
for i in distances:
closeby = tree.query_ball_tree(tree, r=i)
# put into density column
df_groups['groups_within_' + str(i) + 'miles'] = [len(x) for x in closeby]
#create column to look for subsets
df_groups['subs'] = [df_groups.index.values[idx] for idx in closeby]
#set this column to prep updated mean calculation
df_groups['ComMean'] = df_groups['Value'] * df_groups['Count']
#perform updated mean
df_groups[str(i) + '_mean_values'] = [(df_groups.loc[df_groups.index.isin(row), 'ComMean'].sum() /
df_groups.loc[df_groups.index.isin(row), 'Count'].sum()) for row in df_groups['subs']]
df = pd.merge(df, df_groups[['groups_within_' + str(i) + 'miles',
str(i) + '_mean_values']],
left_on='Group',
right_index=True)
the formula for and upated mean is just (m1*n1 + m2*n2)/(n1+n2)
old setup
100000 rows
%timeit old(df)
1 loop, best of 3: 694 ms per loop
1000000 rows
%timeit old(df)
1 loop, best of 3: 6.08 s per loop
10000000 rows
%timeit old(df)
1 loop, best of 3: 6min 13s per loop
new setup
100000 rows
%timeit new(df)
10 loops, best of 3: 136 ms per loop
1000000 rows
%timeit new(df)
1 loop, best of 3: 525 ms per loop
10000000 rows
%timeit new(df)
1 loop, best of 3: 4.53 s per loop
Related
I am working on a route building code and have half a million record which taking around 3-4 hrs to get executed.
For creating dataframe:
# initialize list of lists
data = [[['1027', '(K)', 'TRIM']], [[SJCL, (K), EJ00, (K), ZQFC, (K), 'DYWH']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['route'])
Will look like something this:
route
[1027, (K), TRIM]
[SJCL, (K), EJ00, (K), ZQFC, (K), DYWH]
O/P
Code I have used:
def func_list(hd1):
required_list=[]
for j,i in enumerate(hd1):
#print(i,j)
if j==0:
req=i
else:
if (i[0].isupper() or i[0].isdigit()):
required_list.append(req)
req=i
else:
req=req+i
required_list.append(req)
return required_list
df['route2']=df.route1.apply(lambda x : func_list (x))
#op
route2
[1027(K), TRIM]
[SJCL(K), EJ00(K), ZQFC(K), DYWH]
For half million rows, it taking 3-4 hrs, I dont know how to reduce it pls help.
Use explode to flatten your dataframe:
sr1 = df['route'].explode()
sr2 = pd.Series(np.where(sr1.str[0] == '(', sr1.shift() + sr1, sr1), index=sr1.index)
df['route'] = sr2[sr1.eq(sr2).shift(-1, fill_value=True)].groupby(level=0).apply(list)
print(df)
# Output:
0 [1027(K), TRIM]
1 [SJCL(K), EJ00(K), ZQFC(K), DYWH]
dtype: object
For 500K records:
7.46 s ± 97.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I'm building a time series, trying to get a more efficient way to do this - ideally vectorized.
The pandas apply with list comprehension step is very slow (on a big data set).
import datetime
import pandas as pd
# Dummy data:
todays_date = datetime.datetime.now().date()
xdates = pd.date_range(todays_date-datetime.timedelta(10), periods=4, freq='D')
categories = list(2*'A') + list(2*'B')
d = {'xdate': xdates, 'periods': [8]*2 + [2]*2, 'interval': [3]*2 + [12]*2}
df = pd.DataFrame(d,index=categories)
# This step is slow:
df['sdates'] = df.apply(lambda x: [x.xdate + pd.DateOffset(months=k*x.interval) for k in range(x.periods)], axis=1)
# This step is quite quick, but shown here for completeness
df = df.explode('sdates')
Maybe something like this:
df['sdates'] = [df.xdate + df.periods * [df.interval.astype('timedelta64[M]')]]
but the syntax isn't quite right.
This code
df = pd.DataFrame(d,index=categories)
df['m_offsets'] = df.interval.apply(lambda x: list(range(0, 72, x)))
df = df.explode('m_offsets')
df['sdate'] = df.xdate + df.m_offsets * pd.DateOffset(months=1)
I think is similar to one of the answers, but the last step, pd.DateOffset gives a warning:
PerformanceWarning: Adding/subtracting array of DateOffsets to DatetimeArray not vectorized
I tried building something along the lines of one answer, but as mentioned the modular arithmatic needs tweaking a lot to deal with edge cases, and haven't figured that out yet (calendar monthrange wasn't playing nicely).
This function doesn't run:
from calendar import monthrange
def add_months(df, date_col, n_col):
""" Adds ncol months do date_col """
z = df.copy()
# calculate new year/month/day and convert to datetime
z['year'] = (z[date_col].dt.year * 12 + (z[date_col].dt.month-1) + z[n_col]) // 12
z['month'] = ((z[date_col].dt.month + z[n_col] - 1) % 12) + 1
x,x = monthrange(z.year, z.month)
z['days_in_month'] = monthrange(z.year, z.month)
z['target_day'] = z[date_col].dt.day
# z['day'] = min(z.target_day, z.days_in_month)
z['day'] = z.days_in_month
z['sdates'] = pd.to_datetime(z[['year', 'month', 'day']])
return z['sdates']
This works, for now, but the dateoffset is a really heavy step.
df = pd.DataFrame(d,index=categories)
df['m_offsets'] = df.interval.apply(lambda x: list(range(0, 72, x)))
df = df.explode('m_offsets')
df['sdates'] = df.apply(lambda x: x.xdate + pd.DateOffset(months=x.m_offsets), axis=1)
Here's one option. You're adding months, so we can actually calculate new year/month/day by only dealing with integers in a vectorized way, and then create datetime from these y/m/d combinations:
def f_proposed(df):
z = df.copy()
z = z.reset_index()
# repeat xdate as many times as the number of periods
z = z.loc[np.repeat(z.index, z['periods'])]
# calculate k number of months to add
z['k'] = z.groupby(level=0).cumcount() * z['interval']
# calculate new year/month/day and convert to datetime
z['year'] = (z['xdate'].dt.year * 12 + z['xdate'].dt.month - 1 + z['k']) // 12
z['month'] = (z['xdate'].dt.month - 1 + z['k']) % 12 + 1
# clip day to days_in_month
z['days_in_month'] = pd.to_datetime(
z['year'].astype(str)+'-'+z['month'].astype(str)+'-01').dt.days_in_month
z['day'] = np.clip(z['xdate'].dt.day, 0, z['days_in_month'])
z['sdates'] = pd.to_datetime(z[['year', 'month', 'day']])
# drop temporary columns
z = z.set_index('index').drop(columns=['k', 'year', 'month', 'day', 'days_in_month'])
return z
To compare performance with the original, I've generated a test dataset with 10,000 rows.
Here's my timings (~23x speedup for 10K):
%timeit f_proposed(z)
82.7 ms ± 222 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit f_original(z)
1.92 s ± 2.75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
P.S. For 170K it takes about 1.39s with f_proposed and 33.6 s with f_original on my machine
Semi-vectorized way
As I say below, I don't think there is a pure vectorized way to add a variable and general DateOffset to a Series of Timestamps. #perl solution works in the case where the DateOffset is an exact multiple of 1 month.
Now, adding a single constant DateOffset is vectorized, so we can use the following. It capitalizes on the fact that there is a limited set of distinct values for the date offset. It is also relatively fast, and it is correct for any DateOffset and dates:
n = df['periods'].values
period_no = np.repeat(n - n.cumsum(), n) + np.arange(n.sum())
z = pd.DataFrame(
np.repeat(df.reset_index().values, repeats=n, axis=0),
columns=df.reset_index().columns,
).set_index('index')
z = z.assign(madd=period_no * z['interval'])
z['sdates'] = z['xdate']
for madd in set(z['madd'].unique()):
z.loc[z['madd'] == madd, 'sdates'] += pd.DateOffset(months=madd)
Timing:
# modified large dummy data:
N = 170_000
todays_date = datetime.datetime.now().date()
xdates = pd.date_range(todays_date-datetime.timedelta(10), periods=N, freq='H')
categories = np.random.choice(list('ABCDE'), N)
d = {'xdate': xdates, 'periods': np.random.randint(1,10,N), 'interval': np.random.randint(1,12,N)}
df = pd.DataFrame(d,index=categories)
%%time (the above)
CPU times: user 3.49 s, sys: 13.5 ms, total: 3.51 s
Wall time: 3.51 s
(Note: for 10K rows using the generation above, I see times of ~240ms, but of course it is dependent on how many distinct month offsets you have in your data).
Example result (for one draw of 170K rows as per above):
>>> z.tail()
xdate periods interval madd sdates
index
B 2040-08-25 06:00:00 8 8 48 2044-08-25 06:00:00
B 2040-08-25 06:00:00 8 8 56 2045-04-25 06:00:00
D 2040-08-25 07:00:00 3 2 0 2040-08-25 07:00:00
D 2040-08-25 07:00:00 3 2 2 2040-10-25 07:00:00
D 2040-08-25 07:00:00 3 2 4 2040-12-25 07:00:00
Correction on the initial answer
I stand corrected: my original answer is not vectorized either. The first part, exploding the DataFrame and building the number of months to add, is vectorized and very fast. But the second part, adding a DateOffset of a variable number of months, is not.
I hope I am wrong, but I don't think there is currently a way to do that second part in a vectorized way.
Direct date-parts manipulation (e.g. month = (month - 1 + n_months) % 12 + 1, etc.) are bound to fail for corner cases (e.g. '2021-02-31'). Short of replicating the logic used in DateOffset, this is not going to work for certain cases.
Initial answer
Here is a vectorized way:
n = df.periods.values
period_no = np.repeat(n - n.cumsum(), n) + np.arange(n.sum())
z = pd.DataFrame(
np.repeat(df.reset_index().values, repeats=n, axis=0),
columns=df.reset_index().columns,
).set_index('index').assign(period_no=period_no)
z['sdates'] = z['period_no'] * z['interval'] * pd.DateOffset(months=1) + z['xdate']
I am calculating the RSI value for a stock price where a previous row is needed for the result of the current row. I am doing it currently via for looping the full dataframe for as many times there are entries which takes a lot of time (executing time on my pc around 15 seconds).
Is there any way to improve that code?
import pandas as pd
from pathlib import Path
filename = Path("Tesla.csv")
test = pd.read_csv(filename)
data = pd.DataFrame(test[["Date","Close"]])
data["Change"] = (data["Close"].shift(-1)-data["Close"]).shift(1)
data["Gain"] = 0.0
data["Loss"] = 0.0
data.loc[data["Change"] >= 0, "Gain"] = data["Change"]
data.loc[data["Change"] <= 0, "Loss"] = data["Change"]*-1
data.loc[:, "avgGain"] = 0.0
data.loc[:, "avgLoss"] = 0.0
data["avgGain"].iat[14] = data["Gain"][1:15].mean()
data["avgLoss"].iat[14] = data["Loss"][1:15].mean()
for index in data.iterrows():
data.loc[15:, "avgGain"] = (data.loc[14:, "avgGain"].shift(1)*13 + data.loc[15:, "Gain"])/14
data.loc[15:, "avgLoss"] = (data.loc[14:, "avgLoss"].shift(1)*13 + data.loc[15:, "Loss"])/14
The used dataset can be downloaded here:
TSLA historic dataset from yahoo finance
The goal is to calculate the RSI value based on the to be calculated avgGain and avgLoss value.
The avgGain value on rows 0:14 are not existent.
The avgGain value on row 15 is the mean value of row[1:14] of the Gain column.
The avgGain value from row 16 onwards is calculated as:
(13*avgGain(row before)+Gain(current row))/14
'itertuples' is faster than 'iterrows' and vectorized operations generally performs best in terms of time.
Here you can calculate average gains and losses (rolling averages) over 14 days with the rolling method with a window size of 14.
%%timeit
data["avgGain"].iat[14] = data["Gain"][1:15].mean()
data["avgLoss"].iat[14] = data["Loss"][1:15].mean()
for index in data.iterrows():
data.loc[15:, "avgGain"] = (data.loc[14:, "avgGain"].shift(1)*13 + data.loc[15:, "Gain"])/14
data.loc[15:, "avgLoss"] = (data.loc[14:, "avgLoss"].shift(1)*13 + data.loc[15:, "Loss"])/14
1.12 s ± 3.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
data['avgGain_alt'] = data['Gain'].rolling(window=14).mean().fillna(0)
data['avgLos_alt'] = data['Gain'].rolling(window=14).mean().fillna(0)
1.38 ms ± 2.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
data.head(15)
Using vectorized operations to calculate moving averages is approximately 10 times faster than calculating with loops.
However note that, there is also some calculation error in your code for the averages after the first one.
Essentially I have data which provides a start time, the number of time slots and the duration of each slot.
I want to convert that into a dataframe of start and end times - which I've achieved but I can't help but think is not efficient or particularly pythonic.
The real data has multiple ID's hence the grouping.
import pandas as pd
slots = pd.DataFrame({"ID": 1, "StartDate": pd.to_datetime("2019-01-01 10:30:00"), "Quantity": 3, "Duration": pd.to_timedelta(30, unit="minutes")}, index=[0])
grp_data = slots.groupby("ID")
bob = []
for rota_id, row in grp_data:
start = row.iloc[0, 1]
delta = row.iloc[0, 3]
for quantity in range(1, int(row.iloc[0, 2] + 1)):
data = {"RotaID": rota_id,
"DateStart": start,
"Duration": delta,
"DateEnd": start+delta}
bob.append(data)
start = start + delta
fred = pd.DataFrame(bob)
This might be answered elsewhere but I've no idea how to properly search this since I'm not sure what my problem is.
EDIT: I've updated my code to be more efficient with it's function calls and it is faster, but I'm still interested in knowing if there is a vectorised approach to this.
How about this way:
indices_dup = [np.repeat(i, quantity) for i, quantity in enumerate(slots.Quantity.values)]
slots_ext = slots.loc[np.concatenate(indices_dup).ravel(), :]
# Add a counter per ID; used to 'shift' the duration along StartDate
slots_ext['counter'] = slots_ext.groupby('ID').cumcount()
# Calculate DateStart and DateEnd based on counter and Duration
slots_ext['DateStart'] = (slots_ext.counter) * slots_ext.Duration.values + slots_ext.StartDate
slots_ext['DateEnd'] = (slots_ext.counter + 1) * slots_ext.Duration.values + slots_ext.StartDate
slots_ext.loc[:, ['ID', 'DateStart', 'Duration', 'DateEnd']].reset_index(drop=True)
Performance
Looking at performance on a larger dataframe (duplicated 1000 times) using
slots_large = pd.concat([slots] * 1000, ignore_index=True).drop('ID', axis=1).reset_index().rename(columns={'index': 'ID'})
Yields:
Old method: 289 ms ± 4.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
New method: 8.13 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In case this ever helps anyone:
I found that my data set had varying delta's per ID and #RubenB 's initial answer doesn't handle those. Here was my final solution based on his/her code:
# RubenB's code
indices_dup = [np.repeat(i, quantity) for i, quantity in enumerate(slots.Quantity.values)]
slots_ext = slots.loc[np.concatenate(indices_dup).ravel(), :]
# Calculate the cumulative sum of the delta per rota ID
slots_ext["delta_sum"] = slots_ext.groupby("ID")["Duration"].cumsum()
slots_ext["delta_sum"] = pd.to_timedelta(slots_ext["delta_sum"], unit="minutes")
# Use the cumulative sum to calculate the running end dates and then the start dates
first_value = slots_ext.StartDate[0]
slots_ext["EndDate"] = slots_ext.delta_sum.values + slots_ext.StartDate
slots_ext["StartDate"] = slots_ext.EndDate.shift(1)
slots_ext.loc[0, "StartDate"] = first_value
slots_ext.reset_index(drop=True, inplace=True)
I am given 8000x3 data set similar to this one:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(8000,3), columns=list('XYZ'))
So for a visual reference, df.head(5) looks like this:
X Y Z
0 0.462433 0.559442 0.016778
1 0.663771 0.092044 0.636519
2 0.111489 0.676621 0.839845
3 0.244361 0.599264 0.505175
4 0.115844 0.888622 0.766014
I'm trying to implement a method that when given an index from the dataset, it will return similar items from the dataset (in some reasonable way). For now I have:
def find_similiar_items(item_id):
tmp_df = df.sub(df.loc[item_id], axis='columns')
tmp_series = tmp_df.apply(np.square).apply(np.sum, axis=1)
tmp_series.sort()
return tmp_series
This method takes your row, then subtracts it from each other row in the dataframe, then calculates the norm for each row. So this method simply returns a series of the nearest points to your given point using the euclidean distance.
So you can get the nearest 5 points, for instance, with:
df.loc[find_similiar_items(5).index].head(5)
which yields:
X Y Z
5 0.364020 0.380303 0.623393
4618 0.369122 0.399772 0.643603
4634 0.352484 0.402435 0.619763
5396 0.386675 0.370417 0.600555
3229 0.355186 0.410202 0.616844
The problem with this method is that it takes roughly half a second each time I call it. This isn't acceptable for my purpose, so I need to figure out how to improve the performance of this method in someway. So I have a few questions:
Question 1 Is there perhaps a more efficient way of simply calculating the euclidean distance as above?
Question 2 Is there some other technique that will yield reasonable results like this (the euclidean distance isn't import for instance). Computation time is more important than memory in this problem and pre-processing time is not important; so I would be willing, for instance, to construct a new dataframe that has the size of the Cartesian product (n^2) the original dataframe (but anything more than that might become unreasonable)
Your biggest (and easiest) performance gain is likely to be from merely doing this in numpy rather than pandas. I'm seeing over a 200x improvement just from a quick conversion of the code to numpy:
arr = df.values
def fsi_numpy(item_id):
tmp_arr = arr - arr[item_id]
tmp_ser = np.sum( np.square( tmp_arr ), axis=1 )
return tmp_ser
df['dist'] = fsi_numpy(5)
df = df.sort_values('dist').head(5)
X Y Z dist
5 0.272985 0.131939 0.449750 0.000000
5130 0.272429 0.138705 0.425510 0.000634
4609 0.264882 0.103006 0.476723 0.001630
1794 0.245371 0.175648 0.451705 0.002677
6937 0.221363 0.137457 0.463451 0.002883
Check that it gives the same result as your function (since we have different random draws):
df.loc[ pd.DataFrame( find_similiar_items(5)).index].head(5)
X Y Z
5 0.272985 0.131939 0.449750
5130 0.272429 0.138705 0.425510
4609 0.264882 0.103006 0.476723
1794 0.245371 0.175648 0.451705
6937 0.221363 0.137457 0.463451
Timings:
%timeit df.loc[ pd.DataFrame( find_similiar_items(5)).index].head(5)
1 loops, best of 3: 638 ms per loop
In [105]: %%timeit
...: df['dist'] = fsi_numpy(5)
...: df = df.sort_values('dist').head(5)
...:
100 loops, best of 3: 2.69 ms per loop