Very slow computation on a Dataframe based on previous rows - python

I am calculating the RSI value for a stock price where a previous row is needed for the result of the current row. I am doing it currently via for looping the full dataframe for as many times there are entries which takes a lot of time (executing time on my pc around 15 seconds).
Is there any way to improve that code?
import pandas as pd
from pathlib import Path
filename = Path("Tesla.csv")
test = pd.read_csv(filename)
data = pd.DataFrame(test[["Date","Close"]])
data["Change"] = (data["Close"].shift(-1)-data["Close"]).shift(1)
data["Gain"] = 0.0
data["Loss"] = 0.0
data.loc[data["Change"] >= 0, "Gain"] = data["Change"]
data.loc[data["Change"] <= 0, "Loss"] = data["Change"]*-1
data.loc[:, "avgGain"] = 0.0
data.loc[:, "avgLoss"] = 0.0
data["avgGain"].iat[14] = data["Gain"][1:15].mean()
data["avgLoss"].iat[14] = data["Loss"][1:15].mean()
for index in data.iterrows():
data.loc[15:, "avgGain"] = (data.loc[14:, "avgGain"].shift(1)*13 + data.loc[15:, "Gain"])/14
data.loc[15:, "avgLoss"] = (data.loc[14:, "avgLoss"].shift(1)*13 + data.loc[15:, "Loss"])/14
The used dataset can be downloaded here:
TSLA historic dataset from yahoo finance
The goal is to calculate the RSI value based on the to be calculated avgGain and avgLoss value.
The avgGain value on rows 0:14 are not existent.
The avgGain value on row 15 is the mean value of row[1:14] of the Gain column.
The avgGain value from row 16 onwards is calculated as:
(13*avgGain(row before)+Gain(current row))/14

'itertuples' is faster than 'iterrows' and vectorized operations generally performs best in terms of time.
Here you can calculate average gains and losses (rolling averages) over 14 days with the rolling method with a window size of 14.
%%timeit
data["avgGain"].iat[14] = data["Gain"][1:15].mean()
data["avgLoss"].iat[14] = data["Loss"][1:15].mean()
for index in data.iterrows():
data.loc[15:, "avgGain"] = (data.loc[14:, "avgGain"].shift(1)*13 + data.loc[15:, "Gain"])/14
data.loc[15:, "avgLoss"] = (data.loc[14:, "avgLoss"].shift(1)*13 + data.loc[15:, "Loss"])/14
1.12 s ± 3.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
data['avgGain_alt'] = data['Gain'].rolling(window=14).mean().fillna(0)
data['avgLos_alt'] = data['Gain'].rolling(window=14).mean().fillna(0)
1.38 ms ± 2.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
data.head(15)
Using vectorized operations to calculate moving averages is approximately 10 times faster than calculating with loops.
However note that, there is also some calculation error in your code for the averages after the first one.

Related

creating series_route from multiple chunks of string in list

I am working on a route building code and have half a million record which taking around 3-4 hrs to get executed.
For creating dataframe:
# initialize list of lists
data = [[['1027', '(K)', 'TRIM']], [[SJCL, (K), EJ00, (K), ZQFC, (K), 'DYWH']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['route'])
Will look like something this:
route
[1027, (K), TRIM]
[SJCL, (K), EJ00, (K), ZQFC, (K), DYWH]
O/P
Code I have used:
def func_list(hd1):
required_list=[]
for j,i in enumerate(hd1):
#print(i,j)
if j==0:
req=i
else:
if (i[0].isupper() or i[0].isdigit()):
required_list.append(req)
req=i
else:
req=req+i
required_list.append(req)
return required_list
df['route2']=df.route1.apply(lambda x : func_list (x))
#op
route2
[1027(K), TRIM]
[SJCL(K), EJ00(K), ZQFC(K), DYWH]
For half million rows, it taking 3-4 hrs, I dont know how to reduce it pls help.
Use explode to flatten your dataframe:
sr1 = df['route'].explode()
sr2 = pd.Series(np.where(sr1.str[0] == '(', sr1.shift() + sr1, sr1), index=sr1.index)
df['route'] = sr2[sr1.eq(sr2).shift(-1, fill_value=True)].groupby(level=0).apply(list)
print(df)
# Output:
0 [1027(K), TRIM]
1 [SJCL(K), EJ00(K), ZQFC(K), DYWH]
dtype: object
For 500K records:
7.46 s ± 97.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

How to sjoin using geopandas using a common key column and also location

Suppose I have dataframe A which consists of two columns : geometry ( point) and also hour.
dataframe B also consists of geometry(shape) and hour.
I am familiar with standard sjoin . What I want to do is to make sjoin link rows from the two tables only when the hours are the same. In traditional join terminology the keys are geometry and hour. How can I attain it?
Have reviewed two logical approached
spatial join followed by filter
shard (filter) data frames first on hour, spatial join shards and concatenate results from the sharded data frames
test results for equality
run some timings
Conclusions
little difference between timings on this test data set. simple is quicker if number of points is small
import pandas as pd
import numpy as np
import geopandas as gpd
import shapely.geometry
import requests
# source some points and polygons
# fmt: off
dfp = pd.read_html("https://www.latlong.net/category/cities-235-15.html")[0]
dfp = gpd.GeoDataFrame(dfp, geometry=dfp.loc[:,["Longitude", "Latitude",]].apply(shapely.geometry.Point, axis=1))
res = requests.get("https://opendata.arcgis.com/datasets/69dc11c7386943b4ad8893c45648b1e1_0.geojson")
df_poly = gpd.GeoDataFrame.from_features(res.json())
# fmt: on
# bulk up number of points
dfp = pd.concat([dfp for _ in range(1000)]).reset_index()
HOURS = 24
dfp["hour"] = np.random.randint(0, HOURS, len(dfp))
df_poly["hour"] = np.random.randint(0, HOURS, len(df_poly))
def simple():
return gpd.sjoin(dfp, df_poly).loc[lambda d: d["hour_left"] == d["hour_right"]]
def shard():
return pd.concat(
[
gpd.sjoin(*[d.loc[d["hour"].eq(h)] for d in [dfp, df_poly]])
for h in range(HOURS)
]
)
print(f"""length test: {len(simple()) == len(shard())} {len(simple())}
dataframe test: {simple().sort_index().equals(shard().sort_index())}
points: {len(dfp)}
polygons: {len(df_poly)}""")
%timeit simple()
%timeit shard()
output
length test: True 3480
dataframe test: True
points: 84000
polygons: 379
6.48 s ± 311 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4.05 s ± 34.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Most efficient use of groupby-apply with user-defined functions in Pandas/Numpy

I'm missing information on what would be the most efficient (read: fastest) way of using user-defined functions in a groupby-apply setting in either Pandas or Numpy. I have done some of my own tests but am wondering if there are other methods out there that I have not come across yet.
Take the following example DataFrame:
import numpy as np
import pandas as pd
idx = pd.MultiIndex.from_product([range(0, 100000), ["a", "b", "c"]], names = ["time", "group"])
df = pd.DataFrame(columns=["value"], index = idx)
np.random.seed(12)
df["value"] = np.random.random(size=(len(idx),))
print(df.head())
value
time group
0 a 0.154163
b 0.740050
c 0.263315
1 a 0.533739
b 0.014575
I would like to calculate (for example, the below could be any arbitrary user-defined function) the percentage change over time per group. I could do this in a pure Pandas implementation as follows:
def pct_change_pd(series, num):
return series / series.shift(num) - 1
out_pd = df.sort_values(['group', 'time']).groupby(["group"]).apply(pct_change_pd, num=1)
But I could also modify the function and apply it over a numpy array:
def shift_array(arr, num, fill_value=np.nan):
if num >= 0:
return np.concatenate((np.full(num, fill_value), arr[:-num]))
else:
return np.concatenate((arr[-num:], np.full(-num, fill_value)))
def pct_change_np(series, num):
idx = series.index
arr = series.values.flatten()
arr_out = arr / shift_array(arr, num=num) - 1
return pd.Series(arr_out, index=idx)
out_np = df.sort_values(['group', 'time']).groupby(["group"]).apply(pct_change_np, num=1)
out_np = out_np.reset_index(level=2, drop=True)
From my testing, it seems that the numpy method, even with its additional overhead of converting between np.array and pd.Series, is faster.
Pandas:
%%timeit
out_pd = df.sort_values(['group', 'time']).groupby(["group"]).apply(pct_change_pd, num=1)
113 ms ± 548 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Numpy:
%%timeit
out_np = df.sort_values(['group', 'time']).groupby(["group"]).apply(pct_change_np, num=1)
out_np = out_np.reset_index(level=2, drop=True)
94.7 ms ± 642 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
As the index grows and the user-defined function becomes more complex, the Numpy implementation will continue to outperform the Pandas implementation more and more. However, I wonder if there are alternative methods to achieving similar results that are even faster. I'm specifically after another (more efficient) groupby-apply methodology that would allow me to work with any arbitrary user-defined function, not just with the shown example of calculating the percentage change. Would be happy to hear if they exist!
Often the name of the game is to try to use whatever functions are in the toolbox (often optimized and C compiled) rather than applying your own pure Python function. For example, one alternative would be:
def f1(df, num=1):
grb_kwargs = dict(sort=False, group_keys=False) # avoid redundant ops
z = df.sort_values(['group', 'time'])
return z / z.groupby('group', **grb_kwargs).transform(pd.Series.shift, num) - 1
That is about 32% faster than the .groupby('group').apply(pct_change_pd, num=1). On your system, it would yield around 85ms.
And then, there is the trick of doing your "expensive" calculation on the whole df, but masking out the parts that are spillovers from other groups:
def f2(df, num=1):
grb_kwargs = dict(sort=False, group_keys=False) # avoid redundant ops
z = df.sort_values(['group', 'time'])
z2 = z.shift(num)
gid = z.groupby('group', **grb_kwargs).ngroup()
z2.loc[gid != gid.shift(num)] = np.nan
return z / z2 - 1
That one is fully 2.1x faster (on your system would be around 52.8ms).
Finally, when there is no way to find some vectorized function to use directly, then you can use numba to speed up your code (that can then be written with loops to your heart's content)... A classic example is cumulative sum with caps, as in this SO post and this one.
Your first function and using .apply() gives me this result:
In [42]: %%timeit
...: out_pd = df.sort_values(['group', 'time']).groupby(["group"]).apply(pct_change_pd, num=1)
155 ms ± 887 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using groups, the time goes to 56ms.
%%timeit
num=1
outpd_list = []
for g in dfg.groups.keys():
gc = dfg.get_group(g)
outpd_list.append(gc['value'] / gc['value'].shift(num) - 1)
out_pd = pd.concat(outpd_list, axis=0)
56 ms ± 821 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
And if you change this one line in the above code to use built in function you get a bit more time savings
outpd_list.append(gc['value'].pct_change(num))
41.2 ms ± 283 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

How can I efficiently convert (start_time,[time_deltas]) to (start_time, end_time)?

Essentially I have data which provides a start time, the number of time slots and the duration of each slot.
I want to convert that into a dataframe of start and end times - which I've achieved but I can't help but think is not efficient or particularly pythonic.
The real data has multiple ID's hence the grouping.
import pandas as pd
slots = pd.DataFrame({"ID": 1, "StartDate": pd.to_datetime("2019-01-01 10:30:00"), "Quantity": 3, "Duration": pd.to_timedelta(30, unit="minutes")}, index=[0])
grp_data = slots.groupby("ID")
bob = []
for rota_id, row in grp_data:
start = row.iloc[0, 1]
delta = row.iloc[0, 3]
for quantity in range(1, int(row.iloc[0, 2] + 1)):
data = {"RotaID": rota_id,
"DateStart": start,
"Duration": delta,
"DateEnd": start+delta}
bob.append(data)
start = start + delta
fred = pd.DataFrame(bob)
This might be answered elsewhere but I've no idea how to properly search this since I'm not sure what my problem is.
EDIT: I've updated my code to be more efficient with it's function calls and it is faster, but I'm still interested in knowing if there is a vectorised approach to this.
How about this way:
indices_dup = [np.repeat(i, quantity) for i, quantity in enumerate(slots.Quantity.values)]
slots_ext = slots.loc[np.concatenate(indices_dup).ravel(), :]
# Add a counter per ID; used to 'shift' the duration along StartDate
slots_ext['counter'] = slots_ext.groupby('ID').cumcount()
# Calculate DateStart and DateEnd based on counter and Duration
slots_ext['DateStart'] = (slots_ext.counter) * slots_ext.Duration.values + slots_ext.StartDate
slots_ext['DateEnd'] = (slots_ext.counter + 1) * slots_ext.Duration.values + slots_ext.StartDate
slots_ext.loc[:, ['ID', 'DateStart', 'Duration', 'DateEnd']].reset_index(drop=True)
Performance
Looking at performance on a larger dataframe (duplicated 1000 times) using
slots_large = pd.concat([slots] * 1000, ignore_index=True).drop('ID', axis=1).reset_index().rename(columns={'index': 'ID'})
Yields:
Old method: 289 ms ± 4.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
New method: 8.13 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In case this ever helps anyone:
I found that my data set had varying delta's per ID and #RubenB 's initial answer doesn't handle those. Here was my final solution based on his/her code:
# RubenB's code
indices_dup = [np.repeat(i, quantity) for i, quantity in enumerate(slots.Quantity.values)]
slots_ext = slots.loc[np.concatenate(indices_dup).ravel(), :]
# Calculate the cumulative sum of the delta per rota ID
slots_ext["delta_sum"] = slots_ext.groupby("ID")["Duration"].cumsum()
slots_ext["delta_sum"] = pd.to_timedelta(slots_ext["delta_sum"], unit="minutes")
# Use the cumulative sum to calculate the running end dates and then the start dates
first_value = slots_ext.StartDate[0]
slots_ext["EndDate"] = slots_ext.delta_sum.values + slots_ext.StartDate
slots_ext["StartDate"] = slots_ext.EndDate.shift(1)
slots_ext.loc[0, "StartDate"] = first_value
slots_ext.reset_index(drop=True, inplace=True)

Pandas Vectorize Instead of Loop

I have a dataframe of paths. The task is to get the last modification time for the folder using something like datetime.fromtimestamp(os.path.getmtime('PATH_HERE')) into a separate column
import pandas as pd
import numpy as np
import os
df1 = pd.DataFrame({'Path' : ['C:\\Path1' ,'C:\\Path2', 'C:\\Path3']})
#for a MVCE use the below commented out code. WARNING!!! This WILL Create directories on your machine.
#for path in df1['Path']:
# os.mkdir(r'PUT_YOUR_PATH_HERE\\' + os.path.basename(path))
I can do the task with the below, but it is a slow loop if I have many folders:
for each_path in df1['Path']:
df1.loc[df1['Path'] == each_path, 'Last Modification Time'] = datetime.fromtimestamp(os.path.getmtime(each_path))
How would I go about vectoring this process to improve speed? os.path.getmtime cannot accept the series. I'm looking for something like:
df1['Last Modification Time'] = datetime.fromtimestamp(os.path.getmtime(df1['Path']))
I'm going to present 3 approaches assuming to work with 100 paths. Approach 3 is strictly preferable I think.
# Data initialisation:
paths100 = ['a_whatever_path_here'] * 100
df = pd.DataFrame(columns=['paths', 'time'])
df['paths'] = paths100
def fun1():
# Naive for loop. High readability, slow.
for path in df['paths']:
mask = df['paths'] == path
df.loc[mask, 'time'] = datetime.fromtimestamp(os.path.getmtime(path))
def fun2():
# Naive for loop optimised. Medium readability, medium speed.
for i, path in enumerate(df['paths']):
df.loc[i, 'time'] = datetime.fromtimestamp(os.path.getmtime(path))
def fun3():
# List comprehension. High readability, high speed.
df['time'] = [datetime.fromtimestamp(os.path.getmtime(path)) for path in df['paths']]
% timeit fun1()
>>> 164 ms ± 2.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
% timeit fun2()
>>> 11.6 ms ± 67.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
% timeit fun3()
>>> 13.1 ns ± 0.0327 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)
You can use a groupby transform (so that you are doing the expensive call only once per group):
g = df1.groupby("Path")["Path"]
s = pd.to_datetime(g.transform(lambda x: os.path.getmtime(x.name)))
df1["Last Modification Time"] = s # putting this on two lines so it looks nicer...

Categories

Resources