What is the most efficient way of indexing Numpy matrices?

What is the most efficient way of indexing Numpy matrices? - python

Question: What is the most efficient way to implement the equivalent of the following, using Pandas dataframes: temp = df[df.feature] == value] at scale (see below for context re: scale)?
Background: I have daily time series data for ~500 entities for 30 years, and for each entity and each day, need to create 90 features based on various look-backs, up to 240 days in the past. Currently, I'm looping through each day, manipulating all of the data from that day, then inserting it into a pre-allocated numpy matrix—but it's proving very slow for the size of my data set.
Naive approach:
df = pd.DataFrame()
for day in range(241, t_max):
temp_a = df_timeseries[df_timeseries.t] == day].copy()
temp_b = df_timeseries[df_timeseries.t] == day - 1].copy()
new_val = temp_a.feature_1/temp_b.feature_1
new_val['t'] = day
new_val['entity'] = temp_a.entity
df.concat([df, new_val])
Current approach (simplified):
df = np.matrix(np.zeros([num_days*num_entities, 3]))
col_dict = dict(zip(df_timeseries.columns, list(range(0,len(df_timeseries.columns)))))
mtrx_timeseries = np.matrix(df_timeseries.to_numpy())
for i, day in enumerate(range(241, t_max)):
interm = np.matrix(np.zeros([num_entities, 3]))
interm[:, 0] = day
temp_a = mtrx_timeseries[np.where(mtrx_timeseries[:, col_dict['t']] == day)[0], :]
temp_b = mtrx_timeseries[np.where(mtrx_timeseries[:, col_dict['t']] == day - 1)[0], :]
temp_cr = temp_a[:, col_dict['feature_1']]/temp_b[:, col_dict['feature_1']] - 1
temp_a = mtrx_timeseries[np.where(mtrx_timeseries[:, col_dict['t']] == day - 5)[0], :]
temp_b = mtrx_timeseries[np.where(mtrx_timeseries[:, col_dict['t']] == t - 10)[0], :]
temp_or = temp_a[:, col_dict['feature_1']]/temp_b[:, col_dict['feature_1']] - 1
interm[:, 1:] = np.concatenate([temp_cr, temp_or], axis=1)
df[i*num_entities : (i + 1)*num_entities, :] = interm
Line profiling the full version of the code I have shows that each statement of the form mtrx_timeseries[np.where(mtrx_timeseries[:, col_dict['t']] == day)[0], :] takes up ~23% of the time of the run in total, hence my looking for a more streamlined solution. Since indexing takes the most time, and since the loop means that this operation is performed every iteration, perhaps one solution might be to index just once, storing each day's data in a separate array element, and then looping through array elements?

This isn't a complete solution to your problem, but I think it will get you where you need to be.
Consider the following code:
entity_dict = {}
entity_idx = 0
arr = np.zeros((num_entities, t_max-240))
for entity, day, feature in df_timeseries[['entity', 'day', 'feature_1']].values:
if entity not in entity_dict:
entity_dict[entity] = entity_idx
entity_idx += 1
arr[entity_dict[entity], day-240] = feature
This will convert df_timeseries into an num_entities*num_days shaped array organized by entities, very efficiently. You won't need to do any fancy indexing at all. The most efficient way to index a numpy array or matrix is to know what indices you need ahead of time and not search the array
for them. You can then perform array operations (it looks to me like your operation is simple elementwise division, which you can do in a couple lines with no extra loop).
Then convert back to the original format.

Related

Python, how do i add two values together from same column numpy and iterate an add function on that specific column numpy

I'm working on some python binance data, and have ran into a problem. I want to do simple math on a single column and iterate math on that same column.
The data I'm working with is volume from a binance feed.
This is my code:
candles = client.get_historical_klines("BTCUSDT", Client.KLINE_INTERVAL_1HOUR, "1 day ago UTC")
candle_dataframe = df(candles)
candle_dataframe_date = candle_dataframe[0]
date_init = []
for time in candle_dataframe_date.unique():
readable = datetime.fromtimestamp(int(time/1000))
date_init.append(readable)
candle_dataframe.pop(0)
candle_dataframe.pop(11)
dataframe_final_date = df(date_init)
dataframe_final_date.columns = ['date']
final_dataframe = candle_dataframe.join(dataframe_final_date)
final_dataframe.set_index('date', inplace=True)
final_dataframe.columns = ['open', 'high', 'low', 'close', 'volume', 'close_time', 'asset_volume', 'trade_number', 'taker_buy_base', 'taker_buy_quote']
list_volume = final_dataframe.iloc[:, [4]]
np_array = list_volume.to_numpy()
arr = np_array.astype('float64')
np.transpose(arr)
#print(np.transpose(arr))
#example 1
b = np_array[::2]
a = np_array[1::2]
print(np.add(a, b))
#example two
#for i, values, values1 in np_array:
# values[0] += 1
# if i == values[1]:
# np.multiply(values, values1)
#result = np.add(a + b)
Example two kind of shows what I'm trying to do, I hope.
But is there a way to control, the id iteration so that I get the last volume and the next from there and add those together, then add the next two values and so on?
It seems impossible to create simple math on single column ids, or is there a way through it.
I just think it would be silly to have to copy the column and work with redundant data, to actually perform the math.

Why loop? Shouldn't this work?
arr = np.random.rand(10)
d = arr[:-1] + arr[1:]

The answer to my question was in the comment posting an article and is this function.
d = np.zeros(arr.size - 1)
for i in range(len(arr) - 1):
d[i] = arr[i + 1] + arr[i]
Thank you so much.
Mathias

How to optimize this pandas iterable

I have the following method in which I am eliminating overlapping intervals in a dataframe based on a set of hierarchical rules:
def disambiguate(arg):
arg['length'] = (arg.end - arg.begin).abs()
df = arg[['begin', 'end', 'note_id', 'score', 'length']].copy()
data = []
out = pd.DataFrame()
for row in df.itertuples():
test = df[df['note_id']==row.note_id].copy()
# get overlapping intervals:
# https://stackoverflow.com/questions/58192068/is-it-possible-to-use-pandas-overlap-in-a-dataframe
iix = pd.IntervalIndex.from_arrays(test.begin.apply(pd.to_numeric), test.end.apply(pd.to_numeric), closed='neither')
span_range = pd.Interval(row.begin, row.end)
fx = test[iix.overlaps(span_range)].copy()
maxLength = fx['length'].max()
minLength = fx['length'].min()
maxScore = abs(float(fx['score'].max()))
minScore = abs(float(fx['score'].min()))
# filter out overlapping rows via hierarchy
if maxScore > minScore:
fx = fx[fx['score'] == maxScore]
elif maxLength > minLength:
fx = fx[fx['length'] == minScore]
data.append(fx)
out = pd.concat(data, axis=0)
# randomly reindex to keep random row when dropping remaining duplicates: https://gist.github.com/cadrev/6b91985a1660f26c2742
out.reset_index(inplace=True)
out = out.reindex(np.random.permutation(out.index))
return out.drop_duplicates(subset=['begin', 'end', 'note_id'])
This works fine, except for the fact that the dataframes I am iterating over have well over 100K rows each, so this is taking forever to complete. I did a timing of various methods using %prun in Jupyter, and the method that seems to eat up processing time was series.py:3719(apply) ... NB: I tried using modin.pandas, but that was causing more problems (I kept getting an error wrt to Interval needing a value where left was less than right, which I couldn't figure out: I may file a GitHub issue there).
Am looking for a way to optimize this, such as using vectorization, but honestly, I don't have the slightest clue how to convert this to a vectotrized form.
Here is a sample of my data:
begin,end,note_id,score
0,9,0365,1
10,14,0365,1
25,37,0365,0.7
28,37,0365,1
38,42,0365,1
53,69,0365,0.7857142857142857
56,60,0365,1
56,69,0365,1
64,69,0365,1
83,86,0365,1
91,98,0365,0.8333333333333334
101,108,0365,1
101,127,0365,1
112,119,0365,1
112,127,0365,0.8571428571428571
120,127,0365,1
163,167,0365,1
196,203,0365,1
208,216,0365,1
208,223,0365,1
208,231,0365,1
208,240,0365,0.6896551724137931
217,223,0365,1
217,231,0365,1
224,231,0365,1
246,274,0365,0.7692307692307693
252,274,0365,1
263,274,0365,0.8888888888888888
296,316,0365,0.7222222222222222
301,307,0365,1
301,316,0365,1
301,330,0365,0.7307692307692307
301,336,0365,0.78125
308,316,0365,1
308,323,0365,1
308,330,0365,1
308,336,0365,1
317,323,0365,1
317,336,0365,1
324,330,0365,1
324,336,0365,1
361,418,0365,0.7368421052631579
370,404,0365,0.7111111111111111
370,418,0365,0.875
383,418,0365,0.8285714285714286
396,404,0365,1
396,418,0365,0.8095238095238095
405,418,0365,0.8333333333333334
432,453,0365,0.7647058823529411
438,453,0365,1
438,458,0365,0.7222222222222222

I think I know what the issue was: I did my filtering on note_id incorrectly, and thus iterating over the entire dataframe.
It should been:
cases = set(df['note_id'].tolist())
for case in cases:
test = df[df['note_id']==case].copy()
for row in df.itertuples():
# get overlapping intervals:
# https://stackoverflow.com/questions/58192068/is-it-possible-to-use-pandas-overlap-in-a-dataframe
iix = pd.IntervalIndex.from_arrays(test.begin, test.end, closed='neither')
span_range = pd.Interval(row.begin, row.end)
fx = test[iix.overlaps(span_range)].copy()
maxLength = fx['length'].max()
minLength = fx['length'].min()
maxScore = abs(float(fx['score'].max()))
minScore = abs(float(fx['score'].min()))
if maxScore > minScore:
fx = fx[fx['score'] == maxScore]
elif maxLength > minLength:
fx = fx[fx['length'] == maxLength]
data.append(fx)
out = pd.concat(data, axis=0)
For testing on one note, before I stopped iterating over the entire, non-filtered dataframe, it was taking over 16 minutes. Now, it's at 28 seconds!

Omnet++ / Data in a pandas cell(list) vs pandas series(column)

So I'm using Omnet++, a discrete time network simulator, to simulate different networking scenarios. At some point one can further process Omnet++ output statistics and store them in a .csv file.
The interesting thing about it is that for each time (vectime) there is a value (vecvalue). Those vectime/vecvalues are stored in a single cell of such .csv file. When imported into a Pandas Dataframe, I get something like this.
In [45]: df1[['module','vectime','vecvalue']]
Out[45]:
module vectime vecvalue
237 Tictoc13.tic[1] [2.542245319062, 3.066965320033, 4.78723506093... [0.334535581612, 0.390459633837, 0.50391696492...
249 Tictoc13.tic[4] [2.649303071938, 6.02527384362, 21.42434044990... [2.649303071938, 1.654927100273, 3.11051622577...
261 Tictoc13.tic[3] [4.28876656608, 16.104821448604, 19.5989313700... [2.245250432259, 3.201153958979, 2.39023520069...
277 Tictoc13.tic[2] [13.884917126016, 21.467263378748, 29.59962616... [0.411703261805, 0.764708518232, 0.83288346614...
289 Tictoc13.tic[5] [14.146524815409, 14.349744576545, 24.95022463... [1.732060647139, 8.66456377103, 2.275388282721...
For example, if I needed to plot each vectime/vecvalue for each module, today I'm doing the following...
%pylab
def runningAvg(x):
sigma_x = np.cumsum(x)
sigma_n = np.arange(1,x.size + 1)
return sigma_x / sigma_n
for row in df1.itertuples():
t = row.vectime
x = row.vecvalue
x = runningAvg(x)
plot(t,x)
... to obtain this ...
My question is: what's best in terms of performance:
use the data as is, meaning using those arrays inside each cell, looping over the DF to plot each array;
convert those arrays as pd.Series. In this case, what would be better to still have the module as index?
would I benefit from unnesting those arrays into pd.Series?
thanks!

Well, I've wondered around and it seems that converting Omnet data into pd.Series might not be as efficient as I thought.
These are my two methods:
1) Using Omnet data as is, lists inside Pandas DF.
figure(1)
start = datetime.datetime.now()
for row in df1.itertuples():
t = row.vectime
x = row.vecvalue
x = runningAvg(x)
plot(t,x)
total = (datetime.datetime.now() - start).total_seconds()
print(total)
When running the above, the total is 0.026571.
2) Converting Omnet data to pd.Series.
To obtain the same result, I had to transpose the series several times.
figure(2)
start = datetime.datetime.now()
t = df1.vectime
v = df1.vecvalue
t = t.apply(pd.Series)
v = v.apply(pd.Series)
t = t.T
v = v.T
sigma_v = np.cumsum(v)
sigma_n = np.arange(1,v.shape[0]+1)
sigma = sigma_v.T / sigma_n
plot(t,sigma.T)
total = (datetime.datetime.now() - start).total_seconds()
print(total)
For the later, total is 0.57266.
So it seems that I'll stick to method 1, looping over the different rows.

Pandas and Numpy program slower than loop version of same functionality, how to speed up?

I have millions of records with each record having an integer (p) and a X*3 matrix of values. For each record, the goal is to find a row from the matrix by selection criteria (see the if-statements in the code).
I'm fairly new to Python and try to make use of vectorization in Pandas using parallel computations instead of loops. I have written the program in two versions, one with Pandas+Numpy and another one with simple loops.
I was told that using vectorization and Numpy array operations is faster than loops. But so far, the loop version is about 10x faster:
Here is the program:
import numpy as np
import pandas as pd
import time
d = {
'values': [np.array([[1400,1400,1800000],[1500,1505,4800000],[1300,1305,5000]]), np.array([[800,900,80000],[1400,1420,50000],[1250,1300,60000]]), np.array([[1700,1750,5000000],[1900,1950,5000000],[1600,1600,3000000]]), np.array([])],
'p': [1300, 1350, 1800, 1400]
}
# The Pandas+Numpy version
def selection_numpy(row):
try:
# Select rows where col[0] >= p
c1 = row['values'][row['values'][:,0] >= row['p']]
# Select rows where col[2] > 1000000
c2 = c1[c1[:,2]>1000000]
# Sort by col[0] and return the lowest row
return c2[c2[:,0].argsort()][0]
except:
pass
start = time.time()
df = pd.DataFrame(d)
df['result'] = df.apply(selection_numpy, axis=1)
# print(df.head())
print(time.time()-start)
# The loop version:
def selection_loop(values, p):
lowest_num = 9999999999
lowest_item = None
# Iterate through each row in the matrix and replace lowest_item if it's lower than the previous one
for item in values:
if item[0] >= p and item[2] > 1000000 and item[0] < lowest_num:
lowest_num = item[0]
lowest_item = item
return lowest_item
start = time.time()
d['result'] = []
for i in range(0, 4):
result = selection_loop(d['values'][i], d['p'][i])
d['result'].append(result)
# print(d['result'])
print(time.time()-start)
Both produce the same result values, but the loop version is magnitudes faster (for the actual million record dataset, not for the 4 example records).
I assume there is a simple and elegant solution to find the desired row for each record which uses vectorization and is the fastest. Not sure why the function using Numpy arrays is so slow, but I appreciate any guidance.

How to vectorize this peak finding for loop in Python?

Basically I'm writing a peak finding function that needs to be able to beat scipy.argrelextrema in benchmarking. Here is a link to the data I'm using, and the code:
https://drive.google.com/open?id=1U-_xQRWPoyUXhQUhFgnM3ByGw-1VImKB
If this link expires, the data can be found at dukascopy bank's online historical data downloader.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('EUR_USD.csv')
data.columns = ['Date', 'open', 'high', 'low', 'close','volume']
data.Date = pd.to_datetime(data.Date, format='%d.%m.%Y %H:%M:%S.%f')
data = data.set_index(data.Date)
data = data[['open', 'high', 'low', 'close']]
data = data.drop_duplicates(keep=False)
price = data.close.values
def fft_detect(price, p=0.4):
trans = np.fft.rfft(price)
trans[round(p*len(trans)):] = 0
inv = np.fft.irfft(trans)
dy = np.gradient(inv)
peaks_idx = np.where(np.diff(np.sign(dy)) == -2)[0] + 1
valleys_idx = np.where(np.diff(np.sign(dy)) == 2)[0] + 1
patt_idx = list(peaks_idx) + list(valleys_idx)
patt_idx.sort()
label = [x for x in np.diff(np.sign(dy)) if x != 0]
# Look for Better Peaks
l = 2
new_inds = []
for i in range(0,len(patt_idx[:-1])):
search = np.arange(patt_idx[i]-(l+1),patt_idx[i]+(l+1))
if label[i] == -2:
idx = price[search].argmax()
elif label[i] == 2:
idx = price[search].argmin()
new_max = search[idx]
new_inds.append(new_max)
plt.plot(price)
plt.plot(inv)
plt.scatter(patt_idx,price[patt_idx])
plt.scatter(new_inds,price[new_inds],c='g')
plt.show()
return peaks_idx, price[peaks_idx]
It basically smoothes data using a fast fourier transform (FFT) then takes the derivative to find the minimum and maximum indices of the smoothed data, then finds the corresponding peaks on the unsmoothed data. Sometimes the peaks it finds are not idea due to some smoothing effects, so I run this for loop to search for higher or lower points for each index between the bounds specified by l. I need help vectorizing this for loop! I have no idea how to do it. Without the for loop, my code is about 50% faster than scipy.argrelextrema, but the for loop slows it down. So if I can find a way to vectorize it, it'd be a very quick, and very effective alternative to scipy.argrelextrema. These two images represent the data without and with the for loop respectively.

This may do it. It's not perfect but hopefully it obtains what you want and shows you a bit how to vectorize. Happy to hear any improvements you think up
label = np.array(label[:-1]) # not sure why this is 1 unit longer than search.shape[0]?
# the idea is to make the index matrix you're for looping over row by row all in one go.
# This part is sloppy and you can improve this generation.
search = np.vstack((np.arange(patt_idx[i]-(l+1),patt_idx[i]+(l+1)) for i in range(0,len(patt_idx[:-1])))) # you can refine this.
# then you can make the price matrix
price = price[search]
# and you can swap the sign of elements so you only need to do argmin instead of both argmin and argmax
price[label==-2] = - price[label==-2]
# now find the indices of the minimum price on each row
idx = np.argmin(price,axis=1)
# and then extract the refined indices from the search matrix
new_inds = search[np.arange(idx.shape[0]),idx] # this too can be cleaner.
# not sure what's going on here so that search[:,idx] doesn't work for me
# probably just a misunderstanding
I find that this reproduces your result but I did not time it. I suspect the search generation is quite slow but probably still faster than your for loop.
Edit:
Here's a better way to produce search:
patt_idx = np.array(patt_idx)
starts = patt_idx[:-1]-(l+1)
stops = patt_idx[:-1]+(l+1)
ds = stops-starts
s0 = stops.shape[0]
s1 = ds[0]
search = np.reshape(np.repeat(stops - ds.cumsum(), ds) + np.arange(ds.sum()),(s0,s1))

Here is an alternative... it uses list comprehension which is generally faster than for-loops
l = 2
# Define the bounds beforehand, its marginally faster than doing it in the loop
upper = np.array(patt_idx) + l + 1
lower = np.array(patt_idx) - l - 1
# List comprehension...
new_inds = [price[low:hi].argmax() + low if lab == -2 else
price[low:hi].argmin() + low
for low, hi, lab in zip(lower, upper, label)]
# Find maximum within each interval
new_max = price[new_inds]
new_global_max = np.max(new_max)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

What is the most efficient way of indexing Numpy matrices? - python

Related

Python, how do i add two values together from same column numpy and iterate an add function on that specific column numpy

How to optimize this pandas iterable

Omnet++ / Data in a pandas cell(list) vs pandas series(column)

Pandas and Numpy program slower than loop version of same functionality, how to speed up?

How to vectorize this peak finding for loop in Python?

Categories

Resources