using numpy broadcasting / vectorization to build new array from other arrays - python

I am working on a stock ranking factor for a Quantopian model. They recommend avoiding the use of loops in custom factors. However, I am not exactly sure how I would avoid the loops in this case.
def GainPctInd(offset=0, nbars=2):
class GainPctIndFact(CustomFactor):
window_length = nbars + offset
inputs = [USEquityPricing.close, ms.asset_classification.morningstar_industry_code]
def compute(self, today, assets, out, close, industries):
# Compute the gain percents for all stocks
asset_gainpct = (close[-1] - close[offset]) / close[offset] * 100
# For each industry, build a list of the per-stock gains over the given window
gains_by_industry = {}
for i in range(0, len(industries)):
industry = industries[0,i]
if industry in gains_by_industry:
gains_by_industry[industry].append(asset_gainpct[i])
else:
gains_by_industry[industry] = [asset_gainpct[i]]
# Loop through each stock's industry and compute a mean value for that
# industry (caching it for reuse) and return that industry mean for
# that stock
mean_cache = {}
for i in range(0, len(industries)):
industry = industries[0,i]
if not industry in mean_cache:
mean_cache[industry] = np.mean(gains_by_industry[industry])
out[i] = mean_cache[industry]
return GainPctIndFact()
When the compute function is called, assets is a 1-d array of the asset names, close is a multi-dimensional numpy array where there are window_length close prices for each asset listed in assets (using the same index numbers), and industries is the list of industry codes associated with each asset in a 1-d array. I know numpy vectorizes the computation of the gainpct in this line:
asset_gainpct = (close[-1] - close[offset]) / close[offset] * 100
The result is that asset_gainpct is a 1-d array of all the computed gains for every stock. The part I am unclear about is how I would use numpy to finish the calculations without me manually looping through the arrays.
Basically, what I need to do is aggregate all of the gains for all of the stocks based on the industry they are in, then compute the average of those values, and then de-aggregate the averages back out to the full list of assets.
Right now, I am looping through all the industries and pushing the gain percentages into a industry-indexed dictionary storing a list of the gains per industry. Then I am calculating the mean for those lists and performing a reverse-industry lookup to map the industry gains to each asset based on their industry.
It seems to me like this should be possible to do using some highly optimized traversals of the arrays in numpy, but I can't seem to figure it out. I've never used numpy before today, and I'm fairly new to Python, so that probably doesn't help.
UPDATE:
I modified my industry code loop to try to handle the computation with a masked array using the industry array to mask the asset_gainpct array like such:
# For each industry, build a list of the per-stock gains over the given window
gains_by_industry = {}
for industry in industries.T:
masked = ma.masked_where(industries != industry[0], asset_gainpct)
np.nanmean(masked, out=out)
It gave me the following error:
IndexError: Inconsistant shape between the condition and the input
(got (20, 8412) and (8412,))
Also, as a side note, industries is coming in as a 20x8412 array because the window_length is set to 20. The extra values are the industry codes for the stocks on the previous days, except they don't typically change, so they can be ignored. I am now iterating over industries.T (the transpose of industries) which means industry is a 20-element array with the same industry code in each element. Hence, I only need element 0.
The error above is coming from the ma.masked_where() call. The industries array is 20x8412 so I presume asset_gainpct is the one listed as (8412,). How do I make these compatible for this call to work?
UPDATE 2:
I have modified the code again, fixing several other issues I have run into. It now looks like this:
# For each industry, build a list of the per-stock gains over the given window
unique_ind = np.unique(industries[0,])
for industry in unique_ind:
masked = ma.masked_where(industries[0,] != industry, asset_gainpct)
mean = np.full_like(masked, np.nanmean(masked), dtype=np.float64, subok=False)
np.copyto(out, mean, where=masked)
Basically, the new premise here is that I have to build a mean-value filled array of the same size as the number of stocks in my input data and then copy the values into my destination variable (out) while applying my previous mask so that only the unmasked indexes are filled with the mean value. In addition, I realized that I was iterating over industries more than once in my previous incarnation, so I fixed that, too. However, the copyto() call is yielding this error:
TypeError: Cannot cast array data from dtype('float64') to
dtype('bool') according to the rule 'safe'
Obviously, I am doing something wrong; but looking through the docs, I don't see what it is. This looks like it should be copying from mean (which is np.float64 dtype) to out (which I have not previously defined) and it should be using masked as the boolean array for selecting which indexes get copied. Anyone have any ideas on what the issue is?
UPDATE 3:
First, thanks for all the feedback from everyone who contributed.
After much additional digging into this code, I have come up with the following:
def GainPctInd(offset=0, nbars=2):
class GainPctIndFact(CustomFactor):
window_length = nbars + offset
inputs = [USEquityPricing.close, ms.asset_classification.morningstar_industry_code]
def compute(self, today, assets, out, close, industries):
num_bars, num_assets = close.shape
newest_bar_idx = (num_bars - 1) - offset
oldest_bar_idx = newest_bar_idx - (nbars - 1)
# Compute the gain percents for all stocks
asset_gainpct = ((close[newest_bar_idx] - close[oldest_bar_idx]) / close[oldest_bar_idx]) * 100
# For each industry, build a list of the per-stock gains over the given window
unique_ind = np.unique(industries[0,])
for industry in unique_ind:
ind_view = asset_gainpct[industries[0,] == industry]
ind_mean = np.nanmean(ind_view)
out[industries[0,] == industry] = ind_mean
return GainPctIndFact()
For some reason, the calculations based on the masked views were not yielding correct results. Further, getting those results into the out variable was not working. Somewhere along the line, I stumbled on a post about how numpy (by default) creates views of arrays instead of copies when you do a slice and that you can do a sparse slice based on a Boolean condition. When running a calculation on such a view, it looks like a full array as far as the calculation is concerned, but all the values are still actually in the base array. It's sort of like having an array of pointers and the calculations happen on the data the pointers point to. Similarly, you can assign a value to all nodes in your sparse view and have it update the data for all of them. This actually simplified the logic considerably.
I would still be interested in any ideas anyone has on how to remove the final loop over the industries and vectorize that process. I am wondering if maybe a map / reduce approach might work, but I am still not familiar enough with numpy to figure out how to do it any more efficiently than this FOR loop. On the bright side, the remaining loop only has about 140 iterations to go through vs the two prior loops which would go through 8000 each. In addition to that, I am now avoiding the construction of the gains_by_industry and the mean_cache dict and avoiding all the data copying which went with them. So, it is not just faster, it is also far more memory efficient.
UPDATE 4:
Someone gave me a more succinct way to accomplish this, finally eliminating the extra FOR loop. It basically hides the loop in a Pandas DataFrame groupby, but it more succinctly describes what the desired steps are:
def GainPctInd2(offset=0, nbars=2):
class GainPctIndFact2(CustomFactor):
window_length = nbars + offset
inputs = [USEquityPricing.close, ms.asset_classification.morningstar_industry_code]
def compute(self, today, assets, out, close, industries):
df = pd.DataFrame(index=assets, data={
"gain": ((close[-1 - offset] / close[(-1 - offset) - (nbars - 1)]) - 1) * 100,
"industry_codes": industries[-1]
})
out[:] = df.groupby("industry_codes").transform(np.mean).values.flatten()
return GainPctIndFact2()
It does not improve the efficiency at all, according to my benchmarks, but it's probably easier to verify correctness. The one problem with their example is that it uses np.mean instead of np.nanmean, and np.nanmean drops the NaN values resulting in a shape mismatch if you try to use it. To fix the NaN issue, someone else suggested this:
def GainPctInd2(offset=0, nbars=2):
class GainPctIndFact2(CustomFactor):
window_length = nbars + offset
inputs = [USEquityPricing.close, ms.asset_classification.morningstar_industry_code]
def compute(self, today, assets, out, close, industries):
df = pd.DataFrame(index=assets, data={
"gain": ((close[-1 - offset] / close[(-1 - offset) - (nbars - 1)]) - 1) * 100,
"industry_codes": industries[-1]
})
nans = isnan(df['industry_codes'])
notnan = ~nans
out[notnan] = df[df['industry_codes'].notnull()].groupby("industry_codes").transform(np.nanmean).values.flatten()
out[nans] = nan
return GainPctIndFact2()

Someone gave me a more succinct way to accomplish this, finally eliminating the extra FOR loop. It basically hides the loop in a Pandas DataFrame groupby, but it more succinctly describes what the desired steps are:
def GainPctInd2(offset=0, nbars=2):
class GainPctIndFact2(CustomFactor):
window_length = nbars + offset
inputs = [USEquityPricing.close, ms.asset_classification.morningstar_industry_code]
def compute(self, today, assets, out, close, industries):
df = pd.DataFrame(index=assets, data={
"gain": ((close[-1 - offset] / close[(-1 - offset) - (nbars - 1)]) - 1) * 100,
"industry_codes": industries[-1]
})
out[:] = df.groupby("industry_codes").transform(np.mean).values.flatten()
return GainPctIndFact2()
It does not improve the efficiency at all, according to my benchmarks, but it's probably easier to verify correctness. The one problem with their example is that it uses np.mean instead of np.nanmean, and np.nanmean drops the NaN values resulting in a shape mismatch if you try to use it. To fix the NaN issue, someone else suggested this:
def GainPctInd2(offset=0, nbars=2):
class GainPctIndFact2(CustomFactor):
window_length = nbars + offset
inputs = [USEquityPricing.close, ms.asset_classification.morningstar_industry_code]
def compute(self, today, assets, out, close, industries):
df = pd.DataFrame(index=assets, data={
"gain": ((close[-1 - offset] / close[(-1 - offset) - (nbars - 1)]) - 1) * 100,
"industry_codes": industries[-1]
})
nans = isnan(df['industry_codes'])
notnan = ~nans
out[notnan] = df[df['industry_codes'].notnull()].groupby("industry_codes").transform(np.nanmean).values.flatten()
out[nans] = nan
return GainPctIndFact2()
– user36048

Related

Speeding up numpy operations

Using a 2D numpy array, I want to create a new array that expands the original one using a moving window. Let me explain what I mean using an example code:
# Simulate some data
import numpy as np
np.random.seed(1)
t = 20000 # total observations
location = np.random.randint(1, 5, (t,1))
var_id = np.random.randint(1, 8, (t,1))
hour = np.repeat(np.arange(0, (t/5)), 5).reshape(-1,1)
value = np.random.rand(t,1)
df = np.concatenate((location,var_id,hour,value),axis = 1)
Having "df" I want to create a new array "results" like below:
# length of moving window
window = 10
hours = df[:,2]
# create an empty array to store the results
results = np.empty((0,4))
for i in range(len(set(hours))-window+1):
obs_data = df[(hours >= i) & (hours <= i+window)]
results = np.concatenate((results, obs_data), axis=0)
my problem is that the concatenation is very slow (on my system the operation take 1.4 and 16 seconds without and with the concatenation respectively). I have over a million data points and I want to speedup this code. Does anyone know a better way to create the new array faster (possibly without using the np.concatenate)?
If you need to iterate, make the results array big enough to hold all the values.
# create an empty array to store the results
n = len(set(hours))-window+1
results = np.empty((n,4))
for i in range(n):
obs_data = df[(hours >= i) & (hours <= i+...
results[i,:] = obs_data
Repeated concatenate is slow; list append is faster.
It may be possible to get all obs_data from df with one indexing call, but I won't try to explore that now.
Not a completely for-free answer neither, but a working one
window = 10
hours = df[:,2]
# create an empty array to store the results
results = np.empty((0,4))
lr=[]
for i in range(len(set(hours))-window+1):
obs_data = df[(hours >= i) & (hours <= i+window)]
lr.append(obs_data)
np.vstack(lr)
It is way faster. For the reason already given: calling concatenate in a loop is awfully slow. Where as python list can be expanded more efficiently.
I would have preferred something like hpaulj answer. With some array initially created, and then filled. Even if obs_data is not a single row (as they seem to assume) but several row, it is not really a problem. Something like
p=0
for i in range(n):
obs_data = df[(hours >= i) & (hours <= i+...
results[p:p+len(obs_data),:] = obs_data
p+=len(obs_data)
would do.
But the problem here is to estimate the size of results. With your example, with uniformly distributed hours, it is quite easy : (len(set(hours))-window+1)*window*(len(hours)/len(set(hours))
But I guess in reality, each obs_data has a different size.
So, the only way to compute the size of result in advance would be to do a first iteration just to compute the sum of len(obs_data), and then a second to store obs_data. So, vstack, even if not entierely satisfying, is probably the best option.
Anyway, it is a very visible improvement from your version (on my computer 22 seconds vs less than 1)

How to use np.Vectorize() with Pandas function?

I have the function that operates in Pandas DataFrame format. It works with pandas.apply() but it does not work with np.Vectorize(). Find the function below:
def AMTTL(inputData, amortization = []):
rate = inputData['EIR']
payment = inputData['INSTALMENT']
amount = inputData['OUTSTANDING']
amortization = [amount]
if amount - payment <= 0:
return amortization
else:
while amount > 0:
amount = BALTL(rate, payment, amount)
if amount <= 0:
continue
amortization.append(amount)
return amortization
The function receives inputData as Pandas DataFrame format. The EIR, INSTALMENT and OUTSTANDING are the columns name. This function works well with pandas.apply()
data.apply(AMTTL, axis = 1)
However, I have tried to use np.Vectorize(). it does not work with the code below:
vfunc = np.vectorize(AMTTL)
vfunc(data)
It got error like 'Timestamp' object is not subscriptable. So, I tried to drop other columns that not used but it still got the another error like invalid index to scalar variable.
I am not sure how to adjust pandas.apply() to np.Vectorize().
Any suggestion? Thank you in advance.
np.vectorize is nothing more than a map function that is applied to all the elements of the array - meaning you cannot differentiate between the columns with in the function. It has no idea of the column names like EIR or INSTALMENT. Therefore your current implementation for numpy will not work.
From the docs:
The vectorized function evaluates pyfunc over successive tuples of the
input arrays like the python map function, except it uses the
broadcasting rules of numpy.
The vectorize function is provided primarily for convenience, not for
performance. The implementation is essentially a for loop.
Based on your problem, you should try np.apply_along_axis instead, where you can refer different columns with their indexes.

Referencing time and (time+10 seconds) to calc normalized price return in Pandas Dataframe

I am trying to normalize price at a certain point in time with respect to price 10 seconds later using this formula: ((price(t+10seconds) – price(t)) / price(t) ) / spread(t)
Both price and spread are columns in my dataframe. And I have indexed my dataframe by timestamp (pd.datetime object) because I figured that would make calculating price(t+10sec) easier.
What I've tried so far:
pos['timestamp'] = pd.to_datetime(pos['timestamp'])
pos.set_index('timestamp')
def normalize_data(pos):
t0 = pd.to_datetime('2021-10-27 09:30:13.201')
x = pos['mid_price']
y = ((x[t0 + pd.Timedelta('10 sec')] - x)/x) / (spread)
return y
pos['norm_price'] = normalize_data(pos)
this gives me an error because I'm indexing x[t0+pd.Timedelta('10sec')] but not the other x's in the equation. I also don't think I'm using pd.Timedelta or the x[t0+pd.Time...] correctly and unsure of how to fix all this/define a better function.
Any input would be much appreciated
dataframe
Your problem is here:
pos.set_index('timestamp')
This line of code will return a new dataframe, and leave your original dataframe unchanged. So, your function normalize_data is working on the original version of pos, which does not have the index you want, and neither will x. Change your code to this:
pos = pos.set_index('timestamp')
And that should get things working.

fastest way to get max value of each masked np.array for many masks?

I have two numpy arrays of the same shape. One contains information that I am interested in, and the other contains a bunch of integers that can be used as mask values.
In essence, I want to loop through each unique integer to get each mask for the array, then filtered the main array using this mask and find the max value of the filtered array.
For simplicity, lets say the arrays are:
arr1 = np.random.rand(10000,10000)
arr2 = np.random.randint(low=0, high=1000, size=(10000,10000))
right now I'm doing this:
maxes = {}
ids = np.unique(arr2)
for id in ids:
max_val = arr1[np.equal(arr2, id)].max()
maxes[id] = max_val
My arrays are alot bigger and this is painfully slow, I am strugging to find a quicker way of doing this...maybe there's some kind of creative method I'm not aware of, would really appreciate any help.
EDIT
let's say the majority of arr2 is actually 0 and I dont care about the 0 id, is it possible to speed it up by dropping this entire chunk from the search??
i.e.
arr2[:, 0:4000] = 0
and just return the maxes for ids > 0 ??
much appreciated..
Generic bin-based reduction strategies
Listed below are few approaches to tackle such scenarios where we need to perform bin-based reduction operations. So, essentially we are given two arrays and we are required to use one as the bins and the other one for values and reduce the second one.
Approach #1 : One strategy would be to sort arr1 based on arr2. Once we have them both sorted in that same order, we find the group start and stop indices and then with appropriate ufunc.reduceat, we do our slice-based reduction operation. That's all there is!
Here's the implementation -
def binmax(bins, values, reduceat_func):
''' Get binned statistic from two 1D arrays '''
sidx = bins.argsort()
bins_sorted = bins[sidx]
grpidx = np.flatnonzero(np.r_[True,bins_sorted[:-1]!=bins_sorted[1:]])
max_per_group = reduceat_func(values[sidx],grpidx)
out = dict(zip(bins_sorted[grpidx], max_per_group))
return out
out = binmax(arr2.ravel(), arr1.ravel(), reduceat_func=np.maximum.reduceat)
It's applicable across ufuncs that have their corresponding ufunc.reduceat methods.
Approach #2 : We can also leverage scipy.stats.binned_statistic that 's basically a generic utility to do some of the common reduction operations based on binned array values -
from scipy.stats import binned_statistic
def binmax_v2(bins, values, statistic):
''' Get binned statistic from two 1D arrays '''
num_labels = bins.max()+1
R = np.arange(num_labels+1)
Mx = binned_statistic(bins, values, statistic=statistic, bins=R)[0]
idx = np.flatnonzero(~np.isnan(Mx))
out = dict(zip(idx, Mx[idx].astype(int)))
return out
out = binmax_v2(arr2.ravel(), arr1.ravel(), statistic='max')

Python, Pandas: 80/20 Randomly Split Data; How to loop when index value is 'missing'?

I am trying to loop through a Series data type which was randomly generated from an existing data set to serve as a training data set). Here is the output of my Series data set after the split:
Index data
0 1150
1 2000
2 1800
. .
. .
. .
1960 1800
1962 1200
. .
. .
. .
20010 1500
There is no index of 1961 because the random selection process to create the training data set removed it. When I try to loop through to calculate my residual sum squares it does not work. Here is my loop code:
def ResidSumSquares(x, y, intercept, slope):
out = 0
temprss = 0
for i in x:
out = (slope * x.loc[i]) + intercept
temprss = temprss + (y.loc[i] - out)
RSS = temprss**2
return print("RSS: {}".format(RSS))
KeyError: 'the label [1961] is not in the [index]'
I am still very new to Python and I am not sure of the best way to fix this.
Thank you in advance.
I found the answer right after I posted the question, my apologies. Posted by #mkln
How to reset index in a pandas data frame?
df = df.reset_index(drop=True)
This resets the index of the entire Series and it is not exclusive to DataFrame data type.
My updated function code works like a charm:
def ResidSumSquares(x, y, intercept, slope):
out = 0
myerror = 0
x = x.reset_index(drop=True)
y = y.reset_index(drop=True)
for i in x:
out = slope * x.loc[i] + float(intercept)
myerror = myerror + (y.loc[i] - out)
RSS = myerror**2
return print("RSS: {}".format(RSS))
You omit your actual call to ResidSumSquares. How about not resetting the index within the function and passing the training set as x. Iterating over an unusual (not 1,2,3,...) index shouldn't be a problem
A few observations:
As currently written your function is calculating the squared sum of the error, not the sum of squared error... is this intentional? The latter is typically what is used in regression type applications. Since your variable is named RSS--I assume residual sum of squares, you will want to revisit.
If x and y are consistent subsets of the same larger dataset, the you should have the same indices for both, right? Otherwise by dropping the index you may be matching unrelated x and y variables and glossing over a bug earlier in the code.
Since you are using Pandas this can be easily vectorized to improve readability and speed (Python loops have high overhead)
Example of (3), assuming (2), and illustrating the differences between approaches in (1):
#assuming your indices should be aligned,
#pandas will link xs and ys by index
vectorized_error = y - slope*x + float(intercept)
#your residual sum of squares--you have to square first!
rss = (vectorized_error**2).sum()
# if you really want the square of the summed errors...
sse = (vectorized_error.sum())**2
Edit: didn't notice this has been dead for a year.

Categories

Resources