Speeding up numpy operations - python

Using a 2D numpy array, I want to create a new array that expands the original one using a moving window. Let me explain what I mean using an example code:
# Simulate some data
import numpy as np
np.random.seed(1)
t = 20000 # total observations
location = np.random.randint(1, 5, (t,1))
var_id = np.random.randint(1, 8, (t,1))
hour = np.repeat(np.arange(0, (t/5)), 5).reshape(-1,1)
value = np.random.rand(t,1)
df = np.concatenate((location,var_id,hour,value),axis = 1)
Having "df" I want to create a new array "results" like below:
# length of moving window
window = 10
hours = df[:,2]
# create an empty array to store the results
results = np.empty((0,4))
for i in range(len(set(hours))-window+1):
obs_data = df[(hours >= i) & (hours <= i+window)]
results = np.concatenate((results, obs_data), axis=0)
my problem is that the concatenation is very slow (on my system the operation take 1.4 and 16 seconds without and with the concatenation respectively). I have over a million data points and I want to speedup this code. Does anyone know a better way to create the new array faster (possibly without using the np.concatenate)?

If you need to iterate, make the results array big enough to hold all the values.
# create an empty array to store the results
n = len(set(hours))-window+1
results = np.empty((n,4))
for i in range(n):
obs_data = df[(hours >= i) & (hours <= i+...
results[i,:] = obs_data
Repeated concatenate is slow; list append is faster.
It may be possible to get all obs_data from df with one indexing call, but I won't try to explore that now.

Not a completely for-free answer neither, but a working one
window = 10
hours = df[:,2]
# create an empty array to store the results
results = np.empty((0,4))
lr=[]
for i in range(len(set(hours))-window+1):
obs_data = df[(hours >= i) & (hours <= i+window)]
lr.append(obs_data)
np.vstack(lr)
It is way faster. For the reason already given: calling concatenate in a loop is awfully slow. Where as python list can be expanded more efficiently.
I would have preferred something like hpaulj answer. With some array initially created, and then filled. Even if obs_data is not a single row (as they seem to assume) but several row, it is not really a problem. Something like
p=0
for i in range(n):
obs_data = df[(hours >= i) & (hours <= i+...
results[p:p+len(obs_data),:] = obs_data
p+=len(obs_data)
would do.
But the problem here is to estimate the size of results. With your example, with uniformly distributed hours, it is quite easy : (len(set(hours))-window+1)*window*(len(hours)/len(set(hours))
But I guess in reality, each obs_data has a different size.
So, the only way to compute the size of result in advance would be to do a first iteration just to compute the sum of len(obs_data), and then a second to store obs_data. So, vstack, even if not entierely satisfying, is probably the best option.
Anyway, it is a very visible improvement from your version (on my computer 22 seconds vs less than 1)

Related

Speeding up 3D numpy and dataframe lookup

I currently have a pretty large 3D numpy array (atlasarray - 14M elements with type int64) in which I want to create a duplicate array where every element is a float based on a separate dataframe lookup (organfile).
I'm very much a beginner, so I'm sure that there must be a better (quicker) way to do this. Currently, it takes around 90s, which isn't ages but I'm sure can probably be reduced. Most of this code below is taken from hours of Googling, so surely isn't optimised.
import pandas as pd
organfile = pd.read_excel('/media/sf_VMachine_Shared_Path/ValidationData/ICRP110/AF/AF_OrgansSimp.xlsx')
densityarray = atlasarray
densityarray = densityarray.astype(float)
#create an iterable list of elements that can be written over and go for each elements
for idx, x in tqdm(np.ndenumerate(densityarray), total =densityarray.size):
densityarray[idx] = organfile.loc[x,'Density']
All of the elements in the original numpy array are integers which correspond to an organID. I used pandas to read in the key from an excel file and generate a 4-column dataframe, where in this particular case I want to extract the 4th column (which is a float). OrganIDs go up to 142. Apologies for the table format below, I couldn't get it to work so put it in code format instead.
|:OrganID:|:OrganName:|:TissueType:|:Density:|
|:-------:|:---------:|:----------:|:-------:|
|:---0---:|:---Air---:|:----53----:|:-0.001-:|
|:---1---:|:-Adrenal-:|:----43----:|:-1.030-:|
Any recommendations on ways I can speed this up would be gratefully received.
Put the density from the dataframe into a numpy array:
density = np.array(organfile['Density'])
Then run:
density[atlasarray]
Don't use loops, they are slow. The following example with 14M elements takes less than 1 second to run:
density = np.random.random((143))
atlasarray = np.random.randint(0, 142, (1000, 1000, 14))
densityarray = density[atlasarray]
Shape of densityarray:
print(densityarray.shape)
(1000, 1000, 14)

Comparing values in Dataframes

I am doing a Python project and trying to cut down on some computational time at the start using Pandas.
The code currently is:
for c1 in centres1:
for c2 in centres2:
if ((c1[0]-c2[0])**2 + (c1[1]-c2[1])**2) < search_rad*search_rad:
possible_commet.append([c1,c2])
I am trying to put centres1 and centres2 into data frames then compare each value to each other value. Would pandas help me cut some time off it? (currently 2 mins). If not how could I work around it?
Thanks
Unfortunately this is never going to be fast as you are going to be performing n squared operations. For example if you are comparing n objects where n = 1000 then you only have 1 million comparisons. If however you have n = 10_000 then you 100 million comparisons. A problem 10x bigger becomes 100 times slower.
nevertheless, for loops in python are relatively expensive. Using a library like pandas may mean that you can only make one function call and will shave some time off. Without any input data it's hard to assist further but the below should provide some building blocks
import pandas
df1 = pandas.Dataframe(centres1)
df2 = pandas.Dataframe(centres2)
df3 = df1.merge(df2, how = 'cross')
df3['combined_centre'] = ((df3['0_x']-df2['0_y']**2 + (df1['1_x']-df['1_y'])**2)
df3[df3['prod'] > search_rad**2
Yes, for sure pandas will help in cutting-off atleast some time which will be less that what you are getting right now, but you can try this out:
for i,j in zip(center1, center2):
if ((c1[0]-c2[0])**2 + (c1[1]-c2[1])**2) < search_rad*search_rad:
possible_commet.append([c1,c2])

'For' loop which relies heavily on subsetting is too slow. Alternatives to optimize?

I'm switching from R to Python. Unfortunately I'm stumbling upon a variety of loops which happen to run fast in my R scripts and too slow in Python (at least in my literal translations of such scripts). This code sample is one of them.
I'm slowly getting used to the idea that, when it comes to pandas, it's advisable to drop for loops and use instead while, vectorizing funcions, and apply.
I need a few examples on how to exactly do this, since unfortunately my loops rely too much on classic subsetting, matching and appending, operations that are too slow in its raw form.
# Create two empty lists to append results during loop
values = []
occurrences = []
#Create sample dataset, and sample series. It's just a sorted column (time series) and a column of random values:
time = np.arange(0,5000000,1)
variable = np.random.uniform(1,1000,5000000).round()
data = pd.DataFrame({'time' : time, 'variable':variable })
#Time datapoints to match
time_datapoints_to_match = np.random.uniform(0,5000000,200).round()
for i in time_datapoints_to_match:
time_window = data[(data['time'] > i) & (data['time'] <= i+1000 )] #Subset a time window
first_value_1pct = time_window['variable'].iloc[0] * 0.01 #extract 1/100 of the first value in time window
try: #Check if we have a value which is lower than this 1/100 value within the time window
first_occurence = time_window.loc[time_window['variable'] < first_value_1pct , 'time' ].iloc[0]
except IndexError: #In case there are no matches, let's return NaN
first_occurence = float('nan')
values.append(first_value_1pct)
occurrences.append(first_occurence)
#Create DataFrame out of the two output lists
final_report = pd.DataFrame({'values': values, 'first_occurence': occurrences})

Python fast DataFrame concatenation

I wrote a code to concatenate parts of a DataFrame to the same DataFrame as to normalize the occurrence of rows as per a certain column.
import random
def normalize(data, expectation):
"""Normalize data by duplicating existing rows"""
counts = data[expectation].value_counts()
max_count = int(counts.max())
for tag, group in data.groupby(expectation, sort=False):
array = pandas.DataFrame(columns=data.columns.values)
i = 0
while i < (max_count // int(counts[tag])):
array = pandas.concat([array, group])
i += 1
i = max_count % counts[tag]
if i > 0:
array = pandas.concat([array, group.ix[random.sample(group.index, i)]])
data = pandas.concat([data, array])
return data
and this is unbelievably slow. Is there a way to fast concatenate DataFrame without creating copies of it?
There are a couple of things that stand out.
To begin with, the loop
i = 0
while i < (max_count // int(counts[tag])):
array = pandas.concat([array, group])
i += 1
is going to be very slow. Pandas is not built for these dynamic concatenations, and I suspect the performance is quadratic for what you're doing.
Instead, perhaps you could try
pandas.concat([group] * (max_count // int(counts[tag]))
which just creates a list first, and then calls concat for a one-shot concatenation on the entire list. This should bring the complexity to being linear, and I suspect it will have lower constants in any case.
Another thing which would reduce these small concats is calling groupby-apply. Instead of iterating over the result of groupby, write the loop body as a function, and call apply on it. Let Pandas figure out best how to concat all of the results into a single DataFrame.
However, even if you prefer to keep the loop, I'd just append things into a list, and just concat everything at the end:
stuff = []
for tag, group in data.groupby(expectation, sort=False):
# Call stuff.append for any DataFrame you were going to concat.
pandas.concat(stuff)

using numpy broadcasting / vectorization to build new array from other arrays

I am working on a stock ranking factor for a Quantopian model. They recommend avoiding the use of loops in custom factors. However, I am not exactly sure how I would avoid the loops in this case.
def GainPctInd(offset=0, nbars=2):
class GainPctIndFact(CustomFactor):
window_length = nbars + offset
inputs = [USEquityPricing.close, ms.asset_classification.morningstar_industry_code]
def compute(self, today, assets, out, close, industries):
# Compute the gain percents for all stocks
asset_gainpct = (close[-1] - close[offset]) / close[offset] * 100
# For each industry, build a list of the per-stock gains over the given window
gains_by_industry = {}
for i in range(0, len(industries)):
industry = industries[0,i]
if industry in gains_by_industry:
gains_by_industry[industry].append(asset_gainpct[i])
else:
gains_by_industry[industry] = [asset_gainpct[i]]
# Loop through each stock's industry and compute a mean value for that
# industry (caching it for reuse) and return that industry mean for
# that stock
mean_cache = {}
for i in range(0, len(industries)):
industry = industries[0,i]
if not industry in mean_cache:
mean_cache[industry] = np.mean(gains_by_industry[industry])
out[i] = mean_cache[industry]
return GainPctIndFact()
When the compute function is called, assets is a 1-d array of the asset names, close is a multi-dimensional numpy array where there are window_length close prices for each asset listed in assets (using the same index numbers), and industries is the list of industry codes associated with each asset in a 1-d array. I know numpy vectorizes the computation of the gainpct in this line:
asset_gainpct = (close[-1] - close[offset]) / close[offset] * 100
The result is that asset_gainpct is a 1-d array of all the computed gains for every stock. The part I am unclear about is how I would use numpy to finish the calculations without me manually looping through the arrays.
Basically, what I need to do is aggregate all of the gains for all of the stocks based on the industry they are in, then compute the average of those values, and then de-aggregate the averages back out to the full list of assets.
Right now, I am looping through all the industries and pushing the gain percentages into a industry-indexed dictionary storing a list of the gains per industry. Then I am calculating the mean for those lists and performing a reverse-industry lookup to map the industry gains to each asset based on their industry.
It seems to me like this should be possible to do using some highly optimized traversals of the arrays in numpy, but I can't seem to figure it out. I've never used numpy before today, and I'm fairly new to Python, so that probably doesn't help.
UPDATE:
I modified my industry code loop to try to handle the computation with a masked array using the industry array to mask the asset_gainpct array like such:
# For each industry, build a list of the per-stock gains over the given window
gains_by_industry = {}
for industry in industries.T:
masked = ma.masked_where(industries != industry[0], asset_gainpct)
np.nanmean(masked, out=out)
It gave me the following error:
IndexError: Inconsistant shape between the condition and the input
(got (20, 8412) and (8412,))
Also, as a side note, industries is coming in as a 20x8412 array because the window_length is set to 20. The extra values are the industry codes for the stocks on the previous days, except they don't typically change, so they can be ignored. I am now iterating over industries.T (the transpose of industries) which means industry is a 20-element array with the same industry code in each element. Hence, I only need element 0.
The error above is coming from the ma.masked_where() call. The industries array is 20x8412 so I presume asset_gainpct is the one listed as (8412,). How do I make these compatible for this call to work?
UPDATE 2:
I have modified the code again, fixing several other issues I have run into. It now looks like this:
# For each industry, build a list of the per-stock gains over the given window
unique_ind = np.unique(industries[0,])
for industry in unique_ind:
masked = ma.masked_where(industries[0,] != industry, asset_gainpct)
mean = np.full_like(masked, np.nanmean(masked), dtype=np.float64, subok=False)
np.copyto(out, mean, where=masked)
Basically, the new premise here is that I have to build a mean-value filled array of the same size as the number of stocks in my input data and then copy the values into my destination variable (out) while applying my previous mask so that only the unmasked indexes are filled with the mean value. In addition, I realized that I was iterating over industries more than once in my previous incarnation, so I fixed that, too. However, the copyto() call is yielding this error:
TypeError: Cannot cast array data from dtype('float64') to
dtype('bool') according to the rule 'safe'
Obviously, I am doing something wrong; but looking through the docs, I don't see what it is. This looks like it should be copying from mean (which is np.float64 dtype) to out (which I have not previously defined) and it should be using masked as the boolean array for selecting which indexes get copied. Anyone have any ideas on what the issue is?
UPDATE 3:
First, thanks for all the feedback from everyone who contributed.
After much additional digging into this code, I have come up with the following:
def GainPctInd(offset=0, nbars=2):
class GainPctIndFact(CustomFactor):
window_length = nbars + offset
inputs = [USEquityPricing.close, ms.asset_classification.morningstar_industry_code]
def compute(self, today, assets, out, close, industries):
num_bars, num_assets = close.shape
newest_bar_idx = (num_bars - 1) - offset
oldest_bar_idx = newest_bar_idx - (nbars - 1)
# Compute the gain percents for all stocks
asset_gainpct = ((close[newest_bar_idx] - close[oldest_bar_idx]) / close[oldest_bar_idx]) * 100
# For each industry, build a list of the per-stock gains over the given window
unique_ind = np.unique(industries[0,])
for industry in unique_ind:
ind_view = asset_gainpct[industries[0,] == industry]
ind_mean = np.nanmean(ind_view)
out[industries[0,] == industry] = ind_mean
return GainPctIndFact()
For some reason, the calculations based on the masked views were not yielding correct results. Further, getting those results into the out variable was not working. Somewhere along the line, I stumbled on a post about how numpy (by default) creates views of arrays instead of copies when you do a slice and that you can do a sparse slice based on a Boolean condition. When running a calculation on such a view, it looks like a full array as far as the calculation is concerned, but all the values are still actually in the base array. It's sort of like having an array of pointers and the calculations happen on the data the pointers point to. Similarly, you can assign a value to all nodes in your sparse view and have it update the data for all of them. This actually simplified the logic considerably.
I would still be interested in any ideas anyone has on how to remove the final loop over the industries and vectorize that process. I am wondering if maybe a map / reduce approach might work, but I am still not familiar enough with numpy to figure out how to do it any more efficiently than this FOR loop. On the bright side, the remaining loop only has about 140 iterations to go through vs the two prior loops which would go through 8000 each. In addition to that, I am now avoiding the construction of the gains_by_industry and the mean_cache dict and avoiding all the data copying which went with them. So, it is not just faster, it is also far more memory efficient.
UPDATE 4:
Someone gave me a more succinct way to accomplish this, finally eliminating the extra FOR loop. It basically hides the loop in a Pandas DataFrame groupby, but it more succinctly describes what the desired steps are:
def GainPctInd2(offset=0, nbars=2):
class GainPctIndFact2(CustomFactor):
window_length = nbars + offset
inputs = [USEquityPricing.close, ms.asset_classification.morningstar_industry_code]
def compute(self, today, assets, out, close, industries):
df = pd.DataFrame(index=assets, data={
"gain": ((close[-1 - offset] / close[(-1 - offset) - (nbars - 1)]) - 1) * 100,
"industry_codes": industries[-1]
})
out[:] = df.groupby("industry_codes").transform(np.mean).values.flatten()
return GainPctIndFact2()
It does not improve the efficiency at all, according to my benchmarks, but it's probably easier to verify correctness. The one problem with their example is that it uses np.mean instead of np.nanmean, and np.nanmean drops the NaN values resulting in a shape mismatch if you try to use it. To fix the NaN issue, someone else suggested this:
def GainPctInd2(offset=0, nbars=2):
class GainPctIndFact2(CustomFactor):
window_length = nbars + offset
inputs = [USEquityPricing.close, ms.asset_classification.morningstar_industry_code]
def compute(self, today, assets, out, close, industries):
df = pd.DataFrame(index=assets, data={
"gain": ((close[-1 - offset] / close[(-1 - offset) - (nbars - 1)]) - 1) * 100,
"industry_codes": industries[-1]
})
nans = isnan(df['industry_codes'])
notnan = ~nans
out[notnan] = df[df['industry_codes'].notnull()].groupby("industry_codes").transform(np.nanmean).values.flatten()
out[nans] = nan
return GainPctIndFact2()
Someone gave me a more succinct way to accomplish this, finally eliminating the extra FOR loop. It basically hides the loop in a Pandas DataFrame groupby, but it more succinctly describes what the desired steps are:
def GainPctInd2(offset=0, nbars=2):
class GainPctIndFact2(CustomFactor):
window_length = nbars + offset
inputs = [USEquityPricing.close, ms.asset_classification.morningstar_industry_code]
def compute(self, today, assets, out, close, industries):
df = pd.DataFrame(index=assets, data={
"gain": ((close[-1 - offset] / close[(-1 - offset) - (nbars - 1)]) - 1) * 100,
"industry_codes": industries[-1]
})
out[:] = df.groupby("industry_codes").transform(np.mean).values.flatten()
return GainPctIndFact2()
It does not improve the efficiency at all, according to my benchmarks, but it's probably easier to verify correctness. The one problem with their example is that it uses np.mean instead of np.nanmean, and np.nanmean drops the NaN values resulting in a shape mismatch if you try to use it. To fix the NaN issue, someone else suggested this:
def GainPctInd2(offset=0, nbars=2):
class GainPctIndFact2(CustomFactor):
window_length = nbars + offset
inputs = [USEquityPricing.close, ms.asset_classification.morningstar_industry_code]
def compute(self, today, assets, out, close, industries):
df = pd.DataFrame(index=assets, data={
"gain": ((close[-1 - offset] / close[(-1 - offset) - (nbars - 1)]) - 1) * 100,
"industry_codes": industries[-1]
})
nans = isnan(df['industry_codes'])
notnan = ~nans
out[notnan] = df[df['industry_codes'].notnull()].groupby("industry_codes").transform(np.nanmean).values.flatten()
out[nans] = nan
return GainPctIndFact2()
– user36048

Categories

Resources