Python improving function speed

Python improving function speed - python

I am coding my own script to calculate relation between two signals. Therefore I use the mlab.csd and mlab.psd functions to compute the CSD and PSD of the signals.
My array x is in the shape of (120,68,68,815). My script runs several minutes and this function is the hotspot for this high amount of time.
Anyone any idea what I should do? I am not that familiar with script performance increasing. Thanks!
# to read the list of stcs for all the epochs
with open('/home/daniel/Dropbox/F[...]', 'rb') as f:
label_ts = pickle.load(f)
x = np.asarray(label_ts)
nfft = 512
n_freqs = nfft/2+1
n_epochs = len(x) # in this case there are 120 epochs
channels = 68
sfreq = 1017.25
def compute_mean_psd_csd(x, n_epochs, nfft, sfreq):
'''Computes mean of PSD and CSD for signals.'''
Rxy = np.zeros((n_epochs, channels, channels, n_freqs), dtype=complex)
Rxx = np.zeros((n_epochs, channels, channels, n_freqs))
Ryy = np.zeros((n_epochs, channels, channels, n_freqs))
for i in xrange(0, n_epochs):
print('computing connectivity for epoch %s'%(i+1))
for j in xrange(0, channels):
for k in xrange(0, channels):
Rxy[i,j,k], freqs = mlab.csd(x[j], x[k], NFFT=nfft, Fs=sfreq)
Rxx[i,j,k], _____ = mlab.psd(x[j], NFFT=nfft, Fs=sfreq)
Ryy[i,j,k], _____ = mlab.psd(x[k], NFFT=nfft, Fs=sfreq)
Rxy_mean = np.mean(Rxy, axis=0, dtype=np.float32)
Rxx_mean = np.mean(Rxx, axis=0, dtype=np.float32)
Ryy_mean = np.mean(Ryy, axis=0, dtype=np.float32)
return freqs, Rxy, Rxy_mean, np.real(Rxx_mean), np.real(Ryy_mean)

Something that could help, if the csd and psd methods are computationally intensive. There are chances that you could probably simply cache the results of previous calls and get it instead of calculating multiple times.
As it seems, you will have 120 * 68 * 68 = 591872 cycles.
In the case of the psd calculation, it should be possible to cache the values without problem has the method only depend on one parameter.
Store the value inside a dict for the x[j] or x[k] check if the value exists. If the value doesn't exist, compute it and store it. If the value exists, simply skip the value and reusue the value.
if x[j] not in cache_psd:
cache_psd[x[j]], ____ = mlab.psd(x[j], NFFT=nfft, Fs=sfreq)
Rxx[i,j,k] = cache_psd[x[j]]
if x[k] not in cache_psd:
cache_psd[x[k]], ____ = mlab.psd(x[k], NFFT=nfft, Fs=sfreq)
Ryy[i,j,k] = cache_psd[x[k]]
You can do the same with the csd method. I don't know enough about it to say more. If the order of the parameter doesn't matter, you can store the two parameter in a sorted order to prevent duplicates such as 2, 1 and 1, 2.
The use of the cache will make the code faster only if the memory access time is lower than the computation time and storing time. This fix could be easily added with a module that does memoization.
Here's an article about memoization for further reading:
http://www.python-course.eu/python3_memoization.php

Related

optimize this numpy operation

I have inherited some code and there is one particular operation that takes an inordinate amount of time.
The operation is defined as:
cutoff = 0.2
# X has shape (76187, 247, 20)
X_flat = X.reshape((X.shape[0], X.shape[1] * X.shape[2]))
weightfun = lambda x: 1.0 / np.sum(np.dot(X_flat, x) / np.dot(x, x) > 1 - cutoff)
# This is expensive...
N_list = np.array(list(map(weightfun, X_flat)))
This takes hours to compute on my machine. I am wondering if there is a way to optimize this. The code is computing normalized hamming distances between vector sequences.

weightfun performs two dot product operations for every row of X_flat. The worst one is np.dot(X_flat, x), where the dot product is performed against the whole X_flat matrix. But there's a trick to speed things up. The iterative part in the first dot product can be computed only once with:
X_matmut = X_flat # X_flat.T
Also, I noticed that the second dot product is nothing more than the diagonal of the result of the first one.
The rewritten code looks like this:
cutoff = 0.2
# X has shape (76187, 247, 20)
X_flat = X.reshape((X.shape[0], X.shape[1] * X.shape[2]))
X1 = X_flat # X_flat.T
X2 = X1.diagonal()
N_list = 1.0 / (X1/X2 > 1 - cutoff).sum(axis=0)
Edit
For such a large input, when performing the operation above the memory becomes the new bottleneck as the new matrix won't fit into RAM. So there's also the option of breaking the computation into chunks, as the code below shows.
The code gets a little messy, but at least it didn't try to destroy my PC :-P
import numpy as np
import time
# Sample data
X = np.random.random([76187, 247, 20])
start = time.time()
cutoff = 0.2
X_flat = X.reshape((X.shape[0], X.shape[1] * X.shape[2]))
# Divide data into 20 chuncks
X_parts = np.array_split(X_flat, 20)
# Diagonal will be saved incrementally
diagonal = []
for i in range(len(X_parts)):
part = X_parts[i]
X_parts[i] = part # X_flat.T
diagonal.extend(X_parts[i][range(len(X_parts[i])), range(len(diagonal), len(diagonal)+len(X_parts[i]))])
# Performs the second part of the calculation
diagonal = np.array(diagonal)
X_list = np.zeros(len(diagonal))
for x in X_parts:
X_list += (x/diagonal > 1 - cutoff).sum(axis=0)
X_list = 1.0 / X_list
print('Time to solve: %.2f secs' % (time.time() - start))
I would love to be able to perform all the computation on a single loop and discard the used chunks, but it is obligatory to run over the whole matrix once to retrieve the diagonal. Don't believe it's worth to compute everything twice to save memory.
While I use a decent setup (16 GB of RAM in a i7 intel and SSD drive for storage), the whole processing took me around 15 minutes.

How to vectorize indexing and computation when indexed tensors are different dimensions?

I'm trying to vectorize the following for-loop in Pytorch. I'd be happy with just vectorizing the inner for-loop, but doing the whole batch would also be awesome.
# B: the batch size
# N: the number of training examples
# dim: the dimension of each feature vector
# K: the number of discrete labels. each vector has a single label
# delta: margin for hinge loss
batch_data = torch.tensor(...) # Tensor of shape [B x N x d]
batch_labels = torch.tensor(...) # Tensor of shape [B x N x 1], each element is one of K labels (ints)
batch_losses = [] # Ultimately should be [B x 1]
batch_centroids = [] # Ultimately should be [B x K_i x dim]
for i in range(B):
centroids = [] # Keep track of the means for each class.
classes = torch.unique(labels) # Get the unique labels for the classes.
# NOTE: The number of classes K for each item in the batch might actually
# be different. This may complicate batch-level operations.
total_loss = 0
# For each class independently. This is the part I want to vectorize.
for cl in classes:
# Take the subset of training examples with that label.
subset = data[torch.where(labels == cl)]
# Find the centroid of that subset.
centroid = subset.mean(dim=0)
centroids.append(centroid)
# Get the distance between each point in the subset and the centroid.
dists = subset - centroid
norm = torch.linalg.norm(dists, dim=1)
# The loss is the mean of the hinge loss across the subset.
margin = norm - delta
hinge = torch.clamp(margin, min=0.0) ** 2
total_loss += hinge.mean()
# Keep track of everything. If it's too hard to keep track of centroids, that's also OK.
loss = total_loss.mean()
batch_losses.append(loss)
batch_centroids.append(centroids)
I've been scratching my head on how to deal with the irregularly sized tensors. The number of classes in each batch K_i is different, and the size of each subset is different.

It turns out it actually is possible to vectorize across ragged arrays. I'll use numpy, but code should be directly translatable to torch. The key technique is to:
Sort by ragged array membership
Perform an accumulation
Find boundary indices, compute adjacent differences
For a single (non-batch) input of an n x d matrix X and an n-length array label, the following returns the k x d centroids and n-length distances to respective centroids:
def vcentroids(X, label):
"""
Vectorized version of centroids.
"""
# order points by cluster label
ix = np.argsort(label)
label = label[ix]
Xz = X[ix]
# compute pos where pos[i]:pos[i+1] is span of cluster i
d = np.diff(label, prepend=0) # binary mask where labels change
pos = np.flatnonzero(d) # indices where labels change
pos = np.repeat(pos, d[pos]) # repeat for 0-length clusters
pos = np.append(np.insert(pos, 0, 0), len(X))
Xz = np.concatenate((np.zeros_like(Xz[0:1]), Xz), axis=0)
Xsums = np.cumsum(Xz, axis=0)
Xsums = np.diff(Xsums[pos], axis=0)
counts = np.diff(pos)
c = Xsums / np.maximum(counts, 1)[:, np.newaxis]
repeated_centroids = np.repeat(c, counts, axis=0)
aligned_centroids = repeated_centroids[inverse_permutation(ix)]
dist = np.sum((X - aligned_centroids) ** 2, axis=1)
return c, dist
Batching requires little special handling. For an input B x n x d array batch_X, with B x n batch labels batch_labels, create unique labels for each batch:
batch_k = batch_labels.max(axis=1) + 1
batch_k[1:] = batch_k[:-1]
batch_k[0] = 0
base = np.cumsum(batch_k)
batch_labels += base.expand_dims(1)
So now each batch element has a unique contiguous range of labels. I.e., the first batch element will have n labels in some range [0, k0) where k0 = batch_k[0], the second will have range [k0, k0 + k1) where k1 = batch_k[1], etc.
Then just flatten the n x B x d input to n*B x d and call the same vectorized method. Your loss function is derivable using the final distances and same position-array based reduction technique.
For a detailed explanation of how the vectorization works, see my blog post.

You can vectorize the whole thing if you use a one-hot encoding for your classes and a pairwise distance trick for your norms:
import torch
B = 32
N = 1000
dim = 50
K = 25
batch_data = torch.randn((B, N, dim))
batch_labels = torch.randint(0, K, size=(B, N))
batch_one_hot = torch.nn.functional.one_hot(batch_labels)
centroids = torch.matmul(
batch_one_hot.transpose(-1, 1).type(batch_data.dtype),
batch_data
) / batch_one_hot.sum(1)[..., None]
norms = torch.linalg.norm(batch_data[:, :, None] - centroids[:, None], axis=-1)
# Compute the rest of your loss
# ...
A couple things to watch out for:
You'll get a divide by zero for any batches that have a missing class. You can handle this by first computing the class sums (with matmul) and counts (summing the one-hot tensor along axis 1) separately. Then, mask the sums with count == 0 and divide the rest of them by their class counts.
If you have a large number of classes, this will cause memory problems because the one-hot tensor will be too big. In that case, the answer from #VF1 probably makes more sense.

Efficient way to implement simple filter with varying coeffients in Python/Numpy

I am looking for an efficient way to implement a simple filter with one coefficient that is time-varying and specified by a vector with the same length as the input signal.
The following is a simple implementation of the desired behavior:
def myfilter(signal, weights):
output = np.empty_like(weights)
val = signal[0]
for i in range(len(signal)):
val += weights[i]*(signal[i] - val)
output[i] = val
return output
weights = np.random.uniform(0, 0.1, (100,))
signal = np.linspace(1, 3, 100)
output = myfilter(signal, weights)
Is there a way to do this more efficiently with numpy or scipy?

You can trade in the overhead of the loop for a couple of additional ops:
import numpy as np
def myfilter(signal, weights):
output = np.empty_like(weights)
val = signal[0]
for i in range(len(signal)):
val += weights[i]*(signal[i] - val)
output[i] = val
return output
def vectorised(signal, weights):
wp = np.r_[1, np.multiply.accumulate(1 - weights[1:])]
sw = weights * signal
sw[0] = signal[0]
sws = np.add.accumulate(sw / wp)
return wp * sws
weights = np.random.uniform(0, 0.1, (100,))
signal = np.linspace(1, 3, 100)
print(np.allclose(myfilter(signal, weights), vectorised(signal, weights)))
On my machine the vectorised version is several times faster. It uses a "closed form" solution of your recurrence equation.
Edit: For very long signal / weight (100,000 samples, say) this method doesn't work because of overflow. In that regime you can still save a bit (more than 50% on my machine) using the following trick, which has the added bonus that you needn't solve the recurrence formula, only invert it.
from scipy import linalg
def solver(signal, weights):
rw = 1 / weights[1:]
v = np.r_[1, rw, 1-rw, 0]
v.shape = 2, -1
return linalg.solve_banded((1, 0), v, signal)
This trick uses the fact that your recurrence is formally similar to a Gauss elimination on a matrix with only one nonvanishing subdiagonal. It piggybacks on a library function that specialises in doing precisely that.
Actually, quite proud of this one.

How to real-time filter with scipy and lfilter?

Disclaimer: I am probably not as good at DSP as I should be and therefore have more issues than I should have getting this code to work.
I need to filter incoming signals as they happen. I tried to make this code to work, but I have not been able to so far.
Referencing scipy.signal.lfilter doc
import numpy as np
import scipy.signal
import matplotlib.pyplot as plt
from lib import fnlib
samples = 100
x = np.linspace(0, 7, samples)
y = [] # Unfiltered output
y_filt1 = [] # Real-time filtered
nyq = 0.5 * samples
f1_norm = 0.1 / nyq
f2_norm = 2 / nyq
b, a = scipy.signal.butter(2, [f1_norm, f2_norm], 'band', analog=False)
zi = scipy.signal.lfilter_zi(b,a)
zi = zi*(np.sin(0) + 0.1*np.sin(15*0))
This sets zi as zi*y[0 ] initially, which in this case is 0. I have got it from the example code in the lfilter documentation, but I am not sure if this is correct at all.
Then it comes to the point where I am not sure what to do with the few initial samples.
The coefficients a and b are len(a) = 5 here.
As lfilter takes input values from now to n-4, do I pad it with zeroes, or do I need to wait until 5 samples have gone by and take them as a single bloc, then continuously sample each next step in the same way?
for i in range(0, len(a)-1): # Append 0 as initial values, wrong?
y.append(0)
step = 0
for i in xrange(0, samples): #x:
tmp = np.sin(x[i]) + 0.1*np.sin(15*x[i])
y.append(tmp)
# What to do with the inital filterings until len(y) == len(a) ?
if (step> len(a)):
y_filt, zi = scipy.signal.lfilter(b, a, y[-len(a):], axis=-1, zi=zi)
y_filt1.append(y_filt[4])
print(len(y))
y = y[4:]
print(len(y))
y_filt2 = scipy.signal.lfilter(b, a, y) # Offline filtered
plt.plot(x, y, x, y_filt1, x, y_filt2)
plt.show()

I think I had the same problem, and found a solution on https://github.com/scipy/scipy/issues/5116:
from scipy import zeros, signal, random
def filter_sbs():
data = random.random(2000)
b = signal.firwin(150, 0.004)
z = signal.lfilter_zi(b, 1) * data[0]
result = zeros(data.size)
for i, x in enumerate(data):
result[i], z = signal.lfilter(b, 1, [x], zi=z)
return result
if __name__ == '__main__':
result = filter_sbs()
The idea is to pass the filter state z in each subsequent call to lfilter. For the first few samples the filter may give strange results, but later (depending on the filter length) it starts to behave correctly.

The problem is not how you are buffering the input. The problem is that in the 'offline' version, the state of the filter is initialized using lfilter_zi which computes the internal state of an LTI so that the output will already be in steady-state when new samples arrive at the input. In the 'real-time' version, you skip this so that the filter's initial state is 0. You can either initialize both versions to using lfilter_zi or else initialize both to 0. Then, it doesn't matter how many samples you filter at a time.
Note, if you initialize to 0, the filter will 'ring' for a certain amount of time before reaching a steady state. In the case of FIR filters, there is an analytic solution for determining this time. For many IIR filters, there is not.
This following is correct. For simplicity's sake I initialize to 0 and feed the input on sample at a time. However, any non-zero block size will produce equivalent output.
from scipy import signal, random
from numpy import zeros
def filter_sbs(data, b):
z = zeros(b.size-1)
result = zeros(data.size)
for i, x in enumerate(data):
result[i], z = signal.lfilter(b, 1, [x], zi=z)
return result
def filter(data, b):
result = signal.lfilter(b,1,data)
return result
if __name__ == '__main__':
data = random.random(20000)
b = signal.firwin(150, 0.004)
result1 = filter_sbs(data, b)
result2 = filter(data, b)
print(result1 - result2)
Output:
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 ... -5.55111512e-17
0.00000000e+00 1.66533454e-16]

Calculating event / stimulus triggered averages efficiently in Python

I would like to calculate event / stimulus triggered averages computationally efficient.
Assuming I have got a signal, e.g.
signal = [random.random() for i in xrange(0, 1000)]
with n_signal datapoints
n_signal = len(signal)
I know that this signal is sampled with a rate of
Fs = 25000 # Hz
In this case I know that the total time of the signal
T_sec = n_signal / float(Fs)
At specific times, certain events occur, e.g.
t_events = [0.01, 0.017, 0.018, 0.022, 0.034, 0.0345, 0.03456]
Now I would like to find the signal from a certain time before these events, e.g.
t_bef = 0.001
until a certain time after these events, e.g.
t_aft = 0.002
And once I have got all of these chunks of the signal, I would like to average these.
In the past I would have created the time vector of the signal
t_signal = numpy.linspace(0, T_sec, n_signal)
and looked for all of the indices for t_events in t_signal e.g. using numpy.serachsorted
(Link)
Since I know the sampling rate of the signal, these can be done much quicker, like
indices = [int(i * Fs) for i in t_events]
This saves me the memory for t_signal and I do not have to go through the whole signal to find my indices.
Next, I would determine how many data samples t_bef and t_aft are corresponding to
nsamples_t_bef = int(t_bef * Fs)
nsamples_t_aft = int(t_aft * Fs)
and I would save the signal chunks in a list
signal_chunks = list()
for i in xrange(0, len(t_events)):
signal_chunks.append(signal[indices[i] - nsamples_t_bef : indices[i] + nsamples_t_aft])
And finally I am averaging these
event_triggered_average = numpy.mean(signal_chunks, axis = 0)
If I am interested in the time vector, I am calculating it with
t_event_triggered_average = numpy.linspace(-t_signal[nsamples_t_bef], t_signal[nsamples_t_aft], nsamples_t_bef + nsamples_t_aft)
Now my questions: Is there a computational more efficient way to do this? If I have got a signal with many data points and many events, this computation can take a while. Is a list the best data structure to save these chunks in? Do you know how to get the chunks of data quicker? Maybe using buffer?
Thanks in advance for your comments and advice.
Minimum working example
import numpy
import random
random.seed(0)
signal = [random.random() for i in xrange(0, 1000)]
# sampling rate
Fs = 25000 # Hz
# total time of the signal
n_signal = len(signal)
T_sec = n_signal / float(Fs)
# time of events of interest
t_events = [0.01, 0.017, 0.018, 0.022, 0.034, 0.0345, 0.03456]
# and their corresponding indices
indices = [int(i * Fs) for i in t_events]
# define the time window of interest around each event
t_bef = 0.001
t_aft = 0.002
# and the corresponding index offset
nsamples_t_bef = int(t_bef * Fs)
nsamples_t_aft = int(t_aft * Fs)
# vector of signal times
t_signal = numpy.linspace(0, T_sec, n_signal)
signal_chunks = list()
for i in xrange(0, len(t_events)):
signal_chunks.append(signal[indices[i] - nsamples_t_bef : indices[i] + nsamples_t_aft])
# average signal value across chunks
event_triggered_average = numpy.mean(signal_chunks, axis = 0)
# not sure what's going on here
t_event_triggered_average = numpy.linspace(-t_signal[nsamples_t_bef],
t_signal[nsamples_t_aft],
nsamples_t_bef + nsamples_t_aft)

Since your signal is defined on a regular grid, you could do some arithmetic to find indices for all the samples that you require. Then you can construct the array with chunks using a single indexing operation.
import numpy as np
# Making some test data
n_signal = 1000
signal = np.random.rand(n_signal)
Fs = 25000 # Hz
t_events = np.array([0.01, 0.017, 0.018, 0.022, 0.034, 0.0345, 0.03456])
# Preferences
t_bef = 0.001
t_aft = 0.002
# The number of samples in a chunk
nsamples = int((t_bef+t_aft) * Fs)
# Create a vector from 0 up to nsamples
sample_idx = np.arange(nsamples)
# Calculate the index of the first sample for each chunk
# Require integers, because it will be used for indexing
start_idx = ((t_events - t_bef) * Fs).astype(int)
# Use broadcasting to create an array with indices
# Each row contains consecutive indices for each chunk
idx = start_idx[:, None] + sample_idx[None, :]
# Get all the chunks using fancy indexing
signal_chunks = signal[idx]
# Calculate the average like you did earlier
event_triggered_average = signal_chunks.mean(axis=0)
Note, the line with .astype(int) does not round to nearest integer, but rounds towards zero.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python improving function speed - python

Related

optimize this numpy operation

How to vectorize indexing and computation when indexed tensors are different dimensions?

Efficient way to implement simple filter with varying coeffients in Python/Numpy

How to real-time filter with scipy and lfilter?

Calculating event / stimulus triggered averages efficiently in Python

Categories

Resources