Let's say that there's a "master" array of times with these values:
master = [1.0, 1.25, 1.5, 1.75, 2.0, 2.25, 2.5, 2.75, 3.0]
I want to find the most "compatible" array among several candidates:
candidates = [
[0.01, 0.48, 1.03, 1.17, 1.5],
[1.25, 1.4, 1.5, 1.9, 2.0],
...
]
In this case I consider the first candidate most compatible because after adding 1 to each value, 4 of the values are very close to values that exist in master (the 2nd candidate only has 3 values that match `master'), and order matters (though we can say the arrays are already sorted with no duplicate values, since they represent times).
A physical example could be that master is an array of beat onsets for a clean recording of an audio track, while the candidates are arrays of beat onsets for various audio recordings that may or may not be of the same audio track. I'd like to find the candidate that is most likely to be a recording of (at least a portion of) the same audio track.
I'm not sure of an algorithm to choose among these candidates. I've done some searching that led me to topics like cross-correlation, string distance, and fuzzy matching, but I'd like to know if I'm missing the forest for the trees here. I'm most familiar with data analysis in NumPy and Pandas, so I will tag the question as such.
One way would be to create those sliding 1D arrays as a stacked 2D array with broadcasting and then get the distances against the 2D array with Scipy's cdist. Finally, we get the minimum distance along each row and choose the row with minimum of such distances. Thus, we would have an implementation like so -
from scipy.spatial.distance import cdist
Na = a.shape[1]
Nb = b.size
b2D = b[np.arange(Nb-Na+1)[:,None] + np.arange(Na)]
closesetID = cdist(a,b2D).min(1).argmin()
Sample run -
In [170]: a = np.random.randint(0,99,(400,500))
In [171]: b = np.random.randint(0,99,(700))
In [172]: b[100:100+a.shape[1]] = a[77,:] + np.random.randn(a.shape[1])
# Make b starting at 100th col same as 77th row from 'a' with added noise
In [173]: Na = a.shape[1]
...: Nb = b.size
...: b2D = b[np.arange(Nb-Na+1)[:,None] + np.arange(Na)]
...: closesetID = cdist(a,b2D).min(1).argmin()
...:
In [174]: closesetID
Out[174]: 77
Note: To me it looked like using the default option of cdist, which is the euclidean distance made sense for such a problem. There are numerous other options as listed in the docs that are based on differentiation between inputs and as such could replace the default one.
Related
I want to make a call whenever the two graphs get intersected
Here's img
here's the data https://jpn698dhc9.execute-api.us-east-1.amazonaws.com/prod/v2/historical?symbol=%22degods%22
If you have two time series as equally-indexed pd.Series objects, you can simply subtract them, determine the signs of the differences (which tells you which is higher), determine the differences of these signs over time (which tells you whether or not the order of the two has changed) and look at whenever those sign differences are non-zero (which tells you that there was a change in the order, i.e. an intersection of the graph). The following code computes the indices of all timesteps between whose and their respective predecessors there were intersections:
import numpy as np
import pandas as pd
s1 = pd.Series([1.0, 2.0, 3.0, 4.0, 5.0, 6.0])
s2 = pd.Series([0.0, 1.3, 2.6, 3.9, 5.2, 6.5])
diffs = np.sign(s1 - s2).diff()[1:]
indices = diffs[diffs.ne(0)].index.values
Is there a simple python 3 command that replicates matlab's interp1 command over multiple columns?
data_1 contains two parameters (1 per column) that go with the time_1 time vector (data_1 is a 5 by 2 array that isn't actually used in this example so can be ignored)
data_2 contains two parameters (1 per column) that go with the time_2 time vector
import numpy as np
data_2 = np.array([ [ 0.43, -0.54], [ 0.32, -0.83], [ 0.26, -0.94], [ 0.51, -0.69], [ 0.63, -0.74] ])
time_1 = np.array([ 399.87, 399.89, 399.91, 399.93, 399.95 ])
time_2 = np.array([ 399.86, 399.88, 399.90, 399.92, 399.94 ])
I'd like to interpolate the data_2 2D array into the time_1 time vector so both data sets will have the same time vector.
Desired output (which is just the np.interp of the two data_2 columns into the time_1 time vector
and merged back into an array) is:
data_2_i = np.array([[ 0.375, -0.685], [ 0.290, -0.885], [ 0.385, -0.815], [ 0.570, -0.715], [ 0.630, -0.740]])
Actual arrays will contain approx 20 columns (parameters) and thousands of rows (longer time range).
I know you can just loop over each column with np.interp but I was hoping there was a more compact and faster python 3 (numpy, scipy, pandas, etc.) method that I haven't been able to track down yet. I'm still pretty new to python (more familiar with matlab).
In matlab, you can just use interp1 on the entire multi-column array to get the multi-column result (although the edge cases are handled a bit differently - NaNs vs. last entry in this example - I'm not worried about the edge case differences here).
This looks to work (just made a quick script myself):
import numpy as np
def interp_multi(x_i, x, y):
ncol = y.shape[1]
y_i = np.zeros((len(x_i),ncol))
for i in range(ncol):
y_i[:,i] = np.interp(x_i, x, y[:,i])
return y_i
data_2_i = interp_multi(time_1, time_2, data_2)
I have a large numpy array, with dimensions [1]. I want to find out a sort of "group average". More specifically,
Let my array be [1,2,3,4,5,6,7,8,9,10] and let my group_size be 3. Hence, I will average the first three elements, the 4th to 6th element, the 7th to 9th element, and average the remaining elements (only 1 in this case to get - [2, 5, 8, 10]. Needless to say, I need a vectorized implementation.
Finally, my purpose is reducing the number of points in a noisy graph to smoothen out a general pattern having a lot of oscillation. Is there a correct way to do this? I would like the answer to both questions, in case they have a different answer. Thanks!
A good smoothing function is the kernel convolution. What it does is it multiplies a small array in a moving window over your larger array.
Say you chose a standard smoothing kernel of 1/3 * [1,1,1] and apply it to an array (a kernel needs to be odd-numbered and normalized). Lets apply it to [1,2,2,7,3,4,9,4,5,6]:
The centre of the kernal to begin with is on the first 2. It then averages itself and its neighbours, then moves on. The result is this:
[1.67, 3.67, 4.0, 4.67, 5.33, 5.67, 6.0, 5.0]
Note that the array is missing the first and last element.
You can do this with numpy.convolve, for example:
import numpy as np
a = np.array([[1,2,2,7,3,4,9,4,5,6]])
k = np.array([1,1,1])/3
smoothed = np.convolve(x, k, 'valid')
The effect of this is that your central value is smoothed with the values from its neighbours. You can change the convolution kernel by increasing it in size, 5 for example [1,1,1,1,1]/5, or give it a gaussian, which will stress the central members more than the outside ones. Read the wikipedia article.
EDIT
This works to get a block average as the question asks for:
import numpy as np
a = [1,2,3,4,5,6,7,8,9,10]
size = 3
new_a = []
i = 0
while i < len(a):
val = np.mean(a[i:i+3])
new_a.append(val)
i+=size
print(new_a)
[2.0, 5.0, 8.0, 10.0]
To solve for the group averaging, listed below are two approaches.
Approach #1 : Bin-based summing and averaging
In [77]: a
Out[77]: array([74, 48, 92, 40, 35, 38, 20, 69, 82, 37])
In [78]: N = 3 # Window size
In [79]: np.arange(a.size)//N # IDs for binning with bincount
Out[79]: array([0, 0, 0, 1, 1, 1, 2, 2, 2, 3])
In [84]: np.bincount(np.arange(a.size)//N,a)/np.bincount(np.arange(a.size)//N)
Out[84]: array([ 71.33333333, 37.66666667, 57. , 37. ])
Approach #2 : Slice and reshape based averaging
In [134]: limit0 = N*(a.size//N)
In [135]: out = np.zeros((a.size+N-1)//N)
In [136]: out[:limit0//N] = a[:limit0].reshape(-1,N).mean(1)
In [137]: out[limit0//N:] = a[limit0:].mean()
In [138]: out
Out[138]: array([ 71.33333333, 37.66666667, 57. , 37. ])
To smoothen data, I might suggest using MATLAB's smooth function ported to NumPy that is essentially convolved averaging and should be similar to #Roman's post.
Really, really wish numpy.ma.MaskedArray.resize worked. It would allow a one-step answer to this question.
As it is
def groupAverage(arr,idx):
rem=arr.size%idx
if rem==0:
return np.mean(arr.reshape(idx,-1),index=0)
else:
newsize=arr//size+1
averages=np.mean(arr.resize(idx,newsize),index=0)
averages[-1]*=(idx/rem)
return averages
Let's say I have a standard 2d numpy array, let's call it my2darray with values. In this array there are two major sections. Let's say for each column, there is a specific row which separates "scenario1" and "scenario2". How can i create 2 masked arrays that represent the top section of my2darray and the bottom of my2darray. For example, i am interested in calculating the mean of the top half and the mean of the second half. One idea is to have a mask that is of the same shape as my2darray but that seems like a waste of memory. Is there a better idea? Let's say I have a vector, in which the length is equal to the number of rows in my2darray (in this case 6), i.e. I have
myvector=np.array([9, 15, 5,7,11,11])
I am using python 2.6 with numpy 1.5.0
Using NumPy's broadcasted comparison, we can create such a 2D mask in a vectorized manner. Rest of the work is all about sum-reduction along the first axis for which we can take help from np.einsum. Thus, we would have an implementation like so -
N = my2darray.shape[0]
mask = myvector <= np.arange(N)[:,None]
uout = np.true_divide(np.einsum('ij,ij->j',my2darray,~mask),myvector)
lout = np.true_divide(np.einsum('ij,ij->j',my2darray,mask),N-myvector)
Sample run to verify results -
In [184]: N = my2darray.shape[0]
...: mask = myvector <= np.arange(N)[:,None]
...: uout = np.true_divide(np.einsum('ij,ij->j',my2darray,~mask),myvector)
...: lout = np.true_divide(np.einsum('ij,ij->j',my2darray,mask),N-myvector)
...:
In [185]: uout
Out[185]: array([ 6. , 4.6, 4. , 0. ])
In [186]: [my2darray[:item,i].mean() for i,item in enumerate(myvector)]
Out[186]: [6.0, 4.5999999999999996, 4.0, 0.0] # Loopy version results
In [187]: lout
Out[187]: array([ 5.2 , 4. , 2.66666667, 2. ])
In [188]: [my2darray[item:,i].mean() for i,item in enumerate(myvector)]
Out[188]: [5.2000000000000002, 4.0, 2.6666666666666665, 2.0] # Loopy version
Another potentially faster way would be to calculate the summations for the upper mask, store it and from it, subtract the sum along the first axis along the entire length of the 2D input array. This could be then used for the calculation of the lower part average. Thus, after we store N and calculate mask, we would have -
usum = np.einsum('ij,ij->j',my2darray,~mask)
uout = np.true_divide(usums,myvector)
lout = np.true_divide(my2darray.sum(0) - usums,N-myvector)
I have a 1-dimensional numpy array scores of scores associated with some objects. These objects belong to some disjoint groups, and all the scores of the items in the first group are first, followed by the scores of the items in the second group, etc.
I'd like to create a 2-dimensional array where each row corresponds to a group, and each entry is the score of one of its items. If all the groups are of the same size I can just do:
scores.reshape((numGroups, groupSize))
Unfortunately, my groups may be of varying size. I understand that numpy doesn't support ragged arrays, but it is fine for me if the resulting array simply pads each row with a specified value to make all rows the same length.
To make this concrete, suppose I have set A with 3 items, B with 2 items, and C with four items.
scores = numpy.array([f(a[0]), f(a[1]), f(a[2]), f(b[0]), f(b[1]),
f(c[0]), f(c[1]), f(c[2]), f(c[3])])
rowStarts = numpy.array([0, 3, 5])
paddingValue = -1.0
scoresByGroup = groupIntoRows(scores, rowStarts, paddingValue)
The desired value of scoresByGroup would be:
[[f(a[0]), f(a[1]), f(a[2]), -1.0],
[f(b[0]), f(b[1]), -1.0, -1.0]
[f(c[0]), f(c[1]), f(c[2]), f(c[3])]]
Is there some numpy function or composition of functions I can use to create groupIntoRows?
Background:
This operation will be used in calculating the loss for a minibatch for a gradient descent algorithm in Theano, so that's why I need to keep it as a composition of numpy functions if possible, rather than falling back on native Python.
It's fine to assume there is some known maximum row size
The original objects being scored are vectors and the scoring function is a matrix multiplication, which is why we flatten things out in the first place. It would be possible to pad everything to the maximum item set size before doing the matrix multiplication, but the biggest set is over ten times bigger than the average set size, so this is undesirable for speed reasons.
Try this:
scores = np.random.rand(9)
row_starts = np.array([0, 3, 5])
row_ends = np.concatenate((row_starts, [len(scores)]))
lens = np.diff(row_ends)
pad_len = np.max(lens) - lens
where_to_pad = np.repeat(row_ends[1:], pad_len)
padding_value = -1.0
padded_scores = np.insert(scores, where_to_pad,
padding_value).reshape(-1, np.max(lens))
>>> padded_scores
array([[ 0.05878244, 0.40804443, 0.35640463, -1. ],
[ 0.39365072, 0.85313545, -1. , -1. ],
[ 0.133687 , 0.73651147, 0.98531828, 0.78940163]])