I have a large numpy array, with dimensions [1]. I want to find out a sort of "group average". More specifically,
Let my array be [1,2,3,4,5,6,7,8,9,10] and let my group_size be 3. Hence, I will average the first three elements, the 4th to 6th element, the 7th to 9th element, and average the remaining elements (only 1 in this case to get - [2, 5, 8, 10]. Needless to say, I need a vectorized implementation.
Finally, my purpose is reducing the number of points in a noisy graph to smoothen out a general pattern having a lot of oscillation. Is there a correct way to do this? I would like the answer to both questions, in case they have a different answer. Thanks!
A good smoothing function is the kernel convolution. What it does is it multiplies a small array in a moving window over your larger array.
Say you chose a standard smoothing kernel of 1/3 * [1,1,1] and apply it to an array (a kernel needs to be odd-numbered and normalized). Lets apply it to [1,2,2,7,3,4,9,4,5,6]:
The centre of the kernal to begin with is on the first 2. It then averages itself and its neighbours, then moves on. The result is this:
[1.67, 3.67, 4.0, 4.67, 5.33, 5.67, 6.0, 5.0]
Note that the array is missing the first and last element.
You can do this with numpy.convolve, for example:
import numpy as np
a = np.array([[1,2,2,7,3,4,9,4,5,6]])
k = np.array([1,1,1])/3
smoothed = np.convolve(x, k, 'valid')
The effect of this is that your central value is smoothed with the values from its neighbours. You can change the convolution kernel by increasing it in size, 5 for example [1,1,1,1,1]/5, or give it a gaussian, which will stress the central members more than the outside ones. Read the wikipedia article.
EDIT
This works to get a block average as the question asks for:
import numpy as np
a = [1,2,3,4,5,6,7,8,9,10]
size = 3
new_a = []
i = 0
while i < len(a):
val = np.mean(a[i:i+3])
new_a.append(val)
i+=size
print(new_a)
[2.0, 5.0, 8.0, 10.0]
To solve for the group averaging, listed below are two approaches.
Approach #1 : Bin-based summing and averaging
In [77]: a
Out[77]: array([74, 48, 92, 40, 35, 38, 20, 69, 82, 37])
In [78]: N = 3 # Window size
In [79]: np.arange(a.size)//N # IDs for binning with bincount
Out[79]: array([0, 0, 0, 1, 1, 1, 2, 2, 2, 3])
In [84]: np.bincount(np.arange(a.size)//N,a)/np.bincount(np.arange(a.size)//N)
Out[84]: array([ 71.33333333, 37.66666667, 57. , 37. ])
Approach #2 : Slice and reshape based averaging
In [134]: limit0 = N*(a.size//N)
In [135]: out = np.zeros((a.size+N-1)//N)
In [136]: out[:limit0//N] = a[:limit0].reshape(-1,N).mean(1)
In [137]: out[limit0//N:] = a[limit0:].mean()
In [138]: out
Out[138]: array([ 71.33333333, 37.66666667, 57. , 37. ])
To smoothen data, I might suggest using MATLAB's smooth function ported to NumPy that is essentially convolved averaging and should be similar to #Roman's post.
Really, really wish numpy.ma.MaskedArray.resize worked. It would allow a one-step answer to this question.
As it is
def groupAverage(arr,idx):
rem=arr.size%idx
if rem==0:
return np.mean(arr.reshape(idx,-1),index=0)
else:
newsize=arr//size+1
averages=np.mean(arr.resize(idx,newsize),index=0)
averages[-1]*=(idx/rem)
return averages
Related
Let's suppose I have two arrays that represent pixels in pictures.
I want to build an array of tensordot products of pixels of a smaller picture with a bigger picture as it "scans" the latter. By "scanning" I mean iteration over rows and columns while creating overlays with the original picture.
For instance, a 2x2 picture can be overlaid on top of 3x3 in four different ways, so I want to produce a four-element array that contains tensordot products of matching pixels.
Tensordot is calculated by multiplying a[i,j] with b[i,j] element-wise and summing the terms.
Please examine this code:
import numpy as np
a = np.array([[0,1,2],
[3,4,5],
[6,7,8]])
b = np.array([[0,1],
[2,3]])
shape_diff = (a.shape[0] - b.shape[0] + 1,
a.shape[1] - b.shape[1] + 1)
def compute_pixel(x,y):
sub_matrix = a[x : x + b.shape[0],
y : y + b.shape[1]]
return np.tensordot(sub_matrix, b, axes=2)
def process():
arr = np.zeros(shape_diff)
for i in range(shape_diff[0]):
for j in range(shape_diff[1]):
arr[i,j]=compute_pixel(i,j)
return arr
print(process())
Computing a single pixel is very easy, all I need is the starting location coordinates within a. From there I match the size of the b and do a tensordot product.
However, because I need to do this all over again for each x and y location as I'm iterating over rows and columns I've had to use a loop, which is of course suboptimal.
In the next piece of code I have tried to utilize a handy feature of tensordot, which also accepts tensors as arguments. In order words I can feed an array of arrays for different combinations of a, while keeping the b the same.
Although in order to create an array of said combination, I couldn't think of anything better than using another loop, which kind of sounds silly in this case.
def try_vector():
tensor = np.zeros(shape_diff + b.shape)
for i in range(shape_diff[0]):
for j in range(shape_diff[1]):
tensor[i,j]=a[i: i + b.shape[0],
j: j + b.shape[1]]
return np.tensordot(tensor, b, axes=2)
print(try_vector())
Note: tensor size is the sum of two tuples, which in this case gives (2, 2, 2, 2)
Yet regardless, even if I produced such array, it would be prohibitively large in size to be of any practical use. For doing this for a 1000x1000 picture, could probably consume all the available memory.
So, is there any other ways to avoid loops in this problem?
In [111]: process()
Out[111]:
array([[19., 25.],
[37., 43.]])
tensordot with 2 is the same as element multiply and sum:
In [116]: np.tensordot(a[0:2,0:2],b, axes=2)
Out[116]: array(19)
In [126]: (a[0:2,0:2]*b).sum()
Out[126]: 19
A lower-memory way of generating your tensor is:
In [121]: np.lib.stride_tricks.sliding_window_view(a,(2,2))
Out[121]:
array([[[[0, 1],
[3, 4]],
[[1, 2],
[4, 5]]],
[[[3, 4],
[6, 7]],
[[4, 5],
[7, 8]]]])
We can do a broadcasted multiply, and sum on the last 2 axes:
In [129]: (Out[121]*b).sum((2,3))
Out[129]:
array([[19, 25],
[37, 43]])
Basically, I want to reimplement this video.
Given a corpus of documents, I want to find the terms that are most similar to each other.
I was able to generate a cooccurrence matrix using this SO thread and use the video to generate an association matrix. Next I, would like to generate a second order cooccurrence matrix.
Problem statement: Consider a matrix where the rows of the matrix correspond to a term and the entries in the rows correspond to the top k terms similar to that term. Say, k = 4, and we have n terms in our dictionary, then the matrix M has n rows and 4 columns.
HAVE:
M = [[18,34,54,65], # Term IDs similar to Term t_0
[18,12,54,65], # Term IDs similar to Term t_1
...
[21,43,55,78]] # Term IDs similar to Term t_n.
So, M contains for each term ID, the most similar term IDs. Now, I would like to check how many of those similar terms match. In the example of M above, it seems that term t_0 and term t_1 are quite similar, because three out of four terms match, where as terms t_0 and t_nare not similar, because no terms match. Let's write M as a series of lists.
M = [list_0, # Term IDs similar to Term t_0
list_1, # Term IDs similar to Term t_1
...
list_n] # Term IDs similar to Term t_n.
WANT:
C = [[f(list_0, list_0), f(list_0, list_1), ..., f(list_0, list_n)],
[f(list_1, list_0), f(list_1, list_1), ..., f(list_1, list_n)],
...
[f(list_n, list_0), f(list_n, list_1), ..., f(list_n, list_n)]]
I'd like to find the matrix C, that has as its elements, a function f applied to the lists of M. f(a,b) measures the degree of similarity between two lists a and b. Going, with the example above, the degree of similarity between t_0 and t_1 should be high, whereas the degree of similarity of t_0 and t_n should be low.
My questions:
What is a good choice for comparing the ordering of two lists? That is, what is a good choice for function f?
Is there a transformation already available that takes as an input a matrix like M and produces a matrix like C? Preferably a python package?
Thank you, r0f1
In fact, cosine similarity might not be too bad in this case. The problem is, that you don't want to use the index vectors (i.e. [18,34,54,65] and so on in your case), but you want vectors of length n that are zero everywhere except for the values in your index vector. Luckily, you don't have to create those vectors explicitly, but you can just count how many indices the two index vectors have in common:
def f(u, v):
return len(set(u).intersection(set(v)))
Here, I omitted a constant normalization factor k. There are some more elaborate things that one could do (for example the TF-IDF kernel), but I would stay with this for the start.
In order to run this efficiently using numpy, you would want to do two things:
Convert f to a ufunc, i.e. a numpy vectorized function. You can do that by uf = np.frompyfunc(f, 2, 1) (assuming that you did import numpy as np at some point).
Store M as a 1d array of lists (basically what you state in your second code listing). That's a little more tricky, because numpy is trying to be smart here, but you want something else. So here is how to do that:
n = len(M)
Marray = np.empty(n, dtype='O') # dtype='O' allows you to have elements of type list
for i in range(n):
Marray[i] = M[i]
Now, Marray contains essentially what you described in your second code listing. You can then use the new ufunc's outer method to get your similarity matrix. Here is how all of that would work together for your M from the example (assuming n=3):
M = [[18, 34, 54, 65],
[18, 12, 54, 65],
[21, 43, 55, 78]]
n = len(M) # i.e. 3
uf = np.frompyfunc(f, 2, 1)
Marray = np.empty(n, dtype='O')
for i in range(n):
Marray[i] = M[i]
similarities = uf.outer(Marray, Marray).astype('d') # convert to float instead object type
print(similarities)
# array([[4., 3., 0.],
# [3., 4., 0.],
# [0., 0., 4.]])
I hope that answers your questions.
You asked two questions, one somewhat open-ended (the first one) and other one that has a definitive answer, so I will start by the second one:
Is there a transformation already available that takes as an input a
matrix like M and produces a matrix like C? Preferably, a python
package?
The answer is yes, there is one package named scipy.spatial.distance that contains a function that takes a matrix like M and produces a matrix like C. The following example is to show the function:
import numpy as np
from scipy.spatial.distance import pdist, squareform
# initial data
M = [[18, 34, 54, 65],
[18, 12, 54, 65],
[21, 43, 55, 78]]
# convert to numpy array
arr = np.array(M)
result = squareform(pdist(M, metric='euclidean'))
print(result)
Output
[[ 0. 22. 16.1245155 ]
[22. 0. 33.76388603]
[16.1245155 33.76388603 0. ]]
As seen from the example above, pdist takes the M matrix and generates an C matrix. Note that the output of pdist is a condensed distance matrix, so you need to convert it to square form using squareform. Now onto the second issue:
What is a good choice for comparing the ordering of two lists? That
is, what is a good choice for function f?
Given that order does matter in your particular case I suggest you look at rank correlation coefficients such as: Kendall or Spearman, both are provided in the scipy.stats package, along with a whole bunch of other coefficients. Usage example:
import numpy as np
from scipy.spatial.distance import pdist, squareform
from scipy.stats import kendalltau, spearmanr
# distance function
kendall = lambda x, y : kendalltau(x, y)[0]
spearman = lambda x, y : spearmanr(x, y)[0]
# initial data
M = [[18, 34, 54, 65],
[18, 12, 54, 65],
[21, 43, 55, 78]]
# convert to numpy array
arr = np.array(M)
# compute kendall C and convert to square form
kendall_result = 1 - squareform(pdist(arr, kendall)) # subtract 1 because you want a similarity
print(kendall_result)
print()
# compute spearman C and convert to square form
spearman_result = 1 - squareform(pdist(arr, spearman)) # subtract 1 because you want a similarity
print(spearman_result)
print()
Output
[[1. 0.33333333 0. ]
[0.33333333 1. 0.33333333]
[0. 0.33333333 1. ]]
[[1. 0.2 0. ]
[0.2 1. 0.2]
[0. 0.2 1. ]]
If those do not fit your needs you can take a look at the Hamming distance, for example:
import numpy as np
from scipy.spatial.distance import pdist, squareform
# initial data
M = [[18, 34, 54, 65],
[18, 12, 54, 65],
[21, 43, 55, 78]]
# convert to numpy array
arr = np.array(M)
# compute match_rank C and convert to square form
result = 1 - squareform(pdist(arr, 'hamming'))
print(result)
Output
[[1. 0.75 0. ]
[0.75 1. 0. ]
[0. 0. 1. ]]
In the end the choice of the similarity function will depend on your final application, so you will need to try out different functions and see the ones that fit your needs. Both scipy.spatial.distance and scipy.stats provide a plethora of distance and coefficient functions you can try out.
Further
The following paper contains a section on list similarity
I would suggest cosine similarity as every list is an vector.
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(list0,list1)
Let's say that there's a "master" array of times with these values:
master = [1.0, 1.25, 1.5, 1.75, 2.0, 2.25, 2.5, 2.75, 3.0]
I want to find the most "compatible" array among several candidates:
candidates = [
[0.01, 0.48, 1.03, 1.17, 1.5],
[1.25, 1.4, 1.5, 1.9, 2.0],
...
]
In this case I consider the first candidate most compatible because after adding 1 to each value, 4 of the values are very close to values that exist in master (the 2nd candidate only has 3 values that match `master'), and order matters (though we can say the arrays are already sorted with no duplicate values, since they represent times).
A physical example could be that master is an array of beat onsets for a clean recording of an audio track, while the candidates are arrays of beat onsets for various audio recordings that may or may not be of the same audio track. I'd like to find the candidate that is most likely to be a recording of (at least a portion of) the same audio track.
I'm not sure of an algorithm to choose among these candidates. I've done some searching that led me to topics like cross-correlation, string distance, and fuzzy matching, but I'd like to know if I'm missing the forest for the trees here. I'm most familiar with data analysis in NumPy and Pandas, so I will tag the question as such.
One way would be to create those sliding 1D arrays as a stacked 2D array with broadcasting and then get the distances against the 2D array with Scipy's cdist. Finally, we get the minimum distance along each row and choose the row with minimum of such distances. Thus, we would have an implementation like so -
from scipy.spatial.distance import cdist
Na = a.shape[1]
Nb = b.size
b2D = b[np.arange(Nb-Na+1)[:,None] + np.arange(Na)]
closesetID = cdist(a,b2D).min(1).argmin()
Sample run -
In [170]: a = np.random.randint(0,99,(400,500))
In [171]: b = np.random.randint(0,99,(700))
In [172]: b[100:100+a.shape[1]] = a[77,:] + np.random.randn(a.shape[1])
# Make b starting at 100th col same as 77th row from 'a' with added noise
In [173]: Na = a.shape[1]
...: Nb = b.size
...: b2D = b[np.arange(Nb-Na+1)[:,None] + np.arange(Na)]
...: closesetID = cdist(a,b2D).min(1).argmin()
...:
In [174]: closesetID
Out[174]: 77
Note: To me it looked like using the default option of cdist, which is the euclidean distance made sense for such a problem. There are numerous other options as listed in the docs that are based on differentiation between inputs and as such could replace the default one.
Let's say I have a standard 2d numpy array, let's call it my2darray with values. In this array there are two major sections. Let's say for each column, there is a specific row which separates "scenario1" and "scenario2". How can i create 2 masked arrays that represent the top section of my2darray and the bottom of my2darray. For example, i am interested in calculating the mean of the top half and the mean of the second half. One idea is to have a mask that is of the same shape as my2darray but that seems like a waste of memory. Is there a better idea? Let's say I have a vector, in which the length is equal to the number of rows in my2darray (in this case 6), i.e. I have
myvector=np.array([9, 15, 5,7,11,11])
I am using python 2.6 with numpy 1.5.0
Using NumPy's broadcasted comparison, we can create such a 2D mask in a vectorized manner. Rest of the work is all about sum-reduction along the first axis for which we can take help from np.einsum. Thus, we would have an implementation like so -
N = my2darray.shape[0]
mask = myvector <= np.arange(N)[:,None]
uout = np.true_divide(np.einsum('ij,ij->j',my2darray,~mask),myvector)
lout = np.true_divide(np.einsum('ij,ij->j',my2darray,mask),N-myvector)
Sample run to verify results -
In [184]: N = my2darray.shape[0]
...: mask = myvector <= np.arange(N)[:,None]
...: uout = np.true_divide(np.einsum('ij,ij->j',my2darray,~mask),myvector)
...: lout = np.true_divide(np.einsum('ij,ij->j',my2darray,mask),N-myvector)
...:
In [185]: uout
Out[185]: array([ 6. , 4.6, 4. , 0. ])
In [186]: [my2darray[:item,i].mean() for i,item in enumerate(myvector)]
Out[186]: [6.0, 4.5999999999999996, 4.0, 0.0] # Loopy version results
In [187]: lout
Out[187]: array([ 5.2 , 4. , 2.66666667, 2. ])
In [188]: [my2darray[item:,i].mean() for i,item in enumerate(myvector)]
Out[188]: [5.2000000000000002, 4.0, 2.6666666666666665, 2.0] # Loopy version
Another potentially faster way would be to calculate the summations for the upper mask, store it and from it, subtract the sum along the first axis along the entire length of the 2D input array. This could be then used for the calculation of the lower part average. Thus, after we store N and calculate mask, we would have -
usum = np.einsum('ij,ij->j',my2darray,~mask)
uout = np.true_divide(usums,myvector)
lout = np.true_divide(my2darray.sum(0) - usums,N-myvector)
I have a large 2D numpy matrix that needs to be made smaller (ex: convert from 100x100 to 10x10).
My goal is essentially: break the nxn matrix into smaller mxm matrices, average the cells in these mxm slices, and then construct a new (smaller) matrix out of these mxm slices.
I'm thinking about using something like matrix[a::b, c::d] to extract the smaller matrices, and then averaging those values, but this seems overly complex. Is there a better way to accomplish this?
You could split your array into blocks with the view_as_blocks function (in scikit-image).
For a 2D array, this returns a 4D array with the blocks ordered row-wise:
>>> import skimage.util as ski
>>> import numpy as np
>>> a = np.arange(16).reshape(4,4) # 4x4 array
>>> ski.view_as_blocks(a, (2,2))
array([[[[ 0, 1],
[ 4, 5]],
[[ 2, 3],
[ 6, 7]]],
[[[ 8, 9],
[12, 13]],
[[10, 11],
[14, 15]]]])
Taking the mean along the last two axes returns a 2D array with the mean in each block:
>>> ski.view_as_blocks(a, (2,2)).mean(axis=(2,3))
array([[ 2.5, 4.5],
[ 10.5, 12.5]])
Note: view_as_blocks returns a view of the array by modifying the strides (it also works with arrays with more than two dimensions). It is implemented purely in NumPy using as_strided, so if you don't have access to the scikit-image library you can copy the code from here.
Without ski-learn, you can simply reshape, and take the appropriate mean.
M=np.arange(10000).reshape(100,100)
M1=M.reshape(10,10,10,10)
M2=M1.mean(axis=(1,3))
quick check to see if I got the right axes
In [127]: M2[0,0]
Out[127]: 454.5
In [128]: M[:10,:10].mean()
Out[128]: 454.5
In [131]: M[-10:,-10:].mean()
Out[131]: 9544.5
In [132]: M2[-1,-1]
Out[132]: 9544.5
Adding .transpose([0,2,1,3]) puts the 2 averaging dimensions at the end, as view_as_blocks does.
For this (100,100) case, the reshape approach is 2x faster than the as_strided approach, but both are quite fast.
However the direct strided solution isn't much slower than reshaping.
as_strided(M,shape=(10,10,10,10),strides=(8000,80,800,8)).mean((2,3))
as_strided(M,shape=(10,10,10,10),strides=(8000,800,80,8)).mean((1,3))
I'm coming in late but I'd recommend scipy.ndimage.zoom() as an off-the-shelf solution for this. It does down-sizing (or upsizing) using spline interpolations of arbitrary order from 0 to 5. Sounds like order 0 would be sufficient for you based on your question.
from scipy import ndimage as ndi
import numpy as np
M=np.arange(1000000).reshape(1000,1000)
shrinkby=10
Mfilt = ndi.filters.uniform_filter(input=M, size=shrinkby)
Msmall = ndi.interpolation.zoom(input=Mfilt, zoom=1./shrinkby, order=0)
That's all you need. It's perhaps slightly less convenient to specify a zoom rather than a desired output size, but at least for order=0 this method is very fast.
The output size is 10% of the input in each dimension, i.e.
print M.shape, Msmall.shape
gives (1000, 1000) (100, 100) and the speed you can get from
%timeit Mfilt = ndi.filters.uniform_filter(input=M, size=shrinkby)
%timeit Msmall = ndi.interpolation.zoom(input=Mfilt, zoom=1./shrinkby, order=0)
which on my machine gave 10 loops, best of 3: 20.5 ms per loop for the uniform_filter call and 1000 loops, best of 3: 1.67 ms per loop for the zoom call.