Stretch numpy vector to arbitrary length without interpolation - python

I am working with time series data with different sample frequencies.
I need to accurately stretch a set of 1d vectors of different lengths into a common arbitrary length.
Values should be repeated rather than interpolated.
However, the number of repetitions should be rounded up or down appropriately throughout the target to arrive at a specific target length.
I can't seem to use np.repeat as it rounds off fractional numbers of repeats and the final length is always an exact multiple of repeats.
Basically I am looking for a function with roughly the following behavior:
stretch_func(np.array([1,2,4]), length=11)
out:[1,1,1,2,2,2,2,4,4,4,4]
stretch_func(np.array(["A","B"]), length=11)
out: ["A","A","A","A","A","B","B","B","B","B","B"]
EDIT:
Looks like this functionality is not standard in numpy or pandas. I went ahead and implemented this so here it is for anyone else that might need it:
def stretch_func(arr, length=1):
repetitions = np.round(np.linspace(0,length,arr.shape[0]+1))[1:] - np.round(np.linspace(0,length,arr.shape[0]+1))[:-1]
repeated = np.repeat(arr, repetitions.astype(np.int))
return repeated

As you found out repeat can use a different number of repetitions for each element. But choosing how to allocate those repetitions is ambiguous. So it's not surprising that there isn't a packaged form of you function.
By way of illustration look at what split does in the reverse direction:
In [3]: arr = np.array([1,1,1,2,2,2,2,4,4,4,4])
In [4]: np.split(arr,3)
...
ValueError: array split does not result in an equal division
array_split does the uneven split without complaint - but it short changes the last array, not the first as you chose to do:
In [5]: np.array_split(arr,3)
Out[5]: [array([1, 1, 1, 2]), array([2, 2, 2, 4]), array([4, 4, 4])]
Another point - calculating the number of repetions, even when uneven, is fast, with little dependency on the size of the array. So there's no need to perform such calculations in compiled code. Even if this kind of expansion was a common need (which I don't think it is), it would be implemented as a function similar to what you've written. Look at the code for array_split to see how it handles edge cases. (what if, for example, the desired length was less than the initial?)

If I understood correctly you could use np.repeat and slice:
import numpy as np
def stretch_func(arr, length=1):
reps = length // len(arr) + 1
repeated = np.repeat(arr, reps)
return repeated[-length:]
print(stretch_func(np.array([1,2,4]), length=11))
print(stretch_func(np.array(["A", "B"]), length=11))
Output
[1 1 1 2 2 2 2 4 4 4 4]
['A' 'A' 'A' 'A' 'A' 'B' 'B' 'B' 'B' 'B' 'B']

An alternative to using repeat is to select the indices using a linear space:
def stretch_func(arr, length=1, axis=0):
idxs = np.round(np.linspace(0, arr.shape[axis] - 1, length)).astype(int)
return arr.take(indices=idxs, axis=axis)
This would result in the following output:
print(stretch_func(np.array([1, 2, 4]), length=11))
[1 1 1 2 2 2 2 2 4 4 4]
print(stretch_func(np.array(["A", "B"]), length=11))
['A' 'A' 'A' 'A' 'A' 'A' 'B' 'B' 'B' 'B' 'B']
This function supports stretching along any axis as well as "shrinking", e.g.:
print(stretch_func(np.arange(10), length=5))
[0 2 4 7 9]

Related

Stable conversion of a multi-column (2D) numpy array to an indicator vector

I often need to convert a multi-column (or 2D) numpy array into an indicator vector in a stable (i.e., order preserved) manner.
For example, I have the following numpy array:
import numpy as np
arr = np.array([
[2, 20, 1],
[1, 10, 3],
[2, 20, 2],
[2, 20, 1],
[1, 20, 3],
[2, 20, 2],
])
The output I like to have is:
indicator = [0, 1, 2, 0, 3, 2]
How can I do this (preferably using numpy only)?
Notes:
I am looking for a high performance (vectorized) approach as the arr (see the example above) has millions of rows in a real application.
I am aware of the following auxiliary solutions, but none is ideal. It would be nice to hear expert's opinion.
My thoughts so far:
1. Numpy's unique: This would not work, as it is not stable:
arr_unq, indicator = np.unique(arr, axis=0, return_inverse=True)
print(arr_unq)
# output 1:
# [[ 1 10 3]
# [ 1 20 3]
# [ 2 20 1]
# [ 2 20 2]]
print(indicator)
# output 2:
# [2 0 3 2 1 3]
Notice how the indicator starts from 2. This is because unique function returns a "sorted" array (see output 1). However, I would like it to start from 0.
Of course I can use LabelEncoder from sklearn to convert the items in a manner that they start from 0 but I feel that there is a simple numpy trick that I can use and therefore avoid adding sklearn dependency to my program.
Or I can resolve this by a dictionary mapping like below, but I can imagine that there is a better or more elegant solution:
dct = {}
for idx, item in enumerate(indicator):
if item not in dct:
dct[item] = len(dct)
indicator[idx] = dct[item]
print(indicator)
# outputs:
# [0 1 2 0 3 2]
2. Stabilizing numpy's unique output: This solution is already posted in stackoverflow and correctly returns an stable unique array. But I do not know how to convert the returned indicator vector (returned when return_inverse=True) to represent the values in an stable order starting from 0.
3. Pandas's get_dummies: function. But it returns a "hot encoding" (matrix of indicator values). In contrast, I would like to have an indicator vector. It is indeed possible to convert the "hot encoding" to the indicator vector by few lines of code and data manipulation. But again that approach is not going to be highly efficient.
In addition to return_inverse, you can add the return_index option. This will tell you the first occurrence of each sorted item:
unq, idx, inv = np.unique(arr, axis=0, return_index=True, return_inverse=True)
Now you can use the fact that np.argsort is its own inverse to fix the order. Note that idx.argsort() places unq into sorted order. The corrected result is therefore
indicator = idx.argsort().argsort()[inv]
And of course the byproduct
unq = unq[idx.argsort()]
Of course there's nothing special about these operations to 2D.
A Note on the Intuition
Let's say you have an array x:
x = np.array([7, 3, 0, 1, 4])
x.argsort() is the index that tells you what elements of x are placed at each of the locations in the sorted array. So
i = x.argsort() # 2, 3, 1, 4, 0
But how would you get from np.sort(x) back to x (which is the problem you express in #2)?
Well, it happens that i tells you the original position of each element in the sorted array: the first (smallest) element was originally at index 2, the second at 3, ..., the last (largest) element was at index 0. This means that to place np.sort(x) back into its original order, you need the index that puts i into sorted order. That means that you can write x as
np.sort(x)[i.argsort()]
Which is equivalent to
x[i][i.argsort()]
OR
x[x.argsort()][x.argsort().argsort()]
So, as you can see, np.argsort is effectively its own inverse: argsorting something twice gives you the index to put it back in the original order.

Determining index each group duplicate values in an array in Python with the fastest way

I want to find an index of each group duplicate value like this:
s = [2,6,2,88,6,...]
The results must return the index from original s: [[0,2],[1,4],..] or the result can show another way.
I find many solutions so I find the fastest way to get duplicate group:
s = np.sort(a, axis=None)
s[:-1][s[1:] == s[:-1]]
But after sort I got wrong index from original s.
In my case, I have ~ 200mil value on the list and I want to find the fastest way to do that. I use an array to store value because I want to use GPU to make it faster.
Using hash structures like dict helps.
For example:
import numpy as np
from collections import defaultdict
a=np.array([2,4,2,88,15,4])
table=defaultdict(list)
for ind,num in enumerate(a):
table[num]+=[ind]
Outputs:
{2: [0, 2], 4: [1, 5], 88: [3], 15: [4]}
If you want to show duplicated elements in the order from small to large:
for k,v in sorted(table.items()):
if len(v)>1:
print(k,":",v)
Outputs:
2 : [0, 2]
4 : [1, 5]
The speed is determined by how many different values in the number list.
See if this meets your performance requirements (here, s is your input array):
counts = np.bincount(s)
cum_counts = np.add.accumulate(counts)
sorted_inds = np.argsort(s)
result = np.split(sorted_inds, cum_counts[:-1])
Notes:
The result would be a list of arrays.
Each of these arrays would contain indices of a repeated value in s. Eg, if the value 13 is repeated 7 times in s, there would be an array with 7 indices among the arrays of result
If you want to ignore singleton values of s (values that occur only once in s), you can pass minlength=2 to np.bincount()
(This is a variation of my other answer. Here, instead of splitting the large array sorted_inds, we take slices from it, so it's likely to have a different kind of performance characteristic)
If s is the input array:
counts = np.bincount(s)
cum_counts = np.add.accumulate(counts)
sorted_inds = np.argsort(s)
result = [sorted_inds[:cum_counts[0]]] + [sorted_inds[cum_counts[i]:cum_counts[i+1]] for i in range(cum_counts.size-1)]

get a vector from a matrix and a vactor of index in numpy

I have a matrix m = [[1,2,3],[4,5,6],[7,8,9]] and a vector v=[1,2,0] that contains the indices of the rows I want to return for each column of my matrix.
the results I expect should be r=[4,8,3], but I can not find out how to get this result using numpy.
By applying the vector to the index, for each columns I get this : m[v,[0,1,2]] = [4, 8, 3], which is roughly my quest.
To prevent hardcoding the columns, I'm using np.arange(m.shape[1]) and the my final formula looks like r=m[v,np.arange(m.shape[1])]
This sounds weird to me and a little complicated for something that should be quite common.
Is there a clean way to get such result ?
In [157]: m = np.array([[1,2,3],[4,5,6],[7,8,9]]);v=np.array([1,2,0])
In [158]: m
Out[158]:
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
In [159]: v
Out[159]: array([1, 2, 0])
In [160]: m[v,np.arange(3)]
Out[160]: array([4, 8, 3])
We are choosing 3 elements, with indices (1,0),(2,1),(0,2).
Closer to the MATLAB approach:
In [162]: np.ravel_multi_index((v,np.arange(3)),(3,3))
Out[162]: array([3, 7, 2])
In [163]: m.flat[_]
Out[163]: array([4, 8, 3])
Octave/MATLAB equivalent
>> m = [1 2 3;4 5 6;7 8 9];
>> v = [2 3 1]
v =
2 3 1
>> m = [1 2 3;4 5 6;7 8 9];
>> v = [2 3 1];
>> sub2ind([3,3],v,[1 2 3])
ans =
2 6 7
>> m(sub2ind([3,3],v,[1 2 3]))
ans =
4 8 3
The same broadcasting is used to access a block, as illustrated in this recent question:
Is there a way in Python to get a sub matrix as in Matlab?
Well, this 'weird/complicated' thing is actually mentioned as a "straight forward" scenario, in the documentation of Integer array andexing, which is a sub-topic under the broader topic of "Advanced Indexing".
To quote some extract:
When the index consists of as many integer arrays as the array being
indexed has dimensions, the indexing is straight forward, but
different from slicing. Advanced indexes always are broadcast and iterated as one. Note that the result shape is identical to the (broadcast) indexing array shapes
Blockquote
If it makes it seem any less complicated/weird, you could use range(m.shape[1]) instead of np.arange(m.shape[1]). It just needs to be any array or array-like structure.
Visualization / Intuition:
When I was learning this (integer array indexing), it helped me to visualize things in the following way:
I visualized the indexing arrays standing side-by-side, all having exactly the same shape (perhaps as a consequence of getting broadcasted together). I also visualized the result array, which also has the same shape as the indexing arrays. In each of these indexing arrays and the result array, I visualized a monkey, capable of doing a walk-through of its own array, hopping to successive elements of its own array. Note that, in general, this identical shape of the indexing arrays and the result array, can be n-dimensional, and this identical shape can be very different from the shape of the source array whose values are actually being indexed.
In your own example, the source array m has shape (3,3), and the indexing arrays and the result array each have a shape of (3,).
Inn your example, there is a monkey in each of those three arrays (the two indexing arrays and the result array). We then visualize the monkeys doing a walk-through of their respective array elements in tandem. Here, "in tandem" means all the three monkeys start at the first element of their respective arrays, and whenever a monkey hops to the next element of its own array, the other monkeys in the other arrays also hop to the next element in their respective arrays. As it hops to each successive element, the monkey in each indexing array calls out the value of the element it has just visited. So the two monkeys in the two indexing arrays read out the values they've just visited, in their respective indexing arrays. The monkey in the result array also hops in tandem with the monkeys in the indexing arrays. It hears the values being called out by the monkeys in the indexing arrays, uses those values as indices into the source array m, and thus determines the value to be picked from source array m. The monkey in the result array picks up this value from the source array m, and stores it the value in the result array, at the location it has just hopped to. Thus, for example, when all the three monkeys are in the second element of their respective arrays, the second position of the result array would get its value determined.
As stated by the numpy documentation, I think the way you mentioned is the standard way to do this task:
Example
From each row, a specific element should be selected. The row index is just [0, 1, 2] and the column index specifies the element to choose for the corresponding row, here [0, 1, 0]. Using both together the task can be solved using advanced indexing:
x = np.array([[1, 2], [3, 4], [5, 6]])
x[[0, 1, 2], [0, 1, 0]]

remove duplicate elements from two numpy arrays

I have two numpy arrays a and b, with twenty million elements (float number). If the combination elements of those two arrays are the same, then we call it duplicate, which should be remove from the two arrays. For instance,
a = numpy.array([1,3,6,3,7,8,3,2,9,10,14,6])
b = numpy.array([2,4,15,4,7,9,2,2,0,11,4,15])
From those two arrays, we have a[2]&b[2] is the same as a[11]&b[11], then we call it duplicate element, which should be removed. The same as a[1]&b[1] vs a[3]&b[3]Although each array has duplicate elements itself, they are not treated as duplicate elements. So I want the returned arrays to be:
a = numpy.array([1,3,6,7,8,3,2,9,10,14])
b = numpy.array([2,4,15,7,9,2,2,0,11,4])
Anyone has the cleverest way to implement such reduction?
First you have to pack a and b to identify duplicates.
If values are positive integers (see the edit in other cases), this can be achieved by :
base=a.max()+1
c=a+base*b
Then just find unique values in c:
val,ind=np.unique(c,return_index=True)
and retrieve the associated values in a and b.
ind.sort()
print(a[ind])
print(b[ind])
for the disparition of the duplicate. (two here):
[ 1 3 6 7 8 3 2 9 10 14]
[ 2 4 15 7 9 2 2 0 11 4]
EDIT
regardless of datatype, the c array can be made as follow, packing data to bytes :
ab=ascontiguousarray(vstack((a,b)).T)
dtype = 'S'+str(2*a.itemsize)
c=ab.view(dtype=dtype)
This is done in one pass and without requiring any extra memory for the resulting arrays.
Pair up the elements at each index and iterate over them. Keep a track of which pairs have been seen so far and a counter of the index of the arrays. When a new pair has not been seen before, the index will increase by 1, effectively writing them back to their original place. However, for a duplicate pair you don't increase the index, effectively shifting every new pair one position to the left. At the end, keep the first indexth number of elements to shorten the arrays.
import itertools as it
def delete_duplicate_pairs(*arrays):
unique = set()
arrays = list(arrays)
n = range(len(arrays))
index = 0
for pair in it.izip(*arrays):
if pair not in unique:
unique.add(pair)
for i in n:
arrays[i][index] = pair[i]
index += 1
return [a[:index] for a in arrays]
If you are on Python 2, zip() creates the list of pairs up front. If you have a lot of elements in your arrays, it'll be more efficient to use itertools.izip() which will create the pairs as you request them. However, zip() in Python 3 behaves like that by default.
For your case,
>>> import numpy as np
>>> a = np.array([1,3,6,3,7,8,3,2,9,10,14,6])
>>> b = np.array([2,4,15,4,7,9,2,2,0,11,4,15])
>>> a, b = delete_duplicate_pairs(a, b)
>>> a
array([ 1, 3, 6, 7, 8, 3, 2, 9, 10, 14])
>>> b
array([ 2, 4, 15, 7, 9, 2, 2, 0, 11, 4])
Now, it all comes down to what values your arrays hold. If you have only the values 0-9, there are only 100 unique pairs and most elements will be duplicates, which saves you time. For 20 million elements for both a and b and containing values only between 0-9, the process completes in 6 seconds. For values between 0-999, it takes 12 seconds.

Skipping Same Values when Reading csv into Python

I am trying to subtract the previous item in a list from the following item in a list, but I think my type is preventing me from doing so. The type of each item in the list is int. If i have a list of integers such as
1 2 3 4 5 6 7
How will I subtract 1 from 2, 2 from 3, 3 from 4, etc., and print this value after each operation?
My list is torcount, which I acquired from a numpy operation, and this is the code I tried:
TorCount=len(np.unique(TorNum))
for i in range(TorCount):
TorCount=TorCount[i]-TorCount[i-1]
print TorCount
Thank you
Use np.diff:
Example:
>>> xs = np.array([1, 2, 3, 4])
>>> np.diff(xs, n=1)
array([1, 1, 1])
numpy.diff(a, n=1, axis=-1)
Calculate the n-th order discrete difference along given axis.
The first order difference is given by out[n] = a[n+1] - a[n]
along the given axis, higher order differences are calculated
by using diff recursively.

Categories

Resources