Related
This question already has answers here:
Efficient way to take the minimum/maximum n values and indices from a matrix using NumPy
(3 answers)
Find nth smallest element in numpy array
(3 answers)
Closed 1 year ago.
I'm using the following code to search for the maximum of pred_flat. Is there a direct way to find the second maximum?
line_max_flat = np.max(pred_flat, axis=1) ##creates an array with 2500 entries, each containing the max of the row
The variable pred_flat is an array of size (2500,5) and the other questions regarding the second maximum only address arrays with 1 column or lists.
EDIT:
An example of the input is:
pred_flat=[[0.1,0.2,0.3,0.5,0.7]
[0.5.0.4,0.9,0.7,0.3]
[0.9,0.7,0.8,0.4,0.1]]
and the output should be:
line_max_flat=[0.5,0.7,0.8]
We could use the nlargest method of the heapq module that returns a list of the first n largest numbers of an iterable, though its not exactly direct, it is a simple enough code and works
import numpy as np
import heapq
pred_flat = np.array([[1, 2, 3, 4, 5],[2, 4, 6, 8, 10],[5, 6, 7, 8, 9]]) # example used in my code
line_max_flat = []
for i in pred_flat:
# assuming unique elements, otherwise use set(i) below
_, sec_max = heapq.nlargest(2, i) # returns a list [max, second_max] - here n=2
line_max_flat.append(sec_max)
line_max_flat = np.array(line_max_flat) # make array
print(line_max_flat)
Output:
[4 8 8] # which is the expected result from my example array
I am working with time series data with different sample frequencies.
I need to accurately stretch a set of 1d vectors of different lengths into a common arbitrary length.
Values should be repeated rather than interpolated.
However, the number of repetitions should be rounded up or down appropriately throughout the target to arrive at a specific target length.
I can't seem to use np.repeat as it rounds off fractional numbers of repeats and the final length is always an exact multiple of repeats.
Basically I am looking for a function with roughly the following behavior:
stretch_func(np.array([1,2,4]), length=11)
out:[1,1,1,2,2,2,2,4,4,4,4]
stretch_func(np.array(["A","B"]), length=11)
out: ["A","A","A","A","A","B","B","B","B","B","B"]
EDIT:
Looks like this functionality is not standard in numpy or pandas. I went ahead and implemented this so here it is for anyone else that might need it:
def stretch_func(arr, length=1):
repetitions = np.round(np.linspace(0,length,arr.shape[0]+1))[1:] - np.round(np.linspace(0,length,arr.shape[0]+1))[:-1]
repeated = np.repeat(arr, repetitions.astype(np.int))
return repeated
As you found out repeat can use a different number of repetitions for each element. But choosing how to allocate those repetitions is ambiguous. So it's not surprising that there isn't a packaged form of you function.
By way of illustration look at what split does in the reverse direction:
In [3]: arr = np.array([1,1,1,2,2,2,2,4,4,4,4])
In [4]: np.split(arr,3)
...
ValueError: array split does not result in an equal division
array_split does the uneven split without complaint - but it short changes the last array, not the first as you chose to do:
In [5]: np.array_split(arr,3)
Out[5]: [array([1, 1, 1, 2]), array([2, 2, 2, 4]), array([4, 4, 4])]
Another point - calculating the number of repetions, even when uneven, is fast, with little dependency on the size of the array. So there's no need to perform such calculations in compiled code. Even if this kind of expansion was a common need (which I don't think it is), it would be implemented as a function similar to what you've written. Look at the code for array_split to see how it handles edge cases. (what if, for example, the desired length was less than the initial?)
If I understood correctly you could use np.repeat and slice:
import numpy as np
def stretch_func(arr, length=1):
reps = length // len(arr) + 1
repeated = np.repeat(arr, reps)
return repeated[-length:]
print(stretch_func(np.array([1,2,4]), length=11))
print(stretch_func(np.array(["A", "B"]), length=11))
Output
[1 1 1 2 2 2 2 4 4 4 4]
['A' 'A' 'A' 'A' 'A' 'B' 'B' 'B' 'B' 'B' 'B']
An alternative to using repeat is to select the indices using a linear space:
def stretch_func(arr, length=1, axis=0):
idxs = np.round(np.linspace(0, arr.shape[axis] - 1, length)).astype(int)
return arr.take(indices=idxs, axis=axis)
This would result in the following output:
print(stretch_func(np.array([1, 2, 4]), length=11))
[1 1 1 2 2 2 2 2 4 4 4]
print(stretch_func(np.array(["A", "B"]), length=11))
['A' 'A' 'A' 'A' 'A' 'A' 'B' 'B' 'B' 'B' 'B']
This function supports stretching along any axis as well as "shrinking", e.g.:
print(stretch_func(np.arange(10), length=5))
[0 2 4 7 9]
I have an array y_filtered that contains some masked values. I want to replace these values by some value I calculate based on their neighbouring values. I can get the indices of the masked values by using masked_slices = ma.clump_masked(y_filtered). This returns a list of slices, e.g. [slice(194, 196, None)].
I can easily get the values from my masked array, by using y_filtered[masked_slices], and even loop over them. However, I need to access the index of the values as well, so i can calculate its new value based on its neighbours. Enumerate (logically) returns 0, 1, etc. instead of the indices I need.
Here's the solution I came up with.
# get indices of masked data
masked_slices = ma.clump_masked(y_filtered)
y_enum = [(i, y_i) for i, y_i in zip(range(len(y_filtered)), y_filtered)]
for sl in masked_slices:
for i, y_i in y_enum[sl]:
# simplified example calculation
y_filtered[i] = np.average(y_filtered[i-2:i+2])
It is very ugly method i.m.o. and I think there has to be a better way to do this. Any suggestions?
Thanks!
EDIT:
I figured out a better way to achieve what I think you want to do. This code picks every window of 5 elements and compute its (masked) average, then uses those values to fill the gaps in the original array. If some index does not have any unmasked value close enough it will just leave it as masked:
import numpy as np
from numpy.lib.stride_tricks import as_strided
SMOOTH_MARGIN = 2
x = np.ma.array(data=[1, 2, 3, 4, 5, 6, 8, 9, 10],
mask=[0, 1, 0, 0, 1, 1, 1, 1, 0])
print(x)
# [1 -- 3 4 -- -- -- -- 10]
pad_data = np.pad(x.data, (SMOOTH_MARGIN, SMOOTH_MARGIN), mode='constant')
pad_mask = np.pad(x.mask, (SMOOTH_MARGIN, SMOOTH_MARGIN), mode='constant',
constant_values=True)
k = 2 * SMOOTH_MARGIN + 1
isize = x.dtype.itemsize
msize = x.mask.dtype.itemsize
x_pad = np.ma.array(
data=as_strided(pad_data, (len(x), k), (isize, isize), writeable=False),
mask=as_strided(pad_mask, (len(x), k), (msize, msize), writeable=False))
x_avg = np.ma.average(x_pad, axis=1).astype(x_pad.dtype)
fill_mask = ~x_avg.mask & x.mask
result = x.copy()
result[fill_mask] = x_avg[fill_mask]
print(result)
# [1 2 3 4 3 4 10 10 10]
(note all the values are integers here because x was originally of integer type)
The original posted code has a few errors, firstly it both reads and writes values from y_filtered in the loop, so the results of later indices are affected by the previous iterations, this could be fixed with a copy of the original y_filtered. Second, [i-2:i+2] should probably be [max(i-2, 0):i+3], in order to have a symmetric window starting at zero or later always.
You could do this:
from itertools import chain
# get indices of masked data
masked_slices = ma.clump_masked(y_filtered)
for idx in chain.from_iterable(range(s.start, s.stop) for s in masked_slices):
y_filtered[idx] = np.average(y_filtered[max(idx - 2, 0):idx + 3])
I have two numpy arrays a and b, with twenty million elements (float number). If the combination elements of those two arrays are the same, then we call it duplicate, which should be remove from the two arrays. For instance,
a = numpy.array([1,3,6,3,7,8,3,2,9,10,14,6])
b = numpy.array([2,4,15,4,7,9,2,2,0,11,4,15])
From those two arrays, we have a[2]&b[2] is the same as a[11]&b[11], then we call it duplicate element, which should be removed. The same as a[1]&b[1] vs a[3]&b[3]Although each array has duplicate elements itself, they are not treated as duplicate elements. So I want the returned arrays to be:
a = numpy.array([1,3,6,7,8,3,2,9,10,14])
b = numpy.array([2,4,15,7,9,2,2,0,11,4])
Anyone has the cleverest way to implement such reduction?
First you have to pack a and b to identify duplicates.
If values are positive integers (see the edit in other cases), this can be achieved by :
base=a.max()+1
c=a+base*b
Then just find unique values in c:
val,ind=np.unique(c,return_index=True)
and retrieve the associated values in a and b.
ind.sort()
print(a[ind])
print(b[ind])
for the disparition of the duplicate. (two here):
[ 1 3 6 7 8 3 2 9 10 14]
[ 2 4 15 7 9 2 2 0 11 4]
EDIT
regardless of datatype, the c array can be made as follow, packing data to bytes :
ab=ascontiguousarray(vstack((a,b)).T)
dtype = 'S'+str(2*a.itemsize)
c=ab.view(dtype=dtype)
This is done in one pass and without requiring any extra memory for the resulting arrays.
Pair up the elements at each index and iterate over them. Keep a track of which pairs have been seen so far and a counter of the index of the arrays. When a new pair has not been seen before, the index will increase by 1, effectively writing them back to their original place. However, for a duplicate pair you don't increase the index, effectively shifting every new pair one position to the left. At the end, keep the first indexth number of elements to shorten the arrays.
import itertools as it
def delete_duplicate_pairs(*arrays):
unique = set()
arrays = list(arrays)
n = range(len(arrays))
index = 0
for pair in it.izip(*arrays):
if pair not in unique:
unique.add(pair)
for i in n:
arrays[i][index] = pair[i]
index += 1
return [a[:index] for a in arrays]
If you are on Python 2, zip() creates the list of pairs up front. If you have a lot of elements in your arrays, it'll be more efficient to use itertools.izip() which will create the pairs as you request them. However, zip() in Python 3 behaves like that by default.
For your case,
>>> import numpy as np
>>> a = np.array([1,3,6,3,7,8,3,2,9,10,14,6])
>>> b = np.array([2,4,15,4,7,9,2,2,0,11,4,15])
>>> a, b = delete_duplicate_pairs(a, b)
>>> a
array([ 1, 3, 6, 7, 8, 3, 2, 9, 10, 14])
>>> b
array([ 2, 4, 15, 7, 9, 2, 2, 0, 11, 4])
Now, it all comes down to what values your arrays hold. If you have only the values 0-9, there are only 100 unique pairs and most elements will be duplicates, which saves you time. For 20 million elements for both a and b and containing values only between 0-9, the process completes in 6 seconds. For values between 0-999, it takes 12 seconds.
With list a = [1, 2, 3, 5, 4] I wish to find the index of the nth largest value. function(a, 4) = 2 since 2 is the index of the 4th largest value. NOTE: Needs to function for lists containing 500 or more elements and works with looping.
You could index into the result of sorted(a) to find the n-th largest value:
>>> a = [1, 2, 3, 5, 4]
>>> n = 4
>>> x = sorted(a)[-n]
>>> x
2
Then use a.index() to find the element's index in the original list (assuming the elements are unique):
>>> a.index(x) + 1 # use 1-based indexing
2
P.S. If n is small, you could also use heapq.nlargest() to get the n-th largest element:
>>> import heapq
>>> heapq.nlargest(n, a)[-1]
2
If not using index and max (easiest way) i would just:
def funcion (a):
highest_value = [0, position]
for x in range(len(a),0,-1):
value = a.pop()
if value > highest_value[0]:
highest_value[0] = value
highest_value[1] = len(a)-x
return highest_value
This way, you would get the highest value and save it index at the same time, so it should be quite efficient. Using pop is much faster than looking from 0 to end, because access is backward.
I think you're using 1 based indexing. Assumes values are unique.
a.index(sorted(a)[4-1])+1
will give you 5.