Calculate distance between arrays that contain NaN - python

consider array1 and array2, with:
array1 = [a1 a2 NaN ... an]
array2 = [[NaN b2 b3 ... bn],
[b21 NaN b23 ... b2n],
...]
Both arrays are numpy-arrays. There is an easy way to compute the Euclidean distance between array1and each row of array2:
EuclideanDistance = np.sqrt(((array1 - array2)**2).sum(axis=1))
What messes up this computation are the NaN values. Of course, I could easily replace NaN with some number. But instead, I want to do the following:
When I compare array1 with row_x of array2, I count the columns in which one of the arrays has NaN and the other doesn't. Let's assume the count is 3. I will then delete these columns from both arrays and compute the Euclidean distance between the two. In the end, I add a minus_value * count to the calculated distance.
Now, I cannot think of a fast and efficient way to do this. Can somebody help me?
Here are a few of my ideas:
minus = 1000
dist = np.zeros(shape=(array1.shape[0])) # this array will store the distance of array1 to each row of array2
array1 = np.repeat(array1, array2.shape[0], axis=0) # now array1 has the same dimensions as array2
for i in range(0, array1.shape[0]):
boolarray = np.logical_or(np.isnan(array1[i]), np.isnan(array2[i]))
count = boolarray.sum()
deleteIdxs = boolarray.nonzero() # this should give the indices where boolarray is True
dist[i] = np.sqrt(((np.delete(array1[i], deleteIdxs, axis=0) - np.delete(array2[i], deleteIdxs, axis=0))**2).sum(axis=0))
dist[i] = dist[i] + count*minus
These lines look more than ugly to me, however. Also, I keep getting an index error: Apparently deleteIdxs contains an index that is out of range for array1. Don't know how this can even be.

You can find all the indices with where the value is nan using:
indices_1 = np.isnan(array1)
indices_2 = np.isnan(array2)
Which you can combine to:
indices_total = indices_1 + indices_2
And you can keep all the not nan values using:
array_1_not_nan = array1[~indices_total]
array_2_not_nan = array2[~indices_total]

I would write a function to handle the distance calculation. I am sure there is a faster and more efficient way to write this (list comprehensions, aggregations, etc.), but readability counts, right? :)
import numpy as np
def calculate_distance(fixed_arr, var_arr, penalty):
s_sum = 0.0
counter = 0
for num_1, num_2 in zip(fixed_arr, var_arr):
if np.isnan(num_1) or np.isnan(num_2):
counter += 1
else:
s_sum += (num_1 - num_2) ** 2
return np.sqrt(s_sum) + penalty * counter, counter
array1 = np.array([1, 2, 3, np.NaN, 5, 6])
array2 = np.array(
[
[3, 4, 9, 3, 4, 8],
[3, 4, np.NaN, 3, 4, 8],
[np.NaN, 9, np.NaN, 3, 4, 8],
[np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN],
]
)
dist = np.zeros(len(array2))
minus = 10
for index, arr in enumerate(array2):
dist[index], _ = calculate_distance(array1, arr, minus)
print(dist)
You have to think about the value for the minus variable very carefully. Is adding a random value really useful?
As #Nathan suggested, a more resource efficient can easily be implemented.
fixed_arr = array1
penalty = minus
dist = [
(
lambda indices=(np.isnan(fixed_arr) + np.isnan(var_arr)): np.linalg.norm(
fixed_arr[~indices] - var_arr[~indices]
)
+ (indices == True).sum() * penalty
)()
for var_arr in array2
]
print(dist)
However I would only try to implement something like this if I absolutely needed to (if it's the bottleneck). For all other times I would be happy to sacrifice some resources in order to gain some readability and extensibility.

You can filter out the columns containing nan with:
mask1 = np.isnan(arr1)
mask2 = np.isnan(arr2).any(0)
mask = ~(mask1 | mask2)
# the two filtered arrays
arr1[mask], arr2[mask]

Related

Converting an array of numbers from an old range to a new range, where the lowest valued number is a 100 and the highest valued number is a 0?

Say we have an array of values of [2, 5, 7, 9, 3] I would want the 2 be a 100 since it's the lowest value and the 9 to be a 0 since it's the highest, and everything in between is interpolated, how would I go about converting this array? When I say interpolated I want the numbers to be the same distance apart in the new scale, so the 3 wouldn't quite be 100, but close, maybe around 95 or so.
Just scale the array into the [0, 100], then minus all of them by 100. So, the solution is:
import numpy as np
arr = [2, 5, 7, 9, 3]
min_val = np.min(arr)
max_val = np.max(arr)
total_range = max_val - min_val
new_arr = [(100 - int(((i - min_val)/total_range) * 100.0)) for i in arr]
Notice if you desire all values in the specified range from maximum to minimum will be uniformly distributed, your example for 3 cannot happen. So, in this solution, 3 will be around 84 (not 95).
I hope I'm understanding the question correctly. If so my solution would be to sort the list and then scale it proportional to the difference between the highest and lowest value of the list divided by 100. Here is a quick code example that works fine:
a = [2, 5, 7, 9, 3]
a.sort()
b = []
for element in a:
b.append(int(100 - (element - a[0]) * (100 / (a[-1]-a[0]))))
print(a)
print(b)
along the same lines but broken into smaller steps
Plus practice in naming variables
a = [2,5,7,9,3]
a_min = min(a)
a_max = max(a)
a_diff = a_max - a_min
b=[]
for x in a:
b += [(x- a_min)]
b_max = max(b)
c=[]
for x in b:
c += [(1-(x/b_max))*100]
print('c: ',c)
//c: [100.0, 57.14285714285714, 28.57142857142857, 0.0, 85.71428571428572]

How to append values to multidimensional numpy array?

How can I efficiently append values to a multidimensional numpy array?
import numpy as np
a = np.array([[1,2,3], [4,5,6]])
print(a)
I want to append np.NaN for k=2 times to each dimension/array of the outer array?
One option would be to use a loop - but I guess there must be something smarter (vectorized) in numpy
Expected result would be:
np.array([[1,2,3, np.NaN, np.NaN, ], [4,5,6, np.NaN, np.NaN, ]])
I.e. I am looking for a way to:
np.concatenate((a, np.NaN))
on all the inner dimensions.
A
np.append(a, [[np.NaN, np.NaN]], axis=0)
fails with:
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 3 and the array at index 1 has size 2
For your problem np.hstack() or np.pad() should do the job.
Using np.hstack():
k = 2
a_mat = np.array([[1,2,3], [4, 5, 6]])
nan_mat = np.zeros((a_mat.shape[0], k))
nan_mat.fill(np.nan)
a_mat = np.hstack((a_mat, nan_mat))
Using np.pad():
k = 2
padding_shape = [(0, 0), (0, k)] # [(dim1_before_pads, dim1_after_pads)..]
a_mat = np.array([[1,2,3], [4, 5, 6]])
np.pad(a_mat, mode='constant', constant_values=np.nan)
Note: Incase you are using np.pad() for filling with np.nan, check this post out as well: about padding with np.nan

Find index of max element in numpy array excluding few indexes

Say:
p = array([4, 0, 8, 2, 7])
Want to find the index of max value, except few indexes, say:
excptIndx = [2, 3]
Ans: 4, as 7 will be max.
if excptIndx = [1, 3], Ans: 2, as 8 will be max.
In numpy, you can mask all values at excptIndx and run argmax to obtain index of max element:
import numpy as np
p = np.array([4, 0, 8, 2, 7])
excptIndx = [2, 3]
m = np.zeros(p.size, dtype=bool)
m[excptIndx] = True
a = np.ma.array(p, mask=m)
print(np.argmax(a))
# 4
The setup:
In [153]: p = np.array([4,0,8,2,7])
In [154]: exceptions = [2,3]
Original indexes in p:
In [155]: idx = np.arange(p.shape[0])
delete exceptions from both:
In [156]: np.delete(p,exceptions)
Out[156]: array([4, 0, 7])
In [157]: np.delete(idx,exceptions)
Out[157]: array([0, 1, 4])
Find the argmax in the deleted array:
In [158]: np.argmax(np.delete(p,exceptions))
Out[158]: 2
Use that to find the max value (could just as well use np.max(_156)
In [159]: _156[_158]
Out[159]: 7
Use the same index to find the index in the original p
In [160]: _157[_158]
Out[160]: 4
In [161]: p[_160] # another way to get the max value
Out[161]: 7
For this small example, the pure Python alternatives might well be faster. They often are in small cases. We need test cases with a 1000 or more values to really see the advantages of numpy.
Another method
Set the exceptions to a small enough value, and take the argmax:
In [162]: p1 = p.copy(); p1[exceptions] = -1000
In [163]: np.argmax(p1)
Out[163]: 4
Here the small enough is easy to pick; more generally it may require some thought.
Or taking advantage of the np.nan... functions:
In [164]: p1 = p.astype(float); p1[exceptions]=np.nan
In [165]: np.nanargmax(p1)
Out[165]: 4
A solution is
mask = np.isin(np.arange(len(p)), excptIndx)
subset_idx = np.argmax(p[mask])
parent_idx = np.arange(len(p))[mask][subset_idx]
See http://seanlaw.github.io/2015/09/10/numpy-argmin-with-a-condition/
p = np.array([4,0,8,2,7]) # given
exceptions = [2,3] # given
idx = list( range(0,len(p)) ) # simple array of index
a1 = np.delete(idx, exceptions) # remove exceptions from idx (i.e., index)
a2 = np.argmax(np.delete(p, exceptions)) # get index of the max value after removing exceptions from actual p array
a1[a2] # as a1 and a2 are in sync, this will give the original index (as asked) of the max value

Python - remove elements from array

I have an array called a and another array b. The array a is the main array where I store float data, and b is an array which contain some indexes belonging to a.
Example:
a = [1.3, 1.7, 18.4, 56.2, 82.2, 18.1, 81.9, 56.9, -274.45]
b = [0, 1, 2, 3, 4, 5, 6, 7]
In this example b contains indexes of a from 0 to 7.
What I'm trying to do in Python is to remove "duplicates", I mean to remove all indexes from b which have their similar value in a. For example, notice that there are pair 1.3 and 1.7. Also, there are 18.4 and 18.1 etc. I want to find all these values and to write -1 in all places in array b which has that value.
Output should be the following:
b = [0, -1, 2, 3, 4, -1, -1, -1]
I think it is obvious what I am trying to achieve. Here index 1 is replaced with -1 because in a it represents 1.7 which has "pair" 1.3. Also, last 3 indexes represents 18.1, 81.9 and 56.9 which also have their "pairs" before, so they are replaced with -1.
Of course, I have a parameter x which represents how "similar" values are. So, here x = 2 which mean that any 2 values which differ by 2 are similar.
What have I tried? I tried to use 2 nested for loops and a lot of unnecessary variables and my algorithm eats memory and performance. Is there an elegant np-ish way to achieve it?
Approach #1 : Here's a vectorized approach using broadcasting and a bit memory intensive -
x = 2 # threshold that decides similarity
a_b = a[b]
mask = np.triu(np.abs(a_b[:,None]-a_b)<x,1).any(0)
b[mask[:len(b)]] = -1
Sample run -
In [95]: a = np.array([1.3, 1.7, 18.4, 56.2, 82.2, 18.1, 81.9, 56.9, -274.45])
...: b = np.array([0, 1, 2, 3, 4, 5, 6, 7])
...:
# After code run ...
In [97]: b
Out[97]: array([ 0, -1, 2, 3, 4, -1, -1, -1])
Approach #2 : Less memory intensive approach
import pandas as pd
def set_mask(a,b,thresh):
a_b = a[b]
N = len(a_b)
sidx = a_b.argsort()
sorted_a_b = a_b[sidx]
mask0 = sorted_a_b[1:] - sorted_a_b[:-1] < thresh
id_arr = np.zeros(N, dtype=int)
id_arr[np.flatnonzero(~mask0)+1] = 1
ids = id_arr.cumsum()
d = np.column_stack(( ids, sidx))
df0 = pd.DataFrame(d, columns=(('ids','sidx')))
pp = df0['sidx'].groupby([ids]).min()
maskc = np.ones(N,dtype=bool)
maskc[pp.values] = 0
return maskc
Use this mask to replace the mask needed at the last step from previous approach.

Python: turn single array of sorted, repeat values into an array of arrays?

I have a sorted array with some repeated values. How can this array be turned into an array of arrays with the subarrays grouped by value (see below)? In actuality, my_first_array has ~8 million entries, so the solution would preferably be as time efficient as possible.
my_first_array = [1,1,1,3,5,5,9,9,9,9,9,10,23,23]
wanted_array = [ [1,1,1], [3], [5,5], [9,9,9,9,9], [10], [23,23] ]
itertools.groupby makes this trivial:
import itertools
wanted_array = [list(grp) for _, grp in itertools.groupby(my_first_array)]
With no key function, it just yields groups consisting of runs of identical values, so you list-ify each one in a list comprehension; easy-peasy. You can think of it as basically a within-Python API for doing the work of the GNU toolkit program, uniq, and related operations.
In CPython (the reference interpreter), groupby is implemented in C, and it operates lazily and linearly; the data must already appear in runs matching the key function, so sorting might make it too expensive, but for already sorted data like you have, there is nothing that will be more efficient.
Note: If the inputs might be value identical, but different objects, it may make sense for memory reasons to change list(grp) for _, grp to [k] * len(list(grp)) for k, grp. The former would retain the original (possibly value but not identity duplicate) objects in the final result, the latter would replicate the first object from each group instead, reducing the final cost per group to the cost of N references to a single object, instead of N references to between 1 and N objects.
I am assuming that the input is a NumPy array and you are looking for a list of arrays as output. Now, you can split the input array at indices where those shifts (groups of repeats have boundaries) with np.split. To find such indices, there are two ways - Using np.unique with its optional argument return_index set as True, and another with a combination of np.where and np.diff. Thus, we would have two approaches as listed next.
With np.unique -
import numpy as np
_,idx = np.unique(my_first_array, return_index=True)
out = np.split(my_first_array, idx)[1:]
With np.where and np.diff -
idx = np.where(np.diff(my_first_array)!=0)[0] + 1
out = np.split(my_first_array, idx)
Sample run -
In [28]: my_first_array
Out[28]: array([ 1, 1, 1, 3, 5, 5, 9, 9, 9, 9, 9, 10, 23, 23])
In [29]: _,idx = np.unique(my_first_array, return_index=True)
...: out = np.split(my_first_array, idx)[1:]
...:
In [30]: out
Out[30]:
[array([1, 1, 1]),
array([3]),
array([5, 5]),
array([9, 9, 9, 9, 9]),
array([10]),
array([23, 23])]
In [31]: idx = np.where(np.diff(my_first_array)!=0)[0] + 1
...: out = np.split(my_first_array, idx)
...:
In [32]: out
Out[32]:
[array([1, 1, 1]),
array([3]),
array([5, 5]),
array([9, 9, 9, 9, 9]),
array([10]),
array([23, 23])]
Here is a solution, although it might not be very efficient:
my_first_array = [1,1,1,3,5,5,9,9,9,9,9,10,23,23]
wanted_array = [ [1,1,1], [3], [5,5], [9,9,9,9,9], [10], [23,23] ]
new_array = [ [my_first_array[0]] ]
count = 0
for i in range(1,len(my_first_array)):
a = my_first_array[i]
if a == my_first_array[i - 1]:
new_array[count].append(a)
else:
count += 1
new_array.append([])
new_array[count].append(a)
new_array == wanted_array
This is O(n):
a = [1,1,1,3,5,5,9,9,9,9,9,10,23,23,24]
res = []
s = 0
e = 0
length = len(a)
while s < length:
b = []
while e < length and a[s] == a[e]:
b.append(a[s])
e += 1
res.append(b)
s = e
print res

Categories

Resources