How to compare two numpy arrays with some NaN values? - python

I need to compare some numpy arrays which should have the same elements in the same order, excepting for some NaN values in the second one.
I need a function more or less like this:
def func( array1, array2 ):
if ???:
return True
else:
return False
Example:
x = np.array( [ 1, 2, 3, 4, 5 ] )
y = np.array( [ 11, 2, 3, 4, 5 ] )
z = np.array( [ 1, 2, np.nan, 4, 5] )
func( x, z ) # returns True
func( y, z ) # returns False
The arrays have always the same length and the NaN values are always in the third one (x and y have always numbers only). I can imagine there is a function or something already, but I just don't find it.
Any ideas?

You can use masked arrays, which have the behaviour you're asking for when combined with np.all:
zm = np.ma.masked_where(np.isnan(z), z)
np.all(x == zm) # returns True
np.all(y == zm) # returns False
Or you could just write out your logic explicitly, noting that numpy has to use | instead of or, and the difference in operator precedence that results:
def func(a, b):
return np.all((a == b) | np.isnan(a) | np.isnan(b))

You could use isclose to check for equality (or closeness to within a given tolerance -- this is particularly useful when comparing floats) and use isnan to check for NaNs in the second array.
Combine the two with bitwise-or (|), and use all to demand every pair is either close or contains a NaN to obtain the desired result:
In [62]: np.isclose(x,z)
Out[62]: array([ True, True, False, True, True], dtype=bool)
In [63]: np.isnan(z)
Out[63]: array([False, False, True, False, False], dtype=bool)
So you could use:
def func(a, b):
return (np.isclose(a, b) | np.isnan(b)).all()
In [67]: func(x, z)
Out[67]: True
In [68]: func(y, z)
Out[68]: False

What about:
from math import isnan
def fun(array1,array2):
return all(isnan(x) or isnan(y) or x == y for x,y in zip(array1,array2))
This function works in both directions (if there are NaNs in the first list, these are also ignored). If you do not want that (which is a bit odd since equality usually works bidirectional). You can define:
from math import isnan
def fun(array1,array2):
return all(isnan(y) or x == y for x,y in zip(array1,array2))
The code works as follows: we use zip to emit tuples of elements of both arrays. Next we check if either the element of the first list is NaN, or the second, or they are equal.
Given you want to write a really elegant function, you better also perform a length check:
from math import isnan
def fun(array1,array2):
return len(array1) == len(array2) and all(isnan(y) or x == y for x,y in zip(array1,array2))

numpy.islcose() now provides an argument equal_nan for this case!
>>> import numpy as np
>>> np.isclose([1.0, np.nan], [1.0, np.nan])
array([ True, False])
>>> np.isclose([1.0, np.nan], [1.0, np.nan], equal_nan=True)
array([ True, True])
docs https://numpy.org/doc/stable/reference/generated/numpy.isclose.html

Related

Counting The number None values using labmda function

I have an array consisting of a bunch values, where some of them are Nan and the others are None. I want to count each of them. I can achieve this with a simple for loop as shown,
xx = np.array([2,3,4,None,np.nan,None])
count_None = 0
count_nan = 0
for i in xx:
if i is None:
count_None =+ 1
if i is np.nan:
count_nan =+ 1
I want to find out if I can achieve the same result in one line, perhaps using a lambda function. I tried writing it as so. But of course, the syntax is incorrect. Any ideas?
lambda xx: count_None =+1 if xx is None
One way of achieving it as a one liner is :
len([i for i in xx if i is None])
# or the count function
xx.count(None)
or you can use the numpy.count_nonzero:
np.count_nonzero(xx == None)
Using a lambda function, you can create a list.count() - like function:
>>> counter = lambda x,y:len([i for i in x if i == y])
>>> counter(xx,None)
2
This isn't a lambda but it creates a new list of just the None values and counts the length of that list.
import numpy as np
xx = np.array([2,3,4,None,np.nan,None])
print(len([elem for elem in xx if elem is None]))
if you don't need it to be in numpy you can use the list count method
xx = [2,3,4,None,np.nan,None]
print(xx.count(None))
A third approach:
>>> nan_count, none_count = np.sum([i is np.nan for i in xx]), np.sum([i is None for i in xx])
>>> print(nan_count, none_count)
1, 2
I'd tend to prefer two lines (one for each computation), but this works. It works by adding 1 for each True value, and 0 for each False value.
Another approach if you really want to use a lambda is to use functools.reduce which will perform the sum iteratively. Here, we start with a value of 0, and add 1 for each element that evaluates true:
>>> functools.reduce(lambda x,y: x+(y is np.nan), xx, 0)
1
>>> functools.reduce(lambda x,y: x+(y is None), xx, 0)
2
l= len(list(filter(lambda x:x is None, xx)))
It will return the number of NaN values. But the filter function will work with the list only.
You can use this approach if you want to use lambda.
I prefer using numpy function (np.count_nonzero)
lambda is just a restricted format for creating a function. It is 'one-line' and returns a value. It should not be used for side effects. You use of counter += 1 is a side effect, so can't be use in a lambda.
A lambda that identifies the None values, can be used with map:
In [27]: alist = [2,3,4,None,np.nan,None]
In [28]: list(map(lambda x: x is None, alist))
Out[28]: [False, False, False, True, False, True]
map returns an iterable, which has to be expanded with list, or with sum:
In [29]: sum(map(lambda x: x is None, alist))
Out[29]: 2
But as others have shown, the list count method is simpler.
In [43]: alist.count(None)
Out[43]: 2
In [44]: alist.count(np.nan)
Out[44]: 1
An array containing None will be object dtype. Iteration on such an array is slower than iteration on the list:
In [45]: arr = np.array(alist)
In [46]: arr
Out[46]: array([2, 3, 4, None, nan, None], dtype=object)
The array doesn't have the count method. Also testing for np.nan is trickier.
In [47]: arr == None
Out[47]: array([False, False, False, True, False, True])
In [48]: arr == np.nan
Out[48]: array([False, False, False, False, False, False])
There is a np.isnan function, but that only works for float dtype arrays.
In [51]: arr.astype(float)
Out[51]: array([ 2., 3., 4., nan, nan, nan])
In [52]: np.isnan(arr.astype(float))
Out[52]: array([False, False, False, True, True, True])

python numpy - unable to compare 2 arrays

I have the 2 arrays as follows:
x = array(['2019-02-28', '2019-03-01'], dtype=object)
z = array(['2019-02-28', '2019-03-02', '2019-03-01'], dtype=object)
I'm trying to use np.where to determine on which index the 2 matrixes are aligned.
I'm doing
i = np.where (z == x) but it doesn't work, I get an empty array as a result. It looks like it's comparing the whole array is equal to the other whole array whereas I'm looking for the matching values and would like to get matching results between the 2. How should I do it ?
Thanks
Regards
edit: expected outcome is yes [True, False, False]
The where result is only as good as the boolean it searches. If the argument does not have any True values, where returns empty:
In [308]: x = np.array(['2019-02-28', '2019-03-01'], dtype=object)
...: z = np.array(['2019-02-28', '2019-03-02', '2019-03-01'], dtype=object)
In [309]: x==z
/usr/local/bin/ipython3:1: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
#!/usr/bin/python3
Out[309]: False
If you aren't concerned about order:
In [311]: np.isin(z,x)
Out[311]: array([ True, False, True])
or trimming z:
In [312]: x==z[:2]
Out[312]: array([ True, False])
to extend x you could first use np.pad, or use itertools.zip_longest
In [353]: list(itertools.zip_longest(x,z))
Out[353]:
[('2019-02-28', '2019-02-28'),
('2019-03-01', '2019-03-02'),
(None, '2019-03-01')]
In [354]: [i==j for i,j in itertools.zip_longest(x,z)]
Out[354]: [True, False, False]
zip_longest accepts other fill values if that makes the comparison better.
Is this what you need:
print([i for i, (x, y) in enumerate(zip(x, z)) if x == y])
As the two arrays have different sizes compare over the minimum of the two sizes.
Edit:
I just reread the question and comments.
result= np.zeros( max(x.size, z.size), dtype=bool) # result size of the biggest array.
size = min(x.size, z.size)
result[:size] = z[:size] == x[:size] # Comparison at smallest size.
result
# array([ True, False, False])
This gives the boolean mask the comment asks for.
Original answer
import numpy as np
x = np.array(['2019-02-28', '2019-03-01'], dtype=object)
z = np.array(['2019-02-28', '2019-03-02', '2019-03-01'], dtype=object)
size = min(x.size, z.size)
np.where(z[:size]==x[:size]) # Select the common range
# (array([0], dtype=int64),)
On my machine this is slower than the list comprehension from #U10-Forward for dtype=object but faster if numpy selects the dtype, 'Unicode 10'.
x = np.array(['2019-02-28', '2019-03-01'])
z = np.array(['2019-02-28', '2019-03-02', '2019-03-01'])

Adding numpy array elements to new array only if conditions met

I need to copy elements from one numpy array to another, but only if a condition is met. Let's say I have two arrays:
x = ([1,2,3,4,5,6,7,8,9])
y = ([])
I want to add numbers from x to y, but only if they match a condition, lets say check if they are divisible by two. I know I can do the following:
y = x%2 == 0
which makes y an array of values 'true' and 'false'. This is not what I am trying to accomplish however, I want the actual values (0,2,4,6,8) and only those that evaluate to true.
You can get the values you want like this:
import numpy as np
x = np.array([1,2,3,4,5,6,7,8,9])
# array([1, 2, 3, 4, 5, 6, 7, 8, 9])
y = x[x%2==0]
# y is now: array([2, 4, 6, 8])
And, you can sum them like this:
np.sum(x[x%2==0])
# 20
Explanation: As you noticed, x%2==0 gives you a boolean array array([False, True, False, True, False, True, False, True, False], dtype=bool). You can use this as a "mask" on your original array, by indexing it with x[x%2==0], returning the values of x where your "mask" is True. Take a look at the numpy indexing documentation for more info.

Python: Elementwise comparison of same shaped arrays

I have n matrices of the same size and want to see how many cells are equal to each other across all matrices. Code:
import numpy as np
a = np.array([[1,2,3],[4,5,6],[7,8,9]])
b = np.array([[5,6,7], [4,2,6], [7, 8, 9]])
c = np.array([2,3,4],[4,5,6],[1,2,5])
#Intuition is below but is wrong
a == b == c
How do I get Python to return a value of 2 (cells 2,1 and 2,3 match in all 3 matrices) or an array of [[False, False, False], [True, False, True], [False, False, False]]?
You can do:
(a == b) & (b==c)
[[False False False]
[ True False True]
[False False False]]
For n items in, say, a list like x=[a, b, c, a, b, c], one could do:
r = x[0] == x[1]
for temp in x[2:]:
r &= x[0]==temp
The result in now in r.
If the structure is already in a 3D numpy array, one could also use:
np.amax(x,axis=2)==np.amin(x,axis=2)
The idea for the above line is that although it would be ideal to have an equal function with an axis argument, there isn't one so this line notes that if amin==amax along the axis, then all elements are equal.
If the different arrays to be compared aren't already in a 3D numpy array (or won't be in the future), looping the list is a fast and easy approach. Although I generally agree with avoiding Python loops for Numpy arrays, this seems like a case where it's easier and faster (see below) to use a Python loop since the loop is only along a single axis and it's easy to accumulate the comparisons in place. Here's a timing test:
def f0(x):
r = x[0] == x[1]
for y in x[2:]:
r &= x[0]==y
def f1(x): # from #Divakar
r = ~np.any(np.diff(np.dstack(x),axis=2),axis=2)
def f2(x):
x = np.dstack(x)
r = np.amax(x,axis=2)==np.amin(x,axis=2)
# speed test
for n, size, reps in ((1000, 3, 1000), (10, 1000, 100)):
x = [np.ones((size, size)) for i in range(n)]
print n, size, reps
print "f0: ",
print timeit("f0(x)", "from __main__ import x, f0, f1", number=reps)
print "f1: ",
print timeit("f1(x)", "from __main__ import x, f0, f1", number=reps)
print
1000 3 1000
f0: 1.14673900604 # loop
f1: 3.93413209915 # diff
f2: 3.93126702309 # min max
10 1000 100
f0: 2.42633581161 # loop
f1: 27.1066679955 # diff
f2: 25.9518558979 # min max
If arrays are already in a single 3D numpy array (eg, from using x = np.dstack(x) in the above) then modifying the above function defs appropriately and with the addition of the min==max approach gives:
def g0(x):
r = x[:,:,0] == x[:,:,1]
for iy in range(x[:,:,2:].shape[2]):
r &= x[:,:,0]==x[:,:,iy]
def g1(x): # from #Divakar
r = ~np.any(np.diff(x,axis=2),axis=2)
def g2(x):
r = np.amax(x,axis=2)==np.amin(x,axis=2)
which yields:
1000 3 1000
g0: 3.9761030674 # loop
g1: 0.0599548816681 # diff
g2: 0.0313589572906 # min max
10 1000 100
g0: 10.7617051601 # loop
g1: 10.881870985 # diff
g2: 9.66712999344 # min max
Note also that for a list of large arrays f0 = 2.4 and for a pre-built array g0, g1, g2 ~= 10., so that if the input arrays are large, than fastest approach by about 4x is to store them separately in a list. I find this a bit surprising and guess that this might be due to cache swapping (or bad code?), but I'm not sure anyone really cares so I'll stop this here.
Concatenate along the third axis with np.dstack and perfom differentiation with np.diff, so that the identical ones would show up as zeros. Then, check for cases where all are zeros with ~np.any. Thus, you would have a one-liner solution like so -
~np.any(np.diff(np.dstack((a,b,c)),axis=2),axis=2)
Sample run -
In [39]: a
Out[39]:
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
In [40]: b
Out[40]:
array([[5, 6, 7],
[4, 2, 6],
[7, 8, 9]])
In [41]: c
Out[41]:
array([[2, 3, 4],
[4, 5, 6],
[1, 2, 5]])
In [42]: ~np.any(np.diff(np.dstack((a,b,c)),axis=2),axis=2)
Out[42]:
array([[False, False, False],
[ True, False, True],
[False, False, False]], dtype=bool)
Try this:
z1 = a == b
z2 = a == c
z = np.logical_and(z1,z2)
print "count:", np.sum(z)
You can do this in a single statement:
count = np.sum( np.logical_and(a == b, a == c) )

Numpy: Subtract array element by element

The title might be ambiguous, didn't know how else to word it.
I have gotten a bit far with my particle simulator in python using numpy and matplotlib, I have managed to implement coloumb, gravity and wind, now I just want to add temperature and pressure but I have a pre-optimization question (root of all evil). I want to see when particles crash:
Q: Is it in numpy possible to take the difference of an array with each of its own element based on a bool condition? I want to avoid looping.
Eg: (x - any element in x) < a
Should return something like
[True, True, False, True]
If element 0,1 and 3 in x meets the condition.
Edit:
The loop quivalent would be:
for i in len(x):
for j in in len(x):
#!= not so important
##earlier question I asked lets me figure that one out
if i!=j:
if x[j] - x[i] < a:
True
I notice numpy operations are far faster than if tests and this has helped me speed up things ALOT.
Here is a sample code if anyone wants to play with it.
#Simple circular box simulator, part of part_sim
#Restructure to import into gravity() or coloumb () or wind() or pressure()
#Or to use all forces: sim_full()
#Note: Implement crashing as backbone to all forces
import numpy as np
import matplotlib.pyplot as plt
N = 1000 #Number of particles
R = 8000 #Radius of box
r = np.random.randint(0,R/2,2*N).reshape(N,2)
v = np.random.randint(-200,200,r.shape)
v_limit = 10000 #Speedlimit
plt.ion()
line, = plt.plot([],'o')
plt.axis([-10000,10000,-10000,10000])
while True:
r_hit = np.sqrt(np.sum(r**2,axis=1))>R #Who let the dogs out, who, who?
r_nhit = ~r_hit
N_rhit = r_hit[r_hit].shape[0]
r[r_hit] = r[r_hit] - 0.1*v[r_hit] #Get the dogs back inside
r[r_nhit] = r[r_nhit] +0.1*v[r_nhit]
#Dogs should turn tail before they crash!
#---
#---crash code here....
#---crash end
#---
vmin, vmax = np.min(v), np.max(v)
#Give the particles a random kick when they hit the wall
v[r_hit] = -v[r_hit] + np.random.randint(vmin, vmax, (N_rhit,2))
#Slow down honey
v_abs = np.abs(v) > v_limit
#Hit the wall at too high v honey? You are getting a speed reduction
v[v_abs] *=0.5
line.set_ydata(r[:,1])
line.set_xdata(r[:,0])
plt.draw()
I plan to add colors to the datapoints above once I figure out how...such that high velocity particles can easily be distinguished in larger boxes.
Eg: x - any element in x < a Should return something like
[True, True, False, True]
If element 0,1 and 3 in x meets the condition. I notice numpy operations are far faster than if tests and this has helped me speed up things ALOT.
Yes, it's just m < a. For example:
>>> m = np.array((1, 3, 10, 5))
>>> a = 6
>>> m2 = m < a
>>> m2
array([ True, True, False, True], dtype=bool)
Now, to the question:
Q: Is it in numpy possible to take the difference of an array with each of its own element based on a bool condition? I want to avoid looping.
I'm not sure what you're asking for here, but it doesn't seem to match the example directly below it. Are you trying to, e.g., subtract 1 from each element that satisfies the predicate? In that case, you can rely on the fact that False==0 and True==1 and just subtract the boolean array:
>>> m3 = m - m2
>>> m3
>>> array([ 0, 2, 10, 4])
From your clarification, you want the equivalent of this pseudocode loop:
for i in len(x):
for j in in len(x):
#!= not so important
##earlier question I asked lets me figure that one out
if i!=j:
if x[j] - x[i] < a:
True
I think the confusion here is that this is the exact opposite of what you said: you don't want "the difference of an array with each of its own element based on a bool condition", but "a bool condition based on the difference of an array with each of its own elements". And even that only really gets you to a square matrix of len(m)*len(m) bools, but I think the part left over is that the "any".
At any rate, you're asking for an implicit cartesian product, comparing each element of m to each element of m.
You can easily reduce this from two loops to one (or, rather, implicitly vectorize one of them, gaining the usual numpy performance benefits). For each value, create a new array by subtracting that value from each element and comparing the result with a, and then join those up:
>>> a = -2
>>> comparisons = np.array([m - x < a for x in m])
>>> flattened = np.any(comparisons, 0)
>>> flattened
array([ True, True, False, True], dtype=bool)
But you can also turn this into a simple matrix operation pretty easily. Subtracting every element of m from every other element of m is just m - m.T. (You can make the product more explicit, but the way numpy handles adding row and column vectors, it isn't necessary.) And then you just compare every element of that to the scalar a, and reduce with any, and you're done:
>>> a = -2
>>> m = np.matrix((1, 3, 10, 5))
>>> subtractions = m - m.T
>>> subtractions
matrix([[ 0, 2, 9, 4],
[-2, 0, 7, 2],
[-9, -7, 0, -5],
[-4, -2, 5, 0]])
>>> comparisons = subtractions < a
>>> comparisons
matrix([[False, False, False, False],
[False, False, False, False],
[ True, True, False, True],
[ True, False, False, False]], dtype=bool)
>>> np.any(comparisons, 0)
matrix([[ True, True, False, True]], dtype=bool)
Or, putting it all together in one line:
>>> np.any((m - m.T) < a, 0)
matrix([[ True, True, True, True]], dtype=bool)
If you need m to be an array rather than a matrix, you can replace the subtraction line with m - np.matrix(m).T.
For higher dimensions, you actually do need to work in arrays, because you're trying to cartesian-product a 2D array with itself to get a 4D array, and numpy doesn't do 4D matrices. So, you can't use the simple "row vector - column vector = matrix" trick. But you can do it manually:
>>> m = np.array([[1,2], [3,4]]) # 2x2
>>> m4d = m.reshape(1, 1, 2, 2) # 1x1x2x2
>>> m4d
array([[[[1, 2],
[3, 4]]]])
>>> mt4d = m4d.T # 2x2x1x1
>>> mt4d
array([[[[1]],
[[3]]],
[[[2]],
[[4]]]])
>>> subtractions = m - mt4d # 2x2x2x2
>>> subtractions
array([[[[ 0, 1],
[ 2, 3]],
[[-2, -1],
[ 0, 1]]],
[[[-1, 0],
[ 1, 2]],
[[-3, -2],
[-1, 0]]]])
And from there, the remainder is the same as before. Putting it together into one line:
>>> np.any((m - m.reshape(1, 1, 2, 2).T) < a, 0)
(If you remember my original answer, I'd somehow blanked on reshape and was doing the same thing by multiplying m by a column vector of 1s, which obviously is a much stupider way to proceed.)
One last quick thought: If your algorithm really is "the bool result of (for any element y of m, x - y < a) for each element x of m", you don't actually need "for any element y", you can just use "for the maximal element y". So you can simplify from O(N^2) to O(N):
>>> (m - m.max()) < a
Or, if a is positive, that's always false, so you can simplify to O(1):
>>> np.zeros(m.shape, dtype=bool)
But I'm guessing your real algorithm is actually using abs(x - y), or something more complicated, which can't be simplified in this way.

Categories

Resources