Comparing two numpy arrays and removing elements - python

I have been going through several solutions, but I am not able to find a solution I need.
I have two numpy arrays. Let's take a small example here.
x = [1,2,3,4,5,6,7,8,9]
y = [3,4,5]
I want to compare x and y, and remove those values of x that are in y.
So I expect my final_x to be
final_x = [1,2,6,7,8,9]
I found out that np.in1d returns a boolean array the same length as x that is True where an element of x is in y and False otherwise. But how do I use it, if not any other method to get my final_x.??

If you really do have numpy arrays then you can use numpy.setdiff1d as below
import numpy as np
x = np.array([1,2,3,4,5,6,7,8,9])
y = np.array([3,4,5])
z = np.setdiff1d(x, y)
# array([1, 2, 6, 7, 8, 9])

Simply pass the negated version of boolean array returned by np.in1d to array x:
>>> x = np.array([1,2,3,4,5,6,7,8,9])
>>> y = [3,4,5]
>>> x[~np.in1d(x, y)]
array([1, 2, 6, 7, 8, 9])

You can use built-in sets:
final_x = set(x) - set(y)
and subtract the second from the first. You can convert final_x to a list or numpy.array if you feel so inclined.

Related

What is a best way to intersect multiple arrays with numpy array?

Suppose I have an example of numpy array:
import numpy as np
X = np.array([2,5,0,4,3,1])
And I also have a list of arrays, like:
A = [np.array([-2,0,2]), np.array([0,1,2,3,4,5]), np.array([2,5,4,6])]
I want to leave only these items of each list that are also in X. I expect also to do it in a most efficient/common way.
Solution I have tried so far:
Sort X using X.sort().
Find locations of items of each array in X using:
locations = [np.searchsorted(X, n) for n in A]
Leave only proper ones:
masks = [X[locations[i]] == A[i] for i in range(len(A))]
result = [A[i][masks[i]] for i in range(len(A))]
But it doesn't work because locations of third array is out of bounds:
locations = [array([0, 0, 2], dtype=int64), array([0, 1, 2, 3, 4, 5], dtype=int64), array([2, 5, 4, 6], dtype=int64)]
How to solve this issue?
Update
I ended up with idx[idx==len(Xs)] = 0 solution. I've also noticed two different approaches posted between the answers: transforming X into set vs np.sort. Both of them has plusses and minuses: set operations uses iterations which is quite slow in compare with numpy methods; however np.searchsorted speed increases logarithmically unlike acceses of set items which is instant. That why I decided to compare performance using data with huge sizes, especially 1 million items for X, A[0], A[1], A[2].
One idea would be less compute and minimal work when looping. So, here's one with those in mind -
a = np.concatenate(A)
m = np.isin(a,X)
l = np.array(list(map(len,A)))
a_m = a[m]
cut_idx = np.r_[0,l.cumsum()]
l_m = np.add.reduceat(m,cut_idx[:-1])
cl_m = np.r_[0,l_m.cumsum()]
out = [a_m[i:j] for (i,j) in zip(cl_m[:-1],cl_m[1:])]
Alternative #1 :
We can also use np.searchsorted to get the isin mask, like so -
Xs = np.sort(X)
idx = np.searchsorted(Xs,a)
idx[idx==len(Xs)] = 0
m = Xs[idx]==a
Another way with np.intersect1d
If you are looking for the most common/elegant one, think it would be with np.intersect1d -
In [43]: [np.intersect1d(X,A_i) for A_i in A]
Out[43]: [array([0, 2]), array([0, 1, 2, 3, 4, 5]), array([2, 4, 5])]
Solving your issue
You can also solve your out-of-bounds issue, with a simple fix -
for l in locations:
l[l==len(X)]=0
How about this, very simple and efficent:
import numpy as np
X = np.array([2,5,0,4,3,1])
A = [np.array([-2,0,2]), np.array([0,1,2,3,4,5]), np.array([2,5,4,6])]
X_set = set(X)
A = [np.array([a for a in arr if a in X_set]) for arr in A]
#[array([0, 2]), array([0, 1, 2, 3, 4, 5]), array([2, 5, 4])]
According to the docs, set operations all have O(1) complexity, therefore the overall is O(N)

Split a numpy array using masking in python

I have a numpy array my_array of size 100x20. I want to create a function that receives as an input a 2d numpy array my_arr and an index x and will return two arrays one with size 1x20 test_arr and one with 99x20 train_arr. The vector test_arr will correspond to the row of the matrix my_arr with the index x and the train_arr will contain the rest rows. I tried to follow a solution using masking:
def split_train_test(my_arr, x):
a = np.ma.array(my_arr, mask=False)
a.mask[x, :] = True
a = np.array(a.compressed())
return a
Apparently this is not working as i wanted. How can i return a numpy array as a result and the train and test arrays properly?
You can use simple index and numpy.delete for this:
def split_train_test(my_arr, x):
return np.delete(my_arr, x, 0), my_arr[x:x+1]
my_arr = np.arange(10).reshape(5,2)
train, test = split_train_test(my_arr, 2)
train
#array([[0, 1],
# [2, 3],
# [6, 7],
# [8, 9]])
test
#array([[4, 5]])
You can also use a boolean index as the mask:
def split_train_test(my_arr, x):
# define mask
mask=np.zeros(my_arr.shape[0], dtype=bool)
mask[x] = True # True only at index x, False elsewhere
return my_arr[mask, :], my_arr[~mask, :]
Sample run:
test_arr, train_arr = split_train_test(np.random.rand(100, 20), x=10)
print(test_arr.shape, train_arr.shape)
((1L, 20L), (99L, 20L))
EDIT:
If someone is looking for the general case where more than one element needs to be allocated to the test array (say 80%-20% split), x can also accept an array:
my_arr = np.random.rand(100, 20)
x = np.random.choice(np.arange(my_arr.shape[0]), int(my_arr .shape[0]*0.8), replace=False)
test_arr, train_arr = split_train_test(my_arr, x)
print(test_arr.shape, train_arr.shape)
((80L, 20L), (20L, 20L))

Python numpy index set from Boolean array

How do I transform a Boolean array into an iterable of indexes?
E.g.,
import numpy as np
import itertools as it
x = np.array([1,0,1,1,0,0])
y = x > 0
retval = [i for i, y_i in enumerate(y) if y_i]
Is there a nicer way?
Try np.where or np.nonzero.
x = np.array([1, 0, 1, 1, 0, 0])
np.where(x)[0] # returns a tuple hence the [0], see help(np.where)
# array([0, 2, 3])
x.nonzero()[0] # in this case, the same as above.
See help(np.where) and help(np.nonzero).
Possibly worth noting that in the np.where page it mentions that for 1D x it's basically equivalent to your longform in the question.

first order differences along a given axis in NumPy array

#compute first differences of 1d array
from numpy import *
x = arange(10)
y = zeros(len(x))
for i in range(1,len(x)):
y[i] = x[i] - x[i-1]
print y
The above code works but there must be at least one easy, pythonesque way to do this without having to use a for loop. Any suggestions?
What about:
diff(x)
# array([1, 1, 1, 1, 1, 1, 1, 1, 1])
Yes, this exactly the kind of loop numpy elementwise operations is designed for. You just need to learn to take the right slices of the arrays.
x = numpy.arange(10)
y = numpy.zeros(x.shape)
y[1:] = x[1:] - x[:-1]
print y
several NumPy builtins will do the job--in particular, diff, ediff1d, and gradient.
i suspect ediff1d is the better choice for the specific cast described in the OP--unlike the other two, ediff1d is acdtually directed/limited to this specific use case--ie, first-order differences along a single axis (or axis of a 1D array).
>>> import numpy as NP
>>> x = NP.random.randint(1, 10, 10)
>>> x
array([4, 6, 6, 8, 1, 2, 1, 1, 5, 4])
>>> NP.ediff1d(x)
array([ 2, 0, 2, -7, 1, -1, 0, 4, -1])
Here's a pattern I used a lot for a while:
from itertools import izip
d = [a-b for a,b in izip(x[1:],x[:-1])]
y = [item - x[i - 1] for i, item in enumerate(x[1:])]
If you need to access the index of an item while looping over it, enumerate() is the Pythonic way. Also, a list comprehension is, in this case, more readable.
Moreover, you should never use wild imports (from numpy import *). It will always import more than you need and leads to unnecessary ambiguity. Rather, just import numpy or import what you need, e.g.
from numpy import arange, zeros

Numpy meshgrid in 3D

Numpy's meshgrid is very useful for converting two vectors to a coordinate grid. What is the easiest way to extend this to three dimensions? So given three vectors x, y, and z, construct 3x3D arrays (instead of 2x2D arrays) which can be used as coordinates.
Numpy (as of 1.8 I think) now supports higher that 2D generation of position grids with meshgrid. One important addition which really helped me is the ability to chose the indexing order (either xy or ij for Cartesian or matrix indexing respectively), which I verified with the following example:
import numpy as np
x_ = np.linspace(0., 1., 10)
y_ = np.linspace(1., 2., 20)
z_ = np.linspace(3., 4., 30)
x, y, z = np.meshgrid(x_, y_, z_, indexing='ij')
assert np.all(x[:,0,0] == x_)
assert np.all(y[0,:,0] == y_)
assert np.all(z[0,0,:] == z_)
Here is the source code of meshgrid:
def meshgrid(x,y):
"""
Return coordinate matrices from two coordinate vectors.
Parameters
----------
x, y : ndarray
Two 1-D arrays representing the x and y coordinates of a grid.
Returns
-------
X, Y : ndarray
For vectors `x`, `y` with lengths ``Nx=len(x)`` and ``Ny=len(y)``,
return `X`, `Y` where `X` and `Y` are ``(Ny, Nx)`` shaped arrays
with the elements of `x` and y repeated to fill the matrix along
the first dimension for `x`, the second for `y`.
See Also
--------
index_tricks.mgrid : Construct a multi-dimensional "meshgrid"
using indexing notation.
index_tricks.ogrid : Construct an open multi-dimensional "meshgrid"
using indexing notation.
Examples
--------
>>> X, Y = np.meshgrid([1,2,3], [4,5,6,7])
>>> X
array([[1, 2, 3],
[1, 2, 3],
[1, 2, 3],
[1, 2, 3]])
>>> Y
array([[4, 4, 4],
[5, 5, 5],
[6, 6, 6],
[7, 7, 7]])
`meshgrid` is very useful to evaluate functions on a grid.
>>> x = np.arange(-5, 5, 0.1)
>>> y = np.arange(-5, 5, 0.1)
>>> xx, yy = np.meshgrid(x, y)
>>> z = np.sin(xx**2+yy**2)/(xx**2+yy**2)
"""
x = asarray(x)
y = asarray(y)
numRows, numCols = len(y), len(x) # yes, reversed
x = x.reshape(1,numCols)
X = x.repeat(numRows, axis=0)
y = y.reshape(numRows,1)
Y = y.repeat(numCols, axis=1)
return X, Y
It is fairly simple to understand. I extended the pattern to an arbitrary number of dimensions, but this code is by no means optimized (and not thoroughly error-checked either), but you get what you pay for. Hope it helps:
def meshgrid2(*arrs):
arrs = tuple(reversed(arrs)) #edit
lens = map(len, arrs)
dim = len(arrs)
sz = 1
for s in lens:
sz*=s
ans = []
for i, arr in enumerate(arrs):
slc = [1]*dim
slc[i] = lens[i]
arr2 = asarray(arr).reshape(slc)
for j, sz in enumerate(lens):
if j!=i:
arr2 = arr2.repeat(sz, axis=j)
ans.append(arr2)
return tuple(ans)
Can you show us how you are using np.meshgrid? There is a very good chance that you really don't need meshgrid because numpy broadcasting can do the same thing without generating a repetitive array.
For example,
import numpy as np
x=np.arange(2)
y=np.arange(3)
[X,Y] = np.meshgrid(x,y)
S=X+Y
print(S.shape)
# (3, 2)
# Note that meshgrid associates y with the 0-axis, and x with the 1-axis.
print(S)
# [[0 1]
# [1 2]
# [2 3]]
s=np.empty((3,2))
print(s.shape)
# (3, 2)
# x.shape is (2,).
# y.shape is (3,).
# x's shape is broadcasted to (3,2)
# y varies along the 0-axis, so to get its shape broadcasted, we first upgrade it to
# have shape (3,1), using np.newaxis. Arrays of shape (3,1) can be broadcasted to
# arrays of shape (3,2).
s=x+y[:,np.newaxis]
print(s)
# [[0 1]
# [1 2]
# [2 3]]
The point is that S=X+Y can and should be replaced by s=x+y[:,np.newaxis] because
the latter does not require (possibly large) repetitive arrays to be formed. It also generalizes to higher dimensions (more axes) easily. You just add np.newaxis where needed to effect broadcasting as necessary.
See http://www.scipy.org/EricsBroadcastingDoc for more on numpy broadcasting.
i think what you want is
X, Y, Z = numpy.mgrid[-10:10:100j, -10:10:100j, -10:10:100j]
for example.
Here is a multidimensional version of meshgrid that I wrote:
def ndmesh(*args):
args = map(np.asarray,args)
return np.broadcast_arrays(*[x[(slice(None),)+(None,)*i] for i, x in enumerate(args)])
Note that the returned arrays are views of the original array data, so changing the original arrays will affect the coordinate arrays.
Instead of writing a new function, numpy.ix_ should do what you want.
Here is an example from the documentation:
>>> ixgrid = np.ix_([0,1], [2,4])
>>> ixgrid
(array([[0],
[1]]), array([[2, 4]]))
>>> ixgrid[0].shape, ixgrid[1].shape
((2, 1), (1, 2))'
You can achieve that by changing the order:
import numpy as np
xx = np.array([1,2,3,4])
yy = np.array([5,6,7])
zz = np.array([9,10])
y, z, x = np.meshgrid(yy, zz, xx)

Categories

Resources