Remove duplicate tuples in numpy array (ones directly next to each other)

Remove duplicate tuples in numpy array (ones directly next to each other) - python

I am more or less new to python/numpy and I have this problem:
I have numpy arrays in which the first and last tuples are always the same. In between, there are sometimes duplicate tuples (only the ones directly next to each other) that I want to get rid of. The used parenthesis structure should be maintained.
I tried np.unique already (e.g. 1, 2), but it changes my original order (which has to be maintained). My sample array looks like this:
myarray = np.array([[[1,1],[1,1],[4,4],[4,4],[2,2],[3,3],[1,1]]])
I need a result that looks like this:
myarray = np.array([[[1,1],[4,4],[2,2],[3,3],[1,1]]])
Thank you in advance for your support!

Get the one-off offsetted comparisons along the second axis and use boolean-indexing to select -
myarray[:,np.r_[True,(myarray[0,1:] != myarray[0,:-1]).any(-1)]]
Sample run -
In [42]: myarray
Out[42]:
array([[[1, 1],
[1, 1],
[4, 4],
[4, 4],
[2, 2],
[3, 3],
[1, 1]]])
In [43]: myarray[:,np.r_[True,(myarray[0,1:] != myarray[0,:-1]).any(-1)]]
Out[43]:
array([[[1, 1],
[4, 4],
[2, 2],
[3, 3],
[1, 1]]])
Or with equality comparison and then look for ALL matches -
In [47]: myarray[:,np.r_[True,~((myarray[0,1:] == myarray[0,:-1]).all(-1))]]
Out[47]:
array([[[1, 1],
[4, 4],
[2, 2],
[3, 3],
[1, 1]]])

Related

Numpy - combine two feature arrays but keep original index

I have two feature arrays, e.g.
a = [1, 2, 3]
b = [4, 5, 6]
Now I want to combine these arrays in the following way:
[[1, 4], [2, 5], [3, 6]]
The location in the array corresponds to a timestep. I tried appending and then reshaping, but then I get:
[[1, 2], [3, 4], [5, 6]]

you can use np.dstack to stack your lists depth-wise:
>>> np.dstack([a, b])
array([[[1, 4],
[2, 5],
[3, 6]]])
As noted by #BramVanroy, this does add an unwanted dimension. Two ways around that are to squeeze the result, or to use column_stack instead:
np.dstack([a, b]).squeeze()
# or
np.column_stack([a, b])
Both of which return:
array([[1, 4],
[2, 5],
[3, 6]])

As an alternative to sacuL's reply, you can also simply do
>>> np.array(list(zip(a, b)))
array([[1, 4],
[2, 5],
[3, 6]])
In fact, this is closer to the expected result in terms of the number of dimensions (two, rather than three in sacuL's answer which you still need to .squeeze() to achieve the correct result).

Numpy 3D array (NetCDF data) slicing same element - the fastest way

I need to slice the same element in 3D numpy array (actually masked array, but works the same). I usually do it with iterations - however current data is so huge and it needs repeating the process on thousands of datasets - it will take weeks (raw estimation). What is the quickest way to slice 3D array without looping through all 2D arrays?
In this simple example I need to slice [1, 0] element in each 2D array which is 3 in all 2D arrays and store them in result array.
NetCDF example (slicing element [500, 400])
import netCDF4
url = "http://eip.ceh.ac.uk/thredds/dodsC/public-chess/PET/aggregation/PETAggregation.ncml"
dataset = netCDF4.Dataset(url)
result = dataset.variables['pet'][:, 500, 400]
myarray SUPERSEDED NOW WITH ABOVE
myarray = np.array([
[[1, 2], [3, 4], [5, 6]],
[[1, 2], [3, 4], [5, 6]],
[[1, 2], [3, 4], [5, 6]],
[[1, 2], [3, 4], [5, 6]],
])
result = []
for i in myarray:
result.append(i[1][0])
result [3, 3, 3, 3]
EDIT
FirefoxMetzger suggested to slice it simply with
result = myarray[:, 1, 0]. However, I'm getting the following error message with this:
RuntimeError: NetCDF: DAP server error

The minimal numpy example you provided can be efficiently sliced using standard slicing mechanisms:
myarray = np.array([
[[1, 2], [3, 4], [5, 6]],
[[1, 2], [3, 4], [5, 6]],
[[1, 2], [3, 4], [5, 6]],
[[1, 2], [3, 4], [5, 6]],
])
result = myarray[:, 1, 0]
The NetCFD seems to come from the resulting slice being too large to be returned from the server, causing a crash. As per your comment, the solution here is to query the server in chunks and aggregate the results locally.

How to find missing combinations/sequences in a 2D array with finite element values

In the case of the set np.array([1, 2, 3]), there are only 9 possible combinations/sequences of its constituent elements: [1, 1], [1, 2], [1, 3], [2, 1], [2, 2], [2, 3], [3, 1], [3, 2], [3, 3].
If we have the following array:
np.array([1, 1],
[1, 2],
[1, 3],
[2, 2],
[2, 3],
[3, 1],
[3, 2])
What is the best way, with NumPy/SciPy, to determine that [2, 1] and [3, 3] are missing? Put another way, how do we find the inverse list of sequences (when we know all of the possible element values)? Manually doing this with a couple of for loops is easy to figure out, but that would negate whatever speed gains we get from using NumPy over native Python (especially with larger datasets).

Your can generate a list of all possible pairs using itertools.product and collect all of them which are not in your array:
from itertools import product
pairs = [ [1, 1], [1, 2], [1, 3], [2, 2], [2, 3], [3, 1], [3, 2] ]
allPairs = list(map(list, product([1, 2, 3], repeat=2)))
missingPairs = [ pair for pair in allPairs if pair not in pairs ]
print(missingPairs)
Result:
[[2, 1], [3, 3]]
Note that map(list, ...) is needed to convert your list of list to a list of tuples that can be compared to the list of tuples returned by product. This can be simplified if your input array already was a list of tuples.

This is one way using itertools.product and set.
The trick here is to note that sets may only contain immutable types such as tuples.
import numpy as np
from itertools import product
x = np.array([1, 2, 3])
y = np.array([[1, 1], [1, 2], [1, 3], [2, 2],
[2, 3], [3, 1], [3, 2]])
set(product(x, repeat=2)) - set(map(tuple, y))
{(2, 1), (3, 3)}

If you want to stay in numpy instead of going back to raw python sets, you can do it using void views (based on #Jaime's answer here) and numpy's built in set methods like in1d
def vview(a):
return np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
x = np.array([1, 2, 3])
y = np.array([[1, 1], [1, 2], [1, 3], [2, 2],
[2, 3], [3, 1], [3, 2]])
xx = np.array([i.ravel() for i in np.meshgrid(x, x)]).T
xx[~np.in1d(vview(xx), vview(y))]
array([[2, 1],
[3, 3]])

a = np.array([1, 2, 3])
b = np.array([[1, 1],
[1, 2],
[1, 3],
[2, 2],
[2, 3],
[3, 1],
[3, 2]])
c = np.array(list(itertools.product(a, repeat=2)))
If you want to use numpy methods, try this...
Compare the array being tested against the product using broadcasting
d = b == c[:,None,:]
#d.shape is (9,7,2)
Check if both elements of a pair matched
e = np.all(d, -1)
#e.shape is (9,7)
Check if any of the test items match an item of the product.
f = np.any(e, 1)
#f.shape is (9,)
Use f as a boolean index into the product to see what is missing.
>>> print(c[np.logical_not(f)])
[[2 1]
[3 3]]
>>>

Every combination corresponds to the number in range 0..L^2-1 where L=len(array). For example, [2, 2]=>3*(2-1)+(2-1)=4. Off by -1 arises because elements start from 1, not from zero. Such mapping might be considered as natural perfect hashing for this data type.
If operations on integer sets in NumPy are faster than operations on pairs - for example, integer set of known size might be represented by bit sequence (integer sequence) - then it is worth to traverse pair list, mark corresponding bits in integer set, then look for unset ones and retrieve corresponding pairs.

Indexing numpy multidimensional arrays depends on a slicing method

I have a 3-D array. When I take a 2-D slice of it the result depends on whether it is indexed with a list or with a slice. In the first case the result is transposed. Didn't find this behaviour in the manual.
>>> import numpy as np
>>> x = np.array([[[1,1,1],[2,2,2]], [[3,3,3],[4,4,4]]])
>>> x
array([[[1, 1, 1],
[2, 2, 2]],
[[3, 3, 3],
[4, 4, 4]]])
>>> x[0,:,[0,1]]
array([[1, 2],
[1, 2]])
>>> x[0,:,slice(2)]
array([[1, 1],
[2, 2]])
>>>
Could anyone point a rationale for this?

Because you are actually using advanced indexing when you use [0,1]. From the docs:
Combining advanced and basic indexing When there is at least one
slice (:), ellipsis (...) or np.newaxis in the index (or the array has
more dimensions than there are advanced indexes), then the behaviour
can be more complicated. It is like concatenating the indexing result
for each advanced index element
In the simplest case, there is only a single advanced index. A single
advanced index can for example replace a slice and the result array
will be the same, however, it is a copy and may have a different
memory layout. A slice is preferable when it is possible.
Pay attention to the two parts I've bolded above.
In particular, in this construction:
>>> x[0,:,[0,1]]
array([[1, 2],
[1, 2]])
Is the case where there is at least once "slice, ellipsisi, or np.newaxis" in the index, and the behavior is like concatenating the indexing result for each advanced index element. So:
>>> x[0,:,[0]]
array([[1, 2]])
>>> x[0,:,[1]]
array([[1, 2]])
>>> np.concatenate((x[0,:,[0]], x[0,:,[1]]))
array([[1, 2],
[1, 2]])
However, this construction is like the simple case: there is only a single advanced index, so it acts like a slice:
>>> x[0,:,slice(2)]
array([[1, 1],
[2, 2]])
>>> x[slice(0,1),:,slice(2)]
array([[[1, 1],
[2, 2]]])
Although note, that the later is actually three dimensional because the first part of the index acted as a slice, it's 3 slices so three dimensions.

As I understand it, NumPy is following the axis numbering philosophy when it spits out the result when given a list/tuple-like index.
array([[[1, 1, 1],
[2, 2, 2]],
[[3, 3, 3],
[4, 4, 4]]])
When you already specify the first two indices (x[0, :, ]), now the next question is how to extract the third dimension. Now, when you specify a tuple (0,1), it first extracts the 0th slice axis wise, so it gets [1, 2] since it lies in 0th axis, next it extracts 1st slice likewise and stacks below the already existing row [1, 2].
[[1, 1, 1], array([[1, 2],
[2, 2, 2]] =====> [1, 2]])
(visualize this stacking as below (not on top of) the already existing row since axis-0 grows downwards)
Alternatively, it is following the slicing philosophy (start:stop:step) when slice(n) is given for the index. Note that using slice(2) is essentially equal to 0:2 in your example. So, it extracts [1, 1] first, then [2, 2]. Note, here to how [1, 1] comes on top of [2, 2], again following the same axis philosophy here since we didn't leave the third dimension yet. This is why this result is the transpose of the other.
array([[1, 1],
[2, 2]])
Also, note that starting from 3-D arrays this consistency is preserved. Below is an example from 4-D array and the slicing results.
In [327]: xa
Out[327]:
array([[[[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8]],
[[ 9, 10, 11],
[12, 13, 14],
[15, 16, 17]]],
[[[18, 19, 20],
[21, 22, 23],
[24, 25, 26]],
[[27, 28, 29],
[30, 31, 32],
[33, 34, 35]]]])
In [328]: xa[0, 0, :, [0, 1]]
Out[328]:
array([[0, 3, 6],
[1, 4, 7]])
In [329]: xa[0, 0, :, 0:2]
Out[329]:
array([[0, 1],
[3, 4],
[6, 7]])

Python Matrix sorting via one column

I have a n x 2 matrix of integers. The first column is a series 0,1,-1,2,-2, however these are in the order that they were compiled in from their constituent matrices. The second column is a list of indices from another list.
I would like to sort the matrix via this second column. This would be equivalent to selecting two columns of data in Excel, and sorting via Column B (where the data is in columns A and B). Keep in mind, the adjacent data in the first column of each row should be kept with its respective second column counterpart. I have looked at solutions using the following:
data[np.argsort(data[:, 0])]
But this does not seem to work. The matrix in question looks like this:
matrix([[1, 1],
[1, 3],
[1, 7],
...,
[2, 1021],
[2, 1040],
[2, 1052]])

You could use np.lexsort:
numpy.lexsort(keys, axis=-1)
Perform an indirect sort using a sequence of keys.
Given multiple sorting keys, which can be interpreted as columns in a
spreadsheet, lexsort returns an array of integer indices that
describes the sort order by multiple columns.
In [13]: data = np.matrix(np.arange(10)[::-1].reshape(-1,2))
In [14]: data
Out[14]:
matrix([[9, 8],
[7, 6],
[5, 4],
[3, 2],
[1, 0]])
In [15]: temp = data.view(np.ndarray)
In [16]: np.lexsort((temp[:, 1], ))
Out[16]: array([4, 3, 2, 1, 0])
In [17]: temp[np.lexsort((temp[:, 1], ))]
Out[17]:
array([[1, 0],
[3, 2],
[5, 4],
[7, 6],
[9, 8]])
Note if you pass more than one key to np.lexsort, the last key is the primary key. The next to last key is the second key, and so on.
Using np.lexsort as I show above requires the use of a temporary array because np.lexsort does not work on numpy matrices. Since
temp = data.view(np.ndarray) creates a view, rather than a copy of data, it does not require much extra memory. However,
temp[np.lexsort((temp[:, 1], ))]
is a new array, which does require more memory.
There is also a way to sort by columns in-place. The idea is to view the array as a structured array with two columns. Unlike plain ndarrays, structured arrays have a sort method which allows you to specify columns as keys:
In [65]: data.dtype
Out[65]: dtype('int32')
In [66]: temp2 = data.ravel().view('int32, int32')
In [67]: temp2.sort(order = ['f1', 'f0'])
Notice that since temp2 is a view of data, it does not require allocating new memory and copying the array. Also, sorting temp2 modifies data at the same time:
In [69]: data
Out[69]:
matrix([[1, 0],
[3, 2],
[5, 4],
[7, 6],
[9, 8]])

You had the right idea, just off by a few characters:
>>> import numpy as np
>>> data = np.matrix([[9, 8],
... [7, 6],
... [5, 4],
... [3, 2],
... [1, 0]])
>>> data[np.argsort(data.A[:, 1])]
matrix([[1, 0],
[3, 2],
[5, 4],
[7, 6],
[9, 8]])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove duplicate tuples in numpy array (ones directly next to each other) - python

Related

Numpy - combine two feature arrays but keep original index

Numpy 3D array (NetCDF data) slicing same element - the fastest way

How to find missing combinations/sequences in a 2D array with finite element values

Indexing numpy multidimensional arrays depends on a slicing method

Python Matrix sorting via one column

Categories

Resources