python most common pair of indices in 3 x n array - python

I have a numpy array with shape (3, 600219), which is a list of indices.
i.e.
array([[ 0, 0, 0, ..., 2879, 2879, 2879],
[ 40, 40, 40, ..., 162, 165, 168],
[ 249, 250, 251, ..., 195, 196, 198]])
The first row are time indices, the second and third rows are indices of the coordinates. I am trying to figure out which pair of coordinates most frequently occurred, disregarding the time.
e.g. Was it (49,249) or (40,250)...etc.?

I just used a small sample of your data, but I think you'll get the point:
import numpy as np
array = np.array([[ 0, 0, 0, 2879, 2879, 2879],
[ 40, 40, 40, 162, 165, 168],
[ 249, 250, 251, 195, 196, 198]])
# Zip together only the second and third rows
only_coords = zip(array[1,:], array[2,:])
from collections import Counter
Counter(only_coords).most_common()
Produces:
Out[11]:
[((40, 249), 1),
((165, 196), 1),
((162, 195), 1),
((168, 198), 1),
((40, 251), 1),
((40, 250), 1)]

Here's one vectorized approach -
IDs = a[1].max()+1 + a[2]
unq, idx, count = np.unique(IDs, return_index=1,return_counts=1)
out = a[1:,idx[count.argmax()]]
If there could be negative coordinates, use a[1].max()-a[1].min()+1 + a[2] to compute IDs.
Sample run -
In [44]: a
Out[44]:
array([[8, 3, 6, 6, 8, 5, 1, 6, 6, 5],
[5, 2, 1, 1, 5, 1, 5, 1, 1, 4],
[8, 2, 3, 3, 8, 1, 7, 3, 3, 3]])
In [47]: IDs = a[1].max()+1 + a[2]
In [48]: unq, idx, count = np.unique(IDs, return_index=1,return_counts=1)
In [49]: a[1:,idx[count.argmax()]]
Out[49]: array([1, 3])

This might seem a little abstract, but you could try saving each co-ordinate as a number, e.g. [2,1] = 2.1. And put your data into a list of these co-ordinates. For example, a 2nd row of [1,1,2] and 3rd row of [2,2,1] would be [1.2, 1.2, 2.1] You could then use the code:
from collections import Counter
list1=[1.2,1.2,2.1]
data = Counter(list1)
print (data.most_common(1)) # Returns the highest occurring item
which prints the most common number, and how many times it occurs, then you can simply convert the number back to a co-ordinate if you need to use it in your code.

Here is a sample code that does the count:
import numpy as np
import collections
a = np.array([[0, 1, 2, 3], [10, 10, 30 ,40], [25, 25, 10, 50]])
# You don't care about time
b = np.transpose(a[1:])
# convert list items to tuples
c = map(lambda v:tuple(v), b)
collections.Counter(c)
The output:
Counter({(10, 25): 2, (30, 10): 1, (40, 50): 1})

Related

Clustering a list with nearest values without sorting

I have a list like this
tst = [1,3,4,6,8,22,24,25,26,67,68,70,72,0,0,0,0,0,0,0,4,5,6,36,38,36,31]
I want to group the elements from above list into separate groups/lists based on the difference between the consecutive elements in the list (differing by 1 or 2 or 3).
I have tried following code
def slice_when(predicate, iterable):
i, x, size = 0, 0, len(iterable)
while i < size-1:
if predicate(iterable[i], iterable[i+1]):
yield iterable[x:i+1]
x = i + 1
i += 1
yield iterable[x:size]
tst = [1,3,4,6,8,22,24,25,26,67,68,70,72,0,0,0,0,0,0,0,4,5,6,36,38,36,31]
slices = slice_when(lambda x,y: (y - x > 2), tst)
whola=(list(slices))
I got this results
[[1, 3, 4, 6, 8], [22, 24, 25, 26], [67, 68, 70, 72, 0, 0, 0, 0, 0, 0, 0], [4, 5, 6], [36, 38, 36, 31]]
In 3rd list it doesn't separate the sequence of zeros into another list. Any kind of help highly appreciate. Thank you
I guess this is what you want?
tst = [1,3,4,6,8,22,24,25,26,67,68,70,72,0,0,0,0,0,0,0,4,5,6,36,38,36,31]
slices = slice_when(lambda x,y: (abs(y - x) > 2), tst) # Use abs!
whola=(list(slices))
print(whola)

How to raise every element of a vector to the power of every element of another vector?

I would like to raise a vector by ascending powers form 0 to 5:
import numpy as np
a = np.array([1, 2, 3]) # list of 11 components
b = np.array([0, 1, 2, 3, 4]) # power
c = np.power(a,b)
desired results are:
c = [[1**0, 1**1, 1**2, 1**3, 1**4], [2**0, 2**1, ...], ...]
I keep getting this error:
ValueError: operands could not be broadcast together with shapes (3,) (5,)
One solution will be to add a new dimension to your array a
c = a[:,None]**b
# Using broadcasting :
# (3,1)**(4,) --> (3,4)
#
# [[1],
# c = [2], ** [0,1,2,3,4]
# [3]]
For more information check the numpy broadcasting documentation
Here's a solution:
num_of_powers = 5
num_of_components = 11
a = []
for i in range(1,num_of_components + 1):
a.append(np.repeat(i,num_of_powers))
b = list(range(num_of_powers))
c = np.power(a,b)
The output c would look like:
array([[ 1, 1, 1, 1, 1],
[ 1, 2, 4, 8, 16],
[ 1, 3, 9, 27, 81],
[ 1, 4, 16, 64, 256],
[ 1, 5, 25, 125, 625],
[ 1, 6, 36, 216, 1296],
[ 1, 7, 49, 343, 2401],
[ 1, 8, 64, 512, 4096],
[ 1, 9, 81, 729, 6561],
[ 1, 10, 100, 1000, 10000],
[ 1, 11, 121, 1331, 14641]], dtype=int32)
Your solution shows a broadcast error because as per the documentation:
If x1.shape != x2.shape, they must be broadcastable to a common shape (which becomes the shape of the output).
c = [[x**y for y in b] for x in a]
c = np.asarray(list(map(lambda x: np.power(a,x), b))).transpose()
You need to first create a matrix where the rows are repetitions of each number. This can be done with np.tile:
mat = np.tile(a, (len(b), 1)).transpose()
And then raise that to the power of b elementwise:
np.power(mat, b)
All together:
import numpy as np
nums = np.array([1, 2, 3]) # list of 11 components
powers = np.array([0, 1, 2, 3, 4]) # power
print(np.power(np.tile(nums, (len(powers), 1)).transpose(), powers))
Which will give:
[[ 1 1 1 1 1] # == [1**0, 1**1, 1**2, 1**3, 1**4]
[ 1 2 4 8 16] # == [2**0, 2**1, 2**2, 2**3, 2**4]
[ 1 3 9 27 81]] # == [3**0, 3**1, 3**2, 3**3, 3**4]

How to add padding in a dataset to fill up to 50 items in a list and replace NaN with 0?

I have the following encoded text column in my dataset:
[182, 4]
[14, 2, 31, 42, 72]
[362, 685, 2, 399, 21, 16, 684, 682, 35, 7, 12]
Somehow I want this column to be filled up to 50 items on each row, assuming no row is larger than 50 items. And where there is no numeric value I want a 0 to be placed.
In the example the wanted outcome would be:
[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,182, 4]
[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,14, 2, 31, 42, 72]
[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,362, 685, 2, 399, 21, 16, 684, 682, 35, 7, 12]
Try this:
>>> y=[182,4]
>>> ([0]*(50-len(y))+y)
Assuming you parsed the lists from the string columns already, a very basic approach could be as follows:
a = [182, 4]
b = [182, 4, 'q']
def check_numeric(element):
# assuming only integers are valid numeric values
try:
element = int(element)
except ValueError:
element = 0
return element
def replace_nonnumeric(your_list):
return [check_numeric(element) for element in your_list]
# change the desired length to your needs (change 15 to 50)
def fill_zeros(your_list, desired_length=15):
prepend = (desired_length - len(your_list)) * [0]
result = prepend + your_list
return result
aa = replace_nonnumeric(a)
print(fill_zeros(aa))
bb = replace_nonnumeric(b)
print(fill_zeros(bb))
This code outputs:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 182, 4] # <-- aa
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 182, 4, 0] # <-- bb
However, I suggest using this code as a basis and adopt it to your needs.
Especially when parsing a lot of entries from the "list as strings" column, writing a parsing function and calling it via pandas' .apply() would be nice approach.

How to convert python dict to 3D numpy array?

I have a python dict with n_keys where each value is a 2D array (dim1,dim2).
I want to transfer this into a 3D numpy array of (dim1,dim2,n_keys).
How can I do it fast without a lot of nested loops?
EDIT:
Example:
featureMatrix = np.empty((len(featureDict.values()[0]),
len(featureDict.values()[0][0,:]),
len(featureDict.keys())))
for k,keys in enumerate(featureDict.keys()):
value=featureDict[keys]
for i in range(0,len(value[:,0]),1):
for j in range(0,len(value[0,:]),1):
featureMatrix[i,j,k]=value[i,j]
dict-ionaries are unordered so you probably don't want to simply stack them but you can simply stack the values nevertheless with array3d = np.dstack(somedict.values()).
Here is some example case:
>>> somedict = dict(a = np.arange(4).reshape(2,2),
b = np.arange(4).reshape(2,2) + 10,
c = np.arange(4).reshape(2,2) + 100,
d = np.arange(4).reshape(2,2) + 1000)
>>> array3d = np.dstack(somedict.values())
>>> array3d.shape
(2, 2, 4)
>>> array3d # unordered because of dict unorderedness, order depends for all practical purposes on chance
array([[[ 10, 0, 1000, 100],
[ 11, 1, 1001, 101]],
[[ 12, 2, 1002, 102],
[ 13, 3, 1003, 103]]])
or in case you want to stack it sorted by the key of the dictionary:
>>> array3d = np.dstack((somedict[i] for i in sorted(somedict.keys())))
>>> array3d # sorted by the keys!
array([[[ 0, 10, 100, 1000],
[ 1, 11, 101, 1001]],
[[ 2, 12, 102, 1002],
[ 3, 13, 103, 1003]]])

Numpy: fast calculations considering items' neighbors and their position inside the array

I have 4 2D numpy arrays, called a, b, c, d, each of them made of n rows and m columns. What I need to do is giving to each element of b and d a value calculated as follows (pseudo-code):
min_coords = min_of_neighbors_coords(x, y)
b[x,y] = a[x,y] * a[min_coords];
d[x,y] = c[min_coords];
Where min_of_neighbors_coords is a function that, given the coordinates of an element of the array, returns the coordinates of the 'neighbor' element that has the lower value. I.e., considering the array:
1, 2, 5
3, 7, 2
2, 3, 6
min_of_neighbors_coords(1, 1) will refer to the central element with the value of 7, and will return the tuple (0, 0): the coordinates of the number 1.
I managed to do this using for loops (element per element), but the algorithm is VERY slow and I'm searching a way to improve it, avoiding loops and demanding the calculations to numpy.
Is it possible?
EDIT I have kept my original answer at the bottom. As Paul points out in the comments, the original answer didn't really answer the OP's question, and could be more easily achieved with an ndimage filter. The following much more cumbersome function should do the right thing. It takes two arrays, a and c, and returns the windowed minimum of a and the values in c at the positions of the windowed minimums in a:
def neighbor_min(a, c):
ac = np.concatenate((a[None], c[None]))
rows, cols = ac.shape[1:]
ret = np.empty_like(ac)
# Fill in the center
win_ac = as_strided(ac, shape=(2, rows-2, cols, 3),
strides=ac.strides+ac.strides[1:2])
win_ac = win_ac[np.ogrid[:2, :rows-2, :cols] +
[np.argmin(win_ac[0], axis=2)]]
win_ac = as_strided(win_ac, shape=(2, rows-2, cols-2, 3),
strides=win_ac.strides+win_ac.strides[2:3])
ret[:, 1:-1, 1:-1] = win_ac[np.ogrid[:2, :rows-2, :cols-2] +
[np.argmin(win_ac[0], axis=2)]]
# Fill the top, bottom, left and right borders
win_ac = as_strided(ac[:, :2, :], shape=(2, 2, cols-2, 3),
strides=ac.strides+ac.strides[2:3])
win_ac = win_ac[np.ogrid[:2, :2, :cols-2] +
[np.argmin(win_ac[0], axis=2)]]
ret[:, 0, 1:-1] = win_ac[:, np.argmin(win_ac[0], axis=0),
np.ogrid[:cols-2]]
win_ac = as_strided(ac[:, -2:, :], shape=(2, 2, cols-2, 3),
strides=ac.strides+ac.strides[2:3])
win_ac = win_ac[np.ogrid[:2, :2, :cols-2] +
[np.argmin(win_ac[0], axis=2)]]
ret[:, -1, 1:-1] = win_ac[:, np.argmin(win_ac[0], axis=0),
np.ogrid[:cols-2]]
win_ac = as_strided(ac[:, :, :2], shape=(2, rows-2, 2, 3),
strides=ac.strides+ac.strides[1:2])
win_ac = win_ac[np.ogrid[:2, :rows-2, :2] +
[np.argmin(win_ac[0], axis=2)]]
ret[:, 1:-1, 0] = win_ac[:, np.ogrid[:rows-2],
np.argmin(win_ac[0], axis=1)]
win_ac = as_strided(ac[:, :, -2:], shape=(2, rows-2, 2, 3),
strides=ac.strides+ac.strides[1:2])
win_ac = win_ac[np.ogrid[:2, :rows-2, :2] +
[np.argmin(win_ac[0], axis=2)]]
ret[:, 1:-1, -1] = win_ac[:, np.ogrid[:rows-2],
np.argmin(win_ac[0], axis=1)]
# Fill the corners
win_ac = ac[:, :2, :2]
win_ac = win_ac[:, np.ogrid[:2],
np.argmin(win_ac[0], axis=-1)]
ret[:, 0, 0] = win_ac[:, np.argmin(win_ac[0], axis=-1)]
win_ac = ac[:, :2, -2:]
win_ac = win_ac[:, np.ogrid[:2],
np.argmin(win_ac[0], axis=-1)]
ret[:, 0, -1] = win_ac[:, np.argmin(win_ac[0], axis=-1)]
win_ac = ac[:, -2:, -2:]
win_ac = win_ac[:, np.ogrid[:2],
np.argmin(win_ac[0], axis=-1)]
ret[:, -1, -1] = win_ac[:, np.argmin(win_ac[0], axis=-1)]
win_ac = ac[:, -2:, :2]
win_ac = win_ac[:, np.ogrid[:2],
np.argmin(win_ac[0], axis=-1)]
ret[:, -1, 0] = win_ac[:, np.argmin(win_ac[0], axis=-1)]
return ret
The return is a (2, rows, cols) array that can be unpacked into the two arrays:
>>> a = np.random.randint(100, size=(5,5))
>>> c = np.random.randint(100, size=(5,5))
>>> a
array([[42, 54, 18, 88, 26],
[80, 65, 83, 31, 4],
[51, 52, 18, 88, 52],
[ 1, 70, 5, 0, 89],
[47, 34, 27, 67, 68]])
>>> c
array([[94, 94, 29, 6, 76],
[81, 47, 67, 21, 26],
[44, 92, 20, 32, 90],
[81, 25, 32, 68, 25],
[49, 43, 71, 79, 77]])
>>> neighbor_min(a, c)
array([[[42, 18, 18, 4, 4],
[42, 18, 18, 4, 4],
[ 1, 1, 0, 0, 0],
[ 1, 1, 0, 0, 0],
[ 1, 1, 0, 0, 0]],
[[94, 29, 29, 26, 26],
[94, 29, 29, 26, 26],
[81, 81, 68, 68, 68],
[81, 81, 68, 68, 68],
[81, 81, 68, 68, 68]]])
The OP's case could then be solved as:
def bd_from_ac(a, c):
b,d = neighbor_min(a, c)
return a*b, d
And while there is a serious performance hit, it is pretty fast still:
In [3]: a = np.random.rand(1000, 1000)
In [4]: c = np.random.rand(1000, 1000)
In [5]: %timeit bd_from_ac(a, c)
1 loops, best of 3: 570 ms per loop
You are not really using the coordinates of the minimum neighboring element for anything else than fetching it, so you may as well skip that part and create a min_neighbor function. If you don't want to resort to cython for fast looping, you are going to have to go with rolling window views, such as outlined in Paul's link. This will typically convert your (m, n) array into a (m-2, n-2, 3, 3) view of the same data, and you would then apply np.min over the last two axes.
Unfortunately you have to apply it one axis at a time, so you will have to create a (m-2, n-2, 3) copy of your data. Fortunately, you can compute the minimum in two steps, first windowing and minimizing along one axis, then along the other, and obtain the same result. So at most you are going to have intermediate storage the size of your input. If needed, you could even reuse the output array as intermediate storage and avoid memory allocations, but that is left as exercise...
The following function does that. It is kind of lengthy because it has to deal not only with the central area, but also with the special cases of the four edges and four corners. Other than that it is a pretty compact implementation:
def neighbor_min(a):
rows, cols = a.shape
ret = np.empty_like(a)
# Fill in the center
win_a = as_strided(a, shape=(m-2, n, 3),
strides=a.strides+a.strides[:1])
win_a = win_a.min(axis=2)
win_a = as_strided(win_a, shape=(m-2, n-2, 3),
strides=win_a.strides+win_a.strides[1:])
ret[1:-1, 1:-1] = win_a.min(axis=2)
# Fill the top, bottom, left and right borders
win_a = as_strided(a[:2, :], shape=(2, cols-2, 3),
strides=a.strides+a.strides[1:])
ret[0, 1:-1] = win_a.min(axis=2).min(axis=0)
win_a = as_strided(a[-2:, :], shape=(2, cols-2, 3),
strides=a.strides+a.strides[1:])
ret[-1, 1:-1] = win_a.min(axis=2).min(axis=0)
win_a = as_strided(a[:, :2], shape=(rows-2, 2, 3),
strides=a.strides+a.strides[:1])
ret[1:-1, 0] = win_a.min(axis=2).min(axis=1)
win_a = as_strided(a[:, -2:], shape=(rows-2, 2, 3),
strides=a.strides+a.strides[:1])
ret[1:-1, -1] = win_a.min(axis=2).min(axis=1)
# Fill the corners
ret[0, 0] = a[:2, :2].min()
ret[0, -1] = a[:2, -2:].min()
ret[-1, -1] = a[-2:, -2:].min()
ret[-1, 0] = a[-2:, :2].min()
return ret
You can now do things like:
>>> a = np.random.randint(10, size=(5, 5))
>>> a
array([[0, 3, 1, 8, 9],
[7, 2, 7, 5, 7],
[4, 2, 6, 1, 9],
[2, 8, 1, 2, 3],
[7, 7, 6, 8, 0]])
>>> neighbor_min(a)
array([[0, 0, 1, 1, 5],
[0, 0, 1, 1, 1],
[2, 1, 1, 1, 1],
[2, 1, 1, 0, 0],
[2, 1, 1, 0, 0]])
And your original question can be solved as:
def bd_from_ac(a, c):
return a*neighbor_min(a), neighbor_min(c)
As a performance benchmark:
In [2]: m, n = 1000, 1000
In [3]: a = np.random.rand(m, n)
In [4]: c = np.random.rand(m, n)
In [5]: %timeit bd_from_ac(a, c)
1 loops, best of 3: 123 ms per loop
Finding a[min_coords] is a rolling window operation. Several clever solutions our outlined in this post. You'll want to make the creation of the c[min_coords] array a side-effect of whichever solution you choose.
I hope this helps. I can post some sample code later when I have some time.
I have interest in helping you, and I believe there are possibly better solutions outside the scope of your question, but in order to put my own time into writing code, I must have some feedback of yours, because I am not 100% sure I understand what you need.
One thing to consider: if you are a C# developer, maybe a "brute-force" implementation of C# can outperform a clever implementation of Numpy, so you could consider at least testing your rather simple operations implemented in C#. Geotiff (which I suppose you are reading) has a relatively friendly specification, and I guess there might be .NET GeoTiff libraries around.
But supposing you want to give Numpy a try (and I believe you should), let's take a look at what you're trying to achieve:
If you are going to run min_coords(array) in every element of arrays a and c, you might consider to "stack" nine copies of the same array, each copy rolled by some offset, using numpy.dstack() and numpy.roll(). Then, you apply numpy.argmin(stacked_array, axis=2) and you get an array containing values between 0 and 8, where each of these values map to a tuple containing the offset indexes.
Then, using this principle, your min_coords() function would be vectorized, operating in the whole array at once, and giving back an array that gives you an offset which would be the index of a lookup table containing the offsets.
If you have interest in elaborating this, please leave a comment.
Hope this helps!

Categories

Resources