NumPy: Select and sum data into array

NumPy: Select and sum data into array - python

I have a (large) data array and a (large) list of lists of (a few) indices, e.g.,
data = [1.0, 10.0, 100.0]
contribs = [[1, 2], [0], [0, 1]]
For each entry in contribs, I'd like to sum up the corresponding values of data and put those into an array. For the above example, the expected result would be
out = [110.0, 1.0, 11.0]
Doing this in a loop works,
c = numpy.zeros(len(contribs))
for k, indices in enumerate(contribs):
for idx in indices:
c[k] += data[idx]
but since data and contribs are large, it's taking too long.
I have the feeling this can be improved using numpy's fancy indexing.
Any hints?

One possibility would be
data = np.array(data)
out = [np.sum(data[c]) for c in contribs]
Should be faster than the double loop, at least.

Here's an almost vectorized * approach -
# Get lengths of list element in contribs and the cumulative lengths
# to be used for creating an ID array later on.
clens = np.cumsum([len(item) for item in contribs])
# Setup ID array that corresponds to same ID for same list element in contribs.
# These IDs would be used to accumulate values from a corresponnding array
# that is created by indexing into data array with a flattened contribs
id_arr = np.zeros(clens[-1],dtype=int)
id_arr[clens[:-1]] = 1
out = np.bincount(id_arr.cumsum(),np.take(data,np.concatenate(contribs)))
This approach involves some setting up work. So, the benefits would be hopefully seen when fed with decent sized input arrays and a decent number of list elements in contribs, which would correspond to the looping in an otherwise loopy solution.
*Please note that this coined as almost vectorized because the only looping performed here is at the start, where we are getting the lengths of the list elements. But that part not being so computationally demanding should have minimal effect on the total runtime.

I'm not certain that all cases work, but for your example, with data as a numpy.array:
# Flatten "contribs"
f = [j for i in contribs for j in i]
# Get the "ranges" of data[f] that will be summed in the next step
i = [0] + numpy.cumsum([len(i) for i in contribs]).tolist()[:-1]
# Take the required sums
numpy.add.reduceat(data[f], i)

Related

How work to with an array of arrays and how initialize multiple arrays in Numpy?

I have to write an ABM (agent-based model) project in Python and I need to initialize 50 agents which will each contain a different set of numbers. I cannot use a matrix of 50 rows because each agent (each row) can have a different quantity of elements so the vector of each agent has not the same length: when certain conditions for agent_i occur in the algorithm then a number calculated by the algorithm is added to its vector.
The simplest way would be to write manually every one like this
agent_1 = np.array([])
agent_2 = np.array([])
agent_3 = np.array([])
...
but of course I can't. I don't know if exist a way to automatically initialize with a loop, something like
for i in range(50):
agent_i = np.array([])
If it exist, It would be useful because then, when certain conditions occur in the algorithm, I could add a calculated numbers to agent_i:
agent_i = np.append(agent_i, result_of_algorithm)
Maybe another way is to use an array of arrays
[[agent_1_collection],[agent_2_collection], ... , [agent_50_collection]]
Once again, I don't know how to initialize an array of arrays and I don't know how to add a number to a specific array: in fact I think it can't be done like this (assume, for simplicity, that I have this little array of only 3 agents and that I know how it is done):
vect = np.array([[1],[2,3],[4,5,6]])
result_of_algorithm_for_agent_2 = ...some calculations that we assume give as result... = 90
vect[1] = np.append(vect[1], result_of_algorithm_for_agent_2)
output:
array([[1], array([ 2, 3, 90]), [4, 5, 6]], dtype=object)
why it changes in that way?
Do you have any advice on how to manipulate arrays of arrays? For example, how to add elements to a specific point of a sub-array (agent)?
Thanks.

List of Arrays
You can create a list of arrays:
agents = [np.array([]) for _ in range(50)]
And then to append values to some agent, say agents[0], use:
items_to_append = [1, 2, 3] # Or whatever you want.
agents[0] = np.append(agents[0], items_to_append)
List of Lists
Alternatively, if you don't need to use np.arrays, you can use lists for the agents values. In that case you would initialize with:
a = [[] for _ in range(50)]
And you can append to agents[0] with either
single_value = 1 # Or whatever you want.
agents[0].append(single_value)
Or with
items_to_append = [1, 2, 3] # Or whatever you want
agents[0] += items_to_append

Determining index each group duplicate values in an array in Python with the fastest way

I want to find an index of each group duplicate value like this:
s = [2,6,2,88,6,...]
The results must return the index from original s: [[0,2],[1,4],..] or the result can show another way.
I find many solutions so I find the fastest way to get duplicate group:
s = np.sort(a, axis=None)
s[:-1][s[1:] == s[:-1]]
But after sort I got wrong index from original s.
In my case, I have ~ 200mil value on the list and I want to find the fastest way to do that. I use an array to store value because I want to use GPU to make it faster.

Using hash structures like dict helps.
For example:
import numpy as np
from collections import defaultdict
a=np.array([2,4,2,88,15,4])
table=defaultdict(list)
for ind,num in enumerate(a):
table[num]+=[ind]
Outputs:
{2: [0, 2], 4: [1, 5], 88: [3], 15: [4]}
If you want to show duplicated elements in the order from small to large:
for k,v in sorted(table.items()):
if len(v)>1:
print(k,":",v)
Outputs:
2 : [0, 2]
4 : [1, 5]
The speed is determined by how many different values in the number list.

See if this meets your performance requirements (here, s is your input array):
counts = np.bincount(s)
cum_counts = np.add.accumulate(counts)
sorted_inds = np.argsort(s)
result = np.split(sorted_inds, cum_counts[:-1])
Notes:
The result would be a list of arrays.
Each of these arrays would contain indices of a repeated value in s. Eg, if the value 13 is repeated 7 times in s, there would be an array with 7 indices among the arrays of result
If you want to ignore singleton values of s (values that occur only once in s), you can pass minlength=2 to np.bincount()

(This is a variation of my other answer. Here, instead of splitting the large array sorted_inds, we take slices from it, so it's likely to have a different kind of performance characteristic)
If s is the input array:
counts = np.bincount(s)
cum_counts = np.add.accumulate(counts)
sorted_inds = np.argsort(s)
result = [sorted_inds[:cum_counts[0]]] + [sorted_inds[cum_counts[i]:cum_counts[i+1]] for i in range(cum_counts.size-1)]

Finding the indices of distinct elements in a vectorized manner

I have a list of ints, a, between 0 and 3000. len(a) = 3000. I have a for loop that is iterating through this list, searching for the indices of each elemenent in a larger array.
import numpy as np
a = [i for i in range(3000)]
array = np.random.randint(0, 3000, size(12, 1000, 1000))
newlist = []
for i in range(0, len(a)):
coord = np.where(array == list[i])
newlist.append(coord)
As you can see, coord will be 3 arrays of the coordinates x, y, z for the values in the 3D matrix that equal the value in the list.
Is there a way to do this in a vectorized manner without the for loop?
The output should be a list of tuples, one for each element in a:
# each coord looks like this:
print(coord)
(array[1, ..., 1000], array[2, ..., 1000], array[2, ..., 12])
# combined over all the iterations:
print(newlist)
[coord1, coord2, ..., coord3000]

There is actually a fully vectorized solution to this, despite the fact that the resulting arrays are all of different sizes. The idea is this:
Sort all the elements of the array along with their coordinates. argsort is ideal for this sort of thing.
Find the cut-points in the sorted data, so you know where to split the array, e.g. with diff and flatnonzero.
split the coordinate array along the indices you found. If you have missing elements, you may need to generate a key based on the first element of each run.
Here is an example to walk you through it. Let's say you have an d-dimensional array with size n. Your coordinates will be a (d, n) array:
d = arr.ndim
n = arr.size
You can generate the coordinate arrays with np.indices directly:
coords = np.indices(arr.shape)
Now ravel/reshape the data and the coordinates into an (n,) and (d, n) array, respectively:
arr = arr.ravel() # Ravel guarantees C-order no matter the source of the data
coords = coords.reshape(d, n) # C-order by default as a result of `indices` too
Now sort the data:
order = np.argsort(arr)
arr = arr[order]
coords = coords[:, order]
Find the locations where the data changes values. You want the indices of the new values, so we can make a fake first element that is 1 less than the actual first element.
change = np.diff(arr, prepend=arr[0] - 1)
The indices of the locations give the break-points in the array:
locs = np.flatnonzero(change)
You can now split the data at those locations:
result = np.split(coords, locs[1:], axis=1)
And you can create the key of values actually found:
key = arr[locs]
If you are very confident that all the values are present in the array, then you don't need the key. Instead, you can compute locs as just np.diff(arr) and result as just np.split(coords, inds, axis=1).
Each element in result is already consistent with the indexing used by where/nonzero, but as a numpy array. If specifically want a tuple, you can map it to a tuple:
result = [tuple(inds) for inds in result]
TL;DR
Combining all this into a function:
def find_locations(arr):
coords = np.indices(arr.shape).reshape(arr.ndim, arr.size)
arr = arr.ravel()
order = np.argsort(arr)
arr = arr[order]
coords = coords[:, order]
locs = np.flatnonzero(np.diff(arr, prepend=arr[0] - 1))
return arr[locs], np.split(coords, locs[1:], axis=1)
You can return a list of index arrays with empty arrays for missing elements by replacing the last line with
result = [np.empty(0, dtype=int)] * 3000 # Empty array, so OK to use same reference
for i, j in enumerate(arr[locs]):
result[j] = coords[i]
return result
You can optionally filter for values that are in the specific range you want (e.g. 0-2999).

You can use logical OR in numpy to pass all those equality conditions at once instead of one by one.
import numpy as np
conditions = False
for i in list:
conditions = np.logical_or(conditions,array3d == i)
newlist = np.where(conditions)
This allows numpy to do filtering once instead of n passes for each condition separately.
Another way to do it more compactly
np.where(np.isin(array3d, list))

Assign the new value to a tensor at specific indices

This is a simple example.
Assume that I have an input tensor M. Now I have a tensor of indices of M with size 2 x 3 such as [[0, 1], [2,2], [0,1]] and a new array of values which is corresponding with the index tensor is [1, 2, 3]. I want to assign these values to the input M satisfying that the value is assigned to the element of M at index [0,1] will be the min value (1 in this example).
It means M[0,1] = 1 and M[2,2] = 2.
Can I do that by using some available functions in Pytorch without a loop?

It can be done without loops, but I am generally not sure whether it is such a great idea, due to significantly increased runtime.
The basic idea is relatively simple: Since tensor assignments always assign the last element, it is sufficient to sort your tuples in M in descending order, according to the respective values stored in the value list (let's call it v).
To do this in pytorch, let us consider the following example:
import torch as t
X = t.randn([3, 3]) # Random matrix of size 3x3
v = t.tensor([1, 2, 3])
M = t.tensor([[0, 2, 0],
[1, 2, 1]]) # accessing the elements described above
# Showcase pytorch's result with "naive" tensor assignment:
X[tuple(M)] = v # This would assign the value 3 to position (0, 1)
# To correct behavior, sort v in decreasing order.
v_desc = v.sort(decreasing=True)
# v now contains both the values and the indices of original position
print(v_desc)
# torch.return_types.sort(
# values=tensor([3, 2, 1]),
# indices=tensor([2, 1, 0]))
# Access M in the correct order:
M_desc = M[:, v_desc.indices]
# Finally assign correct order:
X[tuple(M_desc)] = v_desc
Again, this is relatively complicated, because it involves sorting the values, and "re-shuffling" of the tensors. You can surely save at least some memory if you perform operations in-place, which is something I disregarded for the sake of legibility.
As an answer whether this can also be achieved without sorting, I am fairly certain that the answer will be "no"; tensor assignments could only be done on fairly simple conditionals, but not something more complicated like your inter-dependent conditionals would require.

Numpy split array by grouping array

There are the following 2 arrays with equal length. My goal is to split the array B into groups defined by the array A. So finally there should be 3 arrays or an list of array. The final list of arrays should consists of the following rows of array B:
First and second
Third and fifth
Fourth
The order is not really relevant.
A = array([[-1],
[ 1],
[ 0],
[ 0],
[ 1]])
B = array([[ 624.5 , 548. ],
[ 912.8201, 564.3444],
[1564.5 , 764. ],
[1463.4163, 785.9251],
[1698.0757, 846.6306]])
The problem occured to me by using the dbscan clustering function. The A array describes the clusters (0, 1) of the points in array B. The values -1 declares the point as outlier. (The values used are not precise).
My goal is to calculate the compactness, ... of each found cluster

The numpy_indexed package (disclaimer: i am its author) was designed with these type of use cases in mind.
import numpy_indexed as npi
C = npi.group_by(A).split(B)
Not sure what you mean by compactness of each group; but rather than splitting and doing subsequent computations, it is typically more efficient to compute reductions over groups directly; whereby you can reuse the grouping object for increased efficiency:
groups = npi.group_by(A)
mean = groups.mean(B)
std = groups.std(B)

Keep is simple:
[data[labels == l] for l in np.unique(labels)]
Similarly, you can build a dict in a one-liner.

this is a bit lengthy but it should work.
final_dict = {}
for counter in range(0,len(A)):
if(A[counter] not in final_dict):
final_dict[A[counter]] = B[counter]
else:
final_dict[A[counter]] = final_dict[A[counter]] + B[counter]
final_array = []
for key,value in final_dict.items():
final_array.append(value)
Basically since you have odd values like -1 to work with you can set it as keys of a dictionary and then you iterate over the dictionary to get the groups of values which you can then append to a final output array

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

NumPy: Select and sum data into array - python

One possibility would be data = np.array(data) out = [np.sum(data[c]) for c in contribs] Should be faster than the double loop, at least.

Related

How work to with an array of arrays and how initialize multiple arrays in Numpy?

Determining index each group duplicate values in an array in Python with the fastest way

Finding the indices of distinct elements in a vectorized manner

Assign the new value to a tensor at specific indices

Numpy split array by grouping array

Categories

Resources