Calculating Mean of arrays with different lengths - python

Is it possible to calculate the mean of multiple arrays, when they may have different lengths? I am using numpy. So let's say I have:
numpy.array([[1, 2, 3, 4, 8], [3, 4, 5, 6, 0]])
numpy.array([[5, 6, 7, 8, 7, 8], [7, 8, 9, 10, 11, 12]])
numpy.array([[1, 2, 3, 4], [5, 6, 7, 8]])
Now I want to calculate the mean, but ignoring elements that are 'missing' (Naturally, I can not just append zeros as this would mess up the mean)
Is there a way to do this without iterating through the arrays?
PS. These arrays are all 2-D, but will always have the same amount of coordinates for that array. I.e. the 1st array is 5 and 5, 2nd is 6 and 6, 3rd is 4 and 4.
An example:
np.array([[1, 2], [3, 4]])
np.array([[1, 2, 3], [3, 4, 5]])
np.array([[7], [8]])
This must give
(1+1+7)/3 (2+2)/2 3/1
(3+3+8)/3 (4+4)/2 5/1
And graphically:
[1, 2] [1, 2, 3] [7]
[3, 4] [3, 4, 5] [8]
Now imagine that these 2-D arrays are placed on top of each other with coordinates overlapping contributing to that coordinate's mean.

I often needed this for plotting mean of performance curves with different lengths.
Solved it with simple function (based on answer of #unutbu):
def tolerant_mean(arrs):
lens = [len(i) for i in arrs]
arr = np.ma.empty((np.max(lens),len(arrs)))
arr.mask = True
for idx, l in enumerate(arrs):
arr[:len(l),idx] = l
return arr.mean(axis = -1), arr.std(axis=-1)
y, error = tolerant_mean(list_of_ys_diff_len)
ax.plot(np.arange(len(y))+1, y, color='green')
So applying that function to the list of above-plotted curves yields the following:

numpy.ma.mean allows you to compute the mean of non-masked array elements. However, to use numpy.ma.mean, you have to first combine your three numpy arrays into one masked array:
import numpy as np
x = np.array([[1, 2], [3, 4]])
y = np.array([[1, 2, 3], [3, 4, 5]])
z = np.array([[7], [8]])
arr = np.ma.empty((2,3,3))
arr.mask = True
arr[:x.shape[0],:x.shape[1],0] = x
arr[:y.shape[0],:y.shape[1],1] = y
arr[:z.shape[0],:z.shape[1],2] = z
print(arr.mean(axis = 2))
yields
[[3.0 2.0 3.0]
[4.66666666667 4.0 5.0]]

The below function also works by adding columns of arrays of different lengths:
def avgNestedLists(nested_vals):
"""
Averages a 2-D array and returns a 1-D array of all of the columns
averaged together, regardless of their dimensions.
"""
output = []
maximum = 0
for lst in nested_vals:
if len(lst) > maximum:
maximum = len(lst)
for index in range(maximum): # Go through each index of longest list
temp = []
for lst in nested_vals: # Go through each list
if index < len(lst): # If not an index error
temp.append(lst[index])
output.append(np.nanmean(temp))
return output
Going off of your first example:
avgNestedLists([[1, 2, 3, 4, 8], [5, 6, 7, 8, 7, 8], [1, 2, 3, 4]])
Outputs:
[2.3333333333333335,
3.3333333333333335,
4.333333333333333,
5.333333333333333,
7.5,
8.0]
The reason np.amax(nested_lst) or np.max(nested_lst) was not used in the beginning to find the max value is because it will return an array if the nested lists are of different sizes.

OP, I know you were looking for a non-iterative built-in solution, but the following really only takes 3 lines (2 if you combine transpose and means but then it just gets messy):
arrays = [
np.array([1,2], [3,4]),
np.array([1,2,3], [3,4,5]),
np.array([7], [8])
]
mean = lambda x: sum(x)/float(len(x))
transpose = [[item[i] for item in arrays] for i in range(len(arrays[0]))]
means = [[mean(j[i] for j in t if i < len(j)) for i in range(len(max(t, key = len)))] for t in transpose]
Outputs:
>>>means
[[3.0, 2.0, 3.0], [4.666666666666667, 4.0, 5.0]]

Related

PyTorch slice matrix with vector

Say I have one matrix and one vector as follows:
import torch
x = torch.tensor([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
y = torch.tensor([0, 2, 1])
is there a way to slice it x[y] so the result is:
res = [1, 6, 8]
So basically I take the first element of y and take the element in x that corresponds to the first row and the elements' column.
You can specify the corresponding row index as:
import torch
x = torch.tensor([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
y = torch.tensor([0, 2, 1])
x[range(x.shape[0]), y]
tensor([1, 6, 8])
Advanced indexing in pytorch works just as NumPy's, i.e the indexing arrays are broadcast together across the axes. So you could do as in FBruzzesi's answer.
Though similarly to np.take_along_axis, in pytorch you also have torch.gather, to take values along a specific axis:
x.gather(1, y.view(-1,1)).view(-1)
# tensor([1, 6, 8])

How to get max (top) N values across entire numpy matrix

I want to get the top N (maximal) args & values across an entire numpy matrix, as opposed to across a single dimension (rows / columns).
Example input (with N=3):
import numpy as np
mat = np.matrix([[9,8, 1, 2], [3, 7, 2, 5], [0, 3, 6, 2], [0, 2, 1, 5]])
print(mat)
[[9 8 1 2]
[3 7 2 5]
[0 3 6 2]
[0 2 1 5]]
Desired output: [9, 8, 7]
Since max isn't transitive across a single dimension, going by rows or columns doesn't work.
# by rows, no 8
np.squeeze(np.asarray(mat.max(1).reshape(-1)))[:3]
array([9, 7, 6])
# by cols, no 7
np.squeeze(np.asarray(mat.max(0)))[:3]
array([9, 8, 6])
I have code that works, but looks really clunky to me.
# reshape into single vector
mat_as_vector = np.squeeze(np.asarray(mat.reshape(-1)))
# get top 3 arg positions
top3_args = mat_as_vector.argsort()[::-1][:3]
# subset the reshaped matrix
top3_vals = mat_as_vector[top3_args]
print(top3_vals)
array([9, 8, 7])
Would appreciate any shorter way / more efficient way / magic numpy function to do this!
Using numpy.partition() is significantly faster than performing full sort for this purpose:
np.partition(np.asarray(mat), mat.size - N, axis=None)[-N:]
assuming N<=mat.size.
If you need the final result also be sorted (besides being top N), then you need to sort previous result (but presumably you will be sorting a smaller array than the original one):
np.sort(np.partition(np.asarray(mat), mat.size - N, axis=None)[-N:])
If you need the result sorted from largest to lowest, post-pend [::-1] to the previous command:
np.sort(np.partition(np.asarray(mat), mat.size - N, axis=None)[-N:])[::-1]
One way may be with flatten and sorted and slice top n values:
sorted(mat.flatten().tolist()[0], reverse=True)[:3]
Result:
[9, 8, 7]
The idea is from this answer: How to get indices of N maximum values in a numpy array?
import numpy as np
import heapq
mat = np.matrix([[9,8, 1, 2], [3, 7, 2, 5], [0, 3, 6, 2], [0, 2, 1, 5]])
ind = heapq.nlargest(3, range(mat.size), mat.take)
print(mat.take(ind).tolist()[0])
Output
[9, 8, 7]

index 3 is out of bounds for axis 0 with size 3 (python)

So i looked it up a couple of times and i just cant figure it out so I think it would be best to ask for some help.So here is the deal t0 is a [1000,1], x is a [1000,1] and y is a [1000,1], m=1000, suma1=0 and every time i run it i get this stupid error
index 3 is out of bounds for axis 0 with size 3
for i in range(m):
suma1+=((t0[i] + x[i]- y[i])**2)
Note that t0 is a [1000,1], x is a [1000,1] and y is a [1000,1] This means that each of these values is a two dimensional list. However, the code that you are showing appears to want a single dimensioned list summing up the values.
If t0, x, y are two dimensioned lists, then you are concatenating the lists and not adding the values.
t0 = [[1, 2, 3, 4], 1], [[9, 8, 7, 6], 2]
x = [[8, 7, 4, 7], 3], [[6, 4, 2, 8], 4]
t0[1] + x[1] = [[9, 8, 7, 6], 2, [6, 4, 2, 8], 4]]
To do the arithmatic you need singly dimensioned arrays such as:
t0 = [1, 2, 3, 4, 5]
x = [8, 7, 4, 7, 1]
y = [3, 7, 9, 4, 8]
You need to show more code and show sample values (using m with a range of 10)
Also run your code as individual lines
suma1 = 0
for i in range(m):
suma1 += ((t0[i] + x[i]- y[i])**2)
are x, y and t0 1D numpy arrays or what?
Try to find out how to correctly index those arrays?/lists? in the correct way. For example
t0[999,1] vs t0[1,999]
Edit:
In that case I would try using:
suma1+=((t0[i,1] + x[i,1]- y[i,1])**2)
or
suma1+=((t0[1,i] + x[1,i]- y[1,i])**2)

How to use a pair of nested for loops to iterate over a 2-d array?

Need to take the values from one array, put them through a function and put them in another array. It is meant to be done using a pair of nested for loops. Please help. Complete beginner here.
EDIT: Ok to clarify, I have a 2-d array with various values in it. I want to apply a function to all of these values and have a 2-d array returned with the values after they have gone through the function. I am working in python. Thanks for the quick responses and any help you can give!
EDIT3: Example code:
import numpy as N
def makeGrid(dim):
''' Function to return a grid of distances from the centre of an array.
This version uses loops to fill the array and is thus slow.'''
tabx = N.arange(dim) - float(dim/2.0) + 0.5
taby = N.arange(dim) - float(dim/2.0) + 0.5
grid = N.zeros((dim,dim), dtype='float')
for y in range(dim):
for x in range(dim):
grid[y,x] = N.sqrt(tabx[x]**2 + taby[y]**2)
return grid
import math
def BigGrid(dim):
l= float(raw_input('Enter a value for lambda: '))
p= float(raw_input('Enter a value for phi: '))
a = makeGrid
b= N.zeros ((10,10),dtype=float) #Create an arry to take the returned values
for i in range(10):
for j in range (10):
b[i,j] = a[i][j]*2
if __name__ == "__main__":
''' Module test code '''
size = 10 #Dimension of the array
newGrid = BigGrid(size)
newGrid = N.round(newGrid, decimals=2)
print newGrid
def map_row(row):
return map(some_function,row)
map(map_row,my_2d_list)
Is probably how I would do it...
Based on your question, it appears you're using Numpy. If you're not too concerned about speed, you can simply call the function with a numpy array; the function will operate on the entire array for you.
There's no need to write the iteration explicitly, though if you can find a way to take advantage of numpy's special features, that will be faster than using a function designed to operate on one element at a time. Unless you're working with a very large dataset, though, this should be fine:
import numpy as np
>>> g = np.array( [ [1,2,3], [ 4,5,6] ] )
array([[1, 2, 3],
[4, 5, 6]])
>>> def myfunc( myarray ):
... return 2 * myarray
...
>>> myfunc(g)
array([[ 2, 4, 6],
[ 8, 10, 12]])
First, you have a bug in your code in the following line:
a = makeGrid
You are setting a to be a function, not an array. You should have the following:
a = makeGrid(dim)
That is why you had the TypeError when you tried the answer by #abought.
Now, to apply an operation element-wise in numpy there are many possibilities. If you want to perform the same operation for every element in the array, the simplest way is to use array operations:
b = a * 2
(Note that you don't need to declare b beforehand. And you also don't need any loops.) Numpy has also many C-optimised functions that perform the same operation on each element of an array. These are called ufuncs. You can combine ufuncs to get complex expressions evaluated element-wise. For example:
b = N.sin(a**2) + N.log(N.abs(a))
Your a array from makeGrid() can also be much more efficiently created using array operations and numpy's mgrid:
grid = N.mgrid[-dim//2 + 1:dim//2:0.5, -dim//2 + 1:dim//2:0.5]
grid = N.sqrt(grid[0]**2 + grid[1]**2)
If you want to perform different operations on each array element, things get more complicated and it may not be possible to avoid loops. For these cases, numpy has a way to decompose loops on a nD array using ndenumerate or ndidex. Your example with ndenumerate:
for index, x in N.ndenumerate(a):
b[index] = x * 2
This is faster than multiple loops, but the array operations should be used whenever possible.
From what I can get in terms of context from the question and what a 2d-array typically means it looks like you are trying to do the following:
>>>> array2d = [[0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4]]
>>> def add_two( v ):
... return v + 2
...
>>> [ [ add_two( v ) for v in row ] for row in array2d ]
[[2, 3, 4, 5, 6], [2, 3, 4, 5, 6], [2, 3, 4, 5, 6], [2, 3, 4, 5, 6], [2, 3, 4, 5, 6], [2, 3, 4, 5, 6], [2, 3, 4, 5, 6], [2, 3, 4, 5, 6], [2, 3, 4, 5, 6], [2, 3, 4, 5, 6]]
The above uses a list comprehension which is the same as using the two nested for loops and in this case more readable and involves less direct interaction of the list methods as you're describing what the list is rather than building it.
Here is a one-line with double map
map(lambda x:map(func, x), l)
Example:
l=[[1,2,3],[4,3,1]]
map(lambda x:map(lambda x:x*10,x),l)
[[10, 20, 30], [40, 30, 10]]
Easy to do it with a nested loop:
def my_function(n): # n will become y from the next part
new_num = # do whatever you want with it
return new_num
my_list = [[1, 2, 3], [4, 5, 6], [7, 8, 9]] # just an example
new_list, final_list = [], [] # multiple assignment
for x in my_list:
print x
new_list = []
for y in x:
# y is now the first value of the first value of my_list--- 1.
my_num = my_function(y)
new_list.append(my_num)
final_list.append(new_list)
print final_list
That should do it.
Returns: [[2, 3, 4], [5, 6, 7], [8, 9, 10]].
for(int i; i < x; i++)
for(int j; j < y; j++)
array2[i][j] = func(array2[i][j])
Something like that?

How to split an array according to a condition in numpy?

For example, I have a ndarray that is:
a = np.array([1, 3, 5, 7, 2, 4, 6, 8])
Now I want to split a into two parts, one is all numbers <5 and the other is all >=5:
[array([1,3,2,4]), array([5,7,6,8])]
Certainly I can traverse a and create two new array. But I want to know does numpy provide some better ways?
Similarly, for multidimensional array, e.g.
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[2, 4, 7]])
I want to split it according to the first column <3 and >=3, which result is:
[array([[1, 2, 3],
[2, 4, 7]]),
array([[4, 5, 6],
[7, 8, 9]])]
Are there any better ways instead of traverse it? Thanks.
import numpy as np
def split(arr, cond):
return [arr[cond], arr[~cond]]
a = np.array([1,3,5,7,2,4,6,8])
print split(a, a<5)
a = np.array([[1,2,3],[4,5,6],[7,8,9],[2,4,7]])
print split(a, a[:,0]<3)
This produces the following output:
[array([1, 3, 2, 4]), array([5, 7, 6, 8])]
[array([[1, 2, 3],
[2, 4, 7]]), array([[4, 5, 6],
[7, 8, 9]])]
It might be a quick solution
a = np.array([1,3,5,7])
b = a >= 3 # variable with condition
a[b] # to slice the array
len(a[b]) # count the elements in sliced array
1d array
a = numpy.array([2,3,4,...])
a_new = a[(a < 4)] # to get elements less than 5
2d array based on column(consider value of column i should be less than 5,
a = numpy.array([[1,2],[5,6],...]
a = a[(a[:,i] < 5)]
if your condition is multicolumn based, then you can make a new array applying the conditions on the columns. Then you can just compare the new array with value 5(according to my assumption) to get indexes and follow above codes.
Note that, whatever i have written in (), returns the index array.

Categories

Resources