Calculate the mean over a mixed data structure - python

I have a list of lists that looks something like:
data = [
[1., np.array([2., 3., 4.]), ...],
[5., np.array([6., 7., 8.]), ...],
...
]
where each of the internal lists is the same length and contains the same data type/shape at each entry. I would like to calculate the mean over corresponding entries and return something of the same structure as the internal lists. For example, in the above case (assuming only two entries) I want the result to be:
[3., np.array([4., 5., 6.]), ...]
What is the best way to do this with Python?

data is a list, so a list comprehension seems like a natural option. Even if it were a numpy array, given that it's a jagged array, it wouldn't benefit from being wrapped in an ndarray anyway, so a list comp would still be the best option, in my opinion.
Anyway, use zip() to "transpose" data and call np.mean() in a loop to find mean along the first axis.
[np.mean(x, axis=0) for x in zip(*data)]
# [3.0, array([4., 5., 6.]), array([[2., 2.], [2., 2.]])]

if you have a list exactly the same as the one shown in the example, you can do it with the following code.
First we declare some variables to store our results:
number_sum = 0
list_sum = np.array([0,0,0])
It is important that you initialize the values ​​you need to 0 in list_sum. That is, if the data array contains 5 elements, that array should be: list_sum = np.array([0,0,0,0,0]).
The next step is to perform the sum of all elements in data. First we add the int values ​​and then we perform the sum of each element of the list as follows:
for number, nparray in data:
number_sum += number
for index, item in enumerate(nparray):
list_sum[index] += item
Since we know how the variable data is structured (each input is made up of an int value and an np.array) we can do the addition that way. Although be careful with the computational complexity because in examples with longer arrays it could become very high in terms of complexity, since two for loops are being nested.
Finally, you can check that if you divide the sum of the elements by the length of data you get the desired value:
print(number_sum/len(data))
print(list_sum/len(data))
Now you just have to add those two new values ​​to a new list. I hope it helps, greetings and good luck!

The following works:
import numpy as np
data = [
[1., np.array([2., 3., 4.]), np.array([[1., 1.], [1., 1.]])],
[5., np.array([6., 7., 8.]), np.array([[3., 3.], [3., 3.]])],
]
number_of_samples = len(data)
number_of_elements = len(data[0])
means = []
for ielement in range(number_of_elements):
mean_list = []
for isample in range(number_of_samples):
mean_list.append(data[isample][ielement])
mean_list = np.stack(mean_list)
mean = mean_list.mean(axis=0)
means.append(mean)
print(means)
but is a bit ugly, nests a for loops, and does not seem to be very pythonic. Any improvements over this are welcomed.

Related

Replace values in 2D/3D-np.array with lookup-table (= np.array with 2 columns: key + value)

How can I replace all values in a 2D (or 3D) np.array x with a two-column (key + value to replace) lookup table (another np.array) lookup?
x = np.array([[0., 1., 5., 2.],
[5., 1., 3., 5.],
[4., 1., 1., 2.],
[0., 1., 3., 2.],
[2., 4., 1., 0.]])
x may also be 3D and the shape is more or less arbitrary.
lookup = np.array([[0, 1.2],
[1, 3.4],
[2, 0.1],
[3, 2.1],
[4, 5.4],
[5, 2.2]])
Result:
>>> x
array([[1.2, 3.4, 2.2, 0.1],
[2.2, 3.4, 2. , 2.2],
[5.4, 3.4, 3.4, 0.1],
[1.2, 3.4, 2. , 0.1],
[0.1, 5.4, 3.4, 1.2]])
Bonus: Normally all values in x are represented in the first column of lookup. How to best handle values in x that are not represented in lookup, e.g. by ignoring them from being replaced or by setting them to nan.
A somewhat inefficient approach so far (only working for 2D but may easily be adopted to 3D): Iterate through all elements in x and compare it with the keys in lookup.
def replaceByLookup(x, lookup):
for i in range(x.shape[0]):
for j in range(x.shape[1]):
for k in range(lookup.shape[0]):
if x[i,j] == lookup[k,0]:
x[i,j] = lookup[k,1]
break
I am looking for a more efficient and maybe simpler solution. I wonder if there isn't a vectorized solution within numpy. It would also be totally ok if the function does not work by reference but return a new array with the replaced values.
You can use np.searchsorted to efficiently locate the row of the associated key in lookup. Then, you can easily get the associated value with a simple direct indexing. Here is an example:
lookup[np.searchsorted(lookup[:,0], x),1]
Note that this require the key to exist and lookup to be sorted by key. Moreover, you should be careful with floating-point number keys as they be be slightly different from the key. One solution to address this problem is to round the values. Furthermore, the lookup array should not contain special values like np.nan in the keys (that being said, they can be supported separately).
Bonus answer:
np.searchsorted search for indices where elements should be inserted to maintain order. If the value does not exists, the function returns the next biggest item in the searched array. You can check if the key match with the searched one to know if the lookup succeed. That being said, you need to make sure the index is actually valid before. This is a bit cumbersome to do. Here is the resulting code:
idx = np.searchsorted(lookup[:,0], x)
corrected_idx = np.minimum(idx, len(lookup)-1)
is_valid = lookup[corrected_idx, 0] == x
x[~is_valid] = np.nan
Couldn't you use a python dictionary in place of the lookup array, because that is the most efficient way to do lookups, much better than searching through the array every iteration.
It would be very worthwhile to convert the lookup array into a python dictionary first and then just assign from that dictionary. It would be very simple code, require no searching or search iterations and would provide the dictionary's really efficient random access.

Populate with new value(s) to a fixed-shape numpy array filled with zeros

Given that a numpy array is stored contiguously, if we try to append or extend to it then this happens not in-place but, instead, a new copy of the array is created with adequate 'room' for the append or extend to occur contiguously (see https://stackoverflow.com/a/13215559/3286832).
To avoid that, and assuming we are lucky enough to know the specific number of elements we expect the array to have, we can create a numpy array with a fixed size filled with zeros:
import numpy as np
a = np.zeros(shape=(100,)) # [0. 0. 0. ... 0. 0. 0.]
Say that we want to populate this array each element with a new value each time (e.g. provided by the user) by editing this array in-place:
pos = 0
a[pos] = 0.002 # [0.002 0. 0. ... 0. 0. 0.]
pos = pos + 1
a[pos] = 0.101 # [0.002 0.101 0. ... 0. 0. 0.]
# etc.
pos = -1
a[pos] = 42.00 # [0.002 0.101 ... ... ... 42.]
Question:
Is there a way to keep track of the next available position pos (i.e. last position not previously populated with a new input value) instead of manually incrementing pos each time?
Is there a way in efficiently achieving this in numpy, preferably? Or is there a way of achieving this in another Python library (e.g. scipy or Pandas)?
(edited the question according the comments and initial answers which stated how not clear my initial question was phrased - hope this now is clearer)
Actually, your question is still confusing for me. How do you define the new value you want to insert to the new position? Is it coming from outside of your code? Do you have all the new values for your array, or only part of it?
Probably, you can use the slices in numpy, which are exactly for fast updating of the array, however, I'm not exactly sure that this is what you want to do.
Some samples for you:
>>> import numpy as np
>>> a = np.zeros(shape=(10,))
>>> a
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
>>> a[3:6] += 1
>>> a
array([0., 0., 0., 1., 1., 1., 0., 0., 0., 0.])
>>> a[:4] += .001
>>> a
array([1.000e-03, 1.000e-03, 1.000e-03, 1.001e+00, 1.000e+00, 1.000e+00,
0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00])
>>> a[3:5] = [2, 1]
>>> a
array([1.e-03, 1.e-03, 1.e-03, 2.e+00, 1.e+00, 1.e+00, 0.e+00, 0.e+00,
0.e+00, 0.e+00])
>>>
If I understand you correctly, you need some kind of circular buffer. Python has collections.deque for this purpose.
Here is my custom implementation of circular buffer using h5py, but you can change it to numpy.
Update: As it was already mentioned in comments it is impossible to track changes of an np.array out of the box. Instead, you can implement your own class and track all the necessary changes there (see my implementation as an example, i.e. concatenate arrays to extend its size). I'd suggest you to use python list if you need appending or deque if you need fixed size. The both arrays can be then converted to np.array

Deleting multiple elements at once from a numpy 2d array

Is there a way to delete from a numpy 2d array when I have the indexes? For example:
a = np.random.random((4,5))
idxs = [(0,1), (1,3), (2, 1), (3,4)]
I want to remove the indexes specified above. I tried:
np.delete(a, idxs)
but it just removes the top row.
To give an example, for the following input:
[
[0.15393912, 0.08129568, 0.34958515, 0.21266128, 0.92372852],
[0.42450441, 0.1027468 , 0.13050591, 0.60279229, 0.41168151],
[0.06330729, 0.60704682, 0.5340644 , 0.47580567, 0.42528617],
[0.27122323, 0.42713967, 0.94541073, 0.21462462, 0.07293321]
]
and with the indexes as mentioned above, I want the result to be:
[
[0.15393912, 0.34958515, 0.21266128, 0.92372852],
[0.42450441, 0.1027468 , 0.13050591, 0.41168151],
[0.06330729, 0.5340644 , 0.47580567, 0.42528617],
[0.27122323, 0.42713967, 0.94541073, 0.21462462]
]
your index should be for flat array else it only works to remove a row or column.
Here is how you can convert index and use it
arr = np.array([
[0.15393912, 0.08129568, 0.34958515, 0.21266128, 0.92372852],
[0.42450441, 0.1027468 , 0.13050591, 0.60279229, 0.41168151],
[0.06330729, 0.60704682, 0.5340644 , 0.47580567, 0.42528617],
[0.27122323, 0.42713967, 0.94541073, 0.21462462, 0.07293321]
])
idxs = [(0,1), (1,3), (2, 1), (3,4)]
idxs = [i*arr.shape[1]+j for i, j in idxs]
np.delete(arr, idxs).reshape(4,4)
for reshaping you should remove the items such that there will be equal number of items and rows and columns after deletion
Numpy doesn't know that you are removing exactly one element per row when you give it arbitrary indices like that. Since you do know that, I would suggest using a mask to shrink the array. Masking has the same problem: it doesn't assume anything about the shape of the result (because it can't in general), and returns a raveled array. You can reinstate the shape you want quite easily though. In fact, I would suggest removing the first element of each index entirely, since you have one per row:
def remove_indices(a, idx):
if len(idx) != len(idx): raise ValueError('Wrong number of indices')
mask = np.ones(a.size, dtype=np.bool_)
mask[np.arange(len(idx)), idx] = False
return a[mask].reshape(a.shape[0], a.shape[1] - 1)
Here is a method using np.where
import numpy as np
import operator as op
a = np.arange(20.0).reshape(4,5)
idxs = [(0,1), (1,3), (2, 1), (3,4)]
m,n = a.shape
# extract column indices
# there are simpler ways but this is fast
columns = np.fromiter(map(op.itemgetter(1),idxs),int,m)
# build decimated array
result = np.where(columns[...,None]>np.arange(n-1),a[...,:-1],a[...,1:])
result
# array([[ 0., 2., 3., 4.],
# [ 5., 6., 7., 9.],
# [10., 12., 13., 14.],
# [15., 16., 17., 18.]])
As the documentation says
Return a new array with sub-arrays along an axis deleted.
np.delete deletes a row or a column based on the value of the parameter axis.
Secondly np.delete expects int or array of ints as parameter and not a list of tuples.
you need to specify what the requirement is.
As #divakar suggested look at other answers on Stackoverflow regarding deleting individual items in numpy array.

How does numpy array typing interact with object?

I am currently trying to implement a datatype that stores floats in an numpy array. However trying to assign an array with elements of this type with various lengths seems to obviously break the code. One would assign a sequence to an array element, which is not possible.
One can bypass this by using the data type object instead of float. Why is that? How could one resolve this problem using floats without creating a sequence?
Example code that does not work.
from numpy import *
foo= dtype(float32, [])
x = array([[2., 3.], [3.]], dtype=foo)
Example code that does work:
from numpy import *
foo= dtype(float32, [])
x = array([[2., 3.], [3., 2.]], dtype=foo)
Example code that does work, I try to replicate for float:
from numpy import *
foo= dtype(object, [])
x = array([[2., 3.], [3.]], dtype=foo)
The object dtype in Numpy simply creates an array of pointers to Python objects. This means you lose the performance advantage you usually get from Numpy, but it's still sometimes useful to do this.
Your last example creates a one-dimensional Numpy array of length two, so that's two pointers to Python objects. Both these objects happen to be lists, and Python list have arbitrary dynamic length.
I don't know what you were trying to achieve with this, but note that
>>> np.dtype(np.float32, []) == np.float32
True
Arrays require the same number of elements for each row. So, if you feed a list of lists in numpy and all sublists have the same number of elements, it'll happily convert it to an array. This is why your second example works.
If the sublists are not the same length, then each sublist is treated as a single object and you end up with a 1D array of objects. This is why your third example works. Your first example doesn't work because you try to cast a sequence of objects to floats, which isn't possible.
In short, you can't create an array of floats if your sublists are of different lengths. At best, you can create an array of 1D arrays, since they are still considered objects.
>>> x = np.array(list(map(np.array, [[2., 3.], [3.]])))
>>> x
array([array([ 2., 3.]), array([ 3.])], dtype=object)
>>> x[0]
array([ 2., 3.])
>>> x[0][1]
3.0
>>> # but you can't do this
>>> x[0,1]
Traceback (most recent call last):
File "<pyshell#18>", line 1, in <module>
x[0,1]
IndexError: too many indices for array
If you're bent on creating a float 2D array, you have to extend all your sublists to the same size with None, which will be converted to np.nan.
>>> lists = [[2., 3.], [3.]]
>>> max_len = max(map(len, lists))
>>> for i, sublist in enumerate(lists):
sublist = sublist + [None] * (max_len - len(sublist))
lists[i] = sublist
>>> np.array(lists, dtype=np.float32)
array([[ 2., 3.],
[ 3., nan]], dtype=float32)

More pythonian way for getting a mean value of an array

I am still having troubles adjusting to 'more pythonian ways' of writing code sometimes ... right now I am iterating over some values (x). I have many arrays and I always compare the first value of all the arrays, the second value ... shortly: a mean value of all the entries in an array by position in the array.
sum_mean_x = []
for i in range(0, int_points):
for j in range(0, len(x)):
mean_x.append(x[j][i])
sum_mean_x.append(sum(mean_x)/len(x))
mean_x = []
I am pretty sure that can be done super beautiful. I know I could change the second last line to something like sum_mean_x.append(mean_x.mean) but, I guess I miss some serious magic this way.
Use the numpy package for numeric processing. Suppose you have the following three lists in plain Python:
a1 = [1., 4., 6.]
a2 = [3., 7., 3.]
a3 = [2., 0., -1.]
And you want to get the mean value for each position. Arrange the vectors in a single array:
import numpy as np
a = np.array([a1, a2, a3])
Then you can get the per-column mean like this:
>>> a.mean(axis=0)
array([ 2. , 3.66666667, 2.66666667])
It sounds like what you're trying to do is treat your list of lists are a 2D array where each list is a row, and then average each column.
The obvious way to do this is to use NumPy, make it an actual 2D array, and just call mean by columns. See simleo's answer, which is better than what I was going to add here. :)
But if you want to stick with lists of lists, going by column effectively means transposing, and that means zip:
>>> from statistics import mean
>>> arrs = [[1., 2., 3.], [0., 0., 0.], [2., 4., 6.]]
>>> column_means = [mean(col) for col in zip(*arrs)]
>>> column_means
[1.0, 2.0, 3.0]
That statistics.mean is only in the stdlib in 3.4+, but it's based on stats on PyPI, and if yur Python is too old even for that, you can write it on your own. Getting the error handling right on the edge cases is tricky, so you probably want to look at the code from statistics, but if you're only dealing with values near 1, you can just do it the obvious way:
def mean(iterable):
total, length = 0.0, 0
for value in iterable:
total += value
length += 1
return total / length
ar1 = [1,2,3,4,5,6]
ar2 = [3,5,7,2,5,7]
means = [ (i+j)/2.0 for (i,j) in zip(ar1, ar2)]
print(means)
You mean something like
import numpy as np
ar1 = [1,2,3,4,5,6]
ar2 = [3,5,7,2,5,7]
mean_list = []
for i, j in zip(ar1, ar2):
mean_list.append(np.array([i,j]).mean())
print(mean_list)
[2.0, 3.5, 5.0, 3.0, 5.0, 6.5]

Categories

Resources