Mean value of each element in multiple lists - Python - python

If I have two lists
a = [2,5,1,9]
b = [4,9,5,10]
How can I find the mean value of each element, so that the resultant list would be:
[3,7,3,9.5]

>>> a = [2,5,1,9]
>>> b = [4,9,5,10]
>>> [(g + h) / 2 for g, h in zip(a, b)]
[3.0, 7.0, 3.0, 9.5]

Referring to your title of the question, you can achieve this simply with:
import numpy as np
multiple_lists = [[2,5,1,9], [4,9,5,10]]
arrays = [np.array(x) for x in multiple_lists]
[np.mean(k) for k in zip(*arrays)]
Above script will handle multiple lists not just two. If you want to compare the performance of two approaches try:
%%time
import random
import statistics
random.seed(33)
multiple_list = []
for seed in random.sample(range(100), 100):
random.seed(seed)
multiple_list.append(random.sample(range(100), 100))
result = [statistics.mean(k) for k in zip(*multiple_list)]
or alternatively:
%%time
import random
import numpy as np
random.seed(33)
multiple_list = []
for seed in random.sample(range(100), 100):
random.seed(seed)
multiple_list.append(np.array(random.sample(range(100), 100)))
result = [np.mean(k) for k in zip(*multiple_list)]
To my experience numpy approach is much faster.

What you want is the mean of two arrays (or vectors in math).
Since Python 3.4, there is a statistics module which provides a mean() function:
statistics.mean(data)
Return the sample arithmetic mean of data, a sequence or iterator of real-valued numbers.
You can use it like this:
import statistics
a = [2, 5, 1, 9]
b = [4, 9, 5, 10]
result = [statistics.mean(k) for k in zip(a, b)]
# -> [3.0, 7.0, 3.0, 9.5]
notice: this solution can be use for more than two arrays, because zip() can have multiple parameters.

An alternate to using a list and for loop would be to use a numpy array.
import numpy as np
# an array can perform element wise calculations unlike lists.
a, b = np.array([2,5,1,9]), np.array([4,9,5,10])
mean = (a + b)/2; print(mean)
>>>[ 3. 7. 3. 9.5]

Put the two lists into a numpy array using vstack and then take the mean (using 'tolist' to get back from the numpy array):
import numpy as np
a = [2,5,1,9]
b = [4,9,5,10]
np.mean(np.vstack([a,b]), axis=0).tolist()
[3.0, 7.0, 3.0, 9.5]

Seems you are looking for an element-wise mean value. setting axis=0 in np.mean is what you need.
>>> import numpy as np
>>> a = [2,5,1,9]
>>> b = [4,9,5,10]
Create a list containing all your lists
>>> a_b = [a,b]
>>> a_b
[[2, 5, 1, 9], [4, 9, 5, 10]]
Use np.mean and set the axis to 0
>>> np.mean(a_b, axis=0)
array([3. , 7. , 3. , 9.5])

Related

Compute stats function on non-overlapping day-wide time window with Pandas

Preamble
How can I apply a function to a list with non-overlapping sliding window. E.g. data = {x_1, x_2, ...., x_n} and we apply f with window size 2 to get {f(x_1,x_2), f(x_3, x_4), ...., f(x_{n-1}, x_n)}.
I understand that I can partition and use map on the partitioned list. But are there more efficient ways to handle this operation, especially for ndarray and dataframe? Something that would analogous to BlockMap of Mathematica.
Question
The ultimate goal of this is: suppose the dataframe is a time series with values for each hour of the day. How can I apply a function (e.g. mean, variance) for each day, i.e. function blockmaps with a non-overlapping window of 24 hour size?
EDIT 1:
Here is a code that returns a pandas dataframe:
import pandas as pd
import numpy as np
dat = np.random.uniform(0,10,40)
xpd = pd.DataFrame(dat)
xpd.rename(columns = {0:'new_name'}, inplace = True)
date_rng = pd.date_range(start='1/1/2018 03:00:00', periods=40, freq='H')
xpd.set_index(date_rng, inplace=True)
How can I calculate the variance for each day, i.e. from hourly data, and return as a dataframe.
I tried the below line but it didn't work:
xpd.groupby(by=lambda x: pd.Series.dt.floor(x, freq='d'))
EDIT 2
This worked, problem seems to be solved:
xpd.groupby(by=lambda x: x.floor('d')).var()
(EDIT: Answered when was without edits and titled: map a function with non-overlapping window on a dataframe or ndarray).
One way, assuming that n is always even, is:
def pairwise_map(func, items):
iterators = [iter(items)] * 2
return map(func, zip(*iterators))
list(pairwise_map(sum, range(10)))
# [1, 5, 9, 13, 17]
This consists of two steps: the separation in group and the mapping.
A more general version of the group separation can be found in flyingcircus.base.group_by().
(Disclaimer: I am the main author of the package).
While the above works for the general case, if you have a NumPy array arr and the function func() is vectorized, one can simply use:
import numpy as np
arr = np.arange(10)
def func(x, y):
return x + y
func(arr[::2], arr[1::2])
# array([ 1, 5, 9, 13, 17])
EDIT
This can be generalized to any size, e.g.:
def pairwise_map(func, items, window=2):
iterators = [iter(items)] * window
return map(func, zip(*iterators))
list(pairwise_map(sum, range(10), 3))
# [3, 12, 21]
This obviously rely on func() being able to accept the correct or a variable number of arguments.
Similarly, for NumPy arrays and NumPy-aware functions:
import numpy as np
arr = np.arange(9)
def func(*args):
return sum(args)
window = 3
func(*(arr[i::window] for i in range(window)))
# array([ 3, 12, 21])
Note that this require len(arr) % window == 0.
For NumPy functions that support the axis keyword (e.g. np.mean(), np.std(), etc.), one can simply use the following reshaping trick:
import numpy as np
arr = np.arange(56)
window = 8
np.mean(arr.reshape(-1, window), axis=1)
# array([ 3.5, 11.5, 19.5, 27.5, 35.5, 43.5, 51.5])
Note that this also strictly requires len(arr) % window == 0, which can be enforced with e.g. np.concatenate() to pad zeros at the end of the input:
import numpy as np
arr = np.arange(53)
remainder = len(arr) % window
padder = np.zeros(window - remainder if remainder else 0, dtype=arr.dtype)
window = 8
np.mean(np.concatenate((arr, padder)).reshape(-1, window), axis=1)
# array([ 3.5 , 11.5 , 19.5 , 27.5 , 35.5 , 43.5 , 31.25])

Aggregate elements based on position vector

I'm trying to vectorize a very simple operation but can't seem to figure out how.
Given a very large numerical vector (over 1M positions) and another array of size n with a given set of positions, I would like to get back a vector of size n with elements being the average of the values of the first vector as specified by the second
a = np.array([1,2,3,4,5,6,7])
b = np.array([[0,1],[2],[3,5],[4,6]])
c = [1.5,3,5,6]
I need to repeat this operation many times so performance is an issue.
Vanilla python solution:
import numpy as np
import time
a = np.array([1,2,3,4,5,6,7])
b = np.array([[0,1],[2],[3,5],[4,6]])
begin = time.time()
for i in range(100000):
c = []
for d in b:
c.append(np.mean(a[d]))
print(time.time() - begin, c)
# 3.7529971599578857 [1.5, 3.0, 5.0, 6.0]
I'm not sure if this is necessarily faster but you may as well try:
import numpy as np
a = np.array([1, 2, 3, 4, 5, 6, 7])
b = np.array([[0, 1], [2], [3, 5], [4, 6]])
# Get the length of each subset of indices
lens = np.fromiter((len(bi) for bi in b), count=len(b), dtype=np.int32)
# Compute reduction indices
reduce_idx = np.roll(np.cumsum(lens), 1)
reduce_idx[0] = 0
# Make flattened array of index lists
idx = np.fromiter((i for bi in b for i in bi), count=lens.sum(), dtype=np.int32)
# Reorder according to indices
a2 = a[idx]
# Sum reordered array at reduction indices and divide by number of indices
c = np.add.reduceat(a2, reduce_idx) / lens
print(c)
# [1.5 3. 5. 6. ]

Decrease array size by averaging adjacent values with numpy

I have a large array of thousands of vals in numpy. I want to decrease its size by averaging adjacent values.
For example:
a = [2,3,4,8,9,10]
#average down to 2 values here
a = [3,9]
#it averaged 2,3,4 and 8,9,10 together
So, basically, I have n number of elements in array, and I want to tell it to average down to X number of values, and it averages like above.
Is there some way to do that with numpy (already using it for other things, so I'd like to stick with it).
Using reshape and mean, you can average every m adjacent values of an 1D-array of size N*m, with N being any positive integer number. For example:
import numpy as np
m = 3
a = np.array([2, 3, 4, 8, 9, 10])
b = a.reshape(-1, m).mean(axis=1)
#array([3., 9.])
1)a.reshape(-1, m) will create a 2D image of the array without copying data:
array([[ 2, 3, 4],
[ 8, 9, 10]])
2)taking the mean in the second axis (axis=1) will then calculate the mean value of each row, resulting in:
array([3., 9.])
Try this:
n_averaged_elements = 3
averaged_array = []
a = np.array([ 2, 3, 4, 8, 9, 10])
for i in range(0, len(a), n_averaged_elements):
slice_from_index = i
slice_to_index = slice_from_index + n_averaged_elements
averaged_array.append(np.mean(a[slice_from_index:slice_to_index]))
>>>> averaged_array
>>>> [3.0, 9.0]
Looks like a simple non-overlapping moving window average to me, how about:
In [3]:
import numpy as np
a = np.array([2,3,4,8,9,10])
window_sz = 3
a[:len(a)/window_sz*window_sz].reshape(-1,window_sz).mean(1)
#you want to be sure your array can be reshaped properly, so the [:len(a)/window_sz*window_sz] part
Out[3]:
array([ 3., 9.])
In this example, I presume that a is the 1D numpy array that needs to be averaged. In the method that I give below, we first find the factors of the length of this array a. And, then we choose the an appropriate factor as the step size to average the array with.
Here is the code.
import numpy as np
from functools import reduce
''' Function to find factors of a given number 'n' '''
def factors(n):
return list(set(reduce(list.__add__,
([i, n//i] for i in range(1, int(n**0.5) + 1) if n % i == 0))))
a = [2,3,4,8,9,10] #Given array.
'''fac: list of factors of length of a.
In this example, len(a) = 6. So, fac = [1, 2, 3, 6] '''
fac = factors(len(a))
'''step: choose an appropriate step size from the list 'fac'.
In this example, we choose one of the middle numbers in fac
(3). '''
step = fac[int( len(fac)/3 )+1]
'''avg: initialize an empty array. '''
avg = np.array([])
for i in range(0, len(a), step):
avg = np.append( avg, np.mean(a[i:i+step]) ) #append averaged values to `avg`
print avg #Prints the final result
[3.0, 9.0]

How to normalize a NumPy array to a unit vector?

I would like to convert a NumPy array to a unit vector. More specifically, I am looking for an equivalent version of this normalisation function:
def normalize(v):
norm = np.linalg.norm(v)
if norm == 0:
return v
return v / norm
This function handles the situation where vector v has the norm value of 0.
Is there any similar functions provided in sklearn or numpy?
If you're using scikit-learn you can use sklearn.preprocessing.normalize:
import numpy as np
from sklearn.preprocessing import normalize
x = np.random.rand(1000)*10
norm1 = x / np.linalg.norm(x)
norm2 = normalize(x[:,np.newaxis], axis=0).ravel()
print np.all(norm1 == norm2)
# True
I agree that it would be nice if such a function were part of the included libraries. But it isn't, as far as I know. So here is a version for arbitrary axes that gives optimal performance.
import numpy as np
def normalized(a, axis=-1, order=2):
l2 = np.atleast_1d(np.linalg.norm(a, order, axis))
l2[l2==0] = 1
return a / np.expand_dims(l2, axis)
A = np.random.randn(3,3,3)
print(normalized(A,0))
print(normalized(A,1))
print(normalized(A,2))
print(normalized(np.arange(3)[:,None]))
print(normalized(np.arange(3)))
This might also work for you
import numpy as np
normalized_v = v / np.sqrt(np.sum(v**2))
but fails when v has length 0.
In that case, introducing a small constant to prevent the zero division solves this.
As proposed in the comments one could also use
v/np.linalg.norm(v)
To avoid zero division I use eps, but that's maybe not great.
def normalize(v):
norm=np.linalg.norm(v)
if norm==0:
norm=np.finfo(v.dtype).eps
return v/norm
If you have multidimensional data and want each axis normalized to its max or its sum:
def normalize(_d, to_sum=True, copy=True):
# d is a (n x dimension) np array
d = _d if not copy else np.copy(_d)
d -= np.min(d, axis=0)
d /= (np.sum(d, axis=0) if to_sum else np.ptp(d, axis=0))
return d
Uses numpys peak to peak function.
a = np.random.random((5, 3))
b = normalize(a, copy=False)
b.sum(axis=0) # array([1., 1., 1.]), the rows sum to 1
c = normalize(a, to_sum=False, copy=False)
c.max(axis=0) # array([1., 1., 1.]), the max of each row is 1
If you don't need utmost precision, your function can be reduced to:
v_norm = v / (np.linalg.norm(v) + 1e-16)
You mentioned sci-kit learn, so I want to share another solution.
sci-kit learn MinMaxScaler
In sci-kit learn, there is a API called MinMaxScaler which can customize the the value range as you like.
It also deal with NaN issues for us.
NaNs are treated as missing values: disregarded in fit, and maintained
in transform. ... see reference [1]
Code sample
The code is simple, just type
# Let's say X_train is your input dataframe
from sklearn.preprocessing import MinMaxScaler
# call MinMaxScaler object
min_max_scaler = MinMaxScaler()
# feed in a numpy array
X_train_norm = min_max_scaler.fit_transform(X_train.values)
# wrap it up if you need a dataframe
df = pd.DataFrame(X_train_norm)
Reference
[1] sklearn.preprocessing.MinMaxScaler
There is also the function unit_vector() to normalize vectors in the popular transformations module by Christoph Gohlke:
import transformations as trafo
import numpy as np
data = np.array([[1.0, 1.0, 0.0],
[1.0, 1.0, 1.0],
[1.0, 2.0, 3.0]])
print(trafo.unit_vector(data, axis=1))
If you work with multidimensional array following fast solution is possible.
Say we have 2D array, which we want to normalize by last axis, while some rows have zero norm.
import numpy as np
arr = np.array([
[1, 2, 3],
[0, 0, 0],
[5, 6, 7]
], dtype=np.float)
lengths = np.linalg.norm(arr, axis=-1)
print(lengths) # [ 3.74165739 0. 10.48808848]
arr[lengths > 0] = arr[lengths > 0] / lengths[lengths > 0][:, np.newaxis]
print(arr)
# [[0.26726124 0.53452248 0.80178373]
# [0. 0. 0. ]
# [0.47673129 0.57207755 0.66742381]]
If you want to normalize n dimensional feature vectors stored in a 3D tensor, you could also use PyTorch:
import numpy as np
from torch import FloatTensor
from torch.nn.functional import normalize
vecs = np.random.rand(3, 16, 16, 16)
norm_vecs = normalize(FloatTensor(vecs), dim=0, eps=1e-16).numpy()
If you're working with 3D vectors, you can do this concisely using the toolbelt vg. It's a light layer on top of numpy and it supports single values and stacked vectors.
import numpy as np
import vg
x = np.random.rand(1000)*10
norm1 = x / np.linalg.norm(x)
norm2 = vg.normalize(x)
print np.all(norm1 == norm2)
# True
I created the library at my last startup, where it was motivated by uses like this: simple ideas which are way too verbose in NumPy.
Without sklearn and using just numpy.
Just define a function:.
Assuming that the rows are the variables and the columns the samples (axis= 1):
import numpy as np
# Example array
X = np.array([[1,2,3],[4,5,6]])
def stdmtx(X):
means = X.mean(axis =1)
stds = X.std(axis= 1, ddof=1)
X= X - means[:, np.newaxis]
X= X / stds[:, np.newaxis]
return np.nan_to_num(X)
output:
X
array([[1, 2, 3],
[4, 5, 6]])
stdmtx(X)
array([[-1., 0., 1.],
[-1., 0., 1.]])
For a 2D array, you can use the following one-liner to normalize across rows. To normalize across columns, simply set axis=0.
a / np.linalg.norm(a, axis=1, keepdims=True)
If you want all values in [0; 1] for 1d-array then just use
(a - a.min(axis=0)) / (a.max(axis=0) - a.min(axis=0))
Where a is your 1d-array.
An example:
>>> a = np.array([0, 1, 2, 4, 5, 2])
>>> (a - a.min(axis=0)) / (a.max(axis=0) - a.min(axis=0))
array([0. , 0.2, 0.4, 0.8, 1. , 0.4])
Note for the method. For saving proportions between values there is a restriction: 1d-array must have at least one 0 and consists of 0 and positive numbers.
A simple dot product would do the job. No need for any extra package.
x = x/np.sqrt(x.dot(x))
By the way, if the norm of x is zero, it is inherently a zero vector, and cannot be converted to a unit vector (which has norm 1). If you want to catch the case of np.array([0,0,...0]), then use
norm = np.sqrt(x.dot(x))
x = x/norm if norm != 0 else x

Find the min/max excluding zeros in a numpy array (or a tuple) in python

I have an array. The valid values are not zero (either positive or negetive). I want to find the minimum and maximum within the array which should not take zeros into account. For example if the numbers are only negative. Zeros will be problematic.
How about:
import numpy as np
minval = np.min(a[np.nonzero(a)])
maxval = np.max(a[np.nonzero(a)])
where a is your array.
If you can choose the "invalid" value in your array, it is better to use nan instead of 0:
>>> a = numpy.array([1.0, numpy.nan, 2.0])
>>> numpy.nanmax(a)
2.0
>>> numpy.nanmin(a)
1.0
If this is not possible, you can use an array mask:
>>> a = numpy.array([1.0, 0.0, 2.0])
>>> masked_a = numpy.ma.masked_equal(a, 0.0, copy=False)
>>> masked_a.max()
2.0
>>> masked_a.min()
1.0
Compared to Josh's answer using advanced indexing, this has the advantage of avoiding to create a copy of the array.
Here's another way of masking which I think is easier to remember (although it does copy the array). For the case in point, it goes like this:
>>> import numpy
>>> a = numpy.array([1.0, 0.0, 2.0])
>>> ma = a[a != 0]
>>> ma.max()
2.0
>>> ma.min()
1.0
>>>
It generalizes to other expressions such as a > 0, numpy.isnan(a), ...
And you can combine masks with standard operators (+ means OR, * means AND, - means NOT) e.g:
# Identify elements that are outside interpolation domain or NaN
outside = (xi < x[0]) + (eta < y[0]) + (xi > x[-1]) + (eta > y[-1])
outside += numpy.isnan(xi) + numpy.isnan(eta)
inside = -outside
xi = xi[inside]
eta = eta[inside]
You could use a generator expression to filter out the zeros:
array = [-2, 0, -4, 0, -3, -2]
max(x for x in array if x != 0)
Masked arrays in general are designed exactly for these kind of purposes. You can leverage masking zeros from an array (or ANY other kind of mask you desire, even masks that are more complicated than a simple equality) and do pretty much most of the stuff you do on regular arrays on your masked array. You can also specify an axis for which you wish to find the min along:
import numpy.ma as ma
mx = ma.masked_array(x, mask=x==0)
mx.min()
Example input:
x = np.array([1.0, 0.0, 2.0])
output:
1.0
A simple way would be to use a list comprehension to exclude zeros.
>>> tup = (0, 1, 2, 5, 2)
>>> min([x for x in tup if x !=0])
1

Categories

Resources