Running mean of numpy ndarrays from iterator

Running mean of numpy ndarrays from iterator - python

The question of how to compute a running mean of a series of numbers has been asked and answered before. However, I am trying to compute the running mean of a series of ndarrays, with an unknown length of series. So, for example, I have an iterator data where I would do:
running_mean = np.zeros((1000,3))
while True:
datum = next(data)
running_mean = calc_running_mean(datum)
What would calc_running_mean look like? My primary concern here is memory, as I can't have the entirety of the data in memory, and I don't know how much data I will be receiving. datum would be an ndarray, let's say that for this example it's a (1000,3) array, and the running mean would be an array of the same size, with each element containing the elementwise mean of every element we've seen in that position so far.
The key distinction this question has from previous questions is that it's calculating the elementwise mean of a series of ndarrays, and the number of arrays is unknown.

You can use itertools together with standard operators:
>>> import itertools, operator
>>> running_sum = itertools.accumulate(data)
>>> running_mean = map(operator.truediv, running_sum, itertools.count(1))
Example:
>>> data = (np.linspace(-i, i*i, 6) for i in range(10))
>>>
>>> running_sum = itertools.accumulate(data)
>>> running_mean = map(operator.truediv, running_sum, itertools.count(1))
>>>
>>> for i in running_mean:
... print(i)
...
[0. 0. 0. 0. 0. 0.]
[-0.5 -0.3 -0.1 0.1 0.3 0.5]
[-1. -0.46666667 0.06666667 0.6 1.13333333 1.66666667]
[-1.5 -0.5 0.5 1.5 2.5 3.5]
[-2. -0.4 1.2 2.8 4.4 6. ]
[-2.5 -0.16666667 2.16666667 4.5 6.83333333 9.16666667]
[-3. 0.2 3.4 6.6 9.8 13. ]
[-3.5 0.7 4.9 9.1 13.3 17.5]
[-4. 1.33333333 6.66666667 12. 17.33333333 22.66666667]
[-4.5 2.1 8.7 15.3 21.9 28.5]

Related

python check if float value is missing

I am trying to find "missing" values in a python array of floats.
Such that in this case [1.1, 1.3, 2.1, 2.2, 2.3] I would like to print "1.2"
I dont have much experience with floats, I have tried something like this How to find a missing number from a list? but it doesn't work on floats.
Thanks!

To solve this, the problem would need to be simplified first, I am assuming that all the values would be float and with one decimal place, also let's assume that there can be multiple ranges like 1.1-1.3 and 2.1-2.3, also assuming that the numbers are in sorted order, here is a solution. It is written in python 3 by the way
vals = [1.1, 1.3, 2.1, 2.2, 2.3] # This will be the values in which to find the missing number
# The logic starts from here
for i in range(len(vals) - 1):
if vals[i + 1] * 10 - vals[i] * 10 == 2:
print((vals[i] * 10 + 1)/10)
print("\nfinished")

You might want to use https://numpy.org/doc/stable/reference/generated/numpy.arange.html
and create a list of floats (if you know start, end, step values)
Then you can create two sets and use difference to find missing values

Simplest yet dumb way:
Split float to integer and decimal parts.
Create cartesian product of both to generate Full array.
Use set and XOR to find out missing ones.
from itertools import product
source = [1.1, 1.3, 2.1, 2.2, 2.3]
separated = [str(n).split(".") for n in source]
integers, decimals = map(set, zip(*separated))
products = [float(f"{i}.{d}") for i, d in product(integers, decimals)]
print(*(set(products) ^ set(source)))
output:
1.2

I guess that the solutions to the problem you quote proprably work on your case, you just need to adapt the built-in range function to numpy.arange that allow you to create a range of numbers with floats.
it gets something like that: (just did a simple example)
import numpy as np
np_range = np.arange(1, 2, 0.1)
float_list = [1.2, 1.3, 1.4, 1.6]
for i in np_range:
if not round(i, 1) in float_list:
print(round(i, 1))
output:
1.0
1.1
1.5
1.7
1.8
1.9

This is an absolutely AWFUL way to do this, but depending on how many numbers you have in the list and how difficult the other solutions are you might appreciate it.
If you write
firstvalue = 1.1
secondvalue = 1.2
thirdvalue = 1.3
#assign these for every value you are keeping track of
if firstvalue in val: #(or whatever you named your list)
print("1.1 is in the list")
else:
print("1.1 is missing!")
if secondvalue in val:
print("1.2 is in the list")
else:
print("1.2 is missing!")
#etc etc etc for every value in the list. It's tedious and dumb but if you have few enough values in your list it might be your simplest option

With numpy
import numpy as np
arr = [1.1, 1.3, 2.1, 2.2, 2.3]
find_gaps = np.array(arr).round(1)
find_gaps[np.r_[np.diff(find_gaps).round(1), False] == 0.2] + 0.1
Output
array([1.2])
Test with random data
import numpy as np
np.random.seed(10)
arr = np.arange(0.1, 10.4, 0.1)
mask = np.random.randint(0,2, len(arr)).astype(np.bool)
gaps = arr[mask]
print(gaps)
find_gaps = np.array(gaps).round(1)
print('missing values:')
print(find_gaps[np.r_[np.diff(find_gaps).round(1), False] == 0.2] + 0.1)
Output
[ 0.1 0.2 0.4 0.6 0.7 0.9 1. 1.2 1.3 1.6 2.2 2.5 2.6 2.9
3.2 3.6 3.7 3.9 4. 4.1 4.2 4.3 4.5 5. 5.2 5.3 5.4 5.6
5.8 5.9 6.1 6.4 6.8 6.9 7.3 7.5 7.6 7.8 7.9 8.1 8.7 8.9
9.7 9.8 10. 10.1]
missing values:
[0.3 0.5 0.8 1.1 3.8 4.4 5.1 5.5 5.7 6. 7.4 7.7 8. 8.8 9.9]
More general solution
Find all missing value with specific gap size
import numpy as np
def find_missing(find_gaps, gaps = 1):
find_gaps = np.array(find_gaps)
gaps_diff = np.r_[np.diff(find_gaps).round(1), False]
gaps_index = find_gaps[(gaps_diff >= 0.2) & (gaps_diff <= round(0.1*(gaps + 1),1))]
gaps_values = np.searchsorted(find_gaps, gaps_index)
ranges = np.vstack([(find_gaps[gaps_values]+0.1).round(1),find_gaps[gaps_values+1]]).T
return np.concatenate([np.arange(start, end, 0.1001) for start, end in ranges]).round(1)
vals = [0.1,0.3, 0.6, 0.7, 1.1, 1.5, 1.8, 2.1]
print('Vals:', vals)
print('gap=1', find_missing(vals, gaps = 1))
print('gap=2', find_missing(vals, gaps = 2))
print('gap=3', find_missing(vals, gaps = 3))
Output
Vals: [0.1, 0.3, 0.6, 0.7, 1.1, 1.5, 1.8, 2.1]
gap=1 [0.2]
gap=2 [0.2 0.4 0.5 1.6 1.7 1.9 2. ]
gap=3 [0.2 0.4 0.5 0.8 0.9 1. 1.2 1.3 1.4 1.6 1.7 1.9 2. ]

Interpolation of a pandas DataFrame

I do have a pandas DataFrame (size = 34,19) which I want to use as a lookup table.
But the values I want to look up are "between" the values in the dataframe
For example:
0.1 0.2 0.3 0.4 0.5
0.1 4.01 31.86 68.01 103.93 139.2
0.2 24.07 57.49 91.37 125.21 158.57
0.3 44.35 76.4 108.97 141.57 173.78
0.4 59.66 91.02 122.8 154.62 186.13
0.5 87.15 117.9 148.86 179.83 210.48
0.6 106.92 137.41 168.26 198.99 229.06
0.7 121.73 152.48 183.4 213.88 243.33
I know want to look up the value for x = 5.5 y = 1.004, so the answer should be around 114.
I tried it with different methods from scipy but the values I get are always way off.
Last method I used was :inter = interpolate.interpn([np.array(np.arange(34)), np.array(np.arange(19))], np_matrix, [x_value, y_value],)
I even get wrong values for points in the grid which do exist.
Can someone tell me what I'm doing wrong or recommend an easy solution for the task?
EDIT:
An additional problem is:
My raw data, from an .xlsx file, look like:
0.1 0.2 0.3 0.4 0.5
0.1 4.01 31.86 68.01 103.93 139.2
0.2 24.07 57.49 91.37 125.21 158.57
0.3 44.35 76.4 108.97 141.57 173.78
0.4 59.66 91.02 122.8 154.62 186.13
0.5 87.15 117.9 148.86 179.83 210.48
0.6 106.92 137.41 168.26 198.99 229.06
0.7 121.73 152.48 183.4 213.88 243.33
But pandas adds an Index column:
0.1 0.2 0.3 0.4 0.5
0 0.1 4.01 31.86 68.01 103.93 139.2
1 0.2 24.07 57.49 91.37 125.21 158.57
2 0.3 44.35 76.4 108.97 141.57 173.78
3 0.4 59.66 91.02 122.8 154.62 186.13
4 0.8 87.15 117.9 148.86 179.83 210.48
5 1.0 106.92 137.41 168.26 198.99 229.06
6 1.7 121.73 152.48 183.4 213.88 243.33
So if I want to access x = 0.4 y = 0.15 I have to input x = 3, y = 0.15.
Data are read with:
model_references = pd.ExcelFile(model_references_path)
Matrix = model_references.parse('Model_References')
n = Matrix.stack().reset_index().values
out = interpolate.griddata(n[:,0:2], n[:,2], (Stroke, Current), method='cubic')

You can reshape data to 3 columns with stack - first column for index, second for columns and last for values, last get values by scipy.interpolate.griddata
from scipy.interpolate import griddata
a = 5.5
b = 1.004
n = df.stack().reset_index().values
#https://stackoverflow.com/a/8662243
out = griddata(n[:,0:2], n[:,2], [(a, b)], method='linear')
print (out)
[104.563]
Detail:
n = df.stack().reset_index().values
print (n)
[[ 1. 1. 4.01]
[ 1. 2. 31.86]
[ 1. 3. 68.01]
[ 1. 4. 103.93]
[ 1. 5. 139.2 ]
[ 2. 1. 24.07]
[ 2. 2. 57.49]
[ 2. 3. 91.37]
[ 2. 4. 125.21]
[ 2. 5. 158.57]
[ 3. 1. 44.35]
[ 3. 2. 76.4 ]
[ 3. 3. 108.97]
[ 3. 4. 141.57]
[ 3. 5. 173.78]
[ 4. 1. 59.66]
[ 4. 2. 91.02]
[ 4. 3. 122.8 ]
[ 4. 4. 154.62]
[ 4. 5. 186.13]
[ 5. 1. 87.15]
[ 5. 2. 117.9 ]
[ 5. 3. 148.86]
[ 5. 4. 179.83]
[ 5. 5. 210.48]
[ 5. 1. 106.92]
[ 5. 2. 137.41]
[ 5. 3. 168.26]
[ 5. 4. 198.99]
[ 5. 5. 229.06]
[ 6. 1. 121.73]
[ 6. 2. 152.48]
[ 6. 3. 183.4 ]
[ 6. 4. 213.88]
[ 6. 5. 243.33]]

Try interp2d from scipy.
import numpy as np
from scipy.interpolate import interp2d
x = [1, 2, 3, 4, 5, 6, 7]
y = [1, 2, 3, 4, 5]
z = [[4.01, 31.86, 68.01, 103.93, 139.2],
[24.07, 57.49, 91.37, 125.21, 158.57],
[44.35, 76.4, 108.97, 141.57, 173.78],
[59.66, 91.02, 122.8, 154.62, 186.13],
[87.15, 117.9, 148.86, 179.83, 210.48],
[106.92, 137.41, 168.26, 198.99, 229.06],
[121.73, 152.48, 183.4, 213.88, 243.33]]
z = np.array(z).T
f = interp2d(x, y, z)
f(x = 5.5, y = 1.004) # returns 97.15748
Try to change method's kind argument in order to experiment with return value.

Pandas Multi-Index DataFrame to Numpy Ndarray

I am trying to convert a multi-index pandas DataFrame into a numpy.ndarray. The DataFrame is below:
s1 s2 s3 s4
Action State
1 s1 0.0 0 0.8 0.2
s2 0.1 0 0.9 0.0
2 s1 0.0 0 0.9 0.1
s2 0.0 0 1.0 0.0
I would like the resulting numpy.ndarray to be the following with np.shape() = (2,2,4):
[[[ 0.0 0.0 0.8 0.2 ]
[ 0.1 0.0 0.9 0.0 ]]
[[ 0.0 0.0 0.9 0.1 ]
[ 0.0 0.0 1.0 0.0]]]
I have tried df.as_matrix() but this returns:
[[ 0. 0. 0.8 0.2]
[ 0.1 0. 0.9 0. ]
[ 0. 0. 0.9 0.1]
[ 0. 0. 1. 0. ]]
How do I return a list of lists for the first level with each list representing an Action records.

You could use the following:
dim = len(df.index.get_level_values(0).unique())
result = df.values.reshape((dim1, dim1, df.shape[1]))
print(result)
[[[ 0. 0. 0.8 0.2]
[ 0.1 0. 0.9 0. ]]
[[ 0. 0. 0.9 0.1]
[ 0. 0. 1. 0. ]]]
The first line just finds the number of groups that you want to groupby.
Why this (or groupby) is needed: as soon as you use .values, you lose the dimensionality of the MultiIndex from pandas. So you need to re-pass that dimensionality to NumPy in some way.

One way
In [151]: df.groupby(level=0).apply(lambda x: x.values.tolist()).values
Out[151]:
array([[[0.0, 0.0, 0.8, 0.2],
[0.1, 0.0, 0.9, 0.0]],
[[0.0, 0.0, 0.9, 0.1],
[0.0, 0.0, 1.0, 0.0]]], dtype=object)

Using Divakar's suggestion, np.reshape() worked:
>>> print(P)
s1 s2 s3 s4
Action State
1 s1 0.0 0 0.8 0.2
s2 0.1 0 0.9 0.0
2 s1 0.0 0 0.9 0.1
s2 0.0 0 1.0 0.0
>>> np.reshape(P,(2,2,-1))
[[[ 0. 0. 0.8 0.2]
[ 0.1 0. 0.9 0. ]]
[[ 0. 0. 0.9 0.1]
[ 0. 0. 1. 0. ]]]
>>> np.shape(P)
(2, 2, 4)

Elaborating on Brad Solomon's answer, to get a sligthly more generic solution - indexes of different sizes and an unfixed number of indexes - one could do something like this:
def df_to_numpy(df):
try:
shape = [len(level) for level in df.index.levels]
except AttributeError:
shape = [len(df.index)]
ncol = df.shape[-1]
if ncol > 1:
shape.append(ncol)
return df.to_numpy().reshape(shape)
If df has missing sub-indexes reshape will not work. One way to add them would be (maybe there are better solutions):
def enforce_df_shape(df):
try:
ind = pd.MultiIndex.from_product([level.values for level in df.index.levels])
except AttributeError:
return df
fulldf = pd.DataFrame(-1, columns=df.columns, index=ind) # remove -1 to fill fulldf with nan
fulldf.update(df)
return fulldf

If you are just trying to pull out one column, say s1, and get an array with shape (2,2) you can use the .index.levshape like this:
x = df.s1.to_numpy().reshape(df.index.levshape)
This will give you a (2,2) containing the value of s1.

Rolling concatenation array of numpy arrays

I want to implement a rolling concatenation function for numpy array of arrays. For example, if my numpy array is the following:
[[1.0]
[1.5]
[1.6]
[1.8]
...
...
[1.2]
[1.3]
[1.5]]
then, for a window size of 3, my function should return:
[[1.0]
[1.0 1.5]
[1.0 1.5 1.6]
[1.5 1.6 1.8]
...
...
[1.2 1.3 1.5]]
The input array could have elements of different shapes as well. For example, if input is:
[[1.0]
[1.5]
[1.6 1.7]
[1.8]
...
...
[1.2]
[1.3]
[1.5]]
then output should be:
[[1.0]
[1.0 1.5]
[1.0 1.5 1.6 1.7]
[1.5 1.6 1.7 1.8]
...
...
[1.2 1.3 1.5]]

First, make your array into a list. There's no purpose in having an array of arrays in numpy.
l = arr.tolist() #l is a list of arrays
Now use list comprehension to get your elements, and concatenate them with np.r_
l2 = [np.r_[tuple(l[max(i - n, 0):i])] for i in range(1, len(l)+1)]

Printing numpy array in python

Here's a simple code in python.
end = np.zeros((11,2))
alpha=0
while(alpha<=1):
end[int(10*alpha)] = alpha
print(end[int(10*alpha)])
alpha+=0.1
print('')
print(end)
and output:
[ 0. 0.]
[ 0.1 0.1]
[ 0.2 0.2]
[ 0.3 0.3]
[ 0.4 0.4]
[ 0.5 0.5]
[ 0.6 0.6]
[ 0.7 0.7]
[ 0.8 0.8]
[ 0.9 0.9]
[ 1. 1.]
[[ 0. 0. ]
[ 0.1 0.1]
[ 0.2 0.2]
[ 0.3 0.3]
[ 0.4 0.4]
[ 0.5 0.5]
[ 0.6 0.6]
[ 0.8 0.8]
[ 0. 0. ]
[ 1. 1. ]
[ 0. 0. ]]

It is easy to notice that 0.7 is missing and after 0.8 goes 0 instead of 0.9 etc... Why are these outputs differ?

It's because of floating point errors. Run this:
import numpy as np
end = np.zeros((11, 2))
alpha=0
while(alpha<=1):
print("alpha is ", alpha)
end[int(10*alpha)] = alpha
print(end[int(10*alpha)])
alpha+=0.1
print('')
print(end)
and you will see that alpha is, successively:
alpha is 0
alpha is 0.1
alpha is 0.2
alpha is 0.30000000000000004
alpha is 0.4
alpha is 0.5
alpha is 0.6
alpha is 0.7
alpha is 0.7999999999999999
alpha is 0.8999999999999999
alpha is 0.9999999999999999
Basically floating point numbers like 0.1 are stored inexactly on your computer. If you add 0.1 together say 8 times, you won't necessarily get 0.8 -- the small errors can accumulate and give you a different number, in this case 0.7999999999999999. Numpy arrays must take integers as indexes however, so it uses the int function to force this to round down to the nearest integer -- 7 -- which causes that row to be overwritten.
To solve this, you must rewrite your code so that you only ever use integers to index into an array. One slightly crude way would be to round the float to the nearest integer using the round function. But really you should rewrite your code so that it iterates over integers and coverts them into floats, rather than iterating over floats and converting them into integers.
You can read more about floating point numbers here:
https://docs.python.org/3/tutorial/floatingpoint.html

As #Denziloe pointed, this is due to floating point errors.
If you look at the definition of int():
If x is floating point, the conversion truncates towards zero
To solve your problem use round() instead of int()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Running mean of numpy ndarrays from iterator - python

Related

python check if float value is missing

Interpolation of a pandas DataFrame

Pandas Multi-Index DataFrame to Numpy Ndarray

Rolling concatenation array of numpy arrays

Printing numpy array in python

Categories

Resources