Binning in Numpy - python

I have an array A which I am trying to put into 10 bins. Here is what I've done.
A = range(1,94)
hist = np.histogram(A, bins=10)
np.digitize(A, hist[1])
But the output has 11 bins, not 10, with the last value (93) placed in bin 11, when it should have been in bin 10. I can fix it with a hack, but what's the most elegant way of doing this? How do I tell digitize that the last bin in hist[1] is inclusive on the right - [ ] instead of [ )?

The output of np.histogram actually has 10 bins; the last (right-most) bin includes the greatest element because its right edge is inclusive (unlike for other bins).
The np.digitize method doesn't make such an exception (since its purpose is different) so the largest element(s) of the list get placed into an extra bin. To get the bin assignments that are consistent with histogram, just clamp the output of digitize by the number of bins, using fmin.
A = range(1,94)
bin_count = 10
hist = np.histogram(A, bins=bin_count)
np.fmin(np.digitize(A, hist[1]), bin_count)
Output:
array([ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2,
2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4,
4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6,
6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8,
8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10,
10, 10, 10, 10, 10, 10, 10, 10])

Related

Python: How to efficiently create all possible 2 element swaps of an array?

I try to generate all possible 2-element swaps of a given array.
For example:
candidate = [ 5, 9, 1, 8, 3, 7, 10, 6, 4, 2]
result = [[ 9, 5, 1, 8, 3, 7, 10, 6, 4, 2]
[ 1, 9, 5, 8, 3, 7, 10, 6, 4, 2]
[ 8, 9, 1, 5, 3, 7, 10, 6, 4, 2]
[ 3, 9, 1, 8, 5, 7, 10, 6, 4, 2]
[ 7, 9, 1, 8, 3, 5, 10, 6, 4, 2]
[10, 9, 1, 8, 3, 7, 5, 6, 4, 2]
[ 6, 9, 1, 8, 3, 7, 10, 5, 4, 2]
[ 4, 9, 1, 8, 3, 7, 10, 6, 5, 2]
[ 2, 9, 1, 8, 3, 7, 10, 6, 4, 5]
[ 5, 1, 9, 8, 3, 7, 10, 6, 4, 2]
[ 5, 8, 1, 9, 3, 7, 10, 6, 4, 2]
[ 5, 3, 1, 8, 9, 7, 10, 6, 4, 2]
[ 5, 7, 1, 8, 3, 9, 10, 6, 4, 2]
[ 5, 10, 1, 8, 3, 7, 9, 6, 4, 2]
[ 5, 6, 1, 8, 3, 7, 10, 9, 4, 2]
[ 5, 4, 1, 8, 3, 7, 10, 6, 9, 2]
[ 5, 2, 1, 8, 3, 7, 10, 6, 4, 9]
[ 5, 9, 8, 1, 3, 7, 10, 6, 4, 2]
[ 5, 9, 3, 8, 1, 7, 10, 6, 4, 2]
[ 5, 9, 7, 8, 3, 1, 10, 6, 4, 2]
[ 5, 9, 10, 8, 3, 7, 1, 6, 4, 2]
[ 5, 9, 6, 8, 3, 7, 10, 1, 4, 2]
[ 5, 9, 4, 8, 3, 7, 10, 6, 1, 2]
[ 5, 9, 2, 8, 3, 7, 10, 6, 4, 1]
[ 5, 9, 1, 3, 8, 7, 10, 6, 4, 2]
[ 5, 9, 1, 7, 3, 8, 10, 6, 4, 2]
[ 5, 9, 1, 10, 3, 7, 8, 6, 4, 2]
[ 5, 9, 1, 6, 3, 7, 10, 8, 4, 2]
[ 5, 9, 1, 4, 3, 7, 10, 6, 8, 2]
[ 5, 9, 1, 2, 3, 7, 10, 6, 4, 8]
[ 5, 9, 1, 8, 7, 3, 10, 6, 4, 2]
[ 5, 9, 1, 8, 10, 7, 3, 6, 4, 2]
[ 5, 9, 1, 8, 6, 7, 10, 3, 4, 2]
[ 5, 9, 1, 8, 4, 7, 10, 6, 3, 2]
[ 5, 9, 1, 8, 2, 7, 10, 6, 4, 3]
[ 5, 9, 1, 8, 3, 10, 7, 6, 4, 2]
[ 5, 9, 1, 8, 3, 6, 10, 7, 4, 2]
[ 5, 9, 1, 8, 3, 4, 10, 6, 7, 2]
[ 5, 9, 1, 8, 3, 2, 10, 6, 4, 7]
[ 5, 9, 1, 8, 3, 7, 6, 10, 4, 2]
[ 5, 9, 1, 8, 3, 7, 4, 6, 10, 2]
[ 5, 9, 1, 8, 3, 7, 2, 6, 4, 10]
[ 5, 9, 1, 8, 3, 7, 10, 4, 6, 2]
[ 5, 9, 1, 8, 3, 7, 10, 2, 4, 6]
[ 5, 9, 1, 8, 3, 7, 10, 6, 2, 4]]
I currently achive this by using two nested for-loops:
neighborhood = []
for node1 in range(candidate.size - 1):
for node2 in range(node1 + 1, candidate.size):
neighbor = np.copy(candidate)
neighbor[node1] = candidate[node2]
neighbor[node2] = candidate[node1]
neighborhood.append(neighbor)
The larger the array gets, the more inefficient and slower it becomes. Is there a more efficient way here that can also process arrays with three-digit contents?
Thank you!
You can use a generator if you need to use those arrays one by one (in this way, you don't need to memorize them all, you need very little space):
from itertools import combinations
def gen(lst):
for i, j in combinations(range(len(lst)), 2):
yield lst[:i] + lst[j] + lst[i:j] + lst[i] + lst[j:]
And then you can use it in this way:
for lst in gen(candidate):
# do something with your list with two swapped elements
This is going to save much space, but it's probably going to be still slow overall.
Here is a solution using NumPy. This is not space efficient (because it's memorizing all possible lists with swapped elements), but it might be much faster because of NumPy optimizations. Give it a try!
from itertools import combinations
from math import comb
arr = np.tile(candidate, (comb(len(candidate), 2), 1))
indices = np.array(list(combinations(range(len(candidate)), 2)))
arr[np.arange(arr.shape[0])[:, None], indices] = arr[np.arange(arr.shape[0])[:, None], np.flip(indices, axis=-1)]
Example (with candidate = [0, 1, 2, 3]):
>>> arr
array([[1, 0, 2, 3],
[2, 1, 0, 3],
[3, 1, 2, 0],
[0, 2, 1, 3],
[0, 3, 2, 1],
[0, 1, 3, 2]])
Notice that math.comb (which gives you the total number of possible lists with 2 swapped elements) is available only with python >= 3.8. Please have a look at this question to know how to replace math.comb in case you're using an older python version.
To swap only two items in any given list, I'd recommend using range with itertools.combinations. It is probably good to use a generator with the yield statement too, though if you are getting all results at once, it probably doesn't matter much.
from itertools import combinations
def swap2(l):
for pair in combinations(range(len(l)), 2):
l2 = l[:]
l2[pair[0]], l2[pair[1]] = l2[pair[1]], l2[pair[0]]
yield l2
if __name__ == "__main__":
candidate = [5, 9, 1, 8, 3, 7, 10, 6, 4, 2]
result = [l for l in swap2(candidate)]

Making a 10x10 grid from a list of arrays

I'm struggling to list my array as a 10x10 grid, the output I keep getting isn't what I'm looking for. I was hoping someone could help me out.
import numpy as np
x = 1
y = 1
scale = 10
nn = []
for x in range(1,scale+1):
mm = []
for y in range(1,scale+1):
xy = [x,y]
mm.append(xy)
#print(xy)
y=+1
nn.append(mm)
x=+1
nn
grid_array = np.array(nn)
grid=np.meshgrid(grid_array)
But the output I get isn't displaying 10x10
[array([ 1, 1, 1, 2, 1, 3, 1, 4, 1, 5, 1, 6, 1, 7, 1, 8, 1,
9, 1, 10, 2, 1, 2, 2, 2, 3, 2, 4, 2, 5, 2, 6, 2, 7,
2, 8, 2, 9, 2, 10, 3, 1, 3, 2, 3, 3, 3, 4, 3, 5, 3,
6, 3, 7, 3, 8, 3, 9, 3, 10, 4, 1, 4, 2, 4, 3, 4, 4,
4, 5, 4, 6, 4, 7, 4, 8, 4, 9, 4, 10, 5, 1, 5, 2, 5,
3, 5, 4, 5, 5, 5, 6, 5, 7, 5, 8, 5, 9, 5, 10, 6, 1,
6, 2, 6, 3, 6, 4, 6, 5, 6, 6, 6, 7, 6, 8, 6, 9, 6,
10, 7, 1, 7, 2, 7, 3, 7, 4, 7, 5, 7, 6, 7, 7, 7, 8,
7, 9, 7, 10, 8, 1, 8, 2, 8, 3, 8, 4, 8, 5, 8, 6, 8,
7, 8, 8, 8, 9, 8, 10, 9, 1, 9, 2, 9, 3, 9, 4, 9, 5,
9, 6, 9, 7, 9, 8, 9, 9, 9, 10, 10, 1, 10, 2, 10, 3, 10,
4, 10, 5, 10, 6, 10, 7, 10, 8, 10, 9, 10, 10])]
Edited.
This is what I have thus far, thanks for the help guys.
import numpy as np
scale = 10
array = np.empty(shape=(scale, scale, 2)).astype(int)
for x in range(1,scale+1):
for y in range(1,scale+1):
#print([x,y])
array[x-1,y-1] = [x,y]
print(array)
You can use numpy to do that. like this
np.reshape(arr, (-1,10))
See.
Convert a 1D array to a 2D array in numpy
It's pretty far from clear what you want to achieve, but if you simply want to know how to generate a 10x10 numpy array using two for loops, here is what you can do (not he most pythonic way to do it though):
import numpy as np
scale = 10
array = np.empty(shape=(scale, scale))
for x in range(scale):
for y in range(scale):
array[x,y] = 42 # replace with whatever dynamically assigned value you want there
print(array)

numpy.unique gives non-unique output?

I am trying to get the indices of unique elements of a numpy array (long vector of 3628621 elements).
However, I must do something wrong, because when I try to select the unique elements I am still finding duplicates:
Vector
Out[165]: array([712450, 714390, 718560, ..., 384390, 992041, 94852])
Loc = np.where(np.unique(Vector)) # Find indices of unique elements
Vector_New = Vector[Loc] # Create new vector with all unique elements
np.where(Vector_New == 173020) # See how often/where '173020' exists
Out[166]: (array([ 7098, 11581], dtype=int64),)
So, the integer '173020' exists still twice in the new vector, although I expected that all elements should be unique. The new vector is 11594 elements long.
Thanks for the help!
Regards,
Timen
np.unique has several parameters that can be activated and will give you the needed information. It's calling signature is:
np.unique(ar, return_index=False, return_inverse=False, return_counts=False)
read the docs.
In [50]: keys
Out[50]:
array([1, 3, 5, 2, 0, 7, 4, 7, 7, 2, 7, 5, 5, 3, 6, 2, 3, 5, 5, 5, 6, 9, 6,
5, 2, 1, 6, 6, 5, 9, 9, 6, 5, 5, 9, 9, 6, 3, 7, 0, 5, 1, 7, 6, 2, 4,
1, 0, 6, 5, 4, 8, 8, 4, 2, 1, 8, 3, 1, 9, 8, 4, 4, 2, 4, 7, 2, 6, 8,
6, 5, 2, 4, 9, 1, 5, 3, 1, 5, 6, 2, 2, 8, 4, 0, 4, 9, 0, 8, 1, 5, 3,
1, 3, 7, 1, 5, 8, 5, 8])
In [51]: np.unique(keys, return_counts=True, return_index=True)
Out[51]:
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
array([ 4, 0, 3, 1, 6, 2, 14, 5, 51, 21], dtype=int32),
array([ 5, 11, 11, 8, 10, 18, 12, 8, 9, 8]))

How to make duplicates of values in a list python

Quick question... searched and didn't find anything. I have this list:
list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 10]
I want to have each value repeated 3 more times in the same list... How do I do so? I tried some for loops that .append to the list but things got messy. I ended up getting some list in list that were in lists. I have a feeling .append is not right for this scenario.
In [1]: my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 10]
In [2]: sorted(my_list * 3)
Note that you should not use list as a variable name, because it shadows the keyword list.
One other option is to use numpy:
In [8]: import numpy as np
In [9]: np.repeat(my_list, 3)
Use nested loops in a list comprehension.
>>> [x for x in L for y in range(4)]
[1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10]
You can try something like this!
baselist = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 10]
[z for y in [[x]*4 for x in baselist] for z in y]
This is equivalent to:
listofLists = []
for x in baseList:
listofLists.append([x]*4)
finalList = []
for y in listofLists:
for z in y:
finalList.append(z)
You see, the list comprehension simply shortens the logic, but whether it's more readable will depend on your grasp of comprehension syntax.
You could do like this,
>>> lst = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 10]
>>> [j for i in lst for j in [i,i,i,i]]
[1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10]
Another itertools variation using the flatten recipe
>>> m = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 10]
>>> import itertools
>>> list(itertools.chain.from_iterable(itertools.izip(m,m,m,m)))
[1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10]
>>>

Python using Kalman Filter to improve simulation but getting worse results

I have questions on the behavior I am seeing with applying Kalman Filter (KF) to the following forecast problem. I have included a simple code sample.
Goal: I would like to know if KF is suitable for improving forecast/simulation result for a day ahead (at t+24 hours), using the measurement result obtained now (at t). The goal is to get the forecast as close to measurement as possible
Assumption:
We assume the measurement is perfect (ie. if we can get the forecast matches the measurement perfectly, we are happy).
We have a single measurement variable (z, real wind speed), and a single simulated variable (x, predicted wind speed).
The simulated wind speed x is produced from a NWP (numerical weather prediction) software using a variety meteorological data (black box to me). Simulation file is produced daily, containing data every half an hour.
I attempt to correct the t+24hr forecast, using the measurement I obtained now and the forecast data now (generated t-24 hr ago) using a scalar Kalman filter. For reference, I used:
http://www.swarthmore.edu/NatSci/echeeve1/Ref/Kalman/ScalarKalman.html
Code:
#! /usr/bin/python
import numpy as np
import pylab
import os
def main():
# x = 336 data points of simulated wind speed for 7 days * 24 hour * 2 (every half an hour)
# Imagine at time t, we will get a x_t fvalue or t+48 or a 24 hours later.
x = load_x()
# this is a list that will contain 336 data points of our corrected data
x_sample_predict_list = []
# z = 336 data points for 7 days * 24 hour * 2 of actual measured wind speed (every half an hour)
z = load_z()
# Here is the setup of the scalar kalman filter
# reference: http://www.swarthmore.edu/NatSci/echeeve1/Ref/Kalman/ScalarKalman.html
# state transition matrix (we simply have a scalar)
# what you need to multiply the last time's state to get the newest state
# we get the x_t+1 = A * x_t, since we get the x_t+1 directly for simulation
# we will have a = 1
a = 1.0
# observation matrix
# what you need to multiply to the state, convert it to the same form as incoming measurement
# both state and measurements are wind speed, so set h = 1
h = 1.0
Q = 16.0 # expected process variance of predicted Wind Speed
R = 9.0 # expected measurement variance of Wind Speed
p_j = Q # process covariance is equal to the initial process covariance estimate
# Kalman gain is equal to k = hp-_j / (hp-_j + R). With perfect measurement
# R = 0, k reduces to k=1/h which is 1
k = 1.0
# one week data
# original R2 = 0.183
# with delay = 6, R2 = 0.295
# with delay = 12, R2 = 0.147
# with delay = 48, R2 = 0.075
delay = 6
# Kalman loop
for t, x_sample in enumerate(x):
if t <= delay:
# for the first day of the forecast,
# we don't have forecast data and measurement
# from a day before to do correction
x_sample_predict = x_sample
else: # t > 48
# for a priori estimate we take x_sample as is
# x_sample = x^-_j = a x^-_j_1 + b u_j
# Inside the NWP (numerical weather prediction,
# the x_sample should be on x_sample_j-1 (assumption)
x_sample_predict_prior = a * x_sample
# we use the measurement from t-delay (ie. could be a day ago)
# and forecast data from t-delay, to produce a leading residual that can be used to
# correct the forecast.
residual = z[t-delay] - h * x_sample_predict_list[t-delay]
p_j_prior = a**2 * p_j + Q
k = h * p_j_prior / (h**2 * p_j_prior + R)
# we update our prediction based on the residual
x_sample_predict = x_sample_predict_prior + k * residual
p_j = p_j_prior * (1 - h * k)
#print k
#print p_j_prior
#print p_j
#raw_input()
x_sample_predict_list.append(x_sample_predict)
# initial goodness of fit
R2_val_initial = calculate_regression(x,z)
R2_string_initial = "R2 initial: {0:10.3f}, ".format(R2_val_initial)
print R2_string_initial # R2_val_initial = 0.183
# final goodness of fit
R2_val_final = calculate_regression(x_sample_predict_list,z)
R2_string_final = "R2 final: {0:10.3f}, ".format(R2_val_final)
print R2_string_final # R2_val_final = 0.117, which is worse
timesteps = xrange(len(x))
pylab.plot(timesteps,x,'r-', timesteps,z,'b:', timesteps,x_sample_predict_list,'g--')
pylab.xlabel('Time')
pylab.ylabel('Wind Speed')
pylab.title('Simulated Wind Speed vs Actual Wind Speed')
pylab.legend(('predicted','measured','kalman'))
pylab.show()
def calculate_regression(x, y):
R2 = 0
A = np.array( [x, np.ones(len(x))] )
model, resid = np.linalg.lstsq(A.T, y)[:2]
R2_val = 1 - resid[0] / (y.size * y.var())
return R2_val
def load_x():
return np.array([2, 3, 3, 5, 4, 4, 4, 5, 5, 6, 5, 7, 7, 7, 8, 8, 8, 9, 9, 10, 10, 10, 11, 11,
11, 10, 8, 8, 8, 8, 6, 3, 4, 5, 5, 5, 6, 5, 5, 5, 6, 5, 5, 6, 6, 7, 6, 8, 9, 10,
12, 11, 10, 10, 10, 11, 11, 10, 8, 8, 9, 8, 9, 9, 9, 9, 8, 9, 8, 11, 11, 11, 12,
12, 13, 13, 13, 13, 13, 13, 13, 14, 13, 13, 12, 13, 13, 12, 12, 13, 13, 12, 12,
11, 12, 12, 19, 18, 17, 15, 13, 14, 14, 14, 13, 12, 12, 12, 12, 11, 10, 10, 10,
10, 9, 9, 8, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 6, 6, 6, 7, 7, 8, 8, 8, 6, 5, 5,
5, 5, 5, 5, 6, 4, 4, 4, 6, 7, 8, 7, 7, 9, 10, 10, 9, 9, 8, 7, 5, 5, 5, 5, 5, 5,
5, 5, 6, 5, 5, 5, 4, 4, 6, 6, 7, 7, 7, 7, 6, 6, 5, 5, 4, 2, 2, 2, 1, 1, 1, 2, 3,
13, 13, 12, 11, 10, 9, 10, 10, 8, 9, 8, 7, 5, 3, 2, 2, 2, 3, 3, 4, 4, 5, 6, 6,
7, 7, 7, 6, 6, 6, 7, 6, 6, 5, 4, 4, 3, 3, 3, 2, 2, 1, 5, 5, 3, 2, 1, 2, 6, 7,
7, 8, 8, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 9, 9, 9, 9, 9, 8, 8, 8, 8, 7, 7,
7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 7, 11, 11, 11, 11, 10, 10, 9, 10, 10, 10, 2, 2,
2, 3, 1, 1, 3, 4, 5, 8, 9, 9, 9, 9, 8, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 7,
7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 7, 5, 5, 5, 5, 5, 6, 5])
def load_z():
return np.array([3, 2, 1, 1, 1, 1, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 2, 1, 1, 2, 2, 2,
2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 3, 4, 4, 4, 4, 5, 4, 4, 5, 5, 5, 6, 6,
6, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 7, 8, 8, 8, 8, 8, 8, 9, 10, 9, 9, 10, 10, 9,
9, 10, 9, 9, 10, 9, 8, 9, 9, 7, 7, 6, 7, 6, 6, 7, 7, 8, 8, 8, 8, 8, 8, 7, 6, 7,
8, 8, 7, 8, 9, 9, 9, 9, 10, 9, 9, 9, 8, 8, 10, 9, 10, 10, 9, 9, 9, 10, 9, 8, 7,
7, 7, 7, 8, 7, 6, 5, 4, 3, 5, 3, 5, 4, 4, 4, 2, 4, 3, 2, 1, 1, 2, 1, 2, 1, 4, 4,
4, 4, 4, 3, 3, 3, 1, 1, 1, 1, 2, 3, 3, 2, 3, 3, 3, 2, 2, 5, 4, 2, 5, 4, 1, 1, 1,
1, 1, 1, 1, 2, 2, 1, 1, 3, 3, 3, 3, 3, 4, 3, 4, 3, 4, 4, 4, 4, 3, 3, 4, 4, 4, 4,
4, 4, 5, 5, 5, 4, 3, 3, 3, 3, 3, 3, 3, 3, 1, 2, 2, 3, 3, 1, 2, 1, 1, 2, 4, 3, 1,
1, 2, 0, 0, 0, 2, 1, 0, 0, 2, 3, 2, 4, 4, 3, 3, 4, 5, 5, 5, 4, 5, 4, 4, 4, 5, 5,
4, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4, 4, 5, 5, 5, 4, 5, 5, 5, 5, 6, 5, 5, 8, 9, 8, 9,
9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 9, 10, 9, 8, 8, 9, 8, 9, 9, 10, 9, 9, 9,
7, 7, 9, 8, 7, 6, 6, 5, 5, 5, 5, 3, 3, 3, 4, 6, 5, 5, 6, 5])
if __name__ == '__main__': main() # this avoids executing main on import your_module
-------------------------
Observations:
1) If yesterday’s forecast is over-predicting (positive bias), then today, I would make corrections by subtracting away the bias. In practice, if today I happen to be under-predicting, then subtracting the positive bias lead to an even worse prediction. And I actually observe a wider swing of data with poorer overall fit. What is wrong with my example?
2) Most Kalman filter resource indicates that Kalman filter minimizes the a posteriori covariance p_j = E{(x_j – x^_j)}, and has the proof selecting K to minimize the p_j. But can someone explain how minimizing the a posteriori covariance actually minimizes the effects of the process white noise w? In a real time case, let’s say the actual wind speed and measured wind speed is 5 m/s. The prediction wind speed is 6 m/s, ie. there was a noise of w = 1 m/s. The residual is 5 – 6 = -1 m/s. You correct by taking the 1 m/s from your prediction to get back 5 m/s. Is that how the effect of the process noise is minimized?
3) Here is a paper that mentioned applying KF to smooth weather forecast. http://hal.archives-ouvertes.fr/docs/00/50/59/93/PDF/Louka_etal_jweia2008.pdf. The interesting point is on pg 9 eq (7) that ”as soon as the new observation value y_t is known, the estimate of x at time t becomes x_t = x_t/t-1 = K(y_t – H_t * x_t/t-1) ”. If I were to paraphrase it in reference to actual time, then “as soon as the new observation value is known now, the estimate now becomes x_t …. ” I get how KF can bring your data close to measurement in real time. But if you are correcting the forecast data at t=now, with measurement data from t=now, how is that a forecast anymore?
Thanks!
UPDATE1:
4) I have added a delay in the code to investigate how much later can the forecast be than the current bias calculated from the current measurement, if we want the R2 between the Kalman processed data vs measure data time series to improve from the unprocessed data vs. measure data. In this example, if the measurement is used to improve the forecast 6 time step (3 hours from now) it is still useful (R2 goes from 0.183 to 0.295). But if the measurement is used to improve the forecast 1 day from now, then it destroys the correlation (R2 goes down to 0.075).
I updated my test scalar implementation, without making the assumption of perfect measurement R of 1, which was what reduced the kalman gain to a constant value of 1. Now I am seeing an improvement on the time series with reduced RMSE error.
#! /usr/bin/python
import numpy as np
import pylab
import os
# RMSE improved
def main():
# x = 336 data points of simulated wind speed for 7 days * 24 hour * 2 (every half an hour)
# Imagine at time t, we will get a x_t fvalue or t+48 or a 24 hours later.
x = load_x()
# this is a list that will contain 336 data points of our corrected data
x_sample_predict_list = []
# z = 336 data points for 7 days * 24 hour * 2 of actual measured wind speed (every half an hour)
z = load_z()
# Here is the setup of the scalar kalman filter
# reference: http://www.swarthmore.edu/NatSci/echeeve1/Ref/Kalman/ScalarKalman.html
# state transition matrix (we simply have a scalar)
# what you need to multiply the last time's state to get the newest state
# we get the x_t+1 = A * x_t, since we get the x_t+1 directly for simulation
# we will have a = 1
a = 1.0
# observation matrix
# what you need to multiply to the state, convert it to the same form as incoming measurement
# both state and measurements are wind speed, so set h = 1
h = 1.0
Q = 1.0 # expected process noise of predicted Wind Speed
R = 1.0 # expected measurement noise of Wind Speed
p_j = Q # process covariance is equal to the initial process covariance estimate
# Kalman gain is equal to k = hp-_j / (hp-_j + R). With perfect measurement
# R = 0, k reduces to k=1/h which is 1
k = 1.0
# one week data
# original R2 = 0.183
# with delay = 6, R2 = 0.295
# with delay = 12, R2 = 0.147
# with delay = 48, R2 = 0.075
delay = 6
# Kalman loop
for t, x_sample in enumerate(x):
if t <= delay:
# for the first day of the forecast,
# we don't have forecast data and measurement
# from a day before to do correction
x_sample_predict = x_sample
else: # t > 48
# for a priori estimate we take x_sample as is
# x_sample = x^-_j = a x^-_j_1 + b u_j
# Inside the NWP (numerical weather prediction,
# the x_sample should be on x_sample_j-1 (assumption)
x_sample_predict_prior = a * x_sample
# we use the measurement from t-delay (ie. could be a day ago)
# and forecast data from t-delay, to produce a leading residual that can be used to
# correct the forecast.
residual = z[t-delay] - h * x_sample_predict_list[t-delay]
p_j_prior = a**2 * p_j + Q
k = h * p_j_prior / (h**2 * p_j_prior + R)
# we update our prediction based on the residual
x_sample_predict = x_sample_predict_prior + k * residual
p_j = p_j_prior * (1 - h * k)
#print k
#print p_j_prior
#print p_j
#raw_input()
x_sample_predict_list.append(x_sample_predict)
# initial goodness of fit
R2_val_initial = calculate_regression(x,z)
R2_string_initial = "R2 original: {0:10.3f}, ".format(R2_val_initial)
print R2_string_initial # R2_val_original = 0.183
original_RMSE = (((x-z)**2).mean())**0.5
print "original_RMSE"
print original_RMSE
print "\n"
# final goodness of fit
R2_val_final = calculate_regression(x_sample_predict_list,z)
R2_string_final = "R2 final: {0:10.3f}, ".format(R2_val_final)
print R2_string_final # R2_val_final = 0.267, which is better
final_RMSE = (((x_sample_predict-z)**2).mean())**0.5
print "final_RMSE"
print final_RMSE
print "\n"
timesteps = xrange(len(x))
pylab.plot(timesteps,x,'r-', timesteps,z,'b:', timesteps,x_sample_predict_list,'g--')
pylab.xlabel('Time')
pylab.ylabel('Wind Speed')
pylab.title('Simulated Wind Speed vs Actual Wind Speed')
pylab.legend(('predicted','measured','kalman'))
pylab.show()
def calculate_regression(x, y):
R2 = 0
A = np.array( [x, np.ones(len(x))] )
model, resid = np.linalg.lstsq(A.T, y)[:2]
R2_val = 1 - resid[0] / (y.size * y.var())
return R2_val
def load_x():
return np.array([2, 3, 3, 5, 4, 4, 4, 5, 5, 6, 5, 7, 7, 7, 8, 8, 8, 9, 9, 10, 10, 10, 11, 11,
11, 10, 8, 8, 8, 8, 6, 3, 4, 5, 5, 5, 6, 5, 5, 5, 6, 5, 5, 6, 6, 7, 6, 8, 9, 10,
12, 11, 10, 10, 10, 11, 11, 10, 8, 8, 9, 8, 9, 9, 9, 9, 8, 9, 8, 11, 11, 11, 12,
12, 13, 13, 13, 13, 13, 13, 13, 14, 13, 13, 12, 13, 13, 12, 12, 13, 13, 12, 12,
11, 12, 12, 19, 18, 17, 15, 13, 14, 14, 14, 13, 12, 12, 12, 12, 11, 10, 10, 10,
10, 9, 9, 8, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 6, 6, 6, 7, 7, 8, 8, 8, 6, 5, 5,
5, 5, 5, 5, 6, 4, 4, 4, 6, 7, 8, 7, 7, 9, 10, 10, 9, 9, 8, 7, 5, 5, 5, 5, 5, 5,
5, 5, 6, 5, 5, 5, 4, 4, 6, 6, 7, 7, 7, 7, 6, 6, 5, 5, 4, 2, 2, 2, 1, 1, 1, 2, 3,
13, 13, 12, 11, 10, 9, 10, 10, 8, 9, 8, 7, 5, 3, 2, 2, 2, 3, 3, 4, 4, 5, 6, 6,
7, 7, 7, 6, 6, 6, 7, 6, 6, 5, 4, 4, 3, 3, 3, 2, 2, 1, 5, 5, 3, 2, 1, 2, 6, 7,
7, 8, 8, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 9, 9, 9, 9, 9, 8, 8, 8, 8, 7, 7,
7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 7, 11, 11, 11, 11, 10, 10, 9, 10, 10, 10, 2, 2,
2, 3, 1, 1, 3, 4, 5, 8, 9, 9, 9, 9, 8, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 7,
7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 7, 5, 5, 5, 5, 5, 6, 5])
def load_z():
return np.array([3, 2, 1, 1, 1, 1, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 2, 1, 1, 2, 2, 2,
2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 3, 4, 4, 4, 4, 5, 4, 4, 5, 5, 5, 6, 6,
6, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 7, 8, 8, 8, 8, 8, 8, 9, 10, 9, 9, 10, 10, 9,
9, 10, 9, 9, 10, 9, 8, 9, 9, 7, 7, 6, 7, 6, 6, 7, 7, 8, 8, 8, 8, 8, 8, 7, 6, 7,
8, 8, 7, 8, 9, 9, 9, 9, 10, 9, 9, 9, 8, 8, 10, 9, 10, 10, 9, 9, 9, 10, 9, 8, 7,
7, 7, 7, 8, 7, 6, 5, 4, 3, 5, 3, 5, 4, 4, 4, 2, 4, 3, 2, 1, 1, 2, 1, 2, 1, 4, 4,
4, 4, 4, 3, 3, 3, 1, 1, 1, 1, 2, 3, 3, 2, 3, 3, 3, 2, 2, 5, 4, 2, 5, 4, 1, 1, 1,
1, 1, 1, 1, 2, 2, 1, 1, 3, 3, 3, 3, 3, 4, 3, 4, 3, 4, 4, 4, 4, 3, 3, 4, 4, 4, 4,
4, 4, 5, 5, 5, 4, 3, 3, 3, 3, 3, 3, 3, 3, 1, 2, 2, 3, 3, 1, 2, 1, 1, 2, 4, 3, 1,
1, 2, 0, 0, 0, 2, 1, 0, 0, 2, 3, 2, 4, 4, 3, 3, 4, 5, 5, 5, 4, 5, 4, 4, 4, 5, 5,
4, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4, 4, 5, 5, 5, 4, 5, 5, 5, 5, 6, 5, 5, 8, 9, 8, 9,
9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 9, 10, 9, 8, 8, 9, 8, 9, 9, 10, 9, 9, 9,
7, 7, 9, 8, 7, 6, 6, 5, 5, 5, 5, 3, 3, 3, 4, 6, 5, 5, 6, 5])
if __name__ == '__main__': main() # this avoids executing main on import your_module
This line is not respecting the Scalar Kalman Filter:
residual = z[t-delay] - h * x_sample_predict_list[t-delay]
In my opinion you should have done:
residual = z[t -delay] - h * x_sample_predict_prior

Categories

Resources