I have 16 guassian curves which I have to fit with one guassian curve. I was unable to imply the sum of guassian(multiple regression) in python.
Here is the code I am using:
import matplotlib.pyplot as plt
import numpy as np
a=np.array([3750.0, -250.0, 6750.0, 2750.0, -2050.0, 6350.0, 1550.0, -4050.0, 5750.0, 150.0, -6250.0, 4950.0, -1450.0, -8650.0, 3950.0, -3250.0])
v1=np.array( [2.5470357695283954, 0.1937004980283323, 0.43831655553839766, 6.07645636407398, 0.6331239135554633, 0.969937308645575, 13.38133838752005, 1.3226417845166933, 1.5531178254607325, 27.599625693090765, 2.031000233294804, 1.635762971986014, 53.83073800155456, 2.0719664311822843, 0.0, 100.0])
x=[]
s=[]
v5=9.9e2
for j in range(0,len(a)):
for i in range(-1500,1500):
v11=a[j]+i
x.append(v11)
z=np.exp((-4*np.log(2)*((v11-a[j])/(v5))**2))*((4.5*np.log(2)/(np.pi))**0.5)
s.append(z*v1[j])
plt.plot(x,s,'--r',)
plt.stem(a,v1)
Which generates the following plot (with the problem circled):
Instead of the desired output:
The output of your code shows this overlapping because you are not summing the 16 gaussians but instead creating an array containing [x1_g1,x1_g1,...,x3000_g1,x1_g2,...,x3000_g16] and the same for s. It is a 1d array containing the 3000 x values of the first gaussian, then the 3000 x values of the second gaussian and so on. But they are not added. Thus, the plot shows the 16 independent gaussians instead of the sum which is the desired output.
In the actual code, the x values of each gaussian are different (going -1500 and +1500 around its center) which makes adding the 16 gaussians more complicated.
If we consider only the first 2 gaussians for instance, centered at 3750 and -250, the values appended in x from the first gaussian go from 2250 to 5250 in steps of 1, as well as their images in s which are s(2250)... Afterwards, the values of the second gaussian (x between -1750 and 1250) are appended (not added), which will result in an x list like that:
x = [2250,2251,<in steps of 1>,5249,5250,-1750,-1749,<in steps of 1>,1250]
And s is a list where each position contains the image of the same position in x. Strating from this format, getting the final output which is the sum of the gaussians id difficult, because we wolud have to check for equivalent values of x, and sum their contributions...
However, if instead we always evaluated the gaussians at the same positions (in the exemple between -1750 and 5250 in steps of 1), we will have much more values stored, and most of them will be zero, but adding them will be straightforward.
Half-way vectorization
One option similar to the code in the question is the following:
a = np.array([3750.0, -250.0, 6750.0, 2750.0, -2050.0, 6350.0, 1550.0, -4050.0, 5750.0, 150.0, -6250.0, 4950.0, -1450.0, -8650.0, 3950.0, -3250.0])
v1 = np.array( [2.5470357695283954, 0.1937004980283323, 0.43831655553839766, 6.07645636407398, 0.6331239135554633, 0.969937308645575, 13.38133838752005, 1.3226417845166933, 1.5531178254607325, 27.599625693090765, 2.031000233294804, 1.635762971986014, 53.83073800155456, 2.0719664311822843, 0.0, 100.0])
v5 = 9.9e2
xrange = np.arange(a.min()-1500,a.max()+1500)
# This generates an array between the minimum of a minus 1500 and the maximum of a
# plus 1500. This way, all the values in the old x list are contained in ths array
# Therefore, it becomes really easy to sum the contribution of each gaussian,
# because only an element-wise sum is needed.
s = np.zeros(len(xrange))
for j,aj in enumerate(a):
z = np.exp((-4*np.log(2)*((xrange-aj)/(v5))**2))*((4.5*np.log(2)/(np.pi))**0.5)
s += z*v1[j]
plt.plot(xrange,s,'--r')
plt.stem(a,v1)
The output plot is the same as for the completely vectorized solution.
Completely vectorized solution
One simple solution is to define a unique xrange for all 16 gaussians, then calculate s for each of them (on the same x values) and finally sum over the 16 gaussians:
a = np.array([3750.0, -250.0, 6750.0, 2750.0, -2050.0, 6350.0, 1550.0, -4050.0, 5750.0, 150.0, -6250.0, 4950.0, -1450.0, -8650.0, 3950.0, -3250.0])
v1 = np.array( [2.5470357695283954, 0.1937004980283323, 0.43831655553839766, 6.07645636407398, 0.6331239135554633, 0.969937308645575, 13.38133838752005, 1.3226417845166933, 1.5531178254607325, 27.599625693090765, 2.031000233294804, 1.635762971986014, 53.83073800155456, 2.0719664311822843, 0.0, 100.0])
v5 = 9.9e2
xrange = np.arange(a.min()-1500,a.max()+1500)
z = np.exp((-4*np.log(2)*((xrange-a.reshape((len(a),1)))/(v5))**2))*((4.5*np.log(2)/(np.pi))**0.5)
s = z*v1.reshape((len(a),1))
plt.plot(xrange,s.sum(axis=0),'--r')
plt.stem(a,v1)
Note that I have removed the 2 nested loops using numpy.
The loop over range(-1500,1500) can be avoided defining i=np.arange(-1500,1500) instead of the for i in ... and leaving the rest of the code untouched (only indentation has to be updated). Thet is because numpy operated element-wise over the arrays.
The second loop is a bit trickier than that. The a and v1 arrays are reshaped to a 2d array, in order to generate a z with the shape (16,len(xrange)). Thas is why combining an array xrange of length muxh larger than 16 with a does not raise any error of dimensions not matching, because one is the 1st dimension and the other the second.
The code above generates the following plot:
Groupby solution
There is also the option of working with the same code to generate x and s and afterwards, plot every unique value of x (the same value of x can be found in x[i1],x[i2],x[i3]) versus s[i1]+s[i2]+s[i3].
This can be done adding the following code after the loops:
x,s = np.array(x),np.array(s)
ind = np.argsort(x)
x,s = x[ind],s[ind]
unique_x = np.unique(x)
catsums=[]
for k in unique_x:
catsums.append(np.sum(s[np.where(x==k)]))
plt.plot(u,catsums,'--r')
plt.stem(a,v1)
This groupby can also be vectorized using numpy or pandas as it is explained in this other SO answer
Related
I have an HxW "feature map", F. Let us assume that it is a HxWx1 map. Through some other operation, I have a set of pixels that are of interest to me, (say N pixels). Each of these pixels is associated with a different value, thus my set is of the form Nx3 where each pixel is of the form x, y and val. Note that this val is different from the feature map value at the location.
Here is my question. Is it possible to vectorize a neighbourhood operation for each of these points? For each pixel n from N, I wish to multiply the corresponding val to its 3x3 neighbourhood in the feature map F. For the 3x3 neighbourhood, this gives a new 3x3 set of elements new val. I want to replace the x y with the pixel with the maximum of new val (multiplied feature map) in the 3x3 window.
This sounds similar to a convolution (slight abuse of terminology here) followed by a max pool operation, but not exactly since each pixel location has a different val to be multiplied.
Sample input and output, and walkthrough for required solution
Let us assume H=10 and W=10
Here is a sample F
0.635955 0.922379 0.993406 0.007837 0.818661 0.983730 0.199866 0.757519 0.073152 0.015831
0.397718 0.097353 0.231351 0.177886 0.343099 0.419940 0.017342 0.087294 0.402266 0.366337
0.978686 0.476594 0.067836 0.148977 0.058994 0.810586 0.542894 0.797419 0.386559 0.225982
0.479860 0.033354 0.353366 0.431562 0.336208 0.674272 0.398151 0.713732 0.598623 0.829230
0.940838 0.869564 0.287100 0.669844 0.631836 0.748982 0.762292 0.597999 0.540236 0.758802
0.925995 0.141296 0.466772 0.672663 0.929746 0.544029 0.991860 0.197474 0.762866 0.798973
0.543519 0.128332 0.624323 0.876569 0.050709 0.223705 0.708381 0.380842 0.818092 0.163447
0.283125 0.329618 0.283481 0.672950 0.136922 0.897785 0.385479 0.764824 0.132671 0.091148
0.661984 0.369459 0.501181 0.352681 0.554113 0.133283 0.593048 0.108534 0.397813 0.836065
0.654929 0.928576 0.539204 0.931213 0.344114 0.591214 0.126809 0.456681 0.036531 0.725228
My structure of pixels, let us say N=3
The three values in the order of row,col,val: (for simplicity I assume x is rows, and y is cols, though it isn't necessarily the case). This is completely independent of the feature map in the previous step.
3,2,0.38
4,4,0.602
7,5,0.9647
The neighborhood around (3,2) is:
[[0.4765941 , 0.06783561, 0.14897662],
[0.03335438, 0.35336647, 0.4315618 ],
[0.86956374, 0.28709952, 0.66984412]]
Thus val * neighborhood yields. (here val is 0.38)
[[0.18110576, 0.02577753, 0.05661112],
[0.01267466, 0.13427926, 0.16399349],
[0.33043422, 0.10909782, 0.25454077]]
The location of max value here is (2,0) i.e. (1,-1) with respect to center pixel. Thus my updated (x,y) should be (3,2) + (1,-1) = (4,1).
Similarly for the other two, the updated pixels are : (5,4) and (7,5)
How can I parallelize this entire thing?
(Hopefully to be loaded onto a GPU using Pytorch, but not necessarily, I have not come to that stage yet.)
Note: I had asked this question a few days ago, but it was poorly framed without proper info. Hopefully this solves the issue.
Edit: For this specific instance, F can be produced as a random array:
F = np.random.rand(10,10)
If I understand correctly, you want this:
from skimage.util.shape import view_as_windows
idx = pixels[:,0:2].astype(int)
print((np.unravel_index((view_as_windows(F,(3,3))[tuple(idx.T-1)]*pixels[:,-1][:,None,None]).reshape(-1,9).argmax(1),(3,3))+idx.T).T-1)
#if you need to replace the values of F with new values
F[tuple(idx.T)] = (view_as_windows(F,(3,3))[tuple(idx.T-1)]*pixels[:,-1][:,None,None]).reshape(-1,9).max(1)
I assumed your window shape is (3,3). Of course, you can change it. And if you need to deal with edge neighborhoods, pad your F with enough 0s (depending on your window size) using np.pad before using the view_as_windows.
output:
[[4 1]
[5 4]
[7 5]]
I have a set of 46 years worth of rainfall data. It's in the form of 46 numpy arrays each with a shape of 145, 192, so each year is a different array of maximum rainfall data at each lat and lon coordinate in the given model.
I need to create a global map of tau values by doing an M-K test (Mann-Kendall) for each coordinate over the 46 years.
I'm still learning python, so I've been having trouble finding a way to go through all the data in a simple way that doesn't involve me making 27840 new arrays for each coordinate.
So far I've looked into how to use scipy.stats.kendalltau and using the definition from here: https://github.com/mps9506/Mann-Kendall-Trend
EDIT:
To clarify and add a little more detail, I need to perform a test on for each coordinate and not just each file individually. For example, for the first M-K test, I would want my x=46 and I would want y=data1[0,0],data2[0,0],data3[0,0]...data46[0,0]. Then to repeat this process for every single coordinate in each array. In total the M-K test would be done 27840 times and leave me with 27840 tau values that I can then plot on a global map.
EDIT 2:
I'm now running into a different problem. Going off of the suggested code, I have the following:
for i in range(145):
for j in range(192):
out[i,j] = mk_test(yrmax[:,i,j],alpha=0.05)
print out
I used numpy.stack to stack all 46 arrays into a single array (yrmax) with shape: (46L, 145L, 192L) I've tested it out and it calculates p and tau correctly if I change the code from out[i,j] to just out. However, doing this messes up the for loop so it only takes the results from the last coordinate in stead of all of them. And if I leave the code as it is above, I get the error: TypeError: list indices must be integers, not tuple
My first guess was that it has to do with mk_test and how the information is supposed to be returned in the definition. So I've tried altering the code from the link above to change how the data is returned, but I keep getting errors relating back to tuples. So now I'm not sure where it's going wrong and how to fix it.
EDIT 3:
One more clarification I thought I should add. I've already modified the definition in the link so it returns only the two number values I want for creating maps, p and z.
I don't think this is as big an ask as you may imagine. From your description it sounds like you don't actually want the scipy kendalltau, but the function in the repository you posted. Here is a little example I set up:
from time import time
import numpy as np
from mk_test import mk_test
data = np.array([np.random.rand(145, 192) for _ in range(46)])
mk_res = np.empty((145, 192), dtype=object)
start = time()
for i in range(145):
for j in range(192):
out[i, j] = mk_test(data[:, i, j], alpha=0.05)
print(f'Elapsed Time: {time() - start} s')
Elapsed Time: 35.21990394592285 s
My system is a MacBook Pro 2.7 GHz Intel Core I7 with 16 GB Ram so nothing special.
Each entry in the mk_res array (shape 145, 192) corresponds to one of your coordinate points and contains an entry like so:
array(['no trend', 'False', '0.894546014835', '0.132554125342'], dtype='<U14')
One thing that might be useful would be to modify the code in mk_test.py to return all numerical values. So instead of 'no trend'/'positive'/'negative' you could return 0/1/-1, and 1/0 for True/False and then you wouldn't have to worry about the whole object array type. I don't know what kind of analysis you might want to do downstream but I imagine that would preemptively circumvent any headaches.
Thanks to the answers provided and some work I was able to work out a solution that I'll provide here for anyone else that needs to use the Mann-Kendall test for data analysis.
The first thing I needed to do was flatten the original array I had into a 1D array. I know there is probably an easier way to go about doing this, but I ultimately used the following code based on code Grr suggested using.
`x = 46
out1 = np.empty(x)
out = np.empty((0))
for i in range(146):
for j in range(193):
out1 = yrmax[:,i,j]
out = np.append(out, out1, axis=0) `
Then I reshaped the resulting array (out) as follows:
out2 = np.reshape(out,(27840,46))
I did this so my data would be in a format compatible with scipy.stats.kendalltau 27840 is the total number of values I have at every coordinate that will be on my map (i.e. it's just 145*192) and the 46 is the number of years the data spans.
I then used the following loop I modified from Grr's code to find Kendall-tau and it's respective p-value at each latitude and longitude over the 46 year period.
`x = range(46)
y = np.zeros((0))
for j in range(27840):
b = sc.stats.kendalltau(x,out2[j,:])
y = np.append(y, b, axis=0)`
Finally, I reshaped the data one for time as shown:newdata = np.reshape(y,(145,192,2)) so the final array is in a suitable format to be used to create a global map of both tau and p-values.
Thanks everyone for the assistance!
Depending on your situation, it might just be easiest to make the arrays.
You won't really need them all in memory at once (not that it sounds like a terrible amount of data). Something like this only has to deal with one "copied out" coordinate trend at once:
SIZE = (145,192)
year_matrices = load_years() # list of one 145x192 arrays per year
result_matrix = numpy.zeros(SIZE)
for x in range(SIZE[0]):
for y in range(SIZE[1]):
coord_trend = map(lambda d: d[x][y], year_matrices)
result_matrix[x][y] = analyze_trend(coord_trend)
print result_matrix
Now, there are things like itertools.izip that could help you if you really want to avoid actually copying the data.
Here's a concrete example of how Python's "zip" might works with data like yours (although as if you'd used ndarray.flatten on each year):
year_arrays = [
['y0_coord0_val', 'y0_coord1_val', 'y0_coord2_val', 'y0_coord2_val'],
['y1_coord0_val', 'y1_coord1_val', 'y1_coord2_val', 'y1_coord2_val'],
['y2_coord0_val', 'y2_coord1_val', 'y2_coord2_val', 'y2_coord2_val'],
]
assert len(year_arrays) == 3
assert len(year_arrays[0]) == 4
coord_arrays = zip(*year_arrays) # i.e. `zip(year_arrays[0], year_arrays[1], year_arrays[2])`
# original data is essentially transposed
assert len(coord_arrays) == 4
assert len(coord_arrays[0]) == 3
assert coord_arrays[0] == ('y0_coord0_val', 'y1_coord0_val', 'y2_coord0_val', 'y3_coord0_val')
assert coord_arrays[1] == ('y0_coord1_val', 'y1_coord1_val', 'y2_coord1_val', 'y3_coord1_val')
assert coord_arrays[2] == ('y0_coord2_val', 'y1_coord2_val', 'y2_coord2_val', 'y3_coord2_val')
assert coord_arrays[3] == ('y0_coord2_val', 'y1_coord2_val', 'y2_coord2_val', 'y3_coord2_val')
flat_result = map(analyze_trend, coord_arrays)
The example above still copies the data (and all at once, rather than a coordinate at a time!) but hopefully shows what's going on.
Now, if you replace zip with itertools.izip and map with itertools.map then the copies needn't occur — itertools wraps the original arrays and keeps track of where it should be fetching values from internally.
There's a catch, though: to take advantage itertools you to access the data only sequentially (i.e. through iteration). In your case, it looks like the code at https://github.com/mps9506/Mann-Kendall-Trend/blob/master/mk_test.py might not be compatible with that. (I haven't reviewed the algorithm itself to see if it could be.)
Also please note that in the example I've glossed over the numpy ndarray stuff and just show flat coordinate arrays. It looks like numpy has some of it's own options for handling this instead of itertools, e.g. this answer says "Taking the transpose of an array does not make a copy". Your question was somewhat general, so I've tried to give some general tips as to ways one might deal with larger data in Python.
I ran into the same task and have managed to come up with a vectorized solution using numpy and scipy.
The formula are the same as in this page: https://vsp.pnnl.gov/help/Vsample/Design_Trend_Mann_Kendall.htm.
The trickiest part is to work out the adjustment for the tied values. I modified the code as in this answer to compute the number of tied values for each record, in a vectorized manner.
Below are the 2 functions:
import copy
import numpy as np
from scipy.stats import norm
def countTies(x):
'''Count number of ties in rows of a 2D matrix
Args:
x (ndarray): 2d matrix.
Returns:
result (ndarray): 2d matrix with same shape as <x>. In each
row, the number of ties are inserted at (not really) arbitary
locations.
The locations of tie numbers in are not important, since
they will be subsequently put into a formula of sum(t*(t-1)*(2t+5)).
Inspired by: https://stackoverflow.com/a/24892274/2005415.
'''
if np.ndim(x) != 2:
raise Exception("<x> should be 2D.")
m, n = x.shape
pad0 = np.zeros([m, 1]).astype('int')
x = copy.deepcopy(x)
x.sort(axis=1)
diff = np.diff(x, axis=1)
cated = np.concatenate([pad0, np.where(diff==0, 1, 0), pad0], axis=1)
absdiff = np.abs(np.diff(cated, axis=1))
rows, cols = np.where(absdiff==1)
rows = rows.reshape(-1, 2)[:, 0]
cols = cols.reshape(-1, 2)
counts = np.diff(cols, axis=1)+1
result = np.zeros(x.shape).astype('int')
result[rows, cols[:,1]] = counts.flatten()
return result
def MannKendallTrend2D(data, tails=2, axis=0, verbose=True):
'''Vectorized Mann-Kendall tests on 2D matrix rows/columns
Args:
data (ndarray): 2d array with shape (m, n).
Keyword Args:
tails (int): 1 for 1-tail, 2 for 2-tail test.
axis (int): 0: test trend in each column. 1: test trend in each
row.
Returns:
z (ndarray): If <axis> = 0, 1d array with length <n>, standard scores
corresponding to data in each row in <x>.
If <axis> = 1, 1d array with length <m>, standard scores
corresponding to data in each column in <x>.
p (ndarray): p-values corresponding to <z>.
'''
if np.ndim(data) != 2:
raise Exception("<data> should be 2D.")
# alway put records in rows and do M-K test on each row
if axis == 0:
data = data.T
m, n = data.shape
mask = np.triu(np.ones([n, n])).astype('int')
mask = np.repeat(mask[None,...], m, axis=0)
s = np.sign(data[:,None,:]-data[:,:,None]).astype('int')
s = (s * mask).sum(axis=(1,2))
#--------------------Count ties--------------------
counts = countTies(data)
tt = counts * (counts - 1) * (2*counts + 5)
tt = tt.sum(axis=1)
#-----------------Sample Gaussian-----------------
var = (n * (n-1) * (2*n+5) - tt) / 18.
eps = 1e-8 # avoid dividing 0
z = (s - np.sign(s)) / (np.sqrt(var) + eps)
p = norm.cdf(z)
p = np.where(p>0.5, 1-p, p)
if tails==2:
p=p*2
return z, p
I assume your data come in the layout of (time, latitude, longitude), and you are examining the temporal trend for each lat/lon cell.
To simulate this task, I synthesized a sample data array of shape (50, 145, 192). The 50 time points are taken from Example 5.9 of the book Wilks 2011, Statistical methods in the atmospheric sciences. And then I simply duplicated the same time series 27840 times to make it (50, 145, 192).
Below is the computation:
x = np.array([0.44,1.18,2.69,2.08,3.66,1.72,2.82,0.72,1.46,1.30,1.35,0.54,\
2.74,1.13,2.50,1.72,2.27,2.82,1.98,2.44,2.53,2.00,1.12,2.13,1.36,\
4.9,2.94,1.75,1.69,1.88,1.31,1.76,2.17,2.38,1.16,1.39,1.36,\
1.03,1.11,1.35,1.44,1.84,1.69,3.,1.36,6.37,4.55,0.52,0.87,1.51])
# create a big cube with shape: (T, Y, X)
arr = np.zeros([len(x), 145, 192])
for i in range(arr.shape[1]):
for j in range(arr.shape[2]):
arr[:, i, j] = x
print(arr.shape)
# re-arrange into tabular layout: (Y*X, T)
arr = np.transpose(arr, [1, 2, 0])
arr = arr.reshape(-1, len(x))
print(arr.shape)
import time
t1 = time.time()
z, p = MannKendallTrend2D(arr, tails=2, axis=1)
p = p.reshape(145, 192)
t2 = time.time()
print('time =', t2-t1)
The p-value for that sample time series is 0.63341565, which I have validated against the pymannkendall module result. Since arr contains merely duplicated copies of x, the resultant p is a 2d array of size (145, 192), with all 0.63341565.
And it took me only 1.28 seconds to compute that.
I have an array which contains error values as a function of two different quantities (alpha and eigRange).
I fill my array like this :
for j in range(n):
for i in range(alphaLen):
alpha = alpha_list[i]
c = train.eig(xt_, yt_,m-j, m,alpha, "cpu")
costListTrain[j, i] = cost.err(xt_, xt_, yt_, c)
normedValues=costListTrain/np.max(costListTrain.ravel())
where
n = 20
alpha_list = [0.0001,0.0003,0.0008,0.001,0.003,0.006,0.01,0.03,0.05]
My costListTrain array contains some values that have very small differences, e.g.:
2.809458902485728 2.809458905776425 2.809458913576337 2.809459011062461
2.030326752376704 2.030329906064879 2.030337351188699 2.030428976282031
1.919840839066182 1.919846470077076 1.919859731440199 1.920021453630778
1.858436351617677 1.858444223016128 1.858462730482461 1.858687054377165
1.475871326997542 1.475901926855846 1.475973476249240 1.476822830933632
1.475775410801635 1.475806023102173 1.475877601316863 1.476727286424228
1.475774284270633 1.475804896751524 1.475876475382906 1.476726165223209
1.463578292548192 1.463611627166494 1.463689466240788 1.464609083309240
1.462859608038034 1.462893157900139 1.462971489632478 1.463896516033939
1.461912706143012 1.461954067956570 1.462047793798572 1.463079574605320
1.450581041157659 1.452770209885761 1.454835202839513 1.459676311335618
1.450581041157643 1.452770209885764 1.454835202839484 1.459676311335624
1.450581041157651 1.452770209885735 1.454835202839484 1.459676311335610
1.450581041157597 1.452770209885784 1.454835202839503 1.459676311335620
1.450581041157575 1.452770209885757 1.454835202839496 1.459676311335619
1.450581041157716 1.452770209885711 1.454835202839499 1.459676311335613
1.450581041157667 1.452770209885744 1.454835202839509 1.459676311335625
1.450581041157649 1.452770209885750 1.454835202839476 1.459676311335617
1.450581041157655 1.452770209885708 1.454835202839442 1.459676311335622
1.450581041157571 1.452770209885700 1.454835202839498 1.459676311335622
as you can here the value are very very close together!
I am trying to plotting this data in a way where I have the two quantities in the x, y axes and the error value is represented by the dot color.
This is how I'm plotting my data:
alpha_list = np.log(alpha_list)
eigenvalues, alphaa = np.meshgrid(eigRange, alpha_list)
vMin = np.min(costListTrain)
vMax = np.max(costListTrain)
plt.scatter(x, y, s=70, c=normedValues, vmin=vMin, vmax=vMax, alpha=0.50)
but the result is not correct.
I tried to normalize my error value by dividing all values by the max, but it didn't work !
The only way that I could make it work (which is incorrect) is to normalize my data in two different ways. One is base on each column (which means factor1 is constant, factor 2 changing), and the other one based on row (means factor 2 is constant and factor one changing). But it doesn't really make sense because I need a single plot to show the tradeoff between the two quantities on the error values.
UPDATE
this is what I mean by last paragraph.
normalizing values base on max on each rows which correspond to eigenvalues:
maxsEigBasedTrain= np.amax(costListTrain.T,1)[:,np.newaxis]
maxsEigBasedTest= np.amax(costListTest.T,1)[:,np.newaxis]
normEigCostTrain=costListTrain.T/maxsEigBasedTrain
normEigCostTest=costListTest.T/maxsEigBasedTest
normalizing values base on max on each column which correspond to alphas:
maxsAlphaBasedTrain= np.amax(costListTrain,1)[:,np.newaxis]
maxsAlphaBasedTest= np.amax(costListTest,1)[:,np.newaxis]
normAlphaCostTrain=costListTrain/maxsAlphaBasedTrain
normAlphaCostTest=costListTest/maxsAlphaBasedTest
plot 1:
where no. eigenvalue = 10 and alpha changes (should correspond to column 10 of plot 1) :
where alpha = 0.0001 and eigenvalues change (should correspond to first row of plot1)
but as you can see the results are different from plot 1!
UPDATE:
just to clarify more stuff this is how I read my data:
from sklearn.datasets.samples_generator import make_regression
rng = np.random.RandomState(0)
diabetes = datasets.load_diabetes()
X_diabetes, y_diabetes = diabetes.data, diabetes.target
X_diabetes=np.c_[np.ones(len(X_diabetes)),X_diabetes]
ind = np.arange(X_diabetes.shape[0])
rng.shuffle(ind)
#===============================================================================
# Split Data
#===============================================================================
import math
cross= math.ceil(0.7*len(X_diabetes))
ind_train = ind[:cross]
X_train, y_train = X_diabetes[ind_train], y_diabetes[ind_train]
ind_val=ind[cross:]
X_val,y_val= X_diabetes[ind_val], y_diabetes[ind_val]
I also uploaded .csv files HERE
log.csv contain the original value before normalization for plot 1
normalizedLog.csv for plot 1
eigenConst.csv for plot 2
alphaConst.csv for plot 3
I think I found the answer. First of all there was one problem in my code. I was expecting the "No. of eigenvalue" correspond to rows but in my for loop they fill the columns. The currect answer is this :
for i in range(alphaLen):
for j in range(n):
alpha=alpha_list[i]
c=train.eig(xt_, yt_,m-j,m,alpha,"cpu")
costListTrain[i,j]=cost.err(xt_,xt_,yt_,c)
costListTest[i,j]=cost.err(xt_,xv_,yv_,c)
After asking questions from friends and colleagues I got this answer :
I would assume on default imshow and other plotting commands you
might want to use, do equally sized intervals on the values you are
plotting. if you can set that to logarithmic you should be fine.
Ideally, equally "populated bins" would proof most effective, i guess.
for plotting I just subtract the min value from the error and the add a small number and at the end take the log.
temp=costListTrain- costListTrain.min()
temp+=0.00000001
extent = [0, 20,alpha_list[0], alpha_list[-1]]
plt.imshow(np.log(temp),interpolation="nearest",cmap=plt.get_cmap('spectral'), extent = extent, origin="lower")
plt.colorbar()
and result is :
In R, I am using ccf or acf to compute the pair-wise cross-correlation function so that I can find out which shift gives me the maximum value. From the looks of it, R gives me a normalized sequence of values. Is there something similar in Python's scipy or am I supposed to do it using the fft module? Currently, I am doing it as follows:
xcorr = lambda x,y : irfft(rfft(x)*rfft(y[::-1]))
x = numpy.array([0,0,1,1])
y = numpy.array([1,1,0,0])
print xcorr(x,y)
To cross-correlate 1d arrays use numpy.correlate.
For 2d arrays, use scipy.signal.correlate2d.
There is also scipy.stsci.convolve.correlate2d.
There is also matplotlib.pyplot.xcorr which is based on numpy.correlate.
See this post on the SciPy mailing list for some links to different implementations.
Edit: #user333700 added a link to the SciPy ticket for this issue in a comment.
If you are looking for a rapid, normalized cross correlation in either one or two dimensions
I would recommend the openCV library (see http://opencv.willowgarage.com/wiki/ http://opencv.org/). The cross-correlation code maintained by this group is the fastest you will find, and it will be normalized (results between -1 and 1).
While this is a C++ library the code is maintained with CMake and has python bindings so that access to the cross correlation functions is convenient. OpenCV also plays nicely with numpy. If I wanted to compute a 2-D cross-correlation starting from numpy arrays I could do it as follows.
import numpy
import cv
#Create a random template and place it in a larger image
templateNp = numpy.random.random( (100,100) )
image = numpy.random.random( (400,400) )
image[:100, :100] = templateNp
#create a numpy array for storing result
resultNp = numpy.zeros( (301, 301) )
#convert from numpy format to openCV format
templateCv = cv.fromarray(numpy.float32(template))
imageCv = cv.fromarray(numpy.float32(image))
resultCv = cv.fromarray(numpy.float32(resultNp))
#perform cross correlation
cv.MatchTemplate(templateCv, imageCv, resultCv, cv.CV_TM_CCORR_NORMED)
#convert result back to numpy array
resultNp = np.asarray(resultCv)
For just a 1-D cross-correlation create a 2-D array with shape equal to (N, 1 ). Though there is some extra code involved to convert to an openCV format the speed-up over scipy is quite impressive.
I just finished writing my own optimised implementation of normalized cross-correlation for N-dimensional arrays. You can get it from here.
It will calculate cross-correlation either directly, using scipy.ndimage.correlate, or in the frequency domain, using scipy.fftpack.fftn/ifftn depending on whichever will be quickest.
For 1D array, numpy.correlate is faster than scipy.signal.correlate, under different sizes, I see a consistent 5x peformance gain using numpy.correlate. When two arrays are of similar size (the bright line connecting the diagonal), the performance difference is even more outstanding (50x +).
# a simple benchmark
res = []
for x in range(1, 1000):
list_x = []
for y in range(1, 1000):
# generate different sizes of series to compare
l1 = np.random.choice(range(1, 100), size=x)
l2 = np.random.choice(range(1, 100), size=y)
time_start = datetime.now()
np.correlate(a=l1, v=l2)
t_np = datetime.now() - time_start
time_start = datetime.now()
scipy.signal.correlate(in1=l1, in2=l2)
t_scipy = datetime.now() - time_start
list_x.append(t_scipy / t_np)
res.append(list_x)
plt.imshow(np.matrix(res))
As default, scipy.signal.correlate calculates a few extra numbers by padding and that might explained the performance difference.
>> l1 = [1,2,3,2,1,2,3]
>> l2 = [1,2,3]
>> print(numpy.correlate(a=l1, v=l2))
>> print(scipy.signal.correlate(in1=l1, in2=l2))
[14 14 10 10 14]
[ 3 8 14 14 10 10 14 8 3] # the first 3 is [0,0,1]dot[1,2,3]
This may be a question for a different forum, if so please let me know. I noticed that only 14 people follow the wavelet tag.
I've here an elegant way of extending the wavelet decomposition in pywt (pyWavelets package) to multiple dimensions. This should run out of the box if pywt is installed. Test 1 shows the decomposition and recomposition of a 3D array. All, one has to do is increase the number of dimensions and the code will work in decomposing/recomposing with 4, 6 or even 18 dimensions of data.
I've replaced the pywt.wavedec and pywt.waverec functions here. Also, in fn_dec, I show how the new wavedec function works just like the old one.
There is one catch though: It represents the wavelet coefficients as an array of the same shape as the data. As a consequence, with my limited knowledge of wavelets, I've only been able to use it for Haar wavelets. Others like DB4 for example bleed coefficients over the edges of this strict bounds (not a problem with the current representation of coefficients as list of arrays [CA, CD1 ... CDN]. Another catch is that I've only worked this with 2^N edge cuboids of data.
Theoretically, I think it should be possible to make sure that the "bleeding" does not occur. An algorithm for this sort of wavelet decomposition and recomposition is discussed in "numerical recipies in C" - by William Press, Saul A teukolsky, William T. Vetterling and Brian P. Flannery (Second Edition). Though this algorithm assumes reflection at the edges rather than the other forms of edge extensions (like zpd), the method is general enough to work for other forms of extension.
Any suggestion on how to extend this work to other wavelets?
NOTE: This query is also posted on http://groups.google.com/group/pywavelets
Thanks,
Ajo
import pywt
import sys
import numpy as np
def waveFn(wavelet):
if not isinstance(wavelet, pywt.Wavelet):
return pywt.Wavelet(wavelet)
else:
return wavelet
# given a single dimensional array ... returns the coefficients.
def wavedec(data, wavelet, mode='sym'):
wavelet = waveFn(wavelet)
dLen = len(data)
coeffs = np.zeros_like(data)
level = pywt.dwt_max_level(dLen, wavelet.dec_len)
a = data
end_idx = dLen
for idx in xrange(level):
a, d = pywt.dwt(a, wavelet, mode)
begin_idx = end_idx/2
coeffs[begin_idx:end_idx] = d
end_idx = begin_idx
coeffs[:end_idx] = a
return coeffs
def waverec(data, wavelet, mode='sym'):
wavelet = waveFn(wavelet)
dLen = len(data)
level = pywt.dwt_max_level(dLen, wavelet.dec_len)
end_idx = 1
a = data[:end_idx] # approximation ... also the original data
d = data[end_idx:end_idx*2]
for idx in xrange(level):
a = pywt.idwt(a, d, wavelet, mode)
end_idx *= 2
d = data[end_idx:end_idx*2]
return a
def fn_dec(arr):
return np.array(map(lambda row: reduce(lambda x,y : np.hstack((x,y)), pywt.wavedec(row, 'haar', 'zpd')), arr))
# return np.array(map(lambda row: row*2, arr))
if __name__ == '__main__':
test = 1
np.random.seed(10)
wavelet = waveFn('haar')
if test==0:
# SIngle dimensional test.
a = np.random.randn(1,8)
print "original values A"
print a
print "decomposition of A by method in pywt"
print fn_dec(a)
print " decomposition of A by my method"
coeffs = wavedec(a[0], 'haar', 'zpd')
print coeffs
print "recomposition of A by my method"
print waverec(coeffs, 'haar', 'zpd')
sys.exit()
if test==1:
a = np.random.randn(4,4,4)
# 2 D test
print "original value of A"
print a
# decompose the signal into wavelet coefficients.
dimensions = a.shape
for dim in dimensions:
a = np.rollaxis(a, 0, a.ndim)
ndim = a.shape
#a = fn_dec(a.reshape(-1, dim))
a = np.array(map(lambda row: wavedec(row, wavelet), a.reshape(-1, dim)))
a = a.reshape(ndim)
print " decomposition of signal into coefficients"
print a
# re-composition of the coefficients into original signal
for dim in dimensions:
a = np.rollaxis(a, 0, a.ndim)
ndim = a.shape
a = np.array(map(lambda row: waverec(row, wavelet), a.reshape(-1, dim)))
a = a.reshape(ndim)
print "recomposition of coefficients to signal"
print a
First of all, I would like to point you to the function that already implements Single-level Multi-dimensional Transform (Source). It returns a dictionary of n-dimensional coefficients arrays. Coefficients are addressed by keys that describe type of the transform (approximation/details) applied to each of the dimensions.
For example for a 2D case the result is a dictionary with approximation and details coefficients arrays:
>>> pywt.dwtn([[1,2,3,4],[3,4,5,6],[5,6,7,8],[7,8,9,10]], 'db1')
{'aa': [[5.0, 9.0], [13.0, 17.0]],
'ad': [[-1.0, -1.0], [-1.0, -1.0]],
'da': [[-2.0, -2.0], [-2.0, -2.0]],
'dd': [[0.0, 0.0], [0.0, -0.0]]}
Where aa is the coefficients array with approximation transform applied to both dimensions (LL) and da is the coefficients array with details transform applied to the first dimension and approximation transform applied to the second one (HL) (compare with dwt2 output).
Based on that it should be fairly easy to extend it to the multi-level case.
Here's my take on the decomposition part: https://gist.github.com/934166.
I would also like to address one issue you mention in your question:
There is one catch though: It
represents the wavelet coefficients as
an array of the same shape as the data.
The approach of representing results as an array of the same shape/size as the input data is in my opinion harmful. It makes the whole thing unnecessarily complex to understand and work with because anyway you have to make assumptions or maintain a secondary data structure with indexes to be able to access coefficient in the output array and perform an inverse transform (see Matlab's documentation for wavedec/waverec).
Also, even though it works great on paper, it does not always fit real world applications because of the problems you have mentioned: most of the times input data size is not 2^n and the decimated result of convolving signal with wavelet filter is larger that the "storage space", which in turn can lead to data loss and non-perfect reconstruction.
To avoid these problems I would recommend using more natural data structures to represent the result data hierarchy, like Python's lists, dictionaries and tuples (where available).