ValueError when applying theano's scan function

ValueError when applying theano's scan function - python

I'm trying to evaluate a multivariate normal CDF several times using theano's scan function, but I'm getting a ValueError.
Here's an example of the original function I'm trying to vectorize:
from scipy.stats.mvn import mvnun # library that calculates MVN CDF
low = [-1.96, 0 ] # lower bounds of integration
upp = [0 , 1.96] # upper bounds of integration
mean = [0 , 0 ] # means of the jointly distributed random variables
covs = [[1,0.25],[0.25,1]] # covariance matrix
print(mvnun(low,upp,mean,cov))
This produces the following output:
(0.19620339269649473, 0)
Simple and straightforward, right?
What I'm really trying to do is create 4 large input objects with 1500 elements each. That way, I get to evaluate the mvnun function 1500 times. The idea is that at each iteration, all inputs are different from the last, and no information from the previous iteration is necessary.
Here is my setup:
import theano
import numpy as np
lower = theano.tensor.dmatrix("lower") # lower bounds - dim: 1500 x 2
upper = theano.tensor.dmatrix("upper") # upper bounds - dim: 1500 x 2
means = theano.tensor.dmatrix("means") # means means - dim: 1500 x 2
covs = theano.tensor.dtensor3("covs") # cov matrices - dim: 1500 x 2 x 2
results, updates = theano.scan(fn=mvnun,
sequences=[lower,upper,means,covs])
f = theano.function(inputs=[lower, upper, means, covs],
outputs=results,
updates=updates)
However, when I try to run this block of code, I get an error on the line with the scan command. The error states: ValueError: setting an array element with a sequence.. The full traceback of the error is below:
Traceback (most recent call last):
File "", line 7, in
sequences=[lower,upper,means,covs])
File "C:\Anaconda2\lib\site-packages\theano\scan_module\scan.py",
line 745, in scan
condition, outputs, updates = scan_utils.get_updates_and_outputs(fn(*args))
ValueError: setting an array element with a sequence.
I originally thought that the code wasn't working because the mvnun function returns a two-element tuple instead of a single value.
However, when I tried to vectorize a test function (that I created) that also returned a two-element tuple, things worked just fine. Here's the full example:
# Some weird crazy function that takes in three Nx1 vectors
# and an NxN matrix and spits out a tuple of scalars.
def test_func(low_i,upp_i,mean_i,cov_i):
r1 = low_i.sum() + upp_i.sum()
r2 = np.dot(mean_i,cov_i).sum()
test_func_out = (r1,r2)
return(test_func_out)
lower = theano.tensor.dmatrix("lower") # lower
upper = theano.tensor.dmatrix("upper") # upper
means = theano.tensor.dmatrix("means") # means
covs = theano.tensor.dtensor3("covs") # covs
results, updates = theano.scan(fn=test_func,
sequences=[lower,upper,means,covs])
f = theano.function(inputs=[lower, upper, means, covs],
outputs=results,
updates=updates)
np.random.seed(666)
obs = 1500 # number of elements in the dataset
dim = 2 # dimension of multivariate normal distribution
# Generating random values for the lower bounds, upper bounds and means
lower_vals = np.random.rand(obs,dim)
upper_vals = lower_vals + np.random.rand(obs,dim)
means_vals = np.random.rand(obs,dim)
# Creates a symmetric matrix - used for the random covariance matrices
def make_sym_matrix(dim,vals):
m = np.zeros([dim,dim])
xs,ys = np.triu_indices(dim,k=1)
m[xs,ys] = vals[:-dim]
m[ys,xs] = vals[:-dim]
m[ np.diag_indices(dim) ] = vals[-dim:]
return m
# Generating the random covariance matrices
covs_vals = []
for i in range(obs):
cov_vals = np.random.rand((dim^2 - dim)/2+dim)
cov_mtx = make_sym_matrix(dim,cov_vals)
covs_vals.append(cov_mtx)
covs_vals = np.array(covs_vals)
# Evaluating the test function on all 1500 elements
print(f(lower_vals,upper_vals,means_vals,covs_vals))
When I run this block of code, everything works out fine and the output I get is a list with 2 arrays, each containing 1500 elements:
[array([ 4.24700864, 3.80830129, 2.60806493, ..., 3.12995381, 4.41907055, 4.12880839]),
array([ 0.87814314, 1.01768617, 0.45072405, ..., 1.15788282, 0.15766754, 1.32393402])]
It's also worth noting that the order in which the vectorized function is getting elements from the sequences is perfect. I ran a sanity check with the first 3 numbers in the list:
for i in range(3):
print(test_func(lower_vals[i],upper_vals[i],means_vals[i],covs_vals[i]))
And the results are:
(4.2470086396797502, 0.87814313729162796)
(3.808301289302495, 1.017686166097616)
(2.6080649327828564, 0.45072405177076169)
These values are practically identical to the first 3 output values in the vectorized approach.
So back to the main problem: why can't I get the mvnun function to work when I use it in the scan statement? Why am I getting this odd ValueError?
Any kind of advice would be really helpful!!!
Thanks!!!

Related

How to make a graph between order of the matrix and the time taken to multiply the two matrices?

import numpy as np
from time import time
import matplotlib.pyplot as plt
np.random.seed(27)
mysetup = "from math import sqrt"
begin=time()
i=int(input("Number of rows in first matrix"))
k=int(input("Number of column in first and rows in second matrix"))
j=int(input("Number of columns in second matrix"))
A = np.random.randint(1,10,size = (i,k))
B = np.random.randint(1,10,size = (k,j))
def multiply_matrix(A,B):
global C
if A.shape[1]==B.shape[0]:
C=np.zeros((A.shape[0],B.shape[1]),dtype=int)
for row in range(i):
for col in range(j):
for elt in range(0,len(B)):
C[row,col] += A[row,elt]*B[elt,col]
return C
else:
return "Cannot multiply A and B"
print(f"Matrix A:\n {A}\n")
print(f"Matrix B:\n {B}\n")
D=print(multiply_matrix(A, B))
end=time()
t=print(end-begin)
x=[0,100,10]
y=[100,100,1000]
plt.plot(x,y)
plt.xlabel('Time taken for the program to run')
plt.ylabel('Order of the matrix multiplication')
plt.show()
In the program, I have generated random elements for the matrices to be multiplied.Basically I am trying to compute the time it takes to multiply two matrices.The i,j and k will be considered as the order used for the matrix.As we cannot multiply matrices where number of columns of the first is not equal to the number of the rows in the second, I have already given them the variable 'k'.
Initially I considered to increment the order of the matrix using for loop but wasn't able to do so. I want the graph to display the time it took to multiply the matrices on the x axis and the order of the resultant matrix on the y axis.
There is a problem in the logic I applied but I am not able to find out how to do this problem as I am a beginner in programming
I was expecting to get the result as Y axis having a scale ranging from 0 to 100 with a difference of 10 and x axis with a scale of 100 to 1000 with a difference of 100.
The thousandth entity on the x axis will correspond to the time it took to compute the multiplication of two matrices with numbers of rows and columns as 1000.
Suppose the time it took to compute this was 200seconds. So the graph should be showing the point(1000,200).

Some problematic points I'd like to address -
You're starting the timer before the user chooses an input - which can differ, we want to be as precise as possible, thus we need to only calculate how much time it takes for the multiply_matrix function to run.
Because you're taking an input - it means that each run you will get one result, and one result is only a single point - not a full graph, so we need to get rid of the user input and generate our own.
Moreover to point #2 - we are not interested in giving "one shot" for each matrix order - that means that when we want to test how much time it takes to multiply two matrices of order 300 (for example) - we need to do it N times and take the average in order to be more precise, not to mention we are generating random numbers, and it is possible that some random generated matrices will be easier to compute than other... although taking the average over N tests is not 100% accurate - it does help.
You don't need to set C as a global variable as it can be a local variable of the function multiply_matrix that we anyways return. Also this is not the usage of globals as even with the global C - it will be undefined in the module level.
This is not a must, but it can improve a little bit your program - use time.perf_counter() as it uses the clock with the highest (available) resolution to measure a short duration, and it avoids precision loss by the float type.
You need to change the axes because we want to see how the time is affected by the order of the matrices, not the opposite! (so our X axis is now the order and the Y is the average time it took to multiply them)
Those fixes translate to this code:
Calculating how much it takes for multiply_matrix only.
begin = time.perf_counter()
C = multiply_matrix(A, B)
end = time.perf_counter()
2+3. Generating our own data, looping from order 1 to order maximum_order, taking 50 tests for each order:
maximum_order = 50
tests_number_for_each_order = 50
def generate_matrices_to_graph():
matrix_orders = [] # our X
multiply_average_time = [] # our Y
for order in range(1, maximum_order):
print(order)
times_for_each_order = []
for _ in range(tests_amount_for_each_order):
# generating random square matrices of size order.
A = np.random.randint(1, 10, size=(order, order))
B = np.random.randint(1, 10, size=(order, order))
# getting the time it took to compute
begin = time.perf_counter()
multiply_matrix(A, B)
end = time.perf_counter()
# adding it to the times list
times_for_each_order.append(end - begin)
# adding the data about the order and the average time it took to compute
matrix_orders.append(order)
multiply_average_time.append(sum(times_for_each_order) / tests_amount_for_each_order) # average
return matrix_orders, multiply_average_time
Minor changes to multiply_matrix as we don't need i, j, k from the user:
def multiply_matrix(A, B):
matrix_order = A.shape[1]
C = np.zeros((matrix_order, matrix_order), dtype=int)
for row in range(matrix_order):
for col in range(matrix_order):
for elt in range(0, len(B)):
C[row, col] += A[row, elt] * B[elt, col]
return C
and finally call generate_matrices_to_graph
# calling the generate_data_and_compute function
plt.plot(*generate_matrices_to_graph())
plt.xlabel('Matrix order')
plt.ylabel('Time [in seconds]')
plt.show()
Some outputs:
We can see that when our tests_number_for_each_order is small, the graph loses precision and crisp.
Going from order 1-40 with 1 test for each order:
Going from order 1-40 with 30 tests for each order:
Going from order 1-40 with 80 tests for each order:

I love this kind of questions:
import numpy as np
from time import time
import matplotlib.pyplot as plt
np.random.seed(27)
dim = []
times = []
for i in range(1,10001,10):
A = np.random.randint(1,10,size=(1,i))
B = np.random.randint(1,10,size=(i,1))
begin = time()
C = A*B
times.append(time()-begin)
dim.append(i)
plt.plot(times,dim)
This is a simplified test in which I tested 1 dimension matrices, (1,1)(1,1), (1,10)(10,1), (1,20)(20,1) and so on...
But you can make a double iteration to change also the "outer" dimension of the matrices and see how this affect the computational time

Calculating cross-correlation with fft returning backwards output

I'm trying to cross correlate two sets of data, by taking the fourier transform of both and multiplying the conjugate of the first fft with the second fft, before transforming back to time space. In order to test my code, I am comparing the output with the output of numpy.correlate. However, when I plot my code, (restricted to a certain window), it seems the two signals go in opposite directions/are mirrored about zero.
This is what my output looks like
My code:
import numpy as np
import pyplot as plt
phl_data = np.sin(np.arange(0, 10, 0.1))
mlac_data = np.cos(np.arange(0, 10, 0.1))
N = phl_data.size
zeroes = np.zeros(N-1)
phl_data = np.append(phl_data, zeroes)
mlac_data = np.append(mlac_data, zeroes)
# cross-correlate x = phl_data, y = mlac_data:
# take FFTs:
phl_fft = np.fft.fft(phl_data)
mlac_fft = np.fft.fft(mlac_data)
# fft of cross-correlation
Cw = np.conj(phl_fft)*mlac_fft
#Cw = np.fft.fftshift(Cw)
# transform back to time space:
Cxy = np.fft.fftshift(np.fft.ifft(Cw))
times = np.append(np.arange(-N+1, 0, dt),np.arange(0, N, dt))
plt.plot(times, Cxy)
plt.xlim(-250, 250)
# test against convolving:
c = np.correlate(phl_data, mlac_data, mode='same')
plt.plot(times, c)
plt.show()
(both data sets have been padded with N-1 zeroes)

The documentation to numpy.correlate explains this:
This function computes the correlation as generally defined in signal processing texts:
c_{av}[k] = sum_n a[n+k] * conj(v[n])
and:
Notes
The definition of correlation above is not unique and sometimes correlation may be defined differently. Another common definition is:
c'_{av}[k] = sum_n a[n] conj(v[n+k])
which is related to c_{av}[k] by c'_{av}[k] = c_{av}[-k].
Thus, there is not a unique definition, and the two common definitions lead to a reversed output.

Using Mann Kendall in python with a lot of data

I have a set of 46 years worth of rainfall data. It's in the form of 46 numpy arrays each with a shape of 145, 192, so each year is a different array of maximum rainfall data at each lat and lon coordinate in the given model.
I need to create a global map of tau values by doing an M-K test (Mann-Kendall) for each coordinate over the 46 years.
I'm still learning python, so I've been having trouble finding a way to go through all the data in a simple way that doesn't involve me making 27840 new arrays for each coordinate.
So far I've looked into how to use scipy.stats.kendalltau and using the definition from here: https://github.com/mps9506/Mann-Kendall-Trend
EDIT:
To clarify and add a little more detail, I need to perform a test on for each coordinate and not just each file individually. For example, for the first M-K test, I would want my x=46 and I would want y=data1[0,0],data2[0,0],data3[0,0]...data46[0,0]. Then to repeat this process for every single coordinate in each array. In total the M-K test would be done 27840 times and leave me with 27840 tau values that I can then plot on a global map.
EDIT 2:
I'm now running into a different problem. Going off of the suggested code, I have the following:
for i in range(145):
for j in range(192):
out[i,j] = mk_test(yrmax[:,i,j],alpha=0.05)
print out
I used numpy.stack to stack all 46 arrays into a single array (yrmax) with shape: (46L, 145L, 192L) I've tested it out and it calculates p and tau correctly if I change the code from out[i,j] to just out. However, doing this messes up the for loop so it only takes the results from the last coordinate in stead of all of them. And if I leave the code as it is above, I get the error: TypeError: list indices must be integers, not tuple
My first guess was that it has to do with mk_test and how the information is supposed to be returned in the definition. So I've tried altering the code from the link above to change how the data is returned, but I keep getting errors relating back to tuples. So now I'm not sure where it's going wrong and how to fix it.
EDIT 3:
One more clarification I thought I should add. I've already modified the definition in the link so it returns only the two number values I want for creating maps, p and z.

I don't think this is as big an ask as you may imagine. From your description it sounds like you don't actually want the scipy kendalltau, but the function in the repository you posted. Here is a little example I set up:
from time import time
import numpy as np
from mk_test import mk_test
data = np.array([np.random.rand(145, 192) for _ in range(46)])
mk_res = np.empty((145, 192), dtype=object)
start = time()
for i in range(145):
for j in range(192):
out[i, j] = mk_test(data[:, i, j], alpha=0.05)
print(f'Elapsed Time: {time() - start} s')
Elapsed Time: 35.21990394592285 s
My system is a MacBook Pro 2.7 GHz Intel Core I7 with 16 GB Ram so nothing special.
Each entry in the mk_res array (shape 145, 192) corresponds to one of your coordinate points and contains an entry like so:
array(['no trend', 'False', '0.894546014835', '0.132554125342'], dtype='<U14')
One thing that might be useful would be to modify the code in mk_test.py to return all numerical values. So instead of 'no trend'/'positive'/'negative' you could return 0/1/-1, and 1/0 for True/False and then you wouldn't have to worry about the whole object array type. I don't know what kind of analysis you might want to do downstream but I imagine that would preemptively circumvent any headaches.

Thanks to the answers provided and some work I was able to work out a solution that I'll provide here for anyone else that needs to use the Mann-Kendall test for data analysis.
The first thing I needed to do was flatten the original array I had into a 1D array. I know there is probably an easier way to go about doing this, but I ultimately used the following code based on code Grr suggested using.
`x = 46
out1 = np.empty(x)
out = np.empty((0))
for i in range(146):
for j in range(193):
out1 = yrmax[:,i,j]
out = np.append(out, out1, axis=0) `
Then I reshaped the resulting array (out) as follows:
out2 = np.reshape(out,(27840,46))
I did this so my data would be in a format compatible with scipy.stats.kendalltau 27840 is the total number of values I have at every coordinate that will be on my map (i.e. it's just 145*192) and the 46 is the number of years the data spans.
I then used the following loop I modified from Grr's code to find Kendall-tau and it's respective p-value at each latitude and longitude over the 46 year period.
`x = range(46)
y = np.zeros((0))
for j in range(27840):
b = sc.stats.kendalltau(x,out2[j,:])
y = np.append(y, b, axis=0)`
Finally, I reshaped the data one for time as shown:newdata = np.reshape(y,(145,192,2)) so the final array is in a suitable format to be used to create a global map of both tau and p-values.
Thanks everyone for the assistance!

Depending on your situation, it might just be easiest to make the arrays.
You won't really need them all in memory at once (not that it sounds like a terrible amount of data). Something like this only has to deal with one "copied out" coordinate trend at once:
SIZE = (145,192)
year_matrices = load_years() # list of one 145x192 arrays per year
result_matrix = numpy.zeros(SIZE)
for x in range(SIZE[0]):
for y in range(SIZE[1]):
coord_trend = map(lambda d: d[x][y], year_matrices)
result_matrix[x][y] = analyze_trend(coord_trend)
print result_matrix
Now, there are things like itertools.izip that could help you if you really want to avoid actually copying the data.
Here's a concrete example of how Python's "zip" might works with data like yours (although as if you'd used ndarray.flatten on each year):
year_arrays = [
['y0_coord0_val', 'y0_coord1_val', 'y0_coord2_val', 'y0_coord2_val'],
['y1_coord0_val', 'y1_coord1_val', 'y1_coord2_val', 'y1_coord2_val'],
['y2_coord0_val', 'y2_coord1_val', 'y2_coord2_val', 'y2_coord2_val'],
]
assert len(year_arrays) == 3
assert len(year_arrays[0]) == 4
coord_arrays = zip(*year_arrays) # i.e. `zip(year_arrays[0], year_arrays[1], year_arrays[2])`
# original data is essentially transposed
assert len(coord_arrays) == 4
assert len(coord_arrays[0]) == 3
assert coord_arrays[0] == ('y0_coord0_val', 'y1_coord0_val', 'y2_coord0_val', 'y3_coord0_val')
assert coord_arrays[1] == ('y0_coord1_val', 'y1_coord1_val', 'y2_coord1_val', 'y3_coord1_val')
assert coord_arrays[2] == ('y0_coord2_val', 'y1_coord2_val', 'y2_coord2_val', 'y3_coord2_val')
assert coord_arrays[3] == ('y0_coord2_val', 'y1_coord2_val', 'y2_coord2_val', 'y3_coord2_val')
flat_result = map(analyze_trend, coord_arrays)
The example above still copies the data (and all at once, rather than a coordinate at a time!) but hopefully shows what's going on.
Now, if you replace zip with itertools.izip and map with itertools.map then the copies needn't occur — itertools wraps the original arrays and keeps track of where it should be fetching values from internally.
There's a catch, though: to take advantage itertools you to access the data only sequentially (i.e. through iteration). In your case, it looks like the code at https://github.com/mps9506/Mann-Kendall-Trend/blob/master/mk_test.py might not be compatible with that. (I haven't reviewed the algorithm itself to see if it could be.)
Also please note that in the example I've glossed over the numpy ndarray stuff and just show flat coordinate arrays. It looks like numpy has some of it's own options for handling this instead of itertools, e.g. this answer says "Taking the transpose of an array does not make a copy". Your question was somewhat general, so I've tried to give some general tips as to ways one might deal with larger data in Python.

I ran into the same task and have managed to come up with a vectorized solution using numpy and scipy.
The formula are the same as in this page: https://vsp.pnnl.gov/help/Vsample/Design_Trend_Mann_Kendall.htm.
The trickiest part is to work out the adjustment for the tied values. I modified the code as in this answer to compute the number of tied values for each record, in a vectorized manner.
Below are the 2 functions:
import copy
import numpy as np
from scipy.stats import norm
def countTies(x):
'''Count number of ties in rows of a 2D matrix
Args:
x (ndarray): 2d matrix.
Returns:
result (ndarray): 2d matrix with same shape as <x>. In each
row, the number of ties are inserted at (not really) arbitary
locations.
The locations of tie numbers in are not important, since
they will be subsequently put into a formula of sum(t*(t-1)*(2t+5)).
Inspired by: https://stackoverflow.com/a/24892274/2005415.
'''
if np.ndim(x) != 2:
raise Exception("<x> should be 2D.")
m, n = x.shape
pad0 = np.zeros([m, 1]).astype('int')
x = copy.deepcopy(x)
x.sort(axis=1)
diff = np.diff(x, axis=1)
cated = np.concatenate([pad0, np.where(diff==0, 1, 0), pad0], axis=1)
absdiff = np.abs(np.diff(cated, axis=1))
rows, cols = np.where(absdiff==1)
rows = rows.reshape(-1, 2)[:, 0]
cols = cols.reshape(-1, 2)
counts = np.diff(cols, axis=1)+1
result = np.zeros(x.shape).astype('int')
result[rows, cols[:,1]] = counts.flatten()
return result
def MannKendallTrend2D(data, tails=2, axis=0, verbose=True):
'''Vectorized Mann-Kendall tests on 2D matrix rows/columns
Args:
data (ndarray): 2d array with shape (m, n).
Keyword Args:
tails (int): 1 for 1-tail, 2 for 2-tail test.
axis (int): 0: test trend in each column. 1: test trend in each
row.
Returns:
z (ndarray): If <axis> = 0, 1d array with length <n>, standard scores
corresponding to data in each row in <x>.
If <axis> = 1, 1d array with length <m>, standard scores
corresponding to data in each column in <x>.
p (ndarray): p-values corresponding to <z>.
'''
if np.ndim(data) != 2:
raise Exception("<data> should be 2D.")
# alway put records in rows and do M-K test on each row
if axis == 0:
data = data.T
m, n = data.shape
mask = np.triu(np.ones([n, n])).astype('int')
mask = np.repeat(mask[None,...], m, axis=0)
s = np.sign(data[:,None,:]-data[:,:,None]).astype('int')
s = (s * mask).sum(axis=(1,2))
#--------------------Count ties--------------------
counts = countTies(data)
tt = counts * (counts - 1) * (2*counts + 5)
tt = tt.sum(axis=1)
#-----------------Sample Gaussian-----------------
var = (n * (n-1) * (2*n+5) - tt) / 18.
eps = 1e-8 # avoid dividing 0
z = (s - np.sign(s)) / (np.sqrt(var) + eps)
p = norm.cdf(z)
p = np.where(p>0.5, 1-p, p)
if tails==2:
p=p*2
return z, p
I assume your data come in the layout of (time, latitude, longitude), and you are examining the temporal trend for each lat/lon cell.
To simulate this task, I synthesized a sample data array of shape (50, 145, 192). The 50 time points are taken from Example 5.9 of the book Wilks 2011, Statistical methods in the atmospheric sciences. And then I simply duplicated the same time series 27840 times to make it (50, 145, 192).
Below is the computation:
x = np.array([0.44,1.18,2.69,2.08,3.66,1.72,2.82,0.72,1.46,1.30,1.35,0.54,\
2.74,1.13,2.50,1.72,2.27,2.82,1.98,2.44,2.53,2.00,1.12,2.13,1.36,\
4.9,2.94,1.75,1.69,1.88,1.31,1.76,2.17,2.38,1.16,1.39,1.36,\
1.03,1.11,1.35,1.44,1.84,1.69,3.,1.36,6.37,4.55,0.52,0.87,1.51])
# create a big cube with shape: (T, Y, X)
arr = np.zeros([len(x), 145, 192])
for i in range(arr.shape[1]):
for j in range(arr.shape[2]):
arr[:, i, j] = x
print(arr.shape)
# re-arrange into tabular layout: (Y*X, T)
arr = np.transpose(arr, [1, 2, 0])
arr = arr.reshape(-1, len(x))
print(arr.shape)
import time
t1 = time.time()
z, p = MannKendallTrend2D(arr, tails=2, axis=1)
p = p.reshape(145, 192)
t2 = time.time()
print('time =', t2-t1)
The p-value for that sample time series is 0.63341565, which I have validated against the pymannkendall module result. Since arr contains merely duplicated copies of x, the resultant p is a 2d array of size (145, 192), with all 0.63341565.
And it took me only 1.28 seconds to compute that.

Haar Transform matric from Matlab to Python

i've recreated a code of Haar Tranform matrix from matlab to python it's a success upon entering the value of n for 2 and 4 but when i'm trying to input 8 there's an error
"Traceback (most recent call last):
File "python", line 20, in
ValueError: shape too large to be a matrix."
here's my code
import numpy as np
import math
n=8
# check input parameter and make sure it's the power of 2
Level1 = math.log(n, 2)
Level = int(Level1)+1
#Initialization
H = [1]
NC = 1 / math.sqrt(2) #normalization constant
LP = [1, 1]
HP = [1,-1]
for i in range(1,Level):
H = np.dot(NC, [np.matrix(np.kron(H, LP)), np.matrix(np.kron(np.eye(len(H)), HP))])
print H

I'm assuming you got the definition of the haar transform from the wikipedia article or a similar source, so I'll try to stick to their notation.
The problem with your code is that on the wikipedia article a slight abuse of notation is used. In the equation defining H_2N in terms of H_N, two matrices are stacked on top of eachother with brackets around them. Technically, this would be something like an array consisting of 2 arrays, but they mean it to be a single array where the top half of the values is equal to the one matrix and the bottom half equal to the other matrix.
In your code, the array of two matrices is the following part:
[np.matrix(np.kron(H, LP)), np.matrix(np.kron(np.eye(len(H)), HP))]
You can make this into a single matrix as described above using the np.concatenate function as follows:
H = np.dot(NC, np.concatenate([np.matrix(np.kron(H, LP)), np.matrix(np.kron(np.eye(len(H)), HP))]))

Python cross correlation

I have a pair of 1D arrays (of different lengths) like the following:
data1 = [0,0,0,1,1,1,0,1,0,0,1]
data2 = [0,1,1,0,1,0,0,1]
I would like to get the max cross correlation of the 2 series in python. In matlab, the xcorr() function will return it OK
I have tried the following 2 methods:
numpy.correlate(data1, data2)
signal.fftconvolve(data2, data1[::-1], mode='full')
Both methods give me the same values, but the values I get from python are different from what comes out of matlab. Python gives me integers values > 1, whereas matlab gives actual correlation values between 0 and 1.
I have tried normalizing the 2 arrays first (value-mean/SD), but the cross correlation values I get are in the thousands which doesnt seem correct.
Matlab will also give you a lag value at which the cross correlation is the greatest. I assume it is easy to do this using indices but whats the most appropriate way of doing this if my arrays contain 10's of thousands of values?
I would like to mimic the xcorr() function that matlab has, any thoughts on how I would do that in python?

numpy.correlate(arr1,arr2,"full")
gave me same output as
xcorr(arr1,arr2)
gives in matlab

Implementation of MATLAB xcorr(x,y) and comparision of result with example.
import scipy.signal as signal
def xcorr(x,y):
"""
Perform Cross-Correlation on x and y
x : 1st signal
y : 2nd signal
returns
lags : lags of correlation
corr : coefficients of correlation
"""
corr = signal.correlate(x, y, mode="full")
lags = signal.correlation_lags(len(x), len(y), mode="full")
return lags, corr
n = np.array([i for i in range(0,15)])
x = 0.84**n
y = np.roll(x,5);
lags,c = xcorr(x,y);
plt.figure()
plt.stem(lags,c)
plt.show()

This code will help in finding the delay between two channels in audio file
xin, fs = sf.read('recording1.wav')
frame_len = int(fs*5*1e-3)
dim_x =xin.shape
M = dim_x[0] # No. of rows
N= dim_x[1] # No. of col
sample_lim = frame_len*100
tau = [0]
M_lim = 20000 # for testing as processing takes time
for i in range(1,N):
c = np.correlate(xin[0:M_lim,0],xin[0:M_lim,i],"full")
maxlags = M_lim-1
c = c[M_lim -1 -maxlags: M_lim + maxlags]
Rmax_pos = np.argmax(c)
pos = Rmax_pos-M_lim+1
tau.append(pos)
print(tau)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.