pearson correlation using np.random.rand failing - python

I have the following code to calculate the correlation coefficient using two different ways to generate number series. It fails to work for the first way (corr_coeff_pearson) but works for the 2nd way (corr_coeff_pearson_1). Why is this so? In both cases, the variables are of class 'numpy.ndarray'
import numpy as np
np.random.seed(1000)
inp_vct_lngt = 5
X = 2*np.random.rand(inp_vct_lngt,1)
y=4+3*X+np.random.randn(inp_vct_lngt,1)
print(type(X))
corr_coeff_pearson=0
corr_coeff_pearson = np.corrcoef(X,y)
print("Pearson Correlation:")
print(corr_coeff_pearson)
X_1 = np.random.randint(0,50,5)
y_1 = X_1 + np.random.normal(0,10,5)
print(type(X_1))
corr_coeff_pearson_1 = np.corrcoef(X_1,y_1)
print("Pearson Correlation:")
print(corr_coeff_pearson_1)
Is there some way to "convert" the number in the first way of generating the series that I am missing?

The issue is that X and y are 2 dimensional:
>>> X
array([[1.9330627 ],
[0.19204405],
[0.21168505],
[0.65018234],
[0.83079548]])
>>> y
array([[8.60619212],
[6.09210226],
[5.33097283],
[5.71649684],
[5.18771916]])
So corrcoef is thinking
Each row of x represents a variable, and each column a single observation of all those variables
(quoted from the docs)
What you can do is either flatten the two to one dimension:
>>> np.corrcoef(X.flatten(),y.flatten())
array([[1. , 0.84196446],
[0.84196446, 1. ]])
Or use rowvar=False:
>>> np.corrcoef(X,y,rowvar=False)
array([[1. , 0.84196446],
[0.84196446, 1. ]])

Related

Min/max scaling with additional points

I'm trying to normalize an array within a range, e.g. [10,100]
But I also want to manually specify additional points in my result array, for example:
num = [1,2,3,4,5,6,7,8]
num_expected = [min(num), 5, max(num)]
expected_range = [10, 20, 100]
result_array = normalize(num, num_expected, expected_range)
Intended results:
Values from 1-5 are normalized to range (10,20].
5 in num array is mapped to 20 in expected range.
Values from 6-8 are normalized to range (20,100].
I know I can do it by normalizing the array twice, but I might have many additional points to add. I was wondering if there's any built-in function in numpy or scipy to do this?
I've checked MinMaxScaler in sklearn, but did not find the functionality I want.
Thanks!
Linear interpolation will do exactly what you want:
import scipy.interpolate
interp = scipy.interpolate.interp1d(num_expected, expected_range)
Then just pass numbers or arrays of numbers that you want to interpolate:
In [20]: interp(range(1, 9))
Out[20]:
array([ 10. , 12.5 , 15. , 17.5 ,
20. , 46.66666667, 73.33333333, 100. ])

numpy: efficiently obtain a statistic over array elements grouped by the elements of another array

Apologies in advance for the potentially misleading title. I could not think of the way to properly word the problem without an illustrative example.
I have some data array (e.g.):
x = np.array([2,2,2,3,3,3,4,4,4,1,1,2,2,3,3])
and a corresponding array of equal length which indicates which elements of x are grouped:
y = np.array([0,0,0,0,0,0,0,0,0,1,1,1,1,1,1])
In this example, there are two groupings in x: [2,2,2,3,3,3,4,4,4] where y=0; and [1,1,2,2,3,3] where y=1. I want to obtain a statistic on all elements of x where y is 0, then 1. I would like this to be extendable to large arrays with many groupings. y is always ordered from lowest to highest AND is always sequentially increasing without any missing integers between the min and max. For example, y could be np.array([0,0,**1**,2,2,2,2,3,3,3]) for some x array of the same length but not y = np.array([0,0,**2**,2,2,2,2,3,3,3]) as this has no ones.
I can do this by brute force quite easily for this example.
import numpy as np
x = np.array([2,2,2,3,3,3,4,4,4,1,1,2,2,3,3])
y = np.array([0,0,0,0,0,0,0,0,0,1,1,1,1,1,1])
y_max = np.max(y)
stat_min = np.zeros(y_max+1)
stat_sum = np.zeros(y_max+1)
for i in np.arange(y_max+1):
stat_min[i] = np.min(x[y==i])
stat_sum[i] = np.sum(x[y==i])
print(stat_min)
print(stat_sum)
Gives: [2. 1.] and [27. 12.] for the minimum and sum statistics for each grouping, respectively. I need a way to make this efficient for large numbers of groupings and where the arrays are very large (> 1 million elements).
EDIT
A bit better with list comprehension.
import numpy as np
x = np.array([2,2,2,3,3,3,4,4,4,1,1,2,2,3,3])
y = np.array([0,0,0,0,0,0,0,0,0,1,1,1,1,1,1])
y_max = np.max(y)
stat_min = np.array([np.min(x[y==i]) for i in range(y_max+1)])
stat_sum = np.array([np.sum(x[y==i]) for i in range(y_max+1)])
print(stat_min)
print(stat_sum)
You'd put your arrays into a dataframe, then use groupby and the various methods of it: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
import pandas as pd
df = pd.DataFrame({'x': x, 'y': y})`
mins = df.groupby('y').min()

optimize function that reads and mirrors a half numpy matrix

I have a text file that has some values of a matrix, but it just has half of the values of it, like this:
1. 1. 0.01
2. 1. 0.052145
2. 2. 0.045
3. 1. 0.054521
3. 2. 0.05424
3. 3. 0.05459898
the first two columns are referent to matrix (x,y) position, and the last one, the value it has. the first two values might be, actually, value-1.
I made a function that reads the file and mirrors these values to a full matrix:
def expand_mirror_matrix(matrix_path='data.txt'):
data = np.loadtxt(matrix_path)
shape = (int(data[-1][0]), int(data[-1][1]))
m = np.zeros(shape)
for d in data:
x, y, z = int(d[0]), int(d[1]), d[2]
m[x-1,y-1] = z
m[shape[0]-x,shape[1]-y]=z
return m
But it has some unnecessary loops, like the first and the last, and the loop that changes the value of the center of the matrix.
Is there a way of optimizing it? This file actually have thousands of lines, it might be great to downgrade this loop execution time.
I believe this does what you want, at least without the mirroring:
def expand_mirror_matrix(matrix_path='data.txt'):
data = np.loadtxt(matrix_path)
shape = (int(data[-1][0]), int(data[-1][1]))
xs = data[:,0].astype(int) - 1 # Numpy uses zero-based indexing.
ys = data[:,1].astype(int) - 1
m = np.zeros(shape)
m[(xs, ys)] = data[:,2]
return m
For your example file above this returns:
array([[0.01 , 0. , 0. ],
[0.052145 , 0.045 , 0. ],
[0.054521 , 0.05424 , 0.05459898]])
If you wish to mirror it you probably want to edit the above function with the following:
m[(xs, ys)] = data[:,2]
m[(ys, xs)] = data[:,2] # Mirrored.
The result of that is:
array([[0.01 , 0.052145 , 0.054521 ],
[0.052145 , 0.045 , 0.05424 ],
[0.054521 , 0.05424 , 0.05459898]])
Note that this assumes the matrix is square.

How to solve real life difference equations using python

I want to solve a difference equation using python.
y = x(n - 1) - (0.5(x(n-2) + x(n))
x here is a long array of values. I want to plot y with respect to another time sequence array t using Plotly. I can plot x with t, but I am not able to generate the filtered signal y. I have tried the code below, but it seems I'm missing something. I am not getting the desired output.
from scipy import signal
from plotly.offline import plot, iplot
x = array(...)
t = array(...) # x and t are big arrays
b = [-0.5, 1, -0.5]
a = 0
y = signal.lfilter(b, a, x, axis=-1, zi=None)
iplot([{"x": t, "y": y}])
However, the output is something like this.
>>>y
>>> array([-inf, ..., nan])
Therefore, I am getting a blank graph.
UPDATE with examples of x and t (9 values each):
x = [3.1137561664814495,
-1.4589810840917137,
-0.12631870857936914,
-1.2695030212226599,
2.7600637824592158,
-1.7810937909691049,
0.050527483431747656,
0.27158522344564368,
0.48001109260160274]
t = [0.0035589523041146265,
0.011991765409288035,
0.020505576424579175,
0.028935389041247817,
0.037447199517441021,
0.045880011487565042,
0.054462819797731044,
0.062835632533346342,
0.071347441874490158]
It appears that your problem is defining a = 0. When running your example, you get the following warning from SciPy:
/usr/local/lib/python2.7/site-packages/scipy/signal/signaltools.py:1353: RuntimeWarning:
divide by zero encountered in true_divide
[-inf inf nan nan nan inf -inf nan nan]
This division by zero is defined by value a. If you look at the documentation of scipy.signal.lfilter, it points out the following:
a : array_like
The denominator coefficient vector in a 1-D sequence. If a[0] is not 1, then both a and b are normalized by a[0].
If you change a = 0 to a = 1 you should get output you desire, although do consider that you might want to apply data normalization by a different factor.

Correlate a single time series with a large number of time series

I have a large number (M) of time series, each with N time points, stored in an MxN matrix. Then I also have a separate time series with N time points that I would like to correlate with all the time series in the matrix.
An easy solution is to go through the matrix row by row and run numpy.corrcoef. However, I was wondering if there is a faster or more concise way to do this?
Let's use this correlation formula :
You can implement this for X as the M x N array and Y as the other separate time series array of N elements to be correlated with X. So, assuming X and Y as A and B respectively, a vectorized implementation would look something like this -
import numpy as np
# Rowwise mean of input arrays & subtract from input arrays themeselves
A_mA = A - A.mean(1)[:,None]
B_mB = B - B.mean()
# Sum of squares across rows
ssA = (A_mA**2).sum(1)
ssB = (B_mB**2).sum()
# Finally get corr coeff
out = np.dot(A_mA,B_mB.T).ravel()/np.sqrt(ssA*ssB)
# OR out = np.einsum('ij,j->i',A_mA,B_mB)/np.sqrt(ssA*ssB)
Verify results -
In [115]: A
Out[115]:
array([[ 0.1001229 , 0.77201334, 0.19108671, 0.83574124],
[ 0.23873773, 0.14254842, 0.1878178 , 0.32542199],
[ 0.62674274, 0.42252403, 0.52145288, 0.75656695],
[ 0.24917321, 0.73416177, 0.40779406, 0.58225605],
[ 0.91376553, 0.37977182, 0.38417424, 0.16035635]])
In [116]: B
Out[116]: array([ 0.18675642, 0.3073746 , 0.32381341, 0.01424491])
In [117]: out
Out[117]: array([-0.39788555, -0.95916359, -0.93824771, 0.02198139, 0.23052277])
In [118]: np.corrcoef(A[0],B), np.corrcoef(A[1],B), np.corrcoef(A[2],B)
Out[118]:
(array([[ 1. , -0.39788555],
[-0.39788555, 1. ]]),
array([[ 1. , -0.95916359],
[-0.95916359, 1. ]]),
array([[ 1. , -0.93824771],
[-0.93824771, 1. ]]))

Categories

Resources