Related
I use plt.xcorr() function to plot cross-correlation between time series:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
df = pd.DataFrame({'x': [0, 0, -2, -0.5, 0, 1, 1, 2, 0, -0.5, 0, 1],
'y': [-0.77, -0.22, -0.58, 0.34, 0.08, 0.17, 0.13, 0.11, -0.76, -0.57, -0.24, 0.17]})
z = plt.xcorr(df['x'], df['y'], maxlags=5, detrend=mlab.detrend_mean)
z
Output:
(array([-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5]),
array([-0.00951067, -0.13596502, -0.47753569, -0.4611423 , -0.4055174 ,
0.52296164, 0.40426599, 0.49317823, 0.03960693, 0.13358735,
-0.29820952])
But when I decided to compare the result with the result of np.corrcoef() function, I was confused. For lag 0, the values are equivalent:
np.corrcoef(df['x'], df['y'])[0,1]
0.52296164
But for lag 1+ the values are different:
np.corrcoef(df['x'].shift(1).dropna(),
df['y'].loc[df['x'].shift(1).dropna().index[0]:])[0,1]
-0.46265864
What is the reason for this difference?
Which result is more correct?
Is it possible to make the result of plt.xcorr() function equal to np.corrcoef()?
I am solving a problem in which I am using two large matrices A and B. Matrix A is formed by ones and zeros and matrix B is formed by integers in the range [0,...,10].
At some point, I have to update the components of A such that, if the component is 1 it stays at 1. If the component is 0, it can stay at 0 with probability p or change to 1 with probability 1-p. The parameters p are functions of the same component of matrix B, i.e., I have a list of probabilities such that, if I update A[i,j], then p equals the component B[i,j] of the vector probabilities
I can do the updating with the following code:
import numpy as np
for i in range(n):
for j in range(n):
if A[i,j]==0:
pass
else:
A[i,j]=np.random.choice([0,1],p=[probabilities[B[i,j]],1-probabilities[B[i,j]]])
I think that there should be a faster way to update matrix A using slice notation. Any advice?
See that this problem is equivalent to, given a vector 'a' and a matrix with positive entries 'B', obtain a matrix C in which C[i,j]=a[B[i,j]]
The general answer
You can achieve this by generating a random value with a continuous probability, and then compare the generated value with the inverse cumulative probability function of that distribution.
To be practical
Let's use uniform random variable X in the range [0, 1), then the probability that X < a is a. Assuming that A and B have the same shape
You can use numpy.where for this (make sure that your probabilities variable is a numpy array)
A[:,:] = np.where(A == 0, np.where(np.random.rand(*A.shape) < probabilities[B], 1, 0), A);
If you want to avoid computing the random values for positions where A is non-zero then you have a more complex indexing.
A[A == 0] = np.where(np.random.rand(*A[A == 0].shape) < probabilities[B[A == 0]], 1, 0);
Make many p in the 0-1 range:
In [36]: p=np.arange(1000)/1000
Your choice use - with sum to get an overall statistic:
In [37]: sum([np.random.choice([0,1],p=[p[i],1-p[i]]) for i in range(p.shape[0])])
Out[37]: 485
and a statistically similar random array without the comprehension:
In [38]: (np.random.rand(p.shape[0])>p).astype(int).sum()
Out[38]: 496
You are welcome to perform other tests to verify their equivalence.
If probabilities is a function that takes the whole B or B[mask], we should be able to do:
mask = A==0
n = mask.ravel().sum() # the number of true elements
A[mask] = (np.random.rand(n)>probabilities(B[mask])).astype(int)
To test this:
In [39]: A = np.random.randint(0,2, (4,5))
In [40]: A
Out[40]:
array([[1, 1, 0, 0, 0],
[1, 1, 1, 1, 1],
[1, 1, 0, 0, 1],
[1, 1, 1, 1, 0]])
In [41]: mask = A==0
In [42]: A[mask]
Out[42]: array([0, 0, 0, 0, 0, 0])
In [43]: B = np.arange(1,21).reshape(4,5)
In [44]: def foo(B): # scale the B values to probabilites
...: return B/20
...:
In [45]: foo(B)
Out[45]:
array([[0.05, 0.1 , 0.15, 0.2 , 0.25],
[0.3 , 0.35, 0.4 , 0.45, 0.5 ],
[0.55, 0.6 , 0.65, 0.7 , 0.75],
[0.8 , 0.85, 0.9 , 0.95, 1. ]])
In [46]: n = mask.ravel().sum()
In [47]: n
Out[47]: 6
In [51]: (np.random.rand(n)>foo(B[mask])).astype(int)
Out[51]: array([1, 1, 0, 1, 1, 0])
In [52]: A[mask] = (np.random.rand(n)>foo(B[mask])).astype(int)
In [53]: A
Out[53]:
array([[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 0, 1, 1],
[1, 1, 1, 1, 0]])
In [54]: foo(B[mask])
Out[54]: array([0.15, 0.2 , 0.25, 0.65, 0.7 , 1. ])
Let's say I have a binary vector representing two phases:
signal = np.array([0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1])
I would like to compute, for each value of this vector, its "position" relative to its chunk, expressed, for instance, in percentage such as follows:
desired output:
[0, 0.33, 0.66, 0.99,
0, 0.5, 1,
0, 1,
0, 0.33, 0.66, 0.99]
I wonder what is the most efficient or pythonic way of obtaining that. One way would be to loop back and forth and compute the length of each "phase", and to divide the index accordingly, but that seems quite convoluted?
Thanks a lot :)
There's nothing un-Pythonic about writing loops, but if you absolutely must do everything with comprehensions and itertools, here's one way:
import numpy as np
from itertools import chain, groupby
signal = np.array([0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1])
result = list(chain.from_iterable(
np.linspace(0, 1, sum(1 for _ in v))
for _, v in groupby(signal)
))
Result (do your own rounding if really necessary):
[0.0, 0.3333333333333333, 0.6666666666666666, 1.0,
0.0, 0.5, 1.0,
0.0, 1.0,
0.0, 0.3333333333333333, 0.6666666666666666, 1.0]
Explanation:
groupby(signal) groups the contiguous sequences of 0s or 1s,
sum(1 for _ in v) gets the length of the current sequence,
np.linspace(0, 1, ...) creates an array of that length containing evenly-spaced numbers from 0 to 1,
list(chain.from_iterable(...)) concatenates those arrays together, into a list.
I have a multimensionnal array with a shape of (2, 2, 3) like this :
array([[[ 0.64, 0.49, 2.56],
[ 7.84, 13.69, 21.16]],
[[ 33.64, 44.89, 57.76],
[ 77.44, 94.09, 112.36]]])
I would like to find the indices of the min for each row. So for this example there are 4 minimums which are : 0.49, 7.84, 33.64 and 77.44.
To get the indices of those minimum I thought this would work :
idx_arr = np.unravel_index(np.argmin(my_array,axis=2),my_array.shape)
This yields the following array of indices :
(array([[0, 0],
[0, 0]]), array([[0, 0],
[0, 0]]), array([[1, 0],
[0, 0]]))
However, the minimums are not correctly computed, as one can see :
my_array[idx_arr]
array([[0.49, 0.64],
[0.64, 0.64]])
What am I missing there ?
the argmin is actually calculating the values correctly. But you misunderstand what np.unravel_index is expecting.
From docs:
Converts a flat index or array of flat indices into a tuple of
coordinate arrays.
To see what kind of input it would accept to give the desired output here, We need to focus on the main point: it will convert a flat array into the correct coordinate array for a particular location in non-flat terms. Essentially, what it expected is coordinates of your desired points as if your input array was flattened.
import numpy as np
inp = np.array([[[ 0.64, 0.49, 2.56],
[ 7.84, 13.69, 21.16]],
[[ 33.64, 44.89, 57.76],
[ 77.44, 94.09, 112.36]]])
idx = inp.argmin(axis=-1)
#Output:
array([[1, 0],
[0, 0]], dtype=int64)
Note that you cannot send this idx directly because it is not representing correct coordinates for a flattened version of inp array.
That would look more like the following:
flat_idx = np.arange(0, idx.size*inp.shape[-1], inp.shape[-1]) + idx.flatten()
#Output:
array([1, 3, 6, 9], dtype=int64)
And we can see unravel_index accepts it happily.
temp = np.unravel_index(flat_idx, inp.shape)
#Output:
(array([0, 0, 1, 1], dtype=int64),
array([0, 1, 0, 1], dtype=int64),
array([1, 0, 0, 0], dtype=int64))
inp[temp]
Output:
array([ 0.49, 7.84, 33.64, 77.44])
Also, taking a look at the output tuple, we can notice that it is not too difficult to recreate the same ourselves as well. Notice that the last array corresponds to a flattened form of idx, while the first two arrays essentially enable indexing through the first two axes of inp.
And to prepare that, we can actually use the unravel_index function in a rather nifty way, as follows:
real_idx = (*np.unravel_index(np.arange(idx.size), idx.shape), idx.flatten())
inp[real_idx]
#Output:
array([ 0.49, 7.84, 33.64, 77.44])
In my Python script I have floats that I want to bin. Right now I'm doing:
min_val = 0.0
max_val = 1.0
num_bins = 20
my_bins = numpy.linspace(min_val, max_val, num_bins)
hist,my_bins = numpy.histogram(myValues, bins=my_bins)
But now I want to add two more bins to account for values that are < 0.0 and for those that are > 1.0. One bin should thus include all values in ( -inf, 0), the other one all in [1, inf)
Is there any straightforward way to do this while still using numpy's histogram function?
The function numpy.histogram() happily accepts infinite values in the bins argument:
numpy.histogram(my_values, bins=numpy.r_[-numpy.inf, my_bins, numpy.inf])
Alternatively, you could use a combination of numpy.searchsorted() and numpy.bincount(), though I don't see much advantage to that approach.
You can specify numpy.inf as the upper and -numpy.inf as the lower bin limits.
With Numpy version 1.16 you have histogram_bin_edges. With this, todays solution calls histogram_bin_edges to get the bins, concatenate -inf and +inf and pass this as bins to histogram:
a=[1,2,3,4,2,3,4,7,4,6,7,5,4,3,2,3]
np.histogram(a, bins=np.concatenate(([np.NINF], np.histogram_bin_edges(a), [np.PINF])))
Results in:
(array([0, 1, 3, 0, 4, 0, 4, 1, 0, 1, 0, 2]),
array([-inf, 1. , 1.6, 2.2, 2.8, 3.4, 4. , 4.6, 5.2, 5.8, 6.4, 7. , inf]))
if you prefer to have the last bin empty (as I do), you can use the range parameter and add a small number to max:
a=[1,2,3,4,2,3,4,7,4,6,7,5,4,3,2,3]
np.histogram(a, bins=np.concatenate(([np.NINF], np.histogram_bin_edges(a, range=(np.min(a), np.max(a)+.1)), [np.PINF])))
Results in:
(array([0, 1, 3, 0, 4, 4, 0, 1, 0, 1, 2, 0]),
array([-inf, 1. , 1.61, 2.22, 2.83, 3.44, 4.05, 4.66, 5.27, 5.88, 6.49, 7.1 , inf]))