Error calculating entropy over pandas series - python

I'm trying to calculate the entropy over a pandas series. Specifically, I group the strings in Direction as a sequence. Specifically, using this function:
diff_dir = df.iloc[0:,1].ne(df.iloc[0:,1].shift()).cumsum()
will return the count of strings in Direction that are the same until a change. So for each sequence of the same Direction string, I want to calculate the entropy of X,Y.
Using the code the sequencing of the same string is:
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 3
9 3
This code used to work but it's now returning an error. I'm not sure if this was after an upgrade.
import pandas as pd
import numpy as np
def ApEn(U, m = 2, r = 0.2):
'''
Approximate Entropy
Quantify the amount of regularity over time-series data.
Input parameters:
U = Time series
m = Length of compared run of data (subseries length)
r = Filtering level (tolerance). A positive number
'''
def _maxdist(x_i, x_j):
return max([abs(ua - va) for ua, va in zip(x_i, x_j)])
def _phi(m):
x = [U.tolist()[i:i + m] for i in range(N - m + 1)]
C = [len([1 for x_j in x if _maxdist(x_i, x_j) <= r]) / (N - m + 1.0) for x_i in x]
return (N - m + 1.0)**(-1) * sum(np.log(C))
N = len(U)
return abs(_phi(m + 1) - _phi(m))
def Entropy(df):
'''
Calculate entropy for individual direction
'''
df = df[['Time','Direction','X','Y']]
diff_dir = df.iloc[0:,1].ne(df.iloc[0:,1].shift()).cumsum()
# Calculate ApEn grouped by direction.
df['ApEn_X'] = df.groupby(diff_dir)['X'].transform(ApEn)
df['ApEn_Y'] = df.groupby(diff_dir)['Y'].transform(ApEn)
return df
df = pd.DataFrame(np.random.randint(0,50, size = (10, 2)), columns=list('XY'))
df['Time'] = range(1, len(df) + 1)
direction = ['Left','Left','Left','Left','Left','Right','Right','Right','Left','Left']
df['Direction'] = direction
# Calculate defensive regularity
entropy = Entropy(df)
Error:
return (N - m + 1.0)**(-1) * sum(np.log(C))
ZeroDivisionError: 0.0 cannot be raised to a negative power

The issue is because of the below code
(N - m + 1.0)**(-1)
consider the situation when N==1 and since N = len(U) this happens when the a group resulted out of groupby will have size of 1. Since m==2 this end up as
(1-2+1)**-1 == 0
And we 0**-1 is undefined as so the error.
Now if we look theoretically, how do you define a approximate entropy of a timeseries with just one value; highly unpredictable so it should be as high as possible. For this case let us set it to np.nan to denote it is not defined (entropy is always greater then equal to 0)
code
import pandas as pd
import numpy as np
def ApEn(U, m = 2, r = 0.2):
'''
Approximate Entropy
Quantify the amount of regularity over time-series data.
Input parameters:
U = Time series
m = Length of compared run of data (subseries length)
r = Filtering level (tolerance). A positive number
'''
def _maxdist(x_i, x_j):
return max([abs(ua - va) for ua, va in zip(x_i, x_j)])
def _phi(m):
x = [U.tolist()[i:i + m] for i in range(N - m + 1)]
C = [len([1 for x_j in x if _maxdist(x_i, x_j) <= r]) / (N - m + 1.0) for x_i in x]
if (N - m + 1) == 0:
return np.nan
return (N - m + 1)**(-1) * sum(np.log(C))
N = len(U)
return abs(_phi(m + 1) - _phi(m))
def Entropy(df):
'''
Calculate entropy for individual direction
'''
df = df[['Time','Direction','X','Y']]
diff_dir = df.iloc[0:,1].ne(df.iloc[0:,1].shift()).cumsum()
# Calculate ApEn grouped by direction.
df['ApEn_X'] = df.groupby(diff_dir)['X'].transform(ApEn)
df['ApEn_Y'] = df.groupby(diff_dir)['Y'].transform(ApEn)
return df
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,50, size = (10, 2)), columns=list('XY'))
df['Time'] = range(1, len(df) + 1)
direction = ['Left','Left','Left','Left','Left','Right','Right','Right','Left','Left']
df['Direction'] = direction
# Calculate defensive regularity
print (Entropy(df))
Output:
Time Direction X Y ApEn_X ApEn_Y
0 1 Left 6 16 0.287682 0.287682
1 2 Left 22 6 0.287682 0.287682
2 3 Left 16 5 0.287682 0.287682
3 4 Left 5 48 0.287682 0.287682
4 5 Left 11 21 0.287682 0.287682
5 6 Right 44 25 0.693147 0.693147
6 7 Right 14 12 0.693147 0.693147
7 8 Right 43 40 0.693147 0.693147
8 9 Left 46 44 NaN NaN
9 10 Left 49 2 NaN NaN
Larger sample (which results in 0**-1 issue)
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,50, size = (100, 2)), columns=list('XY'))
df['Time'] = range(1, len(df) + 1)
direction = ['Left','Right','Up','Down']
df['Direction'] = np.random.choice((direction), len(df))
print (Entropy(df))
Output:
Time Direction X Y ApEn_X ApEn_Y
0 1 Left 44 47 NaN NaN
1 2 Left 0 3 NaN NaN
2 3 Down 3 39 NaN NaN
3 4 Right 9 19 NaN NaN
4 5 Up 21 36 NaN NaN
.. ... ... .. .. ... ...
95 96 Up 19 33 NaN NaN
96 97 Left 40 32 NaN NaN
97 98 Up 36 6 NaN NaN
98 99 Left 21 31 NaN NaN
99 100 Right 13 7 NaN NaN

It appears that when the ApEn._phi() function is invoked, it is possible that the specific values for N and m end up returning a 0. That then needs to be raised to the negative power of -1, which however is undefined (see also Why does zero raised to the power of negative one equal infinity?).
To illustrate, I tried to replicate your scenario specifically, and in the first iteration of the transform operation, this is what happens:
U is: 1 0
2 48
(the first groupby has 2 elements)
N is: 2
m is: 3
So effectively when you get to the return value of _phi(), you are doing (N - m + 1.0)**-1 = (2 - 3 + 1)**-1 = 0**-1, which is undefined. Perhaps the key here is that you say you're groupby by individual direction and passing the U array into the Approximate Entropy function, however you're grouping by diff_X and diff_Y instead, which result in very small groups due to the nature of the method applied. As far as I understand, if you want to calculate the approximate entropy per direction, you simply need to group by 'Direction':
def Entropy(df):
'''
Calculate entropy for individual direction
'''
# Calculate ApEn grouped by direction.
df['ApEn_X'] = df.groupby('Direction')['X'].transform(ApEn)
df['ApEn_Y'] = df.groupby('Direction')['Y'].transform(ApEn)
return df
This results in a dataframe like this:
entropy.head()
Time Direction X Y ApEn_X ApEn_Y
0 1 Left 28 47 0.035091 0.035091
1 2 Up 8 47 0.013493 0.046520
2 3 Up 0 32 0.013493 0.046520
3 4 Right 34 8 0.044452 0.044452
4 5 Right 49 27 0.044452 0.044452

You have to handle your ZeroDivisions. Maybe this way:
def _phi(m):
if N == m - 1:
return 0
...
You will then encounter length mismatches on groupbys, df and diff_X has to be of same lengths.

Related

Is there a way to reference a previous value in Pandas column efficiently?

I want to do some complex calculations in pandas while referencing previous values (basically I'm calculating row by row). However the loops take forever and I wanted to know if there was a faster way. Everybody keeps mentioning using shift but I don't understand how that would even work.
df = pd.DataFrame(index=range(500)
df["A"]= 2
df["B"]= 5
df["A"][0]= 1
for i in range(len(df):
if i != 0: df['A'][i] = (df['A'][i-1] / 3) - df['B'][i-1] + 25
numpy_ext can be used for expanding calculations
pandas-rolling-apply-using-multiple-columns for reference
I have also included a simpler calc to demonstrate behaviour in simpler way
df = pd.DataFrame(index=range(5000))
df["A"]= 2
df["B"]= 5
df["A"][0]= 1
import numpy_ext as npe
# for i in range(len(df):
# if i != 0: df['A'][i] = (df['A'][i-1] / 3) - df['B'][i-1] + 25
# SO example - function of previous values in A and B
def f(A,B):
r = np.sum(A[:-1]/3) - np.sum(B[:-1] + 25) if len(A)>1 else A[0]
return r
# much simpler example, sum of previous values
def g(A):
return np.sum(A[:-1])
df["AB_combo"] = npe.expanding_apply(f, 1, df["A"].values, df["B"].values)
df["A_running"] = npe.expanding_apply(g, 1, df["A"].values)
print(df.head(10).to_markdown())
sample output
A
B
AB_combo
A_running
0
1
5
1
0
1
2
5
-29.6667
1
2
2
5
-59
3
3
2
5
-88.3333
5
4
2
5
-117.667
7
5
2
5
-147
9
6
2
5
-176.333
11
7
2
5
-205.667
13
8
2
5
-235
15
9
2
5
-264.333
17

Index and save last N points from a list that meets conditions from dataframe Python

I have a DataFrame that contains gas concentrations and the corresponding valve number. This data was taken continuously where we switched the valves back and forth (valves=1 or 2) for a certain amount of time to get 10 cycles for each valve value (20 cycles total). A snippet of the data looks like this (I have 2,000+ points and each valve stayed on for about 90 seconds each cycle):
gas1 valveW time
246.9438 2 1
247.5367 2 2
246.7167 2 3
246.6770 2 4
245.9197 1 5
245.9518 1 6
246.9207 1 7
246.1517 1 8
246.9015 1 9
246.3712 2 10
247.0826 2 11
... ... ...
My goal is to save the last N points of each valve's cycle. For example, the first cycle where valve=1, I want to index and save the last N points from the end before the valve switches to 2. I would then save the last N points and average them to find one value to represent that first cycle. Then I want to repeat this step for the second cycle when valve=1 again.
I am currently converting from Matlab to Python so here is the Matlab code that I am trying to translate:
% NOAA high
n2o_noaaHigh = [];
co2_noaaHigh = [];
co_noaaHigh = [];
h2o_noaaHigh = [];
ind_noaaHigh_end = zeros(1,length(t_c));
numPoints = 40;
for i = 1:length(valveW_c)-1
if (valveW_c(i) == 1 && valveW_c(i+1) ~= 1)
test = (i-numPoints):i;
ind_noaaHigh_end(test) = 1;
n2o_noaaHigh = [n2o_noaaHigh mean(n2o_c(test))];
co2_noaaHigh = [co2_noaaHigh mean(co2_c(test))];
co_noaaHigh = [co_noaaHigh mean(co_c(test))];
h2o_noaaHigh = [h2o_noaaHigh mean(h2o_c(test))];
end
end
ind_noaaHigh_end = logical(ind_noaaHigh_end);
This is what I have so far for Python:
# NOAA high
n2o_noaaHigh = [];
co2_noaaHigh = [];
co_noaaHigh = [];
h2o_noaaHigh = [];
t_c_High = []; # time
for i in range(len(valveW_c)):
# NOAA HIGH
if (valveW_c[i] == 1):
t_c_High.append(t_c[i])
n2o_noaaHigh.append(n2o_c[i])
co2_noaaHigh.append(co2_c[i])
co_noaaHigh.append(co_c[i])
h2o_noaaHigh.append(h2o_c[i])
Thanks in advance!
I'm not sure if I understood correctly, but I guess this is what you are looking for:
# First we create a column to show cycles:
df['cycle'] = (df.valveW.diff() != 0).cumsum()
print(df)
gas1 valveW time cycle
0 246.9438 2 1 1
1 247.5367 2 2 1
2 246.7167 2 3 1
3 246.677 2 4 1
4 245.9197 1 5 2
5 245.9518 1 6 2
6 246.9207 1 7 2
7 246.1517 1 8 2
8 246.9015 1 9 2
9 246.3712 2 10 3
10 247.0826 2 11 3
Now you can use groupby method to get the average for the last n points of each cycle:
n = 3 #we assume this is n
df.groupby('cycle').apply(lambda x: x.iloc[-n:, 0].mean())
Output:
cycle 0
1 246.9768
2 246.6579
3 246.7269
Let's call your DataFrame df; then you could do:
results = {}
for k, v in df.groupby((df['valveW'].shift() != df['valveW']).cumsum()):
results[k] = v
print(f'[group {k}]')
print(v)
Shift(), as it suggests, shifts the column of the valve cycle allows to detect changes in number sequences. Then, cumsum() helps to give a unique number to each of the group with the same number sequence. Then we can do a groupby() on this column (which was not possible before because groups were either of ones or twos!).
which gives e.g. for your code snippet (saved in results):
[group 1]
gas1 valveW time
0 246.9438 2 1
1 247.5367 2 2
2 246.7167 2 3
3 246.6770 2 4
[group 2]
gas1 valveW time
4 245.9197 1 5
5 245.9518 1 6
6 246.9207 1 7
7 246.1517 1 8
8 246.9015 1 9
[group 3]
gas1 valveW time
9 246.3712 2 10
10 247.0826 2 11
Then to get the mean for each cycle; you could e.g. do:
df.groupby((df['valveW'].shift() != df['valveW']).cumsum()).mean()
which gives (again for your code snippet):
gas1 valveW time
valveW
1 246.96855 2.0 2.5
2 246.36908 1.0 7.0
3 246.72690 2.0 10.5
where you wouldn't care much about the time mean but the gas1 one!
Then, based on results you could e.g. do:
n = 3
mean_n_last = []
for k, v in results.items():
if len(v) < n:
mean_n_last.append(np.nan)
else:
mean_n_last.append(np.nanmean(v.iloc[len(v) - n:, 0]))
which gives [246.9768, 246.65796666666665, nan] for n = 3 !
If your dataframe is sorted by time you could get the last N records for each valve like this.
N=2
valve1 = df[df['valveW']==1].iloc[-N:,:]
valve2 = df[df['valveW']==2].iloc[-N:,:]
If it isn't currently sorted you could easily sort it like this.
df.sort_values(by=['time'])

swap cells in a dataframe to minimize the effect to the sum of the difference of correlation

Assume there is a dataframe below:
set.seed(100)
toydata <- data.frame(x = sample(1:50,50,replace = T),
y = sample(1:50,50,replace = T),
z = sample(1:50,50,replace = T)
)
Then I find all the cells whose values are below 10. For the first column:
toydata[toydata$x<10,1]
I get
[1] 3 9 9 7
For the second column,
toydata[toydata$y<10,2]
I get ,I get
[1] 7 5 2 7 2
For the third column,
toydata[toydata$z<10,3]
I get
[1] 3 1 5 2 2 6 1 3 5 8 7 3 1
and their positions
which(toydata$x<10)
[1] 4 10 26 40
which(toydata$y<10)
[1] 7 30 35 48 49
which(toydata$z<10)
[1] 3 9 13 16 26 30 36 38 42 43 45 48 49
I want to swap the values among the cells whose values are lesser than 10.The values in other cells whose values are equal to or more than 10 remain unchanged.
The condition is that each cell whose value is lesser than 10 must be replaced by a new value.
The objective is to minimize the sum of the difference of correlation before and after being swapped, says minimize |cor(x,y)-cor(x',y')|+|cor(x,z)-cor(x',z')|+|cor(y,z)-cor(y',z')|.
x', y', z' are the new columns which has ben swapped.
|| means the absolute value.
Are there any good suggestions to fulfill this in R or Python with any packages?
Thanks.
If all you want to do is to swap the values below a certain threshold, meaning a permutation of those values, sample is your friend.
swapFun <- function(x, n = 10){
inx <- which(x < n)
x[sample(inx)] <- x[inx]
x
}
toydata[toydata$x < 10, 1]
#[1] 3 9 9 7
which(toydata$x < 10)
#[1] 4 10 26 40
toy <- toydata # Work with a copy
toy[] <- lapply(toydata, swapFun)
toy[toy$x < 10, 1]
#[1] 9 7 3 9
which(toy$x < 10)
#[1] 4 10 26 40
Note that the order of the values less than 10 has changed but not where they can be found.
If you want another threshold, say 25, just do
toydata[] <- lapply(toydata, swapFun, n = 25)
To swap between columns, use another function. It starts by transforming the input data.frame into a vector. The swapping is done in the same way. Then back to data.frame.
swapFun2 <- function(DF, n = 10){
x <- unlist(DF)
inx <- which(x < n)
x[sample(inx)] <- x[inx]
x <- as.data.frame(matrix(x, ncol = ncol(DF)))
names(x) <- names(DF)
x
}
toy2 <- swapFun2(toydata)

Pandas sequentially apply function using output of previous value

I want to compute the "carryover" of a series. This computes a value for each row and then adds it to the previously computed value (for the previous row).
How do I do this in pandas?
decay = 0.5
test = pd.DataFrame(np.random.randint(1,10,12),columns = ['val'])
test
val
0 4
1 5
2 7
3 9
4 1
5 1
6 8
7 7
8 3
9 9
10 7
11 2
decayed = []
for i, v in test.iterrows():
if i ==0:
decayed.append(v.val)
continue
d = decayed[i-1] + v.val*decay
decayed.append(d)
test['loop_decay'] = decayed
test.head()
val loop_decay
0 4 4.0
1 5 6.5
2 7 10.0
3 9 14.5
4 1 15.0
Consider a vectorized version with cumsum() where you cumulatively sum (val * decay) with the very first val.
However, you then need to subtract the very first (val * decay) since cumsum() includes it:
test['loop_decay'] = (test.ix[0,'val']) + (test['val']*decay).cumsum() - (test.ix[0,'val']*decay)
You can utilize pd.Series.shift() to create a dataframe with val[i] and val[i-1] and then apply your function across a single axis (1 in this case):
# Create a series that shifts the rows by 1
test['val2'] = test.val.shift()
# Set the first row on the shifted series to 0
test['val2'].ix[0] = 0
# Apply the decay formula:
test['loop_decay'] = test.apply(lambda x: x['val'] + x['val2'] * 0.5, axis=1)

Finding periodicity in an algorithmic signal

In testing a conjecture about the following recursive relation
,
which claims a periodicity of some kind for the sequence of numbers, I wrote a python program which computes the sequences and prints them in a table.
1 # Consider the recursive relation x_{i+1} = p-1 - (p*i-1 mod x_i)
2 # with p prime and x_0 = 1. What is the shortest period of the
3 # sequence?
4
5 from __future__ import print_function
6 import numpy as np
7 from matplotlib import pyplot as plt
8
9 # The length of the sequences.
10 seq_length = 100
11
12 upperbound_primes = 30
13
14 # Computing a list of prime numbers up to n
15 def primes(n):
16 sieve = [True] * n
17 for i in xrange(3,int(n**0.5)+1,2):
18 if sieve[i]:
19 sieve[i*i::2*i]=[False]*((n-i*i-1)/(2*i)+1)
20 return [2] + [i for i in xrange(3,n,2) if sieve[i]]
21
22 # The list of prime numbers up to upperbound_primes
23 p = primes(upperbound_primes)
24
25 # The amount of primes numbers
26 no_primes = len(p)
27
28 # Generate the sequence for the prime number p
29 def sequence(p):
30 x = np.empty(seq_length)
31 x[0] = 1
32 for i in range(1,seq_length):
33 x[i] = p - 1 - (p * (i-1) - 1) % x[i-1]
34 return x
35
36 # List with the sequences.
37 seq = [sequence(i) for i in p]
38 """
39 # Print the sequences in a table where the upper row
40 # indicates the prime numbers.
41 for i in range(seq_length):
42 if not i:
43 for n in p:
44 print('\t',n,end='')
45 print('')
46 print(i+1,'\t',end='')
47 for j in range(no_primes):
48 print(seq[j][i],end='\t')
49 print('\n',end='')
50 """
51 def autocor(x):
52 result = np.correlate(x,x,mode='full')
53 return result[result.size/2:]
54
55
56 fig = plt.figure('Finding period in the sequences')
57 k = 0
58 for s in seq:
59 k = k + 1
60 fig.add_subplot(no_primes,1,k)
61 plt.title("Prime number %d" % p[k-1])
62 plt.plot(autocor(s))
63 plt.show()
64
Now I want to investigate periodicities in these sequences that I computed. After looking around on the net I found myself two options it seems:
Preform autocorrelation on the data and look for the first peak. This should give an approximation of the period.
Preform a FFT on the data. This shows the frequency of the numbers. I do not see how this can give any useful information about the periodicity of a sequence of numbers.
The last lines show my attempt of using autocorrelation, inspired by the accepted answer of How can I use numpy.correlate to do autocorrelation?.
It gives the following plot
Clearly we see a descending sequence of numbers for all the primes.
When testing the same method on a sin function with the following simplyfied python-code snippet
1 # Testing the autocorrelation of numpy
2
3 import numpy as np
4 from matplotlib import pyplot as plt
5
6 num_samples = 1000
7 t = np.arange(num_samples)
8 dt = 0.1
9
10 def autocor(x):
11 result = np.correlate(x,x,mode='full')
12 return result[result.size/2:]
13
14 def f(x):
15 return [np.sin(i * 2 * np.pi * dt) for i in range(num_samples)]
16
17 plt.plot(autocor(f(t)))
18 plt.show()
I get a similar result, it giving the following plot for the sine function
How could I read off the periodicity in the sine-function case, for example?
Anyhow, I do not understand the mechanism of the autocorrelation leading to peaks that give information of the periodicity of a signal. Can someone elaborate on that? How do you properly use autocorrelation in this context?
Also what am I doing wrong in my implementation of the autocorrelation?
Suggestions on alternative methods of determining periodicity in a sequence of numbers are welcome.
There are quite a few questions here, so I'll start be describing how an autocorrelation produces the period from the case of "3", ie, your second sub-plot of the first image.
For prime number 3, the sequence is (after a less consistent start) 1,2,1,2,1,2,1,2,.... To calculate the autocorrelation, the array is basically translated relative to itself, all the elements that align are multiplied, and all of these results are added. So it looks something like this, for a few test cases, where A is the autocorrelation:
0 1 2 3 4 5 6 7 ... 43 44 45 46 47 48 49 # indices 0
1 2 1 2 1 2 1 2 2 1 2 1 2 1 2 # values 0
1 2 1 2 1 2 1 2 2 1 2 1 2 1 2 # values 1
0 1 2 3 4 5 6 7 ... 43 44 45 46 47 48 49 # indices 1
1 4 1 4 1 4 1 4 4 1 4 1 4 1 4 # products
# above result A[0] = 5*25 5=1+4 25=# of pairs # A[0] = 125
0 1 2 3 4 5 6 7 ... 43 44 45 46 47 48 49 # indices 0
1 2 1 2 1 2 1 2 2 1 2 1 2 1 2 # values 0
1 2 1 2 1 2 1 2 2 1 2 1 2 1 2 # values 1
0 1 2 3 4 5 6 7 ... 43 44 45 46 47 48 49 # indices 1
2 2 2 2 2 2 2 2 2 2 2 2 2 2 # products
# above result A[1] = 4*24 4=2+2 24=# of pairs # A[1] = 96
0 1 2 3 4 5 6 7 ... 43 44 45 46 47 48 49 # indices 0
1 2 1 2 1 2 1 2 2 1 2 1 2 1 2 # values 0
1 2 1 2 1 2 1 2 2 1 2 1 2 1 2 # values 1
0 1 2 3 4 5 6 7 ... 43 44 45 46 47 48 49 # indices 1
1 4 1 4 1 4 1 4 4 1 4 1 4 # products
# above result A[2] = 5*23 5=4+1 23=# of pairs # A[2] = 115
There are three take-home messages from the above: 1. the autocorrelation, A, has larger value when like elements are lined up and multiplied, here at every other step. 2. The index of the autocorrelation corresponds to the relative shift. 3. When doing the autocorrelation over the full arrays, as shown here, there's always a downward ramp since the number of points added together to produce the value are reduced at each successive shift.
So here you can see why there's a periodic 20% bump in your graph from "Prime number 3": because the terms that are summed are 1+4 when they are aligned, vs 2+2 when they aren't, ie, 5 vs 4. It's this bump that you're looking for in reading off the period. That is, here is shows that the period is 2, since that is the index of your first bump. (Also, note btw, in the above I only do the calculation as number of pairs to see how this known periodicity leads to the result you see in the autocorrelation, that is, one doesn't in general want to think of number of pairs.)
In these calculations, the values of the bump relative to the base will be increased if you first subtract the mean before doing the autocorrelation. The ramp can be removed if you do the calculation using arrays with trimmed ends, so there's always the same overlap; this usually makes sense since usually one is looking for a periodicity of much shorter wavelength than the full sample (because it takes many oscillations to define a good period of oscillation).
For the autocorrelation of the sine wave, the basic answer is that the period is shown as the first bump. I redid the plot except with the time axis applied. It's always clearest in these things to use a real time axis, so I changed your code a bit to include that. (Also, I replaced the list comprehension with a proper vectorized numpy expression for calculating the sin wave, but that's not important here. And I also explicitly defined the frequency in f(x), just to make it more clear what's going on -- as an implicitly frequency of 1 in confusing.)
The main point is that since the autocorrelation is calculated by shifting along the axis one point at a time, the axis of the autocorrelation is just the time axis. So I plot that as the axis, and then can read the period off of that. Here I zoomed in to see it clearly (and the code is below):
# Testing the autocorrelation of numpy
import numpy as np
from matplotlib import pyplot as plt
num_samples = 1000
dt = 0.1
t = dt*np.arange(num_samples)
def autocor(x):
result = np.correlate(x,x,mode='full')
return result[result.size/2:]
def f(freq):
return np.sin(2*np.pi*freq*t)
plt.plot(t, autocor(f(.3)))
plt.xlabel("time (sec)")
plt.show()
That is, in the above, I set the frequency to 0.3, and the graph shows the period as about 3.3, which is what's expected.
All of this said, in my experience, the autocorrelation generally works well for physical signals but not so reliably for algorithmic signals. It's fairly easy to throw off, for example, if a periodic signal skips a step, which can happen with an algorithm, but is less likely with a vibrating object. You'd think that it should be trivial to calculate that period of an algorithmic signal, but a bit of searching around will show that it's not, and it's even difficult to define what's meant by period. For example, the series:
1 2 1 2 1 2 0 1 2 1 2 1 2
won't work well with the autocorrelation test.
Update.
#tom10 gave a thorough survey of autocorrelation and explained why the first bump in the autocorrelation could give the period of the periodic signal.
I tried both approaches, FFT as well as autocorrelation. Their results agree, although I would prefer FFT over autocorrelation since it gives you the period more directly.
When using autocorrelation, we simply determine the coordinate of the first peak. A manual inspection of the autocorrelation graph will reveal if you have the 'right' peak, since you can notice the period (although for primes above 7 this becomes less clear). I'm sure you could also work out a simple algorithm which calculates the 'right' peak. Perhaps someone could elaborate on some simple algortihm which does the job?
See, for instance, the following plot of the sequences next to their autocorrelation.
Code:
1 # Plotting sequences satisfying, x_{i+1} = p-1 - (p*i-1 mod x_i)
2 # with p prime and x_0 = 1, next to their autocorrelation.
3
4 from __future__ import print_function
5 import numpy as np
6 from matplotlib import pyplot as plt
7
8 # The length of the sequences.
9 seq_length = 10000
10
11 upperbound_primes = 12
12
13 # Computing a list of prime numbers up to n
14 def primes(n):
15 sieve = [True] * n
16 for i in xrange(3,int(n**0.5)+1,2):
17 if sieve[i]:
18 sieve[i*i::2*i]=[False]*((n-i*i-1)/(2*i)+1)
19 return [2] + [i for i in xrange(3,n,2) if sieve[i]]
20
21 # The list of prime numbers up to upperbound_primes
22 p = primes(upperbound_primes)
23
24 # The amount of primes numbers
25 no_primes = len(p)
26
27 # Generate the sequence for the prime number p
28 def sequence(p):
29 x = np.empty(seq_length)
30 x[0] = 1
31 for i in range(1,seq_length):
32 x[i] = p - 1 - (p * (i-1) - 1) % x[i-1]
33 return x
34
35 # List with the sequences.
36 seq = [sequence(i) for i in p]
37
38 # Autocorrelation function.
39 def autocor(x):
40 result = np.correlate(x,x,mode='full')
41 return result[result.size/2:]
42
43 fig = plt.figure("The sequences next to their autocorrelation")
44 plt.suptitle("The sequences next to their autocorrelation")
45
46 # Proper spacing between subplots.
47 fig.subplots_adjust(hspace=1.2)
48
49 # Set up pyplot to use TeX.
50 plt.rc('text',usetex=True)
51 plt.rc('font',family='serif')
52
53 # Maximize plot window by command.
54 mng = plt.get_current_fig_manager()
55 mng.resize(*mng.window.maxsize())
56
57 k = 0
58 for s in seq:
59 k = k + 1
60 fig.add_subplot(no_primes,2,2*(k-1)+1)
61 plt.title("Sequence of the prime %d" % p[k-1])
62 plt.plot(s)
63 plt.xlabel(r"Index $i$")
64 plt.ylabel(r"Sequence number $x_i$")
65 plt.xlim(0,100)
66
67 # Constrain the number of ticks on the y-axis, for clarity.
68 plt.locator_params(axis='y',nbins=4)
69
70 fig.add_subplot(no_primes,2,2*k)
71 plt.title(r"Autocorrelation of the sequence $^{%d}x$" % p[k-1])
72 plt.plot(autocor(s))
73 plt.xlabel(r"Index $i$")
74 plt.xticks
75 plt.ylabel("Autocorrelation")
76
77 # Proper scaling of the y-axis.
78 ymin = autocor(s)[1]-int(autocor(s)[1]/10)
79 ymax = autocor(s)[1]+int(autocor(s)[1]/10)
80 plt.ylim(ymin,ymax)
81 plt.xlim(0,500)
82
83 plt.locator_params(axis='y',nbins=4)
84
85 # Use scientific notation when 0< t < 1 or t > 10
86 plt.ticklabel_format(style='sci',axis='y',scilimits=(0,1))
87
88 plt.show()
When using FFT, we Fourier transform our sequence and look for the first peak. The coordinate of this first peak, gives the frequency which represents our signal the coarsest. This will give our period since the coarsest frequency is the frequency by which our sequence (ideally) oscillates.
See the following plot of the sequences next to their Fourier transforms.
Code:
1 # Plotting sequences satisfying, x_{i+1} = p-1 - (p*i-1 mod x_i)
2 # with p prime and x_0 = 1, next to their Fourier transforms.
3
4 from __future__ import print_function
5 import numpy as np
6 from matplotlib import pyplot as plt
7
8 # The length of the sequences.
9 seq_length = 10000
10
11 upperbound_primes = 12
12
13 # Computing a list of prime numbers up to n
14 def primes(n):
15 sieve = [True] * n
16 for i in xrange(3,int(n**0.5)+1,2):
17 if sieve[i]:
18 sieve[i*i::2*i]=[False]*((n-i*i-1)/(2*i)+1)
19 return [2] + [i for i in xrange(3,n,2) if sieve[i]]
20
21 # The list of prime numbers up to upperbound_primes
22 p = primes(upperbound_primes)
23
24 # The amount of primes numbers
25 no_primes = len(p)
26
27 # Generate the sequence for the prime number p
28 def sequence(p):
29 x = np.empty(seq_length)
30 x[0] = 1
31 for i in range(1,seq_length):
32 x[i] = p - 1 - (p * (i-1) - 1) % x[i-1]
33 return x
34
35 # List with the sequences.
36 seq = [sequence(i) for i in p]
37
38 fig = plt.figure("The sequences next to their FFT")
39 plt.suptitle("The sequences next to their FFT")
40
41 # Proper spacing between subplots.
42 fig.subplots_adjust(hspace=1.2)
43
44 # Set up pyplot to use TeX.
45 plt.rc('text',usetex=True)
46 plt.rc('font',family='serif')
47
48
49 # Maximize plot window by command.
50 mng = plt.get_current_fig_manager()
51 mng.resize(*mng.window.maxsize())
52
53 k = 0
54 for s in seq:
55 f = np.fft.rfft(s)
56 f[0] = 0
57 freq = np.fft.rfftfreq(seq_length)
58 k = k + 1
59 fig.add_subplot(no_primes,2,2*(k-1)+1)
60 plt.title("Sequence of the prime %d" % p[k-1])
61 plt.plot(s)
62 plt.xlabel(r"Index $i$")
63 plt.ylabel(r"Sequence number $x_i$")
64 plt.xlim(0,100)
65
66 # Constrain the number of ticks on the y-axis, for clarity.
67 plt.locator_params(nbins=4)
68
69 fig.add_subplot(no_primes,2,2*k)
70 plt.title(r"FFT of the sequence $^{%d}x$" % p[k-1])
71 plt.plot(freq,abs(f))
72 plt.xlabel("Frequency")
73 plt.ylabel("Amplitude")
74 plt.locator_params(nbins=4)
75
76 # Use scientific notation when 0 < t < 0 or t > 10
77 plt.ticklabel_format(style='sci',axis='y',scilimits=(0,1))
78
79 plt.show()
To see why the FFT method is more convenient then autocorrelation notice that we have a clear algorithm of determining the period: find the first peak of the Fourier transform. For a sufficient number of samples this always works.
See the following table, attained by the FFT method, which agrees with the autocorrelation method.
prime frequency period
2 0.00 1000.00
3 0.50 2.00
5 0.08 12.00
7 0.02 59.88
11 0.00 1000.00
The following code implements the algorithm, printing a table specifying the frequency and period of the sequences per prime number.
1 # Print a table of periods, determined by the FFT method,
2 # of sequences satisfying,
3 # x_{i+1} = p-1 - (p*i-1 mod x_i) with p prime and x_0 = 1.
4
5 from __future__ import print_function
6 import numpy as np
7 from matplotlib import pyplot as plt
8
9 # The length of the sequences.
10 seq_length = 10000
11
12 upperbound_primes = 12
13
14 # Computing a list of prime numbers up to n
15 def primes(n):
16 sieve = [True] * n
17 for i in xrange(3,int(n**0.5)+1,2):
18 if sieve[i]:
19 sieve[i*i::2*i]=[False]*((n-i*i-1)/(2*i)+1)
20 return [2] + [i for i in xrange(3,n,2) if sieve[i]]
21
22 # The list of prime numbers up to upperbound_primes
23 p = primes(upperbound_primes)
24
25 # The amount of primes numbers
26 no_primes = len(p)
27
28 # Generate the sequence for the prime number p
29 def sequence(p):
30 x = np.empty(seq_length)
31 x[0] = 1
32 for i in range(1,seq_length):
33 x[i] = p - 1 - (p * (i-1) - 1) % x[i-1]
34 return x
35
36 # List with the sequences.
37 seq = [sequence(i) for i in p]
38
39 # Function that finds the first peak.
40 # Assumption: seq_length >> 10 so the Fourier transformed
41 # signal is sufficiently smooth.
42 def firstpeak(x):
43 for i in range(10,len(x)-1):
44 if x[i+1] < x[i]:
45 return i
46 return len(x)-1
47
48 k = 0
49 for s in seq:
50 f = np.fft.rfft(s)
51 freq = np.fft.rfftfreq(seq_length)
52 k = k + 1
53 if k == 1:
54 print("prime \t frequency \t period")
55 print(p[k-1],'\t %.2f' % float(freq[firstpeak(abs(f))]), \
56 '\t\t %.2f' % float(1/freq[firstpeak(abs(f))]))
I used 10000 samples (seq_length) in all the above code. As we increase the number of samples the periods are seen to converge to a certain integral value (Using the FFT method).
The FFT method seems to me like an ideal tool to determine periods in algorithmic signals, one only being limited by how high a sample number your equipment can handle.

Categories

Resources