I want to make sure I am using numpy's correlate correctly, it is not giving me the answer I expect. Perhaps I am misunderstanding the correlate function. Here is a code snipet with comments:
import numpy as np
ref = np.sin(np.linspace(-2*np.pi, 2*np.pi, 10000)) # make some data
fragment = ref[2149:7022] # create a fragment of data from ref
corr = np.correlate(ref, fragment) # Find the correlation between the two
maxLag = np.argmax(corr) # find the maximum lag, this should be the offset that we chose above, 2149
print(maxLag)
2167 # I expected this to be 2149.
Isn't the index in the corr array where the correlation is maximum the lag between these two datasets? I would think the starting index I chose for the smaller dataset would be the offset with the greatest correlation.
Why is there a discrepancy between what I expect, 2149, and the result, 2167?
Thanks
That looks like a precision error to me, cross-correlation is an integral and it will always have problems when being represented in discrete space, I guess the problem arises when the values are close to 0. Maybe if you increase the numbers or increase the precision that difference will disappear but I don't think it is really necessary since you are already dealing with approximation when using the discrete cross-correlation, below is the graph of the correlation for you te see that the values are indeed close:
Related
I am currently gathering X gyro data of a device. I need to find the differences between them in terms of their y values in the graphs such as this one.
I have written an algorithm where it finds the min and max y value but when they have some minor fluctuations like this one my algorithm returns faulty answers. I have written it in python and it is as follows;
import numpy as np
x, y = np.loadtxt("gyroDataX.txt", unpack=True, delimiter=",")
max_value = max(y)
min_value = min(x)
print(min_value)
borders = max_value - min_value
I need to now write and algorithm that will;
Determine the max and min y value and draw their borders.
If it sees minor fluctuations it will ignore them.
Would writing such an algorithm be possible, and if so how could I go about writing one? Is there any libraries or any reading material you would recommend? Thanks for your help.
1. Generally maths like this should be done in pure code, with the help of little or none external API's, so it's easier to debug the algorithm and processes.
2. Now for a possible answer:
Since you do not want any outliers (those buggy and irritating minor fluctuations), you need to calculate the standard deviation of your data.
What is the standard deviation, you might ask?
It represents how far your data is from the mean(Symbol: µ) (average) of your data set.
If you do not know what it is, here is quick tutorial on the standard deviation and its partner, variance.
First, the MEAN:
It should not be that hard to calculate the mean of your x and y arrays/lists . Just loop (for loop would be optimal) through the lists and add up all the values and divide it by the length of the list itself. There you have the mean.
Second, the VARIANCE(σ squared):
If you followed the website above, to calculate the variance, loop through the x and y lists again, subtract the x and y values from their respective mean to get the difference, square this difference, add all the differences up and divide by the length of the respective lists and you have the variance.
For the Standard Deviation (Symbol: σ), just take the square root of the variance.
Now the standard deviation can be used to find the real max and min (leaving out those buggy outliers).
Use this graph as an approximate reference to find where your most of your values may be:
Since your data is mostly uniform, you should get a pretty accurate answer.
Do test the different situations: σ + µ or 2σ + µ; to find the optimum max and min.
Edit:
Sorry, only y now:
Sorry for the horrible representation and drawing. This is what it should like graphically is. Also do experiment by yourself with the Standard Deviation from the mean (like above; σ + µ or 2σ + µ) to find the your suited max and min.
I am fairly new to Python and learning some basic python for data science, and I am trying to ascertain the min and max values for an array when you run randn.
to clarify the question, how potentially low and high could the below numbers get?
does it have anything to do with the row/column values entered?
I thought they were for values between -1 and 1 but this is not the case when I test.
import numpy as np
np.random.randn(3,3)
array([[ 1.61311526, -1.20028357, -0.41723647],
[-0.31983635, -3.05411198, -0.43453723],
[ 0.09385744, -0.28239577, -1.17262933]])
As mentioned by others, graphically, the probability distribution looks like this.
Probability of getting a value from -0.5 to 0.5: 19.1% + 19.1% = 38.2%
Probability of getting a value larger than 3 = 0.1%
np.random.randn return a sample (or samples) from the “standard normal” distribution (see documentation here).
The standard normal distribution is not bounded. However, the probability for example to get a sample smaller than -3 is 0.0013.
The function numpy.random.randn returns values from the standard normal distribution, which can be anything between negative and positive infinity, so there's no max or min. These values are distributed along the "bell curve" centered at 0, and are exponentially less likely to occur the farther you get from 0.
The row/column parameters don't affect determine any (non-existent) max/min, they just determine the shape of the output array (see the documentation)
So in your example, passing (3,3) into np.random.randn(3,3) returns a 3x3 array of values from the standard normal distribution.
Basically, there's no max or min value, but since higher numbers are less likely to come up or in other words have lower probabilities, you're usually only looking at -3.5 to 3.5. But the larger the size of random data you are trying to generate, the higher the chances of generating a larger value.
The numpy.random.randn function is based on a standard normal distribution, meaning that there is not a maximum value or a minimum value. However, more positive and negative values are less likely to be produced than ones closer to zero.
I want to apply Fourier transformation using fft function to my time series data to find "patterns" by extracting the dominant frequency components in the observed data, ie. the lowest 5 dominant frequencies to predict the y value (bacteria count) at the end of each time series.
I would like to preserve the smallest 5 coefficients as features, and eliminate the rest.
My code is as below:
df = pd.read_csv('/content/drive/My Drive/df.csv', sep=',')
X = df.iloc[0:2,0:10000]
dft_X = np.fft.fft(X)
print(dft_X)
print(len(dft_X))
plt.plot(dft_X)
plt.grid(True)
plt.show()
# What is the graph about(freq/amplitude)? How much data did it use?
for i in dft_X:
m = i[np.argpartition(i,5)[:5]]
n = i[np.argpartition(i,range(5))[:5]]
print(m,'\n',n)
Here is the output:
But I am not sure how to interpret this graph. To be precise,
1) Does the graph show the transformed values of the input data? I only used 2 rows of data(each row is a time series), thus data is 2x10000, why are there so many lines in the graph?
2) To obtain frequency value, should I use np.fft.fftfreq(n, d=timestep)?
Parameters:
n : int
Window length.
d : scalar, optional
Sample spacing (inverse of the sampling rate). Defaults to 1.
Returns:
f : ndarray
Array of length n containing the sample frequencies.
How to determine n(window length) and sample spacing?
3) Why are transformed values all complex numbers?
Thanks
I'm gonna answer in reverse order of your questions
3) Why are transformed values all complex numbers?
The output of a Fourier Transform is always complex numbers. To get around this fact, you can either apply the absolute value on the output of the transform, or only plot the real part using:
plt.plot(dft_X.real)
2) To obtain frequency value, should I use np.fft.fftfreq(n, d=timestep)?
No, the "frequency values" will be visible on the output of the FFT.
1) Does the graph show the transformed values of the input data? I only used 2 rows of data(each row is a time series), thus data is 2x10000, why are there so many lines in the graph?
Your graph has so many lines because it's making a line for each column of your data set. Apply the FFT on each row separately (or possibly just transpose your dataframe) and then you'll get more actual frequency domain plots.
Follow up
Would using absolute value or real part of the output as features for a later model have different effect than using the original output?
Absolute values are easier to work with usually.
Using real part
Using absolute value
Here's the Octave code that generated this:
Fs = 4000; % Sampling rate of signal
T = 1/Fs; % Period
L = 4000; % Length of signal
t = (0:L-1)*T; % Time axis
freq = 1000; % Frequency of our sinousoid
sig = sin(freq*2*pi*t); % Fill Time-Domain with 1000 Hz sinusoid
f_sig = fft(sig); % Apply FFT
f = Fs*(0:(L/2))/L; % Frequency axis
figure
plot(f,abs(f_sig/L)(1:end/2+1)); % peak at 1kHz)
figure
plot(f,real(f_sig/L)(1:end/2+1)); % main peak at 1kHz)
In my example, you can see the absolute value returned no noise at frequencies other than the sinusoid of frequency 1kHz I generated while the real part had a bigger peak at 1kHz but also had much more noise.
As for effects, I don't know what you mean by that.
is it expected that "frequency values" always be complex numbers
Always? No. The Fourier series represents the frequency coefficients at which the sum of sines and cosines completely equate any continuous periodic function. Sines and cosines can be written in complex forms through Euler's formula. This is the most convenient way to store Fourier coefficients. In truth, the imaginary part of your frequency-domain signal represents the phase of the signal. (i.e if I have 2 sine functions of the same frequency, they can have different complex forms depending on the time shifting). However, most libraries that provide an FFT function will, by default, store FFT coefficients as complex numbers, to facilitate phase and magnitude calculations.
Is it convention that FFT use each column of dataset when plotting a line
I think it is an issue with mathplotlib.plot, not np.fft.
Could you please show me how to apply FFT on each row separately
There are many ways to go around this and I don't want to force you down one path, so I will propose the general solution to iterate over each row of your dataframe and apply the FFT on each specific row. Otherwise, in your case, I believe transposing your output could also work.
I want to combine phase spectrum of one image and magnitude spectrum of different image into one image.
I have got phase spectrum and magnitude spectrum of image A and image B.
Here is the code.
f = np.fft.fft2(grayA)
fshift1 = np.fft.fftshift(f)
phase_spectrumA = np.angle(fshift1)
magnitude_spectrumB = 20*np.log(np.abs(fshift1))
f2 = np.fft.fft2(grayB)
fshift2 = np.fft.fftshift(f2)
phase_spectrumB = np.angle(fshift2)
magnitude_spectrumB = 20*np.log(np.abs(fshift2))
I trying to figure out , but still i do not know how to do that.
Below is my test code.
imgCombined = abs(f) * math.exp(1j*np.angle(f2))
I wish i can come out just like that
Here are the few things that you would need to fix for your code to work as intended:
The math.exp function supports scalar exponentiation. For an element-wise matrix exponentiation you should use numpy.exp instead.
Similary, the * operator would attempt to perform matrix multiplication. In your case you want to instead perform element-wise multiplication which can be done with np.multiply
With these fixes you should get the frequency-domain combined matrix as follows:
combined = np.multiply(np.abs(f), np.exp(1j*np.angle(f2)))
To obtain the corresponding spatial-domain image, you would then need compute the inverse transform (and take the real part since there my be residual small imaginary parts due to numerical errors) with:
imgCombined = np.real(np.fft.ifft2(combined))
Finally the result can be shown with:
import matplotlib.pyplot as plt
plt.imshow(imgCombined, cmap='gray')
Note that imgCombined may contain values outside the [0,1] range. You would then need to decide how you want to rescale the values to fit the expected [0,1] range.
The default scaling (resulting in the image shown above) is to linearly scale the values such that the minimum value is set to 0, and the maximum value is set to 0.
Another way could be to limit the values to that range (i.e. forcing all negative values to 0 and all values greater than 1 to 1).
Finally another approach, which seems to provide a result closer to the screenshot provided, would be to take the absolute value with imgCombined = np.abs(imgCombined)
I am trying to find the minimum of the second derivative for the elements in the array below:
scores = [-100.07, -40.04, -26.97, -17.31, -13.12, -9.02, -7.22,
-5.23, -4.37, -3.44, -2.92, -2.36, -2.11, -1.78,
-1.59, -1.37, -1.23, -1.1 , -0.97, -0.87]
Plotting these data when k is 20 results in the plot below. I've circled in red the "optimal" number of clusters based on this score measure.
The issue, though, is that I had to manually intervene (i.e. the computer didn't select this point of minimization). I would rather have the computer select the appropriate k, so I'm trying to find the point at which the second derivative is smallest.
I have tried differencing, e.g.
import numpy as np
first_diff = np.diff(scores, 1)
second_diff = np.diff(scores, 2)
But this isn't satisfactory, as the sequence of second differences produces some positive numbers and then some negative numbers, which won't produce the results I want when using np.argmin. Using percent changes doesn't work well, either.
Is there a reliable way to difference these types of vectors?