How to create spacing of points per decade for logarithmic plot - python

I have some range of frequencies that goes from freq_start_hz = X to freq_stop_hz = Y.
I am trying to logarithmically (base 10) space out samples between the range [freq_start_hz, freq_stop_hz], based on a number of samples per decade (num_samp_per_decade), inclusive of the endpoint.
I noticed numpy has a method logspace (link) which enables you to create logarithmic divisions of some range base ** start to base ** stop based on a total number of samples, num.
Can you help me create Python code that will create even logarithmic spacing per decade?
freq_start_hz = 10, freq_stop_hz = 100, num_samp_per_decade = 5
This is easy, since it's just one decade. So one could create it using the following:
import numpy as np
from math import log10
freq_start_hz = 10
freq_stop_hz = 100
num_samp_per_decade = 5
freq_list = np.logspace(
freq_list = np.append(freq_list, freq_stop_hz) # Appending end
Output is [10.0, 17.78279410038923, 31.622776601683793, 56.23413251903491, 100.0]
Note: this worked nicely because I designed it this way. If freq_start_hz = 8, this method no longer works since it now spans multiple decades.
I am hoping somewhere out there, there's a premade method in math, numpy, another scipy library, or some other library that my internet searching hasn't turned up.

Calculate the number of points based on the number of decades in the range.
from math import log10
import numpy as np
start = 10
end = 1500
samples_per_decade = 5
ndecades = log10(end) - log10(start)
npoints = int(ndecades) * samples_per_decade
#a = np.linspace(log10(start), log10(end), num = npoints)
#points = np.power(10, a)
points = np.logspace(log10(start), log10(end), num=npoints, endpoint=True, base=10)


Simulate the compound random variable S

Let S=X_1+X_2+...+X_N where N is a nonnegative integer-valued random variable and X_1,X_2,... are i.i.d random variables.(If N=0, we set S=0).
Simulate S in the case where N ~ Poi(100) and X_i ~ Exp(0.5). (draw histograms and use the numpy or scipy built-in functions).And check the equations E(S)=E(N)*E(X_1) and Var(S)=E(N)*Var(X_1)+E(X_1)^2 *Var(N)
I was trying to solve it, but I'm not sure yet of everything and also got stuck on the histogram part. Note: I'm new to python or more generally , new to programming.
My work:
import scipy.stats as stats
import matplotlib as plt
N = stats.poisson(100)
X = stats.expon(0.5)
arr = X.rvs(N.rvs())
S = 0
for i in arr:
expected_S = (N.mean())*(X.mean())
variance_S = (N.mean()*X.var()) + (X.mean()*X.mean()*N.var())
Your existing code mostly looks sensible, but I'd simplify:
arr = X.rvs(N.rvs())
S = 0
for i in arr:
down to:
S = X.rvs(N.rvs()).sum()
To draw a histogram, you need many samples from this distribution, which is now easily accomplished via:
arr = []
for _ in range(10_000):
or, equivalently, using a list comprehension:
arr = [X.rvs(N.rvs()).sum() for _ in range(10_000)]
to plot these in a histogram, you need the pyplot module from Matplotlib, so your import should be:
from matplotlib.pyplot import plt
plt.hist(arr, 50)
The 50 above says to use that number of "bins" when drawing the histogram. We can also compare these to the mean and variance you calculated by assuming the distribution is well approximated by a normal:
approx = stats.norm(expected_S, np.sqrt(variance_S))
_, x, _ = plt.hist(arr, 50, density=True)
plt.plot(x, approx.pdf(x))
This works because the second value returned from matplotlib's hist method are the locations of the bins. I used density=True so I could work with probability densities, but another option could be to just multiply the densities by the number of samples to get expected counts like the previous histogram.
Running this gives me:

How do I force two arrays to be equal for use in pyplot?

I'm trying to plot a simple moving averages function but the resulting array is a few numbers short of the full sample size. How do I plot such a line alongside a more standard line that extends for the full sample size? The code below results in this error message:
ValueError: x and y must have same first dimension, but have shapes (96,) and (100,)
This is using standard matplotlib.pyplot. I've tried just deleting X values using remove and del as well as switching all arrays to numpy arrays (since that's the output format of my moving averages function) then tried adding an if condition to the append in the while loop but neither has worked.
import random
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
def movingaverage(values, window):
weights = np.repeat(1.0, window) / window
smas = np.convolve(values, weights, 'valid')
return smas
sampleSize = 100
min = -10
max = 10
window = 5
vX = np.array([])
vY = np.array([])
x = 0
val = 0
while x < sampleSize:
val += (random.randint(min, max))
vY = np.append(vY, val)
vX = np.append(vX, x)
x += 1
plt.plot(vX, vY)
plt.plot(vX, movingaverage(vY, window))
Expected results would be two lines on the same graph - one a simple moving average of the other.
Just change this line to the following:
smas = np.convolve(values, weights,'same')
The 'valid' option, only convolves if the window completely covers the values array. What you want is 'same', which does what you are looking for.
Edit: This, however, also comes with its own issues as it acts like there are extra bits of data with value 0 when your window does not fully sit on top of the data. This can be ignored if chosen, as is done in this solution, but another approach is to pad the array with specific values of your choosing instead (see Mike Sperry's answer).
Here is how you would pad a numpy array out to the desired length with 'nan's (replace 'nan' with other values, or replace 'constant' with another mode depending on desired results)
import numpy as np
bob = np.asarray([1,2,3])
alice = np.pad(bob,(0,100-len(bob)),'constant',constant_values=('nan','nan'))
So in your code it would look something like this:
import random
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
def movingaverage(values,window):
weights = np.repeat(1.0,window)/window
smas = np.convolve(values,weights,'valid')
shorted = int((100-len(smas))/2)
smas = np.pad(smas,(shorted,shorted),'constant',constant_values=('nan','nan'))
return smas
sampleSize = 100
min = -10
max = 10
window = 5
vX = np.array([])
vY = np.array([])
x = 0
val = 0
while x < sampleSize:
val += (random.randint(min,max))
vY = np.append(vY,val)
vX = np.append(vX,x)
x += 1
To answer your basic question, the key is to take a slice of the x-axis appropriate to the data of the moving average. Since you have a convolution of 100 data elements with a window of size 5, the result is valid for the last 96 elements. You would plot it like this:
plt.plot(vX[window - 1:], movingaverage(vY, window))
That being said, your code could stand to have some optimization done on it. For example, numpy arrays are stored in fixed size static buffers. Any time you do append or delete on them, the entire thing gets reallocated, unlike Python lists, which have amortization built in. It is always better to preallocate if you know the array size ahead of time (which you do).
Secondly, running an explicit loop is rarely necessary. You are generally better off using the under-the-hood loops implemented at the lowest level in the numpy functions instead. This is called vectorization. Random number generation, cumulative sums and incremental arrays are all fully vectorized in numpy. In a more general sense, it's usually not very effective to mix Python and numpy computational functions, including random.
Finally, you may want to consider a different convolution method. I would suggest something based on numpy.lib.stride_tricks.as_strided. This is a somewhat arcane, but very effective way to implement a sliding window with numpy arrays. I will show it here as an alternative to the convolution method you used, but feel free to ignore this part.
All in all:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
def movingaverage(values, window):
# this step creates a view into the same buffer
values = np.lib.stride_tricks.as_strided(values, shape=(window, values.size - window + 1), strides=values.strides * 2)
smas = values.sum(axis=0)
smas /= window # in-place to avoid temp array
return smas
sampleSize = 100
min = -10
max = 10
window = 5
v_x = np.arange(sampleSize)
v_y = np.cumsum(np.random.random_integers(min, max, sampleSize))
plt.plot(v_x, v_y)
plt.plot(v_x[window - 1:], movingaverage(v_y, window))
A note on names: in Python, variable and function names are conventionally name_with_underscore. CamelCase is reserved for class names. np.random.random_integers uses inclusive bounds just like random.randint, but allows you to specify the number of samples to generate. Confusingly, np.random.randint has an exclusive upper bound, more like random.randrange.

Autocorrelation code in Python produces errors (guitar pitch detection)

This link provides code for an autocorrelation-based pitch detection algorithm. I am using it to detect pitches in simple guitar melodies.
In general, it produces very good results. For example, for the melody C4, C#4, D4, D#4, E4 it outputs:
Which correlates to the correct notes.
However, in some cases like this audio file (E4, F4, F#4, G4, G#4, A4, A#4, B4) it produces errors:
More specifically, there are three errors here: 13381Hz is wrongly detected instead of F4 (~350Hz) (weird error), and also 218Hz instead of A4 (440Hz) and 244Hz instead of B4 (~493Hz), which are octave errors.
I assume the two errors are caused by something different? Here is the code:
slices = segment_signal(y, sr)
for segment in slices:
pitch = freq_from_autocorr(segment, sr)
print pitch
def segment_signal(y, sr, onset_frames=None, offset=0.1):
if (onset_frames == None):
onset_frames = remove_dense_onsets(librosa.onset.onset_detect(y=y, sr=sr))
offset_samples = int(librosa.time_to_samples(offset, sr))
print onset_frames
slices = np.array([y[i : i + offset_samples] for i
in librosa.frames_to_samples(onset_frames)])
return slices
You can see the freq_from_autocorr function in the first link above.
The only think that I have changed is this line:
corr = corr[len(corr)/2:]
Which I have replaced with:
corr = corr[int(len(corr)/2):]
I noticed the smallest the offset I use (the smallest the signal segment I use to detect each pitch), the more high-frequency (10000+ Hz) errors I get.
Specifically, I noticed that the part that goes differently in those cases (10000+ Hz) is the calculation of the i_peak value. When in cases with no error it is in the range of 50-150, in the case of the error it is 3-5.
The autocorrelation function in the code snippet that you linked is not particularly robust. In order to get the correct result, it needs to locate the first peak on the left hand side of the autocorrelation curve. The method that the other developer used (calling the numpy.argmax() function) does not always find the correct value.
I've implemented a slightly more robust version, using the peakutils package. I don't promise that it's perfectly robust either, but in any case it achieves a better result than the version of the freq_from_autocorr() function that you were previously using.
My example solution is listed below:
import librosa
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import fftconvolve
from pprint import pprint
import peakutils
def freq_from_autocorr(signal, fs):
# Calculate autocorrelation (same thing as convolution, but with one input
# reversed in time), and throw away the negative lags
signal -= np.mean(signal) # Remove DC offset
corr = fftconvolve(signal, signal[::-1], mode='full')
corr = corr[len(corr)//2:]
# Find the first peak on the left
i_peak = peakutils.indexes(corr, thres=0.8, min_dist=5)[0]
i_interp = parabolic(corr, i_peak)[0]
return fs / i_interp, corr, i_interp
def parabolic(f, x):
Quadratic interpolation for estimating the true position of an
inter-sample maximum when nearby samples are known.
f is a vector and x is an index for that vector.
Returns (vx, vy), the coordinates of the vertex of a parabola that goes
through point x and its two neighbors.
Defining a vector f with a local maximum at index 3 (= 6), find local
maximum if points 2, 3, and 4 actually defined a parabola.
In [3]: f = [2, 3, 1, 6, 4, 2, 3, 1]
In [4]: parabolic(f, argmax(f))
Out[4]: (3.2142857142857144, 6.1607142857142856)
xv = 1/2. * (f[x-1] - f[x+1]) / (f[x-1] - 2 * f[x] + f[x+1]) + x
yv = f[x] - 1/4. * (f[x-1] - f[x+1]) * (xv - x)
return (xv, yv)
# Time window after initial onset (in units of seconds)
window = 0.1
# Open the file and obtain the sampling rate
y, sr = librosa.core.load("./Vocaroo_s1A26VqpKgT0.mp3")
idx = np.arange(len(y))
# Set the window size in terms of number of samples
winsamp = int(window * sr)
# Calcualte the onset frames in the usual way
onset_frames = librosa.onset.onset_detect(y=y, sr=sr)
onstm = librosa.frames_to_time(onset_frames, sr=sr)
fqlist = [] # List of estimated frequencies, one per note
crlist = [] # List of autocorrelation arrays, one array per note
iplist = [] # List of peak interpolated peak indices, one per note
for tm in onstm:
startidx = int(tm * sr)
freq, corr, ip = freq_from_autocorr(y[startidx:startidx+winsamp], sr)
# Choose which notes to plot (it's set to show all 8 notes in this case)
plidx = [0, 1, 2, 3, 4, 5, 6, 7]
# Plot amplitude curves of all notes in the plidx list
fgwin = plt.figure(figsize=[8, 10])
fgwin.subplots_adjust(bottom=0.0, top=0.98, hspace=0.3)
axwin = []
ii = 1
for tm in onstm[plidx]:
axwin.append(fgwin.add_subplot(len(plidx)+1, 1, ii))
startidx = int(tm * sr)
axwin[-1].plot(np.arange(startidx, startidx+winsamp), y[startidx:startidx+winsamp])
ii += 1
axwin[-1].set_xlabel('Sample ID Number', fontsize=18)
# Plot autocorrelation function of all notes in the plidx list
fgcorr = plt.figure(figsize=[8,10])
fgcorr.subplots_adjust(bottom=0.0, top=0.98, hspace=0.3)
axcorr = []
ii = 1
for cr, ip in zip([crlist[ii] for ii in plidx], [iplist[ij] for ij in plidx]):
if ii == 1:
shax = None
shax = axcorr[0]
axcorr.append(fgcorr.add_subplot(len(plidx)+1, 1, ii, sharex=shax))
axcorr[-1].plot(np.arange(500), cr[0:500])
# Plot the location of the leftmost peak
axcorr[-1].axvline(ip, color='r')
ii += 1
axcorr[-1].set_xlabel('Time Lag Index (Zoomed)', fontsize=18)
The printed output looks like:
In [1]: %run
The first figure produced by my code sample depicts the amplitude curves for the next 0.1 seconds following each detected onset time:
The second figure produced by the code shows the autocorrelation curves, as computed inside of the freq_from_autocorr() function. The vertical red lines depict the location of the first peak on the left for each curve, as estimated by the peakutils package. The method used by the other developer was getting incorrect results for some of these red lines; that's why his version of that function was occasionally returning the wrong frequencies.
My suggestion would be to test the revised version of the freq_from_autocorr() function on other recordings, see if you can find more challenging examples where even the improved version still gives incorrect results, and then get creative and try to develop an even more robust peak finding algorithm that never, ever mis-fires.
The autocorrelation method is not always right. You may want to implement a more sophisticated method like YIN:
or MPM:
Both of the above papers are good reads.

How to split dataframe according to intersection point in Python?

I am working on a project which is aiming to show difference between good form and bad form of an exercise. To do this we collected the acceleration data with wrist based accelerometer. The image above shows 2 set of a fitness execise (bench press). Each set has 10 repetitions. And the image below shows 10 repetitions of 1 set.I have a raw data set which consist of 10 set of an execises. What I want to do is splitting the raw data to 10 parts which will contain the part between 2 black line in the image above so I can analyze the data easily. My supervisor gave me a starting point which is choosing cutpoint in the each set. He said take a cutpoint, find the first interruption time start cutting at 3 sec before that time and count to 10 and finish cutting.
This an idea that I don't know how to apply. At least, if you can tell how to cut a dataframe according to cutpoint I would be greatful.
Well, I found another way to detect periodic parts of my accelerometer data. So, Here is my code:
import numpy as np
from peakdetect import peakdetect
import datetime as dt
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from matplotlib import style
from pandas import DataFrame as df
def get_periodic(path):
periodics = []
data_frame = df.from_csv(path)
data_frame.columns = ['z', 'y', 'x']
if path.__contains__('1'):
if path.__contains__('bench'):
bench_press_1_week = data_frame.between_time('11:24', '11:52')
peak_indexes = get_peaks(bench_press_1_week.y, lookahead=3000)
for i in range(0, len(peak_indexes)):
time_indexes = bench_press_1_week.index.tolist()
start_time = time_indexes[0]
periodic_start = start_time.to_datetime() + dt.timedelta(0, peak_indexes[i] / 100)
periodic_end = periodic_start + dt.timedelta(0, 60)
periodic = bench_press_1_week.between_time(periodic_start.time(), periodic_end.time())
return periodics
def get_peaks(data, lookahead):
peak_indexes = []
correlation = np.correlate(data, data, mode='full')
realcorr = correlation[correlation.size / 2:]
maxpeaks, minpeaks = peakdetect(realcorr, lookahead=lookahead)
for i in range(0, len(maxpeaks)):
return peak_indexes
def show_segment_plot(data, periodic_area, exercise_name):
gs = gridspec.GridSpec(7, 2)
ax = plt.subplot(gs[:2, :])
k = 0
for i in range(2, 7):
for j in range(0, 2):
ax = plt.subplot(gs[i, j])
title = "{} {}".format(k + 1, ".Set")
k = k + 1
Firstly, this question gave me another perspective for my problem. The image below shows the raw accelerometer data of bench press with 10 sets. Here it has 3 axis(x,y,z) and it's major axis is y(Blue on the image).
I used autocorrelation function for detecting the periodic parts, In the image above every peak represents 1 set of execises. With this peak detection algorithm I found each peak's x-axis value,
In[196]: maxpeaks
[[16204, 32910.14013671875],
[32281, 28726.95849609375],
[48515, 24583.898681640625],
[64436, 22088.130859375],
[80335, 19582.248291015625],
[96699, 16436.567626953125],
[113081, 12100.027587890625],
[129027, 8098.98486328125],
[145184, 5387.788818359375]]
Basically, each x-value represent samples. My sampling frequency was 100Hz so 16204/100 = 162,04 seconds. To find the time of periodic part I added 162,04 sec to started time. Each bench press took aproximatelly 1 min and in this example, exercise's starting time was 11:24, for first periodic part's start time is 11:26 and ending time is 1 min after. There is some lag but yes best solution that I found is this.

properly not getting frequencies in numpy.fft

I have a signal in frequency domain.Then I took numpy.fft.ifft of signal.I got time domain signal.Again I took fft of same time signal properly I'm not getting negative and positive frequencies(Plot 3 in Figure).
time = np.arange(0, 10, .01)
N = len(time)
signal_td = np.cos(2.0*np.pi*2.0*time)
signal_fd = np.fft.fft(signal_td)
signal_fd2 = signal_fd[0:N/2]
inv_td2 = np.fft.ifft(signal_fd2)
fd2 = np.fft.fft(inv_td2)
General comment: I avoid using time as a variable name because IPython loads it as a "magic" command.
Something I find at times confusing about matplotlib is that when you plot a complex array, it actually plots the real part. In the code snippet:
tt = np.arange(0, 10, .01)
N = len(tt)
signal_td = np.cos(2.0*np.pi*2.0*tt)
signal_fd = np.fft.fft(signal_td)
signal_fd2 = signal_fd[0:N/2]
inv_td2 = np.fft.ifft(signal_fd2)
fd2 = np.fft.fft(inv_td2)
The following arrays have dtype of float64: tt and signal_td. The others are complex128. The reason you only see one peak in fd2 is because it is a transform of exp(4j*np.pi*tt) rather than cos(4*np.pi*tt).

