bin one column and sum the other of (2,N) array - python

Question:
I have a dataset like the following:
import numpy as np
x = np.arange(0,10000,0.5)
y = np.arange(x.size)/x.size
Plotting in log-log space, it looks like this:
import matplotlib.pyplot as plt
plt.loglog(x, y)
plt.show()
Obviously there is a lot of redundant information in this log log plot.
I don't need 10000 points to represent this trend.
My question is this: how can I bin this data so that it displays a uniform number of points in each order of magnitude of the logarithmic scale? At each order of magnitude I'd like to get about ten points. Hence I need to bin 'x' with an exponentially growing bin size, and then take the average of all elements of y corresponding to each bin.
Attempt:
First I generate the bins I want to use for x.
# need a nicer way to do this.
# what if I want more than 10 bins per order of magnitude?
bins = 10**np.arange(1,int(round(np.log10(x.max()))))
bins = np.unique((bins.reshape(-1,1)*np.arange(0,11)).flatten())
#array([ 0, 10, 20, 30, 40, 50, 60, 70, 80,
# 90, 100, 200, 300, 400, 500, 600, 700, 800,
# 900, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000,
# 9000, 10000])
Second, I find the index of the bin to which each element of x corresponds:
digits = np.digitize(x, bins)
Now the part I can really use help with. I want to take the average of every element in y corresponding to each bin, and then plot these averages versus the bin midpoints:
# need a nicer way to do this.. is there an np.searchsorted() solution?
# this way is quick and dirty, but it does not scale with acceptable speed
averages = []
for d in np.unique(digits):
mask = digits==d
y_mean = np.mean(y[mask])
averages.append(y_mean)
del mask, y_mean, d
# now plot the averages within each bin against the center of each bin
plt.loglog((bins[1:]+bins[:-1])/2.0, averages)
plt.show()
Summary:
Is there a smoother way to do this? How can I generate an arbitrary n points per order of magnitude instead of 10?

I will answer two of your several questions: How to create bins alternatively and generate an arbitrary n points per order of magnitude instead of 10?
You can make use of np.logspace and np.outer to create your bins for any arbitrary n value as following. The default base in logspace is 10. It generates logarithmically spaced points similar to linspace which generates linearly spaced mesh.
For n=10
n = 10
bins = np.unique(np.outer(np.logspace(0, 3, 4), np.arange(0, n+1)))
# array([0.e+00, 1.e+00, 2.e+00, 3.e+00, 4.e+00, 5.e+00, 6.e+00, 7.e+00,
# 8.e+00, 9.e+00, 1.e+01, 2.e+01, 3.e+01, 4.e+01, 5.e+01, 6.e+01,
# 7.e+01, 8.e+01, 9.e+01, 1.e+02, 2.e+02, 3.e+02, 4.e+02, 5.e+02,
# 6.e+02, 7.e+02, 8.e+02, 9.e+02, 1.e+03, 2.e+03, 3.e+03, 4.e+03,
# 5.e+03, 6.e+03, 7.e+03, 8.e+03, 9.e+03, 1.e+04])
For n=20
n = 20
bins = np.unique(np.outer(np.logspace(0, 3, 4), np.arange(0, n+1)))
# array([0.0e+00, 1.0e+00, 2.0e+00, 3.0e+00, 4.0e+00, 5.0e+00, 6.0e+00, 7.0e+00, 8.0e+00, 9.0e+00, 1.0e+01, 1.1e+01, 1.2e+01, 1.3e+01, 1.4e+01, 1.5e+01, 1.6e+01, 1.7e+01, 1.8e+01, 1.9e+01, 2.0e+01, 3.0e+01, 4.0e+01, 5.0e+01, 6.0e+01, 7.0e+01, 8.0e+01, 9.0e+01, 1.0e+02, 1.1e+02, 1.2e+02, 1.3e+02, 1.4e+02, 1.5e+02, 1.6e+02, 1.7e+02, 1.8e+02, 1.9e+02, 2.0e+02, 3.0e+02, 4.0e+02, 5.0e+02, 6.0e+02, 7.0e+02, 8.0e+02, 9.0e+02, 1.0e+03, 1.1e+03, 1.2e+03, 1.3e+03, 1.4e+03, 1.5e+03, 1.6e+03, 1.7e+03, 1.8e+03, 1.9e+03, 2.0e+03, 3.0e+03, 4.0e+03, 5.0e+03, 6.0e+03, 7.0e+03, 8.0e+03, 9.0e+03, 1.0e+04, 1.1e+04, 1.2e+04, 1.3e+04, 1.4e+04, 1.5e+04, 1.6e+04, 1.7e+04, 1.8e+04, 1.9e+04, 2.0e+04])
EDIT
If you want 0, 10, 20, 30...90, 100, 200, 300... you can do the following
n = 10
bins = np.unique(np.outer(np.logspace(1, 3, 3), np.arange(0, n+1)))
# array([ 0., 10., 20., 30., 40., 50., 60., 70.,
# 80., 90., 100., 200., 300., 400., 500., 600.,
# 700., 800., 900., 1000., 2000., 3000., 4000., 5000.,
# 6000., 7000., 8000., 9000., 10000.])

Related

Why is the sum of the absolute values of np.sin(x) and np.cos(x) from 0 to 2*pi not the same?

I am trying to compute the sum of the absolute values of these two expressions, and I am somehow confused, since the sum should be the same, right?
Consider the integrals:
Integral of abs(sin(x))dx
Integral of abs(cos(x))dx
It's easy to see, that the area underneath them is the same, and indeed both integrals return 4.
I wrote a simple script to evenly sample those functions and add all the sampled values together. Scaling aside, both expressions should yield the same result, but they dont. Here is the code:
import numpy as np
angles = np.linspace(0, 2*np.pi, 1000)
line1, line2= [], []
for n in angles:
line1.append(np.abs(np.cos(n)))
line2.append(np.abs(np.sin(n)))
print(sum(line1), sum(line2))
The result is the following:
636.983414656738 635.9826284722284
The sums are off by almost exactly 1. I know that they are not 4, because there is some constant factor missing, but the point is, that the values should be the same. Am I completly missing something here or is this a bug?
Consider the extreme case of reducing the samples to 3:
In [93]: angles = np.linspace(0, 2*np.pi, 3)
In [94]: angles
Out[94]: array([0. , 3.14159265, 6.28318531])
In [95]: np.cos(angles)
Out[95]: array([ 1., -1., 1.])
In [96]: np.sin(angles)
Out[96]: array([ 0.0000000e+00, 1.2246468e-16, -2.4492936e-16])
The extra 1 for cos persists for larger samples.
In [97]: angles = np.linspace(0, 2*np.pi, 1001)
In [98]: np.sum(np.abs(np.cos(angles)))
Out[98]: 637.6176779711009
In [99]: np.sum(np.abs(np.sin(angles)))
Out[99]: 636.6176779711009
But if we tell it to skip the 2*np.pi end point, the values match:
In [100]: angles = np.linspace(0, 2*np.pi, 1001, endpoint=False)
In [101]: np.sum(np.abs(np.cos(angles)))
Out[101]: 637.256653677874
In [102]: np.sum(np.abs(np.sin(angles)))
Out[102]: 637.2558690641631
Because your integral method, a simple "sum", has systematic errors that cannot be ignored for this case.
Now try this:
import numpy as np
angles = np.linspace(0, 2*np.pi, 1000)
line1, line2= [], []
for n in angles:
line1.append(np.abs(np.cos(n)))
line2.append(np.abs(np.sin(n)))
i1=np.trapz(line1,angles)
i2=np.trapz(line2,angles)
print(i1,i2,abs(2*(i1-i2)/(i1+i2)))
The result is:
4.000001648229352 3.9999967035417043 1.2361721666315376e-06

how to insert a squence with slight modification in a numpy array?

I have a fixed list and a fixed size numpy array.
data = [10,20,30]
arr = np.zeros(9)
I want to insert the data in NumPy arr with some slight modification so that the expected output should look like this:
arr = [10, 20, 30, 9, 22, 32, 13, 16, 28]
the difference between the values can be in the range of (-5, 5)
my attept was:
import numpy as np
data = [10,20,30]
dataArr = np.zeros(9)
for i in range(9):
for j in data:
dataArr[3*i:3*(i+1)] = random.randint(int(j - 5), int(j + 5))
dataArr
but it gives me this output:
array([34., 34., 34., 33., 33., 33., 35., 35., 35.])
can somebody please help?
When you do
dataArr[3*i:3*(i+1)] = value
You are setting the entire range to value. In fact your indices even go beyond the range of dataArr, even though it doesn't raise an exception.
See after execution :
print(dataArr[3*i:3*(i+1)])
# output:
# []
Iterate over the values of data, and modify the three corresponding values in dataArr by doing so:
nbOfRepetitions = 3
dataArr = np.zeros(len(data)*nbOfRepetitions)
for i in range(len(data)):
dataArr[i] = data[i]
for j in range(1, nbOfRepetitions):
dataArr[i+len(data)*j] = random.randint(int(data[i] - 5), int(data[i] + 5))
Where nbOfRepetitions is the number of time you want to have data in dataArr (this includes the non-modified copy at the start).
This gives the expected result.
array([10., 20., 30., 12., 18., 30., 10., 24., 28.])
Edited to generalize for different sizes for data and dataArr.
If you want to skip the first N items that are already in your data array, you can change your inner loop and access indices to something like:
import random
import numpy as np
MAX_LENGTH = 9
data = [10,20,30]
dataArr = np.zeros(MAX_LENGTH)
for i in range(MAX_LENGTH):
for j in range(len(data),MAX_LENGTH):
dataArr[j] = random.randint((min(data) - 5), (max(data) + 5))
dataArr[0:len(data)] = data
# output:
array([10., 20., 30., 28., 31., 17., 30., 30., 10.])
Other than above solution, this can be also one more way
import numpy as np
(
(np.random.randint(10, size=(3, 3)) - 5) +
np.array([10, 20, 30])
).reshape(-1)
First get 3x3 matrix of random integers from 0-10 and subtract 5 to get from -5 to 5
Before adding np.array([10, 20, 30]) will be broadcasted along 0 axis like np.array([10, 20, 30])[np.newaxis, :] and towards end flattening
You need to set up your array to use integer types:
dataArr = np.zeros(9, dtype=int)

sampling from a clipped normal distribution

How do I sample from a normal distribution that is clipped?
I want to sample from N(0, 1). But I want the values to be from [-1, +1]. I cannot apply np.clip as that would increase the probability of -1 and +1. I can do stochastic clipping, but then there's no guarantee that it'll fall out of the range.
#standard
s = np.random.normal(0, 1, [10,10])
s = np.clip(s)
#stochastic
for j in range(10)
edge1 = np.where(s[j] >= 1.)[0]
edge2 = np.where(s[j] <= -1)[0]
if edge1.shape[0] > 0:
rand_el1 = np.random.normal(0, 1, size=(1, edge1.shape[0]))
s[j,edge1] = rand_el1
if edge2.shape[0] > 0:
rand_el2 = np.random.normal(0, 1, size=(1, edge2.shape[0]))
s[j,edge2] = rand_el2
The scipy library implements the truncated normal distribution as scipy.stats.truncnorm. In your case, you can use sample = truncnorm.rvs(-1, 1, size=sample_size).
For example,
In [55]: import matplotlib.pyplot as plt
In [56]: from scipy.stats import truncnorm, norm
Sample 100000 points from the normal distribution truncated to [-1, 1]:
In [57]: sample = truncnorm.rvs(-1, 1, size=100000)
Make a histogram, and plot the theoretical PDF curve. The PDF can be computed with truncnorm.pdf, or with a scaled version of norm.pdf.
In [58]: _ = plt.hist(sample, bins=51, normed=True, facecolor='g', edgecolor='k', alpha=0.4)
In [59]: x = np.linspace(-1, 1, 101)
In [60]: plt.plot(x, truncnorm.pdf(x, -1, 1), 'k', alpha=0.4, linewidth=5)
Out[60]: [<matplotlib.lines.Line2D at 0x11f78c160>]
In [61]: plt.plot(x, norm.pdf(x)/(norm.cdf(1) - norm.cdf(-1)), 'k--', linewidth=1)
Out[61]: [<matplotlib.lines.Line2D at 0x11f779f60>]
Here's the plot:
I believe that the simplest (perhaps not most efficient) way to do so is to use basic Rejection sampling. It simply consists in simulating values from a N(0,1), rejecting those that fall out of the wanted bounds and keeping the others until they stack to the wanted number of samples.
kept = []
while len(kept) < 1000:
s = np.random.normal()
if -1 <= s <= 1:
kept.append(s)
Here I stack things in a basic list ; feel free to use a np.array and replace the length condition with one based on the array's dimensions.
Do the stochiastic clipping iteratively until you don't need it any more. This basically means turning your ifs into a while. You can also take this opportunity to simplify the out-of bounds condition into a single check rather than a separate check on each side:
s = np.random.normal(0, 1, (10, 10))
while True:
out_of_bounds = np.abs(s) > 1
count = np.count_nonzero(out_of_bounds)
if count:
s[out_of_bounds] = np.random.normal(0, 1, count)
else:
break

Extract histogram modes by detecting the local maxima of a vector with NumPy/SciPy

Is there a way with NumPy/SciPy` to keep only the histogram modes when extracting the local maxima (shown as blue dots on the image below)?:
These maxima were extracted using scipy.signal.argrelmax, but I only need to get the two modes values and ignore the rest of the maxima detected:
# calculate dB positive image
img_db = 10 * np.log10(img)
img_db_pos = img_db + abs(np.min(img_db))
data = img_db_pos.flatten() + 1
# data histogram
n, bins = np.histogram(data, 100, normed=True)
# trim data
x = np.linspace(np.min(data), np.max(data), num=100)
# find index of minimum between two modes
ind_max = argrelmax(n)
x_max = x[ind_max]
y_max = n[ind_max]
# plot
plt.hist(data, bins=100, normed=True, color='y')
plt.scatter(x_max, y_max, color='b')
plt.show()
Note:
I've managed to use this Smoothing filter to get a curve that matches the histogram (but I don't have the equation of the curve).
This could be achieved by appropriately adjusting parameter order of function argrelmax, i.e. by adjusting how many points on each side are considered to detect local maxima.
I've used this code to create mock data (you can play around with the values of the different variables to figure out their effect on the generated histogram):
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import argrelextrema, argrelmax
m = 50
s = 10
samples = 50000
peak_start = 30
peak_width = 10
peak_gain = 4
np.random.seed(3)
data = np.random.normal(loc=m, scale=s, size=samples)
bell, edges = np.histogram(data, bins=np.arange(2*(m + 1) - .5), normed=True)
x = np.int_(.5*(edges[:-1] + edges[1:]))
bell[peak_start + np.arange(peak_width)] *= np.linspace(1, peak_gain, peak_width)
plt.bar(x, bell)
plt.show()
As shown below, it is important to carefully select the value of order. Indeed, if order is too small you are likely to detect noisy local maxima, whereas if order is too large you might fail to detect some of the modes.
In [185]: argrelmax(bell, order=1)
Out[185]: (array([ 3, 5, 7, 12, 14, 39, 47, 51, 86, 90], dtype=int64),)
In [186]: argrelmax(bell, order=2)
Out[186]: (array([14, 39, 47, 51, 90], dtype=int64),)
In [187]: argrelmax(bell, order=3)
Out[187]: (array([39, 47, 51], dtype=int64),)
In [188]: argrelmax(bell, order=4)
Out[188]: (array([39, 51], dtype=int64),)
In [189]: argrelmax(bell, order=5)
Out[189]: (array([39, 51], dtype=int64),)
In [190]: argrelmax(bell, order=11)
Out[190]: (array([39, 51], dtype=int64),)
In [191]: argrelmax(bell, order=12)
Out[191]: (array([39], dtype=int64),)
These results are strongly dependent on the shape of the histogram (if you change just one of the parameters used to generate data the range of valid values for order may vary). To make mode detection more robust, I would recommend you to pass a smoothed histogram to argrelmax rather than the original histogram.
I guess, you want to find second largest number in y_max. Hope this example will help you:
np.random.seed(4) # for reproducibility
data = np.zeros(0)
for i in xrange(10):
data = np.hstack(( data, np.random.normal(i, 0.25, 100*i) ))
# data histogram
n, bins = np.histogram(data, 100, normed=True)
# trim data
x = np.linspace(np.min(data), np.max(data), num=100)
# find index of minimum between two modes
ind_max = argrelmax(n)
x_max = x[ind_max]
y_max = n[ind_max]
# find first and second max values in y_max
index_first_max = np.argmax(y_max)
maximum_y = y_max[index_first_max]
second_max_y = max(n for n in y_max if n!=maximum_y)
index_second_max = np.where(y_max == second_max_y)
# plot
plt.hist(data, bins=100, normed=True, color='y')
plt.scatter(x_max, y_max, color='b')
plt.scatter(x_max[index_first_max], y_max[index_first_max], color='r')
plt.scatter(x_max[index_second_max], y_max[index_second_max], color='g')
plt.show()

Python plot frequency of fft.rfft

this is my first question here on stackoverflow and I hope I will not make huge mistakes.
I am analyzing a set of time series with sampling rate of 1 Hz. I need to plot their fourier transform in order to study their spectra.
Here it is my piece of code:
from obspy.core import read
import numpy as np
import matplotlib.pyplot as plt
st = read('../SC_noise/*HEC_109C*_s', format='SAC')
stp = st.copy()
stp.detrend('linear')
stp.taper('cosine')
for tr in stp:
dataonly = tr.data
spec = np.fft.rfft(dataonly)
plt.plot(abs(spec))
plt.show()
This works just fine: the plot is the same I get using SAC. But the xaxis does not show frequencies. I've wandered around a little bit and found different ideas: none of them is working.
For example in the case of a fft (here I am using a rfft) this should do the job
samp_rate=1
freq = np.fft.fftfreq(len(spec), d=1./samp_rate)
But if I use it it would give me negative frequencies.
Does anybody have an idea?
Thank you very much in advance for all the help!
Piero
If your NumPy version is new enough (1.8 or better), use numpy.fft.rfftfreq. Otherwise, here is the definition:
def rfftfreq(n, d=1.0):
"""
Return the Discrete Fourier Transform sample frequencies
(for usage with rfft, irfft).
The returned float array `f` contains the frequency bin centers in cycles
per unit of the sample spacing (with zero at the start). For instance, if
the sample spacing is in seconds, then the frequency unit is cycles/second.
Given a window length `n` and a sample spacing `d`::
f = [0, 1, ..., n/2-1, n/2] / (d*n) if n is even
f = [0, 1, ..., (n-1)/2-1, (n-1)/2] / (d*n) if n is odd
Unlike `fftfreq` (but like `scipy.fftpack.rfftfreq`)
the Nyquist frequency component is considered to be positive.
Parameters
----------
n : int
Window length.
d : scalar, optional
Sample spacing (inverse of the sampling rate). Defaults to 1.
Returns
-------
f : ndarray
Array of length ``n//2 + 1`` containing the sample frequencies.
Examples
--------
>>> signal = np.array([-2, 8, 6, 4, 1, 0, 3, 5, -3, 4], dtype=float)
>>> fourier = np.fft.rfft(signal)
>>> n = signal.size
>>> sample_rate = 100
>>> freq = np.fft.fftfreq(n, d=1./sample_rate)
>>> freq
array([ 0., 10., 20., 30., 40., -50., -40., -30., -20., -10.])
>>> freq = np.fft.rfftfreq(n, d=1./sample_rate)
>>> freq
array([ 0., 10., 20., 30., 40., 50.])
"""
if not (isinstance(n,int) or isinstance(n, integer)):
raise ValueError("n should be an integer")
val = 1.0/(n*d)
N = n//2 + 1
results = arange(0, N, dtype=int)
return results * val

Categories

Resources