How to coarsen ordered 1D data into irregular bins with Python - python

I have a high frequency set of ordered 1D data set that relates to observations of a property with respect to depth, consisting of a continuous float value observation versus monotonically increasing depth
I'd like to find a way to coarsen this data set up into user defined number of contiguous bins (or zones), each of which is described by a single mean value and lower depth limit (the top depth limit being defined by the end of the zone above it). The criteria for splitting the zones should be k-means like - in that (within the bounds of the number of zones specified) there will be minimum property variance within each zone and maximum variation between adjacent zones.
As an example, if I had a small high frequency dataset as follows;
depth = [2920.530612, 2920.653061, 2920.734694, 2920.857143, 2920.938776, 2921.102041, 2921.22449, 2921.346939, 2921.469388, 2921.510204, 2921.55, 2921.632653, 2921.795918, 2922, 2922.081633, 2922.122449, 2922.244898, 2922.326531, 2922.489796, 2922.612245, 2922.857143, 2922.979592, 2923.020408, 2923.142857, 2923.265306]
value = [0.0098299, 0.009827939, 0.009826632, 1004.042327, 3696.000306, 3943.831644, 3038.254723, 3693.543377, 3692.806616, 50.04989348, 15.0127, 2665.2111, 3690.842641, 3238.749497, 429.4979635, 18.81228993, 1800.889643, 2662.199897, 3454.082382, 3934.140146, 3030.184014, 0.556587319, 8.593768956, 11.90163067, 26.01012696]
And I was to request a split into 7 zones, it would return something like the following;
depth_7zone =[2920.530612, 2920.857143, 2920.857143, 2921.510204, 2921.510204, 2921.632653, 2921.632653, 2922.081633, 2922.081633, 2922.244898, 2922.244898, 2922.979592, 2922.979592, 2923.265306]
value_7zone = [0.009828157, 0.009828157, 3178.079832, 3178.079832, 32.53129674, 32.53129674, 3198.267746, 3198.267746, 224.1551267, 224.1551267, 2976.299216, 2976.299216, 11.76552848, 11.76552848]
which can be visualized as (blue = original data, red = data split into 7 zones);
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
plt.plot(value, depth, '-o')
plt.plot(value_7zone, depth_7zone, '-', color='red')
plt.gca().invert_yaxis()
plt.xlabel('Values')
plt.ylabel('Depth')
plt.show()
I've tried standard k-means clustering, and it doesn't appear suited to this ordered 1D problem. I was thinking of methods perhaps used for digital signal processing but all I could find discretize into constant bin sizes, or even for image compression but that may be overkill and likely expect 2D data
Can anyone suggest an avenue to explore further? (I'm fairly new to Python so apologies in advance)

Related

Calculating total depth of a variable

I have calculated the Moist Brunt-Vaisala frequency.
Let's say that the variable is moistb and has a dimension of [height, lat, lon].
I would like to plot the horizontal distribution of the total depth of the moistb.
How do I calculate the total depth? The idea is to sum all the depth of moistb in each grid point. Is there a way to do this with metpy?
For reference, here's an example as shown by Schumacher and Johnson (2008)
where they plot the horizontal distribution of total depth (m).
It sounds like in this case that you're working with data stored in an Xarray DataArray. If so, the way to do what you're looking for is:
moistb.sum(dim='height')
You can also do this with regular numpy arrays (or a DataArray) by using the axis argument, which corresponds to the number of the dimension in order. So for the order listed above this would be:
moistb.sum(axis=0)
For more information see the Xarray docs or the Numpy docs.

Is there a way to plot Matplotlib's Imshow against a specific array rather than the indices?

I'm trying to use Imshow to plot a 2-d Fourier transform of my data. However, Imshow plots the data against its index in the array. I would like to plot the data against a set of arrays I have containing the corresponding frequency values (one array for each dim), but can't figure out how.
I have a 2D array of data (gaussian pulse signal) that I Fourier transform with np.fft.fft2. This all works fine. I then get the corresponding frequency bins for each dimension with np.fft.fftfreq(len(data))*sampling_rate. I can't figure out how to use imshow to plot the data against these frequencies though. The 1D equivalent of what I'm trying to do us using plt.plot(x,y) rather than just using plt.plot(y).
My first attempt was to use imshows "extent" flag, but as fas as I can tell that just changes the axis limits, not the actual bins.
My next solution was to use np.fft.fftshift to arrange the data in numerical order and then simply re-scale the axis using this answer: Change the axis scale of imshow. However, the index to frequency bin is not a pure scaling factor, there's typically a constant offset as well.
My attempt was to use 2d hist instead of imshow, but that doesn't work since 2dhist plots the number of times an order pair occurs, while I want to plot a scalar value corresponding to specific order pairs (i.e the power of the signal at specific frequency combinations).
import numpy as np
import matplotlib.pyplot as plt
from scipy import signal
f = 200
st = 2500
x = np.linspace(-1,1,2*st)
y = signal.gausspulse(x, fc=f, bw=0.05)
data = np.outer(np.ones(len(y)),y) # A simple example with constant y
Fdata = np.abs(np.fft.fft2(data))**2
freqx = np.fft.fftfreq(len(x))*st # What I want to plot my data against
freqy = np.fft.fftfreq(len(y))*st
plt.imshow(Fdata)
I should see a peak at (200,0) corresponding to the frequency of my signal (with some fall off around it corresponding to bandwidth), but instead my maximum occurs at some random position corresponding to the frequencie's index in my data array. If anyone has any idea, fixes, or other functions to use I would greatly appreciate it!
I cannot run your code, but I think you are looking for the extent= argument to imshow(). See the the page on origin and extent for more information.
Something like this may work?
plt.imshow(Fdata, extent=(freqx[0],freqx[-1],freqy[0],freqy[-1]))

A way to maintain index pointers over a contiguous array

In Python, I am currently trying to create a per-note frequency visualisation of a specific guitar riff I like
In order to do this and have the points plotted by matplotlib.pyplot I am doing something like this for each note, but will ultimately be summing y-values at specific points for 2 specific frequencies
import numpy
import matplotlib.pyplot as plt
t_per_beat = 110/60.0 #tempo is 110 bpm, finding no of seconds per beat
#creating range of x values for 8 beats of music, in this case 2 bars
x0 = numpy.linspace(0, t_per_beat * 8, 100)
a = []
#generate y-axis values
for i in x0:
a.append(numpy.sin(<note_freq> * i)
I want the y-axis values to be contiguous like the x-axis, so an array of plotted points is best, but I also want to be able to index specific intervals in the array, down to a granularity of a 'sixteenth note' (t_per_beat/4)
Because the frequency value of my note may increase (so I will need to increase the amount of points in my numpy.linspace array, I cannot be assured the interval of index numbers across the array will be consistent.
Of course splitting into a container of separate arrays (i.e: 2-dimensional list) would be preferable, but where the modelling of the waves means that 2 waves coalesce over beat boundaries, this is not really ideal.
In essence my question is (in absence of a better solution that I haven't thought of), is there logic to store a reference to a piece of data in an array such that when called I can always find the index of said data in the array?

Python fastKDE beyond limits of data points

I'm trying to use the fastKDE package (https://pypi.python.org/pypi/fastkde/1.0.8) to find the KDE of a point in a 2D plot. However, I want to know the KDE beyond the limits of the data points, and cannot figure out how to do this.
Using the code listed on the site linked above;
#!python
import numpy as np
from fastkde import fastKDE
import pylab as PP
#Generate two random variables dataset (representing 100000 pairs of datapoints)
N = 2e5
var1 = 50*np.random.normal(size=N) + 0.1
var2 = 0.01*np.random.normal(size=N) - 300
#Do the self-consistent density estimate
myPDF,axes = fastKDE.pdf(var1,var2)
#Extract the axes from the axis list
v1,v2 = axes
#Plot contours of the PDF should be a set of concentric ellipsoids centered on
#(0.1, -300) Comparitively, the y axis range should be tiny and the x axis range
#should be large
PP.contour(v1,v2,myPDF)
PP.show()
I'm able to find the KDE for any point within the limits of the data, but how do I find the KDE for say the point (0,300), without having to include it into var1 and var2. I don't want the KDE to be calculated with this data point, I want to know the KDE at that point.
I guess what I really want to be able to do is give the fastKDE a histogram of the data, so that I can set its axes myself. I just don't know if this is possible?
Cheers
I, too, have been experimenting with this code and have run into the same issues. What I've done (in lieu of a good N-D extrapolator) is to build a KDTree (with scipy.spatial) from the grid points that fastKDE returns and find the nearest grid point to the point I was to evaluate. I then lookup the corresponding pdf value at that point (it should be small near the edge of the pdf grid if not identically zero) and assign that value accordingly.
I came across this post while searching for a solution of this problem. Similiar to the building of a KDTree you could just calculate your stepsize in every griddimension, and then get the index of your query point by just subtracting the point value with the beginning of your axis and divide by the stepsize of that dimension, finally round it off, turn it to integer and voila. So for example in 1D:
def fastkde_test(test_x):
kde, axes = fastKDE.pdf(test_x, numPoints=num_p)
x_step = (max(axes)-min(axes)) / len(axes)
x_ind = np.int32(np.round((test_x-min(axes)) / x_step))
return kde[x_ind]
where test_x in this case is both the set for defining the KDE and the query set. Doing it this way is marginally faster by a factor of 10 in my case (at least in 1D, higher dimensions not yet tested) and does basically the same thing as the KDTree query.
I hope this helps anyone coming across this problem in the future, as I just did.
Edit: if your querying points outside of the range over which the KDE was calculated this method of course can only give you the same result as the KDTree query, namely the corresponding border of your KDE-grid. You would however have to hardcode this by cutting the resulting x_ind at the highest index, i.e. `len(axes)-1'.

How can I account for identical data points in a scatter plot?

I'm working with some data that has several identical data points. I would like to visualize the data in a scatter plot, but scatter plotting doesn't do a good job of showing the duplicates.
If I change the alpha value, then the identical data points become darker, which is nice, but not ideal.
Is there some way to map the color of a dot to how many times it occurs in the data set? What about size? How can I assign the size of the dot to how many times it occurs in the data set?
As it was pointed out, whether this makes sense depends a bit on your dataset. If you have reasonably discrete points and exact matches make sense, you can do something like this:
import numpy as np
import matplotlib.pyplot as plt
test_x=[2,3,4,1,2,4,2]
test_y=[1,2,1,3,1,1,1] # I am just generating some test x and y values. Use your data here
#Generate a list of unique points
points=list(set(zip(test_x,test_y)))
#Generate a list of point counts
count=[len([x for x,y in zip(test_x,test_y) if x==p[0] and y==p[1]]) for p in points]
#Now for the plotting:
plot_x=[i[0] for i in points]
plot_y=[i[1] for i in points]
count=np.array(count)
plt.scatter(plot_x,plot_y,c=count,s=100*count**0.5,cmap='Spectral_r')
plt.colorbar()
plt.show()
Notice: You will need to adjust the radius (the value 100 in th s argument) according to your point density. I also used the square root of the count to scale it so that the point area is proportional to the counts.
Also note: If you have very dense points, it might be more appropriate to use a different kind of plot. Histograms for example (I personally like hexbin for 2d data) are a decent alternative in these cases.

Categories

Resources