I am trying to generate latitudes from latitude boundary(min,max latitudes and number of points) .
I have two choices to generate this :
Using multiple probability distribution and generating x with 1 probability distribution ,as that wont be normal
Or
make a distribution with skew introduced.
The main aim is getting a non normal distributed array points .
I might be doing a conceptual mistake here.
def array_gen(min,max):
new_lat_ = np.linspace(min,max,1000)
a = 3
r = skewnorm.rvs(a, size=1000)
#not sure how to use the new_lat_
arr = skewnorm.pdf(new_lat_, *skewnorm.fit(new_lat_))
ax.hist(r, density=True,histtype='stepfilled', alpha=0.2)
ax.hist(arr, density=True,histtype='stepfilled', alpha=0.2)
return arr
array = array_gen(12.786963,13.140868)
fig
I expect a distribution that is skewed I get it but it isn't in the range I expected .Expected range = (12.786963,13.140868)
0.43939953, 0.75707352, 0.63534797, 0.40377254, 0.27907808,
0.23454434, 0.11875663, 0.07422289, 0.02375133, 0.00296892
-0.1491852 , 0.1876381 , 0.5244614 , 0.8612847 , 1.198108,
1.53493129, 1.87175459, 2.20857789, 2.54540119, 2.88222449,
3.21904779
Both seaborn and pandas provide APIs in order to plot bivariate histograms as a hexbin plot (example plotted below). However, I am searching to execute a query for the points that are located in the same hexbin. Is there a function to retrieve the rows associated with the data points in the hexbin?
The give an example:
My data frame contains 3 rows: A, B and C. I use sns.jointplot(x=A,y=B) to plot the density. Now, I want to execute a query on each data point located in the same bin. For instance, for each bin compute the mean of the C value associated with each point.
Current solution -- Quick Hack
Currently, I have implemented the following function to apply a function to the data associated with a (x,y) coordinate located in the same hexbin:
def hexagonify(x, y, values, func=None):
hexagonized_list = []
fig = plt.figure()
fig.set_visible(False)
if func is not None:
image = plt.hexbin(x=x, y=y, C=values, reduce_C_function=func)
else:
image = plt.hexbin(x=x, y=y, C=values)
values = image.get_array()
verts = image.get_offsets()
for offc in range(verts.shape[0]):
binx, biny = verts[offc][0], verts[offc][1]
val = values[offc]
if val:
hexagonized_list.append((binx, biny, val))
fig.clear()
plt.close(fig)
return hexagonized_list
The values (with the same size as x or y) are passed through the values parameter. The hexbins are computed through the hexbin function of matplotlib. The values are retrieved through the get_array() function of the returned PolyCollection. By default, the np.mean function is applied to the accumalated values per bin. This functionality can be changed by providing a function to the func paramater. Subsequently, the get_offsets() method allows us to calculate the center of the bins (discussed here). In this way, we can associate (by default) mean value of the provided values per hexbin. However, this solution is a hack, so any improvements to this solution are welcome.
From matplotlib
If you have already drawn the plot, you can get Bin Counts from polycollection returned by matplotlib:
polycollection: A PolyCollection instance; use PolyCollection.get_array on this to get the counts in each hexagon.
This functionality is also available in:
matplotlib.pyplot.hist2d;
numpy.histogram2d;
Pure pandas
Here a MCVE using only pandas that can handle the C property:
import numpy as np
import pandas as pd
# Trial Dataset:
N=1000
d = np.array([np.random.randn(N), np.random.randn(N), np.random.rand(N)]).T
df = pd.DataFrame(d, columns=['x', 'y', 'c'])
# Create bins:
df['xb'] = pd.cut(df.x, 3)
df['yb'] = pd.cut(df.y, 3)
# Group by and Aggregate:
p = df.groupby(['xb', 'yb']).agg('mean')['c']
p.unstack()
First we create bins using pandas.cut. Then we group by and aggregate. You can pick the agg function you like to aggregate C (eg. max, median, etc.).
The output is about:
yb (-2.857, -0.936] (-0.936, 0.98] (0.98, 2.895]
xb
(-2.867, -0.76] 0.454424 0.519920 0.507443
(-0.76, 1.34] 0.535930 0.484818 0.513158
(1.34, 3.441] 0.441094 0.493657 0.385987
Given an array of values, I want to be able to fit a density function to it and find the pdf of an arbitrary input value. Is this possible, and how would I go about it? There aren't necessarily assumptions of normality, and I don't need the function itself.
For instance, given:
x = array([ 0.62529759, -0.08202699, 0.59220673, -0.09074541, 0.05517865,
0.20153703, 0.22773723, -0.26229708, 0.76137555, -0.61229314,
0.27292745, 0.35596795, -0.01373896, 0.32464979, -0.22932331,
1.14796175, 0.17268531, 0.40692172, 0.13846154, 0.22752953,
0.13087359, 0.14111479, -0.09932381, 0.12800392, 0.02605917,
0.18776078, 0.45872642, -0.3943505 , -0.0771418 , -0.38822433,
-0.09171721, 0.23083624, -0.21603973, 0.05425592, 0.47910286,
0.26359565, -0.19917942, 0.40182097, -0.0797546 , 0.47239264,
-0.36654449, 0.4513859 , -0.00282486, -0.13950512, -0.05375369,
0.03331833, 0.48951555, -0.13760504, 2.788 , -0.15017848,
0.02930675, 0.10910646, 0.03868301, -0.048482 , 0.7277376 ,
0.08841259, -0.10968462, 0.50371324, 0.86379698, 0.01674877,
0.19542421, -0.06639165, 0.74500856, -0.10148342, 0.02482331,
0.79195804, 0.40401969, 0.25120005, 0.21020794, -0.01767013,
-0.13453783, -0.09605592, -0.88044229, 0.04689623, 0.09043851,
0.21232286, 0.34129982, -0.3736799 , 0.17313858])
I would like to find how a value of 0.3 compares to all of the above, and what percent of the above values it is greater than.
I personally like using the scipy.stats package. It has a useful implementation of Kernel Density Estimation. Bascially what this does is it estimates a probability density function of certain data, using combinations of gaussian (or other) distributions. Which distributions are used is a parameter you can set. Look at the documentation and related examples here: https://docs.scipy.org/doc/scipy/reference/tutorial/stats.html#kernel-density-estimation
And for more about KDE: https://en.wikipedia.org/wiki/Kernel_density_estimation
Once you have built your KDE, then you can perform the same operations on it to get probabilities. For example, if you want to calculate the probability that a value occurs that is as large or larger than 0.3 you would do the following:
kde = stats.gaussian_kde(np.array(x))
#visualize KDE
fig = plt.figure()
ax = fig.add_subplot(111)
x_eval = np.linspace(-.2, .2, num=200)
ax.plot(x_eval, kde(x_eval), 'k-')
#get probability
kde.integrate_box_1d( 0.3, np.inf)
TLDR:
Calculate a KDE, then use the KDE as if it were a PDF.
You can use openTURNS for that. You can use a Gaussian kernel smoothing to do that easily! From the doc:
import openturns as ot
kernel = ot.KernelSmoothing()
estimated = kernel.build(x)
That's it, now you have a distribution object :)
This library is very cool for statistics! (I am not related to them).
We have first to create the Sample from the Numpy array.
Then we compute the complementary CDF with the complementaryCDF method of the distribution (a small improvement over Yoda's answer).
import numpy as np
x = np.array([ 0.62529759, -0.08202699, 0.59220673, -0.09074541, 0.05517865,
0.20153703, 0.22773723, -0.26229708, 0.76137555, -0.61229314,
0.27292745, 0.35596795, -0.01373896, 0.32464979, -0.22932331,
1.14796175, 0.17268531, 0.40692172, 0.13846154, 0.22752953,
0.13087359, 0.14111479, -0.09932381, 0.12800392, 0.02605917,
0.18776078, 0.45872642, -0.3943505 , -0.0771418 , -0.38822433,
-0.09171721, 0.23083624, -0.21603973, 0.05425592, 0.47910286,
0.26359565, -0.19917942, 0.40182097, -0.0797546 , 0.47239264,
-0.36654449, 0.4513859 , -0.00282486, -0.13950512, -0.05375369,
0.03331833, 0.48951555, -0.13760504, 2.788 , -0.15017848,
0.02930675, 0.10910646, 0.03868301, -0.048482 , 0.7277376 ,
0.08841259, -0.10968462, 0.50371324, 0.86379698, 0.01674877,
0.19542421, -0.06639165, 0.74500856, -0.10148342, 0.02482331,
0.79195804, 0.40401969, 0.25120005, 0.21020794, -0.01767013,
-0.13453783, -0.09605592, -0.88044229, 0.04689623, 0.09043851,
0.21232286, 0.34129982, -0.3736799 , 0.17313858])
import openturns as ot
kernel = ot.KernelSmoothing()
sample = ot.Sample(x,1)
distribution = kernel.build(sample)
q = distribution.computeComplementaryCDF(0.3)
print(q)
which prints:
0.29136124840835353
Is there a way to extract the data from an array, which corresponds to a line of a contourplot in python? I.e. I have the following code:
n = 100
x, y = np.mgrid[0:1:n*1j, 0:1:n*1j]
plt.contour(x,y,values)
where values is a 2d array with data (I stored the data in a file but it seems not to be possible to upload it here). The picture below shows the corresponding contourplot. My question is, if it is possible to get exactly the data from values, which corresponds e.g. to the left contourline in the plot?
Worth noting here, since this post was the top hit when I had the same question, that this can be done with scikit-image much more simply than with matplotlib. I'd encourage you to check out skimage.measure.find_contours. A snippet of their example:
from skimage import measure
x, y = np.ogrid[-np.pi:np.pi:100j, -np.pi:np.pi:100j]
r = np.sin(np.exp((np.sin(x)**3 + np.cos(y)**2)))
contours = measure.find_contours(r, 0.8)
which can then be plotted/manipulated as you need. I like this more because you don't have to get into the deep weeds of matplotlib.
plt.contour returns a QuadContourSet. From that, we can access the individual lines using:
cs.collections[0].get_paths()
This returns all the individual paths. To access the actual x, y locations, we need to look at the vertices attribute of each path. The first contour drawn should be accessible using:
X, Y = cs.collections[0].get_paths()[0].vertices.T
See the example below to see how to access any of the given lines. In the example I only access the first one:
import matplotlib.pyplot as plt
import numpy as np
n = 100
x, y = np.mgrid[0:1:n*1j, 0:1:n*1j]
values = x**0.5 * y**0.5
fig1, ax1 = plt.subplots(1)
cs = plt.contour(x, y, values)
lines = []
for line in cs.collections[0].get_paths():
lines.append(line.vertices)
fig1.savefig('contours1.png')
fig2, ax2 = plt.subplots(1)
ax2.plot(lines[0][:, 0], lines[0][:, 1])
fig2.savefig('contours2.png')
contours1.png:
contours2.png:
plt.contour returns a QuadContourSet which holds the data you're after.
See Get coordinates from the contour in matplotlib? (which this question is probably a duplicate of...)
I am facing a problem while running a script ( please find the code below ).
I am trying to plot an array of values, write it into a FITS file format, read it back again and plot it ---> I don't get the same plots!
If you could please help me with this it would be great.
The following are the versions of my packages and compiler:
matplotlib : '2.0.0b1'
numpy : '1.11.0'
astropy : u'1.1.2'
python : 2.7
Sincerely,
Anik Halder
import numpy as np
from pylab import *
from astropy.io import fits
# Just making a 10x10 meshgrid
x = np.arange(10)
X , Y = np.meshgrid(x,x)
# finding the distance of different points on the meshgrid from a point suppose at (5,5)
Z = ((X-5)**2 + (Y-5)**2)**0.5
# plotting Z (see image [link below] - left one)
imshow(Z, origin = "lower")
colorbar()
show()
# writing the Z data into a fits file
fits.writeto("my_file.fits", Z)
# reading the same fits file and storing the data
Z_read = fits.open("my_file.fits")[0].data
# plotting Z_read : we expect it to show the same plot as before
imshow(Z_read, origin = "lower")
colorbar()
show()
# Lo! That's not the case for me! It's not the same plot! (see image - right one)
# Hence, I try to check whether the values stored in Z and Z_read are different..
print Z - Z_read
# No! It returns an array full of zeros! This means Z and Z_read are the same! I don't get why the plots look different!
Please find the image in this link: http://imgur.com/1TklSjU
Actually it turns out to be that it's to do with the version of matplotlib.
Answered by a developer of matplotlib - Jens Nielsen
This doesn't occur on matplotlib version 1.51
In version 2 beta 1, it seems that the FITS data is converted from float32 to big endian float8. Please look at the following link:
https://gist.github.com/jenshnielsen/86d4a86d8f667fadddc09f88c5fb87e6
The issue has been posted and you can look it up over here:
https://github.com/matplotlib/matplotlib/issues/6671
In the meantime for having the same plots (Z and Z _read) we should rather use the following code (for matplotlib 2 beta 1) :
imshow(Z_read.astype('float64'), origin = "lower")