How to retrieve seaborn's histplot data [duplicate] - python

I use
sns.distplot
to plot a univariate distribution of observations. Still, I need not only the chart, but also the data points. How do I get the data points from matplotlib Axes (returned by distplot)?

You can use the matplotlib.patches API. For instance, to get the first line:
sns.distplot(x).get_lines()[0].get_data()
This returns two numpy arrays containing the x and y values for the line.
For the bars, information is stored in:
sns.distplot(x).patches
You can access the bar's height via the function patches.get_height():
[h.get_height() for h in sns.distplot(x).patches]

If you want to obtain the kde values of an histogram you can use scikit-learn KernelDensity function instead:
import numpy as np
import pandas as pd
from sklearn.neighbors import KernelDensity
ds=pd.read_csv('data-to-plot.csv')
X=ds.loc[:,'Money-Spent'].values[:, np.newaxis]
kde = KernelDensity(kernel='gaussian', bandwidth=0.75).fit(X) #you can supply a bandwidth
#parameter.
x=np.linspace(0,5,100)[:, np.newaxis]
log_density_values=kde.score_samples(x)
density=np.exp(log_density_values)
array([1.88878660e-05, 2.04872903e-05, 2.21864649e-05, 2.39885206e-05,
2.58965064e-05, 2.79134003e-05, 3.00421245e-05, 3.22855645e-05,
3.46465903e-05, 3.71280791e-05, 3.97329392e-05, 4.24641320e-05,
4.53246933e-05, 4.83177514e-05, 5.14465430e-05, 5.47144252e-05,
5.81248850e-05, 6.16815472e-05, 6.53881807e-05, 6.92487062e-05,
7.32672057e-05, 7.74479375e-05, 8.17953578e-05, 8.63141507e-05,
..........................
..........................
3.93779919e-03, 4.15788216e-03, 4.38513011e-03, 4.61925890e-03,
4.85992626e-03, 5.10672757e-03, 5.35919187e-03, 5.61677855e-03])

With the newer version of seaborn this is not the case anymore. First of all, distplot has been replaced with displot. Secondly, when calling get_lines() an error message comes up AttributeError: 'FacetGrid' object has no attribute 'get_lines'.

This will get the kde curve you want
line = sns.distplot(data).get_lines()[0]
plt.plot(line.get_xdata(), line.get_ydata())

Related

Problem with matplotlib.pyplot with matplotlib.pyplot.scatter in the argument s

My name is Luis Francisco Gomez and I am in the course Intermediate Python > 1 Matplotlib > Sizes that belongs to the Data Scientist with Python in DataCamp. I am reproducing the exercises of the course where in this part you have to make a scatter plot in which the size of the points are equivalent to the population of the countries. I try to reproduce the results of DataCamp with this code:
# load subpackage
import matplotlib.pyplot as plt
## load other libraries
import pandas as pd
import numpy as np
## import data
gapminder = pd.read_csv("https://assets.datacamp.com/production/repositories/287/datasets/5b1e4356f9fa5b5ce32e9bd2b75c777284819cca/gapminder.csv")
gdp_cap = gapminder["gdp_cap"].tolist()
life_exp = gapminder["life_exp"].tolist()
# create an np array that contains the population
pop = gapminder["population"].tolist()
pop_np = np.array(pop)
plt.scatter(gdp_cap, life_exp, s = pop_np*2)
# Previous customizations
plt.xscale('log')
plt.xlabel('GDP per Capita [in USD]')
plt.ylabel('Life Expectancy [in years]')
plt.title('World Development in 2007')
plt.xticks([1000, 10000, 100000],['1k', '10k', '100k'])
# Display the plot
plt.show()
However a get this:
But in theory you need to get this:
I don't understand what is the problem with the argument s in plt.scatter .
You need to scale your s,
plt.scatter(gdp_cap, life_exp, s = pop_np*2/1000000)
The marker size in points**2.
Per docs
This is because your sizes are too large, scale it down. Also, there's no need to create all the intermediate arrays:
plt.scatter(gapminder.gdp_cap,
gapminder.life_exp,
s=gapminder.population/1e6)
Output:
I think you should use
plt.scatter(gdp_cap, life_exp, s = gdp_cap*2)
or maybe reduce or scale pop_np

Python - fitting data with exponential function

I am aware that there are a few questions about a similar subject, although I couldn't find a proper answer.
I would like to fit some data with a function (called Bastenaire) and iget the parameters values. Here is the code:
import numpy as np
from matplotlib import pyplot as plt
from scipy import optimize
def bastenaire(s, A,B, C,sd):
logNB=np.log(A)-C*(s-sd)-np.log(s-sd)
return np.exp(logNB)-B
S=np.array([659,646,634,623,613,595,580,565,551,535,515,493,473,452,432,413,394,374,355,345])
N=np.array([46963,52934,59975,65522,74241,87237,101977,116751,133665,157067,189426,233260,281321,355558,428815,522582,630257,768067,902506,1017280])
fitmb,fitmob=optimize.curve_fit(bastenaire,S,N,p0=(30000,2000000000,0.2,250))
plt.scatter(N,S)
plt.plot(bastenaire(S,*fitmb),S,label='bastenaire')
plt.legend()
plt.show()
However, the curve fit cannot identify the correct parameters and I get: OptimizeWarning: Covariance of the parameters could not be estimated.
Same results when I give no input parameters values.
Figure
Is there any way to tweak something and get results? Should my dataset cover a wider range and values?
Thank you!
Broc
Fitting is tough, you need to restrain the parameter space using bounds and (often) check a bit your initial values.
To make it work, I search for an initial value where the function had the correct look, then estimated some constraints:
bounds = np.array([(1e4, 1e12), (-np.inf, np.inf), (1e-20, 1e-2), (-2000., 20000)]).T
fitmb, fitmob = optimize.curve_fit(bastenaire,S, N,p0=(1e7,-100.,1e-5,250.), bounds=bounds)
returns
(array([ 1.00000000e+10, 1.03174824e+04, 7.53169772e-03, -7.32901325e+01]), array([[ 2.24128391e-06, 6.17858390e+00, -1.44693602e-07,
-5.72040842e-03],
[ 6.17858390e+00, 1.70326029e+07, -3.98881486e-01,
-1.57696515e+04],
[-1.44693602e-07, -3.98881486e-01, 1.14650323e-08,
4.68707940e-04],
[-5.72040842e-03, -1.57696515e+04, 4.68707940e-04,
1.93358414e+01]]))

python - What produces the same plot as autocorrelation_plot()?

I need the values of the autocorrelation coefficients coming from the autocorrelation_plot(). The problem is that the output coming from this function is not accessible, so I need another function to get such values. That's why I used acf() from statsmodels but it didn't get the same plot as autocorrelation_plot() does. Here is my code:
from statsmodels.tsa.stattools import acf
from pandas.plotting import autocorrelation_plot
import matplotlib.pyplot as plt
import numpy as np
y = np.sin(np.arange(1,6*np.pi,0.1))
plt.plot(acf(y))
plt.show()
So the result is not the same as this:
autocorrelation_plot(y)
plt.show()
This seems to be related to the nlags parameter of acf:
nlags: int, optional
Number of lags to return autocorrelation for.
I don't know what exactly this does but in the source of acf there is a slicing
that shortens the array:
avf = acovf(x, unbiased=unbiased, demean=True, fft=fft, missing=missing)
acf = avf[:nlags + 1] / avf[0]
If you use statsmodels.tsa.stattools.acovf directly the result is the same as with autocorrelation_plot:
avf = acovf(x, unbiased=unbiased, demean=True, fft=fft, missing=missing)
So you can call it like
plt.plot(acf(y, nlags=len(y)))
to make it work.
An explanation of lag: https://math.stackexchange.com/questions/2548314/what-is-lag-in-a-time-series/2548350

Transformation to dgamma function in Python

I want to transform data
[0.54667, 0.471447, 0.826591, 0.330514, 0.7263, 0.496063, 0.520698, 0.321594, 0.351358, 0.894333]
to distribution
'dgamma(a=0.91, loc=0.48, scale=0.15)'
How to do in python
First of all, you don't need to generate an distribution object in advance. All you need would be the distribution params using codes below.
from scipy.stats import gamma
import numpy as np
data = [1,2,3,4,5] # your data
fit_alpha, fit_loc, fit_beta = gamma.fit(np.array(data), floc=0, fscale=1)
Then, you can use scipy.stats.gamma funtions to get PDF/CDF/ec. Like:
print(gamma.pdf(0.9, fit_alpha))
Check out the documentations to find the useful calls.

get numpy array of matplotlib tricontourf

I had x,y,height vars to build a contour in python.
I created a Triangulation grid using
x,y,height and traing are numpy arrays
tri = Tri.Triangulation(x, y, triang)
then i did a contour using tricontourf
tricontourf(tri,height)
how can i get the output of the tricontourf into a numpy array. I can display the image using pyplot but I dont want to.
when I tried this:
triout = tricontourf(tri,height)
print triout
I got:
<matplotlib.tri.tricontour.TriContourSet instance at 0xa9ab66c>
I need to get the image data and if I could get numpy array its easy for me.
Is it possible to do this?
if its not possible can I do what tricontourf does without matplotlib in python?
You should try this :
cs = tricontourf(tri,height)
for collection in cs.collections:
for path in collection.get_paths():
print path.to_polygons()
as I learned on:
https://github.com/matplotlib/matplotlib/issues/367
(it is better to use path.to_polygons() )

Categories

Resources