Matplotlib - Plot histogram truncate bar - python

I am plotting a histogram of observed values from a population against a normal distribution (dervived from the mean and std of the sample). The sample has an unusual number of observations of value 0 (not to be confused with "NAN"). As a result, the graph of the two does not show clearly.
How can I best truncate the one bar in the histogram to allow the rest of the plot to fill the frame?

Why don't you set the y-limit to be 0.00004? Then you can analyze better the plot.
axes = plt.gca()
axes.set_xlim([xmin,xmax])
axes.set_ylim([ymin,ymax])

Related

Why is distplot changing the range of values plotted?

I have an array of values for which I am trying to fit a probability density function. I plotted the histogram using distplot as shown below:
x = [ 17.56,
162.52,
172.58,
160.82,
182.14,
165.86,
242.06,
135.76,
122.86,
230.22,
208.66,
271.36,
122.68,
188.42,
171.82,
102.30,
196.40,
107.38,
192.35,
179.66,
173.30,
254.66,
176.12,
75.365,
135.78,
103.66,
183.50,
166.08,
207.66,
146.22,
151.19,
172.20,
103.41,
133.93,
186.48,]
sns.distplot(x)
and the plot looks like this:
My minimum value in the array is 17 and maximum value is around 250 so I don't understand the range on the x-axis in the figure as I have not added any arguments either. Does sns.displot standardize the data before plotting?
A kde curve fits many gaussian normal curves over the data points. Such a normal curve has an infinite tail, which here is cut off when it gets close enough to zero height.
Note that sns.distplot has been deprecated since seaborn 0.11, and replaced by (in this case) sns.histplot(..., kde=True). The new kdeplot has a parameter cut= which defaults to zero, cutting the curve at the data limits (cut is one of the kde_kws in histplot: sns.histplot(x, kde=True, kde_kws={'cut': 0}). ).
import seaborn as sns
x = [17.56, 162.52, 172.58, 160.82, 182.14, 165.86, 242.06, 135.76, 122.86, 230.22, 208.66, 271.36, 122.68, 188.42,
171.82, 102.30, 196.40, 107.38, 192.35, 179.66, 173.30, 254.66, 176.12, 75.365, 135.78, 103.66, 183.50, 166.08,
207.66, 146.22, 151.19, 172.20, 103.41, 133.93, 186.48]
sns.histplot(x, kde=True)

Seaborn distplot() won't display frequency in the y-axis

I am trying to display the weighted frequency in the y-axis of a seaborn.distplot() graph, but it keeps displaying the density (which is the default in distplot())
I read the documentation and also many similar questions here in Stack.
The common answer is to set norm_hist=False and also to assign the weights in a bumpy array as in a standard histogram. However, it keeps showing the density and not the probability/frequency of each bin.
My code is
plt.figure(figsize=(10, 4))
plt.xlim(-0.145,0.145)
plt.axvline(0, color='grey')
data = df['col1']
x = np.random.normal(data.mean(), scale=data.std(), size=(100000))
normal_dist =sns.distplot(x, hist=False,color="red",label="Gaussian")
data_viz = sns.distplot(data,color="blue", bins=31,label="data", norm_hist=False)
# I also tried adding the weights inside the argument
#hist_kws={'weights': np.ones(len(data))/len(data)})
plt.legend(bbox_to_anchor=(1, 1), loc=1)
And I keep receiving this output:
Does anyone have an idea of what could be the problem here?
Thanks!
[EDIT]: The problem is that the y-axis is showing the kdevalues and not those from the weighted histogram. If I set kde=False then I can display the frequency in the y-axis. However, I still want to keep the kde, so I am not considering that option.
Keeping the kde and the frequency/count in one y-axis in one plot will not work because they have different scales. So it might be better to create a plot with 2 axis with each showing the kde and histogram separately.
From documentation norm_hist If True, the histogram height shows a density rather than a count. **This is implied if a KDE or fitted density is plotted**.
versusnja in https://github.com/mwaskom/seaborn/issues/479 has a workaround:
# Plot hist without kde.
# Create another Y axis.
# Plot kde without hist on the second Y axis.
# Remove Y ticks from the second axis.
first_ax = sns.distplot(data, kde=False)
second_ax = ax.twinx()
sns.distplot(data, ax=second_ax, kde=True, hist=False)
second_ax.set_yticks([])
If you need this just for visualization it should be good enough.

Density plot using seaborn

I'm trying to make a density plot of the hourly demand:
data
The 'hr' means different hours, 'cnt' means demand.
I know how to make a density plot such as:
sns.kdeplot(bike['hr'])
However, this only works when the demand for different hours is unknown. Thus I can count each hour as its demand. Now I know the demand count of each hour, how I can make a density plot of such data?
A density plot aims to show an estimate of a distribution. To make a graph showing the density of hourly demand, we would really expect to see many iid samples of demand, with time-stamps, i.e. one row per sample. Then a density plot would make sense.
But in the type of data here, where the demand ('cnt') is sampled regularly and aggregated over that sample period (the hour), a density plot is not directly meaningful. But a bar graph as a histogram does make sense, using the hours as the bins.
Below I show how to use pandas functions to produce such a plot -- really simple. For reference I also show how we might produce a density plot, through a sort of reconstruction of "original" samples.
df = pd.read_csv("../data/hour.csv") # load dataset, inc cols hr, cnt, no NaNs
# using the bar plotter built in to pandas objects
fig, ax = plt.subplots(1,2)
df.groupby('hr').agg({'cnt':sum}).plot.bar(ax=ax[0])
# reconstructed samples - has df.cnt.sum() rows, each one containing an hour of a rental.
samples = np.hstack([ np.repeat(h, df.cnt.iloc[i]) for i, h in enumerate(df.hr)])
# plot a density estimate
sns.kdeplot(samples, bw=0.5, lw=3, c="r", ax=ax[1])
# to make a useful comparison with a density estimate, we need to have our bar areas
# sum up to 1, so we use groupby.apply to divide by the total of all counts.
tot = float(df.cnt.sum())
df.groupby('hr').apply(lambda x: x['cnt'].sum()/tot).plot.bar(ax=ax[1], color='C0')
Demand for bikes seems to be low during the night... But it is also apparent that they are probably used for commuting, with peaks at hours 8am and 5-6pm.

Plotting a column with millions of rows

I have a data-frame with millions of rows (almost 8 million). I need to see the distribution of the values in one of the columns. This column is called 'price_per_mile'. I also have a column called 'Borough'. The final goal is doing a t-test.
First I want to see the distribution of data in 'price_per_mile', to see if data is normal and if I need to do some data cleaning. Then group-by based on five categories in 'borough' column and then do the t-test for each possible pair of boroughs.
I have tried to plot the distribution with sns.distplot() but it doesn't give me a clear plot as it seems there's a scaling of the values on the y-axis. Also, the range of values contained in 'price_per_mile' is big.
Then I tried to plot a section of values, again the plot doesn't look clear and informative enough. Scaling happens again.
result.drop(result[(result.price_per_mile <1) | (result.price_per_mile>200)].index, inplace=True)
What do I need to do to have a better-looking plot which gives me the true value of each bin and not just a normalized value?
I read the documentation for sns.distplot() but didn't find something helpful.
As per the documentation for displot (emphasis mine)
norm_hist : bool, optional
If True, the histogram height shows a density rather than a count. This is implied if a KDE or fitted density is plotted.
Which means that if you want the non-normalized histogram, you have to make sure to instruct seaborn to not plot the KDE at the same time
sns.distplot(a, kde=True, norm_hist=False)
sns.distplot(a, kde=False, norm_hist=False)

Plotting histogram with matplotlib

i try to plot data in a histogram or bar in python. The data size (array size) is between 0-10000. The data itself (each entry of the array) depends on the input and has a range between 0 and e+20 (mostly the data is in th same range). So i want to do a hist plot with matplotlib. I want to plot how often a data is in some intervall (to illustrate the mean and deviation). Sometimes it works like this:
hist1.
But sometimes there is a problem with the intevall size like this:
hist2.
In this plot i need more bars at point 0-100 etc.
Can anyone help me with this?
The plots are just made with:
from numpy.linalg import *
import matplotlib.pyplot as plt
plt.hist(numbers,bins=100)
plt.show()
By default, hist produces a plot with an x range that covers the full range of your data.
If you have one outsider at very high x in comparison with the other values, then you will see this image with a 'compressed' figure.
I you want to have always the same view you can fix the limits with xlim.
Alternatively, if you want to see your distribution always centered and as nicer as possible, you can calculate the mean and the standard deviation of your data and fix the x range accordingly (p.e. for mean +/- 5 stdev)

Categories

Resources