I have a dataset with 17 features and 14k observations.
I would like to plot the price distribution to get a better understanding. price feature has a float64 data type
Plotting the price distribution gives me the following
The distribution looks like this
Why does this plot looks like this? Something wrong with my data? What's the proper way to solve this?
code:
fig, ax = plt.subplots(1, 1, figsize = (9,5))
data['sale_price'].hist(bins=50, ax=ax)
plt.xlabel('Price')
plt.title('Distribution of prices')
plt.ylabel('Number of houses')
It seems your histogram is heavily Long-Tailed. As you have prices up to 3*1e7 while the majority of your data are much smaller, in the order of 1e6. So the bin=50 parameter does such that the first bin includes almost all of the data. possible treatments:
Use logarithmic bins (see this post)
choose bins according to 0-75 quantiles
However note that the 2nd solution creates an ugly accumulation of value count at the right tail of the histogram, maybe not desired. Still... It depends on the data. I'd use logarithmic histogram for house prices. I guess it makes more sense in terms of visualization
Related
As the title explains, I am trying to reproduce a stacked barchart where the y-axis scale is linear but the inside fill of the plot (i.e. the stacked bars) are logarithmic and grouped in the order of 10s.
I have made this plot before on R-Studio with an in-house package, however I am trying to reproduce the plot with other programs (python) to validate and confirm my analysis.
Quick description of the data w/ more detail:
I have thousands of entries of clonal cell information. They have multiple identifiers, such as "Strain", "Sample", "cloneID", as well as a frequency value ("cloneFraction") for each clone.
This is the .head() of the dataset I am working with to give you an idea of my data
I am trying to reproduce this following plot I made with R-Studio:
this one here
This plot has the dataset divided in groups based on their frequency, with the top 10 most frequent grouped in red, followed by the next top 100, next 1000, etc etc. The y-axis has a 0.00-1.00 scale but also a 100% scale wouldn't change, they mean the same thing in this context.
This is just to get an idea and visualize if I have big clones (the top 10) and how much of the overall dataset they occupy in frequency - i.e. the bigger the red stack the larger clones I have, signifying there has been a significant clonal expansion in my sample of a few selected cells.
What I have done so far:
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
%matplotlib inline
MYDATAFRAME.groupby(['Sample','cloneFraction']).size().groupby(level=0).apply(lambda x: 100 * x / x.sum()).unstack().plot(kind='bar',stacked=True, legend=None)
plt.yscale('log')
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter())
plt.show()
And I get this plot here
Now, I realize there is no order in the stacked plot, so the most frequent aren't on top - it's just stacking in the order of the entries in my dataset (which I assume I can just fix by sorting my dataframe by the column of interest).
Other than the axis messing up and not giving my a % when I use log scale (which is a secondary issue), I can't seem/wouldn't know how to group the data entries by frequency as I mentioned above.
I have tried things such as:
temp = X.SOME_IDENTIFIER.value_counts()
temp2 = temp.head(10)
if len(temp) > 10:
temp2['remaining {0} items'.format(len(temp) - 10)] = sum(temp[10:])
temp2.plot(kind='pie')
Just to see if I could separate them in a correct way but this does not achieve what I would like (other than being a pie chart, but I changed that in my code).
I have also tried using iloc[n:n] to select specific entries, but I can't seem to get that working either, as I get errors when I try adding it to the code I've used above to plot my graph - and if I use it without the other fancy stuff in the code (% scale, etc) it gets confused in the stacked barplot and just plots the top 10 out of all the 4 samples in my data, rather than the top 10 per sample. I also wouldn't know how to get the next 100, 1000, etc.
If you have any suggestions and can help in any way, that would be much appreciated!
Thanks
I fixed what I wanted to do with the following:
I created a new column with the category my samples fall in, base on their value (i.e. if they're the top 10 most frequent, next 100, etc etc).
df['category']='10001+'
for sampleref in df.sample_ref.unique().tolist():
print(f'Setting sample {sampleref}')
df.loc[df[df.sample_ref == sampleref].nlargest(10000, 'cloneCount')['category'].index,'category']='1001-10000'
df.loc[df[df.sample_ref == sampleref].nlargest(1000, 'cloneCount')['category'].index,'category']='101-1000'
df.loc[df[df.sample_ref == sampleref].nlargest(100, 'cloneCount')['category'].index,'category']='11-100'
df.loc[df[df.sample_ref == sampleref].nlargest(10, 'cloneCount')['category'].index,'category']='top10'
This code starts from the biggest group (10001+) and goes smaller and smaller, to include overlapping samples that might fall into the next big group.
Following this, I plotted the samples with the following code:
fig, ax = plt.subplots(figsize=(15,7))
df.groupby(['Sample','category']).sum()['cloneFraction'].unstack().plot(ax=ax, kind="bar", stacked=True)
plt.xticks(rotation=0)
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter(1))
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles[::-1], labels[::-1], title='Clonotype',bbox_to_anchor=(1.04,0), loc="lower left", borderaxespad=0)
And here are the results:
I hope this helps anyone struggling with the same issue!
I am plotting a histogram of observed values from a population against a normal distribution (dervived from the mean and std of the sample). The sample has an unusual number of observations of value 0 (not to be confused with "NAN"). As a result, the graph of the two does not show clearly.
How can I best truncate the one bar in the histogram to allow the rest of the plot to fill the frame?
Why don't you set the y-limit to be 0.00004? Then you can analyze better the plot.
axes = plt.gca()
axes.set_xlim([xmin,xmax])
axes.set_ylim([ymin,ymax])
In my program, I calculate N amounts of three parameters and want to create three histograms for each parameter. I have strict conditions for the histograms. Firstly, it is a condition on the range (at some points histogram should go to zero strictly), and it should be smooth.
I use np.histogram, as following:
hist, bins = np.histogram(Gamm1, bins=100)
center = bins[:-1]
plt.plot(center, hist)
plt.show()
but the solution is too sharp. After that, I use the following construction with seaborn,
snsplot = sns.kdeplot(data['Third'], shade=True)
fig = snsplot.get_figure()
fig.savefig("output2.png")
but here approximation goes out of range (range is created from physical conditions).
I think that changing in bins for a seaborn solution, as it could be done for np.histogram, can help.
But, in the end, I'm looking for some solution, which will be smooth and into given by me range.
I wanted to use seaborn to visualize my entire Pandas dataframe with violinplots, and I thought I had made the necessary corrections to generate a large graph for the sizable number of 270 variables my dataframe possessed.
However, no matter what I do, the violinplots only display their inner mini-boxplots (as another question here describes) for each variable, and not their kde's:
fig, ax = plt.subplots(figsize=(50,5))
ax.set_ylim(-6, 6)
a = sns.violinplot(x='variable', y='value', data=pd.melt(train_norm), ax=ax)
a.set_xticklabels(a.get_xticklabels(), rotation=90);
plt.savefig('massive_violinplot.png', figsize=(50,5), dpi=220)
(apologies for the cropped graph, the whole thing is too big to post)
Whereas the following code, using the same pd.Dataframe, but only showing the first six variables, displays correctly:
fig, ax = plt.subplots(figsize=(10,5))
ax.set_ylim(-6, 6)
a = sns.violinplot(x='variable', y='value', data=pd.melt(train_norm.iloc[:,:6]), ax=ax)
a.set_xticklabels(a.get_xticklabels(), rotation=90);
plt.savefig('massive_violinplot.png', figsize=(10,5), dpi=220)
How could I get a graph like the above for all the variables, filled with proper violinplots showing their kde's?
This is not related to the number of variables or the plot size but to the huge differences in the distributions of the variables. I can't access your data right now so I will ilustrate it with a made up dataset. You can follow along with your dataset, selecting the three variables with more dispersion and the three with less dispersion. As a dispersion measurement you can use the variance or even the data range (if you don't have crazy long tails) or something different, I am not sure what would work better.
rs = np.random.RandomState(42)
data = rs.randn(100, 6)
data[:, :3] *= 20
df = pd.DataFrame(data)
See what happens if we plot the density with common axes so they are directly comparable.
df.plot(kind='kde', subplots=True, layout=(3, 2), sharex=True, sharey=True)
plt.tight_layout()
This is more or less the same you can see in the seaborn violin plot but of course transposed.
sns.violinplot(x='variable', y='value', data=pd.melt(df))
This is usually great for comparing the variables because you can look at the differences in width as differences in density. Unfortunately the violin for the variables with more dispersion are so narrow that you can't see the width at all and you lose any sense of the shape. On the other hand the variables with less dispersion appear too short (actually in your dataset some of them are just horizontal lines).
For the first problem you can make the violins use all the available horizontal space by using scale='width' but then you no longer can compare the density across variables. The width is the same at the peaks but the density is not.
sns.violinplot(x='variable', y='value', data=pd.melt(df), scale='width')
By the way, this is what matplotlib's violin plot does by default.
plt.violinplot(df.T)
For the second problem I think your only option is to normalize or standardize the variables in some way.
sns.violinplot(x='variable', y='value', data=pd.melt((df - df.mean()) / df.std()))
Now you have a clearer view of each variable separately (how many modes they have, how skewed they are, how long the tails are...) but you can compare neither the scale nor the dispersion across variables.
The moral of the story is that you can't see everything at once, you have to pick and choose depending on what you are looking for in the data.
I'm trying to make a density plot of the hourly demand:
data
The 'hr' means different hours, 'cnt' means demand.
I know how to make a density plot such as:
sns.kdeplot(bike['hr'])
However, this only works when the demand for different hours is unknown. Thus I can count each hour as its demand. Now I know the demand count of each hour, how I can make a density plot of such data?
A density plot aims to show an estimate of a distribution. To make a graph showing the density of hourly demand, we would really expect to see many iid samples of demand, with time-stamps, i.e. one row per sample. Then a density plot would make sense.
But in the type of data here, where the demand ('cnt') is sampled regularly and aggregated over that sample period (the hour), a density plot is not directly meaningful. But a bar graph as a histogram does make sense, using the hours as the bins.
Below I show how to use pandas functions to produce such a plot -- really simple. For reference I also show how we might produce a density plot, through a sort of reconstruction of "original" samples.
df = pd.read_csv("../data/hour.csv") # load dataset, inc cols hr, cnt, no NaNs
# using the bar plotter built in to pandas objects
fig, ax = plt.subplots(1,2)
df.groupby('hr').agg({'cnt':sum}).plot.bar(ax=ax[0])
# reconstructed samples - has df.cnt.sum() rows, each one containing an hour of a rental.
samples = np.hstack([ np.repeat(h, df.cnt.iloc[i]) for i, h in enumerate(df.hr)])
# plot a density estimate
sns.kdeplot(samples, bw=0.5, lw=3, c="r", ax=ax[1])
# to make a useful comparison with a density estimate, we need to have our bar areas
# sum up to 1, so we use groupby.apply to divide by the total of all counts.
tot = float(df.cnt.sum())
df.groupby('hr').apply(lambda x: x['cnt'].sum()/tot).plot.bar(ax=ax[1], color='C0')
Demand for bikes seems to be low during the night... But it is also apparent that they are probably used for commuting, with peaks at hours 8am and 5-6pm.