I'm trying to make a density plot of the hourly demand:
data
The 'hr' means different hours, 'cnt' means demand.
I know how to make a density plot such as:
sns.kdeplot(bike['hr'])
However, this only works when the demand for different hours is unknown. Thus I can count each hour as its demand. Now I know the demand count of each hour, how I can make a density plot of such data?
A density plot aims to show an estimate of a distribution. To make a graph showing the density of hourly demand, we would really expect to see many iid samples of demand, with time-stamps, i.e. one row per sample. Then a density plot would make sense.
But in the type of data here, where the demand ('cnt') is sampled regularly and aggregated over that sample period (the hour), a density plot is not directly meaningful. But a bar graph as a histogram does make sense, using the hours as the bins.
Below I show how to use pandas functions to produce such a plot -- really simple. For reference I also show how we might produce a density plot, through a sort of reconstruction of "original" samples.
df = pd.read_csv("../data/hour.csv") # load dataset, inc cols hr, cnt, no NaNs
# using the bar plotter built in to pandas objects
fig, ax = plt.subplots(1,2)
df.groupby('hr').agg({'cnt':sum}).plot.bar(ax=ax[0])
# reconstructed samples - has df.cnt.sum() rows, each one containing an hour of a rental.
samples = np.hstack([ np.repeat(h, df.cnt.iloc[i]) for i, h in enumerate(df.hr)])
# plot a density estimate
sns.kdeplot(samples, bw=0.5, lw=3, c="r", ax=ax[1])
# to make a useful comparison with a density estimate, we need to have our bar areas
# sum up to 1, so we use groupby.apply to divide by the total of all counts.
tot = float(df.cnt.sum())
df.groupby('hr').apply(lambda x: x['cnt'].sum()/tot).plot.bar(ax=ax[1], color='C0')
Demand for bikes seems to be low during the night... But it is also apparent that they are probably used for commuting, with peaks at hours 8am and 5-6pm.
Related
I am working on a relatively large dataset (5000 rows) in pandas and would like to draw a bar plot, but continuous and with different colors 1.
For every depth data there will be a value of SBT.
Initially, I thought to generate a bar for each depth, but due to the amount of data, the graph does not display it very well and it takes a really long time to load.
In the meantime, I generated a plot of the data, but with lines.
I added the code and the picture of this plot below 2.
fig, SBTcla = plt.subplots()
SBTcla.plot(SBT,datos['Depth (m)'], color='black',label='SBT')
plt.xlim(0, 10)
plt.grid(color='grey', linestyle='--', linewidth=1)
plt.title('SBT');
plt.xlabel('SBT');
plt.ylabel('Profundidad (mts)');
plt.gca().invert_yaxis();
Your graph consists of a lot of points with no information. Consecutive rows which contain the same SBT could we eliminated. Grouping by consecutive rows with equal content can be done by a shift and cummulative sum. The boolean expression looks for steps from one region to the next. If it is a step it returns true and the sum increases by one.
x = datos.groupby((datos['SBT'].shift() != datos['SBT']).cumsum())
Each group can be plotted on its own, with a filled style
I have a dataset with 17 features and 14k observations.
I would like to plot the price distribution to get a better understanding. price feature has a float64 data type
Plotting the price distribution gives me the following
The distribution looks like this
Why does this plot looks like this? Something wrong with my data? What's the proper way to solve this?
code:
fig, ax = plt.subplots(1, 1, figsize = (9,5))
data['sale_price'].hist(bins=50, ax=ax)
plt.xlabel('Price')
plt.title('Distribution of prices')
plt.ylabel('Number of houses')
It seems your histogram is heavily Long-Tailed. As you have prices up to 3*1e7 while the majority of your data are much smaller, in the order of 1e6. So the bin=50 parameter does such that the first bin includes almost all of the data. possible treatments:
Use logarithmic bins (see this post)
choose bins according to 0-75 quantiles
However note that the 2nd solution creates an ugly accumulation of value count at the right tail of the histogram, maybe not desired. Still... It depends on the data. I'd use logarithmic histogram for house prices. I guess it makes more sense in terms of visualization
I am plotting a histogram of observed values from a population against a normal distribution (dervived from the mean and std of the sample). The sample has an unusual number of observations of value 0 (not to be confused with "NAN"). As a result, the graph of the two does not show clearly.
How can I best truncate the one bar in the histogram to allow the rest of the plot to fill the frame?
Why don't you set the y-limit to be 0.00004? Then you can analyze better the plot.
axes = plt.gca()
axes.set_xlim([xmin,xmax])
axes.set_ylim([ymin,ymax])
In my program, I calculate N amounts of three parameters and want to create three histograms for each parameter. I have strict conditions for the histograms. Firstly, it is a condition on the range (at some points histogram should go to zero strictly), and it should be smooth.
I use np.histogram, as following:
hist, bins = np.histogram(Gamm1, bins=100)
center = bins[:-1]
plt.plot(center, hist)
plt.show()
but the solution is too sharp. After that, I use the following construction with seaborn,
snsplot = sns.kdeplot(data['Third'], shade=True)
fig = snsplot.get_figure()
fig.savefig("output2.png")
but here approximation goes out of range (range is created from physical conditions).
I think that changing in bins for a seaborn solution, as it could be done for np.histogram, can help.
But, in the end, I'm looking for some solution, which will be smooth and into given by me range.
I have a dataset where I have observations at arange of depths in the sea. I am trying to plot the frequency on the x-axis and the depth on the y-axis of a histogram.
#
This section is sorted
In order to do this, I wanted to bin the data for every 5 metre category. The only problem is, the depths in the dataset start at 2m, but I want the bins to start at 0m, and increase in 5m intervals. I don't know how to set the bins to start at 0m.
#
In addition, I'm plotting a histogram to show this, and I would like the depth to be on the y-axis (so that the plot is a bit more intuitive to look at)
Currently I have code that sorts the data into bins (I don't know the size) and then plots a histogram, but with the depth on the x-axis. Here is my code:
import numpy as np
#Data read in at this point
depthout = []
Dout = np.array(depthout)
bins = np.linspace(0,260,61)
plt.hist(Dout, bins)
plt.show()