Frequency in seaborn histograms - python

I have a dataset of used cars. I have made a histogram plot for the count of cars by their age (in months).
sns.distplot(df['Age'],kde=False,bins=6)
And the plot looks something like this:
Is there any way I can depict the frequency values of each bin in the plot itself
PS: I know I can fetch the values using the numpy histogram function which is
np.histogram(df['Age'],bins=6)
Basically I want the plot to look somewhat like this I guess so:

You can iterate over the patches belonging to the ax, get their position and height and use these to create annotations.
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
sns.set_style()
df = pd.DataFrame({'Age': np.random.triangular(1, 80, 80, 1000).astype(np.int)})
ax = sns.distplot(df['Age'], kde=False, bins=6)
for p in ax.patches:
ax.annotate(f'{p.get_height():.0f}\n',
(p.get_x() + p.get_width() / 2, p.get_height()), ha='center', va='center', color='crimson')
plt.show()

Related

Histogram with stacked percentage for each bin

I am trying to plot a histogram with the proportion of the class (0/1) for each bin.
I have already plotted a barplot with stacked percentage (image below), but it doesn't look the way I want to.
Stacked percentage barplot
I want something like this (it was on this post, but it is coded in R, I want it in python), and if possible, using the seaborn library:
Stacked percentage histplot
My dataset is super simple, it contains a column with the age and another one for classification (0/1):
df.head()
[dataset
With seaborn, you can use sns.histplot(..., multiple='fill').
Here is an example starting from the titanic dataset:
from matplotlib import pyplot as plt
from matplotlib.ticker import PercentFormatter
import seaborn as sns
import numpy as np
titanic = sns.load_dataset('titanic')
ax = sns.histplot(data=titanic, x='age', hue='alive', multiple='fill', bins=np.arange(0, 91, 10), palette='spring')
for bars in ax.containers:
heights = [b.get_height() for b in bars]
labels = [f'{h * 100:.1f}%' if h > 0.001 else '' for h in heights]
ax.bar_label(bars, labels=labels, label_type='center')
ax.yaxis.set_major_formatter(PercentFormatter(1))
ax.set_ylabel('Percentage of age group')
plt.tight_layout()
plt.show()

Overlaying Pandas plot with Matplotlib is sensitive to the plotting order

I have the following problem: I'm trying to overlay two plots: One Pandas plot via plot.area() for a dataframe, and a second plot that is a standard Matplotlib plot. Depending the coder order for those two, the Matplotlib plot is displayed only if the code is before the Pandas plot.area() on the same axes.
Example: I have a Pandas dataframe called revenue that has a DateTimeIndex, and a single column with "revenue" values (float). Separately I have a dataset called projection with data along the same index (revenue.index)
If the code looks like this:
import pandas as pd
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10, 6))
# First -- Pandas area plot
revenue.plot.area(ax = ax)
# Second -- Matplotlib line plot
ax.plot(revenue.index, projection, color='black', linewidth=3)
plt.tight_layout()
plt.show()
Then the only thing displayed is the pandas plot.area() like this:
1/ Pandas plot.area() and 2/ Matplotlib line plot
However, if the order of the plotting is reversed:
fig, ax = plt.subplots(figsize=(10, 6))
# First -- Matplotlib line plot
ax.plot(revenue.index, projection, color='black', linewidth=3)
# Second -- Pandas area plot
revenue.plot.area(ax = ax)
plt.tight_layout()
plt.show()
Then the plots are overlayed properly, like this:
1/ Matplotlib line plot and 2/ Pandas plot.area()
Can someone please explain me what I'm doing wrong / what do I need to do to make the code more robust ? Kind TIA.
The values on the x-axis are different in both plots. I think DataFrame.plot.area() formats the DateTimeIndex in a pretty way, which is not compatible with pyplot.plot().
If you plot of the projection first, plot.area() can still plot the data and does not format the x-axis.
Mixing the two seems tricky to me, so I would either use pyplot or Dataframe.plot for both the area and the line:
import pandas as pd
from matplotlib import pyplot as plt
projection = [1000, 2000, 3000, 4000]
datetime_series = pd.to_datetime(["2021-12","2022-01", "2022-02", "2022-03"])
datetime_index = pd.DatetimeIndex(datetime_series.values)
revenue = pd.DataFrame({"value": [1200, 2200, 2800, 4100]})
revenue = revenue.set_index(datetime_index)
fig, ax = plt.subplots(1, 2, figsize=(10, 4))
# Option 1: only pyplot
ax[0].fill_between(revenue.index, revenue.value)
ax[0].plot(revenue.index, projection, color='black', linewidth=3)
ax[0].set_title("Pyplot")
# Option 2: only DataFrame.plot
revenue["projection"] = projection
revenue.plot.area(y='value', ax=ax[1])
revenue.plot.line(y='projection', ax=ax[1], color='black', linewidth=3)
ax[1].set_title("DataFrame.plot")
The results then look like this, where DataFrame.plot gives a much cleaner looking result:
If you do not want the projection in the revenue DataFrame, you can put it in a separate DataFrame and set the index to match revenue:
projection_df = pd.DataFrame({"projection": projection})
projection_df = projection_df.set_index(datetime_index)
projection_df.plot.line(ax=ax[1], color='black', linewidth=3)

Python - Pandas histogram width

I am doing a histogram plot of a bunch of data that goes from 0 to 1. When I plot I get this
As you can see, the histogram 'blocks' do not align with the y-axis.
Is there a way to set my histogram in order to get the histograms in a constant width of 0.1? Or should I try a diferent package?
My code is quite simple:
import pandas as pd
import numpy as np
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
np.set_printoptions(precision=10,
threshold=10000,
linewidth=150,suppress=True)
E=pd.read_csv("FQCoherentSeparableBons5.csv")
E = E.ix[0:,1:]
E=np.array(E,float)
P0=E[:,0]
P0=pd.DataFrame(P0,columns=['P0'])
scatter_matrix(P0, alpha=0.2, figsize=(6, 6), diagonal='hist',color="red")
plt.suptitle('Distribucio p0')
plt.ylabel('Frequencia p0')
plt.show()
PD: If you are wondering about the data, I is just a random distribution from 0 to 1.
You can pass additional arguments to the pandas histogram using the hist_kwds argument of the scatter_matrix function. If you want ten bins of width 0.1, then your scatter_matrix call should look like
scatter_matrix(P0, alpha=0.2, figsize=(6, 6), diagonal='hist', color="red",
hist_kwds={'bins':[i*0.1 for i in range(11)]})
Additional arguments for the pandas histogram can be found in documentation.
Here is a simple example. I've added a grid to the plot so that you can see the bins align correctly.
import numpy as np
import pandas as pd
from pandas import scatter_matrix
import matplotlib.pyplot as plt
x = np.random.uniform(0,1,100)
scatter_matrix(pd.DataFrame(x), diagonal='hist',
hist_kwds={'bins':[i*0.1 for i in range(11)]})
plt.xlabel('x')
plt.ylabel('frequency')
plt.grid()
plt.show()
By default, the number of bins in the histogram is 10, but just because your data is distributed between 0 and 1 doesn't mean the bins will be evenly spaced over the range. For example, if you do not actually have a data point equal to 1, you will get a result similar to the one in your question.

Extending the range of bins in seaborn histogram

I'm trying to create a histogram with seaborn, where the bins start at 0 and go to 1. However, there is only date in the range from 0.22 to 0.34. I want the empty space more for a visual effect to better present the data.
I create my sheet with
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
%matplotlib inline
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('svg', 'pdf')
df = pd.read_excel('test.xlsx', sheetname='IvT')
Here I create a variable for my list and one that I think should define the range of the bins of the histogram.
st = pd.Series(df['Short total'])
a = np.arange(0, 1, 15, dtype=None)
And the histogram itself looks like this
sns.set_style("white")
plt.figure(figsize=(12,10))
plt.xlabel('Ration short/total', fontsize=18)
plt.title ('CO3 In vitro transcription, Na+', fontsize=22)
ax = sns.distplot(st, bins=a, kde=False)
plt.savefig("hist.svg", format="svg")
plt.show()
Histogram
It creates a graph bit the range in x goes from 0 to 0.2050 and in y from -0.04 to 0.04. So completely different from what I expect. I google searched for quite some time but can't seem to find an answer to my specific problem.
Already, thanks for your help guys.
There are a few approaches to achieve the desired results here. For example, you can change the xaxis limits after you have plotted the histogram, or adjust the range over which the bins are created.
import seaborn as sns
# Load sample data and create a column with values in the suitable range
iris = sns.load_dataset('iris')
iris['norm_sep_len'] = iris['sepal_length'] / (iris['sepal_length'].max()*2)
sns.distplot(iris['norm_sep_len'], bins=10, kde=False)
Change the xaxis limits (the bins are still created over the range of your data):
ax = sns.distplot(iris['norm_sep_len'], bins=10, kde=False)
ax.set_xlim(0,1)
Create the bins over the range 0 to 1:
sns.distplot(iris['norm_sep_len'], bins=10, kde=False, hist_kws={'range':(0,1)})
Since the range for the bins is larger, you now need to use more bins if you want to have the same bin width as when adjusting the xlim:
sns.distplot(iris['norm_sep_len'], bins=45, kde=False, hist_kws={'range':(0,1)})

Plot start-end time slots - matplotlib python

I'm trying to plot time slots. I have two ndarrays of 'start' and 'end' points.
I want to draw it as chunks on a figure. Keep in mind that the chunks are not consecutive and there are gaps between the slots.
Until now I have tried to use patches:
for x_1 , x_2 in zip(s_data['begin'].values ,s_data['end'].values):
ax1.add_patch(Rectangle((x_1,0),x_2-x_1,0.5))
plt.show()
But its only giving me hald blue figure.
While I want something like this
The approach is correct. You just need to scale the axes such that the complete plot is within its range.
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame({"begin": [1,4,6,9], "end" : [3,5,8,12]})
fig, ax = plt.subplots()
for x_1 , x_2 in zip(df['begin'].values ,df['end'].values):
ax.add_patch(plt.Rectangle((x_1,0),x_2-x_1,0.5))
ax.autoscale()
ax.set_ylim(-2,2)
plt.show()
It is worth noting that matplotlib has a function broken_barh, which simplifies the creation of such charts.
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame({"begin": [1,4,6,9], "end" : [3,5,8,12]})
fig, ax = plt.subplots()
ax.broken_barh(list(zip(df["begin"].values, (df["end"] - df["begin"]).values)), (0, 0.5))
ax.set_ylim(-2,2)
plt.show()
Giving the same diagram as the above.

Categories

Resources