Difficulty aligning xticks to edge of Histogram bin - python

I am trying to show the frequency of my data throughout the hours of the day, using a histogram, in 3 hour intervals. I therefore use 8 bins.
plt.style.use('seaborn-colorblind')
plt.figure(figsize=(10,5))
plt.hist(comments19['comment_hour'], bins = 8, alpha = 1, align='mid', edgecolor = 'white', label = '2019', density=True)
plt.title('2019 comments, 8 bins')
plt.xticks([0,3,6,9,12,15,18,21,24])
plt.xlabel('Hours of Day')
plt.ylabel('Relative Frequency')
plt.tight_layout()
plt.legend()
plt.show()
However, the ticks are not aligning with the bin edges, as seen from the image below.

You can do either:
plt.figure(figsize=(10,5))
# define the bin and pass to plt.hist
bins = [0,3,6,9,12,15,18,21,24]
plt.hist(comments19['comment_hour'], bins = bins, alpha = 1, align='mid',
# remove this line
# plt.xticks([0,3,6,9,12,15,18,21,24])
edgecolor = 'white', label = '2019', density=True)
plt.title('2019 comments, 8 bins')
plt.xlabel('Hours of Day')
plt.ylabel('Relative Frequency')
plt.tight_layout()
plt.legend()
plt.show()
Or:
fig, ax = plt.subplots()
bins = np.arange(0,25,3)
comments19['comment_hour'].plot.hist(ax=ax,bins=bins)
# other plt format

If you set bins=8, seaborn will set 9 evenly spread boundaries, from the lowest value in the input array (0) to the highest (23), so at [0.0, 2.875, 5.75, 8.625, 11.5, 14.375, 17.25, 20.125, 23.0]. To get the 9 boundaries at 0, 3, 6, ... you need to set them explicitly.
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
plt.style.use('seaborn-colorblind')
comments19 = pd.DataFrame({'comment_hour': np.random.randint(0, 24, 100)})
plt.figure(figsize=(10, 5))
plt.hist(comments19['comment_hour'], bins=np.arange(0, 25, 3), alpha=1, align='mid', edgecolor='white', label='2019',
density=True)
plt.title('2019 comments, 8 bins')
plt.xticks(np.arange(0, 25, 3))
plt.xlabel('Hours of Day')
plt.ylabel('Relative Frequency')
plt.tight_layout()
plt.legend()
plt.show()
Note that your density=True means that the total area of the histogram is 1. As each bin is 3 hours wide, the sum of all the bin heights will be 0.33 and not 1.00 as you might expect. To really get a y-axis with relative frequencies, you could make the internal bin widths 1 by dividing the hours by 3. Afterwards you can relabel the x-axis back to hours.
So, following changes could be made for all the bins to sum to 100 %:
from matplotlib.ticker import PercentFormatter
plt.hist(comments19['comment_hour'] / 3, bins=np.arange(9), alpha=1, align='mid', edgecolor='white', label='2019',
density=True)
plt.xticks(np.arange(9), np.arange(0, 25, 3))
plt.gca().yaxis.set_major_formatter(PercentFormatter(1))

Related

Rounding the marker sizes to a given list of ranges

I have marker sizes varied based on a column in my geodataframe but I want the sizes in 5 groups. I don't want every value to have its own size, instead I'd like a range of values to have one marker size.
Here is the code:
fig, ax = mpl.pyplot.subplots(1, figsize = (10,10))
sns.scatterplot(
data=fishpts_clip, x="Lon", y="Lat", color='Green', size='SpeciesCatch',
sizes=(100, 300), legend="full"
)
plt.legend(loc='center left', bbox_to_anchor=(1.05, 0.5), ncol=1, title='Sizes')
This is what I got:
Instead, I'd like something like this:
You could create an extra column with each of the values rounded up to one of the desired bounds. That new column can be used for the sizes and the hue. To update the legend, the values are located in the list of bounds; the value itself and the previous one form the new legend label.
The following code illustrates the concept from simplified test data.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from scipy import interpolate
df = pd.DataFrame({'val': np.arange(1, 61),
'x': np.arange(60) % 10,
'y': np.arange(60) // 10 * 10})
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(16, 5))
sns.scatterplot(data=df, x="x", y="y", hue='val', palette='flare',
size='val', sizes=(100, 300), legend='full', ax=ax1)
sns.move_legend(ax1, loc='center left', bbox_to_anchor=(1.01, 0.5), ncol=6, title='Sizes')
ax1.set_title('using the given values')
# create an extra column with the values rounded up towards one of the bounds
bounds = [0, 5, 10, 20, 40, 60]
round_to_bound = interpolate.interp1d(bounds, bounds, kind='next', fill_value='extrapolate', bounds_error=False)
df['rounded'] = round_to_bound(df['val']).astype(int)
sns.scatterplot(data=df, x="x", y="y", hue='rounded', palette='flare',
size='rounded', sizes=(100, 300), ax=ax2)
sns.move_legend(ax2, loc='center left', bbox_to_anchor=(1.01, 0.5), ncol=1, title='Sizes')
for t in ax2.legend_.texts:
v = int(t.get_text())
t.set_text(f"{bounds[bounds.index(v) - 1]} - {v}")
ax2.set_title('rounding up the values towards given bounds')
sns.despine()
plt.tight_layout()
plt.show()
Combining a seaborn legend with other elements can be complicated, depending on the situation. If you just add a pandas plot on top of the seaborn scatter plot, it seems to work out well. In this case, pandas adds a new element to the existing legend, which can be moved via sns.move_legend() at the end.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from scipy import interpolate
df = pd.DataFrame({'val': np.arange(1, 61),
'x': np.arange(60) % 10,
'y': np.arange(60) // 10 * 10})
fig, ax = plt.subplots(figsize=(16, 5))
# create an extra column with the values rounded up towards one of the bounds
bounds = [0, 5, 10, 20, 40, 60]
round_to_bound = interpolate.interp1d(bounds, bounds, kind='next', fill_value='extrapolate', bounds_error=False)
df['rounded'] = round_to_bound(df['val']).astype(int)
sns.scatterplot(data=df, x="x", y="y", hue='rounded', palette='flare',
size='rounded', sizes=(100, 300), ax=ax)
for t in ax.legend_.texts:
v = int(t.get_text())
t.set_text(f"{bounds[bounds.index(v) - 1]} - {v}")
# add a pandas plot on top, which extends the legend
xs = np.linspace(0, 9, 200)
ys = np.random.randn(len(xs)).cumsum() * 2 + 25
dams_clip = pd.DataFrame({'dams_ys': ys}, index=xs)
dams_clip.plot(ax=ax, color="Red", linewidth=0.5, markersize=150, zorder=3)
sns.move_legend(ax, loc='center left', bbox_to_anchor=(1.01, 0.5), ncol=1, title='Sizes')
sns.despine()
plt.tight_layout()
plt.show()

Bar color became white when plotting many bins in catplot

I am trying to plot a histogram of exponential distribution ranging from 0 to 20 with mean value 2.2 and bin width 0.05. However, the bar color became white as I am plotting it. The following is my code:
bins = np.linspace(0, 20, 401)
x = np.random.exponential(2.2, 3000)
counts, _ = np.histogram(x, bins)
df = pd.DataFrame({'bin': bins[:-1], 'count': counts})
p = sns.catplot(data = df, x = 'bin', y = 'count', yerr = [i**(1/2) for i in counts], kind = 'bar', height = 4, aspect = 2, palette = 'Dark2_r')
p.set(xlabel = 'Muon decay times ($\mu s$)', ylabel = 'Count', title = 'Distribution for muon decay times')
for ax in p.axes.flat:
labels = ax.get_xticklabels()
for i,l in enumerate(labels):
if (i%40 != 0):
labels[i] = ""
ax.set_xticklabels(labels, rotation=30)
I believe that this is caused by the number of bins. If the first line of the codes are set to bins = np.linspace(0, 20, 11), the plot would be:
But I have no idea how to resolve this.
As #JohanC points out, if you're trying to draw elements that are close to or smaller than the resolution of your raster graphic, you have to expect some artifacts. But it also seems like you'd have an easier time making this plot directly in matplotlib, since catplot is not designed to make histograms:
f, ax = plt.subplots(figsize=(8, 4), dpi=96)
ax.bar(
bins[:-1], counts,
yerr=[i**(1/2) for i in counts],
width=(bins[1] - bins[0]), align="edge",
linewidth=0, error_kw=dict(linewidth=1),
)
ax.set(
xmargin=.01,
xlabel='Muon decay times ($\mu s$)',
ylabel='Count',
title='Distribution for muon decay times'
)
Matplotlib doesn't have a good way to deal with bars that are thinner than one pixel. If you save to an image file, you can increase the dpi and/or the figsize.
Some white space is due to the bars being 0.8 wide, leaving a gap of 0.2. Seaborn's barplot doesn't let you set the bar widths, but you could iterate through the generated bars and change their width (also updating their x-value to keep them centered around the tick position).
The edges of the bars get a fixed color (default 'none', or fully transparent). While iterating through the generated bars, you could set the edge color equal to the face color.
from matplotlib import pyplot as plt
from matplotlib.ticker import MultipleLocator
import seaborn as sns
import pandas as pd
import numpy as np
bins = np.linspace(0, 20, 401)
x = np.random.exponential(2.2, 3000)
counts, _ = np.histogram(x, bins)
df = pd.DataFrame({'bin': bins[:-1], 'count': counts})
g = sns.catplot(data=df, x='bin', y='count', yerr=[i ** (1 / 2) for i in counts], kind='bar',
height=4, aspect=2, palette='Dark2_r', lw=0.5)
g.set(xlabel='Muon decay times ($\mu s$)', ylabel='Count', title='Distribution for muon decay times')
for ax in g.axes.flat:
ax.xaxis.set_major_locator(MultipleLocator(40))
ax.tick_params(axis='x', labelrotation=30)
for bar in ax.patches:
bar.set_edgecolor(bar.get_facecolor())
bar.set_x(bar.get_x() - (1 - bar.get_width()) / 2)
bar.set_width(1)
plt.tight_layout()
plt.show()

Matplotlib histogram shifted xticks

For some reason xticks on my histogram are shifted:
Here is the code:
data = list(df['data'].to_numpy())
bin = 40
plt.style.use('seaborn-colorblind')
plt.grid(axis='y', alpha=0.5, linestyle='--')
plt.hist(data, bins=bin, rwidth=0.7, align='mid')
plt.yticks(np.arange(0, 13000, 1000))
ticks = np.arange(0, 100000, 2500)
plt.xticks(ticks, rotation='-90', ha='center')
plt.show()
Im wondering why x ticks are shifted at the very beginning of the xaxis.
When setting bins=40, 40 equally sized bins will be created between the lowest and highest data value. In this case, the highest data value seems to be around 90000, and the lowest about 0. Dividing this into 40 regions will result in boundaries with non-rounded values. Therefore, it seems better to explicitly set the bins boundaries to the values you really want, for example dividing the range 0-100000 into 40 (so 41 boundaries).
from matplotlib import pyplot as plt
import numpy as np
plt.style.use('seaborn-colorblind')
data = np.random.lognormal(10, 0.4, 100000)
data[data > 90000] = np.nan
fig, axes = plt.subplots(ncols=2, figsize=(12, 4))
for ax in axes:
if ax == axes[0]:
bins = 40
ax.set_title('bins = 40')
else:
bins = np.linspace(0, 100000, 41)
ax.set_title('bins = np.linspace(0, 100000, 41)')
ax.grid(axis='y', alpha=0.5, linestyle='--')
ax.hist(data, bins=bins, rwidth=0.7, align='mid')
ax.set_yticks(np.arange(0, 13000, 1000))
xticks = np.arange(0, 100000, 2500)
ax.set_xticks(xticks)
ax.tick_params(axis='x', labelrotation=-90)
plt.tight_layout()
plt.show()
The issue is related to the way bins are constructed.
You have two choices:
Set the range for bins directly
plt.hist(data, bins=bin, rwidth=0.7, range=(0, 100_000), align='mid')
Set x axis accordingly to the binning:
_, bin_edges, _ = plt.hist(data, bins=bin, rwidth=0.7, align='mid')
ticks = bin_edges
I recommend the 2. option. The histogram will have a more natural scale comparing to the boundaries of bins.

Add standard deviation bars from existing std column in data?

I'm plotting a chart based on the following data (head only):
Date noctiluca_ave density_ave density_sd
0 2018-03-07 2.0 1027.514332 0.091766
1 2018-03-14 4.0 1027.339988 0.285309
2 2018-03-21 1.0 1027.346413 0.183336
3 2018-03-31 1.0 1027.372996 0.170423
4 2018-04-07 0.0 1027.292119 0.187385
How do I add standard deviation ('density_sd') bars to the density_ave line?
fig, ax = plt.subplots(figsize=(10, 10))
ax.plot(hydro_weekly2 ['Date'], hydro_weekly2 ['density_ave'], label='density weekly ave', color='purple')
ax2=ax.twinx()
ax2.plot(hydro_weekly2['Date'], hydro_weekly2['noctiluca_ave'], label='noctiluca abundance' , color='r')
ax.set_ylabel('Density')
ax.set_xlabel('Date')
ax2.set_ylabel('Noctiluca abundance/cells per m3')
ax.set(title="Noctiluca Abundance and Density 2018")
lines, labels = ax.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax2.legend(lines + lines2, labels + labels2, loc="upper left")
You could either replace ax.plot with ax.errorbar() or use ax.fill_between() to show a colored band.
Here is an example with toy data and both approaches combined:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
N = 20
dates = pd.date_range('2018-03-07', periods=N, freq='W-WED')
hydro_weekly2 = pd.DataFrame({'Date': dates,
'noctiluca_ave': np.random.randint(0, 14000, N),
'density_ave': 1027 + np.random.randn(N).cumsum() / 5,
'density_sd': 0.1 + np.abs(np.random.randn(N) / 5)})
fig, ax = plt.subplots(figsize=(10, 10))
ax.errorbar(hydro_weekly2['Date'], hydro_weekly2['density_ave'], yerr=hydro_weekly2['density_sd'],
label='density weekly ave', color='purple')
ax.fill_between(hydro_weekly2['Date'], hydro_weekly2['density_ave'] - hydro_weekly2['density_sd'],
hydro_weekly2['density_ave'] + hydro_weekly2['density_sd'], color='purple', alpha=0.3)
ax2 = ax.twinx()
ax2.plot(hydro_weekly2['Date'], hydro_weekly2['noctiluca_ave'], label='noctiluca abundance', color='r')
ax.set_ylabel('Density')
ax.set_xlabel('Date')
ax2.set_ylabel('Noctiluca abundance/cells per m3')
ax.set(title="Noctiluca Abundance and Density 2018")
plt.show()
Since, you are dealing with dataframe. Here is another way to show density_sd bars with pandas plotting. The method works fine for barplot as well.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('test.csv',parse_dates=['Date'],index_col=0)
ax = df['density_ave'].plot(yerr=df['density_sd'].T)
df['noctiluca_ave'].plot(yerr=df['density_sd'].T,secondary_y=True, ax=ax)
## added errorbar in the secondary y as well.
plt.show()
Output :

Seaborn scatterplot legend showing true values and normalized continuous color

I have a dataframe that I'd like to use to build a scatterplot where different points have different colors:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
dat=pd.DataFrame(np.random.rand(20, 2), columns=['x','y'])
dat['c']=np.random.randint(0,100,20)
dat['c_norm']=(dat['c']-dat['c'].min())/(dat['c'].max()-dat['c'].min())
dat['group']=np.append(np.repeat('high',10), np.repeat('low',10))
As you can see, the column c_norm shows the c column has been normalized between 0 and 1. I would like to show a continuous legend whose color range reflect the normalized values, but labeled using the original c values as label. Say, the minimum (1), the maximum (86), and the median (49). I also want to have differing markers depending on group.
So far I was able to do this:
fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(1,1,1)
for row in dat.index:
if(dat.loc[row,'group']=='low'):
i_marker='.'
else:
i_marker='x'
ax.scatter(
x=dat.loc[row,'x'],
y=dat.loc[row,'y'],
s=50, alpha=0.5,
marker=i_marker
)
ax.legend(dat['c_norm'], loc='center right', bbox_to_anchor=(1.5, 0.5), ncol=1)
Questions:
- How to generate a continuous legend based on the values?
- How to adapt its ticks to show the original ticks in c, or at least a min, max, and mean or median?
Thanks in advance
Partial answer. Do you actually need to determine your marker colors based on the normed values? See the output of the snippet below.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dat = pd.DataFrame(np.random.rand(20, 2), columns=['x', 'y'])
dat['c'] = np.random.randint(0, 100, 20)
dat['c_norm'] = (dat['c'] - dat['c'].min()) / (dat['c'].max() - dat['c'].min())
dat['group'] = np.append(np.repeat('high', 10), np.repeat('low', 10))
fig, (ax, bx) = plt.subplots(nrows=1, ncols=2, num=0, figsize=(16, 8))
mask = dat['group'] == 'low'
scat = ax.scatter(dat['x'][mask], dat['y'][mask], s=50, c=dat['c'][mask],
marker='s', vmin=np.amin(dat['c']), vmax=np.amax(dat['c']),
cmap='plasma')
ax.scatter(dat['x'][~mask], dat['y'][~mask], s=50, c=dat['c'][~mask],
marker='X', vmin=np.amin(dat['c']), vmax=np.amax(dat['c']),
cmap='plasma')
cbar = fig.colorbar(scat, ax=ax)
scat = bx.scatter(dat['x'][mask], dat['y'][mask], s=50, c=dat['c_norm'][mask],
marker='s', vmin=np.amin(dat['c_norm']),
vmax=np.amax(dat['c_norm']), cmap='plasma')
bx.scatter(dat['x'][~mask], dat['y'][~mask], s=50, c=dat['c_norm'][~mask],
marker='X', vmin=np.amin(dat['c_norm']),
vmax=np.amax(dat['c_norm']), cmap='plasma')
cbar2 = fig.colorbar(scat, ax=bx)
plt.show()
You could definitely modify the second colorbar so that it matches the first one, but is that necessary?

Categories

Resources