How Add Average Values to a Categorical Plot in Python

How Add Average Values to a Categorical Plot in Python - python

Trying to add the average value to each category in the plot. I have been trying to add these average values independently, per category, but without success. Is there a way that catplot can average the values from the data set and plot that extra value with a different color? My goal is to add and differentiate the average value from the individual values so can be visually identified.
plt.rcParams["figure.figsize"] = [5.50, 5.50]
plt.rcParams["figure.autolayout"] = True
ax = sns.catplot(x="Sample Set", y="Values [%]", data=df)
ax.set_xticklabels(rotation=90)
ax.despine(right=True, top=True)
sp = 100
delta = 5
plt.axhline(y=sp, color='gray', linestyle='--', label='Target')
plt.axhline(y=sp*((100+(delta*2))/100), color='r', linestyle='--', label='10%')
plt.axhline(y=sp*((100-(delta*2))/100), color='r', linestyle='--')
plt.ylim(80, 120)
plt.title('Sample Location[enter image description here][1]', fontsize = 14, y=1.05)
plt.legend(frameon=False, loc ="lower right")
plt.savefig(outputFileName, dpi=300, bbox_inches = 'tight')
plt.show()
plt.draw()

You probably run into strange error messages, as you named the return value of sns.catplot as ax. sns.catplot is a "figure-level" function and returns a FacetGrid, often assigned to a variable named g. A figure-level function can have one or more subplots, accessible via g.axes. When there is only one subplot, g.ax points to that subplot.
Also note that the catplot's figsize isn't set via the rcParams. The figure size comes from the height= parameter (height in inches of one subplot) and the aspect= parameter (ratio between width and height of a subplot), multiplied by the number of rows/columns of subplots.
Further, you seem to be mixing the "object-oriented" and the pyplot interface for matplotlib. For readability and code maintenance, it is preferred to stick to one interface.
To indicate the means, sns.pointplot without confidence interval might be suited. ax.axhspan might be used to visualize the range around the target.
Here is some example code starting from seaborn's iris dataset.
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
iris = sns.load_dataset('iris')
g = sns.catplot(data=iris, x="species", y="sepal_length", height=5.50, aspect=1)
ax = g.ax
ax.tick_params(axis='x', rotation=0, length=0)
sns.pointplot(data=iris, x="species", y="sepal_length", estimator=np.mean,
join=False, ci=None, markers=['D'], color='black', size=20, zorder=3, ax=ax)
sns.despine(right=True, top=True)
sp = 6
delta = 10
ax.axhline(y=sp, color='gray', linestyle='--', label='Target')
ax.axhspan(ymin=sp * (100 - delta) / 100, ymax=sp * (100 + delta) / 100,
color='r', alpha=0.15, linestyle='--', label='10%')
ax.collections[-1].set_label('Mean')
ax.legend(frameon=False, loc="lower right")
# plt.savefig(outputFileName, dpi=300, bbox_inches='tight')
plt.tight_layout()
plt.show()

According to plotting with seaborn using the matplotlib object-oriented interface as catplot is a Figure-leveltype of graph, will be much harder than doing it comparing to some other types of graph.
The second group of functions (Figure-level) are distinguished by the
fact that the resulting plot can potentially include several Axes
which are always organized in a "meaningful" way. That means that the
functions need to have total control over the figure, so it isn't
possible to plot, say, an lmplot onto one that already exists. Calling
the function always initializes a figure and sets it up for the
specific plot it's drawing.

Related

How to fix transparency overlaps in Matplotlib when plotting multiple figures?

I have a function that inputs a string (the name of the dataframe we're visualizing) and returns two histograms that visualize that data. The first plot (on the left) is the raw data, the one on the right is it after being normalized (same, just plotted using the matplotlib parameter density=True). But as you can see, this leads to transparency issues when the plots overlap. This is my code for this particular plot:
plt.rcParams["figure.figsize"] = [12, 8]
plt.rcParams["figure.autolayout"] = True
ax0_1 = plt.subplot(121)
_,bins,_ = ax0_1.hist(filtered_0,alpha=1,color='b',bins=15,label='All apples')
ax0_1.hist(filtered_1,alpha=0.9,color='gold',bins=bins,label='Less than two apples')
ax0_1.set_title('Condition 0 vs Condition 1: '+'{}'.format(apple_data),fontsize=14)
ax0_1.set_xlabel('{}'.format(apple_data),fontsize=13)
ax0_1.set_ylabel('Frequency',fontsize=13)
ax0_1.grid(axis='y',linewidth=0.4)
ax0_1.tick_params(axis='x',labelsize=13)
ax0_1.tick_params(axis='y',labelsize=13)
ax0_1_norm = plt.subplot(122)
_,bins,_ = ax0_1_norm.hist(filtered_0,alpha=1,color='b',bins=15,label='All apples',density=True)
ax0_1_norm.hist(filtered_1,alpha=0.9,color='gold',bins=bins,label='Less than two apples',density=True)
ax0_1_norm.set_title('Condition 0 vs Condition 1: '+'{} - Normalized'.format(apple_data),fontsize=14)
ax0_1_norm.set_xlabel('{}'.format(apple_data),fontsize=13)
ax0_1_norm.set_ylabel('Frequency',fontsize=13)
ax0_1_norm.legend(bbox_to_anchor=(2, 0.95))
ax0_1_norm.grid(axis='y',linewidth=0.4)
ax0_1_norm.tick_params(axis='x',labelsize=13)
ax0_1_norm.tick_params(axis='y',labelsize=13)
plt.tight_layout(pad=0.5)
plt.show()
What my current plot looks like
Any ideas on how to make the colors blend a bit better would be helpful. Alternatively, if there are any other combinations you know of that would work instead, feel free to share. I'm not picky about the color choice. Thanks!

I think it is better to emphasize such a histogram by distinguishing it by the shape of the histogram or by the difference in transparency rather than visualizing it by color. I have coded an example from the official reference with additional overlap.
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(20211021)
N_points = 100000
n_bins = 20
x = np.random.randn(N_points)
y = .4 * x + np.random.randn(100000) + 2
fig, axs = plt.subplots(2, 2, sharey=True, tight_layout=True)
# We can set the number of bins with the `bins` kwarg
axs[0,0].hist(x, color='b', alpha=0.9, bins=n_bins, ec='b', fc='None')
axs[0,0].hist(y, color='gold', alpha=0.6, bins=21)
axs[0,0].set_title('edgecolor and facecolor None')
axs[0,1].hist(x, color='b', alpha=0.9, bins=n_bins)
axs[0,1].hist(y, color='gold', alpha=0.6, bins=21, ec='b')
axs[0,1].set_title('edgecolor and facecolor')
axs[1,0].hist(x, alpha=0.9, bins=n_bins, histtype='step', facecolor='b')
axs[1,0].hist(y, color='gold', alpha=0.6, bins=21)
axs[1,0].set_title('step')
axs[1,1].hist(x, color='b', alpha=0.9, bins=n_bins, histtype='bar', rwidth=0.8)
axs[1,1].hist(y, color='gold', alpha=0.6, bins=21, ec='b')
axs[1,1].set_title('bar')
plt.show()

Plotting Two Histograms. Why can't one have kde while other not have it?

So I was going through the Kaggle Data Visualization Micro Course, and I reached the lesson on plotting histograms.
So the excercise asked to plot two histograms and I did that and it worked, but if I add kde = False on one of the plots, only that plot will be visible, the other plot isn't displayed:
`sns.distplot(a = cancer_b_data['Area (mean)'], kde = False)
sns.distplot(a = cancer_m_data['Area (mean)']) `
Don't know how stupid I sound, but any clarification would help. Thanks

With the default kde=True the kde is normalized such that the area under the curve is one. To go together in the same plot, the histogram will also be normalized such that the complete area of all bars sums to one.
With kde=False, the default histogram will show the frequency (the count of each bin), which are much larger numbers. If you display both inside the same plot with the same axes, the normalized histogram will not disappear, but be very small. With the zoom tool you can verify that it still is there. To see both with the same size, sns.distplot(..., kde=False, norm_hist=True) can be used
You'll note that the two histogram don't use the same bin boundaries. These boundaries are calculated using the number of samples and the minimum and maximum of the individual sets of samples.
To really compare two histograms, explicit bins can be set, so both use the same bin boundaries.
The following code and plot compares the 3 different ways to compare the histograms:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
x1 = np.random.randn(100).cumsum()
x2 = np.random.randn(100).cumsum()
fig, (ax1, ax2, ax3) = plt.subplots(ncols=3, figsize=(15, 4))
sns.distplot(a=x1, kde=False, ax=ax1)
sns.distplot(a=x2, ax=ax1)
ax1.set_title('one histogram without kde')
sns.distplot(a=x1, kde=False, norm_hist=True, ax=ax2)
sns.distplot(a=x2, ax=ax2)
ax2.set_title('setting norm_hist=True')
xmin = min(x1.min(), x2.min())
xmax = max(x1.max(), x2.max())
bins = np.linspace(xmin, xmax, 11)
sns.distplot(a=x1, kde=False, norm_hist=True, bins=bins, ax=ax3)
sns.distplot(a=x2, bins=bins, ax=ax3)
ax3.set_title('using the same bins')
plt.tight_layout()
plt.show()

Configuring grid-lines in matplotlib plot

Consider the figure below.
This image has been set up with the following code.
plt.rc('text', usetex=True)
plt.rc('font', family='serif')
fig, ax = plt.subplots()
ax.set_xlabel("Run Number", fontsize=25)
plt.grid(True, linestyle='--')
plt.tick_params(labelsize=20)
ax.set_xticklabels(map(str,range(number_of_runs)))
ax.minorticks_on()
ax.set_ylim([0.75,1.75])
I have not included the code that actually generates the data for plotting for the sake of clarity.
Unlike the diagram above, I would like to draw grid-lines perpendicular to the X-axis through each orange (and hence blue) dot. How do I do this?
The x-coordinates of the successive orange and blue dots form the same arithmetic progression in my code.
Also I notice that the tick numbers numbered 1,2,... are wrong for my application. Instead, I would like each successive grid-line, which I ask for as perpendicular to the X-axis in the previous step, to be numbered sequentially from 1 along the X-axis. How do I configure the Xtick marks for this?

The grid lines cross the xticks (or yticks).
You need to define xticks properly so that the grid lines cross your data points (the dots)
example below:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
number_of_runs = range(1,10) # use your actual number_of_runs
ax.set_xticks(number_of_runs, minor=False)
ax.xaxis.grid(True, which='major')
In case you want to have only vertical lines, add this:
ax.yaxis.grid(False, which='major')
Similar question here.

You should specify the exact places where you want the grids using a call to ax.set_xticks and then specify the exact numbers you want on the axis using a call to ax.set_xticklabels.
I am plotting some two random arrays in the example below:
plt.rc('text', usetex=True)
plt.rc('font', family='serif')
y1 = np.random.random(10)
y2 = np.random.random(10)
fig, ax = plt.subplots(ncols=2, figsize=(8, 3))
# equivalent to your figure
ax[0].plot(y1, 'o-')
ax[0].plot(y2, 'o-')
ax[0].grid(True, linestyle='--')
ax[0].set_title('Before')
# hopefully what you want
ax[1].plot(y1, 'o-')
ax[1].plot(y2, 'o-')
ax[1].set_title('After')
ax[1].set_xticks(range(0, len(y1)))
ax[1].set_xticklabels(range(1, len(y1)+1))
ax[1].grid(True, linestyle='--')
plt.show()
This is the output:
A note: Looking at your plot, it seems that the actual x-axis is not integers, but you want integers starting from 1, Probably the best way to do this is to just pass in the y axis data array as an argument for the plot command (plt.plot(y) instead of plt.plot(x, y), like what I have done above. You should decide if this is appropriate for your case.

Getting values and probabilities into a Matplotlib cumulative distribution function

I have a plot that shows Cumulative Distribution values and a 'survival function' as well.
import numpy as np
import matplotlib.pyplot as plt
values, base = np.histogram(serWEB, bins=100)
cumulative = np.cumsum(values)
# plot the cumulative function
plt.plot(base[:-1], cumulative, c='blue')
plt.title('Cumulative Distribution')
plt.xlabel('X Data')
plt.ylabel('Y Data')
# survival function next
plt.plot(base[:-1], len(serWEB)-cumulative, c='green')
plt.show()
This plot shows the values as the main Y-Axis.
I am interested in adding a 2nd Y-Axis on the right to show the percentages.
How to do that?

Using a combination of matplotlib.axes.Axes.twinx and matplotlib.ticker.Formatter should do what you need. First get the current axis, then create a "twin" of it using twinx():
import matplotlib.ticker as mtick
...
ax = plt.gca() # returns the current axes
ax = ax.twinx() # create twin xaxis
ax.yaxis.set_major_formatter(mtick.PercentFormatter(1.0))
There are quite a few ways you can format the percentage:
PercentFormatter() accepts three arguments, max, decimals, symbol. max allows you to set the value that corresponds to 100% on the axis.
This is nice if you have data from 0.0 to 1.0 and you want to display
it from 0% to 100%. Just do PercentFormatter(1.0).
The other two parameters allow you to set the number of digits after the decimal point and the symbol. They default to None and '%',
respectively. decimals=None will automatically set the number of
decimal points based on how much of the axes you are showing.[1]

Check out ax.twinx(), as used e.g. in this question and its answers.
Try this approach:
ax = plt.gca() # get current active axis
ax2 = ax.twinx()
ax2.plot(base[:-1], (len(serWEB)-cumulative)/len(serWEB), 'r')

Here is the basic shell of plotting values on the left Y-Axis and ratio on the right Y-Axis (ax2). I used the twinx() function as per the comments.
import numpy as np
import matplotlib.pyplot as plt
# evaluate the histogram
import matplotlib.ticker as mtick
values, base = np.histogram(serWEB, bins=100)
#evaluate the cumulative
cumulative = np.cumsum(values)
# plot the cumulative function
plt.plot(base[:-1], cumulative, c='blue')
plt.title('Chart Title')
plt.xlabel('X Label')
plt.ylabel('Y Label')
#plot the survival function
plt.plot(base[:-1], len(serWEB)-cumulative, c='green')
# clone the Y-axis and make a Y-Axis #2
ax = plt.gca()
ax2 = ax.twinx()
# on the second Y-Axis, plot the ratio from 0 to 1
n, bins, patches = ax2.hist(serWEB, 100, normed=True, histtype='step',
cumulative=True)
plt.xticks(np.arange(0,900, step = 14))
plt.xlim(0, 200)
plt.show()

scatter plot with single pixel marker in matplotlib

I am trying to plot a large dataset with a scatter plot.
I want to use matplotlib to plot it with single pixel marker.
It seems to have been solved.
https://github.com/matplotlib/matplotlib/pull/695
But I cannot find a mention of how to get a single pixel marker.
My simplified dataset (data.csv)
Length,Time
78154393,139.324091
84016477,229.159305
84626159,219.727537
102021548,225.222662
106399706,221.022827
107945741,206.760239
109741689,200.153263
126270147,220.102802
207813132,181.67058
610704756,50.59529
623110004,50.533158
653383018,52.993885
659376270,53.536834
680682368,55.97628
717978082,59.043843
My code is below.
import pandas as pd
import os
import numpy
import matplotlib.pyplot as plt
inputfile='data.csv'
iplevel = pd.read_csv(inputfile)
base = os.path.splitext(inputfile)[0]
fig = plt.figure()
plt.yscale('log')
#plt.xscale('log')
plt.title(' My plot: '+base)
plt.xlabel('x')
plt.ylabel('y')
plt.scatter(iplevel['Time'], iplevel['Length'],color='black',marker=',',lw=0,s=1)
fig.tight_layout()
fig.savefig(base+'_plot.png', dpi=fig.dpi)
You can see below that the points are not single pixel.
Any help is appreciated

The problem
I fear that the bugfix discussed at matplotlib git repository that you're citing is only valid for plt.plot() and not for plt.scatter()
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(4,2))
ax = fig.add_subplot(121)
ax2 = fig.add_subplot(122, sharex=ax, sharey=ax)
ax.plot([1, 2],[0.4,0.4],color='black',marker=',',lw=0, linestyle="")
ax.set_title("ax.plot")
ax2.scatter([1,2],[0.4,0.4],color='black',marker=',',lw=0, s=1)
ax2.set_title("ax.scatter")
ax.set_xlim(0,8)
ax.set_ylim(0,1)
fig.tight_layout()
print fig.dpi #prints 80 in my case
fig.savefig('plot.png', dpi=fig.dpi)
The solution: Setting the markersize
The solution is to use a usual "o" or "s" marker, but set the markersize to be exactly one pixel. Since the markersize is given in points, one would need to use the figure dpi to calculate the size of one pixel in points. This is 72./fig.dpi.
For aplot`, the markersize is directly
ax.plot(..., marker="o", ms=72./fig.dpi)
For a scatter the markersize is given through the s argument, which is in square points,
ax.scatter(..., marker='o', s=(72./fig.dpi)**2)
Complete example:
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(4,2))
ax = fig.add_subplot(121)
ax2 = fig.add_subplot(122, sharex=ax, sharey=ax)
ax.plot([1, 2],[0.4,0.4], marker='o',ms=72./fig.dpi, mew=0,
color='black', linestyle="", lw=0)
ax.set_title("ax.plot")
ax2.scatter([1,2],[0.4,0.4],color='black', marker='o', lw=0, s=(72./fig.dpi)**2)
ax2.set_title("ax.scatter")
ax.set_xlim(0,8)
ax.set_ylim(0,1)
fig.tight_layout()
fig.savefig('plot.png', dpi=fig.dpi)

For anyone still trying to figure this out, the solution I found was to specify the s argument in plt.scatter.
The s argument refers to the area of the point you are plotting.
It doesn't seem to be quite perfect, since s=1 seems to cover about 4 pixels of my screen, but this definitely makes them smaller than anything else I've been able to find.
https://matplotlib.org/devdocs/api/_as_gen/matplotlib.pyplot.scatter.html
s : scalar or array_like, shape (n, ), optional
size in points^2. Default is rcParams['lines.markersize'] ** 2.

Set the plt.scatter() parameter to linewidths=0 and figure out the right value for the parameter s.
Source: https://stackoverflow.com/a/45803960/4063622

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How Add Average Values to a Categorical Plot in Python - python

Related

How to fix transparency overlaps in Matplotlib when plotting multiple figures?

Plotting Two Histograms. Why can't one have kde while other not have it?

Configuring grid-lines in matplotlib plot

Getting values and probabilities into a Matplotlib cumulative distribution function

scatter plot with single pixel marker in matplotlib

Categories

Resources