Getting values and probabilities into a Matplotlib cumulative distribution function - python

I have a plot that shows Cumulative Distribution values and a 'survival function' as well.
import numpy as np
import matplotlib.pyplot as plt
values, base = np.histogram(serWEB, bins=100)
cumulative = np.cumsum(values)
# plot the cumulative function
plt.plot(base[:-1], cumulative, c='blue')
plt.title('Cumulative Distribution')
plt.xlabel('X Data')
plt.ylabel('Y Data')
# survival function next
plt.plot(base[:-1], len(serWEB)-cumulative, c='green')
plt.show()
This plot shows the values as the main Y-Axis.
I am interested in adding a 2nd Y-Axis on the right to show the percentages.
How to do that?

Using a combination of matplotlib.axes.Axes.twinx and matplotlib.ticker.Formatter should do what you need. First get the current axis, then create a "twin" of it using twinx():
import matplotlib.ticker as mtick
...
ax = plt.gca() # returns the current axes
ax = ax.twinx() # create twin xaxis
ax.yaxis.set_major_formatter(mtick.PercentFormatter(1.0))
There are quite a few ways you can format the percentage:
PercentFormatter() accepts three arguments, max, decimals, symbol. max allows you to set the value that corresponds to 100% on the axis.
This is nice if you have data from 0.0 to 1.0 and you want to display
it from 0% to 100%. Just do PercentFormatter(1.0).
The other two parameters allow you to set the number of digits after the decimal point and the symbol. They default to None and '%',
respectively. decimals=None will automatically set the number of
decimal points based on how much of the axes you are showing.[1]

Check out ax.twinx(), as used e.g. in this question and its answers.
Try this approach:
ax = plt.gca() # get current active axis
ax2 = ax.twinx()
ax2.plot(base[:-1], (len(serWEB)-cumulative)/len(serWEB), 'r')

Here is the basic shell of plotting values on the left Y-Axis and ratio on the right Y-Axis (ax2). I used the twinx() function as per the comments.
import numpy as np
import matplotlib.pyplot as plt
# evaluate the histogram
import matplotlib.ticker as mtick
values, base = np.histogram(serWEB, bins=100)
#evaluate the cumulative
cumulative = np.cumsum(values)
# plot the cumulative function
plt.plot(base[:-1], cumulative, c='blue')
plt.title('Chart Title')
plt.xlabel('X Label')
plt.ylabel('Y Label')
#plot the survival function
plt.plot(base[:-1], len(serWEB)-cumulative, c='green')
# clone the Y-axis and make a Y-Axis #2
ax = plt.gca()
ax2 = ax.twinx()
# on the second Y-Axis, plot the ratio from 0 to 1
n, bins, patches = ax2.hist(serWEB, 100, normed=True, histtype='step',
cumulative=True)
plt.xticks(np.arange(0,900, step = 14))
plt.xlim(0, 200)
plt.show()

Related

How Add Average Values to a Categorical Plot in Python

Trying to add the average value to each category in the plot. I have been trying to add these average values independently, per category, but without success. Is there a way that catplot can average the values from the data set and plot that extra value with a different color? My goal is to add and differentiate the average value from the individual values so can be visually identified.
plt.rcParams["figure.figsize"] = [5.50, 5.50]
plt.rcParams["figure.autolayout"] = True
ax = sns.catplot(x="Sample Set", y="Values [%]", data=df)
ax.set_xticklabels(rotation=90)
ax.despine(right=True, top=True)
sp = 100
delta = 5
plt.axhline(y=sp, color='gray', linestyle='--', label='Target')
plt.axhline(y=sp*((100+(delta*2))/100), color='r', linestyle='--', label='10%')
plt.axhline(y=sp*((100-(delta*2))/100), color='r', linestyle='--')
plt.ylim(80, 120)
plt.title('Sample Location[enter image description here][1]', fontsize = 14, y=1.05)
plt.legend(frameon=False, loc ="lower right")
plt.savefig(outputFileName, dpi=300, bbox_inches = 'tight')
plt.show()
plt.draw()
You probably run into strange error messages, as you named the return value of sns.catplot as ax. sns.catplot is a "figure-level" function and returns a FacetGrid, often assigned to a variable named g. A figure-level function can have one or more subplots, accessible via g.axes. When there is only one subplot, g.ax points to that subplot.
Also note that the catplot's figsize isn't set via the rcParams. The figure size comes from the height= parameter (height in inches of one subplot) and the aspect= parameter (ratio between width and height of a subplot), multiplied by the number of rows/columns of subplots.
Further, you seem to be mixing the "object-oriented" and the pyplot interface for matplotlib. For readability and code maintenance, it is preferred to stick to one interface.
To indicate the means, sns.pointplot without confidence interval might be suited. ax.axhspan might be used to visualize the range around the target.
Here is some example code starting from seaborn's iris dataset.
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
iris = sns.load_dataset('iris')
g = sns.catplot(data=iris, x="species", y="sepal_length", height=5.50, aspect=1)
ax = g.ax
ax.tick_params(axis='x', rotation=0, length=0)
sns.pointplot(data=iris, x="species", y="sepal_length", estimator=np.mean,
join=False, ci=None, markers=['D'], color='black', size=20, zorder=3, ax=ax)
sns.despine(right=True, top=True)
sp = 6
delta = 10
ax.axhline(y=sp, color='gray', linestyle='--', label='Target')
ax.axhspan(ymin=sp * (100 - delta) / 100, ymax=sp * (100 + delta) / 100,
color='r', alpha=0.15, linestyle='--', label='10%')
ax.collections[-1].set_label('Mean')
ax.legend(frameon=False, loc="lower right")
# plt.savefig(outputFileName, dpi=300, bbox_inches='tight')
plt.tight_layout()
plt.show()
According to plotting with seaborn using the matplotlib object-oriented interface as catplot is a Figure-leveltype of graph, will be much harder than doing it comparing to some other types of graph.
The second group of functions (Figure-level) are distinguished by the
fact that the resulting plot can potentially include several Axes
which are always organized in a "meaningful" way. That means that the
functions need to have total control over the figure, so it isn't
possible to plot, say, an lmplot onto one that already exists. Calling
the function always initializes a figure and sets it up for the
specific plot it's drawing.

Removing legend from mpl parallel coordinates plot?

I have a parallel coordinates plot with lots of data points so I'm trying to use a continuous colour bar to represent that, which I think I have worked out. However, I haven't been able to remove the default key that is put in when creating the plot, which is very long and hinders readability. Is there a way to remove this table to make the graph much easier to read?
This is the code I'm currently using to generate the parallel coordinates plot:
parallel_coordinates(data[[' male_le','
female_le','diet','activity','obese_perc','median_income']],'median_income',colormap = 'rainbow',
alpha = 0.5)
fig, ax = plt.subplots(figsize=(6, 1))
fig.subplots_adjust(bottom=0.5)
cmap = mpl.cm.rainbow
bounds = [0.00,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]
norm = mpl.colors.BoundaryNorm(bounds, cmap.N,)
plt.colorbar(mpl.cm.ScalarMappable(norm = norm, cmap=cmap),cax = ax, orientation = 'horizontal',
label = 'normalised median income', alpha = 0.5)
plt.show()
Current Output:
I want my legend to be represented as a color bar, like this:
Any help would be greatly appreciated. Thanks.
You can use ax.legend_.remove() to remove the legend.
The cax parameter of plt.colorbar indicates the subplot where to put the colorbar. If you leave it out, matplotlib will create a new subplot, "stealing" space from the current subplot (subplots are often referenced to by ax in matplotlib). So, here leaving out cax (adding ax=ax isn't necessary, as here ax is the current subplot) will create the desired colorbar.
The code below uses seaborn's penguin dataset to create a standalone example.
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import numpy as np
from pandas.plotting import parallel_coordinates
penguins = sns.load_dataset('penguins')
fig, ax = plt.subplots(figsize=(10, 4))
cmap = plt.get_cmap('rainbow')
bounds = np.arange(penguins['body_mass_g'].min(), penguins['body_mass_g'].max() + 200, 200)
norm = mpl.colors.BoundaryNorm(bounds, 256)
penguins = penguins.dropna(subset=['body_mass_g'])
parallel_coordinates(penguins[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']],
'body_mass_g', colormap=cmap, alpha=0.5, ax=ax)
ax.legend_.remove()
plt.colorbar(mpl.cm.ScalarMappable(norm=norm, cmap=cmap),
ax=ax, orientation='horizontal', label='body mass', alpha=0.5)
plt.show()

Plotting Two Histograms. Why can't one have kde while other not have it?

So I was going through the Kaggle Data Visualization Micro Course, and I reached the lesson on plotting histograms.
So the excercise asked to plot two histograms and I did that and it worked, but if I add kde = False on one of the plots, only that plot will be visible, the other plot isn't displayed:
`sns.distplot(a = cancer_b_data['Area (mean)'], kde = False)
sns.distplot(a = cancer_m_data['Area (mean)']) `
Don't know how stupid I sound, but any clarification would help. Thanks
With the default kde=True the kde is normalized such that the area under the curve is one. To go together in the same plot, the histogram will also be normalized such that the complete area of all bars sums to one.
With kde=False, the default histogram will show the frequency (the count of each bin), which are much larger numbers. If you display both inside the same plot with the same axes, the normalized histogram will not disappear, but be very small. With the zoom tool you can verify that it still is there. To see both with the same size, sns.distplot(..., kde=False, norm_hist=True) can be used
You'll note that the two histogram don't use the same bin boundaries. These boundaries are calculated using the number of samples and the minimum and maximum of the individual sets of samples.
To really compare two histograms, explicit bins can be set, so both use the same bin boundaries.
The following code and plot compares the 3 different ways to compare the histograms:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
x1 = np.random.randn(100).cumsum()
x2 = np.random.randn(100).cumsum()
fig, (ax1, ax2, ax3) = plt.subplots(ncols=3, figsize=(15, 4))
sns.distplot(a=x1, kde=False, ax=ax1)
sns.distplot(a=x2, ax=ax1)
ax1.set_title('one histogram without kde')
sns.distplot(a=x1, kde=False, norm_hist=True, ax=ax2)
sns.distplot(a=x2, ax=ax2)
ax2.set_title('setting norm_hist=True')
xmin = min(x1.min(), x2.min())
xmax = max(x1.max(), x2.max())
bins = np.linspace(xmin, xmax, 11)
sns.distplot(a=x1, kde=False, norm_hist=True, bins=bins, ax=ax3)
sns.distplot(a=x2, bins=bins, ax=ax3)
ax3.set_title('using the same bins')
plt.tight_layout()
plt.show()

Change range of colors in plot(imshow)?

Values in my matrix called 'energy' are close enough to each other: e.g. one value can be 500, another one 520. And i want to see the color difference on my plot more precisely. Like for the smallest value in my data it should be the very dark color and for the highest value it should be the very bright color.
I have the following code:
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111)
plt.imshow(energy[0:60, 0:5920], cmap='Reds')
ax.axes.set_aspect(aspect=100)
plt.grid(color='yellow')
plt.title('My plot')
plt.xlabel('Length points')
plt.ylabel('Time points(seconds)')
import matplotlib.ticker as plticker
loc = plticker.MultipleLocator(base=500)
ax.xaxis.set_major_locator(loc)
plt.show()
I get the following plot:
plot of energy
Other words i'd love to get this plot more colorful.
Thanks in advance.
You can set a custom range either through a custom colormap or adjusting the range value to show using the keywords vmin and vmax. For example:
from matplotlib.pyplot import subplots
import numpy as np
fig, ax = subplots()
h = ax.imshow(np.random.rand(10,10) * 10, vmin = 0,\
vmax = 2, cmap = 'Reds')
fig.colorbar(h)
fig.show()
Which produces the colors within 0, 2 value
Alternatively you can rescale your data or adjust your colormap, see the maplotlib docs for more info.

Configuring grid-lines in matplotlib plot

Consider the figure below.
This image has been set up with the following code.
plt.rc('text', usetex=True)
plt.rc('font', family='serif')
fig, ax = plt.subplots()
ax.set_xlabel("Run Number", fontsize=25)
plt.grid(True, linestyle='--')
plt.tick_params(labelsize=20)
ax.set_xticklabels(map(str,range(number_of_runs)))
ax.minorticks_on()
ax.set_ylim([0.75,1.75])
I have not included the code that actually generates the data for plotting for the sake of clarity.
Unlike the diagram above, I would like to draw grid-lines perpendicular to the X-axis through each orange (and hence blue) dot. How do I do this?
The x-coordinates of the successive orange and blue dots form the same arithmetic progression in my code.
Also I notice that the tick numbers numbered 1,2,... are wrong for my application. Instead, I would like each successive grid-line, which I ask for as perpendicular to the X-axis in the previous step, to be numbered sequentially from 1 along the X-axis. How do I configure the Xtick marks for this?
The grid lines cross the xticks (or yticks).
You need to define xticks properly so that the grid lines cross your data points (the dots)
example below:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
number_of_runs = range(1,10) # use your actual number_of_runs
ax.set_xticks(number_of_runs, minor=False)
ax.xaxis.grid(True, which='major')
In case you want to have only vertical lines, add this:
ax.yaxis.grid(False, which='major')
Similar question here.
You should specify the exact places where you want the grids using a call to ax.set_xticks and then specify the exact numbers you want on the axis using a call to ax.set_xticklabels.
I am plotting some two random arrays in the example below:
plt.rc('text', usetex=True)
plt.rc('font', family='serif')
y1 = np.random.random(10)
y2 = np.random.random(10)
fig, ax = plt.subplots(ncols=2, figsize=(8, 3))
# equivalent to your figure
ax[0].plot(y1, 'o-')
ax[0].plot(y2, 'o-')
ax[0].grid(True, linestyle='--')
ax[0].set_title('Before')
# hopefully what you want
ax[1].plot(y1, 'o-')
ax[1].plot(y2, 'o-')
ax[1].set_title('After')
ax[1].set_xticks(range(0, len(y1)))
ax[1].set_xticklabels(range(1, len(y1)+1))
ax[1].grid(True, linestyle='--')
plt.show()
This is the output:
A note: Looking at your plot, it seems that the actual x-axis is not integers, but you want integers starting from 1, Probably the best way to do this is to just pass in the y axis data array as an argument for the plot command (plt.plot(y) instead of plt.plot(x, y), like what I have done above. You should decide if this is appropriate for your case.

Categories

Resources