How would I make a plot of this style in python with matplotlib? (Cumulative probability plot) I don't need complete code, mostly just need a place to start and a general idea of what I need to do for it.
A cumulative probability plot is really easy to make:
import numpy as np
import matplotlib.pyplot as plt
data = np.random.randn(1000)
fig,ax = plt.subplots()
ax.plot(np.sort(data),np.linspace(0.0,1.0,len(data)))
plt.xlabel(r'$x$')
plt.ylabel(r'$P(X \leq x)$')
plt.show()
Note that it can have a strong advantage over a probability density plot as it does not require binning of your data. (Should you be looking for the latter you can check this code).
Related
The square attribute in sns.heatmap works in weird manner. When I plot a heatmap using random numbers and use the square attribute, it works fine.
When I plot the heatmap with my matrix, it creates the heatmap properly.
However, when I use the square attribute, the plot becomes a tiny square.
I can't figure out what is going wrong over here.
Well, square=True means: "show all cells as squares". The only way to fit 7x560 squares into the plot region is reducing the height by a factor of about 80. In other words: it is strongly recommended to use square=False for data that has such a large difference between horizontal and vertical directions. Seaborn isn't doing anything wrong here, it just gives you want you asked for.
If you want the heatmap to be square (instead of the cells), you can use ax = sns.heatmap(data, square=False) and then ax.set_aspect(data.shape[1] / data.shape[0]).
Here is an example:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
data = np.random.randn(7, 560).cumsum(axis=1).cumsum(axis=0)
data -= data.min(axis=1, keepdims=True)
data /= data.max(axis=1, keepdims=True)
ax = sns.heatmap(data, cmap='turbo', cbar=True, xticklabels=50,
yticklabels=['Grumpy', 'Dopey', 'Doc', 'Happy', 'Bashful', 'Sneezy', 'Sleepy'])
ax.set_aspect(data.shape[1] / data.shape[0])
ax.tick_params(labelrotation=0)
plt.tight_layout()
plt.show()
I have data in the CSV file. I am trying to plot a histogram using matplotlib.
Here is the code that I am trying.
data.hist(bins=10)
plt.ylabel('Frequency')
plt.xlabel('Data')
plt.show()
This is the plot that I get.
Now using the same code, I need to create a normalized histogram that shows the probability distribution of the data. But now on the y-axis, instead of plotting the number of data points that fall in each bin, you will plot the number of data points in that data bin divided by the total number of data points.
How should I do it?
Pandas' histogram adds some functionality to the underlying pyplot.hist(). Many of the parameters are passed through. One of them is density=.
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
data = pd.DataFrame(np.random.uniform(258.1, 262.3, 20))
data.hist(bins=10, density=True)
plt.ylabel('Density')
plt.xlabel('Data')
plt.show()
A related library, seaborn, has a command to create a density histogram together with a kde curve as an approximation of the probability distribution.
import seaborn as sns
sns.distplot(data, bins=10)
How can I make a figure like the following one but with flat curve using matlibplot in Python?
Instead of using a histogram to bin your data have a look at using a KDE for a continuous estimate of the probability distribution. There is an implementation using a gaussian kernel in scipy.stats.gaussian_kde.
As an example:
import numpy as np
from scipy.stats import gaussian_kde
import matplotlib.pyplot as plt
data = np.random.normal(0.0, 1.0, 10000) #Generate some data
kde = gaussian_kde(data)
xplot = np.linspace(-5,5,1000)
plt.plot( xplot, kde(xplot), label='KDE' )
plt.hist( data, bins=50, histtype='step', normed=True, label='histogram' )
plt.legend()
plt.show()
Will produce the plot:
Note that when using KDEs the bandwidth of the kernel that you choose can have a very big impact on the representation of the data that gets produced, this is similar to the effect that the bin size would have when making a histogram. Both the scipy documentation that I linked to and the wikipedia page have good writeups on how to make this selection in a well motivated way.
I'm trying to reproduce this plot in python with little luck:
It's a simple number density contour currently done in SuperMongo. I'd like to drop it in favor of Python but the closest I can get is:
which is by using hexbin(). How could I go about getting the python plot to resemble the SuperMongo one? I don't have enough rep to post images, sorry for the links. Thanks for your time!
Example simple contour plot from a fellow SuperMongo => python sufferer:
import numpy as np
from matplotlib.colors import LogNorm
from matplotlib import pyplot as plt
plt.interactive(True)
fig=plt.figure(1)
plt.clf()
# generate input data; you already have that
x1 = np.random.normal(0,10,100000)
y1 = np.random.normal(0,7,100000)/10.
x2 = np.random.normal(-15,7,100000)
y2 = np.random.normal(-10,10,100000)/10.
x=np.concatenate([x1,x2])
y=np.concatenate([y1,y2])
# calculate the 2D density of the data given
counts,xbins,ybins=np.histogram2d(x,y,bins=100,normed=LogNorm())
# make the contour plot
plt.contour(counts.transpose(),extent=[xbins.min(),xbins.max(),
ybins.min(),ybins.max()],linewidths=3,colors='black',
linestyles='solid')
plt.show()
produces a nice contour plot.
The contour function offers a lot of fancy adjustments, for example let's set the levels by hand:
plt.clf()
mylevels=[1.e-4, 1.e-3, 1.e-2]
plt.contour(counts.transpose(),mylevels,extent=[xbins.min(),xbins.max(),
ybins.min(),ybins.max()],linewidths=3,colors='black',
linestyles='solid')
plt.show()
producing this plot:
And finally, in SM one can do contour plots on linear and log scales, so I spent a little time trying to figure out how to do this in matplotlib. Here is an example when the y points need to be plotted on the log scale and the x points still on the linear scale:
plt.clf()
# this is our new data which ought to be plotted on the log scale
ynew=10**y
# but the binning needs to be done in linear space
counts,xbins,ybins=np.histogram2d(x,y,bins=100,normed=LogNorm())
mylevels=[1.e-4,1.e-3,1.e-2]
# and the plotting needs to be done in the data (i.e., exponential) space
plt.contour(xbins[:-1],10**ybins[:-1],counts.transpose(),mylevels,
extent=[xbins.min(),xbins.max(),ybins.min(),ybins.max()],
linewidths=3,colors='black',linestyles='solid')
plt.yscale('log')
plt.show()
This produces a plot which looks very similar to the linear one, but with a nice vertical log axis, which is what was intended:
Have you checked out matplotlib's contour plot?
Unfortunately I couldn't view yours images. Do you mean something like this? It was done by MathGL -- GPL plotting library, which have Python interface too. And you can use arbitrary data arrays as input (including numpy's one).
You can use numpy.histogram2d to get a number density distribution of your array.
Try this example:
http://micropore.wordpress.com/2011/10/01/2d-density-plot-or-2d-histogram/
I would like to use Matplotlib to generate a scatter plot with a huge amount of data (about 3 million points). Actually I've 3 vectors with the same dimension and I use to plot in the following way.
import matplotlib.pyplot as plt
import numpy as np
from numpy import *
from matplotlib import rc
import pylab
from pylab import *
fig = plt.figure()
fig.subplots_adjust(bottom=0.2)
ax = fig.add_subplot(111)
plt.scatter(delta,vf,c=dS,alpha=0.7,cmap=cm.Paired)
Nothing special actually. But it takes too long to generate it actually (I'm working on my MacBook Pro 4 GB RAM with Python 2.7 and Matplotlib 1.0). Is there any way to improve the speed?
Unless your graphic is huge, many of those 3 million points are going to overlap.
(A 400x600 image only has 240K dots...)
So the easiest thing to do would be to take a sample of say, 1000 points, from your data:
import random
delta_sample=random.sample(delta,1000)
and just plot that.
For example:
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
import random
fig = plt.figure()
fig.subplots_adjust(bottom=0.2)
ax = fig.add_subplot(111)
N=3*10**6
delta=np.random.normal(size=N)
vf=np.random.normal(size=N)
dS=np.random.normal(size=N)
idx=random.sample(range(N),1000)
plt.scatter(delta[idx],vf[idx],c=dS[idx],alpha=0.7,cmap=cm.Paired)
plt.show()
Or, if you need to pay more attention to outliers, then perhaps you could bin your data using np.histogram, and then compose a delta_sample which has representatives from each bin.
Unfortunately, when using np.histogram I don't think there is any easy way to associate bins with individual data points. A simple, but approximate solution is to use the location of a point in or on the bin edge itself as a proxy for the points in it:
xedges=np.linspace(-10,10,100)
yedges=np.linspace(-10,10,100)
zedges=np.linspace(-10,10,10)
hist,edges=np.histogramdd((delta,vf,dS), (xedges,yedges,zedges))
xidx,yidx,zidx=np.where(hist>0)
plt.scatter(xedges[xidx],yedges[yidx],c=zedges[zidx],alpha=0.7,cmap=cm.Paired)
plt.show()
What about trying pyplot.hexbin? It generates a sort of heatmap based on point density in a set number of bins.
You could take the heatmap approach shown here. In this example the color represents the quantity of data in the bin, not the median value of the dS array, but that should be easy to change. More later if you are interested.