I have a dataset that includes all the batting averages of baseball players. I assign each player in this dataset randomly to a cluster. Now I want to visually display each cluster in a stacked histogram. I use the following:
import matplotlib.pyplot as plt
def chart(k=2):
x = np.arange(0, 0.4, 0.001)
for j in range(k):
cluster = df.loc[df['cluster'] == j].reset_index()
plt.hist(cluster['Average'], bins=50, density=1, stacked=True)
plt.xlim(0, 0.4)
plt.xlabel('Batting Average')
plt.ylabel('Density')
plt.show()
This gives me the following output:
However, I would like to see the following:
I created this chart by dividing the dataset "hard-coded". Ideally, I want to do it dynamically by creating a loop. How can I also add a legenda with the clusters names and specify the color for each cluster? Again all in a loop. K can also be 10 for example.
Thanks in advance
Not providing data and a Minimal, Complete, and Verifiable example
to people before asking a question makes it difficult to answer your problem. This is something you should keep in mind for the next time. Nevertheless, here is one way that should work for you. The idea is to create an axis object ax and pass it to plot both the histograms on the same figure. Then you can modify the labels, limits etc. outside the function after plotting everything.
P.S: As poited out by Paul H in comments below, the DataFrame df and the column names should be passed as arguments to the chart function as well to make it more robust
import matplotlib.pyplot as plt
def chart(ax1, k=2):
x = np.arange(0, 0.4, 0.001)
for j in range(k):
cluster = df.loc[df['cluster'] == j].reset_index()
ax1.hist(cluster['Average'], bins=50, density=1, stacked=True)
return ax1
fig, ax = plt.subplots()
ax = chart(ax, k=2)
plt.xlim(0, 0.4)
plt.xlabel('Batting Average')
plt.ylabel('Density')
plt.show()
Related
I need some help with a pyplot bar chart that isn't doing what it should, and I cannot figure out why.
So basically what I need to do is draw the power function of a binomial distribution test. First I plot the binomial distribution and mark important values.
from scipy.stats import binom
import numpy as np
import matplotlib.pyplot as plt
n = 20
p = 1/2
x_values = list(range(n + 1))
prob = [binom.pmf(x, n, p) for x in x_values ]
cumult = 0
index_count = 0
for px in prob:
cumult += px
print(cumult)
if cumult > 0.1:
print(index_count-1)
break
else:
index_count = index_count + 1
plt.bar(x_values,prob)
plt.axvline(x=6, color='red', linestyle='-', label='Grenze')
plt.axhline(y=0.1, color='green',linestyle='--',label='Signifikanzniveau')
plt.legend()
plt.show()
Binomial distribution plot
So far so good. Looks exactly like it should. Now for the power function what I do is add up the single probabilities from prob, and for each one, I calculate their probability of failing the test. Now the graph for this should look something like this for example
Example Graph
(ofc as a bar chart in my case)
Yet, my code
p_values = []
err_p = []
cumul = 0
for p in prob:
cumul = cumul + p
p_values.append(cumul)
err_p.append(1-cumul)
x_pos = np.arange(len(p_values))
plt.bar(p_values, err_p)
plt.axvline(x=0.5, color='red', linestyle='-', label='p0')
plt.axhline(y=0.1, color='green',linestyle='--',label='Signifikanzniveau')
plt.legend()
plt.show()
Produces this weird bar chart
which has values in the negatives and over 1 on the x-axis even though there are no values like this in the data??? I know that it worked once before I marked the values in this chart as well, but I haven't been able to reproduce it. I always get the one with non-existent values. I also don't know if it may have to do with the weirdly wide bars since in the first graph they look normal but here they sort of flow into each other.
For your task, you don't want to use a bar plot but a step plot:
plt.step(x=p_values, y=err_p, where="mid", label="err")
plt.axvline(x=0.5, color='red', linestyle='-', label='p0')
plt.axhline(y=0.1, color='green',linestyle='--',label='Signifikanzniveau')
plt.legend()
plt.show()
Sample output:
Bars have usually a constant width, hence they will leak into x-data that are not actually in your dataset. You could manually calculate the necessary width of each bar but thankfully matplotlib has implemented the step function for this task.
If you wanted a filled plot like a histogram, you could use fill_between:
plt.fill_between(x=p_values, y1=err_p, step="mid", color="lightblue", label="err")
plt.axvline(x=0.5, color='red', linestyle='-', label='p0')
plt.axhline(y=0.1, color='green',linestyle='--',label='Signifikanzniveau')
plt.legend()
plt.show()
Sample output:
I want to create a plot that looks like the plot attached below.
My data frame is built at this format:
Playlist Type Streams
0 a classical 94
1 b hip-hop 12
2 c classical 8
The 'popularity' category can be replaced by the 'streams' - the only thing is that the streams variable has a high variance of values (goes from 0 to 10,000+) and therefore I believe the density graph might look weird.
However, my first question is how can I plot a graph similar to this in Pandas, when grouping by the 'Type' column and then creating the density graph.
I tried various methods but did not find a good one to establish my goal.
To augment the answer of #Student240 you could make use of the seaborn library, which makes it easy to fit 'kernal density estimates'. In other words, to have smooth curves similar to that in your question, rather than a binned histogram. This is done with the KDEplot class. A related plot type is the distplot which gives the KDE estimate but also shows the histogram bins.
Another difference in my answer is to use the explicit object oriented approach in matplotlib/seaborn. This involves initially declaring a figure and axes objects with plt.subplots() rather than the implicit approach of fig.hist. See this really good tutorial for more details.
import matplotlib.pyplot as plt
import seaborn as sns
## This block of code is copied from Student240's answer:
import random
categories = ['classical','hip-hop','indiepop','indierock','jazz'
,'metal','pop','rap','rock']
# NB I use a slightly different random variable assignment to introduce a bit more variety in my random numbers.
df = pd.DataFrame({'Type':[random.choice(categories) for _ in range(1000)],
'stream':[random.normalvariate(i,random.randint(0,15)) for i in
range(1000)]})
###split the data into groups based on types
g = df.groupby('Type')
## From here things change as I make use of the seaborn library
classical = g.get_group('classical')
hiphop = g.get_group('hip-hop')
indiepop = g.get_group('indiepop')
indierock = g.get_group('indierock')
fig, ax = plt.subplots()
ax = sns.kdeplot(data=classical['stream'], label='classical streams', ax=ax)
ax = sns.kdeplot(data=hiphop['stream'], label='hiphop streams', ax=ax)
ax = sns.kdeplot(data=indiepop['stream'], label='indiepop streams', ax=ax)
# for this final one I use the shade option just to show how it is done:
ax = sns.kdeplot(data=indierock['stream'], label='indierock streams', ax=ax, shade=True)
ax.set_xtitle('Count')
ax.set_ytitle('Density')
ax.set_title('KDE plot example from seaborn")
Hi you can try the following example, I have used randon normals just for this example, obviously it wouldn't be possible to have negative streams. Anyway disclaimer over, here is the code:
import random
categories = ['classical','hip-hop','indiepop','indierock','jazz'
,'metal','pop','rap','rock']
df = pd.DataFrame({'Type':[random.choice(categories) for _ in range(10000)],
'stream':[random.normalvariate(0,random.randint(0,15)) for _ in
range(10000)]})
###split the data into groups based on types
g = df.groupby('Type')
###access the classical group
classical = g.get_group('classical')
plt.figure(figsize=(15,6))
plt.hist(classical.stream, histtype='stepfilled', bins=50, alpha=0.2,
label="Classical Streams", color="#D73A30", density=True)
plt.legend(loc="upper left")
###hip hop
hiphop = g.get_group('hip-hop')
plt.hist(hiphop.stream, histtype='stepfilled', bins=50, alpha=0.2,
label="hiphop Streams", color="#2A3586", density=True)
plt.legend(loc="upper left")
###indie pop
indiepop = g.get_group('indiepop')
plt.hist(indiepop.stream, histtype='stepfilled', bins=50, alpha=0.2,
label="indie pop streams", color="#5D271B", density=True)
plt.legend(loc="upper left")
#indierock
indierock = g.get_group('indierock')
plt.hist(indierock.stream, histtype='stepfilled', bins=50, alpha=0.2,
label="indie rock Streams", color="#30A9D7", density=True)
plt.legend(loc="upper left")
##jazz
jazz = g.get_group('jazz')
plt.hist(jazz.stream, histtype='stepfilled', bins=50, alpha=0.2,
label="jazz Streams", color="#30A9D7", density=True)
plt.legend(loc="upper left")
####you can add other here if you wish
##modify this to control x-axis, possibly useful for high-variance data
plt.xlim([-20,20])
plt.title('Distribution of Streams by Genre')
plt.xlabel('Count')
plt.ylabel('Density')
You can Google 'Hex color picker' if you want to get a specific '#000000' color in the format I have used in this example.
modify variable 'alpha' if you want to change how dense the color is displayed, you can also play around with 'bins' in the example I provided as this should allow you to make it look better if 50 is too large or small.
I hope this helps, plotting in matplotlib can be a pain to learn, but it is surely worth it!!
say I was testing a range of parameters of a clustering algorithm and I wanted to write python code that would plot all the results of the algorithm in subplots 2 to a row
is there a way to do this without pre-calculating how many total plots you would need?
something like:
for c in range(3,10):
k = KMeans(n_clusters=c)
plt.subplots(_, 2, _)
plt.scatter(data=data, x='x', y='y', c=k.fit_predict(data))
... and then it would just plot 'data' with 'c' clusters 2 plots per row until it ran out of stuff to plot.
thanks!
This answer from the question Dynamically add/create subplots in matplotlib explains a way to do it:
https://stackoverflow.com/a/29962074/3827277
verbatim copy & paste:
import matplotlib.pyplot as plt
# Start with one
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot([1,2,3])
# Now later you get a new subplot; change the geometry of the existing
n = len(fig.axes)
for i in range(n):
fig.axes[i].change_geometry(n+1, 1, i+1)
# Add the new
ax = fig.add_subplot(n+1, 1, n+1)
ax.plot([4,5,6])
plt.show()
Here is the code of plotting the figures. But why are there always two empty figures before the third expected figure, it seems I created two blank fig.
And I cannot save the figure in my local computer fig.savefig('Sens.png'). There is an error The C++ part of the object has been deleted, attribute access no longer allowed(actually successfully saved only for one time).
fig = plt.figure(figsize=(10,10))
m = 1
for s in dataList:
plt.subplot(2,2,m)
f = interp1d(FXSpotList, s, 'cubic')
xnew = np.linspace(FXSpotList[0], FXSpotList[-1], 40, True)
plt.plot(xnew, f(xnew), '-')
plt.xlabel('Spot')
plt.ylabel(titleList[m-1])
plt.axvline(x=tradeTest.Pair().Spot(), linestyle='--')
plt.axhline(y=0, linestyle='--')
m = m + 1
plt.figtext(0.5, 0.01, 'Type='+str(tradeTest.Types()[0]), ha='center')
plt.tight_layout()
plt.show()
plt.close()
fig.savefig('Sens.png')
Although you did not provide a Minimal, Complete, and Verifiable example, it is obvious that there are things wrong with your loop construction. You show, close, then save the plot in every loop, which is probably not, what you are intending to do. A minimal example of your loop would be
import numpy as np
from matplotlib import pyplot as plt
#sample list to iterate over
dataList = ["fig1", "fig2", "fig3"]
plt.figure(figsize=(10,10))
#loop over the list, retrieve data entries and index
for i, s in enumerate(dataList):
#define position of the plot in a 2 x 2 grid
plt.subplot(2, 2, i + 1)
#random plot, insert your calculations here
plt.plot(range(3), np.random.randint(0, 10, 3))
#utilize list data
plt.title(s)
#save figure
plt.savefig('test.png')
#show figure
plt.show()
I'm trying to reproduce the following chart:
But I'm not sure if's actually possible to create such a plot using Python,R or Tableau.
Here is my first attempt using Plotly in R:
Do you have any suggestion for creating such a chart?
You can use R and de package highcharter to create a plot like this one:
spiderweb plot
the plot js code is in www/highcharts.com/demo/polar-spider
While I was working on creating this plot with matplotlib, someone mentioned that I can create this chart using Excel! in less than 2 minutes, so I didn't complete the code but anyway as I already figure out how should I create different elements of the plot in matplotlib, I put the code here in case anyone wants to create such a thing.
import matplotlib.pyplot as plt
import matplotlib.patches as patches
fig1 = plt.figure()
#Adding grids
for rad in reversed(range(1,10)): #10 is maximum of ranks we need to show
ax1 = fig1.add_subplot(111,aspect = 'equal')
ax1.add_patch(
patches.RegularPolygon(
(0,0), #center of the shape
11, #number of vertices
rad,
fill=False,
ls='--',
))
plt.xlim(xmin = -10,xmax=10)
plt.ylim(ymin = -10,ymax=10)
fig1.show()
#plotting the trend
plt.scatter(xs,ys) #xs = list of x coordinates, the same for ys
for k in range(len(xs)-1):
x, y = [xs[k], xs[k+1]], [ys[k], ys[k+1]]
plt.plot(x, y,color = 'b')
plt.grid(False)
plt.show()
Result plot
(As I said the code doesn't create the whole trends, labels,...but it's pretty much all you need to create the plot)