I want to create a plot that looks like the plot attached below.
My data frame is built at this format:
Playlist Type Streams
0 a classical 94
1 b hip-hop 12
2 c classical 8
The 'popularity' category can be replaced by the 'streams' - the only thing is that the streams variable has a high variance of values (goes from 0 to 10,000+) and therefore I believe the density graph might look weird.
However, my first question is how can I plot a graph similar to this in Pandas, when grouping by the 'Type' column and then creating the density graph.
I tried various methods but did not find a good one to establish my goal.
To augment the answer of #Student240 you could make use of the seaborn library, which makes it easy to fit 'kernal density estimates'. In other words, to have smooth curves similar to that in your question, rather than a binned histogram. This is done with the KDEplot class. A related plot type is the distplot which gives the KDE estimate but also shows the histogram bins.
Another difference in my answer is to use the explicit object oriented approach in matplotlib/seaborn. This involves initially declaring a figure and axes objects with plt.subplots() rather than the implicit approach of fig.hist. See this really good tutorial for more details.
import matplotlib.pyplot as plt
import seaborn as sns
## This block of code is copied from Student240's answer:
import random
categories = ['classical','hip-hop','indiepop','indierock','jazz'
,'metal','pop','rap','rock']
# NB I use a slightly different random variable assignment to introduce a bit more variety in my random numbers.
df = pd.DataFrame({'Type':[random.choice(categories) for _ in range(1000)],
'stream':[random.normalvariate(i,random.randint(0,15)) for i in
range(1000)]})
###split the data into groups based on types
g = df.groupby('Type')
## From here things change as I make use of the seaborn library
classical = g.get_group('classical')
hiphop = g.get_group('hip-hop')
indiepop = g.get_group('indiepop')
indierock = g.get_group('indierock')
fig, ax = plt.subplots()
ax = sns.kdeplot(data=classical['stream'], label='classical streams', ax=ax)
ax = sns.kdeplot(data=hiphop['stream'], label='hiphop streams', ax=ax)
ax = sns.kdeplot(data=indiepop['stream'], label='indiepop streams', ax=ax)
# for this final one I use the shade option just to show how it is done:
ax = sns.kdeplot(data=indierock['stream'], label='indierock streams', ax=ax, shade=True)
ax.set_xtitle('Count')
ax.set_ytitle('Density')
ax.set_title('KDE plot example from seaborn")
Hi you can try the following example, I have used randon normals just for this example, obviously it wouldn't be possible to have negative streams. Anyway disclaimer over, here is the code:
import random
categories = ['classical','hip-hop','indiepop','indierock','jazz'
,'metal','pop','rap','rock']
df = pd.DataFrame({'Type':[random.choice(categories) for _ in range(10000)],
'stream':[random.normalvariate(0,random.randint(0,15)) for _ in
range(10000)]})
###split the data into groups based on types
g = df.groupby('Type')
###access the classical group
classical = g.get_group('classical')
plt.figure(figsize=(15,6))
plt.hist(classical.stream, histtype='stepfilled', bins=50, alpha=0.2,
label="Classical Streams", color="#D73A30", density=True)
plt.legend(loc="upper left")
###hip hop
hiphop = g.get_group('hip-hop')
plt.hist(hiphop.stream, histtype='stepfilled', bins=50, alpha=0.2,
label="hiphop Streams", color="#2A3586", density=True)
plt.legend(loc="upper left")
###indie pop
indiepop = g.get_group('indiepop')
plt.hist(indiepop.stream, histtype='stepfilled', bins=50, alpha=0.2,
label="indie pop streams", color="#5D271B", density=True)
plt.legend(loc="upper left")
#indierock
indierock = g.get_group('indierock')
plt.hist(indierock.stream, histtype='stepfilled', bins=50, alpha=0.2,
label="indie rock Streams", color="#30A9D7", density=True)
plt.legend(loc="upper left")
##jazz
jazz = g.get_group('jazz')
plt.hist(jazz.stream, histtype='stepfilled', bins=50, alpha=0.2,
label="jazz Streams", color="#30A9D7", density=True)
plt.legend(loc="upper left")
####you can add other here if you wish
##modify this to control x-axis, possibly useful for high-variance data
plt.xlim([-20,20])
plt.title('Distribution of Streams by Genre')
plt.xlabel('Count')
plt.ylabel('Density')
You can Google 'Hex color picker' if you want to get a specific '#000000' color in the format I have used in this example.
modify variable 'alpha' if you want to change how dense the color is displayed, you can also play around with 'bins' in the example I provided as this should allow you to make it look better if 50 is too large or small.
I hope this helps, plotting in matplotlib can be a pain to learn, but it is surely worth it!!
Related
I have a dataset that includes all the batting averages of baseball players. I assign each player in this dataset randomly to a cluster. Now I want to visually display each cluster in a stacked histogram. I use the following:
import matplotlib.pyplot as plt
def chart(k=2):
x = np.arange(0, 0.4, 0.001)
for j in range(k):
cluster = df.loc[df['cluster'] == j].reset_index()
plt.hist(cluster['Average'], bins=50, density=1, stacked=True)
plt.xlim(0, 0.4)
plt.xlabel('Batting Average')
plt.ylabel('Density')
plt.show()
This gives me the following output:
However, I would like to see the following:
I created this chart by dividing the dataset "hard-coded". Ideally, I want to do it dynamically by creating a loop. How can I also add a legenda with the clusters names and specify the color for each cluster? Again all in a loop. K can also be 10 for example.
Thanks in advance
Not providing data and a Minimal, Complete, and Verifiable example
to people before asking a question makes it difficult to answer your problem. This is something you should keep in mind for the next time. Nevertheless, here is one way that should work for you. The idea is to create an axis object ax and pass it to plot both the histograms on the same figure. Then you can modify the labels, limits etc. outside the function after plotting everything.
P.S: As poited out by Paul H in comments below, the DataFrame df and the column names should be passed as arguments to the chart function as well to make it more robust
import matplotlib.pyplot as plt
def chart(ax1, k=2):
x = np.arange(0, 0.4, 0.001)
for j in range(k):
cluster = df.loc[df['cluster'] == j].reset_index()
ax1.hist(cluster['Average'], bins=50, density=1, stacked=True)
return ax1
fig, ax = plt.subplots()
ax = chart(ax, k=2)
plt.xlim(0, 0.4)
plt.xlabel('Batting Average')
plt.ylabel('Density')
plt.show()
I am working on a task called knowledge tracing which estimates the student mastery level over time. I would like to plot a similar figure as below using the Matplotlib or Seaborn.
It uses different colors to represent a knowledge concept, instead of a text. However, I have googled and found there is no article is talking about how we can do this.
I tried the following
# simulate a record of student mastery level
student_mastery = np.random.rand(5, 30)
df = pd.DataFrame(student_mastery)
# plot the heatmap using seaborn
marker = matplotlib.markers.MarkerStyle(marker='o', fillstyle='full')
sns_plot = sns.heatmap(df, cmap="RdYlGn", vmin=0.0, vmax=1.0)
y_limit = 5
y_labels = [marker for i in range(y_limit)]
plt.yticks(range(y_limit), y_labels)
Yet it simply returns the __repr__ of the marker, e.g., <matplotlib.markers.MarkerStyle at 0x1c5bb07860> on the yticks.
Thanks in advance!
While How can I make the xtick labels of a plot be simple drawings using matplotlib? gives you a general solution for arbitrary shapes, for the shapes shown here, it may make sense to use unicode symbols as text and colorize them according to your needs.
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(42)
fig, ax = plt.subplots()
ax.imshow(np.random.rand(3,10), cmap="Greys")
symbolsx = ["⚪", "⚪", "⚫", "⚫", "⚪", "⚫","⚪", "⚫", "⚫","⚪"]
colorsx = np.random.choice(["#3ba1ab", "#b43232", "#8ecc3a", "#893bab"], 10)
ax.set_xticks(range(len(symbolsx)))
ax.set_xticklabels(symbolsx, size=40)
for tick, color in zip(ax.get_xticklabels(), colorsx):
tick.set_color(color)
symbolsy = ["◾", "◾", "◾"]
ax.set_yticks(range(len(symbolsy)))
ax.set_yticklabels(symbolsy, size=40)
for tick, color in zip(ax.get_yticklabels(), ["crimson", "gold", "indigo"]):
tick.set_color(color)
plt.show()
I am using matplotlib.pyplot.specgram and matplotlib.pyplot.pcolormesh to make spectrogram plots of a seismic signal.
Background information -The reason for using pcolormesh is that I need to do arithmitic on the spectragram data array and then replot the resulting spectrogram (for a three-component seismogram - east, north and vertical - I need to work out the horizontal spectral magnitude and divide the vertical spectra by the horizontal spectra). It is easier to do this using the spectrogram array data than on individual amplitude spectra
I have found that the plots of the spectrograms after doing my arithmetic have unexpected values. Upon further investigation it turns out that the spectrogram plot made using the pyplot.specgram method has different values compared to the spectrogram plot made using pyplot.pcolormesh and the returned data array from the pyplot.specgram method. Both plots/arrays should contain the same values, I cannot work out why they do not.
Example:
The plot of
plt.subplot(513)
PxN, freqsN, binsN, imN = plt.specgram(trN.data, NFFT = 20000, noverlap = 0, Fs = trN.stats.sampling_rate, detrend = 'mean', mode = 'magnitude')
plt.title('North')
plt.xlabel('Time [s]')
plt.ylabel('Frequency [Hz]')
plt.clim(0, 150)
plt.colorbar()
#np.savetxt('PxN.txt', PxN)
looks different to the plot of
plt.subplot(514)
plt.pcolormesh(binsZ, freqsZ, PxN)
plt.clim(0,150)
plt.colorbar()
even though the "PxN" data array (that is, the spectrogram data values for each segment) is generated by the first method and re-used in the second.
Is anyone aware why this is happening?
P.S. I realise that my value for NFFT is not a square number, but it's not important at this stage of my coding.
P.P.S. I am not aware of what the "imN" array (fourth returned variable from pyplot.specgram) is and what it is used for....
First off, let's show an example of what you're describing so that other folks
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)
# Brownian noise sequence
x = np.random.normal(0, 1, 10000).cumsum()
fig, (ax1, ax2) = plt.subplots(nrows=2, figsize=(8, 10))
values, ybins, xbins, im = ax1.specgram(x, cmap='gist_earth')
ax1.set(title='Specgram')
fig.colorbar(im, ax=ax1)
mesh = ax2.pcolormesh(xbins, ybins, values, cmap='gist_earth')
ax2.axis('tight')
ax2.set(title='Raw Plot of Returned Values')
fig.colorbar(mesh, ax=ax2)
plt.show()
Magnitude Differences
You'll immediately notice the difference in magnitude of the plotted values.
By default, plt.specgram doesn't plot the "raw" values it returns. Instead, it scales them to decibels (in other words, it plots the 10 * log10 of the amplitudes). If you'd like it not to scale things, you'll need to specify scale="linear". However, for looking at frequency composition, a log scale is going to make the most sense.
With that in mind, let's mimic what specgram does:
plotted = 10 * np.log10(values)
fig, ax = plt.subplots()
mesh = ax.pcolormesh(xbins, ybins, plotted, cmap='gist_earth')
ax.axis('tight')
ax.set(title='Plot of $10 * log_{10}(values)$')
fig.colorbar(mesh)
plt.show()
Using a Log Color Scale Instead
Alternatively, we could use a log norm on the image and get a similar result, but communicate that the color values are on a log scale more clearly:
from matplotlib.colors import LogNorm
fig, ax = plt.subplots()
mesh = ax.pcolormesh(xbins, ybins, values, cmap='gist_earth', norm=LogNorm())
ax.axis('tight')
ax.set(title='Log Normalized Plot of Values')
fig.colorbar(mesh)
plt.show()
imshow vs pcolormesh
Finally, note that the examples we've shown have had no interpolation applied, while the original specgram plot did. specgram uses imshow, while we've been plotting with pcolormesh. In this case (regular grid spacing) we can use either.
Both imshow and pcolormesh are very good options, in this case. However,imshow will have significantly better performance if you're working with a large array. Therefore, you might consider using it instead, even if you don't want interpolation (e.g. interpolation='nearest' to turn off interpolation).
As an example:
extent = [xbins.min(), xbins.max(), ybins.min(), ybins.max()]
fig, ax = plt.subplots()
mesh = ax.imshow(values, extent=extent, origin='lower', aspect='auto',
cmap='gist_earth', norm=LogNorm())
ax.axis('tight')
ax.set(title='Log Normalized Plot of Values')
fig.colorbar(mesh)
plt.show()
I am trying to display a Zipf plot, which is typically displayed on a log-log scale.
I'm using a library which gives rank in linear scale and frequencies in log scale. I have the following code which plots my data fairly correctly:
ranks = [3541, 60219, 172644, 108926, 733215, 1297533, 1297534, 1297535]
# These frequencies are already log-scale
freqs = [-10.932271003723145, -15.213129043579102, -17.091760635375977, -16.27560806274414,
-19.482173919677734, -19.502029418945312, -19.502029418945312, -19.502029418945312]
data = {
'ranks': ranks,
'freqs': freqs,
}
df = pd.DataFrame(data=data)
_, ax = plt.subplots(figsize=(7, 7))
ax.set(xscale="log", yscale="linear")
ax.set_title("Zipf plot")
sns.regplot("ranks", "freqs", data=df, ax=ax, fit_reg=False)
ax.set_xlabel("Frequency rank of token")
ax.set_ylabel("Absolute frequency of token")
ax.grid(True, which="both")
plt.show()
The resulting plot is:
The plot looks good, but the y-label is weird. I'd like it to be displayed in log-increments as well. My current workaround is to raise 10 to the power of each element in the freqs list; i.e.,
freqs = [10**freq for freq in freqs]
# ...
and change the yscale in ax.set to log; i.e.,
_, ax = plt.subplots(figsize=(7, 7))
ax.set(xscale="log", yscale="log")
ax.set_title("Zipf plot")
# ...
This gives me the expected plot (below), but it requires a transform of the data which is a) relatively expensive, b) redundant, c) lossy.
Is there a way to mimic the log scale of the axes in a matplotlib plot without transforming the data?
First a comment: Personally i would prefer the method of rescaling the data, since it makes everything much easier at the expense of some more memory/cpu time and accurary should not matter
Now to the question, which is acutally how to mimic a log scale on a linear axis
Solution 1: mimic the log scale
This is not easy. Setting the axes to log scale changes a lot in the background and one needs to mimic all of that.
The easy part is to set the major tickmark frequency to 1 by using matplotlib.ticker.MultipleLocator()
Creating the minor tickmarks at positions which look logarithmic is harder. The best solution I could come up with is to set them manually using the matplotlib.ticker.FixedLocator()
Last we need to change the tickmarks to represent the actual numbers, meaning that they should look like 10^(-x) instead of -x. I am aware of two options here:
Using a FuncFormatter that sets the values 10**x in scientific format.
Using a FuncFormatter that sets the values 10^x in Latex format. This looks much nicer but contrasts to the rest of the plot.
I do not know any better solution for that last point, but maybe someone else does.
Here is the code and how it looks.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from matplotlib.ticker import MultipleLocator, FixedLocator, FuncFormatter
###### Locators for Y-axis
# set tickmarks at multiples of 1.
majorLocator = MultipleLocator(1.)
# create custom minor ticklabels at logarithmic positions
ra = np.array([ [n+(1.-np.log10(i))] for n in xrange(10,20) for i in [2,3,4,5,6,7,8,9][::-1]]).flatten()*-1.
minorLocator = FixedLocator(ra)
###### Formatter for Y-axis (chose any of the following two)
# show labels as powers of 10 (looks ugly)
majorFormatter= FuncFormatter(lambda x,p: "{:.1e}".format(10**x) )
# or using MathText (looks nice, but not conform to the rest of the layout)
majorFormatter= FuncFormatter(lambda x,p: r"$10^{"+"{x:d}".format(x=int(x))+r"}$" )
ranks = [3541, 60219, 172644, 108926, 733215, 1297533, 1297534, 1297535]
# These frequencies are already log-scale
freqs = [-10.932271003723145, -15.213129043579102, -17.091760635375977, -16.27560806274414,
-19.482173919677734, -19.502029418945312, -19.502029418945312, -19.502029418945312]
data = {
'ranks': ranks,
'freqs': freqs,
}
df = pd.DataFrame(data=data)
_, ax = plt.subplots(figsize=(6, 6))
ax.set(xscale="log", yscale="linear")
ax.set_title("Zipf plot")
sns.regplot("ranks", "freqs", data=df, ax=ax, fit_reg=False)
# Set the locators
ax.yaxis.set_major_locator(majorLocator)
ax.yaxis.set_minor_locator(minorLocator)
# Set formatter if you like to have the ticklabels consistently in power notation
ax.yaxis.set_major_formatter(majorFormatter)
ax.set_xlabel("Frequency rank of token")
ax.set_ylabel("Absolute frequency of token")
ax.grid(True, which="both")
plt.show()
Solution 2: Use a different axes
A different solution, of which I did not think in the first place, would be to use two different axes, one with a loglog scale which looks nice and produces the correct labels and ticks and anotherone to plot the data to.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
ranks = [3541, 60219, 172644, 108926, 733215, 1297533, 1297534, 1297535]
# These frequencies are already log-scale
freqs = [-10.932271003723145, -15.213129043579102, -17.091760635375977, -16.27560806274414,
-19.482173919677734, -19.502029418945312, -19.502029418945312, -19.502029418945312]
data = {
'ranks': ranks,
'freqs': freqs,
}
df = pd.DataFrame(data=data)
fig, ax = plt.subplots(figsize=(6, 6))
# use 2 axes
# ax is the log, log scale which produces nice labels and ticks
ax.set(xscale="log", yscale="log")
ax.set_title("Zipf plot")
# ax2 is the axes where the values are plottet to
ax2 = ax.twinx()
#plot values to ax2
sns.regplot("ranks", "freqs", data=df, ax=ax2, fit_reg=False)
# set the limits of the log log axis to 10 to the power of the label of ax2
ax.set_ylim(10**np.array(ax2.get_ylim()) )
ax.set_xlabel("Frequency rank of token")
ax.set_ylabel("Absolute frequency of token")
# remove ticklabels and axislabel from ax2
ax2.set_yticklabels([])
ax2.set_ylabel("")
ax.grid(True, which="both")
plt.show()
I am using matplotlib.pyplot.specgram and matplotlib.pyplot.pcolormesh to make spectrogram plots of a seismic signal.
Background information -The reason for using pcolormesh is that I need to do arithmitic on the spectragram data array and then replot the resulting spectrogram (for a three-component seismogram - east, north and vertical - I need to work out the horizontal spectral magnitude and divide the vertical spectra by the horizontal spectra). It is easier to do this using the spectrogram array data than on individual amplitude spectra
I have found that the plots of the spectrograms after doing my arithmetic have unexpected values. Upon further investigation it turns out that the spectrogram plot made using the pyplot.specgram method has different values compared to the spectrogram plot made using pyplot.pcolormesh and the returned data array from the pyplot.specgram method. Both plots/arrays should contain the same values, I cannot work out why they do not.
Example:
The plot of
plt.subplot(513)
PxN, freqsN, binsN, imN = plt.specgram(trN.data, NFFT = 20000, noverlap = 0, Fs = trN.stats.sampling_rate, detrend = 'mean', mode = 'magnitude')
plt.title('North')
plt.xlabel('Time [s]')
plt.ylabel('Frequency [Hz]')
plt.clim(0, 150)
plt.colorbar()
#np.savetxt('PxN.txt', PxN)
looks different to the plot of
plt.subplot(514)
plt.pcolormesh(binsZ, freqsZ, PxN)
plt.clim(0,150)
plt.colorbar()
even though the "PxN" data array (that is, the spectrogram data values for each segment) is generated by the first method and re-used in the second.
Is anyone aware why this is happening?
P.S. I realise that my value for NFFT is not a square number, but it's not important at this stage of my coding.
P.P.S. I am not aware of what the "imN" array (fourth returned variable from pyplot.specgram) is and what it is used for....
First off, let's show an example of what you're describing so that other folks
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)
# Brownian noise sequence
x = np.random.normal(0, 1, 10000).cumsum()
fig, (ax1, ax2) = plt.subplots(nrows=2, figsize=(8, 10))
values, ybins, xbins, im = ax1.specgram(x, cmap='gist_earth')
ax1.set(title='Specgram')
fig.colorbar(im, ax=ax1)
mesh = ax2.pcolormesh(xbins, ybins, values, cmap='gist_earth')
ax2.axis('tight')
ax2.set(title='Raw Plot of Returned Values')
fig.colorbar(mesh, ax=ax2)
plt.show()
Magnitude Differences
You'll immediately notice the difference in magnitude of the plotted values.
By default, plt.specgram doesn't plot the "raw" values it returns. Instead, it scales them to decibels (in other words, it plots the 10 * log10 of the amplitudes). If you'd like it not to scale things, you'll need to specify scale="linear". However, for looking at frequency composition, a log scale is going to make the most sense.
With that in mind, let's mimic what specgram does:
plotted = 10 * np.log10(values)
fig, ax = plt.subplots()
mesh = ax.pcolormesh(xbins, ybins, plotted, cmap='gist_earth')
ax.axis('tight')
ax.set(title='Plot of $10 * log_{10}(values)$')
fig.colorbar(mesh)
plt.show()
Using a Log Color Scale Instead
Alternatively, we could use a log norm on the image and get a similar result, but communicate that the color values are on a log scale more clearly:
from matplotlib.colors import LogNorm
fig, ax = plt.subplots()
mesh = ax.pcolormesh(xbins, ybins, values, cmap='gist_earth', norm=LogNorm())
ax.axis('tight')
ax.set(title='Log Normalized Plot of Values')
fig.colorbar(mesh)
plt.show()
imshow vs pcolormesh
Finally, note that the examples we've shown have had no interpolation applied, while the original specgram plot did. specgram uses imshow, while we've been plotting with pcolormesh. In this case (regular grid spacing) we can use either.
Both imshow and pcolormesh are very good options, in this case. However,imshow will have significantly better performance if you're working with a large array. Therefore, you might consider using it instead, even if you don't want interpolation (e.g. interpolation='nearest' to turn off interpolation).
As an example:
extent = [xbins.min(), xbins.max(), ybins.min(), ybins.max()]
fig, ax = plt.subplots()
mesh = ax.imshow(values, extent=extent, origin='lower', aspect='auto',
cmap='gist_earth', norm=LogNorm())
ax.axis('tight')
ax.set(title='Log Normalized Plot of Values')
fig.colorbar(mesh)
plt.show()