Download Zip with signal.csv
I can create a psd plot like this:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.read_csv('signal.csv')/1000
dt = df.iloc[1,0] - df.iloc[0,0]
print(f'time delta: {dt*1e12:.0f} ps')
print(f'time: {(len(data)*dt*1e6):.3f} \u03BCs')
resolution = 2000
plt.xlabel('Frequency / GHz')
plt.ylabel('Power Spectral Density / dB/Hz')
How can I change its ytickrate? When I try adding:
the whole graph is morphed:
How can I change the psd plot so that the y axis is represented in 10's without changing the plot?
You just have to set your y-lim again:
and you can adjust the limits to match your desired output.
If you want to make it automatic you can use axs.get_ylim():
fig, axs = plt.subplots(figsize=(10, 10),constrained_layout=True)
axs.set_xlabel('Frequency / GHz')
axs.set_ylabel('Power Spectral Density / dB/Hz')
I am struggling to make a histogram plot where the total percentage of events sums to 100%. Instead, for this particular example, it sums to approximately 3%. Will anyone be able to show me how I make the percentages of my events sum to 100% for any array used?
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
import numpy as np
data = np.array([0,9,78,6,44,23,88,77,12,29])
length_of_data = len(data) # Length of data
bins = int(np.sqrt(length_of_data)) # Choose number of bins
y = data
plt.ylabel('Percentage Of Events')
plt.hist(y,bins=bins, density = True)
One way of doing it is to get the bin heights that plt.hist returns, then re-set the patch heights to the normalized height you want. It's not that involved if you know what to do, but not that ideal. Here's your case:
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
import numpy as np
plt.gca().yaxis.set_major_formatter(PercentFormatter(100)) # <-- changed here
data = np.array([0,9,78,6,44,23,88,77,12,29])
length_of_data = len(data) # Length of data
bins = int(np.sqrt(length_of_data)) # Choose number of bins
y = data
plt.ylabel('Percentage Of Events')
#### Setting new heights
n, bins, patches = plt.hist(y, bins=bins, density = True, edgecolor='k')
scaled_n = n / n.sum() * 100
for new_height, patch in zip(scaled_n, patches):
# Setting cumulative sum as verification
plt.plot((bins[1:] + bins[:-1])/2, scaled_n.cumsum())
# If you want the cumsum to start from 0, uncomment the line below
#plt.plot(np.concatenate([[0], (bins[1:] + bins[:-1])/2]), np.concatenate([[0], scaled_n.cumsum()]))
This is the resulting picture:
As others said, you can use seaborn. Here's how to reproduce my code above. You'd still need to add all the labels and styling you want.
import seaborn as sns
sns.histplot(data, bins=int(np.sqrt(length_of_data)), stat='percent')
sns.histplot(data, bins=int(np.sqrt(length_of_data)), stat='percent', cumulative=True, element='poly', fill=False, color='C1')
This is the resulting picture:
I have 2 dfs. One of them has data for a month. Another one, averages for the past quarters. I wanna plot the averages in front of the monthly data. How can I do it? Please note that I am trying to plot averages as dots and monthly as line chart.
So far my best result was achieved by ax1=ax.twiny(), but still not ideal result as data point appear in throughout the chart, rather than just in front.
import pandas as pd
import numpy as np
import matplotlib.dates as mdates
from matplotlib.ticker import ScalarFormatter, FormatStrFormatter, FuncFormatter
import matplotlib.ticker as ticker
import matplotlib.pyplot as plt
date_base = pd.date_range(start='1/1/2018', end='1/30/2018')
df_base = pd.DataFrame(np.random.randn(30,4), columns=list("ABCD"), index=date_base)
date_ext = pd.date_range(start='1/1/2017', end='1/1/2018', freq="Q")
df_ext = pd.DataFrame(np.random.randn(4,4), columns=list("ABCD"), index=date_ext)
def drawChartsPlt(df_base, df_ext):
fig = plt.figure(figsize=(10,5))
ax = fig.add_subplot(111)
number_of_plots = len(df_base.columns)
LINE_STYLES = ['-', '--', '-.', 'dotted']
colormap =
ax.set_prop_cycle("color", [colormap(i) for i in np.linspace(0,1,number_of_plots)])
date_base = df_base.index
date_base = [i.strftime("%Y-%m-%d") for i in date_base]
q_ends = df_ext.index
q_ends = [i.strftime("%Y-%m-%d") for i in q_ends]
date_base.insert(0, "") #to shift xticks so they match chart
date_base += q_ends
for i in range(number_of_plots):
df_base.ix[:-3, df_base.columns[i]].plot(kind="line", linestyle=LINE_STYLES[i%2], subplots=False, ax=ax)
# ax.xaxis.set_major_locator(ticker.MultipleLocator(20))
# ax1=ax.twinx()
ax1.set_prop_cycle("color", [colormap(i) for i in np.linspace(0,1,number_of_plots)])
for i in range(len(df_ext.columns)):
ax1.scatter(x=df_ext.index, y=df_ext[df_ext.columns[i]])
ax.legend(loc="center left", bbox_to_anchor=(1,0.5))
plt.xlabel("TEST X Label")
plt.ylabel("TEST Y Label")
drawChartsPlt(df_base, df_ext)
The way I ended up coding it is by saving quarterly index of df_ext to a temp variable, overwriting it with dates that are close to df_base.index using pd.date_range(start=df_base.index[-1], periods=len(df_ext), freq='D'), and the finally setting the dates that I need with ax.set_xticklabels(list(date_base)+list(date_ext)).
It looks like it could be achieved using broken axes as indicated Break // in x axis of matplotlib and Python/Matplotlib - Is there a way to make a discontinuous axis?, but I haven't tried that solution.
I want to plot a histogram with Matplotlib, but I'd like the bins' values to represent the percentage of the total observations. A MWE would be like this:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import matplotlib.pyplot as plt
import matplotlib.ticker as tck
import seaborn as sns
import numpy
imagen2 = plt.figure(1, figsize=(5, 2))
imagen2.suptitle('StackOverflow Matplotlib histogram demo')
luminance = numpy.random.randn(1000, 1000)
# "Luminance" should range from 0.0...1.0 so we normalize it
luminance = (luminance - luminance.min())/(luminance.max() - luminance.min())
top_left = plt.subplot(121)
bottom_left = plt.subplot(122)
sns.distplot(luminance.flatten(), kde_kws={"cumulative": True})
# plt.savefig("stackoverflow.pdf", dpi=300)
plt.tight_layout(rect=(0, 0, 1, 0.95))
The CDF here is OK (range: [0, 1]), but the resulting histogram doesn't match my expectations:
Why are the histogram's results in the range [0, 4]? Is there any way to fix this?
What you think you want
Here's how to plot the histogram such that the bins sum to 1:
import matplotlib.pyplot as plt
import matplotlib.ticker as tck
import seaborn as sns
import numpy as np
imagen2 = plt.figure(1, figsize=(5, 2))
imagen2.suptitle('StackOverflow Matplotlib histogram demo')
luminance = numpy.random.randn(1000, 1000)
# "Luminance" should range from 0.0...1.0 so we normalize it
luminance = (luminance - luminance.min())/(luminance.max() - luminance.min())
# get the histogram values
heights,edges = np.histogram(luminance.flat, bins=30)
binCenters = (edges[:-1] + edges[1:])/2
# norm the heights
heights = heights/heights.sum()
# get the cdf
cdf = heights.cumsum()
left = plt.subplot(121)
right = plt.subplot(122)
right.plot(binCenters, cdf, binCenters, heights)
# plt.savefig("stackoverflow.pdf", dpi=300)
plt.tight_layout(rect=(0, 0, 1, 0.95))
# confirm that the hist vals sum to 1
print('heights sum: %.2f' % heights.sum())
heights sum: 1.00
The actual answer
This one is actually super easy. Just do
sns.distplot(luminance.flatten(), kde_kws={"cumulative": True}, norm_hist=True)
Here's what I get when I run your script with the above modification:
Surprise twist!
So it turns out that your histogram was normalized all along, as per the formal identity:
In plain(er) English, the general practice is to norm continuously valued histograms (ie their observations can be expressed as floating point number) in terms of their density. So in this case the sum of the bin widths times the bin heights will 1.0, as you can see by running this simplified version of your script:
import matplotlib.pyplot as plt
import matplotlib.ticker as tck
import numpy as np
imagen2 = plt.figure(1, figsize=(4,3))
imagen2.suptitle('StackOverflow Matplotlib histogram demo')
luminance = numpy.random.randn(1000, 1000)
luminance = (luminance - luminance.min())/(luminance.max() - luminance.min())
heights,edges,patches = plt.hist(luminance.ravel(), density=True, bins=30)
widths = edges[1:] - edges[:-1]
totalWeight = (heights*widths).sum()
# plt.savefig("stackoverflow.pdf", dpi=300)
plt.tight_layout(rect=(0, 0, 1, 0.95))
And the totalWeight will indeed be exactly equal to 1.0, give or take a smidge of rounding error.
tel's answer is great! I just want to provide an alternative to give you the histogram you want with less lines. The key idea is to use weights arguments in the matplotlib hist function to normalize counts. You can replace your sns.distplot(luminance.flatten(), kde_kws={"cumulative": True}) with the following three lines of code:
lf = luminance.flatten()
sns.kdeplot(lf, cumulative=True)
sns.distplot(lf, kde=False,
hist_kws={'weights': numpy.full(len(lf), 1/len(lf))})
If you want to see the histogram on a second y-axis (better visual), add ax=bottom_left.twinx() to sns.distplot:
Plotting Differences between bar and hist
Given some data in a pandas.Series , rv, there is a difference between
Calling hist directly on the data to plot
Calculating the histogram results (with numpy.histogram) then plotting with bar
Example Data Generation
%matplotlib inline
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib
matplotlib.rcParams['figure.figsize'] = (12.0, 8.0)'ggplot')
# Setup size and distribution
size = 50000
distribution = stats.norm()
# Create random data
rv = pd.Series(distribution.rvs(size=size))
# Get sane start and end points of distribution
start = distribution.ppf(0.01)
end = distribution.ppf(0.99)
# Build PDF and turn into pandas Series
x = np.linspace(start, end, size)
y = distribution.pdf(x)
pdf = pd.Series(y, x)
# Get histogram of random data
y, x = np.histogram(rv, bins=50, normed=True)
# Correct bin edge placement
x = [(a+x[i+1])/2.0 for i,a in enumerate(x[0:-1])]
hist = pd.Series(y, x)
hist() Plotting
ax = pdf.plot(lw=2, label='PDF', legend=True)
rv.plot(kind='hist', bins=50, normed=True, alpha=0.5, label='Random Samples', legend=True, ax=ax)
bar() Plotting
ax = pdf.plot(lw=2, label='PDF', legend=True)
hist.plot(kind='bar', alpha=0.5, label='Random Samples', legend=True, ax=ax)
How can the bar plot be made to look like the hist plot?
The use case for this is needing to save only the histogrammed data to use and plot later (it is typically smaller in size than the original data).
Bar plotting differences
Obtaining a bar plot that looks like the hist plot requires some manipulating of default behavior for bar.
Force bar to use actual x data for plotting range by passing both x (hist.index) and y (hist.values). The default bar behavior is to plot the y data against an arbitrary range and put the x data as the label.
Set the width parameter to be related to actual step size of x data (The default is 0.8)
Set the align parameter to 'center'.
Manually set the axis legend.
These changes need to be made via matplotlib's bar() called on the axis (ax) instead of pandas's bar() called on the data (hist).
Example Plotting
%matplotlib inline
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib
matplotlib.rcParams['figure.figsize'] = (12.0, 8.0)'ggplot')
# Setup size and distribution
size = 50000
distribution = stats.norm()
# Create random data
rv = pd.Series(distribution.rvs(size=size))
# Get sane start and end points of distribution
start = distribution.ppf(0.01)
end = distribution.ppf(0.99)
# Build PDF and turn into pandas Series
x = np.linspace(start, end, size)
y = distribution.pdf(x)
pdf = pd.Series(y, x)
# Get histogram of random data
y, x = np.histogram(rv, bins=50, normed=True)
# Correct bin edge placement
x = [(a+x[i+1])/2.0 for i,a in enumerate(x[0:-1])]
hist = pd.Series(y, x)
# Plot previously histogrammed data
ax = pdf.plot(lw=2, label='PDF', legend=True)
w = abs(hist.index[1]) - abs(hist.index[0]), hist.values, width=w, alpha=0.5, align='center')
ax.legend(['PDF', 'Random Samples'])
Another, simpler solution is to create fake samples that reproduce the same histogram and then simply use hist().
I.e., after retrieving bins and counts from stored data, do
fake = np.array([])
for i in range(len(counts)):
a, b = bins[i], bins[i+1]
sample = a + (b-a)*np.random.rand(counts[i])
fake = np.append(fake, sample)
plt.hist(fake, bins=bins)
I have a series of data that I'm reading in from a tutorial site.
I've managed to plot the distribution of the TV column in that data, however I also want to overlay a normal distribution curve with StdDev ticks on a second x-axis (so I can compare the two curves). I'm struggling to work out how to do it..
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import matplotlib.mlab as mlab
import math
# read data into a DataFrame
data = pd.read_csv('', index_col=0)
# draw distribution curve
h = sorted(data.TV)
hmean = np.mean(h)
hstd = np.std(h)
pdf = stats.norm.pdf(h, hmean, hstd)
plt.plot(h, pdf)
Here is a diagram close to what I'm after, where x is the StdDeviations. All this example needs is a second x axis to show the values of data.TV
Not sure what you really want, but you could probably use second axis like this
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import matplotlib.mlab as mlab
import math
# read data into a DataFrame
data = pd.read_csv('Advertising.csv', index_col=0)
fig, ax1 = plt.subplots()
# draw distribution curve
h = sorted(data.TV)
ax1.set_ylabel('Count', color='b')
for tl in ax1.get_yticklabels():
hmean = np.mean(h)
hstd = np.std(h)
pdf = stats.norm.pdf(h, hmean, hstd)
ax2 = ax1.twinx()
ax2.plot(h, pdf, 'r.')
ax2.set_ylabel('pdf', color='r')
for tl in ax2.get_yticklabels():
Ok, assuming that you want to plot the distribution of your data, the fitted normal distribution with two x-axes, one way to achieve this is as follows.
Plot the normalized data together with the standard normal distribution. Then use matplotlib's twiny() to add a second x-axis to the plot. Use the same tick positions as the original x-axis on the second axis, but scale the labels so that you get the corresponding original TV values. The result looks like this:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import matplotlib.mlab as mlab
import math
# read data into a DataFrame
data = pd.read_csv('', index_col=0)
h = sorted(data.TV)
hmean = np.mean(h)
hstd = np.std(h)
h_n = (h - hmean) / hstd
pdf = stats.norm.pdf( h_n )
# plot data
f,ax1 = plt.subplots()
ax1.hist( h_n, 20, normed=1 )
ax1.plot( h_n , pdf, lw=3, c='r')
ax1.set_xlim( [h_n.min(), h_n.max()] )
ax1.set_xlabel( r'TV $[\sigma]$' )
ax1.set_ylabel( r'Relative Frequency')
ax2 = ax1.twiny()
ax2.grid( False )
ax2.set_xlim( ax1.get_xlim() )
ax2.set_ylim( ax1.get_ylim() )
ax2.set_xlabel( r'TV' )
ticklocs = ax2.xaxis.get_ticklocs()
ticklocs = [ round( t*hstd + hmean, 2) for t in ticklocs ]
ax2.xaxis.set_ticklabels( map( str, ticklocs ) )