I have a similar, but larger data set with more dates and over ten thousand rows. Usually, it takes 3mins or longer to run the code and plot. I think the problem comes from loop. Looping is time-consuming in python. In this case, would be appreciated if someone knows how to rewrite the code to make it faster.
data = {'Date' : ["2022-07-01"]*5000 + ["2022-07-02"]*5000+ ["2022-07-03"]*5000,
'OB1' : range(1,15001),
'OB2' : range(1,15001)}
df = pd.DataFrame(data)
# multi-indexing
df = df.set_index(['Date'])
# loop for plot
i = 1
fig, axs = plt.subplots(nrows = 1, ncols = 3, sharey = True)
fig.subplots_adjust(wspace=0)
for j, sub_df in df.groupby(level=0):
plt.subplot(130 + i)
x = sub_df['OB1']
y = sub_df['OB2']
plt.barh(x, y)
i = i + 1
plt.show()
The slowness comes from the barh function, which involves drawing many rectangles. While your example is already pretty slow (a minute on my laptop), this one runs in less than a second. I replaced barh with fill_betweenx, which fills the area between two curves (here 0 and the height of bars) instead of drawing rectangles. It goes much faster but is not strictly the same. Also, I use the option step=post, so if you zoom, you will have a bar-style graph.
import pandas as pd
import matplotlib.pyplot as plt
data = {
"Date": ["2022-07-01"] * 5000
+ ["2022-07-02"] * 5000
+ ["2022-07-03"] * 5000,
"OB1": range(1, 15001),
"OB2": range(1, 15001),
}
df = pd.DataFrame(data)
# multi-indexing
df = df.set_index(["Date"])
# loop for plot
i = 1
fig, axs = plt.subplots(nrows=1, ncols=3, sharey=True)
fig.subplots_adjust(wspace=0)
for j, sub_df in df.groupby(level=0):
plt.subplot(130 + i)
x = sub_df["OB1"]
y = sub_df["OB2"]
# plt.barh(x, y)
plt.fill_betweenx(y, 0, x, step="post")
i = i + 1
plt.show()
Related
I have large subplot-based figure to produce in python using matplotlib. In total the figure has in excess of 500 individual plots each with 1000s of datapoints. This can be plotted using a for loop-based approach modelled on the minimum example given below
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
# define main plot names and subplot names
mains = ['A','B','C','D']
subs = list(range(9))
# generate mimic data in pd dataframe
col = [letter+str(number) for letter in mains for number in subs]
col.insert(0,'Time')
df = pd.DataFrame(columns=col)
for title in df.columns:
df[title] = [i for i in range(100)]
# although alphabet and mains are the same in this minimal example this may not always be true
alphabet = ['A', 'B', 'C', 'D']
column_names = [column for column in df.columns if column != 'Time']
# define figure size and main gridshape
fig = plt.figure(figsize=(15, 15))
outer = gridspec.GridSpec(2, 2, wspace=0.2, hspace=0.2)
for i, letter in enumerate(alphabet):
# define inner grid size and shape
inner = gridspec.GridSpecFromSubplotSpec(3, 3,
subplot_spec=outer[i], wspace=0.1, hspace=0.1)
# select only columns with correct letter
plot_array = [col for col in column_names if col.startswith(letter)]
# set title for each letter plot
ax = plt.Subplot(fig, outer[i])
ax.set_title(f'Letter {letter}')
ax.axis('off')
fig.add_subplot(ax)
# create each subplot
for j, col in enumerate(plot_array):
ax = plt.Subplot(fig, inner[j])
X = df['Time']
Y = df[col]
# plot waveform
ax.plot(X, Y)
# hide all axis ticks
ax.axis('off')
# set y_axis limits so all plots share same y_axis
ax.set_ylim(df[column_names].min().min(),df[column_names].max().max())
fig.add_subplot(ax)
However this is slow, requiring minutes to plot the figure. Is there a more efficient (potentially for loop free) method to achieve the same result
The issue with the loop is not the plotting but the setting of the axis limits with df[column_names].min().min() and df[column_names].max().max().
Testing with 6 main plots, 64 subplots and 375,000 data points, the plotting section of the example takes approx 360s to complete when axis limits are set by searching df for min and max values each loop. However by moving the search for min and max outside the loops. eg
# set y_lims
y_upper = df[column_names].max().max()
y_lower = df[column_names].min().min()
and changing
ax.set_ylim(df[column_names].min().min(),df[column_names].max().max())
to
ax.set_ylim(y_lower,y_upper)
the plotting time is reduced to approx 24 seconds.
having the following dataframe:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import joypy
sample1 = np.random.normal(5, 10, size = (200, 5))
sample2 = np.random.normal(40, 5, size = (200, 5))
sample3 = np.random.normal(10, 5, size = (200, 5))
b = []
for i in range(0, 3):
a = "Sample" + "{}".format(i)
lst = np.repeat(a, 200)
b.append(lst)
b = np.asarray(b).reshape(600,1)
data_arr = np.vstack((sample1,sample2, sample3))
df1 = pd.DataFrame(data = data_arr, columns = ["foo", "bar", "qux", "corge", "grault"])
df1.insert(0, column="sampleNo", value = b)
I am able to produce the following ridgeplot:
fig, axes = joypy.joyplot(df1, column = ['foo'], by = 'sampleNo',
alpha=0.6,
linewidth=.5,
linecolor='w',
fade=True)
Now, let's say I have the following vector:
vectors = np.asarray([10, 40, 50])
How do I plot each one of those points into the density plots? E.g., on the distribution plot of sample 1, I'd like to have a single point (or line) on 10; sample 2 on 40, etc..
I've tried to use axvline, and I sort of expected this to work, but no luck:
for ax in axes:
ax.axvline(vectors(ax))
I am not sure if what I want is possible at all...
You almost had the correct approach.
axes holds 4 axis objects, in order: the three stacked plots from top to bottom and the big one where all the other 3 live in. So,
for ax, v in zip(axes, vectors):
ax.axvline(v)
zip() will only zip up to the shorter iterable, which is vectors. So, it will match each point from vectors with each axis from the stacked plots.
I have about 200 pairs of columns in a dataframe that I would like to plot in a single plot. Each pair of columns can be thought of as related "x" and "y" variables. Some of the "y variables" are 0 at certain points in the data. I don't want to plot those. I would rather they show up as a discontinuity in the plot. I am not able to figure out an efficient way to excluse those variables. There is also a "date" variable that I don't need in the plot but I am keeping it in the sample data just to mirror the reality.
Here is a sample data set and what I have done with it. I created my sample dataset in a hurry, the original data has unique "y values" for a given "x value" for every pair of column data.
import pandas as pd
from numpy.random import randint
data1y = [n**3 -n**2+n for n in range(12)]
data1x = [randint(0, 100) for n in range(12)]
data1x.sort()
data2y = [n**3 for n in range(12)]
data2x = [randint(0, 100) for n in range(12)]
data2x.sort()
data3y = [n**3 - n**2 for n in range(12)]
data3x = [randint(0, 100) for n in range(12)]
data3x.sort()
data1y = [0 if x%7==0 else x for x in data1y]
data2y = [0 if x%7==0 else x for x in data2y]
data3y = [0 if x%7==0 else x for x in data3y]
date = ['Jan','Feb','Mar','Apr','May', 'Jun','Jul','Aug','Sep','Oct','Nov','Dec']
df = pd.DataFrame({'Date':date,'Var1':data1y, 'Var1x':data1x, 'Vartwo':data2y, 'Vartwox':data2x,'datatree':data3y, 'datatreex':data3x})
print(df)
ax = plt.gca()
fig = plt.figure()
for k in ['Var1','Vartwo','datatree']:
df.plot(x=k+'x', y=k, kind = 'line',ax=ax)enter code here
The output I get this this:
I would like to see discontinuity where the 'y variables' are zero.
I have tried:
import numpy as np
df2 = df.copy()
df2[df2.Var1 < 0.5] = np.nan
But this makes an entire row NaN when I only want it to be a particular variable.
I'm trying this but it isnt working.
ax = plt.gca()
fig = plt.figure()
for k in ['Var1','Vartwo','datatree']:
filter = df.k.values > 0
x = df.k+'x'
y = df.k
plot(x[filter], y[filter], kind = 'line',ax=ax)
This works for a single variable but I don't know how to loop it across 200 variables and this also doesn't show the discontinuities.
import matplotlib.pyplot as plt
ax = plt.gca()
fig = plt.figure()
for k in ['Var1','Vartwo','datatree']:
filter = df.Var1.values > 0
x = df.Var1x[filter]
y = df.Var1[filter]
plt.plot(x, y)
You're looking for .replace():
df2 = df.copy()
cols_to_replace = ['Var1','Var1x','Vartwo']
df2[cols_to_replace] = df2[cols_to_replace].replace({0:np.nan})
fig, ax = plt.subplots()
for k in ['Var1','Vartwo','datatree']:
df2.plot(x=k+'x', y=k, kind = 'line',ax=ax)
Result:
Hoping to get some help please, I'm trying plot simulation data in separate subplots using pandas and matplotlib my code so far is:
import matplotlib.pylab as plt
import pandas as pd
fig, ax = plt.subplots(2, 3)
for i in range(2):
for j in range(50, 101, 10):
for e in range(3):
Var=(700* j)/ 100
Names1 = ['ig','M_GZ']
Data1 = pd.read_csv('~/File/JTL_'+str(Var)+'/GZ.csv', names=Names1)
ig = Data1['ig']
M_GZ=Data1['M_GZ']
MGZ = Data1[Data1.M_GZ != 0]
ax[i, e].plot(MGZ['ig'][:4], MGZ['M_GZ'][:4], '--v', linewidth=1.75)
plt.tight_layout()
plt.show()
But the code gives me 6 duplicate copies of the same plot:
instead of each iteration of Var having its own plot, I've tried changing the loop and using different variations like:
fig = plt.figure()
for i in range(1, 7):
ax = fig.add_subplot(2, 3, i)
for j in range(50, 101, 10):
Var=(700* j)/ 100
Names1 = ['ig','M_GZ']
Data1 = pd.read_csv('~/File/JTL_'+str(Var)+'/GZ.csv', names=Names1)
ig = Data1['ig']
M_GZ=Data1['M_GZ']
MGZ = Data1[Data1.M_GZ != 0]
ax.plot(MGZ['ig'][:4], MGZ['M_GZ'][:4], '--v', linewidth=1.75)
plt.tight_layout()
plt.show()
but that changes nothing I still get the same plot as above. Any help would be appreciated, I'm hoping that each subplot contains one set of data instead of all six
This is a Link to one of the Dataframes each subdirectory ~/File/JTL_'+str(Var)+'/ contains a copy of this file there are 6 in total
The problem is in your loop
for i in range(2): # Iterating rows of the plot
for j in range(50, 101, 10): # Iterating your file names
for e in range(3): # iterating the columns of the plot
The end result is that you iterate all the columns for each file name
For it two work, you should have only two nesting levels in your loop. Potential code (updated) :
import matplotlib.pylab as plt
import pandas as pd
fig, ax = plt.subplots(2, 3)
for row in range(2):
for col in range(3):
f_index = range(50, 101, 10)[row+1 * col]
print row, col, f_index
Var=(700* f_index)/ 100
Names1 = ['ig','M_GZ']
Data1 = pd.read_csv('~/File/JTL_'+str(Var)+'/GZ.csv', names=Names1)
ig = Data1['ig']
M_GZ=Data1['M_GZ']
MGZ = Data1[Data1.M_GZ != 0]
ax[row, col].plot(MGZ['ig'][:4], MGZ['M_GZ'][:4], '--v',linewidth=1.75)
plt.tight_layout()
plt.show()
I am trying to make a profile plot for two columns of a pandas.DataFrame. I would not expect this to be in pandas directly but it seems there is nothing in matplotlib either. I have searched around and cannot find it in any package other than rootpy. Before I take the time to write this myself I thought I would ask if there was a small package that contained profile histograms, perhaps where they are known by a different name.
If you don't know what I mean by "profile histogram" have a look at the ROOT implementation. http://root.cern.ch/root/html/TProfile.html
You can easily do it using scipy.stats.binned_statistic.
import scipy.stats
import numpy
import matplotlib.pyplot as plt
x = numpy.random.rand(10000)
y = x + scipy.stats.norm(0, 0.2).rvs(10000)
means_result = scipy.stats.binned_statistic(x, [y, y**2], bins=50, range=(0,1), statistic='mean')
means, means2 = means_result.statistic
standard_deviations = numpy.sqrt(means2 - means**2)
bin_edges = means_result.bin_edges
bin_centers = (bin_edges[:-1] + bin_edges[1:])/2.
plt.errorbar(x=bin_centers, y=means, yerr=standard_deviations, linestyle='none', marker='.')
Use seaborn. Data as from #MaxNoe
import numpy as np
import seaborn as sns
# just some random numbers to get started
x = np.random.uniform(-2, 2, 10000)
y = np.random.normal(x**2, np.abs(x) + 1)
sns.regplot(x=x, y=y, x_bins=10, fit_reg=None)
You can do much more (error bands are from bootstrap, you can change the estimator on the y-axis, add regression, ...)
While #Keith's answer seems to fit what you mean, it is quite a lot of code. I think this can be done much simpler, so one gets the key concepts and can adjust and build on top of it.
Let me stress one thing: what ROOT is calling a ProfileHistogram is not a special kind of plot. It is an errorbar plot. Which can simply be done in matplotlib.
It is a special kind of computation and that's not the task of a plotting library. This lies in the pandas realm, and pandas is great at stuff like this. It's symptomatical for ROOT as the giant monolithic pile it is to have an extra class for this.
So what you want to do is: discretize in some variable x and for each bin, calculate something in another variable y.
This can easily done using np.digitize together with the pandas groupy and aggregate methods.
Putting it all together:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# just some random numbers to get startet
x = np.random.uniform(-2, 2, 10000)
y = np.random.normal(x**2, np.abs(x) + 1)
df = pd.DataFrame({'x': x, 'y': y})
# calculate in which bin row belongs base on `x`
# bins needs the bin edges, so this will give as 100 equally sized bins
bins = np.linspace(-2, 2, 101)
df['bin'] = np.digitize(x, bins=bins)
bin_centers = 0.5 * (bins[:-1] + bins[1:])
bin_width = bins[1] - bins[0]
# grouby bin, so we can calculate stuff
binned = df.groupby('bin')
# calculate mean and standard error of the mean for y in each bin
result = binned['y'].agg(['mean', 'sem'])
result['x'] = bin_centers
result['xerr'] = bin_width / 2
# plot it
result.plot(
x='x',
y='mean',
xerr='xerr',
yerr='sem',
linestyle='none',
capsize=0,
color='black',
)
plt.savefig('result.png', dpi=300)
Just like ROOT ;)
I made a module myself for this functionality.
import pandas as pd
from pandas import Series, DataFrame
import numpy as np
import matplotlib.pyplot as plt
def Profile(x,y,nbins,xmin,xmax,ax):
df = DataFrame({'x' : x , 'y' : y})
binedges = xmin + ((xmax-xmin)/nbins) * np.arange(nbins+1)
df['bin'] = np.digitize(df['x'],binedges)
bincenters = xmin + ((xmax-xmin)/nbins)*np.arange(nbins) + ((xmax-xmin)/(2*nbins))
ProfileFrame = DataFrame({'bincenters' : bincenters, 'N' : df['bin'].value_counts(sort=False)},index=range(1,nbins+1))
bins = ProfileFrame.index.values
for bin in bins:
ProfileFrame.ix[bin,'ymean'] = df.ix[df['bin']==bin,'y'].mean()
ProfileFrame.ix[bin,'yStandDev'] = df.ix[df['bin']==bin,'y'].std()
ProfileFrame.ix[bin,'yMeanError'] = ProfileFrame.ix[bin,'yStandDev'] / np.sqrt(ProfileFrame.ix[bin,'N'])
ax.errorbar(ProfileFrame['bincenters'], ProfileFrame['ymean'], yerr=ProfileFrame['yMeanError'], xerr=(xmax-xmin)/(2*nbins), fmt=None)
return ax
def Profile_Matrix(frame):
#Much of this is stolen from https://github.com/pydata/pandas/blob/master/pandas/tools/plotting.py
import pandas.core.common as com
import pandas.tools.plotting as plots
from pandas.compat import lrange
from matplotlib.artist import setp
range_padding=0.05
df = frame._get_numeric_data()
n = df.columns.size
fig, axes = plots._subplots(nrows=n, ncols=n, squeeze=False)
# no gaps between subplots
fig.subplots_adjust(wspace=0, hspace=0)
mask = com.notnull(df)
boundaries_list = []
for a in df.columns:
values = df[a].values[mask[a].values]
rmin_, rmax_ = np.min(values), np.max(values)
rdelta_ext = (rmax_ - rmin_) * range_padding / 2.
boundaries_list.append((rmin_ - rdelta_ext, rmax_+ rdelta_ext))
for i, a in zip(lrange(n), df.columns):
for j, b in zip(lrange(n), df.columns):
common = (mask[a] & mask[b]).values
nbins = 100
(xmin,xmax) = boundaries_list[i]
ax = axes[i, j]
Profile(df[a][common],df[b][common],nbins,xmin,xmax,ax)
ax.set_xlabel('')
ax.set_ylabel('')
plots._label_axis(ax, kind='x', label=b, position='bottom', rotate=True)
plots._label_axis(ax, kind='y', label=a, position='left')
if j!= 0:
ax.yaxis.set_visible(False)
if i != n-1:
ax.xaxis.set_visible(False)
for ax in axes.flat:
setp(ax.get_xticklabels(), fontsize=8)
setp(ax.get_yticklabels(), fontsize=8)
return axes
To my knowledge matplotlib doesn't still allow to directly produce profile histograms.
You can instead give a look at Hippodraw, a package developed at SLAC, that can be used as a Python extension module.
Here there is a Profile histogram example:
http://www.slac.stanford.edu/grp/ek/hippodraw/datareps_root.html#datareps_profilehist