I have recently figured out that I can use plot function directly from pandas without using Seaborn for quick visualisations.
I used the following code to generate a series of graphs from the data frame that contains years as the first column and the price for different product in the rest of the columns.
df_annual_price.plot.line(x='Date',
subplots=True,
layout=(5,5),
figsize=(60,60),
fontsize=20,
sharex=False,
title = list_of_products
)
It neatly graphs the lineplot for all the columns. However, one thing I can't figure out is how to control the fontsize of the title for each plot. I have tried to look it up in other threads but couldn't find an answer.
Is there a simple and elegant answer to this?
Pandas's plot() with subplots=True option returns a list (or list of lists) of axes.
We could enumerate each axis and call its set_title() with title and font size.
This is how you change the title font size of each subplot.
We could pick any one of the axes and call its get_figure() to obtain the Figure object of the overall plot. Then we could call Figure's suptitle() with title and font size. This is how you change the title font size of the overall figure.
The example below creates a 2 x 2 subplots and illustrates functions which may be useful for people who are new to MatplotLib and Pandas's plot() function.
import numpy as np
import pandas as pd
labels = ['y1', 'y2', 'y3', 'y4']
x = 'x'
columns = [x] + labels
matrix = np.random.rand(10, 5)
df = pd.DataFrame(matrix, columns=columns)
df = df.sort_values(by=x)
axes = df.plot(
x=x,
y=labels,
subplots=True,
layout=(2,2),
kind='hist',
figsize=(8,8)
)
for i, row in enumerate(axes):
for j, ax in enumerate(row):
ax.set_title(f'Subplot {i, j}', fontsize=12)
ax.set_xlabel('Width')
ax.set_ylabel('Percentage')
fig = axes[0, 0].get_figure()
fig.subplots_adjust(top=0.9, wspace=0.3, hspace=0.3)
_ = fig.suptitle(f'Distribution of Widths', fontsize=16) # suppress printing of title
Pandas's plot() accepts **kwargs parameters which could be passed to its underlying matplotlib.pyplot.plot(). See https://matplotlib.org/devdocs/api/_as_gen/matplotlib.pyplot.plot.html for various parameters.
Related
I have the the folowing dataframe "freqs2" with index (SD to SD17) and associated values (frequencies) :
freqs
SD 101
SD2 128
...
SD17 65
I would like to affect a list of precise colors (in order) for each index. I've tried the following code :
colors=['#e5243b','#DDA63A', '#4C9F38','#C5192D','#FF3A21','#26BDE2','#FCC30B','#A21942','#FD6925','#DD1367','#FD9D24','#BF8B2E','#3F7E44','#0A97D9','#56C02B','#00689D','#19486A']
freqs2.plot.bar(freqs2.index, legend=False,rot=45,width=0.85, figsize=(12, 6),fontsize=(14),color=colors )
plt.ylabel('Frequency',fontsize=(17))
As result I obtain all my chart bars in red color (first color of the list).
Based on similar questions, I've tried to integrate "freqs2.index" to stipulate that the list of colors concern index but the problem stay the same.
It looks like a bug in pandas, plotting directly in matplotlib or using seaborn (which I recommend) works:
import seaborn as sns
colors=['#e5243b','#dda63a', '#4C9F38','#C5192D','#FF3A21','#26BDE2','#FCC30B','#A21942','#FD6925','#DD1367','#FD9D24','#BF8B2E','#3F7E44','#0A97D9','#56C02B','#00689D','#19486A']
# # plotting directly with matplotlib works too:
# fig = plt.figure()
# ax = fig.add_axes([0,0,1,1])
# ax.bar(x=df.index, height=df['freqs'], color=colors)
ax = sns.barplot(data=df, x= df.index, y='freqs', palette=colors)
ax.tick_params(axis='x', labelrotation=45)
plt.ylabel('Frequency',fontsize=17)
plt.show()
Edit: an issue already exists on Github
I managed to plot a few charts out with using this for loop:
for i in df.columns[2:7]:
df.plot.scatter(x='out_date', y=i, figsize = (10,7))
plt.axvline(x=cutoff_date, color='r')
plt.xlabel('out date')
plt.ylabel('sucesses')
I wanted to add titles to the plots using the header of a different dataframe with the following code but the title will only be added to the last plot instead of every plot:
for x in df2.columns[67:72]:
plt.title(x)
Is there a way to fix this?
Try to zip your 2 dataframes:
for i, title in zip(df.columns[2:7], df2.columns[67:72]):
ax = df.plot.scatter(x='out_date', y=i, figsize = (10,7), title=title)
ax.axvline(x=cutoff_date, color='r')
ax.set_xlabel('out date')
ax.set_ylabel('sucesses')
You can also use AxesSubplot instance returned by df.plot methods instead of functions of plt module.
I have a dataframe(df) with two columns: 'Foundation Type', which has 4 types of foundations (Shafts, Piles, Combination, Spread), and another column 'Vs30' with different values for parameter Vs30. Each row represents a bridge, with a type of foundation and a Vs30 value.
First, I create an new column 'binVs30' in df, converting each element of 'Vs30' into different bins, which has 5 different kind of ranges ([0-200],[200-400]...[800-1000]).
df['binVs30'] = pd.cut(df.Vs30, bins=np.arange(0, 1100, 200))
then, I created a stacked area plot with the code as follow:
color_table = pd.crosstab(df['binVs30'], df['Foundation Type'], dropna=False)
ax = color_table.plot(kind='area', figsize=(8, 8), stacked=True, rot=0)
display(ax)
plt.xlabel('')
plt.ylabel('Frequency', fontsize=12)
plt.legend(title='Foundation Type', loc='upper right')
plt.title('Column Database', fontsize='20')
plt.show()
The resulting picture shows some extra bins that shouldn't be there. Therefore, I had to fix the xticks by manually adding the following code:
locs, labels = plt.xticks()
plt.xticks(locs, ['','0-200','','200-400','','400-600','','600-800','','800-1000'], fontsize=10, rotation=45)
Is there a reason why Python creates those extra bins that shouldn't exist? Is that a bug that Python has? Since if I change it to a stacked bar plot, the problem just vanished. Is there a way that I could fix it by not manually adding bin code?
Also two other questions are, how to add the edgecolor for an area plot? Something like:
color_table.plot(kind='area', figsize=(8, 8), stacked=True, edgecolor='black', legend=None, rot=0)
The command edgecolor='black' doesn't work in a stacked area plot.
And, if I want to create bin for 'Vs30' like ([0-200],[200-400]...[>800]). Is there a way I can do that? Since the way I create 'binVs30' column doesn't allow me create a bin that is '>800'.
There are a couple of questions here. Firstly about including an open-ended bin in your pd.cut(). You can use np.inf to capture everything in the last bin and assign it a custom label. Secondly, since you're already using matplotlib, I'd recommend using its stacking plot directly rather than via pandas. Then you can use edgecolor argument without any issues.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
df = pd.DataFrame(data={
"foundation" : np.random.choice(list("ABCD"), 1000),
"binVs30" : np.random.randint(0, 1200, 1000)
})
bins = [0, 200, 400, 600, 800, np.inf]
labels = ["0-199", "200-399", "400-599", "600-799", "800+"]
df["bins"] = pd.cut(
df["binVs30"], bins=bins, labels=labels,
right=False, include_lowest=True)
stack_data = pd.crosstab(df['bins'], df['foundation'], dropna=False)
stack_array = stack_data.values.T.tolist()
pal = sns.color_palette("Set1")
plt.figure(figsize=(8,4))
plt.stackplot(
labels, stack_array, labels=list("ABCD"),
colors=pal, alpha=0.4, edgecolor="black")
plt.legend(loc='upper left')
plt.show()
I've been trying to plot multiple graphs using a for loop and seaborn. Have tried different approaches (with subplots and trying to display them sequentially) and I can't manage to get the all the graphs to display (the best I've achieved is plotting the last one in the list). Here are the two approaches I've tried:
fig, ax = plt.subplots(1, 3, sharex = True) #Just hardcoding thre 3 here (number of slicers to plot) for testing
for i, col in enumerate(slicers):
plt.sca(ax[i])
ax[i] = sns.catplot(x = 'seq', kind = 'count', hue = col
, order = dfFirst['seq'].value_counts().index, height=6, aspect=11.7/6
, data = dfFirst) # distribution.set_xticklabels(rotation=65, horizontalalignment='right')
display(fig)
Have tried all combinations between plt.sca(ax[i]) and ax[i] = sns.catplot (activating both as in the example and one at a time) but fig always shows empty when displaying. In addition, I tried displaying figures sequentially using:
for i, col in enumerate(slicers):
plt.figure(i)
sns.catplot(x = 'seq', kind = 'count', hue = col
, order = dfFirst['seq'].value_counts().index, height=6, aspect=11.7/6
, data = dfFirst) # distribution.set_xticklabels(rotation=65, horizontalalignment='right')
display(figure)
catplot produces its own figure. See Plotting with seaborn using the matplotlib object-oriented interface
Hence, here it's just
for whatever:
sns.catplot(...)
plt.show()
I would like to use a code that shows all histograms in a dataframe. That will be df.hist(bins=10). However, I would like to add another histograms which shows CDF df_hist=df.hist(cumulative=True,bins=100,density=1,histtype="step")
I tried separating their matplotlib axes by using fig=plt.figure() and
plt.subplot(211). But this df.hist is actually part of pandas function, not matplotlib function. I also tried setting axes and adding ax=ax1 and ax2 options to each histogram but it didn't work.
How can I combine these histograms together?
Any help?
Histograms that I want to combine are like these. I want to show them side by side or put the second one on tip of the first one.
Sorry that I didn't care to make them look good.
It is possible to draw them together:
# toy data frame
df = pd.DataFrame(np.random.normal(0,1,(100,20)))
# draw hist
fig, axes = plt.subplots(5,4, figsize=(16,10))
df.plot(kind='hist', subplots=True, ax=axes, alpha=0.5)
# clone axes so they have different scales
ax_new = [ax.twinx() for ax in axes.flatten()]
df.plot(kind='kde', ax=ax_new, subplots=True)
plt.show()
Output:
It's also possible to draw them side-by-side. For example
fig, axes = plt.subplots(10,4, figsize=(16,10))
hist_axes = axes.flatten()[:20]
df.plot(kind='hist', subplots=True, ax=hist_axes, alpha=0.5)
kde_axes = axes.flatten()[20:]
df.plot(kind='kde', subplots=True, ax=kde_axes, alpha=0.5)
will plot hist on top of kde.
You can find more info here: Multiple histograms in Pandas (possible duplicate btw) but apparently Pandas cannot handle multiple histogram on same graphs.
It's ok because np.histogram and matplotlib.pyplot can, check the above link for a more complete answer.
Solution for overlapping histograms with df.hist with any number of subplots
You can combine two dataframe histogram figures by creating twin axes using the grid of axes returned by df.hist. Here is an example of normal histograms combined with cumulative step histograms where the size of the figure and the layout of the grid of subplots are taken care of automatically:
import numpy as np # v 1.19.2
import pandas as pd # v 1.1.3
import matplotlib.pyplot as plt # v 3.3.2
# Create sample dataset stored in a pandas dataframe
rng = np.random.default_rng(seed=1) # random number generator
letters = [chr(i) for i in range(ord('A'), ord('G')+1)]
df = pd.DataFrame(rng.exponential(1, size=(100, len(letters))), columns=letters)
# Set parameters for figure dimensions and grid layout
nplots = df.columns.size
ncols = 3
nrows = int(np.ceil(nplots/ncols))
subp_w = 10/ncols # 10 is the total figure width in inches
subp_h = 0.75*subp_w
bins = 10
# Plot grid of histograms with pandas function (with a shared y-axis)
grid = df.hist(grid=False, sharey=True, figsize=(ncols*subp_w, nrows*subp_h),
layout=(nrows, ncols), bins=bins, edgecolor='white', linewidth=0.5)
# Create list of twin axes containing second y-axis: note that due to the
# layout, the grid object may contain extra unused axes that are not shown
# (here in the H and I positions). The ax parameter of df.hist only accepts
# a number of axes that corresponds to the number of numerical variables
# in df, which is why the flattened array of grid axes is sliced here.
grid_twinx = [ax.twinx() for ax in grid.flat[:nplots]]
# Plot cumulative step histograms over normal histograms: note that the grid layout is
# preserved in grid_twinx so no need to set the layout parameter a second time here.
df.hist(ax=grid_twinx, histtype='step', bins=bins, cumulative=True, density=True,
color='tab:orange', linewidth=2, grid=False)
# Adjust space between subplots after generating twin axes
plt.gcf().subplots_adjust(wspace=0.4, hspace=0.4)
plt.show()
Solution for displaying histograms of different types side-by-side with matplotlib
To my knowledge, it is not possible to show the different types of plots side-by-side with df.hist. You need to create the figure from scratch, like in this example using the same dataset as before:
# Set parameters for figure dimensions and grid layout
nvars = df.columns.size
plot_types = 2 # normal histogram and cumulative step histogram
ncols_vars = 2
nrows = int(np.ceil(nvars/ncols_vars))
subp_w = 10/(plot_types*ncols_vars) # 10 is the total figure width in inches
subp_h = 0.75*subp_w
bins = 10
# Create figure with appropriate size
fig = plt.figure(figsize=(plot_types*ncols_vars*subp_w, nrows*subp_h))
fig.subplots_adjust(wspace=0.4, hspace=0.7)
# Create subplots by adding a new axes per type of plot for each variable
# and create lists of axes of normal histograms and their y-axis limits
axs_hist = []
axs_hist_ylims = []
for idx, var in enumerate(df.columns):
axh = fig.add_subplot(nrows, plot_types*ncols_vars, idx*plot_types+1)
axh.hist(df[var], bins=bins, edgecolor='white', linewidth=0.5)
axh.set_title(f'{var} - Histogram', size=11)
axs_hist.append(axh)
axs_hist_ylims.append(axh.get_ylim())
axc = fig.add_subplot(nrows, plot_types*ncols_vars, idx*plot_types+2)
axc.hist(df[var], bins=bins, density=True, cumulative=True,
histtype='step', color='tab:orange', linewidth=2)
axc.set_title(f'{var} - Cumulative step hist.', size=11)
# Set shared y-axis for histograms
for ax in axs_hist:
ax.set_ylim(max(axs_hist_ylims))
plt.show()