Values missing: Overlaying points on boxplot subplots from a pandas dataframe - python

I have two dataframes: df1 with 5000 rows and 12 columns and df2 with a single row and identical 12 columns as df1. I would like to plot each column on a separate subplot as a boxplot with values from df1, and overlay it with scatter plots (just one value per boxplot) from df2.
UPDATE: I tried using a transposed df2 with a numerical index. The following code does not give me any error but only one value on the second boxplot is visible.. I can't see the rest
labels = df1.columns.values
fig, axes = plt.subplots(nrows = 3, ncols = 4,gridspec_kw=dict(hspace=0.4))
targets = zip(labels, axes.flatten())
for i, (col, ax) in enumerate(targets):
print col
pd.DataFrame(df1[col]).boxplot(ax=ax, return_tpe = 'axes')
ax.scatter(df2.index[i], df2['mycol'].values[i])
ax.legend()
plt.show()
Tried attaching the image here but I am slow today...
https://drive.google.com/file/d/0B0Bhe_qx3BgGWkJUdE45UEJqSFk/view?usp=sharing
Using pandas plotting:
Getting boxplot subplots only is easy using
ax = df1.plot(subplots = True)
But I don't know how to specify y for the scatter plot
ax = df2.plot(x = df2.plot.index, y = ???, subplots = True)

Related

plotting whit subplots in a loop python [duplicate]

Case:
I receive a dataframe with (say 50) columns.
I extract the necessary columns from that dataframe using a condition.
So we have a list of selected columns of our dataframe now. (Say this variable is sel_cols)
I need a bar chart for each of these columns value_counts().
And I need to arrange all these bar charts in 3 columns, and varying number of rows based on number of columns selected in sel_cols.
So, if say 8 columns were selected, I want the figure to have 3 columns and 3 rows, with last subplot empty or just 8 subplots in 3x3 matrix if that is possible.
I could generate each chart separately using following code:
for col in sel_cols:
df[col].value_counts().plot(kind='bar)
plt.show()
plt.show() inside the loop so that each chart is shown and not just the last one.
I also tried appending these charts to a list this way:
charts = []
for col in sel_cols:
charts.append(df[col].value_counts().plot(kind='bar))
I could convert this list into an numpy array through reshape() but then it will have to be perfectly divisible into that shape. So 8 chart objects will not be reshaped into 3x3 array.
Then I tried creating the subplots first in this way:
row = len(sel_cols)//3
fig, axes = plt.subplots(nrows=row,ncols=3)
This way I would get the subplots, but I get two problems:
I end up with extra subplots in the 3 columns which will go unplotted (8 columns example).
I do not know how to plot under each subplots through a loop.
I tried this:
for row in axes:
for chart, col in zip(row,sel_cols):
chart = data[col].value_counts().plot(kind='bar')
But this only plots the last subplot with the last column. All other subplots stays blank.
How to do this with minimal lines of code, possibly without any need for human verification of the final subplots placements?
You may use this sample dataframe:
pd.DataFrame({'A':['Y','N','N','Y','Y','N','N','Y','N'],
'B':['E','E','E','E','F','F','F','F','E'],
'C':[1,1,0,0,1,1,0,0,1],
'D':['P','Q','R','S','P','Q','R','P','Q'],
'E':['E','E','E','E','F','F','G','G','G'],
'F':[1,1,0,0,1,1,0,0,1],
'G':['N','N','N','N','Y','N','N','Y','N'],
'H':['G','G','G','E','F','F','G','F','E'],
'I':[1,1,0,0,1,1,0,0,1],
'J':['Y','N','N','Y','Y','N','N','Y','N'],
'K':['E','E','E','E','F','F','F','F','E'],
'L':[1,1,0,0,1,1,0,0,1],
})
Selected columns are: sel_cols = ['A','B','D','E','G','H','J','K']
Total 8 columns.
Expected output is bar charts for value_counts() of each of these columns arranged in subplots in a figure with 3 columns. Rows to be decided based on number of columns selected, here 8 so 3 rows.
Given OP's sample data:
df = pd.DataFrame({'A':['Y','N','N','Y','Y','N','N','Y','N'],'B':['E','E','E','E','F','F','F','F','E'],'C':[1,1,0,0,1,1,0,0,1],'D':['P','Q','R','S','P','Q','R','P','Q'],'E':['E','E','E','E','F','F','G','G','G'],'F':[1,1,0,0,1,1,0,0,1],'G':['N','N','N','N','Y','N','N','Y','N'],'H':['G','G','G','E','F','F','G','F','E'],'I':[1,1,0,0,1,1,0,0,1],'J':['Y','N','N','Y','Y','N','N','Y','N'],'K':['E','E','E','E','F','F','F','F','E'],'L':[1,1,0,0,1,1,0,0,1]})
sel_cols = list('ABDEGHJK')
data = df[sel_cols].apply(pd.value_counts)
We can plot the columns of data in several ways (in order of simplicity):
DataFrame.plot with subplots param
seaborn.catplot
Loop through plt.subplots
1. DataFrame.plot with subplots param
Set subplots=True with the desired layout dimensions. Unused subplots will be auto-disabled:
data.plot.bar(subplots=True, layout=(3, 3), figsize=(8, 6),
sharex=False, sharey=True, legend=False)
plt.tight_layout()
2. seaborn.catplot
melt the data into long-form (i.e., 1 variable per column, 1 observation per row) and pass it to seaborn.catplot:
import seaborn as sns
melted = data.melt(var_name='var', value_name='count', ignore_index=False).reset_index()
sns.catplot(data=melted, kind='bar', x='index', y='count',
col='var', col_wrap=3, sharex=False)
3. Loop through plt.subplots
zip the columns and axes to iterate in pairs. Use the ax param to place each column onto its corresponding subplot.
If the grid size is larger than the number of columns (e.g., 3*3 > 8), disable the leftover axes with set_axis_off:
fig, axes = plt.subplots(3, 3, figsize=(8, 8), constrained_layout=True, sharey=True)
# plot each col onto one ax
for col, ax in zip(data.columns, axes.flat):
data[col].plot.bar(ax=ax, rot=0)
ax.set_title(col)
# disable leftover axes
for ax in axes.flat[data.columns.size:]:
ax.set_axis_off()
Alternative to the answer by tdy, I tried to do it without seaborn using Matplotlib and a for loop.
Figured it might be better for some who want specific control over subplots with formatting and other parameters, then this is another way:
fig = plt.figure(1,figsize=(16,12))
for i, col in enumerate(sel_cols,1):
fig.add_subplot(3,4,i,)
data[col].value_counts().plot(kind='bar',ax=plt.gca())
plt.title(col)
plt.tight_layout()
plt.show(1)
plt.subplot activates a subplot, while plt.gca() points to the active subplot.

Show first and last label in pandas plot

I have a DataFrame with 361 columns. I want to plot it but showing only the first and last columns in the legend. For instance:
d = {'col1':[1,2],'col2':[3,4],'col3':[5,6],'col4':[7,8]}
df = pd.DataFrame(data=d)
If I plot through df.plot() all the legends will be displayed, but I only want 'col1' and 'col4' in my legend with the proper color code (I am using a colormap) and legend title.
One way to do this is to plot each column separately through matplotlib without using legends and then plot two more empty plots with only the labels (example below), but I wonder if there is a direct way to do it with pandas.
for columns in df:
plt.plot(df[columns])
plt.plot([],[],label=df.columns[0])
plt.plot([],[],label=df.columns[-1])
plt.legend()
plt.show()
Let's try extracting the handlers/labels from the axis and defining new legend:
ax = df.plot()
handlers, labels = ax.get_legend_handles_labels()
new_handlers, new_labels = [], []
for h,l in zip(handlers, labels):
if l in ['col1','col4']:
new_handlers.append(h)
new_labels.append(l)
ax.legend(new_handlers, new_labels)
Output:
You can try to split your df into two dfs which the second one will contain only the columns of interest and then plot both dfs showing only the second legend.

for loop with seaborn not displaying all plots

I've been trying to plot multiple graphs using a for loop and seaborn. Have tried different approaches (with subplots and trying to display them sequentially) and I can't manage to get the all the graphs to display (the best I've achieved is plotting the last one in the list). Here are the two approaches I've tried:
fig, ax = plt.subplots(1, 3, sharex = True) #Just hardcoding thre 3 here (number of slicers to plot) for testing
for i, col in enumerate(slicers):
plt.sca(ax[i])
ax[i] = sns.catplot(x = 'seq', kind = 'count', hue = col
, order = dfFirst['seq'].value_counts().index, height=6, aspect=11.7/6
, data = dfFirst) # distribution.set_xticklabels(rotation=65, horizontalalignment='right')
display(fig)
Have tried all combinations between plt.sca(ax[i]) and ax[i] = sns.catplot (activating both as in the example and one at a time) but fig always shows empty when displaying. In addition, I tried displaying figures sequentially using:
for i, col in enumerate(slicers):
plt.figure(i)
sns.catplot(x = 'seq', kind = 'count', hue = col
, order = dfFirst['seq'].value_counts().index, height=6, aspect=11.7/6
, data = dfFirst) # distribution.set_xticklabels(rotation=65, horizontalalignment='right')
display(figure)
catplot produces its own figure. See Plotting with seaborn using the matplotlib object-oriented interface
Hence, here it's just
for whatever:
sns.catplot(...)
plt.show()

How to change the color of lines within a subplot?

My goal is to create a time series plot for each column in my data with their corresponding rolling mean. I'd like the color of the lines across subplots to be different. For example, for gym and rolling_mean_gym in the second subplot, the color of the lines should be purple and red. How do I do this?
When I set the color option inside plot(), it changes the color of both the raw data plot and the rolling mean plot, which is not ideal.
I created the plot below by calculating the rolling mean of each column of the time series using the following code:
# calculate rolling mean
def rolling_mean(col):
rolling_mean_col = 'rolling_mean_{}'.format(col)
df[rolling_mean_col] = df[col].rolling(12).mean()
# create rolling mean columns
cols = ['diet', 'gym', 'finance']
for col in cols:
rolling_mean(col)
# plot data in subplots
fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(13,10));
df[['diet', 'rolling_mean_diet']].plot(ax=axes[0]);
df[['gym', 'rolling_mean_gym']].plot(ax=axes[1]);
df[['finance', 'rolling_mean_finance']].plot(ax=axes[2]);
One option is to provide a list of colors: .plot(..., color=['red', 'blue']).
Pandas plot() method is just a thin wrapper around matplotlib plotting methods. Any non-consumed keyword argument will be passed on to them.
df = pd.DataFrame()
df['diet'] = np.random.random_sample(100)
df['rolling_mean_diet'] = np.random.random_sample(100) / 10 + 0.5
df['gym'] = np.random.random_sample(100)
df['rolling_mean_gym'] = np.random.random_sample(100) / 10 + 0.5
fig, axes = plt.subplots(nrows=2, ncols=1, figsize=(13,10));
df[['diet', 'rolling_mean_diet']].plot(ax=axes[0], color=['red', 'green']);
df[['gym', 'rolling_mean_gym']].plot(ax=axes[1], color=['purple', 'red']);

Python Matplotlib plot with x-axis labels correctly aligned and matching colors for series and errors

I am looking to plot some data from Pandas Dataframes using Matplotlib. I need to have control over the various properties of the plot. I am having difficulty with 2 properties:
a. Correct spacing for custom x-axis lables
b. How to plot a data series and its error bars with the same color
I have the following Dataframes in Python Pandas:
x = pd.DataFrame(np.random.rand(4,5), columns = list('ABCDE'))
y = pd.DataFrame(np.random.rand(4,5), columns = list('ABCDE'))
x_err = pd.DataFrame(np.random.rand(4,5), columns = list('ABCDE'))
y_err = pd.DataFrame(np.random.rand(4,5), columns = list('ABCDE'))
x.insert(0,'Name',['Temp_C','Pressure_Rear','Barometric_High','Facility_depletion_rate']
y.insert(0,'Name',['Temp_C','Pressure_Rear','Barometric_High','Facility_depletion_rate']
x_err.insert(0,'Name',['Temp_C','Pressure_Rear','Barometric_High','Facility_depletion_rate']
y_err.insert(0,'Name',['Temp_C','Pressure_Rear','Barometric_High','Facility_depletion_rate']
In the dataframe x, each column gives the x co-ordinates. In the dataframe y, each column gives the corresponding y co-ordinates.
I am looking to plot y vs x in a scatter plot, where the points are connected y lines like this plot: http://40.media.tumblr.com/2bf0909609003f549e0d03406dc5a2dd/tumblr_mik00mS7rv1s6xcwuo1_1280.png
I also need to put error bars (x_err is the x-error and y_err is the y_error)
In each of these dataframes, there are 5 columns. The column headers must be in the legend and the x-axis needs to have labels.
The x-axis labels should be 'Temp_C','Pressure_Rear','Barometric_High' and 'Facility_depletion_rate'.
Here is the code that I have and a sample output is shown at the end of this post:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pylab as pl
fig = plt.figure(1)
ax = fig.add_subplot(111)
fig.set_facecolor('white')
for i in range(1,len(x.columns.tolist())):
ax.errorbar(x.iloc[:,i], y.iloc[:,i], yerr=y_err.iloc[:,i], xerr=x_err.iloc[:,i], fmt='o') #generate plot for each set of errors (x and y) in the x-variable column list
plt.plot(x.iloc[:,i], y.iloc[:,i], linestyle='-', marker='o', linewidth=1.5, label = x.columns.tolist()[i])
ttl = 'Room conditions - tracking monitor'
xtitle = 'Type of reading'
ytitle = 'Reading value (units)'
title_font = {'fontname':'Times New Roman', 'size':'28', 'color':'black', 'weight':'bold','verticalalignment':'bottom'} #Bottom vertical alignment for more space
axis_font = {'fontname':'Constantia', 'size':'26'}
axis_tick_font = {'fontname':'Times New Roman', 'size':'20'}
#plt.legend(loc='upper left')
ax.set_xticklabels(x.Name.tolist())
ax.tick_params(axis='x', pad=10)
plt.title(ttl, **title_font)
plt.xlabel(xtitle,**axis_font)
plt.ylabel(ytitle,**axis_font)
plt.xticks(**axis_tick_font)
plt.yticks(**axis_tick_font)
params = {'legend.fontsize': 20} #set legend properties
pl.rcParams.update(params)
plt.legend(loc = 1, prop={'family':title_font['fontname']}, numpoints = 1)
plt.show()
Here are the two problems I am having and questions about these:
The x-axis labels are completely mis-aligned. Is there a way to
print the labels such that:
a. they are aligned with the data points. By this I mean that the x-axis labels should be aligned with the x co-ordinates which are given in each column of the dataframe x.i.e. x.iloc[:,1], y.iloc[:,1] are the x and y co-ordinates for the dataset A, x.iloc[:,2], y.iloc[:,2] are the x and y co-ordinates for the dataset B. I need the x-axis labels to ONLY be aligned with x.iloc[:,1], y.iloc[:,1].
b. the labels are equally spaced from eachother and from the left and right border
The color of the error bars is not the same as the color of the data series. For example, if data series 'A' is blue, is there a way to force errorbars for 'A' to also be blue?

Categories

Resources