I'm getting something weird with the legend in a seaborn jointplot. I want to plot some quantity y as function of a quantity x for 8 different datasets. These datasets have only two columns for x and y and a different number of rows. First of all I concatenate all rows of all datasets using numpy
y = np.concatenate(((data1[:,1]), (data2[:,1]), (data3[:,1]), (data4[:,1]),(data5[:,1]), (data6[:,1]), (data7[:,1]), (data8[:,1])), axis=0)
x = np.concatenate(((data1[:,0]), (data2[:,0]), (data3[:,0]), (data4[:,0]), (data5[:,0]), (data6[:,0]), (data7[:,0]), (data8[:,0])), axis=0)
Then I create the array of values which I will use for the parameter "hue" in the jointplot, which will distinguish the several datasets in the legend/colors. I do this by assigning at every dataset one number from 1 to 8,which is repeated for every row of the cumulative dataset:
indexes = np.concatenate((np.ones(len(data1[:,0])), 2*np.ones(len(data2[:,0])), 3*np.ones(len(data3[:,0])), 4*np.ones(len(data4[:,0])), 5*np.ones(len(data5[:,0])), 6*np.ones(len(data6[:,0])), 7*np.ones(len(data7[:,0])), 8*np.ones(len(data8[:,0]))), axis=0)
Then I create the dataset:
all_together = np.column_stack((x, y, indexes))
df = pd.DataFrame(all_together, columns = ['x','y','Dataset'])
So now I can create the jointplot. This is simply done by:
g = sns.jointplot(y="y", x="x", data=df, hue="Dataset", palette='turbo')
handles, labels = g.ax_joint.get_legend_handles_labels()
g.ax_joint.legend(handles=handles, labels=['data1', 'data2', 'data3', 'data4', 'data5', 'data6', 'data7', 'data8'], fontsize=10)
At this point, the problem is: all points are getting plotted (at least I think), but the legend only shows: data1, data2, data3, data4 and data5. I don't understand why it is not showing also the other three labels, and in this way the plot is difficult to read. I have checked and the cumulative dataset df has the correct shape. Any ideas?
You can add legend='full' to obtain a full legend. By default, sns.jointplot uses sns.scatterplot for the central plot. The keyword parameters which aren't used by jointplot are sent to scatterplot. The legend parameter can be "auto", "brief", "full", or False.
From the docs:
If “brief”, numeric hue and size variables will be represented with a sample of evenly spaced values. If “full”, every group will get an entry in the legend. If “auto”, choose between brief or full representation based on number of levels. If False, no legend data is added and no legend is drawn.
The following code is tested with seaborn 0.11.2:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
N = 200
k = np.repeat(np.arange(1, 9), N // 8)
df = pd.DataFrame({'x': 5 * np.cos(2 * k * np.pi / 8) + np.random.randn(N),
'y': 5 * np.sin(2 * k * np.pi / 8) + np.random.randn(N),
'Dataset': k})
g = sns.jointplot(y="y", x="x", data=df, hue="Dataset", palette='turbo', legend='full')
plt.show()
Related
In Have gradient colours in sns.pairplot for one column of dataframe so that I can see which datapoints are connected to each other
very good answers were given how to solve the challenge to recognize which data points are related to the same data points in other sub plots.
To have a self containing question, I state here my requirement (which is somehow an extension of the linked question):
I would like to see the interdependence of my data.
For that I want to have a gradual color gradient for one column of my DataFrame (so
that low numerical values of that column are e.g. yellow and high values are blue).
For a second column of my data, I would like to have increasing marker sizes with
increasing values of this column.
These colors and marker sizes should be visible for all non diagonal subplots of my
plot, based on the data points of a and b.
The solution to the gradient color is given in the linked question. I put here both solutions that presently exist:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
f, axes = plt.subplots(1, 1)
np.random.seed(1)
a = np.arange(0, 10, 0.1)
def myFunc(x):
myReturn = +10 + 10*x -x**2 + 1*np.random.random(x.shape[0])
return myReturn
b = myFunc(a)
c = a * np.sin(a)
df = pd.DataFrame({'a': a, 'b': b, 'c': c})
if False:
sns.pairplot(
df,
corner=True,
diag_kws=dict(color=".6"),
plot_kws=dict(
hue=df.index,
palette="blend:gold,dodgerblue",
),
)
else:
from matplotlib.colors import LinearSegmentedColormap
cmap = LinearSegmentedColormap.from_list('blue-yellow', ['gold', 'lightblue', 'darkblue']) # plt.get_cmap('viridis_r')
g = sns.pairplot(df, corner=True)
for ax in g.axes.flat:
if ax is not None and not ax in g.diag_axes:
for collection in ax.collections:
collection.set_cmap(cmap)
collection.set_array(df['a'])
plt.show()
A (basic) solution for the increasing marker sizes would be (using simply matplotlib):
import numpy as np
import matplotlib.pyplot as plt
# Fixing random state for reproducibility
np.random.seed(19680801)
N = 50
x = np.random.rand(N)
y = np.random.rand(N)
colors = np.random.rand(N)
area = (30 * np.random.rand(N))**2 # 0 to 15 point radii
plt.scatter(x, y, s=area, c=colors, alpha=0.5)
plt.show()
My question is:
I could work on a manual solution to iterate over all columns of my DataFrame and build the sub plots by myself. Is there any more convenient (and probably more robust) way to do this?
You can modify the sizes and hue for the off-diagonal data easily by adding the parameters you'd use in Matplotlib to the plot_kws dictionary:
sns.pairplot(df, corner=True,
diag_kws=dict(color=".6"),
plot_kws=dict(
hue=df['a'],
palette="blend:gold,dodgerblue",
size = df['b']
)
)
I am trying to create multiple box plot charts for about 5 columns in my dataframe (df_summ):
columns = ['dimension_a','dimension_b']
for i in columns:
sns.set(style = "ticks", palette = "pastel")
box_plot = sns.boxplot(y="measure", x=i,
palette=["m","g"],
data=df_summ_1500_delta)
sns.despine(offset=10, trim=True)
medians = df_summ_1500_delta.groupby([i])['measure'].median()
vertical_offset=df_summ_1500_delta['measure'].median()*-0.5
for xtick in box_plot.get_xticks():
box_plot.text(xtick,medians[xtick] + vertical_offset,medians[xtick],
horizontalalignment='center',size='small',color='blue',weight='semibold')
My only issue is that they aren't be separated on different facets, but rather on top of each other.
Any help on how I can make both on their own separate chart with the x axis being 'dimension a' and the x axis of the second chart being 'dimension b'.
To draw two boxplots next to each other at each x-position, you can use a hue for dimension_a and dimension_b separately. These two columns need to be transformed (with pd.melt()) to "long form".
Here is a some example code starting from generated test data. Note that the order both for the x-values as for the hue-values needs to be enforced to be sure of their exact position. The individual box plots are distributed over a width of 0.8.
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
df = pd.DataFrame({'dimension_a': np.random.choice(['hot', 'cold'], 100),
'dimension_b': np.random.choice(['hot', 'cold'], 100),
'measure': np.random.uniform(100, 500, 100)})
df.loc[df['dimension_a'] == 'hot', 'measure'] += 100
df.loc[df['dimension_a'] == 'cold', 'measure'] -= 100
x_order = ['hot', 'cold']
columns = ['dimension_a', 'dimension_b']
df1 = df.melt(value_vars=columns, var_name='dimension', value_name='value', id_vars='measure')
sns.set(style="ticks", palette="pastel")
ax = sns.boxplot(data=df1, x='value', order=x_order, y='measure',
hue='dimension', hue_order=columns, palette=["m", "g"], dodge=True)
ax.set_xlabel('')
sns.despine(offset=10, trim=True)
for col, dodge_dist in zip(columns, np.linspace(-0.4, 0.4, 2 * len(x_order) + 1)[1::2]):
medians = df.groupby([col])['measure'].median()
vertical_offset = df['measure'].median() * -0.5
for x_ind, xtick in enumerate(x_order):
ax.text(x_ind + dodge_dist, medians[xtick] + vertical_offset, f'{medians[xtick]:.2f}',
horizontalalignment='center', size='small', color='blue', weight='semibold')
plt.show()
I have a number of charts, made with matplotlib and seaborn, that look like the example below.
I show how certain quantities evolve over time on a lineplot
The x-axis labels are not numbers but strings (e.g. 'Q1' or '2018 first half' etc)
I need to "extend" the x-axis to the right, with an empty period. The chart must show from Q1 to Q4, but there is no data for Q4 (the Q4 column is full of nans)
I need this because I need the charts to be side-by-side with others which do have data for Q4
matplotlib doesn't display the column full of nans
If the x-axis were numeric, it would be easy to extend the range of the plot; since it's not numeric, I don't know which x_range each tick corresponds to
I have found the solution below. It works, but it's not elegant: I use integers for the x-axis, add 1, then set the labels back to the strings. Is there a more elegant way?
This is the code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
from matplotlib.ticker import FuncFormatter
import seaborn as sns
df =pd.DataFrame()
df['period'] = ['Q1','Q2','Q3','Q4']
df['a'] = [3,4,5,np.nan]
df['b'] = [4,4,6,np.nan]
df = df.set_index( 'period')
fig, ax = plt.subplots(1,2)
sns.lineplot( data = df, ax =ax[0])
df_idx = df.index
df2 = df.set_index( np.arange(1, len(df_idx) + 1 ))
sns.lineplot(data = df2, ax = ax[1])
ax[1].set_xlim(1,4)
ax[1].set_xticklabels(df.index)
You can add these lines of code for ax[0]
left_buffer,right_buffer = 3,2
labels = ['Q1','Q2','Q3','Q4']
extanded_labels = ['']*left_buffer + labels + ['']*right_buffer
left_range = list(range(-left_buffer,0))
right_range = list(range(len(labels),len(labels)+right_buffer))
ticks_range = left_range + list(range(len(labels))) + right_range
aux_range = list(range(len(extanded_labels)))
ax[0].set_xticks(ticks_range)
ax[0].set_xticklabels(extanded_labels)
xticks = ax[0].xaxis.get_major_ticks()
for ind in aux_range[0:left_buffer]: xticks[ind].tick1line.set_visible(False)
for ind in aux_range[len(labels)+left_buffer:len(labels)+left_buffer+right_buffer]: xticks[ind].tick1line.set_visible(False)
in which left_buffer and right_buffer are margins you want to add to the left and to the right, respectively. Running the code, you will get
I may have actually found a simpler solution: I can draw a transparent line (alpha = 0 ) by plotting x = index of the dataframe, ie with all the labels, including those for which all values are nans, and y = the average value of the dataframe, so as to be sure it's within the range:
sns.lineplot(x = df.index, y = np.ones(df.shape[0]) * df.mean().mean() , ax = ax[0], alpha =0 )
This assumes the scale of the y a xis has not been changed manually; a better way of doing it would be to check whether it has:
y_centre = np.mean([ax[0].get_ylim()])
sns.lineplot(x = df.index, y = np.ones(df.shape[0]) * y_centre , ax = ax[0], alpha =0 )
Drawing a transparent line forces matplotlib to extend the axes so as to show all the x values, even those for which all the other values are nans.
Im trying to smooth a graph line out but since the x-axis values are dates im having great trouble doing this. Say we have a dataframe as follows
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline
startDate = '2015-05-15'
endDate = '2015-12-5'
index = pd.date_range(startDate, endDate)
data = np.random.normal(0, 1, size=len(index))
cols = ['value']
df = pd.DataFrame(data, index=index, columns=cols)
Then we plot the data
fig, axs = plt.subplots(1,1, figsize=(18,5))
x = df.index
y = df.value
axs.plot(x, y)
fig.show()
we get
Now to smooth this line there are some usefull staekoverflow questions allready like:
Generating smooth line graph using matplotlib,
Plot smooth line with PyPlot
Creating numpy linspace out of datetime
But I just cant seem to get some code working to do this for my example, any suggestions?
You can use interpolation functionality that is shipped with pandas. Because your dataframe has a value for every index already, you can populate it with an index that is more sparse, and fill every previously non-existent indices with NaN values. Then, after choosing one of many interpolation methods available, interpolate and plot your data:
index_hourly = pd.date_range(startDate, endDate, freq='1H')
df_smooth = df.reindex(index=index_hourly).interpolate('cubic')
df_smooth = df_smooth.rename(columns={'value':'smooth'})
df_smooth.plot(ax=axs, alpha=0.7)
df.plot(ax=axs, alpha=0.7)
fig.show()
There is one workaround, we will create two plots - 1) non smoothed /interploted with date labels 2) smoothed without date labels.
Plot the 1) using argument linestyle=" " and convert the dates to be plotted on x-axis to string type.
Plot the 2) using the argument linestyle="-" and interpolating the x-axis and y-axis using np.linespace and make_interp_spline respectively.
Following is the use of the discussed workaround for your code.
# your initial code
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.interpolate import make_interp_spline
%matplotlib inline
startDate = "2015-05-15"
endDate = "2015-07-5" #reduced the end date so smoothness is clearly seen
index = pd.date_range(startDate, endDate)
data = np.random.normal(0, 1, size=len(index))
cols = ["value"]
df = pd.DataFrame(data, index=index, columns=cols)
fig, axs = plt.subplots(1, 1, figsize=(40, 12))
x = df.index
y = df.value
# workaround by creating linespace for length of your x axis
x_new = np.linspace(0, len(df.index), 300)
a_BSpline = make_interp_spline(
[i for i in range(0, len(df.index))],
df.value,
k=5,
)
y_new = a_BSpline(x_new)
# plot this new plot with linestyle = "-"
axs.plot(
x_new[:-5], # removing last 5 entries to remove noise, because interpolation outputs large values at the end.
y_new[:-5],
"-",
label="interpolated"
)
# to get the date on x axis we will keep our previous plot but linestyle will be None so it won't be visible
x = list(x.astype(str))
axs.plot(x, y, linestyle=" ", alpha=0.75, label="initial")
xt = [x[i] for i in range(0,len(x),5)]
plt.xticks(xt,rotation="vertical")
plt.legend()
fig.show()
Resulting Plot
Overalpped plot to see the smoothing.
Depending on what exactly you mean by "smoothing," the easiest way can be the use of savgol_filter or something similar. Unlike with interpolated splines, this method means that the smoothed line does not pass through the measured points, effectively filtering out higher-frequency noise.
from scipy.signal import savgol_filter
...
windowSize = 21
polyOrder = 1
smoothed = savgol_filter(values, windowSize, polyOrder)
axes.plot(datetimes, smoothed, color=chart.color)
The higher the polynomial order value, the closer the smoothed line is to the raw data.
Here is an example.
For a dataframe
import pandas as pd
df=pd.DataFrame({'group':list("AADABCBCCCD"),'Values':[1,0,1,0,1,0,0,1,0,1,0]})
I am trying to plot a barplot showing percentage of times A, B, C, D takes zero (or one).
I have a round about way which works but I am thinking there has to be more straight forward way
tempdf=df.groupby(['group','Values']).Values.count().unstack().fillna(0)
tempdf['total']=df['group'].value_counts()
tempdf['percent']=tempdf[0]/tempdf['total']*100
tempdf.reset_index(inplace=True)
print tempdf
sns.barplot(x='group',y='percent',data=tempdf)
If it were plotting just the mean value, I could simply do sns.barplot on df dataframe than tempdf. I am not sure how to do it elegantly if I am interested in plotting percentages.
Thanks,
You can use Pandas in conjunction with seaborn to make this easier:
import pandas as pd
import seaborn as sns
df = sns.load_dataset("tips")
x, y, hue = "day", "proportion", "sex"
hue_order = ["Male", "Female"]
(df[x]
.groupby(df[hue])
.value_counts(normalize=True)
.rename(y)
.reset_index()
.pipe((sns.barplot, "data"), x=x, y=y, hue=hue))
You could use your own function in sns.barplot estimator, as from docs:
estimator : callable that maps vector -> scalar, optional
Statistical function to estimate within each categorical bin.
For you case you could define function as lambda:
sns.barplot(x='group', y='Values', data=df, estimator=lambda x: sum(x==0)*100.0/len(x))
You can follow these steps so that you can see the count and percentages on top of the bars in your plot. Check the example outputs down below
with_hue function will plot percentages on the bar graphs if you have the 'hue' parameter in your plots. It takes the actual graph, feature, Number_of_categories in feature, and hue_categories(number of categories in hue feature) as a parameter.
without_hue function will plot percentages on the bar graphs if you have a normal plot. It takes the actual graph and feature as a parameter.
def with_hue(plot, feature, Number_of_categories, hue_categories):
a = [p.get_height() for p in plot.patches]
patch = [p for p in plot.patches]
for i in range(Number_of_categories):
total = feature.value_counts().values[i]
for j in range(hue_categories):
percentage = '{:.1f}%'.format(100 * a[(j*Number_of_categories + i)]/total)
x = patch[(j*Number_of_categories + i)].get_x() + patch[(j*Number_of_categories + i)].get_width() / 2 - 0.15
y = patch[(j*Number_of_categories + i)].get_y() + patch[(j*Number_of_categories + i)].get_height()
ax.annotate(percentage, (x, y), size = 12)
plt.show()
def without_hue(plot, feature):
total = len(feature)
for p in ax.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/total)
x = p.get_x() + p.get_width() / 2 - 0.05
y = p.get_y() + p.get_height()
ax.annotate(percentage, (x, y), size = 12)
plt.show()
You can use the library Dexplot, which has the ability to return relative frequencies for categorical variables. It has a similar API to Seaborn. Pass the column you would like to get the relative frequency for to the count function. If you would like to subdivide this by another column, do so with the split parameter. The following returns raw counts.
import dexplot as dxp
dxp.count('group', data=df, split='Values')
To get the relative frequencies, set the normalize parameter to the column you want to normalize over. Use True to normalize over the overall total count.
dxp.count('group', data=df, split='Values', normalize='group')
Normalizing over the 'Values' column would produce the following graph, where the total of all the '0' bars are 1.
dxp.count('group', data=df, split='Values', normalize='Values')