I have a dataframe df containing ages for students and non students, which looks something like this:
Subject Student Age
001 yes 21
002 yes 45
003 no 61
004 no 37
...
I would like to plot the proportions of each group under the age of 40. I can do this in R with plot(factor(age < 40) ~ Student, data = df) which gives me:
Is there a way to replicate this in Python, ideally using either matplotlib or seaborn?
There is no inbuilt option to create such plot. You may create it through matplotlib of course by calculating the respecting numbers.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
a = np.random.poisson(lam=40, size=6000)
b = ((a>50).astype(int)+np.random.rand(6000))>0.9
df = pd.DataFrame({"Subject" : np.arange(6000),
"Age" : a, "Student" : b})
df["Age>40"] = df["Age"] > 40
def propplot(x, y, data):
xdata = data[[x,y]].groupby(x)
xcount = xdata.count()
fig, axes = plt.subplots(ncols=len(xcount),
gridspec_kw={"width_ratios":list(xcount[y].values)})
for ax, (n,grp) in zip(axes, xdata):
ycount = grp.groupby(y).count().T
ycount /= float(ycount.values.sum())
ycount.plot.bar(stacked=True, ax=ax, width=1, legend=False)
ax.set_xlabel(n)
ax.set_xlim(-.5,.5)
ax.set_ylim(0,1)
ax.set_xticks([])
axes[0].set_ylabel(y)
axes[0].legend(ncol=100, title=y, loc=(0,1.02))
fig.text(0.5,0.02, x)
propplot("Student", "Age>40", df)
plt.show()
Related
I have an issue with axis labels when using groupby and trying to plot with seaborn. Here is my problem:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
df = pd.DataFrame({'user': ['Bob', 'Jane','Alice','Bob','Jane','Alice'],
'income': [40000, 50000, 42000,47000,53000,46000]})
groupedProduct = df.groupby(['Product']).sum().reset_index()
I then plot a horizontal bar plot using seaborn:
bar = sns.barplot( x="income", y="user", data=df_group_user, color="b" )
#Prettify the plot
bar.set_yticklabels( bar.get_yticks(), size = 10)
bar.set_xticklabels( bar.get_xticks(), size = 10)
bar.set_ylabel("User", fontsize = 20)
bar.set_xlabel("Income ($)", fontsize = 20)
bar.set_title("Total income per user", fontsize = 20)
sns.set_theme(style="whitegrid")
sns.set_color_codes("muted")
Unfortunately, when I run the code in such a manner, the y-axis ticks are labelled as 0,1,2 instead of Bob, Jane, Alice as I'd like it to.
I can get around the issue if I use matplotlib in the following manner:
df_group_user = df.groupby(['user']).sum()
df_group_user['income'].plot(kind="barh")
plt.title("Total income per user")
plt.ylabel("User")
plt.xlabel("Income ($)")
Ideally, I'd like to use seaborn for plotting, but if I don't use reset_index() like above, when calling sns.barplot:
bar = sns.barplot( x="income", y="user", data=df_group_user, color="b" )
ValueError: Could not interpret input 'user'
just try re-writing the positions of x and y axis.
I'm using a diff dataframe to exhibit similar situation.
gp = df.groupby("Gender")['Salary'].sum().reset_index()
gp
Output:
Gender Salary
0 Female 8870
1 Male 23667
Now while plotting a bar chart, mention x axis first and then supply y axis and check,
bar = sns.barplot(x = 'Salary', y = "Gender", data = gp);
I am trying to create a violin plot using seaborn.
My df looks like this:
drought
Out[65]:
Dataset TGLO TAM TAFR TAA Type
0 ACCESS1-0 0.181017 0.068988 0.166761 0.069303 AMIP
1 ACCESS1-3 0.109676 -0.001961 -0.008700 0.373162 AMIP
2 BNU-ESM 0.277070 0.272242 0.266324 -0.077017 AMIP
3 CCSM4 0.385075 0.258976 0.304438 0.211241 AMIP
...
21 CMAP 0.087274 -0.062214 -0.079958 0.372267 OBS
22 ERA 0.179999 -0.010484 0.134584 0.204052 OBS
23 GPCC 0.173947 -0.020719 0.021819 0.370157 OBS
24 GPCP 0.151394 0.036450 -0.021462 0.336876 OBS
25 UEA 0.223828 -0.018237 0.088486 0.398062 OBS
26 UofD 0.190969 0.094744 0.036374 0.310938 OBS
I want to have a split violin plot based on Type and this is the code I am using
sns.violinplot(data=drought, hue='Type', split=True)
And this is the error:
Cannot use `hue` without `x` or `y`
I do not have an x or y value because what I want is to have the columns as x , and the values in the rows as y.
Thanks for your help!
Do you want to ignore the 'Dataset' column and have split violins for the 4 other columns? In that case, you need to convert these 4 columns to "long form" (via pandas' melt()).
Here is an example:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
drought = pd.DataFrame({'Dataset': ["".join(np.random.choice([*'VWXYZ'], 5)) for _ in range(40)],
'TGLO': np.random.randn(40),
'TAM': np.random.randn(40),
'TAFR': np.random.randn(40),
'TAA': np.random.randn(40),
'Type': np.repeat(['AMIP', 'OBS'], 20)})
drought_long = drought.melt(id_vars=['Dataset', 'Type'], value_vars=['TGLO', 'TAM', 'TAFR', 'TAA'])
sns.set_style('white')
ax = sns.violinplot(data=drought_long, x='variable', y='value', hue='Type', split=True, palette='flare')
ax.legend()
sns.despine()
plt.tight_layout()
plt.show()
I have a dataframe:
df.head()[['price', 'volumen']]
price volumen
48 45 3
49 48 1
50 100 2
51 45 1
52 56 1
It represents the number of objects with particular price.
I created a histogram based on the volume column:
I would like to add information about the price distribution of each bin. My idea is to use heatmaps instead of single-color columns. E.g. a color red will show a high price, and yellow a low price.
Here is an example plot to illustrate the general idea:
The following example uses seaborn's tips dataset. A histogram is created by grouping the total_bill into bins. And then the bars are colored depending on the tips in each group.
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
sns.set_theme(style='white')
tips = sns.load_dataset('tips')
tips['bin'] = pd.cut(tips['total_bill'], 10) # histogram bin
grouped = tips.groupby('bin')
min_tip = tips['tip'].min()
max_tip = tips['tip'].max()
cmap = 'RdYlGn_r'
fig, ax = plt.subplots(figsize=(12, 4))
for bin, binned_df in grouped:
bin_height = len(binned_df)
binned_tips = np.sort(binned_df['tip']).reshape(-1, 1)
ax.imshow(binned_tips, cmap=cmap, vmin=min_tip, vmax=max_tip, extent=[bin.left, bin.right, 0, bin_height],
origin='lower', aspect='auto')
ax.add_patch(mpatches.Rectangle((bin.left, 0), bin.length, bin_height, fc='none', ec='k', lw=1))
ax.autoscale()
ax.set_ylim(0, 1.05 * ax.get_ylim()[1])
ax.set_xlabel('total bill')
ax.set_ylabel('frequency')
plt.colorbar(ax.images[0], ax=ax, label='tip')
plt.tight_layout()
plt.show()
Here is how it looks with a banded colormap (cmap = plt.get_cmap('Spectral', 9)):
Here is another example using the 'mpg' dataset, with a histogram over car weight and coloring via mile-per-gallon.
You can generate a heat map using Seaborn. bin / shape the dataframe first. This is random data so heat map is not so interesting.
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
s = 50
df = pd.DataFrame({"price":np.random.randint(30,120, s),"volume":np.random.randint(1,5, s)})
fig, ax = plt.subplots(2, figsize=[10,6])
df.loc[:,"volume"].plot(ax=ax[0], kind="hist", bins=3)
# reshape for a heatmap... put price into bins and make 2D
dfh = df.assign(pbin=pd.qcut(df.price,5)).groupby(["pbin","volume"]).mean().unstack(1).droplevel(0,axis=1)
axh = sns.heatmap(dfh, ax=ax[1])
I'm trying to add a bar-plot (stacked or otherwise) for each row in a seaborn clustermap.
Let's say that I have a dataframe like this:
import pandas as pd
import numpy as np
import random
df = pd.DataFrame(np.random.randint(0,100,size=(100, 8)), columns=["heatMap_1","heatMap_2","heatMap_3","heatMap_4","heatMap_5", "barPlot_1","barPlot_1","barPlot_1"])
df['index'] = [ random.randint(1,10000000) for k in df.index]
df.set_index('index', inplace=True)
df.head()
heatMap_1 heatMap_2 heatMap_3 heatMap_4 heatMap_5 barPlot_1 barPlot_1 barPlot_1
index
4552288 9 3 54 37 23 42 94 31
6915023 7 47 59 92 70 96 39 59
2988122 91 29 59 79 68 64 55 5
5060540 68 80 25 95 80 58 72 57
2901025 86 63 36 8 33 17 79 86
I can use the first 5 columns (in this example starting with prefix heatmap_) to create seaborn clustermap using this(or the seaborn equivalent):
sns.clustermap(df.iloc[:,0:5], )
and the stacked barplot for last four columns(in this example starting with prefix barPlot_) using this:
df.iloc[:,5:8].plot(kind='bar', stacked=True)
but I'm a bit confused on how to merge both plot types. I understand that clustermap creates it's own figures and I'm not sure if I can extract just the heatmap from clustermap and then use it with subfigures. (Discussed here: Adding seaborn clustermap to figure with other plots). This creates a weird output.
Edit:
Using this:
import pandas as pd
import numpy as np
import random
import seaborn as sns; sns.set(color_codes=True)
import matplotlib.pyplot as plt
import matplotlib.gridspec
df = pd.DataFrame(np.random.randint(0,100,size=(100, 8)), columns=["heatMap_1","heatMap_2","heatMap_3","heatMap_4","heatMap_5", "barPlot_1","barPlot_2","barPlot_3"])
df['index'] = [ random.randint(1,10000000) for k in df.index]
df.set_index('index', inplace=True)
g = sns.clustermap(df.iloc[:,0:5], )
g.gs.update(left=0.05, right=0.45)
gs2 = matplotlib.gridspec.GridSpec(1,1, left=0.6)
ax2 = g.fig.add_subplot(gs2[0])
df.iloc[:,5:8].plot(kind='barh', stacked=True, ax=ax2)
creates this:
which does not really match well (i.e. due to dendrograms there is a shift).
Another options is to manually perform clustering and create a matplotlib heatmap and then add associated subfigures like barplots(discussed here:How to get flat clustering corresponding to color clusters in the dendrogram created by scipy)
Is there a way I can use clustermap as a subplot along with other plots ?
This is the result I'm looking for[1]:
While not a proper answer, I decided to break it down and do everything manually.
Taking inspiration from answer here, I decided to cluster and reorder the heatmap separately:
def heatMapCluter(df):
row_method = "ward"
column_method = "ward"
row_metric = "euclidean"
column_metric = "euclidean"
if column_method == "ward":
d2 = dist.pdist(df.transpose())
D2 = dist.squareform(d2)
Y2 = sch.linkage(D2, method=column_method, metric=column_metric)
Z2 = sch.dendrogram(Y2, no_plot=True)
ind2 = sch.fcluster(Y2, 0.7 * max(Y2[:, 2]), "distance")
idx2 = Z2["leaves"]
df = df.iloc[:, idx2]
ind2 = ind2[idx2]
else:
idx2 = range(df.shape[1])
if row_method:
d1 = dist.pdist(df)
D1 = dist.squareform(d1)
Y1 = sch.linkage(D1, method=row_method, metric=row_metric)
Z1 = sch.dendrogram(Y1, orientation="right", no_plot=True)
ind1 = sch.fcluster(Y1, 0.7 * max(Y1[:, 2]), "distance")
idx1 = Z1["leaves"]
df = df.iloc[idx1, :]
ind1 = ind1[idx1]
else:
idx1 = range(df.shape[0])
return df
Rearranged the original dataframe:
clusteredHeatmap = heatMapCluter(df.iloc[:, 0:5].copy())
# Extract the "barplot" rows and merge them
clusteredDataframe = df.reindex(list(clusteredHeatmap.index.values))
clusteredDataframe = clusteredDataframe.reindex(
list(clusteredHeatmap.columns.values)
+ list(df.iloc[:, 5:8].columns.values),
axis=1,
)
and then used the gridspec to plot both "subfigures" (clustermap and barplot):
# Now let's plot this - first the heatmap and then the barplot.
# Since it is a "two" part plot which shares the same axis, it is
# better to use gridspec
fig = plt.figure(figsize=(12, 12))
gs = GridSpec(3, 3)
gs.update(wspace=0.015, hspace=0.05)
ax_main = plt.subplot(gs[0:3, :2])
ax_yDist = plt.subplot(gs[0:3, 2], sharey=ax_main)
im = ax_main.imshow(
clusteredDataframe.iloc[:, 0:5],
cmap="Greens",
interpolation="nearest",
aspect="auto",
)
clusteredDataframe.iloc[:, 5:8].plot(
kind="barh", stacked=True, ax=ax_yDist, sharey=True
)
ax_yDist.spines["right"].set_color("none")
ax_yDist.spines["top"].set_color("none")
ax_yDist.spines["left"].set_visible(False)
ax_yDist.xaxis.set_ticks_position("bottom")
ax_yDist.set_xlim([0, 100])
ax_yDist.set_yticks([])
ax_yDist.xaxis.grid(False)
ax_yDist.yaxis.grid(False)
Jupyter notebook: https://gist.github.com/siddharthst/2a8b7028d18935860062ac7379b9279f
Image:
1 - http://code.activestate.com/recipes/578175-hierarchical-clustering-heatmap-python/
I have a dataframe called df that looks like this:
Qname X Y Magnitude
Bob 5 19 10
Tom 6 20 20
Jim 3 30 30
I would like to make a visual text plot of the data. I want to plot the Qnames on a figure with their coordinates set = X,Y and a s=Size.
I have tried:
fig = plt.figure()
ax = fig.add_axes((0,0,1,1))
X = df.X
Y = df.Y
S = df.magnitude
Name = df.Qname
ax.text(X, Y, Name, size=S, color='red', rotation=0, alpha=1.0, ha='center', va='center')
fig.show()
However nothing is showing up on my plot. Any help is greatly appreciated.
This should get you started. Matplotlib does not handle the text placement for you so you will probably need to play around with this.
import pandas as pd
import matplotlib.pyplot as plt
# replace this with your existing code to read the dataframe
df = pd.read_clipboard()
plt.scatter(df.X, df.Y, s=df.Magnitude)
# annotate the plot
# unfortunately you have to iterate over your points
# see http://stackoverflow.com/q/5147112/553404
for idx, row in df.iterrows():
# see http://stackoverflow.com/q/5147112/553404
# for better annotation options
plt.annotate(row['Qname'], xy=(row['X'], row['Y']))
plt.show()