I have the following dataframe
Class Age Percentage
0 2004 3 43.491170
1 2004 2 29.616607
2 2004 4 13.838925
3 2004 6 10.049712
4 2004 5 2.637445
5 2004 1 0.366142
6 2005 2 51.267369
7 2005 3 19.589268
8 2005 6 13.730432
9 2005 4 11.155305
10 2005 5 3.343524
11 2005 1 0.913590
12 2005 9 0.000511
I would like to make a bar plot using seaborn where in the y-axis is the 'Percentage', in the x-axis is the 'Class' and label them using the 'Age' column. I would also like to arrange the bars in descending order, i.e. from the bigger to the smaller bar.
In order to do that I thought of the following: I will change the hue_order parameter based on the order of the 'Percentage' variable. For example, if I sort the 'Percentage' column in descending order for the Class == 2004, then the hue_order = [3, 2, 4, 6, 5, 1].
Here is my code:
import matplotlib.pyplot as plt
import seaborn as sns
def hue_order():
for cls in dataset.Class.unique():
temp_df = dataset[dataset['Class'] == cls]
order = temp_df.sort_values('Percentage', ascending = False)['Age']
return order
sns.barplot(x="Class", y="Percentage", hue = 'Age',
hue_order= hue_order(),
data=dataset)
plt.show()
However, the bars are in descending order only for the Class == 2005. Any help?
In my question, I am using the hue parameter, thus, it is not a duplicate as proposed.
The seaborn hue parameter adds another dimension to the plot. The hue_order determines in which order this dimension is handled. However you cannot split that order. This means you may well change the order such that Age == 2 is in the third place in the plot. But you cannot change it partially, such that in some part it is in the first and in some other it'll be in the third place.
In order to achieve what is desired here, namely to use different orders of the auxilary dimensions within the same axes, you need to handle this manually.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df = pd.DataFrame({"Class" : [2004]*6+[2005]*7,
"Age" : [3,2,4,6,5,1,2,3,6,4,5,1,9],
"Percentage" : [50,40,30,20,10,30,20,35,40,50,45,30,15]})
def sortedgroupedbar(ax, x,y, groupby, data=None, width=0.8, **kwargs):
order = np.zeros(len(data))
df = data.copy()
for xi in np.unique(df[x].values):
group = data[df[x] == xi]
a = group[y].values
b = sorted(np.arange(len(a)),key=lambda x:a[x],reverse=True)
c = sorted(np.arange(len(a)),key=lambda x:b[x])
order[data[x] == xi] = c
df["order"] = order
u, df["ind"] = np.unique(df[x].values, return_inverse=True)
step = width/len(np.unique(df[groupby].values))
for xi,grp in df.groupby(groupby):
ax.bar(grp["ind"]-width/2.+grp["order"]*step+step/2.,
grp[y],width=step, label=xi, **kwargs)
ax.legend(title=groupby)
ax.set_xticks(np.arange(len(u)))
ax.set_xticklabels(u)
ax.set_xlabel(x)
ax.set_ylabel(y)
fig, ax = plt.subplots()
sortedgroupedbar(ax, x="Class",y="Percentage", groupby="Age", data=df)
plt.show()
Related
I would like to print the DataFrame besides the plot. What would be a pythonic way to do that?
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame({'Age':[21,22,23,24,25,26,27,28,29,30],'Count':[4,1,3,7,2,3,5,1,1,5]})
print(df)
Age Count
0 21 4
1 22 1
2 23 3
3 24 7
4 25 2
5 26 3
6 27 5
7 28 1
8 29 1
9 30 5
plt.rcParams['figure.figsize']=(10,6)
fig,ax = plt.subplots()
font_used={'fontname':'pristina', 'color':'Black'}
ax.set_ylabel('Count',fontsize=20,**font_used)
ax.set_xlabel('Age',fontsize=20,**font_used)
plt.plot(df['Age'],df['Count'])
I would like to have a Graph like this. How can I have the DataFrame's plotted values are printed alongside?:
You can use ax.text to add the DataFrame to the plot. DataFrames have a .to_string method which makes formatting nice. Supply index=False to remove the row index.
plt.rcParams['figure.figsize']=(10, 6)
fig,ax = plt.subplots()
font_used={'fontname':'pristina', 'color':'Black'}
ax.set_ylabel('Count',fontsize=20,**font_used)
ax.set_xlabel('Age',fontsize=20,**font_used)
# Adjust to where you want.
ax.text(x=28.5, y=4.5, s=df.to_string(index=False))
plt.plot(df['Age'],df['Count'])
plt.show()
Another option is to use the function plt.table():
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame({'Age':[21,22,23,24,25,26,27,28,29,30],'Count':[4,1,3,7,2,3,5,1,1,5]})
plt.rcParams['figure.figsize']=(10,15)
fig,ax = plt.subplots()
plt.subplots_adjust(left=0.1, right=0.85, top=0.9, bottom=0.1)
font_used={'fontname':'pristina', 'color':'Black'}
ax.set_ylabel('Count',fontsize=20,**font_used)
ax.set_xlabel('Age',fontsize=20,**font_used)
plt.plot(df['Age'],df['Count'])
ax.table(cellText=df['Count'].map(str),
rowLabels=df['Age'].map(str),
colWidths=[0.2,0.25],
loc='right')
plt.show()
This approach will create a table with their respective lines. Just make sure to adjust the plot with subplots_adjust() afterwards.
Pandas has a to_html function you can use and place the html next to it. What are you placing the graph and Dataframe into?
df.to_html()
I want to plot a box plot with my DataFrame:
A B C
max 10 11 14
min 3 4 10
q1 5 6 12
q3 9 7 13
how can I plot a box plot with these fixed values?
You can use the Axes.bxp method in matplotlib, based on this helpful answer. The input is a list of dictionaries containing the relevant values, but the median is a required key in these dictionaries. Since the data you provided does not include medians, I have made up medians in the code below (but you will need to calculate them from your actual data).
import matplotlib.pyplot as plt
import pandas as pd
# reproducing your data
df = pd.DataFrame({'A':[10,3,5,9],'B':[11,4,6,7],'C':[14,10,12,13]})
# add a row for median, you need median values!
sample_medians = {'A':7, 'B':6.5, 'C':12.5}
df = df.append(sample_medians, ignore_index=True)
df.index = ['max','min','q1','q3','med']
Here is the modified df with medians included:
>>> df
A B C
max 10.0 11.0 14.0
min 3.0 4.0 10.0
q1 5.0 6.0 12.0
q3 9.0 7.0 13.0
med 7.0 6.5 12.5
Now we transform the df into a list of dictionaries:
labels = list(df.columns)
# create dictionaries for each column as items of a list
bxp_stats = df.apply(lambda x: {'med':x.med, 'q1':x.q1, 'q3':x.q3, 'whislo':x['min'], 'whishi':x['max']}, axis=0).tolist()
# add the column names as labels to each dictionary entry
for index, item in enumerate(bxp_stats):
item.update({'label':labels[index]})
_, ax = plt.subplots()
ax.bxp(bxp_stats, showfliers=False);
plt.show()
Unfortunately the median line is a required parameter so it must be specified for every box. Therefore we just make it as thin as possible to be virtually unseeable.
If you want each box to be drawn with different specifications, they will have to be in different subplots. I understand if this looks kind of ugly, so you can play around with the spacing between subplots or consider removing some of the y-axes.
fig, axes = plt.subplots(nrows=1, ncols=3, sharey=True)
# specify list of background colors, median line colors same as background with as thin of a width as possible
colors = ['LightCoral', '#FEF1B5', '#EEAEEE']
medianprops = [dict(linewidth = 0.1, color='LightCoral'), dict(linewidth = 0.1, color='#FEF1B5'), dict(linewidth = 0.1, color='#EEAEEE')]
# create a list of boxplots of length 3
bplots = [axes[i].bxp([bxp_stats[i]], medianprops=medianprops[i], patch_artist=True, showfliers=False) for i in range(len(df.columns))]
# set each boxplot a different color
for i, bplot in enumerate(bplots):
for patch in bplot['boxes']:
patch.set_facecolor(colors[i])
plt.show()
I'm trying to visualise a large (pandas) dataframe in Python as a heatmap. This dataframe has two types of variables: strings ("Absent" or "Unknown") and floats.
I want the heatmap to show cells with "Absent" in black and "Unknown" in red, and the rest of the dataframe as a normal heatmap, with the floats in a scale of greens.
I can do this easily in Excel with conditional formatting of cells, but I can't find any help online to do this with Python either with matplotlib, seaborn, ggplot. What am I missing?
Thank you for your time.
You could use cmap_custom.set_under('red') and cmap_custom.set_over('black') to apply custom colors to values below and above vmin and vmax (See 1, 2):
import numpy as np
import matplotlib.pyplot as plt
import mpl_toolkits.axes_grid1 as axes_grid1
import pandas as pd
# make a random DataFrame
np.random.seed(1)
arr = np.random.choice(['Absent', 'Unknown']+list(range(10)), size=(5,7))
df = pd.DataFrame(arr)
# find the largest and smallest finite values
finite_values = pd.to_numeric(list(set(np.unique(df.values))
.difference(['Absent', 'Unknown'])))
vmin, vmax = finite_values.min(), finite_values.max()
# change Absent and Unknown to numeric values
df2 = df.replace({'Absent': vmax+1, 'Unknown': vmin-1})
# make sure the values are numeric
for col in df2:
df2[col] = pd.to_numeric(df2[col])
fig, ax = plt.subplots()
cmap_custom = plt.get_cmap('Greens')
cmap_custom.set_under('red')
cmap_custom.set_over('black')
im = plt.imshow(df2, interpolation='nearest', cmap = cmap_custom,
vmin=vmin, vmax=vmax)
# add a colorbar (https://stackoverflow.com/a/18195921/190597)
divider = axes_grid1.make_axes_locatable(ax)
cax = divider.append_axes("right", size="5%", pad=0.05)
plt.colorbar(im, cax=cax, extend='both')
plt.show()
The DataFrame
In [117]: df
Out[117]:
0 1 2 3 4 5 6
0 3 9 6 7 9 3 Absent
1 Absent Unknown 5 4 7 0 2
2 3 0 2 9 8 0 2
3 5 5 7 Unknown 5 Absent 4
4 7 7 5 4 7 Unknown Absent
becomes
I have a data frame similar to this
import pandas as pd
df = pd.DataFrame([['1','3','1','2','3','1','2','2','1','1'], ['ONE','TWO','ONE','ONE','ONE','TWO','ONE','TWO','ONE','THREE']]).T
df.columns = [['age','data']]
print(df) #printing dataframe.
I performed the groupby function on it to get the required output.
df['COUNTER'] =1 #initially, set that counter to 1.
group_data = df.groupby(['age','data'])['COUNTER'].sum() #sum function
print(group_data)
now i want to plot the out using matplot lib. Please help me with it.. I am not able to figure how to start and what to do.
I want to plot using the counter value and something similar to bar graph
Try:
group_data = group_data.reset_index()
in order to get rid of the multiple index that the groupby() has created for you.
Your print(group_data) will give you this:
In [24]: group_data = df.groupby(['age','data'])['COUNTER'].sum() #sum function
In [25]: print(group_data)
age data
1 ONE 3
THREE 1
TWO 1
2 ONE 2
TWO 1
3 ONE 1
TWO 1
Name: COUNTER, dtype: int64
Whereas, reseting will 'simplify' the new index:
In [26]: group_data = group_data.reset_index()
In [27]: group_data
Out[27]:
age data COUNTER
0 1 ONE 3
1 1 THREE 1
2 1 TWO 1
3 2 ONE 2
4 2 TWO 1
5 3 ONE 1
6 3 TWO 1
Then depending on what it is exactly that you want to plot, you might want to take a look at the Matplotlib docs
EDIT
I now read more carefully that you want to create a 'bar' chart.
If that is the case then I would take a step back and not use reset_index() on the groupby result. Instead, try this:
In [46]: fig = group_data.plot.bar()
In [47]: fig.figure.show()
I hope this helps
Try with this:
# This is a great tool to add plots to jupyter notebook
% matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
# Params get plot bigger
plt.rcParams["axes.labelsize"] = 16
plt.rcParams["xtick.labelsize"] = 14
plt.rcParams["ytick.labelsize"] = 14
plt.rcParams["legend.fontsize"] = 12
plt.rcParams["figure.figsize"] = [15, 7]
df = pd.DataFrame([['1','3','1','2','3','1','2','2','1','1'], ['ONE','TWO','ONE','ONE','ONE','TWO','ONE','TWO','ONE','THREE']]).T
df.columns = [['age','data']]
df['COUNTER'] = 1
group_data = df.groupby(['age','data']).sum()[['COUNTER']].plot.bar(rot = 90) # If you want to rotate labels from x axis
_ = group_data.set(xlabel = 'xlabel', ylabel = 'ylabel'), group_data.legend(['Legend']) # you can add labels and legend
I intend to plot multiple columns in a pandas dataframe, all grouped by another column using groupby inside seaborn.boxplot. There is a nice answer here, for a similar problem in matplotlib matplotlib: Group boxplots but given the fact that seaborn.boxplot comes with groupby option I thought it could be much easier to do this in seaborn.
Here we go with a reproducible example that fails:
import seaborn as sns
import pandas as pd
df = pd.DataFrame([[2, 4, 5, 6, 1], [4, 5, 6, 7, 2], [5, 4, 5, 5, 1],
[10, 4, 7, 8, 2], [9, 3, 4, 6, 2], [3, 3, 4, 4, 1]],
columns=['a1', 'a2', 'a3', 'a4', 'b'])
# display(df)
a1 a2 a3 a4 b
0 2 4 5 6 1
1 4 5 6 7 2
2 5 4 5 5 1
3 10 4 7 8 2
4 9 3 4 6 2
5 3 3 4 4 1
#Plotting by seaborn
sns.boxplot(df[['a1','a2', 'a3', 'a4']], groupby=df.b)
What I get is something that completely ignores groupby option:
Whereas if I do this with one column it works thanks to another SO question Seaborn groupby pandas Series :
sns.boxplot(df.a1, groupby=df.b)
So I would like to get all my columns in one plot (all columns come in a similar scale).
EDIT:
The above SO question was edited and now includes a 'not clean' answer to this problem, but it would be nice if someone has a better idea for this problem.
As the other answers note, the boxplot function is limited to plotting a single "layer" of boxplots, and the groupby parameter only has an effect when the input is a Series and you have a second variable you want to use to bin the observations into each box..
However, you can accomplish what I think you're hoping for with the factorplot function, using kind="box". But, you'll first have to "melt" the sample dataframe into what is called long-form or "tidy" format where each column is a variable and each row is an observation:
df_long = pd.melt(df, "b", var_name="a", value_name="c")
Then it's very simple to plot:
sns.factorplot("a", hue="b", y="c", data=df_long, kind="box")
You can use directly boxplot (I imagine when the question was asked, that was not possible, but with seaborn version > 0.6 it is).
As explained by #mwaskom, you have to "melt" the sample dataframe into its "long-form" where each column is a variable and each row is an observation:
df_long = pd.melt(df, "b", var_name="a", value_name="c")
# display(df_long.head())
b a c
0 1 a1 2
1 2 a1 4
2 1 a1 5
3 2 a1 10
4 2 a1 9
Then you just plot it:
sns.boxplot(x="a", hue="b", y="c", data=df_long)
Seaborn's groupby function takes Series not DataFrames, that's why it's not working.
As a work around, you can do this :
fig, ax = plt.subplots(1,2, sharey=True)
for i, grp in enumerate(df.filter(regex="a").groupby(by=df.b)):
sns.boxplot(grp[1], ax=ax[i])
it gives :
Note that df.filter(regex="a") is equivalent to df[['a1','a2', 'a3', 'a4']]
a1 a2 a3 a4
0 2 4 5 6
1 4 5 6 7
2 5 4 5 5
3 10 4 7 8
4 9 3 4 6
5 3 3 4 4
Hope this helps
It isn't really any better than the answer you linked, but I think the way to achieve this in seaborn is using the FacetGrid feature, as the groupby parameter is only defined for Series passed to the boxplot function.
Here's some code - the pd.melt is necessary because (as best I can tell) the facet mapping can only take individual columns as parameters, so the data need to be turned into a 'long' format.
g = sns.FacetGrid(pd.melt(df, id_vars='b'), col='b')
g.map(sns.boxplot, 'value', 'variable')
It's not adding a lot to this conversation, but after struggling with this for longer than warranted (the actual clusters are unusable), I thought I would add my implementation as another example. It's got a superimposed scatterplot (because of how annoying my dataset is), shows melt using indices, and some aesthetic tweaks. I hope this is useful for someone.
output_graph
Here it is without using column headers (I saw a different thread that wanted to know how to do this using indices):
combined_array: ndarray = np.concatenate([dbscan_output.data, dbscan_output.labels.reshape(-1, 1)], axis=1)
cluster_data_df: DataFrame = DataFrame(combined_array)
if you want to use labelled columns:
column_names: List[str] = list(outcome_variable_names)
column_names.append('cluster')
cluster_data_df.set_axis(column_names, axis='columns', inplace=True)
graph_data: DataFrame = pd.melt(
frame=cluster_data_df,
id_vars=['cluster'],
# value_vars is an optional param - by default it uses columns except the id vars, but I've included it as an example
# value_vars=['outcome_var_1', 'outcome_var_2', 'outcome_var_3', 'outcome_var_4', 'outcome_var_5', 'outcome_var_6']
var_name='psychometric_test',
value_name='standard deviations from the mean'
)
The resulting dataframe (rows = sample_n x variable_n (in my case 1626 x 6 = 9756)):
index
cluster
psychometric_tst
standard deviations from the mean
0
0.0
outcome_var_1
-1.276182
1
0.0
outcome_var_1
-1.118813
2
0.0
outcome_var_1
-1.276182
9754
0.0
outcome_var_6
0.892548
9755
0.0
outcome_var_6
1.420480
If you want to use indices with melt:
graph_data: DataFrame = pd.melt(
frame=cluster_data_df,
id_vars=cluster_data_df.columns[-1],
# value_vars=cluster_data_df.columns[:-1],
var_name='psychometric_test',
value_name='standard deviations from the mean'
)
And here's the graphing code:
(Done with column headings - just note that y-axis=value_name, x-axis = var_name, hue = id_vars):
# plot graph grouped by cluster
sns.set_theme(style="ticks")
fig = plt.figure(figsize=(10, 10))
fig.set(font_scale=1.2)
fig.set_style("white")
# create boxplot
fig.ax = sns.boxplot(y='standard deviations from the mean', x='psychometric_test', hue='cluster', showfliers=False,
data=graph_data)
# set box alpha:
for patch in fig.ax.artists:
r, g, b, a = patch.get_facecolor()
patch.set_facecolor((r, g, b, .2))
# create scatterplot
fig.ax = sns.stripplot(y='standard deviations from the mean', x='psychometric_test', hue='cluster', data=graph_data,
dodge=True, alpha=.25, zorder=1)
# customise legend:
cluster_n: int = dbscan_output.n_clusters
## create list with legend text
i = 0
cluster_info: Dict[int, int] = dbscan_output.cluster_sizes # custom method
legend_labels: List[str] = []
while i < cluster_n:
label: str = f"cluster {i+1}, n = {cluster_info[i]}"
legend_labels.append(label)
i += 1
if -1 in cluster_info.keys():
cluster_n += 1
label: str = f"Unclustered, n = {cluster_info[-1]}"
legend_labels.insert(0, label)
## fetch existing handles and legends (each tuple will have 2*cluster number -> 1 for each boxplot cluster, 1 for each scatterplot cluster, so I will remove the first half)
handles, labels = fig.ax.get_legend_handles_labels()
index: int = int(cluster_n*(-1))
labels = legend_labels
plt.legend(handles[index:], labels[0:])
plt.xticks(rotation=45)
plt.show()
asds
Just a note: Most of my time was spent debugging the melt function. I predominantly got the error "*only integer scalar arrays can be converted to a scalar index with 1D numpy indices array*". My output required me to concatenate my outcome variable value table and the clusters (DBSCAN), and I'd put extra square brackets around the cluster array in the concat method. So I had a column where each value was an invisible List[int], rather than a plain int. It's pretty niche, but maybe it'll help someone.
List item