Scatter plot with custom ticks - python

I want to do a scatter plot of a wavelength (float) in y-axis and spectral class (list of character/string) in x-axis, labels = ['B','A','F','G','K','M']. Data are saved in pandas dataframe, df.
df['Spec Type Index']
0 NaN
1 A
2 G
. .
. .
167 K
168 Nan
169 G
Then,
df['Disk Major Axis "']
0 4.30
1 4.50
2 22.00
. .
. .
167 1.32
168 0.28
169 25.00
Thus, I thought this should be done simply with
plt.scatter(df['Spec Type Index'], df['Disk Major Axis "'])
But I get this annoying error
could not convert string to float: 'G'
After fixing this, I want to make custom xticks as follows. However, how can I
labels = ['B','A','F','G','K','M']
ticks = np.arange(len(labels))
plt.xticks(ticks, labels)

First, I think you have to map those strings to integers then matplotlib can decide where to place those points.
labels = ['B','A','F','G','K','M']
mapping = {'B': 0,'A': 1,'F': 2,'G': 3,'K': 4,'M': 5}
df = df.replace({'Spec Type Index': mapping})
Then plot the scatter,
fig, ax = plt.subplots()
ax.scatter(df['Spec Type Index'], df['Disk Major Axis "'])
Finally,
ax.set_xticklabels(labels)

Related

Modify a bar plot into a staked plot keeping the original values

I have a pandas DataFrame containing the percentage of students that have a certain skill in each subject stratified according to their gender
iterables = [['Above basic','Basic','Low'], ['Female','Male']]
index = pd.MultiIndex.from_product(iterables, names=["Skills", "Gender"])
df = pd.DataFrame(data=[[36,36,8,8,6,6],[46,46,2,3,1,2],[24,26,10,11,16,13]], index=["Math", "Literature", "Physics"], columns=index)
print(df)
Skill Above basic Basic Low
Gender Female Male Female Male Female Male
Math 36 36 8 8 6 6
Literature 46 46 2 3 1 2
Physics 24 26 10 11 16 13
Next I want to see how the skills are distributed according to the subjects
#plot how the skills are distributed according to the subjects
df.sum(axis=1,level=[0]).plot(kind='bar')
df.plot(kind='bar')
Now I would like to add the percentage of Male and Female to each bar in a stacked manner.. eg. for the fist bar ("Math", "Above basic") it should be 50/50. For the bar ("Literature", "Basic") it should be 40/60, for the bar ("Literature","Low") it should be 33.3/66.7 and so on...
Could you give me a hand?
Using the level keyword in DataFrame and Series aggregations, df.sum(axis=1,level=[0]), is deprecated.
Use df.groupby(level=0, axis=1).sum()
df.div(dfg).mul(100).round(1).astype(str) creates a DataFrame of strings with the 'Female' and 'Male' percent for each of the 'Skills', which can be used to create a custom bar label.
As shown in this answer, use matplotlib.pyplot.bar_label to annotate the bars, which has a labels= parameter for custom labels.
Tested in python 3.11, pandas 1.5.3, matplotlib 3.7.0, seaborn 0.12.2
# group df to create the bar plot
dfg = df.groupby(level=0, axis=1).sum()
# calculate the Female / Male percent for each Skill
percent_s = df.div(dfg).mul(100).round(1).astype(str)
# plot the bars
ax = dfg.plot(kind='bar', figsize=(10, 7), rot=0, width=0.9, ylabel='Total Percent\n(Female/Male split)')
# iterate through the bar containers
for c in ax.containers:
# get the Skill label
label = c.get_label()
# use the Skill label to get the current group based on level, join the strings,and get an array of custom labels
labels = percent_s.loc[:, percent_s.columns.get_level_values(0).isin([label])].agg('/'.join, axis=1).values
# add the custom labels to the center of the bars
ax.bar_label(c, labels=labels, label_type='center')
# add total percent to the top of the bars
ax.bar_label(c, weight='bold', fmt='%g%%')
percent_s
Skills Above basic Basic Low
Gender Female Male Female Male Female Male
Math 50.0 50.0 50.0 50.0 50.0 50.0
Literature 50.0 50.0 40.0 60.0 33.3 66.7
Physics 48.0 52.0 47.6 52.4 55.2 44.8
Optionally, melt df into a long form, and use sns.catplot with kind='bar' to plot each 'Gender' in a separate Facet.
# melt df into a long form
dfm = df.melt(ignore_index=False).reset_index(names='Subject')
# plot the melted dataframe
g = sns.catplot(kind='bar', data=dfm, x='Subject', y='value', col='Gender', hue='Skills')
# Flatten the axes for ease of use
axes = g.axes.ravel()
# relabel the yaxis
axes[0].set_ylabel('Percent')
# add bar labels
for ax in axes:
for c in ax.containers:
ax.bar_label(c, fmt='%0.1f%%')
Or swap x= and col= to col='Subject' and x='Gender'.

Show Count and percentage labels for grouped bar chart python

I would like to add count and percentage labels to a grouped bar chart, but I haven't been able to figure it out.
I've seen examples for count or percentage for single bars, but not for grouped bars.
the data looks something like this (not the real numbers):
age_group Mis surv unk death total surv_pct death_pct
0 0-9 1 2 0 3 6 100.0 0.0
1 10-19 2 1 0 1 4 99.9 0.0
2 20-29 0 3 0 1 4 99.9 0.0
3 30-39 0 7 1 2 10 100.0 0.0
`4 40-49 0 5 0 1 6 99.7 0.3
5 50-59 0 6 0 4 10 99.3 0.3
6 60-69 0 7 1 4 12 98.0 2.0
7 70-79 1 8 2 5 16 92.0 8.0
8 80+ 0 10 0 7 17 81.0 19.0
And The chart looks something like this
I created the chart with this code:
ax = df.plot(y=['deaths', 'surv'],
kind='barh',
figsize=(20,9),
rot=0,
title= '\n\n surv and deaths by age group')
ax.legend(['Deaths', 'Survivals']);
ax.set_xlabel('\nCount');
ax.set_ylabel('Age Group\n');
How could I add count and percentage labels to the grouped bars? I would like it to look something like this chart
Since nobody else has suggested anything, here is one way to approach it with your dataframe structure.
from matplotlib import pyplot as plt
import pandas as pd
df = pd.read_csv("test.txt", delim_whitespace=True)
cat = ['death', 'surv']
ax = df.plot(y=cat,
kind='barh',
figsize=(20, 9),
rot=0,
title= '\n\n surv and deaths by age group')
#making space for the annotation
xmin, xmax = ax.get_xlim()
ax.set_xlim(xmin, 1.05 * xmax)
#connecting bar series with df columns
for cont, col in zip(ax.containers, cat):
#connecting each bar of the series with its absolute and relative values
for rect, vals, perc in zip(cont.patches, df[col], df[col+"_pct"]):
#annotating each bar
ax.annotate(f"{vals} ({perc:.1f}%)", (rect.get_width(), rect.get_y() + rect.get_height() / 2.),
ha='left', va='center', fontsize=10, color='black', xytext=(3, 0),
textcoords='offset points')
ax.set_yticklabels(df.age_group)
ax.set_xlabel('\nCount')
ax.set_ylabel('Age Group\n')
ax.legend(['Deaths', 'Survivals'], loc="lower right")
plt.show()
Sample output:
If the percentages per category add up, one could also calculate the percentages on the fly. This would then not necessitate that the percentage columns have exactly the same name structure. Another problem is that the font size of the annotation, the scaling to make space for labeling the largest bar, and the distance between bar and annotation are not interactive and may need fine-tuning.
However, I am not fond of this mixing of pandas and matplotlib plotting functions. I had cases where the axis definition by pandas interfered with matplotlib, and datetime objects ... well, let's not talk about that.

Why don't all of the factor variables appear in the legend?

I'm pretty new to plotting using matplotlib and I'm having a few problems with the legends, I have this data set:
Wavelength CD Time
0 250.0 0.00000 1
1 249.8 -0.04278 1
2 249.6 -0.03834 1
3 249.4 -0.02384 1
4 249.2 -0.04817 1
... ... ... ...
3760 200.8 0.99883 15
3761 200.6 0.50277 15
3762 200.4 -0.19228 15
3763 200.2 0.81317 15
3764 200.0 0.90226 15
[3765 rows x 3 columns]
Column types:
Wavelength float64
CD float64
Time int64
dtype: object
Why when plotted with Time as the categorical variable all the values are not shown in the legend?
x = df1['Wavelength']
y = df1['CD']
z = df1['Time']
sns.lineplot(x, y, hue = z)
plt.tight_layout()
plt.show()
But I can plot using pandas built in matplotlib function with a colorbar bar like this:
df1.plot.scatter('Wavelength', 'CD', c='Time', cmap='RdYlBu')
What's the best way of choosing between discrete and continuous legends using matplotlib/seaborn?
Many thanks!

plot data with different scale on same y axis on subplots

I have a dataframe with variable scale data, I am trying to get a plot with subplots. something like this.
raw_data = {'strike_date': ['2019-10-31', '2019-11-31','2019-12-31','2020-01-31', '2020-02-31'],
'strike': [100.00, 113.00, 125.00, 126.00, 135.00],
'lastPrice': [42, 32, 36, 18, 23],
'volume': [4, 24, 31, 2, 3],
'openInterest': [166, 0, 0, 62, 12]}
ploty_df = pd.DataFrame(raw_data, columns = ['strike_date', 'strike', 'lastPrice', 'volume', 'openInterest'])
ploty_df
strike_date strike lastPrice volume openInterest
0 2019-10-31 100.0 42 4 166
1 2019-11-31 113.0 32 24 0
2 2019-12-31 125.0 36 31 0
3 2020-01-31 126.0 18 2 62
4 2020-02-31 135.0 23 3 12
this is what I tried so far with a twinx, if you noticed the out put is a flat data without any scale difference for strike and volume.
fig, ax = plt.subplots()
fig.subplots_adjust(right=0.75)
mm = ax.twinx()
yy = ax.twinx()
for col in ploty_df.columns:
mm.plot(ploty_df.index,ploty_df[[col]],label=col)
mm.set_ylabel('volume')
yy.set_ylabel('strike')
yy.spines["right"].set_position(("axes", 1.2))
yy.set_ylim(mm.get_ylim()[0]*12, mm.get_ylim()[1]*12)
plt.tick_params(axis='both', which='major', labelsize=16)
handles, labels = mm.get_legend_handles_labels()
mm.legend(fontsize=14, loc=6)
plt.show()
and the output
the main problem with your script is that you are generating 3 axes but only plotting on one of them, you need to think of each axes as a separate object with its own y-scale, y-limit and so. So for example in your script when you call fig, ax = plt.subplots() you generate the first axes that you call ax (this is the standard yaxis with the scale on the left-side of your plot). If you want to plot something on this axes you should call ax.plot() but in your case you are plotting everything on the axes that you called mm.
I think you should really go through the matplotlib documentation do understand these concepts better. For plotting on multiple y-axis I would recommend you to have a look at this example.
Below you can find a basic example to plot your data on 3 different y-axis, you can take it as a starting point to produce the graph you are looking for.
#convert the index of your dataframe to datetime
plot_df.index=pd.DatetimeIndex(plot_df.strike_date)
fig, ax = plt.subplots(figsize=(15,7))
fig.subplots_adjust(right=0.75)
l1,=ax.plot(plot_df['strike'],'r')
ax.set_ylabel('Stike')
ax2=ax.twinx()
l2,=ax2.plot(plot_df['lastPrice'],'g')
ax2.set_ylabel('lastPrice')
ax3=ax.twinx()
l3,=ax3.plot(plot_df['volume'],'b')
ax3.set_ylabel('volume')
ax3.spines["right"].set_position(("axes", 1.2))
ax3.spines["right"].set_visible(True)
ax.legend((l1,l2,l3),('Stike','lastPrice','volume'),loc='center left')
here the result:
p.s. Your example dataframe contains non existing dates (31st February 2020) so you have to modify those in order to be able to convert the index to datetime.

plot dataframe with two y-axes

I have the following dataframe:
land_cover 1 2 3 4 5 6 size
0 20 19.558872 6.856950 3.882243 1.743048 1.361306 1.026382 16.520265
1 30 9.499454 3.513521 1.849498 0.836386 0.659660 0.442690 8.652517
2 40 10.173790 3.123167 1.677257 0.860317 0.762718 0.560290 11.925280
3 50 10.098777 1.564575 1.280729 0.894287 0.884028 0.887448 12.647710
4 60 6.166109 1.588687 0.667839 0.230659 0.143044 0.070628 2.160922
5 110 17.846565 3.884678 2.202129 1.040551 0.843709 0.673298 30.406541
I want to plot the data in the way that:
. land_cover is the x-axis
. cols 1 - 6 should be stacked bar plots per land_cover class (row)
. and the column 'size' should be a second y-axis and could be a simple point symbol for every row and additionally a smooth line connecting the points
Any ideas?
Your code is pretty fine. I only add two more lines
import matplotlib.pyplot as plt
df.plot(x="land_cover", y=[1, 2, 3, 4, 5, 6], stacked=True, kind="bar")
ax = df['size'].plot(secondary_y=True, color='k', marker='o')
ax.set_ylabel('size')
plt.show()
In general just add one extra argument to your plot call: secondary_y=['size'].
In this case a separate plot is easier though, because of line vs bars etc.

Categories

Resources