Sort boxplot and colour by pairs - python

I have some data for conditions that go together by pairs, structured like this:
mydata = {
"WT_before": [11,12,13],
"WT_after": [16,17,18],
"MRE11_before": [21,22,23,24,25],
"MRE11_after": [26,27,28,29,30],
"NBS1_before": [31,32,33,34],
"NBS1_after": [36,37,38,39]
}
(my real data has more conditions and more values per condition, this is just an example)
I looked into colouring the boxplots by pairs to help reading the figure, but it seemed quite convoluted to do in matplotlib.
For the moment I'm doing it this way:
bxplt_labels, bxplt_data = mydata.keys(), mydata.values()
bxplt_colors = ["pink", "pink", "lightgreen", "lightgreen", "lightblue", "lightblue"]
fig2, ax = plt.subplots(figsize=(20, 10), dpi=500)
bplot = plt.boxplot(bxplt_data, vert=False, showfliers=False, notch=False, patch_artist=True,)
for patch, color in zip(bplot['boxes'], bxplt_colors):
patch.set_facecolor(color)
plt.yticks(range(1, len(bxplt_labels) + 1), bxplt_labels)
fig2.show()
which produces the figure:
I would like:
to sort the condition names, so that I can order them to my choosing, and
to get a more elegant way of choosing the colours used, in particular because I will need to reuse this data for more figures afterwards (like scatterplot before/after for each condition)
If it is needed, I can rearrange the data structure, but each condition doesn't have the same number of values, so a dictionary seemed like the best option for me. Alternatevely, I can use seaborn, which I saw has quite a few possibilities, but I'm not familiar with it, so I would need more time to understand it.
Could you help me to figure out?

Seaborn works easiest with a dataframe in "long form". In this case, there would be rows with the condition repeated for every value with that condition.
Seaborn's boxplot accepts an order= keyword, where you can change the order of the x-values. E.g. order=sorted(mydata.keys()) to sort the values alphabetically. Or list(mydata.keys())[::-1] to use the original order, but reversed. The default order would be how the values appear in the dataframe.
For a horizontal boxplot, you can use x='value', y='condition'. The order will apply to either x or y, depending on which column contains strings.
For coloring, you can use the palette= keyword. This can either be a string indicating one of matplotlib's or seaborn's colormaps. Or it can be a list of colors. Many more options are possible.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
mydata = {
"WT_before": [11, 12, 13],
"WT_after": [16, 17, 18],
"MRE11_before": [21, 22, 23, 24, 25],
"MRE11_after": [26, 27, 28, 29, 30],
"NBS1_before": [31, 32, 33, 34],
"NBS1_after": [36, 37, 38, 39]
}
df = pd.DataFrame([[k, val] for k, vals in mydata.items() for val in vals],
columns=['condition', 'value'])
fig, ax = plt.subplots(figsize=(12, 5))
sns.boxplot(data=df, x='condition', y='value',
order=['WT_before', 'WT_after', 'MRE11_before', 'MRE11_after', 'NBS1_before', 'NBS1_after'],
palette='turbo', ax=ax)
plt.tight_layout()
plt.show()
Here is an example with horizontal boxes:
sns.boxplot(data=df, x='value', y='condition', palette='Paired')
sns.despine()
plt.xlabel('')
plt.ylabel('')
plt.tight_layout()
plt.show()
The dataframe would look like:
condition
value
0
WT_before
11
1
WT_before
12
2
WT_before
13
3
WT_after
16
4
WT_after
17
5
WT_after
18
6
MRE11_before
21
7
MRE11_before
22
8
MRE11_before
23
9
MRE11_before
24
10
MRE11_before
25
11
MRE11_after
26
12
MRE11_after
27
13
MRE11_after
28
14
MRE11_after
29
15
MRE11_after
30
16
NBS1_before
31
17
NBS1_before
32
18
NBS1_before
33
19
NBS1_before
34
20
NBS1_after
36
21
NBS1_after
37
22
NBS1_after
38
23
NBS1_after
39

Related

How to align the x position of the dots in seaborn scatterplot to a nested bar plot

I am trying to plot a scatter plot on top of a bar plot using sns.scatterplot() and df.plot(kind='bar'); The figure turns out to be fine, but it would be even nicer if I can align each of the scatter points to its corresponding bar with an identical label.
I have read the document on Rectangle of matplotlib.pyplot that it has a get_x() method that can "Return the left coordinate of the rectangle";
I wonder if there is a way for me to assign these coordinates to the scatter points that'd be plotted by seaborn?
Code
fig, ax = plt.subplots(nrows=1, ncols=1)
fig.set_size_inches(9, 9)
fig.set_dpi(300)
bar_df.plot(kind='bar', ax=ax)
ax2 = ax.twinx()
sns.scatterplot(data=line_df, ax=ax2)
Dataframes
bar_df
year
apple
banana
citrus
...
2020
12
34
56
78
2025
12
34
56
78
2030
12
34
56
78
2035
12
34
56
78
line_df
year
apple
banana
citrus
...
2020
23
45
67
89
2025
23
45
67
89
2030
23
45
67
89
2035
23
45
67
89
It'd be really nice if I could make the points in the same vertical line as the bar with the same header;
sns.scatterplot interprets the x-axis as numeric. As such, it doesn't align well with a bar plot, nor does it have a dodge= parameter.
You can use sns.stripplot instead.
Seaborn works easiest with its data in "long form", which can be achieved via pandas pd.melt.
Here is some example code:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
bar_df, line_df = pd.read_html('https://stackoverflow.com/questions/73191315')
bar_df_long = bar_df.melt(id_vars='year', var_name='fruit', value_name='bar_value')
line_df_long = line_df.melt(id_vars='year', var_name='fruit', value_name='line_value')
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(6,6), dpi=300)
sns.barplot(data=bar_df_long, x='year', y='bar_value', hue='fruit', dodge=True, ax=ax)
ax2 = ax.twinx()
sns.stripplot(data=line_df_long, x='year', y='line_value', hue='fruit', dodge=True, jitter=False,
edgecolor='black', linewidth=1, ax=ax2)
ax2.legend_.remove() # remove the second legend
plt.tight_layout()
plt.show()

How to plot a line graph for each column

I have a dataset of four years' worth of ACT participation percentages by state entitled 'part_ACT'. Here's a snippet of it:
Index State ACT17 ACT18 ACT19 ACT20
0 Alabama 100 100 100 100
1 Alaska 65 33 38 33
2 Arizona 62 66 73 71
3 Arkansas 100 100 100 100
4 California 31 27 23 19
5 Colorado 100 30 27 25
6 Connecticut 31 26 22 19
I'm trying to produce a line graph with each of the four column headings on the x-axis and their values on the y-axis (1-100). I would prefer to display all of these line graphs into a single figure.
What's the easiest way to do this? I'm fine with Pandas, Matplotlib, Seaborn, or whatever. Thanks much!
One solution is to melt the df and plot with hue
import numpy as np
import pandas as pd
import seaborn as sns
df = pd.DataFrame({
'State': ['A', 'B', 'C', 'D'],
'x18': sorted(np.random.randint(0, 100, 4)),
'x19': sorted(np.random.randint(0, 100, 4)),
'x20': sorted(np.random.randint(0, 100, 4)),
'x21': sorted(np.random.randint(0, 100, 4)),
})
df_melt = df.melt(id_vars='State', var_name='year')
sns.relplot(
kind='line',
data=df_melt,
x='year', y='value',
hue='State'
)
Creating a plot is all about the shape of the DataFrame.
One way to accomplish this is by converting the DataFrame from wide to long, with melt, but this isn't necessary.
The primary requirement, is set 'State' as the index.
Plots can be generated directly with df, or df.T (.T is the transpose of the DataFrame).
The OP requests a line plot, but this is discrete data, and the correct way to visualize discrete data is with a bar plot, not a line plot.
pandas v1.2.3, seaborn v0.11.1, and matplotlib v3.3.4
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = {'State': ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut'],
'ACT17': [100, 65, 62, 100, 31, 100, 31],
'ACT18': [100, 33, 66, 100, 27, 30, 26],
'ACT19': [100, 38, 73, 100, 23, 27, 22],
'ACT20': [100, 33, 71, 100, 19, 25, 19]}
df = pd.DataFrame(data)
# set State as the index - this is important
df.set_index('State', inplace=True)
# display(df)
ACT17 ACT18 ACT19 ACT20
State
Alabama 100 100 100 100
Alaska 65 33 38 33
Arizona 62 66 73 71
Arkansas 100 100 100 100
California 31 27 23 19
Colorado 100 30 27 25
Connecticut 31 26 22 19
# display(df.T)
State Alabama Alaska Arizona Arkansas California Colorado Connecticut
ACT17 100 65 62 100 31 100 31
ACT18 100 33 66 100 27 30 26
ACT19 100 38 73 100 23 27 22
ACT20 100 33 71 100 19 25 19
Plot 1
Use pandas.DataFrame.plot
df.T.plot()
plt.legend(title='State', bbox_to_anchor=(1.05, 1), loc='upper left')
# get rid of the ticks between the labels - not necessary
plt.xticks(ticks=range(0, len(df.T)))
plt.show()
Plot 2 & 3
Use pandas.DataFrame.plot with kind='bar' or kind='barh'
The bar plot is much better at conveying the yearly changes in the data, and allows for an easy comparison between states.
df.plot(kind='bar')
plt.legend(title='Year', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()
kind='bar'
kind='barh'
Plot 4
Use seaborn.lineplot
Will correctly plot a line plot from a wide dataframe with the columns and index labels.
sns.lineplot(data=df.T)
plt.legend(title='State', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

How to properly plot a bar chart in a Pandas dataframe? The x-values keep showing up as the index

I have a small dataframe with student_id, exam_1, exam_2, exam_3, exam_4, and exam_5 as columns. There are 5 students as well for the rows. What I'd like to do is plot a bar graph showing the exam grades of one student aka one specific row, and ultimately doing it for each or a specific student from user input.
For now, though, I'm stuck on how to plot a bar graph for just one specific student.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'student_id': [83838, 16373, 93538, 29383, 58585],
'exam_1': [80, 95, 90, 75, 50],
'exam_2': [60, 92, 88, 85, 40],
'exam_3': [70, 55, 75, 45, 60],
'exam_4': [55, 95, 45, 80, 55],
'exam_5': [91, 35, 92, 90, 75]})
print(df)
Which produces this as output:
student_id exam_1 exam_2 exam_3 exam_4 exam_5
0 83838 80 60 70 55 91
1 16373 95 92 55 95 35
2 93538 90 88 75 45 92
3 29383 75 85 45 80 90
4 58585 50 40 60 55 75
Adding this code below will allow me to select just one specific student ID aka row:
df = df.loc[df['student_id'] == 29383]
print(df)
student_id exam_1 exam_2 exam_3 exam_4 exam_5
3 29383 75 85 45 80 90
From here is where I'd like to plot this particular student's exams in a bar plot.
I tried the code below but it doesn't display it how I'd like. It seems that the index of this particular student is being used for the tick on the x-axis, if you can see the image. It will show '3' with some bar plots around it.
exam_plots_for_29383 = df.plot.bar()
plt.show()
Which will output this bar plot:
Dataframe with bar plot. Looks weird.
I tried to transpose the dataframe, which kind of gets me to what I want. I used this code below:
df = df.T
exam_plots_for_29383_T = df.plot.bar()
plt.show()
But I end up with this as a graph:
Transpose of dataframe with bar plot. Looks weird still.
I'm a bit stuck. I know there's a logical way of properly plotting a bar plot from the dataframe, I just can't for the life of me figure it out.
I'd like the bar plot to have:
Exams 1 through 5 show up on the x-axis.
Their values on the y-axis.
Each exam bar in separate color.
The legend showing the colors.
I think the last two options are done automatically. It's just the first two that are breaking my brain. I appreciate any help or tips.
Here's the code in full in case anyone would like to see it without it being split like above.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'student_id': [83838, 16373, 93538, 29383, 58585],
'exam_1': [80, 95, 90, 75, 50],
'exam_2': [60, 92, 88, 85, 40],
'exam_3': [70, 55, 75, 45, 60],
'exam_4': [55, 95, 45, 80, 55],
'exam_5': [91, 35, 92, 90, 75]})
print(df)
df = df.loc[df['student_id'] == 29383]
print(df)
exam_plots_for_29383 = df.plot.bar()
plt.show()
df = df.T
exam_plots_for_29383_T = df.plot.bar()
plt.show()
You are very close. The issue is that your numeric-like student ID is messing up all of the plots (which is why ID 29383 is giving you a bar close to 30,000 in all of your graphs).
Set the 'student_id' to the index so that it doesn't get plotted and now you can plot each student separately slicing the index with .loc[student_id], or if you plot the entire DataFrame it will color each different student.
df = df.set_index('student_id')
df.loc[29383].plot(kind='bar', figsize=(4,3), rot=30)
Knowing there are 5 exams you can give each its own color if you really want. Use a categorical color palette (tab10). (This also only works with Series.plot)
from matplotlib import cm
df.loc[29383].plot(kind='bar', figsize=(4,3), rot=30, color=cm.tab10.colors[0:5])

plot data with different scale on same y axis on subplots

I have a dataframe with variable scale data, I am trying to get a plot with subplots. something like this.
raw_data = {'strike_date': ['2019-10-31', '2019-11-31','2019-12-31','2020-01-31', '2020-02-31'],
'strike': [100.00, 113.00, 125.00, 126.00, 135.00],
'lastPrice': [42, 32, 36, 18, 23],
'volume': [4, 24, 31, 2, 3],
'openInterest': [166, 0, 0, 62, 12]}
ploty_df = pd.DataFrame(raw_data, columns = ['strike_date', 'strike', 'lastPrice', 'volume', 'openInterest'])
ploty_df
strike_date strike lastPrice volume openInterest
0 2019-10-31 100.0 42 4 166
1 2019-11-31 113.0 32 24 0
2 2019-12-31 125.0 36 31 0
3 2020-01-31 126.0 18 2 62
4 2020-02-31 135.0 23 3 12
this is what I tried so far with a twinx, if you noticed the out put is a flat data without any scale difference for strike and volume.
fig, ax = plt.subplots()
fig.subplots_adjust(right=0.75)
mm = ax.twinx()
yy = ax.twinx()
for col in ploty_df.columns:
mm.plot(ploty_df.index,ploty_df[[col]],label=col)
mm.set_ylabel('volume')
yy.set_ylabel('strike')
yy.spines["right"].set_position(("axes", 1.2))
yy.set_ylim(mm.get_ylim()[0]*12, mm.get_ylim()[1]*12)
plt.tick_params(axis='both', which='major', labelsize=16)
handles, labels = mm.get_legend_handles_labels()
mm.legend(fontsize=14, loc=6)
plt.show()
and the output
the main problem with your script is that you are generating 3 axes but only plotting on one of them, you need to think of each axes as a separate object with its own y-scale, y-limit and so. So for example in your script when you call fig, ax = plt.subplots() you generate the first axes that you call ax (this is the standard yaxis with the scale on the left-side of your plot). If you want to plot something on this axes you should call ax.plot() but in your case you are plotting everything on the axes that you called mm.
I think you should really go through the matplotlib documentation do understand these concepts better. For plotting on multiple y-axis I would recommend you to have a look at this example.
Below you can find a basic example to plot your data on 3 different y-axis, you can take it as a starting point to produce the graph you are looking for.
#convert the index of your dataframe to datetime
plot_df.index=pd.DatetimeIndex(plot_df.strike_date)
fig, ax = plt.subplots(figsize=(15,7))
fig.subplots_adjust(right=0.75)
l1,=ax.plot(plot_df['strike'],'r')
ax.set_ylabel('Stike')
ax2=ax.twinx()
l2,=ax2.plot(plot_df['lastPrice'],'g')
ax2.set_ylabel('lastPrice')
ax3=ax.twinx()
l3,=ax3.plot(plot_df['volume'],'b')
ax3.set_ylabel('volume')
ax3.spines["right"].set_position(("axes", 1.2))
ax3.spines["right"].set_visible(True)
ax.legend((l1,l2,l3),('Stike','lastPrice','volume'),loc='center left')
here the result:
p.s. Your example dataframe contains non existing dates (31st February 2020) so you have to modify those in order to be able to convert the index to datetime.

plot dataframe with two y-axes

I have the following dataframe:
land_cover 1 2 3 4 5 6 size
0 20 19.558872 6.856950 3.882243 1.743048 1.361306 1.026382 16.520265
1 30 9.499454 3.513521 1.849498 0.836386 0.659660 0.442690 8.652517
2 40 10.173790 3.123167 1.677257 0.860317 0.762718 0.560290 11.925280
3 50 10.098777 1.564575 1.280729 0.894287 0.884028 0.887448 12.647710
4 60 6.166109 1.588687 0.667839 0.230659 0.143044 0.070628 2.160922
5 110 17.846565 3.884678 2.202129 1.040551 0.843709 0.673298 30.406541
I want to plot the data in the way that:
. land_cover is the x-axis
. cols 1 - 6 should be stacked bar plots per land_cover class (row)
. and the column 'size' should be a second y-axis and could be a simple point symbol for every row and additionally a smooth line connecting the points
Any ideas?
Your code is pretty fine. I only add two more lines
import matplotlib.pyplot as plt
df.plot(x="land_cover", y=[1, 2, 3, 4, 5, 6], stacked=True, kind="bar")
ax = df['size'].plot(secondary_y=True, color='k', marker='o')
ax.set_ylabel('size')
plt.show()
In general just add one extra argument to your plot call: secondary_y=['size'].
In this case a separate plot is easier though, because of line vs bars etc.

Categories

Resources