Labels at the end of curves (matplotlib-seaborn) [duplicate] - python

This question already has answers here:
How to annotate end of lines using python and matplotlib?
(3 answers)
Closed 4 years ago.
I have multiple data frames in this format:
year count cum_sum
2001 5 5
2002 15 20
2003 14 34
2004 21 55
2005 44 99
2006 37 136
2007 55 191
2008 69 260
2009 133 393
2010 94 487
2011 133 620
2012 141 761
2013 206 967
2014 243 1210
2015 336 1546
2016 278 1824
2017 285 2109
2018 178 2287
I have generated a plot as the followig:
enter image description here
The following code has been utilized for this purpose:
fig, ax = plt.subplots(figsize=(12,8))
sns.pointplot(x="year", y="cum_sum", data=china_papers_by_year_sorted, color='red')
sns.pointplot(x="year", y="cum_sum", data=usa_papers_by_year_sorted, color='blue')
sns.pointplot(x="year", y="cum_sum", data=korea_papers_by_year_sorted, color='lightblue')
sns.pointplot(x="year", y="cum_sum", data=japan_papers_by_year_sorted, color='yellow')
sns.pointplot(x="year", y="cum_sum", data=brazil_papers_by_year_sorted, color='green')
ax.set_ylim([0,2000])
ax.set_ylabel("Cumulative frequency")
fig.text(x = 0.91, y = 0.76, s = "China", color = "red", weight = "bold") #Here I have had to indicate manually x and y coordinates
fig.text(x = 0.91, y = 0.72, s = "South Korea", color = "lightblue", weight = "bold") #Here I have had to indicate manually x and y coordinates
plt.show()
The problem is that the method for adding text to the plot is not recognizing the data coordinates. So, I have had to manually indicate the coordinates of the labels of each dataframe (please see "China" and "Korea"). Is there a clever way of doing it? I have seen an example using ".last_valid_index()" method. However, since the data coordinates are not being recognized, it is not working.

You don't need to make repeated calls to pointplot and add labels manually. Instead add a country column to your data frames to indicate the country, combine the data frames and then simply plot cumulative sum vs year using country as the hue.
Instead, do the following:
# Add a country label to dataframe itself
china_papers_by_year_sorted['country'] = 'China'
usa_papers_by_year_sorted['country'] = 'USA'
korea_papers_by_year_sorted['country'] = 'Korea'
japan_papers_by_year_sorted['country'] = 'Japan'
brazil_papers_by_year_sorted['country'] = 'Brazil'
# List of dataframes with same columns
frames = [china_papers_by_year_sorted, usa_papers_by_year_sorted,
korea_papers_by_year_sorted, japan_papers_by_year_sorted,
brazil_papers_by_year_sorted]
# Combine into one dataframe
result = pd.concat(frames)
# Plot.. hue will make country name a label
ax = sns.pointplot(x="year", y="cum_sum", hue="country", data=result)
ax.set_ylim([0,2000])
ax.set_ylabel("Cumulative frequency")
plt.show()
Edit: Editing to add that if you want to annotate the lines themselves instead of using the legend, the answers to this existing question indicate how to annotate end of lines.

Related

How to annotate grouped bars with group count instead of bar height

To draw plot, I am using seaborn and below is my code
import seaborn as sns
sns.set_theme(style="whitegrid")
tips = sns.load_dataset("tips")
tips=tips.head()
ax = sns.barplot(x="day", y="total_bill",hue="sex", data=tips, palette="tab20_r")
I want to get and print frequency of data plots that is no. of times it occurred and below is the expected image
To Add label in bar,
I have used below code
for rect in ax.patches:
y_value = rect.get_height()
x_value = rect.get_x() + rect.get_width() / 2
space = 1
label = "{:.0f}".format(y_value)
ax.annotate(label, (x_value, y_value), xytext=(0, space), textcoords="offset points", ha='center', va='bottom')
plt.show()
So, With above code. I am able to display height with respect to x-axis , but I don't want height. I want frequency/count that satisfies relationship. For above example, there are 2 male and 3 female who gave tip on Sunday. So it should display 2 and 3 and not the amount of tip
Below is the code
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="whitegrid")
df = sns.load_dataset("tips")
ax = sns.barplot(x='day', y='tip',hue="sex", data=df, palette="tab20_r")
for rect in ax.patches:
y_value = rect.get_height()
x_value = rect.get_x() + rect.get_width() / 2
space = 1
label = "{:.0f}".format(y_value)
ax.annotate(label, (x_value, y_value), xytext=(0, space), textcoords="offset points", ha='center', va='bottom')
plt.show()
How to display custom values on a bar plot does not clearly show how to annotate grouped bars, nor does it show how to determine the frequency of each hue category for each day.
How to plot and annotate grouped bars in seaborn / matplotlib shows how to annotate grouped bars, but not with custom labels.
for rect in ax.patches is an obsolete way to annotate bars. Use matplotlib.pyplot.bar_label, as fully described in How to add value labels on a bar chart.
Use pandas.crosstab or pandas.DataFrame.groupby to calculate the count of each category by the hue group.
As tips.info() shows, several columns have a category Dtype, which insures the plotting order and why the tp.index and tp.column order matches the x-axis and hue order of ax. Use pandas.Categorical to set a column to a category Dtype.
Tested in python 3.11, pandas 1.5.2, matplotlib 3.6.2, seaborn 0.12.1
import pandas as pd
import seaborn as sns
# load the data
tips = sns.load_dataset('tips')
# determine the number of each gender for each day
tp = pd.crosstab(tips.day, tips.sex)
# or use groupby
# tp = tips.groupby(['day', 'sex']).sex.count().unstack('sex')
# plot the data
ax = sns.barplot(x='day', y='total_bill', hue='sex', data=tips)
# move the legend if needed
sns.move_legend(ax, bbox_to_anchor=(1, 1.02), loc='upper left', frameon=False)
# iterate through each group of bars, zipped to the corresponding column name
for c, col in zip(ax.containers, tp):
# add bar labels with custom annotation values
ax.bar_label(c, labels=tp[col], padding=3, label_type='center')
DataFrame Views
tips
tips.head()
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
tips.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 total_bill 244 non-null float64
1 tip 244 non-null float64
2 sex 244 non-null category
3 smoker 244 non-null category
4 day 244 non-null category
5 time 244 non-null category
6 size 244 non-null int64
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB
tp
sex Male Female
day
Thur 30 32
Fri 10 9
Sat 59 28
Sun 58 18

The median is not matching and going out of the box plot in SNS

I have a problem with my box plot and I cant figure out how to make it work. It works for other data frames I have in my code but for this one is not. There are no nan values nor 0 in the data frame. So I dont know what might be.
combined_dfs1 = pd.DataFrame ({'GE 3.2 130': df_GE_32_130_mean,
'GE 3.4 130': df_GE_34_130_mean,
'GE 5.3 158': df_GE_53_158_mean,
'Nordex N131': df_Nordex_N131_mean,
'SG 4.1 142': df_SG_41_142_mean,
'SG 4.5 145': df_SG_45_145_mean,
'Vestas V90': df_Vestas_V90_mean,
'Vestas V100': df_Vestas_V100_mean,
'Vestas V110': df_Vestas_V110_mean,
'Vestas V136': df_Vestas_V136_mean,
'Vestas V150': df_Vestas_V150_mean})
box_plot = sns.boxplot(data = combined_dfs1)
ax = box_plot.axes
lines = ax.get_lines()
categories = ax.get_xticks()
for cat in categories:
# every 4th line at the interval of 6 is median line
# 0 -> p25 1 -> p75 2 -> lower whisker 3 -> upper whisker 4 -> p50 5 -> upper extreme value
y = round(lines[4+cat*6].get_ydata()[0],1)
ax.text(
cat,
y,
f'{y}',
ha='center',
va='center',
fontweight='semibold',
size=10,
color='black',
bbox=dict(facecolor='none', edgecolor='none'))
box_plot.set( xlabel = "Different wind mills", ylabel = "Anually averaged wind capacity factors")
plt.title("Anually averaged wind capacity factors over the different \n wind mills suppliers with the most CF")
plt.setp(ax.get_xticklabels(), rotation=30)
box_plot.figure.tight_layout()
The content of each of the means it would be something similar to this data frame:
time CF
2005 0.2627484028342433
2006 0.2824635223048327
2007 0.2675500697026022
2008 0.2911540057636888
2009 0.2762751475865262
2010 0.28497178538390383
2011 0.27058531492866256
2012 0.26133452904420273
2013 0.27250318397591755
2014 0.2781640652098508
2015 0.2671843141797961
2016 0.27489229696758455
2017 0.283594425087108
2018 0.267121473536955
2019 0.2582981070723493
2020 0.2753802263802264

could not convert string to float: '12-31'

I need to plot 2 lines with minimum and maximum temperature per day
My dataframe looks like this:
Date min max min2015 max2015
0 01-01 -160 156 -133 11
1 01-02 -267 139 -122 39
2 01-03 -267 133 -67 39
3 01-04 -261 106 -88 44
The date column I formatted with day and month only, not year, the reason is that its the MIX and MIX of temperatures between 2004-2014, thats why the year is not present.
so I tried to plot like this:
fig, ax = plt.subplots(figsize=(10, 6))
axb = ax.twinx()
# Same as above
ax.set_xlabel('Date')
ax.set_ylabel('Temp')
ax.set_title('Min and Max temperature 2004-2014')
ax.grid(True)
# Plotting on the first y-axis
ax.plot(new_df.Date, new_df['min'], color='tab:orange', label='Min')
ax.plot(new_df.Date, new_df['max'], color='tab:olive', label='Max')
But I get this:
ValueError: could not convert string to float: '12-31'
the 'plot' function can not identity string type, you can use 'plot_date' function instead.
plt.xlabel('Date')
plt.ylabel('Temp')
plt.title('Min and Max temperature 2004-2014')
plt.grid(True)
plt.plot_date(['01-01', '01-02', '01-03'], [13, 15, 12], color='tab:orange', label='Min')
plt.show()
then you can get a picture like
this

Background with range on seaborn based on two columns

I am trying to add to my several line plots a background that shows a range from value x (column "Min") to value y (column "Max") for each year. My dataset looks like that:
Country Model Year Costs Min Max
494 FR 1 1990 300 250 350
495 FR 1 1995 250 300 400
496 FR 1 2000 220 330 640
497 FR 1 2005 210 289 570
498 FR 2 1990 400 250 350
555 JPN 8 1990 280 250 350
556 JPN 8 1995 240 300 400
557 JPN 8 2000 200 330 640
558 JPN 8 2005 200 289 570
I used the following code:
example_1 = sns.relplot(data=example, x = "Year", y = "Costs", hue = "Model", style = "Model", col = "Country", kind="line", col_wrap=4,height = 4, dashes = True, markers = True, palette = palette, style_order = style_order)
I would like something like this with the range being my "Min" and "Max" by year.
Is it possible to do it?
Thank you very much !
Usually, grid.map is the tool for this, as shown in many examples in the mutli-plot grids tutorial. But you are using relplot to combine lineplot with a FacetGrid as it is suggested in the docs (last example) which lets you use some extra styling parameters.
Because relplot processes the data a bit differently than if you would first initiate a FacetGrid and then map a lineplot (you can check this with grid.data), using grid.map(plt.bar, ...) to plot the ranges is quite cumbersome as it requires editing the grid.data dataframe as well as the x- and y-axis labels.
The simplest way to plot the ranges is to loop through the grid.axes. This can be done with grid.axes_dict.items() which provides the column names (i.e. countries) that you can use to select the appropriate data for the bars (useful if the ranges were to differ, contrary to this example).
The default figure legend does not contain the complete legend including the key for ranges, but the first ax object does so that one displayed instead of the default legend in the following example. Note that I have edited the data you shared so that the min/max ranges make more sense:
import io
import pandas as pd # v 1.1.3
import matplotlib.pyplot as plt # v 3.3.2
import seaborn as sns # v 0.11.0
data ='''
Country Model Year Costs Min Max
494 FR 1 1990 300 250 350
495 FR 1 1995 250 200 300
496 FR 1 2000 220 150 240
497 FR 1 2005 210 189 270
555 JPN 8 1990 280 250 350
556 JPN 8 1995 240 200 300
557 JPN 8 2000 200 150 240
558 JPN 8 2005 200 189 270
'''
df = pd.read_csv(io.StringIO(data), delim_whitespace=True)
# Create seaborn FacetGrid with line plots
grid = sns.relplot(data=df, x='Year', y='Costs', hue='Model', style='Model',height=3.9,
col='Country', kind='line', markers=True, palette='tab10')
# Loop through axes of the FacetGrid to plot bars for ranges and edit x ticks
for country, ax in grid.axes_dict.items():
df_country = df[df['Country'] == country]
cost_range = df_country['Max']-df_country['Min']
ax.bar(x=df_country['Year'], height=cost_range, bottom=df_country['Min'],
color='black', alpha=0.1, label='Min/max\nrange')
ax.set_xticks(df_country['Year'])
# Remove default seaborn figure legend and show instead full legend stored in first ax
grid._legend.remove()
grid.axes.flat[0].legend(bbox_to_anchor=(2.1, 0.5), loc='center left',
frameon=False, title=grid.legend.get_title().get_text());

how can I plot with multiple colors based on values on x-axis

Hey guys let's say I have a pandas DataFrame
Year Delta_T
0 2000 23
1 2001 25
2 2002 22
2 2002 22
4 2004 30
5 2005 21
and I want to plot delta_T in function of time but I want to plot one color for the dates from 2000 to 2003 and another color from 2004 to 2005
Can someone please tell me how I can do it?
I have tried this
plt.figure()
plt.scatter(delta_T_all.iloc[:,0].pd.Timestamp('2010-04-17'),
pd.Timestamp('2016-01-01'),delta_T_all.iloc[:,1], label= '220-250m'),
plt.scatter(delta_T_all.iloc[:,0].pd.Timestamp('2016-01-01'),
pd.Timestamp('2019-09-14'),delta_T_all.iloc[:,1], label= '220-250m')
plt.xlabel('Time')
plt.setp(plt.gca().get_xticklabels(), rotation=60, ha="right")
plt.ylabel('Delta_T')
plt.legend()
plt.title('Delta_T in function of time')
This is the answer
plt.figure()
plt.scatter(delta_T_all.iloc[0:2,0],delta_T_all.iloc[0:2,1],color='r')
plt.scatter(delta_T_all.iloc[2:5,0],delta_T_all.iloc[2:5,1],color='b')
plt.xlabel('Time')
plt.ylabel('Delta_T')
plt.legend()

Categories

Resources