Background with range on seaborn based on two columns - python

I am trying to add to my several line plots a background that shows a range from value x (column "Min") to value y (column "Max") for each year. My dataset looks like that:
Country Model Year Costs Min Max
494 FR 1 1990 300 250 350
495 FR 1 1995 250 300 400
496 FR 1 2000 220 330 640
497 FR 1 2005 210 289 570
498 FR 2 1990 400 250 350
555 JPN 8 1990 280 250 350
556 JPN 8 1995 240 300 400
557 JPN 8 2000 200 330 640
558 JPN 8 2005 200 289 570
I used the following code:
example_1 = sns.relplot(data=example, x = "Year", y = "Costs", hue = "Model", style = "Model", col = "Country", kind="line", col_wrap=4,height = 4, dashes = True, markers = True, palette = palette, style_order = style_order)
I would like something like this with the range being my "Min" and "Max" by year.
Is it possible to do it?
Thank you very much !

Usually, grid.map is the tool for this, as shown in many examples in the mutli-plot grids tutorial. But you are using relplot to combine lineplot with a FacetGrid as it is suggested in the docs (last example) which lets you use some extra styling parameters.
Because relplot processes the data a bit differently than if you would first initiate a FacetGrid and then map a lineplot (you can check this with grid.data), using grid.map(plt.bar, ...) to plot the ranges is quite cumbersome as it requires editing the grid.data dataframe as well as the x- and y-axis labels.
The simplest way to plot the ranges is to loop through the grid.axes. This can be done with grid.axes_dict.items() which provides the column names (i.e. countries) that you can use to select the appropriate data for the bars (useful if the ranges were to differ, contrary to this example).
The default figure legend does not contain the complete legend including the key for ranges, but the first ax object does so that one displayed instead of the default legend in the following example. Note that I have edited the data you shared so that the min/max ranges make more sense:
import io
import pandas as pd # v 1.1.3
import matplotlib.pyplot as plt # v 3.3.2
import seaborn as sns # v 0.11.0
data ='''
Country Model Year Costs Min Max
494 FR 1 1990 300 250 350
495 FR 1 1995 250 200 300
496 FR 1 2000 220 150 240
497 FR 1 2005 210 189 270
555 JPN 8 1990 280 250 350
556 JPN 8 1995 240 200 300
557 JPN 8 2000 200 150 240
558 JPN 8 2005 200 189 270
'''
df = pd.read_csv(io.StringIO(data), delim_whitespace=True)
# Create seaborn FacetGrid with line plots
grid = sns.relplot(data=df, x='Year', y='Costs', hue='Model', style='Model',height=3.9,
col='Country', kind='line', markers=True, palette='tab10')
# Loop through axes of the FacetGrid to plot bars for ranges and edit x ticks
for country, ax in grid.axes_dict.items():
df_country = df[df['Country'] == country]
cost_range = df_country['Max']-df_country['Min']
ax.bar(x=df_country['Year'], height=cost_range, bottom=df_country['Min'],
color='black', alpha=0.1, label='Min/max\nrange')
ax.set_xticks(df_country['Year'])
# Remove default seaborn figure legend and show instead full legend stored in first ax
grid._legend.remove()
grid.axes.flat[0].legend(bbox_to_anchor=(2.1, 0.5), loc='center left',
frameon=False, title=grid.legend.get_title().get_text());

Related

Problem animating polar plots from measured data

Problem
I'm trying to animate a polar plot from a measured temperature data from a cylinder using the plotly.express command line_polar by using a dataset of 6 radial values (represented by columns #1 - #6) over 10 rows (represented by column Time) distributed over a polar plot. I'm struggling to make it animate and get the following error:
Error
ValueError: All arguments should have the same length. The length of column argument df[animation_frame] is 10, whereas the length of previously-processed arguments ['r', 'theta'] is 6
According to the help for the parameter "animation_frame" it should be specified as following:
animation_frame (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to assign marks to animation frames.
I'm a bit stumped with this problem since I don't see why this shouldn't work, since other use cases seem to use multi-dimensional data with the data with equal rows.
Example of polar plot for t=1
Polar plot
Dataset:
Time
#1
#2
#3
#4
#5
#6
1
175
176
179
182
178
173
2
174
175
179
184
178
172
3
175
176
178
183
179
174
4
173
174
178
184
179
174
5
173
174
177
185
180
175
6
173
174
177
185
180
175
7
172
173
176
186
181
176
8
172
173
176
186
181
176
9
171
172
175
187
182
177
10
171
172
175
187
182
177
Code:
import pandas as pd
import plotly.express as px
df = pd.read_excel('TempData.xlsx')
sensor = ["0", "60", "120", "180", "240","300"]
radial_all = ['#1', '#2', '#3', '#4', '#5', '#6']
fig = px.line_polar(df, r=radial_all, theta=sensor, line_close=True,
color_discrete_sequence=px.colors.sequential.Plasma_r, template="plotly_dark", animation_frame="Time")
fig.update_polars(radialaxis_range=[160, 190])
fig.update_polars(radialaxis_rangemode="normal")
fig.update_polars(radialaxis=dict(tickvals = [150, 160, 170, 180, 190, 200]))
Thanks in advance!
I have found the solution to this problem, its also possible with scatterpolar but I recommend line_polar from plotly express, its way more elegant and easy. What you need to do is format the data from wide to long format using the pandas command melt(). This will allow you to correctly walk through the data and match it to the animation steps (in this case "Time" column). See following links for helpful info.
Pandas - reshaping-by-melt
pandas.melt()
Resulting code:
import plotly.express as px
import pandas as pd
df = pd.read_excel('TempData.xlsx')
df_1 = df.melt(id_vars=['Time'], var_name="Sensor", value_name="Temperature",
value_vars=['#1', '#2', '#3', '#4','#5','#6'])
fig = px.line_polar(df_1, r="Temperature", theta="Sensor", line_close=True,
line_shape="linear", direction="clockwise",
color_discrete_sequence=px.colors.sequential.Plasma_r, template="plotly_dark",
animation_frame="Time")
fig.show()
Resulting animating plot

How to create grouped bars charts with matplotlib with data in DataFrame

This is my current output:
Now i want the next bars next to the already plotted bars.
My DataFrame has 3 columns: 'Block', 'Cluster', and 'District'.
'Block' and 'Cluster' contain the numbers for plotting and the grouping is based
on the strings in 'District'.
How can I plot the other bars next to the existing bars?
df=pd.read_csv("main_ds.csv")
fig = plt.figure(figsize=(20,8))
ax = fig.add_subplot(111)
plt.xticks(rotation=90)
bwidth=0.30
indic1=ax.bar(df["District"],df["Block"], width=bwidth, color='r')
indic2=ax.bar(df["District"],df["Cluster"], width=bwidth, color='b')
ax.autoscale(tight=False)
def autolabel(rects):
for rect in rects:
h = rect.get_height()
ax.text(rect.get_x()+rect.get_width()/2., 1.05*h, '%d'%int(h),
ha='center', va='top')
autolabel(indic1)
autolabel(indic2)
plt.show()
Data:
District Block Cluster Villages Schools Decadal_Growth_Rate Literacy_Rate Male_Literacy Female_Literacy Primary ... Govt_School Pvt_School Govt_Sch_Rural Pvt_School_Rural Govt_Sch_Enroll Pvt_Sch_Enroll Govt_Sch_Enroll_Rural Pvt_Sch_Enroll_Rural Govt_Sch_Teacher Pvt_Sch_Teacher
0 Dimapur 5 30 278 494 23.2 85.4 88.1 82.5 147 ... 298 196 242 90 33478 57176 21444 18239 3701 3571
1 Kiphire 3 3 94 142 -58.4 73.1 76.5 70.4 71 ... 118 24 118 24 5947 7123 5947 7123 853 261
2 Kohima 5 5 121 290 22.7 85.6 89.3 81.6 128 ... 189 101 157 49 10116 26464 5976 8450 2068 2193
3 Longleng 2 2 37 113 -30.5 71.1 75.6 65.4 60 ... 90 23 90 23 3483 4005 3483 4005 830 293
4 Mon 5 5 139 309 -3.8 56.6 60.4 52.4 165 ... 231 78 219 58 18588 16578 17108 8665 1667 903
5 rows × 26 columns
Try using pandas.DataFrame.plot
import pandas as pd
import numpy as np
from io import StringIO
from datetime import date
import matplotlib.pyplot as plt
def add_value_labels(ax, spacing=5):
for rect in ax.patches:
y_value = rect.get_height()
x_value = rect.get_x() + rect.get_width() / 2
space = spacing
# Vertical alignment for positive values
va = 'bottom'
# If value of bar is negative: Place label below bar
if y_value < 0:
# Invert space to place label below
space *= -1
# Vertically align label at top
va = 'top'
# Use Y value as label and format number with one decimal place
label = "{:.1f}".format(y_value)
# Create annotation
ax.annotate(
label, # Use `label` as label
(x_value, y_value), # Place label at end of the bar
xytext=(0, space), # Vertically shift label by `space`
textcoords="offset points", # Interpret `xytext` as offset in points
ha='center', # Horizontally center label
va=va) # Vertically align label differently for
# positive and negative values.
first3columns = StringIO("""District Block Cluster
Dimapur 5 30
Kiphire 3 3
Kohima 5 5
Longleng 2
Mon 5 5
""")
df_plot = pd.read_csv(first3columns, delim_whitespace=True)
fig, ax = plt.subplots()
#df_plot.set_index(['District'], inplace=True)
df_plot[['Block', 'Cluster']].plot.bar(ax=ax, color=['r', 'b'])
ax.set_xticklabels(df_plot['District'])
add_value_labels(ax)
plt.show()
Try changing
indic1=ax.bar(df["District"],df["Block"], width=bwidth, color='r')
indic2=ax.bar(df["District"],df["Cluster"], width=bwidth, color='b')
to
indic1=ax.bar(df["District"]-bwidth/2,df["Block"], width=bwidth, color='r')
indic2=ax.bar(df["District"]+bwidth/2,df["Cluster"], width=bwidth, color='b')

Labels at the end of curves (matplotlib-seaborn) [duplicate]

This question already has answers here:
How to annotate end of lines using python and matplotlib?
(3 answers)
Closed 4 years ago.
I have multiple data frames in this format:
year count cum_sum
2001 5 5
2002 15 20
2003 14 34
2004 21 55
2005 44 99
2006 37 136
2007 55 191
2008 69 260
2009 133 393
2010 94 487
2011 133 620
2012 141 761
2013 206 967
2014 243 1210
2015 336 1546
2016 278 1824
2017 285 2109
2018 178 2287
I have generated a plot as the followig:
enter image description here
The following code has been utilized for this purpose:
fig, ax = plt.subplots(figsize=(12,8))
sns.pointplot(x="year", y="cum_sum", data=china_papers_by_year_sorted, color='red')
sns.pointplot(x="year", y="cum_sum", data=usa_papers_by_year_sorted, color='blue')
sns.pointplot(x="year", y="cum_sum", data=korea_papers_by_year_sorted, color='lightblue')
sns.pointplot(x="year", y="cum_sum", data=japan_papers_by_year_sorted, color='yellow')
sns.pointplot(x="year", y="cum_sum", data=brazil_papers_by_year_sorted, color='green')
ax.set_ylim([0,2000])
ax.set_ylabel("Cumulative frequency")
fig.text(x = 0.91, y = 0.76, s = "China", color = "red", weight = "bold") #Here I have had to indicate manually x and y coordinates
fig.text(x = 0.91, y = 0.72, s = "South Korea", color = "lightblue", weight = "bold") #Here I have had to indicate manually x and y coordinates
plt.show()
The problem is that the method for adding text to the plot is not recognizing the data coordinates. So, I have had to manually indicate the coordinates of the labels of each dataframe (please see "China" and "Korea"). Is there a clever way of doing it? I have seen an example using ".last_valid_index()" method. However, since the data coordinates are not being recognized, it is not working.
You don't need to make repeated calls to pointplot and add labels manually. Instead add a country column to your data frames to indicate the country, combine the data frames and then simply plot cumulative sum vs year using country as the hue.
Instead, do the following:
# Add a country label to dataframe itself
china_papers_by_year_sorted['country'] = 'China'
usa_papers_by_year_sorted['country'] = 'USA'
korea_papers_by_year_sorted['country'] = 'Korea'
japan_papers_by_year_sorted['country'] = 'Japan'
brazil_papers_by_year_sorted['country'] = 'Brazil'
# List of dataframes with same columns
frames = [china_papers_by_year_sorted, usa_papers_by_year_sorted,
korea_papers_by_year_sorted, japan_papers_by_year_sorted,
brazil_papers_by_year_sorted]
# Combine into one dataframe
result = pd.concat(frames)
# Plot.. hue will make country name a label
ax = sns.pointplot(x="year", y="cum_sum", hue="country", data=result)
ax.set_ylim([0,2000])
ax.set_ylabel("Cumulative frequency")
plt.show()
Edit: Editing to add that if you want to annotate the lines themselves instead of using the legend, the answers to this existing question indicate how to annotate end of lines.

Python plot data against date

I have a dataset:
A B C D yearweek
0 245 95 60 30 2014-48
1 245 15 70 25 2014-49
2 150 275 385 175 2014-50
3 100 260 170 335 2014-51
4 580 925 535 2590 2015-02
5 630 126 485 2115 2015-03
6 425 90 905 1085 2015-04
7 210 670 655 945 2015-05
How to plot each value against 'yearweek'?
I tried for example:
import matplotlib.pyplot as plt
import pandas as pd
new = pd.DataFrame([df['A'].values, df['yearweek'].values])
plt.plot(new)
but it doesn't work and shows
ValueError: could not convert string to float: '2014-48'
Then I tried this:
plt.scatter(df['Total'], df['yearweek'])
turns out:
ValueError: could not convert string to float: '2015-37'
Is this means the type of yearweek has some problem? How can I fix it?
Or if it's possible to change the index into date?

The best solution I see is to calculate the date from scratch and add it to a new column as a datetime. Then you can plot it easily.
df['date'] = df['yearweek'].map(lambda x: datetime.datetime.strptime(x,"%Y-%W")+datetime.timedelta(days=7*(int(x.split('-')[1])-1)))
df.plot('date','A')
So I start with the first january of the current year and go forward 7*(week-1) days, then generate the date from it.
As of pandas 0.20.X, you can use DataFrame.plot() to generate your required plots. It uses matplotlib under the hood -
import pandas as pd
data = pd.read_csv('Your_Dataset.csv')
data.plot(['yearweek'], ['A'])
Here, yearweek will become the x-axis and A will become the y. Since it's a list, you can use multiple in both cases
Note: If it still doesn't look good then you could go towards parsing the yearweek column correctly into dateformat and try again.

Plotting a histogram in Pandas with very heavy-tailed data

I am often working with data that has a very 'long tail'. I want to plot histograms to summarize the distribution, but when I try to using pandas I wind up with a bar graph that has one giant visible bar and everything else invisible.
Here is an example of the series I am working with. Since it's very long, I used value_counts() so it will fit on this page.
In [10]: data.value_counts.sort_index()
Out[10]:
0 8012
25 3710
100 10794
200 11718
300 2489
500 7631
600 34
700 115
1000 3099
1200 1766
1600 63
2000 1538
2200 41
2500 208
2700 2138
5000 515
5500 201
8800 10
10000 10
10900 465
13000 9
16200 74
20000 518
21500 65
27000 64
53000 82
56000 1
106000 35
530000 3
I'm guessing that the answer involves binning the less common results into larger groups somehow (53000, 56000, 106000, and 53000 into one group of >50000, etc.), and also changing the y index to represent percentages of the occurrence rather than the absolute number. However, I don't understand how I would go about doing that automatically.
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
mydict = {0: 8012,25: 3710,100: 10794,200: 11718,300: 2489,500: 7631,600: 34,700: 115,1000: 3099,1200: 1766,1600: 63,2000: 1538,2200: 41,2500: 208,2700: 2138,5000: 515,5500: 201,8800: 10,10000: 10,10900: 465,13000: 9,16200: 74,20000: 518,21500: 65,27000: 64,53000: 82,56000: 1,106000: 35,530000: 3}
mylist = []
for key in mydict:
for e in range(mydict[key]):
mylist.insert(0,key)
df = pd.DataFrame(mylist,columns=['value'])
df2 = df[df.value <= 5000]
Plot as a bar:
fig = df.value.value_counts().sort_index().plot(kind="bar")
plt.savefig("figure.png")
As a histogram (limited to values 5000 & under which is >97% of your data):
I like using linspace to control buckets.
df2 = df[df.value <= 5000]
df2.hist(bins=np.linspace(0,5000,101))
plt.savefig('hist1')
EDIT: Changed np.linspace(0,5000,100) to np.linspace(0,5000,101) & updated histogram.

Categories

Resources