I've been trying to work through the code in this function and cannot get my series to show up on my plots. Possibly there is an easier way to do this. In each plot I want display each of the 7 entities, in a time series with 1 indicator.
I'm struggling with how to group values by both year, and country. I am new to python and data science so I appreciate any help.
Here is a link to the csv data from the World Bank
https://datacatalog.worldbank.org/search/dataset/0037712
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('fivethirtyeight')
plt.rcParams['figure.figsize'] = (14, 7)
raw = pd.read_csv('WDIData.csv')
countries = ['BIH', 'HRV', 'MKD', 'MNE', 'SRB', 'SVN', 'EUU']
colors = {
'Bosnia and Herzegovina': "#66C2A5",
'Croatia': "#FA8D62",
'North Macedonia': "#F7BA20",
'Montenegro': "#E68AC3",
'Serbia': "#8D9FCA",
'Slovenia': "#A6D853",
'avg. EU': "#CCCCCC"
}
i = 0
df = raw[raw['Country Code'].isin(countries)].copy()
pre_1990 = [str(x) for x in range(1960, 1990)]
df.drop(pre_1990, axis=1, inplace=True)
df = df.rename(columns={'Country Name': 'CountryName', 'Country Code': 'CountryCode', 'Indicator Name': 'IndicatorName', 'Indicator Code': 'IndicatorCode'})
columns = ['CountryName', 'CountryCode', 'IndicatorName', 'IndicatorCode']
df = pd.melt(df, id_vars=columns, var_name='Year', value_name='Value')
df.dropna(inplace=True)
def plot_indicator(indicators, title=None,
xlim=None, ylim=None, xspace=None,
loc=0, loc2=0,
drop_eu=False, filename=None):
lines = ['-', '--']
line_styles = []
fig, ax = plt.subplots()
indicators = indicators if isinstance(indicators, list) else [indicators]
for line, (name, indicator) in zip(lines, indicators):
ls, = plt.plot(np.nan, linestyle=line, color='#999999')
line_styles.append([ls, name])
df_ind = df[(df.IndicatorCode == indicator)]
group = df_ind.groupby(['CountryName'])
for country, values in group:
country_values = values.groupby('Year').mean()
if country == 'European Union':
if drop_eu:
continue
ax.plot(country_values, label=country,
linestyle='--', color='#666666', linewidth=1, zorder=1)
elif country_values.shape[0] > 1:
ax.plot(country_values, label=country, linestyle=line,
color=colors[country], linewidth=2.5)
if line == lines[0]:
legend = plt.legend(loc=loc)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
if xlim and xspace:
ax.set_xticks(np.arange(xlim[0], xlim[1]+1, xspace))
plt.tight_layout()
fig.subplots_adjust(top=0.94)
if title:
ax.set_title(title)
else:
ax.set_title(df_ind.IndicatorName.values[0])
if len(indicators) > 1:
plt.legend(*zip(*line_styles), loc=loc2)
ax.add_artist(legend)
population = [
('pop_dens', 'EN.POP.DNST'), # Population density
('rural', 'SP.RUR.TOTL.ZS'), # Rural population
('under14', 'SP.POP.0014.TO.ZS'),# Population, ages 0-14
('above65', 'SP.POP.65UP.TO.ZS'),# Population ages 65 and above
]
for indicator in population:
plot_indicator(indicator, loc=0, xlim=(1990, 2020))
EDIT
I have re-written this answer to be more clear and concise.
This is a clever bit of code! I found the problem, it was with xlim. As the years are strings, not integers, the x-axis is index-based, not integer-based. This means that when you pass the range between 1990 and 2020 you are looking the 1990th to 2020th values! Obviously, there are not this many values (only 30 years between 1990 and 2020), so there was no data within that range, thus the blank plot.
If you change the code within the function to ax.set_xlim(xlim[0]-int(df_ind['Year'].min()), xlim[1]-int(df_ind['Year'].min())) then you can pass the year and it will subtract the minimum year to give the appropriate index values. I would also add plt.xticks(rotation=45) underneath to stop the ticks overlapping.
ALTERNATIVELY!! (this is the option I would choose):
You can simply change the DataFrame column type to integer, then everything you have remains unchanged. Underneath df.dropna(inplace=True) (just before the function), you can add df['Year'] = df['Year'].astype(int), which solves the problem with the non-integer x-axis above.
Once one or the other has been changed, you should be able to see the lines of the plots.
Related
I use seaborn to make a categorical barplot of a df containing Pearson correlation R-values for 17 vegetation classes, 3 carbon species and 4 regions. I try to recreate a smaller sample df here:
import pandas as pd
import seaborn as sns
import random
import numpy as np
df = pd.DataFrame({
'veg class':12*['Tree bl dc','Shrubland','Grassland'],
'Pearson R':np.random.uniform(0,1, 36),
'Pearson p':np.random.uniform(0,0.1, 36),
'carbon':4*['CO2','CO2','CO2', 'CO', 'CO', 'CO', 'CO2 corr', 'CO2 corr', 'CO2 corr'],
'spatial':9*['SH'] + 9*['larger AU region'] + 9*['AU'] + 9*['SE-AU']
})
#In my original df, the number of vegetation classes where R-values are
#available is not the same for all spatial scales, so I drop random rows
#to make it more similar:
df.drop([11,14,17,20,23,26,28,29,31,32,34,35], inplace=True)
#I added colums indicating where hatching should be
#boolean:
df['significant'] = 1
df.loc[df['Pearson p'] > 0.05, 'significant'] = 0
#string:
df['hatch'] = ''
df.loc[df['Pearson p'] > 0.05, 'hatch'] = 'x'
df.head()
This is my plotting routine:
sns.set(font_scale=2.1)
#Draw a nested barplot by veg class
g = sns.catplot(
data=df, kind="bar", row="spatial",
x="veg class", y="Pearson R", hue="carbon",
ci=None, palette="YlOrBr", aspect=5
)
g.despine(left=True)
g.set_titles("{row_name}")
g.set_axis_labels("", "Pearson R")
g.set(xlabel=None)
g.legend.set_title("")
g.set_xticklabels(rotation = 60)
(The plot looks as follows: seaborn categorical barplot)
The plot is exactly how I would like it, except that now I would like to add hatching (or any kind of distinction) for all bars where the Pearson R value is insignificant, i.e. where the p value is larger than 0.05. I found this stackoverflow entry, but my problem differs from this, as the plots that should be hatched are not in repetitive order.
Any hints will be highly appreciated!
To determine the height of individual bars and hatching, we get a container for each graph unit, get the height of that individual container, determine it with a specified threshold, and then set the hatching and color. Please add the following code at the end.
for ax in g.axes.flat:
for k in range(len(ax.containers)):
h = ax.patches[k].get_height()
if h >= 0.8:
ax.patches[k].set_hatch('*')
ax.patches[k].set_edgecolor('k')
Edit: The data has been updated to match the actual data, and the code has been modified accordingly. Also, the logic is conditional on the value of the hatching column.
for i,ax in enumerate(g.axes.flat):
s = ax.get_title()
dff = df.query('spatial == #s')
dff = dff.sort_values('veg class', ascending=False)
ha = dff['hatch'].tolist()
p = dff['Pearson R'].tolist()
print(ha)
for k in range(len(dff)):
if ha[k] == 'x':
ax.patches[k].set_hatch('*')
ax.patches[k].set_edgecolor('k')
I have created the following df which contains the number of times a specific type of disaster occurred in a given year, and I want to create a graph with multiple lines depicting the changes over time, of the number of each disaster happened per year. Therefore, each disaster type would have it's own line, and one would be able to see for example, are winter storms decreasing while droughts increasing?
Currently, I've attempted to define the X and y, however, I'm not sure how to groupby flood and still add the number per year over time. For some reason, when this is run, I'm getting a keyerror: 'Start_year' -- which could possibly be because the start year was used as an index, but I've reset it as seen below, which should have taken care of that. Sorry a bit new with this.
#Number of each type of disaster each year
df_yearly_tcount = df_time.groupby(['Start_year', 'Disaster_Type']).size()
yearly_tcount=pd.DataFrame(df_yearly_tcount)
yearly_tcount.reset_index()
X = yearly_tcount['Start_year']
y = yearly_tcount(['Disaster_type']=='Flood')
plt.plot(X, y, label = 'Flood')
Entire code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from scipy.stats import zscore
#Import Datased
df = pd.read_csv('database.csv')
df_time = (df[['County','Disaster Type','Start Date', 'End Date']][0: :])
#Preprocessing
#Number of NaN values
df_nan = df[['County','Disaster Type','Start Date', 'End Date']].isna().sum()
#NaN values as a percentage as total
df_nan_number = [(df_nan.sum(axis=0)), str((((539/45330)*100))) +'%']
#Remove NaN values
df_time.dropna(subset = ["County", 'End Date'], inplace=True)
#Set Date Format
df_time['Start_Date_A'] = pd.to_datetime(df['Start Date'], format='%m/%d/%Y')
df_time['End_Date_A'] = pd.to_datetime(df['End Date'], format='%m/%d/%Y')
#Create new column == Disaster Length
df_time['Disaster_Length'] = (df_time.Start_Date_A - df_time.End_Date_A).dt.days
#Create new column == start year
df_time['Start_year'] = df_time['Start_Date_A'].dt.year
#Dropped Old Date Formats from df
df_time = df_time.drop(columns=['Start Date', 'End Date'], axis=1)
#Replace 0 day values with 1 to indicate a Disaster length of 1 Day
df_time['Disaster_Length'] = df_time['Disaster_Length'].replace({0:1})
#Replace all values with absolute values so all days are represented as positive numeric values
df_time['Disaster_Length'] = df_time['Disaster_Length'].abs()
# Locating man-made and non 'natural' disasters, sorting Disaster types, and analyzing value counts
df_DTypes= df_time['Disaster Type'].values
df_DTypes=pd.DataFrame(df_DTypes)
df_DType_VCounts=(df_DTypes.value_counts()).sort_values(ascending=True)
df_DType_Natural=(df_DType_VCounts.drop(['Human Cause', 'Chemical', 'Dam/Levee Break', 'Terrorism','Other'],axis=0)).sort_values(ascending=True)
df_time = df_time.rename(columns={'Disaster Type': 'Disaster_Type'})
#Removing non-natural disasters from main df_time
df_time = df_time[(df_time.Disaster_Type != 'Human Cause') & (df_time.Disaster_Type != 'Chemical') & (df_time.Disaster_Type != 'Dam/Levee Break') & (df_time.Disaster_Type != 'Terrorism') & (df_time.Disaster_Type != 'Other') ]
#Resetting index for final df Analysis
df_time.reset_index(drop=True, inplace = True)
#Analysis
#Dataframe with mean disaster length for each year
df_yearly_mean_len = df_time.groupby(['Start_year']).mean()
df_yearly_mean_len.reset_index().plot('Start_year','Disaster_Length')
#Number of disasters declared per year
yearly_dcount = df_time.groupby(['Start_year']).size()
yearly_dcount=pd.DataFrame(yearly_dcount)
yearly_dcount.columns=['Number_of_Disasters']
#Visualizing change in total number of disasters over time
yearly_dcount.reset_index().plot('Start_year','Number_of_Disasters')
#Number of each type of disaster each year
df_yearly_tcount = df_time.groupby(['Start_year', 'Disaster_Type']).size()
yearly_tcount=pd.DataFrame(df_yearly_tcount)
yearly_tcount.reset_index()
X = yearly_tcount['Start_year']
y = yearly_tcount(['Disaster_type']=='Flood')
plt.plot(X, y, label = 'Flood')
Df
0
Start_year Disaster_Type
1959 Flood 1
1964 Flood 115
1965 Drought 51
Earthquake 6
Flood 198
Hurricane 56
Storm 6
Tornado 112
1966 Flood 113
Tornado 2
Typhoon 5
1967 Fire 10
Flood 121
Hurricane 29
Tornado 36
Typhoon 1
1968 Flood 76
Hurricane 14
Ice 21
Tornado 50
Typhoon 1
1969 Flood 394
Hurricane 64
Storm 1
Tornado 46
1970 Fire 6
Flood 180
Hurricane 7
Storm 17
Tornado 11
Original Data set
https://www.kaggle.com/fema/federal-disasters
Looks like you are on the right track. A lot of your code/styles seem to be trending in the correct direction. I put your data into a CSV and reset the multi-index. After this it is fairly simple to plot your data. It may look better with more data, but currently there are multiple outliers and disasters with missing data (1959 and 1964 for example). Furthermore, if you use a line graph, then you're comparing to the same y axis which could make it difficult to compare low and high frequency disasters (ex. earthquakes vs floods). You could alternatively plot the percent change, but this doesn't look very good either with the data provided. Lastly, you could use a stacked bar graph instead. Personally, I think this looks the best. How you chose to present your data depends on the goals of your chart, how quantitative or qualitative your want to be, and if you want to show raw data such as with a scatter plot. Regardless, here are some graphs and some code that should help.
types = ['Flood', 'Drought', 'Earthquake', 'Hurricane', 'Storm', 'Tornado',
'Typhoon', 'Fire', 'Ice']
fig, axes = plt.subplots(ncols=2, nrows=2, figsize=(16,14))
axes = axes.flatten()
ax = axes[0]
for i in range(len(types)):
disaster_df = df[df.Disaster_Type == types[i]]
ax.plot(disaster_df.Start_year, disaster_df.Size, linewidth=2.5, label=types[i])
ax.legend(ncol=3, edgecolor='w')
[ax.spines[s].set_visible(False) for s in ['top','right']]
ax.set_title('Disasters Raw', fontsize=16, fontweight='bold')
#remove 1959
ax = axes[1]
df2 = df.iloc[1:]
for i in range(len(types)):
disaster_df = df2[df2.Disaster_Type == types[i]]
ax.plot(disaster_df.Start_year, disaster_df.Size, linewidth=2.5, label=types[i])
ax.legend(ncol=3, edgecolor='w')
[ax.spines[s].set_visible(False) for s in ['top','right']]
ax.set_title('Remove 1959', fontsize=16, fontweight='bold')
#remove 1964
ax = axes[2]
df2 = df.iloc[2:]
for i in range(len(types)):
disaster_df = df2[df2.Disaster_Type == types[i]]
ax.plot(disaster_df.Start_year, disaster_df.Size, linewidth=2.5, label=types[i])
ax.legend(ncol=3, edgecolor='w')
[ax.spines[s].set_visible(False) for s in ['top','right']]
ax.set_title('Remove 1959 and 1964', fontsize=16, fontweight='bold')
#plot percent change
ax = axes[3]
df2 = df.iloc[2:]
for i in range(len(types)):
disaster_df = df2[df2.Disaster_Type == types[i]]
ax.plot(disaster_df.Start_year, disaster_df.Size.pct_change(), linewidth=2.5, label=types[i])
ax.legend(ncol=1, edgecolor='w', loc=(1, 0.5))
[ax.spines[s].set_visible(False) for s in ['top','right']]
ax.set_title('Try plotting percent change', fontsize=16, fontweight='bold')
fig, ax = plt.subplots(figsize=(12,8))
df.pivot(index='Start_year', columns = 'Disaster_Type', values='Size' ).plot.bar(stacked=True, ax=ax, zorder=3)
ax.legend(ncol=3, edgecolor='w')
[ax.spines[s].set_visible(False) for s in ['top','right', 'left']]
ax.tick_params(axis='both', left=False, bottom=False)
ax.grid(axis='y', dashes=(8,3), color='gray', alpha=0.3)
I have a dataframe looks like this:
and using code snippet, I calculate species observed each month and their respective count per month (I think only variables are needed to show):
code:
state_bird_month_sum = state_bird_month_sorted.groupby(['Month','COMMON NAME'])[['OBSERVATION COUNT']].agg('sum')
state_bird_month_sum
that gives me this:
Basically, there can be multiple species observed in a month and each species has a value 'observation count' associated.
I want to make a plot like this where x is 'COMMON NAME', y is 'Month' and annotated values are 'OBSERVATION COUNT'. In general, plot should show the species count (cumulative = all observation counts in a month) for each month.
I tried to plot using seaborn using the code below but doesn't work:
import matplotlib.pyplot as plt
import seaborn as sns
flights = state_bird_month_sum.pivot('Month','COMMON NAME', 'OBSERVATION COUNT')
myflights = flights.copy()
arr = flights.values
vmin, vmax = arr.min(), arr.max()
sns.heatmap(myflights, annot=True, fmt="d", vmin=vmin, vmax=vmax)
plt.show()
Considering the dataframe state_bird_month_sum.
df_temp = state_bird_month_sum.pivot_table("OBSERVATION COUNT", "COMMON NAME", "Month")
plt.figure(figsize = (25,15))
plt.title("NAME OF THE GRAPH", size=20)
plot = sns.heatmap(df_temp)
plot.set(xlabel="MONTH", ylabel="COMMON NAME")
plt.show()
I have retail beef ad counts time series data, and I intend to make stacked line chart aim to show On a three-week average basis, quantity of average ads that grocers posted per store last week. To do so, I managed to aggregate data for plotting and tried to make line chart that I want. The main motivation is based on context of the problem and desired plot. In my attempt, I couldn't get very nice line chart because it is not informative to understand. I am wondering how can I achieve this goal in matplotlib. Can anyone suggest me what should I do from my current attempt? Any thoughts?
reproducible data and current attempt
Here is minimal reproducible data that I used in my current attempt:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
from datetime import timedelta, datetime
url = 'https://gist.githubusercontent.com/adamFlyn/96e68902d8f71ad62a4d3cda135507ad/raw/4761264cbd55c81cf003a4219fea6a24740d7ce9/df.csv'
df = pd.read_csv(url, parse_dates=['date'])
df.drop(columns=['Unnamed: 0'], inplace=True)
df_grp = df.groupby(['date', 'retail_item']).agg({'number_of_ads': 'sum'})
df_grp["percentage"] = df_grp.groupby(level=0).apply(lambda x:100 * x / float(x.sum()))
df_grp = df_grp.reset_index(level=[0,1])
for item in df_grp['retail_item'].unique():
dd = df_grp[df_grp['retail_item'] == item].groupby(['date', 'percentage'])[['number_of_ads']].sum().reset_index(level=[0,1])
dd['weakly_change'] = dd[['percentage']].rolling(7).mean()
fig, ax = plt.subplots(figsize=(8, 6), dpi=144)
sns.lineplot(dd.index, 'weakly_change', data=dd, ax=ax)
ax.set_xlim(dd.index.min(), dd.index.max())
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b %Y'))
plt.gcf().autofmt_xdate()
plt.style.use('ggplot')
plt.xticks(rotation=90)
plt.show()
Current Result
but I couldn't get correct line chart that I expected, I want to reproduce the plot from this site. Is that doable to achieve this? Any idea?
desired plot
here is the example desired plot that I want to make from this minimal reproducible data:
I don't know how should make changes for my current attempt to get my desired plot above. Can anyone know any possible way of doing this in matplotlib? what else should I do? Any possible help would be appreciated. Thanks
Also see How to create a min-max plot by month with fill_between?
See in-line comments for details
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import calendar
#################################################################
# setup from question
url = 'https://gist.githubusercontent.com/adamFlyn/96e68902d8f71ad62a4d3cda135507ad/raw/4761264cbd55c81cf003a4219fea6a24740d7ce9/df.csv'
df = pd.read_csv(url, parse_dates=['date'])
df.drop(columns=['Unnamed: 0'], inplace=True)
df_grp = df.groupby(['date', 'retail_item']).agg({'number_of_ads': 'sum'})
df_grp["percentage"] = df_grp.groupby(level=0).apply(lambda x:100 * x / float(x.sum()))
df_grp = df_grp.reset_index(level=[0,1])
#################################################################
# create a month map from long to abbreviated calendar names
month_map = dict(zip(calendar.month_name[1:], calendar.month_abbr[1:]))
# update the month column name
df_grp['month'] = df_grp.date.dt.month_name().map(month_map)
# set month as categorical so they are plotted in the correct order
df_grp.month = pd.Categorical(df_grp.month, categories=month_map.values(), ordered=True)
# use groupby to aggregate min mean and max
dfmm = df_grp.groupby(['retail_item', 'month'])['percentage'].agg([max, min, 'mean']).stack().reset_index(level=[2]).rename(columns={'level_2': 'mm', 0: 'vals'}).reset_index()
# create a palette map for line colors
cmap = {'min': 'k', 'max': 'k', 'mean': 'b'}
# iterate through each retail item and plot the corresponding data
for g, d in dfmm.groupby('retail_item'):
plt.figure(figsize=(7, 4))
sns.lineplot(x='month', y='vals', hue='mm', data=d, palette=cmap)
# select only min or max data for fill_between
y1 = d[d.mm == 'max']
y2 = d[d.mm == 'min']
plt.fill_between(x=y1.month, y1=y1.vals, y2=y2.vals, color='gainsboro')
# add lines for specific years
for year in [2016, 2018, 2020]:
data = df_grp[(df_grp.date.dt.year == year) & (df_grp.retail_item == g)]
sns.lineplot(x='month', y='percentage', ci=None, data=data, label=year)
plt.ylim(0, 100)
plt.margins(0, 0)
plt.legend(bbox_to_anchor=(1., 1), loc='upper left')
plt.ylabel('Percentage of Ads')
plt.title(g)
plt.show()
Below I have my code to plot my graph.
#can change the 'iloc[x:y]' component to plot sections of chart
#ax = df['Data'].iloc[300:].plot(color = 'black', title = 'Past vs. Expected Future Path')
ax = df.plot('Date','Data',color = 'black', title = 'Past vs. Expected Future Path')
df.loc[df.index >= idx, 'up2SD'].plot(color = 'r', ax = ax)
df.loc[df.index >= idx, 'down2SD'].plot(color = 'r', ax = ax)
df.loc[df.index >= idx, 'Data'].plot(color = 'b', ax = ax)
plt.show()
#resize the plot
plt.rcParams["figure.figsize"] = [10,6]
plt.show()
Lines 2 (commented out) and 3 both work to plot all of the lines together as seen, however I wish to have the dates on the x-axis and also be able to be able to plot sections of the graph (defined by x-axis, i.e. date1 to date2).
Using line 3 I can plot with dates on the x-axis, however using ".iloc[300:]" like in line 2 does not appear to work as the 3 coloured lines disconnect from the main line as seen below:
ax = df.iloc[300:].plot('Date','Data',color = 'black', title = 'Past vs. Expected Future Path')
Using line 2, I can edit the x-axis' length, however it doesn't have dates on the x-axis.
Does anyone have any advice on how to both have dates and be able to edit the x-axis periods?
For this to work as desired, you need to set the 'date' column as index of the dataframe. Otherwise, df.plot has no way to know what needs to be used as x-axis. With the date set as index, pandas accepts expressions such as df.loc[df.index >= '20180101', 'data2'] to select a time range and a specific column.
Here is some example code to demonstrate the concept.
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
dates = pd.date_range('20160101', '20191231', freq='D')
data1 = np.random.normal(-0.5, 0.2, len(dates))
data2 = np.random.normal(-0.7, 0.2, len(dates))
df = pd.DataFrame({'date': dates, 'data1':data1, 'data2':data2})
df.set_index('date', inplace=True)
df['data1'].iloc[300:].plot(color='crimson')
df.loc[df.index >= '20180101', 'data2'].plot(color='dodgerblue')
plt.tight_layout()
plt.show()