How do I split a grouped bar chart into sub-groups? - python

I have this dataset-
group sub_group value date
0 Animal Cats 12 today
1 Animal Dogs 32 today
2 Animal Goats 38 today
3 Animal Fish 1 today
4 Plant Tree 48 today
5 Object Car 55 today
6 Object Garage 61 today
7 Object Instrument 57 today
8 Animal Cats 44 yesterday
9 Animal Dogs 12 yesterday
10 Animal Goats 18 yesterday
11 Animal Fish 9 yesterday
12 Plant Tree 8 yesterday
13 Object Car 12 yesterday
14 Object Garage 37 yesterday
15 Object Instrument 77 yesterday
I want to have two series in a barchart. I want to have one series for today and I want to have another series for yesterday. Within each series, I want the bars to be split up by their sub-groups. For example, there would be one bar called "Animal - today" and it would sum up to 83 and, within that bar, there would be cats, dogs, etc.
I want to make a chart that is very similar to chart shown under "Bar charts with Long Format Data" on the docs, except that I have two series.
This is what I tried-
fig = make_subplots(rows = 1, cols = 1)
fig.add_trace(go.Bar(
y = df[df['date'] == 'today']['amount'],
x = df[df['date'] == 'today']['group'],
color = df[df['date'] == 'today']['sub_group']
),
row = 1, col = 1
)
fig.add_trace(go.Bar(
y = df[df['date'] == 'yesterday']['amount'],
x = df[df['date'] == 'yesterday']['group'],
color = df[df['date'] == 'yesterday']['sub_group']
),
row = 1, col = 1
)
fig.show()
I added a bounty because I want to be able to add the chart as a trace in my subplot.

I think your code will draw the intended graph except for the color settings, but if you want to separate each stacked graph by color, you will need to do some tricks. There may be another way to do this, but create two express graphs by date and reuse that data. To create that x-axis, add a column with the code that makes the group column a category. the second group needs to be shifted on the x-axis, so I add +0.5 to the x-axis. Set the array ([0.25,1.25,2.25]) and string to center the x-axis scale on the created graph. Finally, duplicate legend items are made unique.
# group id add
df['gr_id'] = df['group'].astype('category').cat.codes
import plotly.express as px
import plotly.graph_objects as go
fig_t = px.bar(df.query('date == "today"'), x="gr_id", y="amount", color='sub_group')
fig_y = px.bar(df.query('date == "yesterday"'), x="gr_id", y="amount", color='sub_group')
fig = go.Figure()
for t in fig_t.data:
fig.add_trace(go.Bar(t))
for i,y in enumerate(fig_y.data):
y['x'] = y['x'] + 0.5
fig.add_trace(go.Bar(y))
fig.update_layout(bargap=0.05, barmode='stack')
fig.update_xaxes(tickvals=[0.25,1.25,2.25], ticktext=['Animal','Plant','Pbject'])
names = set()
fig.for_each_trace(
lambda trace:
trace.update(showlegend=False)
if (trace.name in names) else names.add(trace.name))
fig.update_layout(legend_tracegroupgap=0)
fig.show()
EDIT:
This is achieved by using the pre-loaded subplotting functionality built into express.
import plotly.express as px
px.bar(df, x='group', y='amount', color='sub_group', facet_col='date')

I Know This is Kinda late, maybe not as useful, but you could use Dataclasses
from dataclasses import dataclass
#dataclass
class DataObject:
value: str
index: int
#dataclass
class TwoDayData:
today: DataObject
yesterday: DataObject
data = DataObject(value="hello", index=1)
data_yesterday = DataObject(value="Mom", index=1)
two_day_point = TwoDayData(today=data, yesterday=data_yesterday)
print(two_day_point.yesterday, two_day_point.today)
print(two_day_point.yesterday.index, two_day_point.today.index)

I didn't see your expected Output so I followed my prediction. Try to see if you got it right:
import plotly.graph_objects as go
from plotly.subplots import make_subplots
fig = make_subplots(rows = 1, cols = 2)
fig.add_trace(go.Bar(x=[tuple(df[df['date'] == 'today']['group']),
tuple(df[df['date'] == 'today']['sub_group'])],
y=list(df[df['date'] == 'today']['value']),
name='today'),
row = 1, col = 1)
fig.add_trace(go.Bar(x=[tuple(df[df['date'] == 'yesterday']['group']),
tuple(df[df['date'] == 'yesterday']['sub_group'])],
y=list(df[df['date'] == 'yesterday']['value']),
name='yesterday'),
row = 1, col = 2)
fig.show()
And below is the Output:

I find it more aesthetically pleasing to put data like this into figures with multi-categorical axes.
fig = make_subplots(rows = 1, cols = 1)
for sg in df['sub_group'].unique():
fig.append_trace(go.Bar(x=[df['date'][df['sub_group']==sg], df['group'][df['sub_group']==sg]],
y=df['value'][df['sub_group']==sg],
name=sg,
text=sg),
col=1,
row=1
)
fig.update_layout(barmode='stack')
fig.show()

Related

Python plotly how to change X and Y axis in button

I'm working on a graph that ilustrates computer usage each day. I want to have a button that will group dates monthly for last year and set y as AVERAGE (mean) and draw avg line.
My code:
import datetime
import numpy as np
import pandas as pd
import plotly.graph_objects as go
example_data = {"date": ["29/07/2022", "30/07/2022", "31/07/2022", "01/08/2022", "02/08/2022"],
"time_spent" : [15840, 21720, 40020, 1200, 4200]}
df = pd.DataFrame(example_data)
df["date"] = pd.to_datetime(df["date"], dayfirst=True)
df['Time spent'] = df['time_spent'].apply(lambda x:str(datetime.timedelta(seconds=x)))
df['Time spent'] = pd.to_datetime(df['Time spent'])
df = df.drop("time_spent", axis=1)
dfall = df.resample("M", on="date").mean().copy()
dfyearly = dfall.tail(12).copy()
dfweekly = df.tail(7).copy()
dfmonthly = df.tail(30).copy()
del df
dfs = {'Week':dfweekly, 'Month': dfmonthly, 'Year' : dfyearly, "All" : dfall}
for dframe in list(dfs.values()):
dframe['StfTime'] = dframe['Time spent'].apply(lambda x: x.strftime("%H:%M"))
frames = len(dfs) # number of dataframes organized in dict
columns = len(dfs['Week'].columns) - 1 # number of columns i df, minus 1 for Date
scenarios = [list(s) for s in [e==1 for e in np.eye(frames)]]
visibility = [list(np.repeat(e, columns)) for e in scenarios]
lowest_value = datetime.datetime.combine(datetime.date.today(), datetime.datetime.min.time())
highest_value = dfweekly["Time spent"].max().ceil("H")
buttons = []
fig = go.Figure()
for i, (period, df) in enumerate(dfs.items()):
print(i)
for column in df.columns[1:]:
fig.add_bar(
name = column,
x = df['date'],
y = df[column],
customdata=df[['StfTime']],
text=df['StfTime'],
visible=True if period=='Week' else False # 'Week' values are shown from the start
)
#Change display data to more friendly format
fig.update_traces(textfont=dict(size=20), hovertemplate='<b>Time ON</b>: %{customdata[0]}</br>')
#Change range for better scalling
this_value =df["Time spent"].max().ceil("H")
if highest_value <= this_value:
highest_value = this_value
fig.update_yaxes(range=[lowest_value, highest_value])
#Add average value indicator
average_value = df["Time spent"].mean()
fig.add_hline(y=average_value, line_width=3, line_dash="dash",
line_color="green")
# one button per dataframe to trigger the visibility
# of all columns / traces for each dataframe
button = dict(label=period,
method = 'restyle',
args = ['visible',visibility[i]])
buttons.append(button)
fig.update_yaxes(dtick=60*60*1000, tickformat='%H:%M')
fig.update_xaxes(type='date', dtick='D1')
fig.update_layout(updatemenus=[dict(type="dropdown",
direction="down",
buttons = buttons)])
fig.show()
EDIT 1.
Thanks to vestland I managed to get semi-working dropdown.
The problem is that the line added with add_hline affect all bar charts. I want it to display only on the chart that it had been added for. Also after passing in custom data for nicer display, the space between bars is doubled. Any way to fix above issues?

Choropleth map slider only showing the first year , and not including other steps - no error provided

I've been working on trying to create a choropleth map with a date slider, and while the map and initial year of data is output, when the slider is moved past the initial year, the map goes white and the 'range' indicator on the right side noting the number of 'Total Deaths' disappears.
I receive no error, and am not sure what could be happening, any help would be great. Thanks!
The Kaggle dataset link
The code
df_total = pd.read_csv('../input/air-pollution/death-rates-total-air-pollution.csv')
scl = [[0.0, '#ffffff'],[0.2, '#ff9999'],[0.4, '#ff4d4d'],
[0.6, '#ff1a1a'],[0.8, '#cc0000'],[1.0, '#4d0000']]
data_slider = []
for year in df_total.Year.unique():
# I select the year (and remove DC for now)
dff = df_total[(df_total['Year']== year )]
dff = df_total = dff.rename(columns={"Deaths - Air pollution - Sex: Both - Age: Age-standardized (Rate)":"Total Deaths"})
for col in dff.columns: # I transform the columns into string type so I can:
dff[col] = dff[col].astype(str)
### I create the text for mouse-hover for each state, for the current year
'''dff['Total Deaths'] = dff['Entity'] '''
### create the dictionary with the data for the current year
data_one_year = dict(
type='choropleth',
locations = dff['Entity'],
z=dff['Total Deaths'].astype(float),
locationmode='country names',
colorscale = scl,
text = dff['Entity'],
)
data_slider.append(data_one_year) # I add the dictionary to the list of dictionaries for the slider
steps = []
for i in range(len(data_slider)):
step = dict(method='restyle',
args=['visible', [False] * len(data_slider)],
label='Year {}'.format(i + 1990)) # label to be displayed for each step (year)
step['args'][1][i] = True
steps.append(step)
sliders = [dict(active=0, pad={"t": 1}, steps=steps)]
layout = dict(geo=dict(scope='world',
projection={'type': 'equirectangular'}),
sliders=sliders)
#create the figure object:
fig = dict(data=data_slider, layout=layout)
#plot in the notebook
plotly.offline.iplot(fig)
This line is problematic inside for loop:
dff = df_total = dff.rename(
columns={"Deaths - Air pollution - Sex: Both - Age: Age-standardized (Rate)": "Total Deaths"})
Replace it with:
dff = dff.rename(
columns={"Deaths - Air pollution - Sex: Both - Age: Age-standardized (Rate)": "Total Deaths"})

How to create a plot with dynamic variables

Using matplotlib library on Pyhton, I would like to plot some graphs with dynamic y variables, i.e. variables which would change according to another variable stated before my plot functions.
From my imported data frame, I have extracted different gases concentration (M**_conc) and fluxes (M**_fluxes).
M33_conc = ec_top["M 33(ppbv)"]
M39_conc = ec_top["M 39(ncps)"]
M45_conc = ec_top["M 45(ppbv)"]
M59_conc = ec_top["M 59(ppbv)"]
M69_conc = ec_top["M 69(ppbv)"]
M71_conc = ec_top["M 71(ppbv)"]
M81_conc = ec_top["M 81(ppbv)"]
M137_conc = ec_top["M 137(ppbv)"]
M87_conc = ec_top["M 87(ppbv)"]
M47_conc = ec_top["M 47(ppbv)"]
M61_conc = ec_top["M 61(ppbv)"]
M33_flux = ec_top["Flux_M 33"]
M45_flux = ec_top["Flux_M 45"]
M59_flux = ec_top["Flux_M 59"]
M69_flux = ec_top["Flux_M 69"]
M71_flux = ec_top["Flux_M 71"]
M81_flux = ec_top["Flux_M 81"]
M137_flux = ec_top["Flux_M 137"]
M87_flux = ec_top["Flux_M 87"]
M47_flux = ec_top["Flux_M 47"]
M61_flux = ec_top["Flux_M 61"]
I want to be able to plot the evolution of these gases concentration/fluxes with time, with only one function which would allow me to choose between plotting the concentration or the fluxes of these gases.
Here is what I have written so far :
color_1 = 'black'
graph_type='conc'
fig, ((ax1, ax2, ax3), (ax5, ax7, ax8),(ax9,ax10,ax11)) = plt.subplots(3, 3, sharex=True, sharey=False)
fig.suptitle('Influence of wind direction of BVOCs concentration')
ax1.plot(wind_dir,'M33_'+graph_type,linestyle='',marker='.',color=color_1)
ax1.set_title('Methanol')
ax1.set(ylabel='Concentration [ppbv]')
ax2.plot(wind_dir,M39_conc,linestyle='',marker='.',color=color_1)
ax2.set_title('Water cluster')
ax2.set(ylabel='Concentration [ncps]')
ax3.plot(wind_dir,M45_conc,linestyle='',marker='.',color=color_1)
ax3.set_title('Acetaldehyde')
ax3.set(ylabel='Concentration [ppbv]')
# ax4.plot(wind_dir,M47_conc,linestyle='',marker='.',color='color_1')
# ax4.set_title('Unknown')
ax5.plot(wind_dir,M59_conc,linestyle='',marker='.',color=color_1)
ax5.set_title('Acetone')
ax5.set(ylabel='Concentration [ppbv]')
# ax6.plot(wind_dir,M61_conc,linestyle='',marker='.',color='color_1')
# ax6.set_title('Unknown')
ax7.plot(wind_dir,M69_conc,linestyle='',marker='.',color=color_1)
ax7.set_title('Isoprene')
ax7.set(ylabel='Concentration [ppbv]')
ax8.plot(wind_dir,M71_conc,linestyle='',marker='.',color=color_1)
ax8.set_title('Methyl vinyl, ketone and methacrolein')
ax8.set(ylabel='Concentration [ppbv]')
ax9.plot(wind_dir,M81_conc,linestyle='',marker='.',color=color_1)
ax9.set_title('Fragment of monoterpenes')
ax9.set(ylabel='Concentration [ppbv]',xlabel='Wind direction [°]')
ax10.plot(wind_dir,M87_conc,linestyle='',marker='.',color=color_1)
ax10.set_title('Methylbutenols')
ax10.set(ylabel='Concentration [ppbv]',xlabel='Wind direction [°]')
ax11.plot(wind_dir,M137_conc,linestyle='',marker='.',color=color_1)
ax11.set_title('Monoterpenes')
ax11.set(ylabel='Concentration [ppbv]',xlabel='Wind direction [°]')
plt.show()
When I try to parametrize the data I want to plot, I write, for example :
'M33_'+graph_type
which I am expecting to take the value 'M33_conc'.
Could someone help me to do this?
Thanks in advance
You have mentioned wanting to plot the evolution of the gases with time, but in the code sample you have given, you use wind_dir as the x variable. In this answer, I disregard this and use time as the x variable instead.
Looking at your code, I understand that you are wanting to create two different figures made of small multiples, one for gas concentrations and one for gas fluxes. For this kind of plot, I recommend using pandas or seaborn so that you can plot all the variables contained in a pandas dataframe at once. Here I share an example using pandas.
Because you are wanting to plot different measurements of the same substances, I recommend creating a table that lists the names of the variables and units associated with each unique substance (see df_subs below). I create one using code to extract the units and share it here, but this is easier to do with spreadsheet software.
Having a table like that makes it easier to create a plotting function that selects the group of variables you want to plot from the ec_top dataframe. You can then use the pandas plotting function like this: df.plot(subplots=True).
Most of the code shown below is to create some sample data based on your code to make it possible for you to recreate exactly what I show here and for anyone else who would like to give this a try. So if you want to use this solution, you can skip most of it, all you would need to do is create the substances table your way and then adjust the plotting function to fit your preferences.
Create sample dataset
import io # from Python v 3.8.5
import numpy as np # v 1.19.2
import pandas as pd # v 1.1.3
import matplotlib.pyplot as plt # v 3.3.2
import matplotlib.dates as mdates
pd.set_option("display.max_columns", 6)
rng = np.random.default_rng(seed=1) # random number generator
# Copy paste variable names from sample given in question
var_strings = '''
"M 33(ppbv)"
"M 39(ncps)"
"M 45(ppbv)"
"M 59(ppbv)"
"M 69(ppbv)"
"M 71(ppbv)"
"M 81(ppbv)"
"M 137(ppbv)"
"M 87(ppbv)"
"M 47(ppbv)"
"M 61(ppbv)"
"Flux_M 33"
"Flux_M 45"
"Flux_M 59"
"Flux_M 69"
"Flux_M 71"
"Flux_M 81"
"Flux_M 137"
"Flux_M 87"
"Flux_M 47"
"Flux_M 61"
'''
variables = pd.read_csv(io.StringIO(var_strings), header=None, names=['var'])['var']
# Create datetime variable
nperiods = 60
time = pd.date_range('2021-01-15 12:00', periods=nperiods, freq='min')
# Create range of numbers to compute sine waves for fake data
x = np.linspace(0, 2*np.pi, nperiods)
# Create dataframe containing gas concentrations
var_conc = np.array([var for var in variables if '(' in var])
conc_sine_wave = np.reshape(np.sin(x), (len(x), 1))
loc = rng.exponential(scale=10, size=var_conc.size)
scale = loc/10
var_conc_noise = rng.normal(loc, scale, size=(x.size, var_conc.size))
data_conc = conc_sine_wave + var_conc_noise + 2
df_conc = pd.DataFrame(data_conc, index=time, columns=var_conc)
# Create dataframe containing gas fluxes
var_flux = np.array([var for var in variables if 'Flux' in var])
flux_sine_wave = np.reshape(np.sin(x)**2, (len(x), 1))
loc = rng.exponential(scale=10, size=var_flux.size)
scale = loc/10
var_flux_noise = rng.normal(loc, scale, size=(x.size, var_flux.size))
data_flux = flux_sine_wave + var_flux_noise + 1
df_flux = pd.DataFrame(data_flux, index=time, columns=var_flux)
# Merge concentrations and fluxes into single dataframe
ec_top = pd.merge(left=df_conc, right=df_flux, how='outer',
left_index=True, right_index=True)
ec_top.head()
# M 33(ppbv) M 39(ncps) M 45(ppbv) ... Flux_M 87 Flux_M 47 Flux_M 61
# 2021-01-15 12:00:00 11.940054 5.034281 53.162767 ... 8.079255 2.402073 31.383911
# 2021-01-15 12:01:00 13.916828 4.354558 45.706391 ... 10.229084 2.494649 26.816754
# 2021-01-15 12:02:00 13.635604 5.500438 53.202743 ... 12.772899 2.441369 33.219213
# 2021-01-15 12:03:00 13.146823 5.409585 53.346907 ... 11.373669 2.817323 33.409331
# 2021-01-15 12:04:00 14.124752 5.491555 49.455010 ... 11.827497 2.939942 28.639749
Create substances table containing variable names and units
The substances are shown in the figure subplots in the order that they are listed here. Information from this table is used to create the labels and titles of the subplots.
# Copy paste substance codes and names from sample given in question
subs_strings = """
M33 "Methanol"
M39 "Water cluster"
M45 "Acetaldehyde"
M47 "Unknown"
M59 "Acetone"
M61 "Unknown"
M69 "Isoprene"
M71 "Methyl vinyl, ketone and methacrolein"
M81 "Fragment of monoterpenes"
M87 "Methylbutenols"
M137 "Monoterpenes"
"""
# Create dataframe containing substance codes and names
df_subs = pd.read_csv(io.StringIO(subs_strings), header=None,
names=['subs', 'subs_name'], index_col=False,
delim_whitespace=True)
# Add units and variable names matching the substance codes
# Do this for gas concentrations
for var in var_conc:
var_subs, var_unit_raw = var.split('(')
var_subs_num = var_subs.lstrip('M ')
var_unit = var_unit_raw.rstrip(')')
for i, subs in enumerate(df_subs['subs']):
if var_subs_num == subs.lstrip('M'):
df_subs.loc[i, 'conc_unit'] = var_unit
df_subs.loc[i, 'conc_var'] = var
# Do this for gas fluxes
for var in var_flux:
var_subs_num = var.split('M')[1].lstrip()
var_unit = rng.choice(['unit_a', 'unit_b', 'unit_c'])
for i, subs in enumerate(df_subs['subs']):
if var_subs_num == subs.lstrip('M'):
df_subs.loc[i, 'flux_unit'] = var_unit
df_subs.loc[i, 'flux_var'] = var
df_subs
# subs subs_name conc_unit conc_var flux_unit flux_var
# 0 M33 Methanol ppbv M 33(ppbv) unit_c Flux_M 33
# 1 M39 Water cluster ncps M 39(ncps) NaN NaN
# 2 M45 Acetaldehyde ppbv M 45(ppbv) unit_a Flux_M 45
# 3 M47 Unknown ppbv M 47(ppbv) unit_b Flux_M 47
# 4 M59 Acetone ppbv M 59(ppbv) unit_a Flux_M 59
# 5 M61 Unknown ppbv M 61(ppbv) unit_c Flux_M 61
# 6 M69 Isoprene ppbv M 69(ppbv) unit_a Flux_M 69
# 7 M71 Methyl vinyl, ketone and methacrolein ppbv M 71(ppbv) unit_a Flux_M 71
# 8 M81 Fragment of monoterpenes ppbv M 81(ppbv) unit_c Flux_M 81
# 9 M87 Methylbutenols ppbv M 87(ppbv) unit_c Flux_M 87
# 10 M137 Monoterpenes ppbv M 137(ppbv) unit_b Flux_M 137
Create plotting function based on pandas
Here is one way of creating a plotting function that lets you select the variables for the plot with the graph_type argument. It works by selecting the relevant variables from the substances table using the if/elif statement. This and the ec_top[variables].plot(...) function are all that is really necessary to create the plot, the rest is all for formatting the figure. The variables are plotted in the order of the variables list. I draw only two columns of subplots because of width constraints here (max 10 inches width to get a sharp image on Stack Overflow).
# Create plotting function that creates a single figure showing all
# variables of the chosen type
def plot_grid(graph_type):
# Set the type of variables and units to fetch in df_subs: using if
# statements for the strings lets you use a variety of strings
if 'conc' in graph_type:
var_type = 'conc_var'
unit_type = 'conc_unit'
elif 'flux' in graph_type:
var_type = 'flux_var'
unit_type = 'flux_unit'
else:
return f'Error: "{graph_type}" is not a valid string, \
it must contain "conc" or "flux".'
# Create list of variables to plot depending on type
variables = df_subs[var_type].dropna()
# Set parameters for figure dimensions
nvar = variables.size
cols = 2
rows = int(np.ceil(nvar/cols))
width = 10/cols
height = 3
# Draw grid of line plots: note that x_compat is used to override the
# default x-axis time labels, remove it if you do not want to use custom
# tick locators and formatters like the ones created in the loop below
grid = ec_top[variables].plot(subplots=True, figsize=(cols*width, rows*height),
layout=(rows, cols), marker='.', linestyle='',
xlabel='Time', x_compat=True)
# The code in the following loop is optional formatting based on my
# preferences, if you remove it the plot should still look ok but with
# fewer informative labels and the legends may not all be in the same place
# Loop through the subplots to edit format, including creating labels and
# titles based on the information in the substances table (df_subs):
for ax in grid.flatten()[:nvar]:
# Edit tick locations and format
plt.setp(ax.get_xticklabels(which='both'), fontsize=8, rotation=0, ha='center')
loc = mdates.AutoDateLocator()
ax.xaxis.set_major_locator(loc)
ax.set_xticks([], minor=True)
fmt = mdates.ConciseDateFormatter(loc, show_offset=False)
ax.xaxis.set_major_formatter(fmt)
# Edit legend
handle, (var_name,) = ax.get_legend_handles_labels()
subs = df_subs[df_subs[var_type] == var_name]['subs']
ax.legend(handle, subs, loc='upper right')
# Add y label
var_unit, = df_subs[df_subs[var_type] == var_name][unit_type]
ylabel_type = f'{"Concentration" if "conc" in graph_type else "Flux"}'
ax.set_ylabel(f'{ylabel_type} [{var_unit}]')
# Add title
subs_name, = df_subs[df_subs[var_type] == var_name]['subs_name']
ax.set_title(subs_name)
# Edit figure format
fig = plt.gcf()
date = df_conc.index[0].strftime('%b %d %Y')
title_type = f'{"concentrations" if "conc" in graph_type else "fluxes"}'
fig.suptitle(f'BVOCs {title_type} on {date} from 12:00 to 13:00',
y=0.93, fontsize=15);
fig.subplots_adjust(wspace=0.3, hspace=0.4)
plt.show()
plot_grid('conc') # any kind of string works if it contains 'conc' or 'flux'
plot_grid('graph fluxes')
Documentation: matplotlib date ticks

Waterfall chart python matplotlib

I am having a problem with waterfall. I took this chart from matplotlib site and added my own data frame with 2 simple columns with some integer numbers. My waterfall was produced but without numbers, just empty bars. I am a bit lost and I would appreciate any suggestions.
What I am trying to build is the custom waterfall that takes one dataframe with column names, values, and some values for filters like countries. I haven't found anything like that anywhere so I am trying to build my own.
import numpy as np;
import pandas as pd;
import matplotlib.pyplot as plt;
from matplotlib.ticker import FuncFormatter;
dataset = pd.read_csv('waterfall_test_data.csv')
#Use python 2.7+ syntax to format currency
def money(x, pos):
'The two args are the value and tick position'
return "${:,.0f}".format(x)
formatter = FuncFormatter(money)
#Data to plot. Do not include a total, it will be calculated
index = dataset['columns']
data = dataset['amount']
#Store data and create a blank series to use for the waterfall
trans = pd.DataFrame(data=data,index=index)
blank = trans.amount.cumsum().shift(1).fillna(0)
#Get the net total number for the final element in the waterfall
total = trans.sum().amount
trans.loc["net"]= total
blank.loc["net"] = total
#The steps graphically show the levels as well as used for label placement
step = blank.reset_index(drop=True).repeat(3).shift(-1)
step[1::3] = np.nan
#When plotting the last element, we want to show the full bar,
#Set the blank to 0
blank.loc["net"] = 0
#Plot and label
my_plot = trans.plot(kind='bar', stacked=True, bottom=blank,legend=None, figsize=(15, 5), title="2014 Sales Waterfall")
my_plot.plot(step.index, step.values,'k')
my_plot.set_xlabel("Transaction Types")
#Format the axis for dollars
my_plot.yaxis.set_major_formatter(formatter)
#Get the y-axis position for the labels
y_height = trans.amount.cumsum().shift(1).fillna(0)
#Get an offset so labels don't sit right on top of the bar
max = trans.max()
neg_offset = max / 25
pos_offset = max / 50
plot_offset = int(max / 15)
#Start label loop
loop = 0
for index, row in trans.iterrows():
# For the last item in the list, we don't want to double count
if row['amount'] == total:
y = y_height[loop]
else:
y = y_height[loop] + row['amount']
# Determine if we want a neg or pos offset
if row['amount'] > 0:
y += pos_offset
else:
y -= neg_offset
my_plot.annotate("{:,.0f}".format(row['amount']),(loop,y),ha="center")
loop+=1
#Scale up the y axis so there is room for the labels
my_plot.set_ylim(0,blank.max()+int(plot_offset))
#Rotate the labels
my_plot.set_xticklabels(trans.index,rotation=0)
my_plot.get_figure().savefig("waterfall.png",dpi=200,bbox_inches='tight')

Plot categorical data with matplotlib - transposed pandas dataframe

I have data for two groups in a pandas dataframe, with for each group the mean of 3 different items of a scale:
item1 item2 item3
group
1 2.807692 3.115385 3.923077
2 2.909091 2.454545 3.909091
I would like to plot the means for both groups in a bar plot. I found some code for doing just that here with the following function:
def groupedbarplot(x_data, y_data_list, y_data_names, colors, x_label, y_label, title):
_, ax = plt.subplots()
# Total width for all bars at one x location
total_width = 0.5
# Width of each individual bar
ind_width = total_width / len(y_data_list)
# This centers each cluster of bars about the x tick mark
alteration = np.arange(-(total_width/2), total_width/2, ind_width)
# Draw bars, one category at a time
for i in range(0, len(y_data_list)):
# Move the bar to the right on the x-axis so it doesn't
# overlap with previously drawn ones
ax.bar(x_data + alteration[i], y_data_list[i], color = colors[i], label = y_data_names[i], width = ind_width)
ax.set_ylabel(y_label)
ax.set_xlabel(x_label)
ax.set_title(title)
ax.legend(loc = 'upper right')
This is working perfectly fine with the mentioned dataframe, however, instead of having the bars for all items grouped together per group I wanted the bars to be grouped per item so I can see for each item the difference per groups. Therefore, I've transposed the dataframe, but this is giving me errors when plotting:
groupedbarplot(x_data = data.index.values
, y_data_list = [data[1],data[2]]
, y_data_names = ['group1', 'group2']
, colors = ['blue', 'orange']
, x_label = 'Scale'
, y_label = 'Score'
, title = 'title')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-111-b910d304a19e> in <module>()
5 , x_label = 'Verloning scale'
6 , y_label = 'Score'
----> 7 , title = 'Score op elke item voor werknemer en oud-werknemers')
<ipython-input-66-9fa4a515d5e9> in groupedbarplot(x_data, y_data_list, y_data_names, colors, x_label, y_label, title)
11 # Move the bar to the right on the x-axis so it doesn't
12 # overlap with previously drawn ones
---> 13 ax.bar(x_data + alteration[i], y_data_list[i], color = colors[i], label = y_data_names[i], width = ind_width)
14 ax.set_ylabel(y_label)
15 ax.set_xlabel(x_label)
TypeError: Can't convert 'float' object to str implicitly
I have checked all variable to see where the difference is, but can't seem to find it. Any ideas?
When I transpose the dataframe you provide the result looks like this
1 2
item1 2.807692 2.909091
item2 3.115385 2.454545
item3 3.923077 3.909091
therefore data.index.values returns array(['item1', 'item2', 'item3'], dtype=object), and I suspect that your error results from x_data + alteration[i] which tries to add a float to your array which contains strings.

Categories

Resources