Plot categorical data with matplotlib - transposed pandas dataframe - python

I have data for two groups in a pandas dataframe, with for each group the mean of 3 different items of a scale:
item1 item2 item3
group
1 2.807692 3.115385 3.923077
2 2.909091 2.454545 3.909091
I would like to plot the means for both groups in a bar plot. I found some code for doing just that here with the following function:
def groupedbarplot(x_data, y_data_list, y_data_names, colors, x_label, y_label, title):
_, ax = plt.subplots()
# Total width for all bars at one x location
total_width = 0.5
# Width of each individual bar
ind_width = total_width / len(y_data_list)
# This centers each cluster of bars about the x tick mark
alteration = np.arange(-(total_width/2), total_width/2, ind_width)
# Draw bars, one category at a time
for i in range(0, len(y_data_list)):
# Move the bar to the right on the x-axis so it doesn't
# overlap with previously drawn ones
ax.bar(x_data + alteration[i], y_data_list[i], color = colors[i], label = y_data_names[i], width = ind_width)
ax.set_ylabel(y_label)
ax.set_xlabel(x_label)
ax.set_title(title)
ax.legend(loc = 'upper right')
This is working perfectly fine with the mentioned dataframe, however, instead of having the bars for all items grouped together per group I wanted the bars to be grouped per item so I can see for each item the difference per groups. Therefore, I've transposed the dataframe, but this is giving me errors when plotting:
groupedbarplot(x_data = data.index.values
, y_data_list = [data[1],data[2]]
, y_data_names = ['group1', 'group2']
, colors = ['blue', 'orange']
, x_label = 'Scale'
, y_label = 'Score'
, title = 'title')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-111-b910d304a19e> in <module>()
5 , x_label = 'Verloning scale'
6 , y_label = 'Score'
----> 7 , title = 'Score op elke item voor werknemer en oud-werknemers')
<ipython-input-66-9fa4a515d5e9> in groupedbarplot(x_data, y_data_list, y_data_names, colors, x_label, y_label, title)
11 # Move the bar to the right on the x-axis so it doesn't
12 # overlap with previously drawn ones
---> 13 ax.bar(x_data + alteration[i], y_data_list[i], color = colors[i], label = y_data_names[i], width = ind_width)
14 ax.set_ylabel(y_label)
15 ax.set_xlabel(x_label)
TypeError: Can't convert 'float' object to str implicitly
I have checked all variable to see where the difference is, but can't seem to find it. Any ideas?

When I transpose the dataframe you provide the result looks like this
1 2
item1 2.807692 2.909091
item2 3.115385 2.454545
item3 3.923077 3.909091
therefore data.index.values returns array(['item1', 'item2', 'item3'], dtype=object), and I suspect that your error results from x_data + alteration[i] which tries to add a float to your array which contains strings.

Related

How do I split a grouped bar chart into sub-groups?

I have this dataset-
group sub_group value date
0 Animal Cats 12 today
1 Animal Dogs 32 today
2 Animal Goats 38 today
3 Animal Fish 1 today
4 Plant Tree 48 today
5 Object Car 55 today
6 Object Garage 61 today
7 Object Instrument 57 today
8 Animal Cats 44 yesterday
9 Animal Dogs 12 yesterday
10 Animal Goats 18 yesterday
11 Animal Fish 9 yesterday
12 Plant Tree 8 yesterday
13 Object Car 12 yesterday
14 Object Garage 37 yesterday
15 Object Instrument 77 yesterday
I want to have two series in a barchart. I want to have one series for today and I want to have another series for yesterday. Within each series, I want the bars to be split up by their sub-groups. For example, there would be one bar called "Animal - today" and it would sum up to 83 and, within that bar, there would be cats, dogs, etc.
I want to make a chart that is very similar to chart shown under "Bar charts with Long Format Data" on the docs, except that I have two series.
This is what I tried-
fig = make_subplots(rows = 1, cols = 1)
fig.add_trace(go.Bar(
y = df[df['date'] == 'today']['amount'],
x = df[df['date'] == 'today']['group'],
color = df[df['date'] == 'today']['sub_group']
),
row = 1, col = 1
)
fig.add_trace(go.Bar(
y = df[df['date'] == 'yesterday']['amount'],
x = df[df['date'] == 'yesterday']['group'],
color = df[df['date'] == 'yesterday']['sub_group']
),
row = 1, col = 1
)
fig.show()
I added a bounty because I want to be able to add the chart as a trace in my subplot.
I think your code will draw the intended graph except for the color settings, but if you want to separate each stacked graph by color, you will need to do some tricks. There may be another way to do this, but create two express graphs by date and reuse that data. To create that x-axis, add a column with the code that makes the group column a category. the second group needs to be shifted on the x-axis, so I add +0.5 to the x-axis. Set the array ([0.25,1.25,2.25]) and string to center the x-axis scale on the created graph. Finally, duplicate legend items are made unique.
# group id add
df['gr_id'] = df['group'].astype('category').cat.codes
import plotly.express as px
import plotly.graph_objects as go
fig_t = px.bar(df.query('date == "today"'), x="gr_id", y="amount", color='sub_group')
fig_y = px.bar(df.query('date == "yesterday"'), x="gr_id", y="amount", color='sub_group')
fig = go.Figure()
for t in fig_t.data:
fig.add_trace(go.Bar(t))
for i,y in enumerate(fig_y.data):
y['x'] = y['x'] + 0.5
fig.add_trace(go.Bar(y))
fig.update_layout(bargap=0.05, barmode='stack')
fig.update_xaxes(tickvals=[0.25,1.25,2.25], ticktext=['Animal','Plant','Pbject'])
names = set()
fig.for_each_trace(
lambda trace:
trace.update(showlegend=False)
if (trace.name in names) else names.add(trace.name))
fig.update_layout(legend_tracegroupgap=0)
fig.show()
EDIT:
This is achieved by using the pre-loaded subplotting functionality built into express.
import plotly.express as px
px.bar(df, x='group', y='amount', color='sub_group', facet_col='date')
I Know This is Kinda late, maybe not as useful, but you could use Dataclasses
from dataclasses import dataclass
#dataclass
class DataObject:
value: str
index: int
#dataclass
class TwoDayData:
today: DataObject
yesterday: DataObject
data = DataObject(value="hello", index=1)
data_yesterday = DataObject(value="Mom", index=1)
two_day_point = TwoDayData(today=data, yesterday=data_yesterday)
print(two_day_point.yesterday, two_day_point.today)
print(two_day_point.yesterday.index, two_day_point.today.index)
I didn't see your expected Output so I followed my prediction. Try to see if you got it right:
import plotly.graph_objects as go
from plotly.subplots import make_subplots
fig = make_subplots(rows = 1, cols = 2)
fig.add_trace(go.Bar(x=[tuple(df[df['date'] == 'today']['group']),
tuple(df[df['date'] == 'today']['sub_group'])],
y=list(df[df['date'] == 'today']['value']),
name='today'),
row = 1, col = 1)
fig.add_trace(go.Bar(x=[tuple(df[df['date'] == 'yesterday']['group']),
tuple(df[df['date'] == 'yesterday']['sub_group'])],
y=list(df[df['date'] == 'yesterday']['value']),
name='yesterday'),
row = 1, col = 2)
fig.show()
And below is the Output:
I find it more aesthetically pleasing to put data like this into figures with multi-categorical axes.
fig = make_subplots(rows = 1, cols = 1)
for sg in df['sub_group'].unique():
fig.append_trace(go.Bar(x=[df['date'][df['sub_group']==sg], df['group'][df['sub_group']==sg]],
y=df['value'][df['sub_group']==sg],
name=sg,
text=sg),
col=1,
row=1
)
fig.update_layout(barmode='stack')
fig.show()

How can I simplify and create conditional colours on this Waterfall Chart?

This is a code for a waterfall chart. I'd kindly like to ask:
if there is a way to simplify this code. The code is far too long and I'm sure there is a lot of extra lines of code that could be reduced.
How I can make the first and last bars black?. Since I am creating a waterfall chart I am looking for the first and last value to be black at all times and the values in between to be green or red depending on whether or not it is a negative or positive number.
Bars greater than zero green.
Bars less than zero red.
Any help would be greatly appreciated.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
#Use python 2.7+ syntax to format currency
def money(x, pos):
'The two args are the value and tick position'
return "${:,.0f}".format(x)
formatter = FuncFormatter(money)
#Data to plot. Do not include a total, it will be calculated
index = ['sales','returns','credit fees','rebates','late charges','shipping']
data = {'amount': [350000,-30000,-7500,-25000,95000,-7000]}
#Store data and create a blank series to use for the waterfall
trans = pd.DataFrame(data=data,index=index)
blank = trans.amount.cumsum().shift(1).fillna(0)
#Get the net total number for the final element in the waterfall
total = trans.sum().amount
trans.loc["net"]= total
blank.loc["net"] = total
#The steps graphically show the levels as well as used for label placement
step = blank.reset_index(drop=True).repeat(3).shift(-1)
step[1::3] = np.nan
#When plotting the last element, we want to show the full bar,
#Set the blank to 0
blank.loc["net"] = 0
#Plot and label
my_plot = trans.plot(kind='bar', stacked=True, bottom=blank,legend=None, figsize=(10, 5), title="2014 Sales Waterfall")
my_plot.plot(step.index, step.values,'k')
my_plot.set_xlabel("Transaction Types")
#Format the axis for dollars
my_plot.yaxis.set_major_formatter(formatter)
#Get the y-axis position for the labels
y_height = trans.amount.cumsum().shift(1).fillna(0)
#Get an offset so labels don't sit right on top of the bar
max = trans.max()
neg_offset = max / 25
pos_offset = max / 50
plot_offset = int(max / 15)
#Start label loop
loop = 0
for index, row in trans.iterrows():
# For the last item in the list, we don't want to double count
if row['amount'] == total:
y = y_height[loop]
else:
y = y_height[loop] + row['amount']
# Determine if we want a neg or pos offset
if row['amount'] > 0:
y += pos_offset
else:
y -= neg_offset
my_plot.annotate("{:,.0f}".format(row['amount']),(loop,y),ha="center")
loop+=1
#Scale up the y axis so there is room for the labels
my_plot.set_ylim(0,blank.max()+int(plot_offset))
#Rotate the labels
my_plot.set_xticklabels(trans.index,rotation=0)
my_plot.get_figure().savefig("waterfall.png",dpi=200,bbox_inches='tight')
Answer to questions 2, 3 and 4: set the colors of the bar patches after plotting them:
for p, c in zip(my_plot.containers[0].patches, np.r_[0, np.sign(trans.amount[1:-1]), 0]):
p.set_color({0: 'k', 1: 'g', -1: 'r'}[c])

Waterfall chart python matplotlib

I am having a problem with waterfall. I took this chart from matplotlib site and added my own data frame with 2 simple columns with some integer numbers. My waterfall was produced but without numbers, just empty bars. I am a bit lost and I would appreciate any suggestions.
What I am trying to build is the custom waterfall that takes one dataframe with column names, values, and some values for filters like countries. I haven't found anything like that anywhere so I am trying to build my own.
import numpy as np;
import pandas as pd;
import matplotlib.pyplot as plt;
from matplotlib.ticker import FuncFormatter;
dataset = pd.read_csv('waterfall_test_data.csv')
#Use python 2.7+ syntax to format currency
def money(x, pos):
'The two args are the value and tick position'
return "${:,.0f}".format(x)
formatter = FuncFormatter(money)
#Data to plot. Do not include a total, it will be calculated
index = dataset['columns']
data = dataset['amount']
#Store data and create a blank series to use for the waterfall
trans = pd.DataFrame(data=data,index=index)
blank = trans.amount.cumsum().shift(1).fillna(0)
#Get the net total number for the final element in the waterfall
total = trans.sum().amount
trans.loc["net"]= total
blank.loc["net"] = total
#The steps graphically show the levels as well as used for label placement
step = blank.reset_index(drop=True).repeat(3).shift(-1)
step[1::3] = np.nan
#When plotting the last element, we want to show the full bar,
#Set the blank to 0
blank.loc["net"] = 0
#Plot and label
my_plot = trans.plot(kind='bar', stacked=True, bottom=blank,legend=None, figsize=(15, 5), title="2014 Sales Waterfall")
my_plot.plot(step.index, step.values,'k')
my_plot.set_xlabel("Transaction Types")
#Format the axis for dollars
my_plot.yaxis.set_major_formatter(formatter)
#Get the y-axis position for the labels
y_height = trans.amount.cumsum().shift(1).fillna(0)
#Get an offset so labels don't sit right on top of the bar
max = trans.max()
neg_offset = max / 25
pos_offset = max / 50
plot_offset = int(max / 15)
#Start label loop
loop = 0
for index, row in trans.iterrows():
# For the last item in the list, we don't want to double count
if row['amount'] == total:
y = y_height[loop]
else:
y = y_height[loop] + row['amount']
# Determine if we want a neg or pos offset
if row['amount'] > 0:
y += pos_offset
else:
y -= neg_offset
my_plot.annotate("{:,.0f}".format(row['amount']),(loop,y),ha="center")
loop+=1
#Scale up the y axis so there is room for the labels
my_plot.set_ylim(0,blank.max()+int(plot_offset))
#Rotate the labels
my_plot.set_xticklabels(trans.index,rotation=0)
my_plot.get_figure().savefig("waterfall.png",dpi=200,bbox_inches='tight')

How to draw proper chart of distributional tree?

I am using python with matplotlib and need to visualize distribution percentage of sub-groups of an data set.
imagine this tree:
Data --- group1 (40%)
-
--- group2 (25%)
-
--- group3 (35%)
group1 --- A (25%)
-
--- B (25%)
-
--- c (50%)
and it can go on, each group can have several sub-groups and same for each sub group.
How can i plot a proper chart for this info?
I created a minimal reproducible example that I think fits your description, but please let me know if that is not what you need.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
data = pd.DataFrame()
n_rows = 100
data['group'] = np.random.choice(['1', '2', '3'], n_rows)
data['subgroup'] = np.random.choice(['A', 'B', 'C'], n_rows)
For instance, we could get the following counts for the subgroups.
In [1]: data.groupby(['group'])['subgroup'].value_counts()
Out[1]: group subgroup
1 A 17
C 16
B 5
2 A 23
C 10
B 7
3 C 8
A 7
B 7
Name: subgroup, dtype: int64
I created a function that computes the necessary counts given an ordering of the columns (e.g. ['group', 'subgroup']) and incrementally plots the bars with the corresponding percentages.
import matplotlib.pyplot as plt
import matplotlib.cm
def plot_tree(data, ordering, axis=False):
"""
Plots a sequence of bar plots reflecting how the data
is distributed at different levels. The order of the
levels is given by the ordering parameter.
Parameters
----------
data: pandas DataFrame
ordering: list
Names of the columns to be plotted.They should be
ordered top down, from the larger to the smaller group.
axis: boolean
Whether to plot the axis.
Returns
-------
fig: matplotlib figure object.
The final tree plot.
"""
# Frame set-up
fig, ax = plt.subplots(figsize=(9.2, 3*len(ordering)))
ax.set_xticks(np.arange(-1, len(ordering)) + 0.5)
ax.set_xticklabels(['All'] + ordering, fontsize=18)
if not axis:
plt.axis('off')
counts=[data.shape[0]]
# Get colormap
labels = ['All']
for o in reversed(ordering):
labels.extend(data[o].unique().tolist())
# Pastel is nice but has few colors. Change for a larger map if needed
cmap = matplotlib.cm.get_cmap('Pastel1', len(labels))
colors = dict(zip(labels, [cmap(i) for i in range(len(labels))]))
# Group the counts
counts = data.groupby(ordering).size().reset_index(name='c_' + ordering[-1])
for i, o in enumerate(ordering[:-1], 1):
if ordering[:i]:
counts['c_' + o]=counts.groupby(ordering[:i]).transform('sum')['c_' + ordering[-1]]
# Calculate percentages
counts['p_' + ordering[0]] = counts['c_' + ordering[0]]/data.shape[0]
for i, o in enumerate(ordering[1:], 1):
counts['p_' + o] = counts['c_' + o]/counts['c_' + ordering[i-1]]
# Plot first bar - all data
ax.bar(-1, data.shape[0], width=1, label='All', color=colors['All'], align="edge")
ax.annotate('All -- 100%', (-0.9, 0.5), fontsize=12)
comb = 1 # keeps track of the number of possible combinations at each level
for bar, col in enumerate(ordering):
labels = sorted(data[col].unique())*comb
comb *= len(data[col].unique())
# Get only the relevant counts at this level
local_counts = counts[ordering[:bar+1] +
['c_' + o for o in ordering[:bar+1]] +
['p_' + o for o in ordering[:bar+1]]].drop_duplicates()
sizes = local_counts['c_' + col]
percs = local_counts['p_' + col]
bottom = 0 # start at from 0
for size, perc, label in zip(sizes, percs, labels):
ax.bar(bar, size, width=1, bottom=bottom, label=label, color=colors[label], align="edge")
ax.annotate('{} -- {:.0%}'.format(label, perc), (bar+0.1, bottom+0.5), fontsize=12)
bottom += size # stack the bars
ax.legend(colors)
return fig
With the data shown above we would get the following.
fig = plot_tree(data, ['group', 'subgroup'], axis=True)
Have you tried stacked bar graph?
https://matplotlib.org/gallery/lines_bars_and_markers/bar_stacked.html#sphx-glr-gallery-lines-bars-and-markers-bar-stacked-py

python - pandas - setting x ticks labels

I have a Dataframe =
from collections import OrderedDict
dico = OrderedDict({"Cisco" :54496.923851069776,
"Citrix" :75164.2973859488,
"Datacore/veritas/docker/quest" :7138.499540816414,
"Dell / EMC" : 34836.42983441935,
"HPE": 40265.33070005489,
"IBM Hard Ware / IBM services" : 220724.89293359307,
"Microsoft cloud" : 3159.7624999999994,
"Netapp":48898.21721115539,
"Nutanix / Lenovo DCG":38761.815197677075,
"Oracle/Microfocus":100877.21884162886,
"Other brands":13825.151033348895,
"VM Ware":21267.66907692287,
"Veeam / Redhat":5006.715599405339})
That I can plot :
df = pd.DataFrame(list(dico.values()))
df.index = dico.keys()
ax = df.sort(0).plot.barh()
but I want to format the xtick labels :
ax = df.sort_values(0).plot.barh()
new_labels = [str(pow(10,i-1))+"€" if i>0 else str(i) for i, tick_label in enumerate(ax.get_xticklabels())]
print(new_labels)
ax.set_xticklabels(new_labels)
Giving :
['0', '1€', '10€', '100€', '1000€', '10000€']
[]2
Why don't I get 20 000 in the list of the new labels ?
Why the 10 000 it self is not displayed ?
You don't get 20000 because you are creating powers of 10 as pow(10,i-1). It is mathematically not possibly from this equation. Moreover, 10000 is not displayed because you just use ax.set_xticklabels to reset the labels of the already existing xticks. Since you have only 5 major ticks in your first plot, you only create 5 labels as 0, 1, 10, 100, 1000 as per your definition.
To get what you want, just replace the last three lines of your code (after plotting) by:
locs = ax.get_xticks()
labels = [ '{}{}'.format(int(i), '\u20ac') for i in locs]
ax.set_xticklabels(labels)
Output

Categories

Resources