I have a Dataframe =
from collections import OrderedDict
dico = OrderedDict({"Cisco" :54496.923851069776,
"Citrix" :75164.2973859488,
"Datacore/veritas/docker/quest" :7138.499540816414,
"Dell / EMC" : 34836.42983441935,
"HPE": 40265.33070005489,
"IBM Hard Ware / IBM services" : 220724.89293359307,
"Microsoft cloud" : 3159.7624999999994,
"Netapp":48898.21721115539,
"Nutanix / Lenovo DCG":38761.815197677075,
"Oracle/Microfocus":100877.21884162886,
"Other brands":13825.151033348895,
"VM Ware":21267.66907692287,
"Veeam / Redhat":5006.715599405339})
That I can plot :
df = pd.DataFrame(list(dico.values()))
df.index = dico.keys()
ax = df.sort(0).plot.barh()
but I want to format the xtick labels :
ax = df.sort_values(0).plot.barh()
new_labels = [str(pow(10,i-1))+"€" if i>0 else str(i) for i, tick_label in enumerate(ax.get_xticklabels())]
print(new_labels)
ax.set_xticklabels(new_labels)
Giving :
['0', '1€', '10€', '100€', '1000€', '10000€']
[]2
Why don't I get 20 000 in the list of the new labels ?
Why the 10 000 it self is not displayed ?
You don't get 20000 because you are creating powers of 10 as pow(10,i-1). It is mathematically not possibly from this equation. Moreover, 10000 is not displayed because you just use ax.set_xticklabels to reset the labels of the already existing xticks. Since you have only 5 major ticks in your first plot, you only create 5 labels as 0, 1, 10, 100, 1000 as per your definition.
To get what you want, just replace the last three lines of your code (after plotting) by:
locs = ax.get_xticks()
labels = [ '{}{}'.format(int(i), '\u20ac') for i in locs]
ax.set_xticklabels(labels)
Output
Related
This is a code for a waterfall chart. I'd kindly like to ask:
if there is a way to simplify this code. The code is far too long and I'm sure there is a lot of extra lines of code that could be reduced.
How I can make the first and last bars black?. Since I am creating a waterfall chart I am looking for the first and last value to be black at all times and the values in between to be green or red depending on whether or not it is a negative or positive number.
Bars greater than zero green.
Bars less than zero red.
Any help would be greatly appreciated.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
#Use python 2.7+ syntax to format currency
def money(x, pos):
'The two args are the value and tick position'
return "${:,.0f}".format(x)
formatter = FuncFormatter(money)
#Data to plot. Do not include a total, it will be calculated
index = ['sales','returns','credit fees','rebates','late charges','shipping']
data = {'amount': [350000,-30000,-7500,-25000,95000,-7000]}
#Store data and create a blank series to use for the waterfall
trans = pd.DataFrame(data=data,index=index)
blank = trans.amount.cumsum().shift(1).fillna(0)
#Get the net total number for the final element in the waterfall
total = trans.sum().amount
trans.loc["net"]= total
blank.loc["net"] = total
#The steps graphically show the levels as well as used for label placement
step = blank.reset_index(drop=True).repeat(3).shift(-1)
step[1::3] = np.nan
#When plotting the last element, we want to show the full bar,
#Set the blank to 0
blank.loc["net"] = 0
#Plot and label
my_plot = trans.plot(kind='bar', stacked=True, bottom=blank,legend=None, figsize=(10, 5), title="2014 Sales Waterfall")
my_plot.plot(step.index, step.values,'k')
my_plot.set_xlabel("Transaction Types")
#Format the axis for dollars
my_plot.yaxis.set_major_formatter(formatter)
#Get the y-axis position for the labels
y_height = trans.amount.cumsum().shift(1).fillna(0)
#Get an offset so labels don't sit right on top of the bar
max = trans.max()
neg_offset = max / 25
pos_offset = max / 50
plot_offset = int(max / 15)
#Start label loop
loop = 0
for index, row in trans.iterrows():
# For the last item in the list, we don't want to double count
if row['amount'] == total:
y = y_height[loop]
else:
y = y_height[loop] + row['amount']
# Determine if we want a neg or pos offset
if row['amount'] > 0:
y += pos_offset
else:
y -= neg_offset
my_plot.annotate("{:,.0f}".format(row['amount']),(loop,y),ha="center")
loop+=1
#Scale up the y axis so there is room for the labels
my_plot.set_ylim(0,blank.max()+int(plot_offset))
#Rotate the labels
my_plot.set_xticklabels(trans.index,rotation=0)
my_plot.get_figure().savefig("waterfall.png",dpi=200,bbox_inches='tight')
Answer to questions 2, 3 and 4: set the colors of the bar patches after plotting them:
for p, c in zip(my_plot.containers[0].patches, np.r_[0, np.sign(trans.amount[1:-1]), 0]):
p.set_color({0: 'k', 1: 'g', -1: 'r'}[c])
I am trying to plot column data vs the row label of a data frame. When I do so, the plot looks good but the the Y axis starts to look illegible as the number of rows is increased. What I don't get it why does the automatic spacing for the X axis work fine but not the same for the Y axis.
x1 = M.iloc[:,1]
plt.plot(x1,x)
Where the variable "x" represents Column 0 values of dataframe "M" below
The "M" dataframe:
0.0 0.5 1.0
0 300 300.000000 1550
1.00e-01 s 300 300.769527 1550
2.00e-01 s 300 301.538106 1550
3.00e-01 s 300 302.305739 1550
.
.
.
2.80e+00 s 300 321.192396 1550
2.90e+00 s 300 321.935830 1550
Edit
So it seems it's the formatting of the first column being in scientific notation that is messing things up, still not sure why however
x = [0]
i=1
while i < 30:
q = i*0.1
xx = str('{:.2e}'.format(q)) + ' s'
x.append(xx)
i = i + 1
M = pd.DataFrame(index=x, columns=3)
So in the code above, it is the line xx = str('{:.2e}'.format(q)) + ' s' that is making the Y-labels go crazy. I unfortunately can't take it out as I need them to be in scientific notation.
You can try tick-spacing if okay to eliminate few tick labels. Other options are to increase you plot size or decrase font size for y labels.
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
x1 = M.iloc[:,1]
tick_spacing = 2 # or whatever label gap you want to use.
fig, ax = plt.subplots(1,1)
apx.plot(x1,x)
ax.yaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
plt.show()
I am having a problem with waterfall. I took this chart from matplotlib site and added my own data frame with 2 simple columns with some integer numbers. My waterfall was produced but without numbers, just empty bars. I am a bit lost and I would appreciate any suggestions.
What I am trying to build is the custom waterfall that takes one dataframe with column names, values, and some values for filters like countries. I haven't found anything like that anywhere so I am trying to build my own.
import numpy as np;
import pandas as pd;
import matplotlib.pyplot as plt;
from matplotlib.ticker import FuncFormatter;
dataset = pd.read_csv('waterfall_test_data.csv')
#Use python 2.7+ syntax to format currency
def money(x, pos):
'The two args are the value and tick position'
return "${:,.0f}".format(x)
formatter = FuncFormatter(money)
#Data to plot. Do not include a total, it will be calculated
index = dataset['columns']
data = dataset['amount']
#Store data and create a blank series to use for the waterfall
trans = pd.DataFrame(data=data,index=index)
blank = trans.amount.cumsum().shift(1).fillna(0)
#Get the net total number for the final element in the waterfall
total = trans.sum().amount
trans.loc["net"]= total
blank.loc["net"] = total
#The steps graphically show the levels as well as used for label placement
step = blank.reset_index(drop=True).repeat(3).shift(-1)
step[1::3] = np.nan
#When plotting the last element, we want to show the full bar,
#Set the blank to 0
blank.loc["net"] = 0
#Plot and label
my_plot = trans.plot(kind='bar', stacked=True, bottom=blank,legend=None, figsize=(15, 5), title="2014 Sales Waterfall")
my_plot.plot(step.index, step.values,'k')
my_plot.set_xlabel("Transaction Types")
#Format the axis for dollars
my_plot.yaxis.set_major_formatter(formatter)
#Get the y-axis position for the labels
y_height = trans.amount.cumsum().shift(1).fillna(0)
#Get an offset so labels don't sit right on top of the bar
max = trans.max()
neg_offset = max / 25
pos_offset = max / 50
plot_offset = int(max / 15)
#Start label loop
loop = 0
for index, row in trans.iterrows():
# For the last item in the list, we don't want to double count
if row['amount'] == total:
y = y_height[loop]
else:
y = y_height[loop] + row['amount']
# Determine if we want a neg or pos offset
if row['amount'] > 0:
y += pos_offset
else:
y -= neg_offset
my_plot.annotate("{:,.0f}".format(row['amount']),(loop,y),ha="center")
loop+=1
#Scale up the y axis so there is room for the labels
my_plot.set_ylim(0,blank.max()+int(plot_offset))
#Rotate the labels
my_plot.set_xticklabels(trans.index,rotation=0)
my_plot.get_figure().savefig("waterfall.png",dpi=200,bbox_inches='tight')
I am using python with matplotlib and need to visualize distribution percentage of sub-groups of an data set.
imagine this tree:
Data --- group1 (40%)
-
--- group2 (25%)
-
--- group3 (35%)
group1 --- A (25%)
-
--- B (25%)
-
--- c (50%)
and it can go on, each group can have several sub-groups and same for each sub group.
How can i plot a proper chart for this info?
I created a minimal reproducible example that I think fits your description, but please let me know if that is not what you need.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
data = pd.DataFrame()
n_rows = 100
data['group'] = np.random.choice(['1', '2', '3'], n_rows)
data['subgroup'] = np.random.choice(['A', 'B', 'C'], n_rows)
For instance, we could get the following counts for the subgroups.
In [1]: data.groupby(['group'])['subgroup'].value_counts()
Out[1]: group subgroup
1 A 17
C 16
B 5
2 A 23
C 10
B 7
3 C 8
A 7
B 7
Name: subgroup, dtype: int64
I created a function that computes the necessary counts given an ordering of the columns (e.g. ['group', 'subgroup']) and incrementally plots the bars with the corresponding percentages.
import matplotlib.pyplot as plt
import matplotlib.cm
def plot_tree(data, ordering, axis=False):
"""
Plots a sequence of bar plots reflecting how the data
is distributed at different levels. The order of the
levels is given by the ordering parameter.
Parameters
----------
data: pandas DataFrame
ordering: list
Names of the columns to be plotted.They should be
ordered top down, from the larger to the smaller group.
axis: boolean
Whether to plot the axis.
Returns
-------
fig: matplotlib figure object.
The final tree plot.
"""
# Frame set-up
fig, ax = plt.subplots(figsize=(9.2, 3*len(ordering)))
ax.set_xticks(np.arange(-1, len(ordering)) + 0.5)
ax.set_xticklabels(['All'] + ordering, fontsize=18)
if not axis:
plt.axis('off')
counts=[data.shape[0]]
# Get colormap
labels = ['All']
for o in reversed(ordering):
labels.extend(data[o].unique().tolist())
# Pastel is nice but has few colors. Change for a larger map if needed
cmap = matplotlib.cm.get_cmap('Pastel1', len(labels))
colors = dict(zip(labels, [cmap(i) for i in range(len(labels))]))
# Group the counts
counts = data.groupby(ordering).size().reset_index(name='c_' + ordering[-1])
for i, o in enumerate(ordering[:-1], 1):
if ordering[:i]:
counts['c_' + o]=counts.groupby(ordering[:i]).transform('sum')['c_' + ordering[-1]]
# Calculate percentages
counts['p_' + ordering[0]] = counts['c_' + ordering[0]]/data.shape[0]
for i, o in enumerate(ordering[1:], 1):
counts['p_' + o] = counts['c_' + o]/counts['c_' + ordering[i-1]]
# Plot first bar - all data
ax.bar(-1, data.shape[0], width=1, label='All', color=colors['All'], align="edge")
ax.annotate('All -- 100%', (-0.9, 0.5), fontsize=12)
comb = 1 # keeps track of the number of possible combinations at each level
for bar, col in enumerate(ordering):
labels = sorted(data[col].unique())*comb
comb *= len(data[col].unique())
# Get only the relevant counts at this level
local_counts = counts[ordering[:bar+1] +
['c_' + o for o in ordering[:bar+1]] +
['p_' + o for o in ordering[:bar+1]]].drop_duplicates()
sizes = local_counts['c_' + col]
percs = local_counts['p_' + col]
bottom = 0 # start at from 0
for size, perc, label in zip(sizes, percs, labels):
ax.bar(bar, size, width=1, bottom=bottom, label=label, color=colors[label], align="edge")
ax.annotate('{} -- {:.0%}'.format(label, perc), (bar+0.1, bottom+0.5), fontsize=12)
bottom += size # stack the bars
ax.legend(colors)
return fig
With the data shown above we would get the following.
fig = plot_tree(data, ['group', 'subgroup'], axis=True)
Have you tried stacked bar graph?
https://matplotlib.org/gallery/lines_bars_and_markers/bar_stacked.html#sphx-glr-gallery-lines-bars-and-markers-bar-stacked-py
I have data for two groups in a pandas dataframe, with for each group the mean of 3 different items of a scale:
item1 item2 item3
group
1 2.807692 3.115385 3.923077
2 2.909091 2.454545 3.909091
I would like to plot the means for both groups in a bar plot. I found some code for doing just that here with the following function:
def groupedbarplot(x_data, y_data_list, y_data_names, colors, x_label, y_label, title):
_, ax = plt.subplots()
# Total width for all bars at one x location
total_width = 0.5
# Width of each individual bar
ind_width = total_width / len(y_data_list)
# This centers each cluster of bars about the x tick mark
alteration = np.arange(-(total_width/2), total_width/2, ind_width)
# Draw bars, one category at a time
for i in range(0, len(y_data_list)):
# Move the bar to the right on the x-axis so it doesn't
# overlap with previously drawn ones
ax.bar(x_data + alteration[i], y_data_list[i], color = colors[i], label = y_data_names[i], width = ind_width)
ax.set_ylabel(y_label)
ax.set_xlabel(x_label)
ax.set_title(title)
ax.legend(loc = 'upper right')
This is working perfectly fine with the mentioned dataframe, however, instead of having the bars for all items grouped together per group I wanted the bars to be grouped per item so I can see for each item the difference per groups. Therefore, I've transposed the dataframe, but this is giving me errors when plotting:
groupedbarplot(x_data = data.index.values
, y_data_list = [data[1],data[2]]
, y_data_names = ['group1', 'group2']
, colors = ['blue', 'orange']
, x_label = 'Scale'
, y_label = 'Score'
, title = 'title')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-111-b910d304a19e> in <module>()
5 , x_label = 'Verloning scale'
6 , y_label = 'Score'
----> 7 , title = 'Score op elke item voor werknemer en oud-werknemers')
<ipython-input-66-9fa4a515d5e9> in groupedbarplot(x_data, y_data_list, y_data_names, colors, x_label, y_label, title)
11 # Move the bar to the right on the x-axis so it doesn't
12 # overlap with previously drawn ones
---> 13 ax.bar(x_data + alteration[i], y_data_list[i], color = colors[i], label = y_data_names[i], width = ind_width)
14 ax.set_ylabel(y_label)
15 ax.set_xlabel(x_label)
TypeError: Can't convert 'float' object to str implicitly
I have checked all variable to see where the difference is, but can't seem to find it. Any ideas?
When I transpose the dataframe you provide the result looks like this
1 2
item1 2.807692 2.909091
item2 3.115385 2.454545
item3 3.923077 3.909091
therefore data.index.values returns array(['item1', 'item2', 'item3'], dtype=object), and I suspect that your error results from x_data + alteration[i] which tries to add a float to your array which contains strings.