I'm struggling to get a stacked vbar working.
With python/pandas and bokeh I want to plot several statistics about the players of a football team. The dataframe is nicely filled, the values are a string where they should be an int where it should be a numeric value.
I used the sample of bokeh to try and adjust it for my purpose, but I'm stuck on
'ValueError: Keyword argument sequences for broadcasting must be the same length as stackers' this error.
My code (without imports and scraping pieces) is:
source = ColumnDataSource(data=statsdfsource[['goals','assists','naam']])
p = figure(plot_height=250, title="Fruit Counts by Year",
toolbar_location=None, tools="")
p.vbar_stack(['goals','assists'], x='naam', width=0.9, color=colors,
source=source)
p.y_range.start = 0
p.x_range.range_padding = 0.1
p.xgrid.grid_line_color = None
p.axis.minor_tick_line_color = None
p.outline_line_color = None
p.legend.location = "top_left"
p.legend.orientation = "horizontal"
show(p)
The dataframe I fill the columndatasource with is
goals assists naam
0 NaN NaN Miguel Santos
1 NaN NaN Aykut Özer
2 NaN NaN Job van de Walle
3 NaN NaN Rowen Koot
4 8.0 6.0 Perr Schuurs
5 4.0 2.0 Wessel Dammers
6 12.0 2.0 Stefan Askovski
7 1.0 NaN Mica Pinto
8 NaN NaN Christopher Braun
9 1.0 4.0 Marco Ospitalieri
10 NaN 1.0 Clint Esser
The result I want to reach is a stacked columnframe, where on the x-axis is the name of the player, with 2 columns above it, one with the goals the player made and one with the assists.
I think I'm messing up somewhere with how my dataframe is built, but I'm a bit floating how it should be formed (can't really imagine on the other hand that the dataframe doesn't fit the purpose).
When using categorical ranges, you have to tell figure what the categories for the axis are and what order you want them to show up, e.g. provide x_range something like:
# specify all the factors for the x-axis by passing x_range
p = figure(..., x_range=sorted(df.naam.unique()))
It's also possible the NaN values are a problem, since they are "contagious". I'd recommend changing them to zeros instead in any case.
Finally the error message probably indicates that your colors list is the wrong length. You are stacking two bars in each column, so the list of colors needs to also be two (one color for each "row" in the stack).
Related
I am using pandas and matplotlib to generate some charts.
My DataFrame:
Journal Papers per year in journal
0 Information and Software Technology 4
1 2012 International Conference on Cyber Securit... 4
2 Journal of Network and Computer Applications 4
3 IEEE Security & Privacy 5
4 Computers & Security 11
My Dataframe is a result of a groupby out of a larger dataframe. What I want now, is a simple barchart, which in theory works fine with a df_groupby_time.plot(kind='bar'). However, I get this:
What I want are different colored bars, and a legend which states which color corresponds to which paper.
Playing around with relabeling hasn't gotten me anywhere so far. And I have no idea anymore on how to achieve what I want.
EDIT:
Resetting the index and plotting isn't what I want:
df_groupby_time.set_index("Journals").plot(kind='bar')
I found a solution, based on this question here.
SO, the dataframe needs to be transformed into a matrix, were the values exist only on the main diagonal.
First, I save the column journals for later in a variable.
new_cols = df["Journal"].values
Secondly, I wrote a function, that takes a series, the column Papers per year in Journal, and the previously saved new columns, as input parameters, and returns a dataframe, where the values are only on the main diagonal.:
def values_into_main_diagonal(some_series, new_cols):
"""Puts the values of a series onto the main diagonal of a new df.
some_series - any series given
new_cols - the new column labels as list or numpy.ndarray"""
x = [{i: some_series[i]} for i in range(len(some_series))]
main_diag_df = pd.DataFrame(x)
main_diag_df.columns = new_cols
return main_diag_df
Thirdly, feeding the function the Papers per year in Journal column and our saved new columns names, returns the following dataframe:
new_df:
1_journal 2_journal 3_journal 4_journal 5_journal
0 4 NaN NaN NaN NaN
1 NaN 4 NaN NaN NaN
2 NaN NaN 4 NaN NaN
3 NaN NaN NaN 5 NaN
4 NaN NaN NaN NaN 11
Finally plotting the new_df via new_df.plot(kind='bar', stacked=True) gives me what I want. The Journals in different colors as the legend and NOT on the axis.:
I have a dataframe with a list of surfaces and depths. Some of the surfaces are labeled with the suffix _top and _base.
How can I write a function that will create a column that calculates the thickness of only the surfaces that have the same name with the _top and _base suffix (e.g. red_top - red_base = thickness)?
Example:
df = pd.DataFrame({'Surface': ['red_top', 'red_base',
'blue_top', 'blue_base', 'green_top', 'pink'],
'Depth':[2, 6, 12, 45, 55, 145]})
I've tried to split the surface column to create one for the surfaces and one for the top/base, but I'm not sure if that is necessary and am still stuck on how to calculate the thickness based on meeting those conditions.
Many thanks
I would first split "Surface" column into two parts - "color" and "level", then pivot the table by "color", and then calculate thickness as follows
split = df.Surface.str.split("_", expand=True)
split.columns = ["Color", "Level"]
df = pd.concat([df, split], axis=1)
df_pivoted = df.pivot(index="Color", columns="Level", values="Depth")
df_pivoted["Thinkness"] = df_pivoted.base - df_pivoted.top
df_pivoted for your example looks like this -
Level NaN base top Thinkness
Color
blue NaN 45.0 12.0 33.0
green NaN NaN 55.0 NaN
pink 145.0 NaN NaN NaN
red NaN 6.0 2.0 4.0
The NaN column has non-empty values for Surfaces without the subscript.
The line below provides thickness calculation just for data with both _top and _base,
thickness = (df_pivoted.base-df_pivoted.top).dropna()
print(thickness)
results in
Color
blue 33.0
red 4.0
dtype: float64
I'm plotting some data that requires Day 0 to not be shown on the x-axis. The dataframe has no column for Day 0, but Matplotlib creates a space for it between day -1 and 1. I've looked through the documentation, but can't find a way to adjust spacing between only two ticks. The dataframe is:
group stat -1.0 1.0 2.0 3.0 4.0 5.0
abc mean 8.362999 17.043362 3.526539 22.931884 10.835121 6.035011
abc sem 1.481135 5.029173 0.822778 13.768812 2.149704 0.840965
abc std 3.311919 11.245573 1.839788 30.787999 4.806885 1.880455
Code to plot:
df.set_index(['subject'], inplace=True)
df.drop(['group'],axis=1,inplace=True)
x = df.columns.values
y = df.loc['mean'].values
sem = df.loc['sem'].values
plt.errorbar(x, y, sem, color='#0075d9', marker='o', clip_on=False)
This is an example of the chart (please ignore the shading):
You can see that it has more space between -1 and 1 than the other ticks. Is there a way to 'drop' the Day 0 tick from the X-axis?
What is the best way to automate the graph production in the following case:
I have a data frame with different plan and type in the columns
I want a graph for each combination of plan and type
Dataframe:
plan type hour ok notok other
A cont 0 60.0 40.0 0.0
A cont 1 56.6 31.2 12.2
A vend 2 30.0 50.0 20.0
B test 5 20.0 50.0 30.0
For one df with only one plan and type, I wrote the following code:
fig_ = df.set_index('hour').plot(kind='bar', stacked=True, colormap='YlOrBr')
plt.xlabel('Hour')
plt.ylabel('(%)')
fig_.figure.savefig('p_hour.png', dpi=1000)
plt.show()
In the end, I would like to save one different figure for each combination of plan and type.
Thanks in advance!
You can try iterating over groups using groupby:
for (plan, type), group in df.groupby(['plan', 'type']):
fig_ = group.set_index('hour').plot(kind='bar', stacked=True, colormap='YlOrBr')
plt.xlabel('Hour') # Maybe add plan and type
plt.ylabel('(%)') # Maybe add plan and type
fig_.figure.savefig('p_hour_{}_{}.png'.format(plan, type), dpi=1000)
I'm generating some plots based on data that I'm holding in a pandas DataFrame; a snapshot of what this data (call it data)looks like is below:
CIG CLD DPT OBV P06 P12 POS POZ Q06 Q12 TMP \
2010-10-01 18:00:00 8 CL 54 N NaN NaN 0 0 NaN NaN 85
2010-10-01 21:00:00 8 CL 50 N NaN NaN 0 0 NaN NaN 89
2010-10-02 00:00:00 8 CL 51 N 0 NaN 0 0 0 NaN 81
2010-10-02 03:00:00 8 CL 52 N NaN NaN 0 0 NaN NaN 67
2010-10-02 06:00:00 8 CL 52 N 0 NaN 0 0 0 NaN 62
2010-10-02 09:00:00 8 CL 51 N NaN NaN 0 0 NaN NaN 59
...
The idea for one of the plots is to overlay traces of the TMP and DPT fields (generated by using data['TMP'].plot()) on top of shading corresponding to the CLD field. So for instance, the block of time between 2010-10-01 18:00:00-2010-10-01 19:30:00 might be a light gray, and if the next entry for CLD were something else other than "CL", then the block 2010-10-01 19:30:00-2010-10-01 22:30:00 might be a darker color, that way I can see how the CLD field changes contemporaneously with the other fields.
My idea was to use a Rectangle patch from matplotlib.Patches to accomplish this shading. Since I'm basing the bounds on of the plot on the trace of TMP and DPT, I'll always know exactly what the height of the patch is, and I also always know its left boundary and its width - but the wrinkle is that I know them in datetime coordinates, not in x-y coordinates. So, if bnd_left is the left boundary as a datetime, ylo and height are floats, and width is a datetime.timedelta, I'm trying to make a patch like,
shading_patch = Rectangle([bnd_left, ylo], width, height)
But this doesn't work. There is a TypeError when the patch tries to create itself, since one cannot add a float and a datetime.timedelta. In the documentation, I can't find anything on how to transform the datetime coordinates to floats in the native transform of the plot I created by using the DataFrame.plot() method when I created the traces I'm trying to draw underneath.
Is there any simple way to draw patches on those plots generated with DataFrame.plot()?
Ok, after some more digging a much easier solution came up - use the axvspanmethod. There is a caveat, though. In Pandas v. 0.12, if you slice through a DataFrame or Timeseries using the .ix attribute, for some weird reason you screw up the formatting into x-axis dates. When you plot, you must plot with my_dataframe.plot(ax=ax, x_compat=True) and configure the ticks yourself, or the shading from axvspan won't work.