pandas - automate graph using the combination of two columns - python

What is the best way to automate the graph production in the following case:
I have a data frame with different plan and type in the columns
I want a graph for each combination of plan and type
Dataframe:
plan type hour ok notok other
A cont 0 60.0 40.0 0.0
A cont 1 56.6 31.2 12.2
A vend 2 30.0 50.0 20.0
B test 5 20.0 50.0 30.0
For one df with only one plan and type, I wrote the following code:
fig_ = df.set_index('hour').plot(kind='bar', stacked=True, colormap='YlOrBr')
plt.xlabel('Hour')
plt.ylabel('(%)')
fig_.figure.savefig('p_hour.png', dpi=1000)
plt.show()
In the end, I would like to save one different figure for each combination of plan and type.
Thanks in advance!

You can try iterating over groups using groupby:
for (plan, type), group in df.groupby(['plan', 'type']):
fig_ = group.set_index('hour').plot(kind='bar', stacked=True, colormap='YlOrBr')
plt.xlabel('Hour') # Maybe add plan and type
plt.ylabel('(%)') # Maybe add plan and type
fig_.figure.savefig('p_hour_{}_{}.png'.format(plan, type), dpi=1000)

Related

How to turn dataframe with categorical rows and numerical columns into coherent chart (sparkline like)

After a bit of wrangling, I've got a dataframe that sort of looks like how it needs to be, with rows representing days of the week, and columns as weeks as they progress. I'd like to present each day as a timeseries plot, almost like a sparkline - but not sure how to do it.
(Here's my original question and thinking: How to create a multiindex chart in Pandas that combines categories and numericals)
The data as it stands:
week of year 1 2 3
Monday 22.8 0.0 22.8
Tuesday 7.6 0.0 22.8
Wednesday 30.4 19.0 19.0
Thursday 15.2 28.0 0.0
Friday 15.2 19.0 0.0
Saturday 15.2 19.0 0.0
Sunday 0.0 26.6 0.0
Update: I do wish I had a working license to Excel to give rough idea of what I'm thinking of. I'm using Numbers and well, it's not Excel. (Unless someone can prove me stupid yet again).
This would be the line (except the gaps) to represent Wednesday's data. I do anticipate that I can fill the gaps over time using averaging.
A simple way is to use seaborn.relplot:
import seaborn as sns
sns.relplot(data=df.melt(id_vars='week of year'),
y='value', x='variable',
kind='line', row='week of year',
height=1, aspect=4,
)
output:
Or with pure matplotlib:
import matplotlib.pyplot as plt
f, axes = plt.subplots(nrows=len(df), sharex=True)
for i, (name, s) in enumerate(df.set_index('week of year').iterrows()):
axes[i].plot(s)
axes[i].set_title(name)
output:

plotting a vbar_stack using a dataframe

I'm struggling to get a stacked vbar working.
With python/pandas and bokeh I want to plot several statistics about the players of a football team. The dataframe is nicely filled, the values are a string where they should be an int where it should be a numeric value.
I used the sample of bokeh to try and adjust it for my purpose, but I'm stuck on
'ValueError: Keyword argument sequences for broadcasting must be the same length as stackers' this error.
My code (without imports and scraping pieces) is:
source = ColumnDataSource(data=statsdfsource[['goals','assists','naam']])
p = figure(plot_height=250, title="Fruit Counts by Year",
toolbar_location=None, tools="")
p.vbar_stack(['goals','assists'], x='naam', width=0.9, color=colors,
source=source)
p.y_range.start = 0
p.x_range.range_padding = 0.1
p.xgrid.grid_line_color = None
p.axis.minor_tick_line_color = None
p.outline_line_color = None
p.legend.location = "top_left"
p.legend.orientation = "horizontal"
show(p)
The dataframe I fill the columndatasource with is
goals assists naam
0 NaN NaN Miguel Santos
1 NaN NaN Aykut Özer
2 NaN NaN Job van de Walle
3 NaN NaN Rowen Koot
4 8.0 6.0 Perr Schuurs
5 4.0 2.0 Wessel Dammers
6 12.0 2.0 Stefan Askovski
7 1.0 NaN Mica Pinto
8 NaN NaN Christopher Braun
9 1.0 4.0 Marco Ospitalieri
10 NaN 1.0 Clint Esser
The result I want to reach is a stacked columnframe, where on the x-axis is the name of the player, with 2 columns above it, one with the goals the player made and one with the assists.
I think I'm messing up somewhere with how my dataframe is built, but I'm a bit floating how it should be formed (can't really imagine on the other hand that the dataframe doesn't fit the purpose).
When using categorical ranges, you have to tell figure what the categories for the axis are and what order you want them to show up, e.g. provide x_range something like:
# specify all the factors for the x-axis by passing x_range
p = figure(..., x_range=sorted(df.naam.unique()))
It's also possible the NaN values are a problem, since they are "contagious". I'd recommend changing them to zeros instead in any case.
Finally the error message probably indicates that your colors list is the wrong length. You are stacking two bars in each column, so the list of colors needs to also be two (one color for each "row" in the stack).

Plotting irregular time-series (multiple) from dataframe using ggplot

I have a df structured as so:
UnitNo Time Sensor
0 1.0 2016-07-20 18:34:44 19.0
1 1.0 2016-07-20 19:27:39 19.0
2 3.0 2016-07-20 20:45:39 17.0
3 3.0 2016-07-20 23:05:29 17.0
4 3.0 2016-07-21 01:23:30 11.0
5 2.0 2016-07-21 04:23:59 11.0
6 2.0 2016-07-21 17:33:29 2.0
7 2.0 2016-07-21 18:55:04 2.0
I want to create a time-series plot where each UnitNo has its own line (color) and the y-axis values correspond to Sensor and the x-axis is Time. I want to do this in ggplot, but I am having trouble figuring out how to do this efficiently. I have looked at previous examples but they all have regular time series, i.e., observations for each variable occur at the same times which makes it easy to create a time index. I imagine I can loop through and add data to plot(?), but I was wondering if there was a more efficient/elegant way forward.
df.set_index('Time').groupby('UnitNo').Sensor.plot();
I think you need pivot or set_index and unstack with DataFrame.plot:
df.pivot('Time', 'UnitNo','Sensor').plot()
Or:
df.set_index(['Time', 'UnitNo'])['Sensor'].unstack().plot()
If some duplicates:
df = df.groupby(['Time', 'UnitNo'])['Sensor'].mean().unstack().plot()
df = df.pivot_table(index='Time', columns='UnitNo',values='Sensor', aggfunc='mean').plot()

Simple way to subset multiple data frames with pandas groupby or other function?

I have a DataFrame with two columns that I got with this command result = pd.concat([Value, Date], axis=1)
import pandas as pd
>>> result
Value Date
189 9.0 11/14/15
191 10.0 11/14/15
192 1.0 11/14/15
193 4.0 11/14/15
... ... ...
2920 6.0 2/20/16
2921 8.0 2/20/16
2923 10.0 2/20/16
2925 2.0 2/20/16
But what I need is multiple dataframes of all the Value data for each Date. I know that I can execute something like x = result.groupby('Date').mean() which gives me the mean Value for each Date, but I want the actual data in its own dataframe that was used to produce the mean.
Is there another argument or function to simply get this data frame?
From your comments you can use seaborn directly to plot a distplot of all dates without any grouping or looping with FacetGrid. Here is some fake data for 12 days and then the plot.
Create fake data and then plot
date = pd.date_range('1-1-2016', '1-13-2016', freq='h', closed='left').date
df = pd.DataFrame({'num' : np.random.rand(len(date)), 'date':date})
g = sns.FacetGrid(df, col='date', col_wrap=4)
g.map(sns.distplot, "num", hist=False, rug=True)
your specific data
g = sns.FacetGrid(result, col='Date', col_wrap=4)
g.map(sns.distplot, 'Value', hist=False, rug=True)
you need a place to put each DataFrame. Let's say you put it in a dictionary d
d = {day: group for day, group in result.groupby('Date')}

how to load a long panel dataset in Pandas?

I have a panel dataset in the long format, that is observations in the data are at the Panel_ID - Day level. I have, say, m Panel_IDs and each Panel_ID has T(m) Day observations.
For instance, the data would look like this. I show an example with 2 panel IDs (1 and 2) but the data contains a lot of them. X is one variable of interest.
Panel_ID Day X
1 2-feb 5
1 3-feb 4.3
1 5-feb 3
2 2-feb 0
2 5-feb 0.5
2 8-feb 3.2
etc. Days are not necessarily the same across Panel_IDs and each Panel_ID has its own number of daily observations.
How can I load this dataset in Pandas so that Pandas recognize its panel structure?
Many thanks!
Just load it normally, with read_csv() or whatever. I copied your data and used read_clipboard() myself.
Then, set the index:
df = df.set_index(['Panel_ID','Day'])
X
Panel_ID Day
1 2-feb 5.0
3-feb 4.3
5-feb 3.0
2 2-feb 0.0
5-feb 0.5
8-feb 3.2
If you want, you are done at this point, but if you want to convert from dataframe to panel, then it is easy after you have indexed the df:
pan = df.to_panel()
Honestly, I generally prefer to keep things as a multi-indexed dataframe rather than add the complexity of the panel structure, but you can do things either way. Note, that even keeping it as a standard dataframe, you can do lots of reshaping easily with things like stack(). For example, convert from narrow to wide with unstack():
df.unstack(level=1)
X
Day 2-feb 3-feb 5-feb 8-feb
Panel_ID
1 5 4.3 3.0 NaN
2 0 NaN 0.5 3.2
Also see the docs here.

Categories

Resources