Pandas: bar plot with multiIndex dataframe - python

I have a pandas DataFrame with a TIMESTAMP column (not the index), and the timestamp format is as follows:
2015-03-31 22:56:45.510
I also have columns called CLASS and AXLES. I would like to compute the count of records for each month separately for each unique value of AXLES (AXLES can take an integer value between 3-12).
I came up with a combination of resample and groupby:
resamp = dfWIM.set_index('TIMESTAMP').groupby('AXLES').resample('M', how='count').CLASS
This seems to give me a multiIndex dataframe object, as shown below.
In [72]: resamp
Out [72]:
AXLES TIMESTAMP
3 2014-07-31 5517
2014-08-31 31553
2014-09-30 42816
2014-10-31 49308
2014-11-30 44168
2014-12-31 45518
2015-01-31 54782
2015-02-28 52166
2015-03-31 47929
4 2014-07-31 3147
2014-08-31 24810
2014-09-30 39075
2014-10-31 46857
2014-11-30 42651
2014-12-31 48282
2015-01-31 42708
2015-02-28 43904
2015-03-31 50033
From here, how can I access different components of this multiIndex object to create a bar plot for the following conditions?
show data when AXLES = 3
show x ticks in the Month - Year format (no days, hours, minutes etc.)
Thanks!
EDIT: Following code gives me the plot, but I could not change the xtick formatting to MM-YY.
resamp[3].plot(kind='bar')
EDIT 2 below is a code snippet that generates a small sample of the data similar to what I have:
dftest = {'TIMESTAMP':['2014-08-31','2014-09-30','2014-10-31'], 'AXLES':[3, 3, 3], 'CLASS':[5,6,7]}
dfTest = pd.DataFrame(dftest)
dfTest.TIMESTAMP = pd.to_datetime(pd.Series(dfTest.TIMESTAMP))
resamp = dfTest.set_index('TIMESTAMP').groupby('AXLES').resample('M', how='count').CLASS
resamp[3].plot(kind='bar')
EDIT 3:
Here below is the solution:
A.Plot the whole resampled dataframe (based on #Ako 's suggestion):
df = resamp.unstack(0)
df.index = [ts.strftime('%b 20%y') for ts in df.index]
df.plot(kind='bar', rot=0)
B.Plot an individual index from the resampled dataframe (based on #Alexander 's suggestion):
df = resamp[3]
df.index = [ts.strftime('%b 20%y') for ts in df.index]
df.plot(kind='bar', rot=0)

You could generate and set the labels explicitly using ax.xaxis.set_major_formatter with a ticker.FixedFormatter. This will allow you to keep your DataFrame's MultiIndex with timestamp values, while displaying the timestamps in the desired %m-%Y format:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.ticker as ticker
dftest = {'TIMESTAMP':['2014-08-31','2014-09-30','2014-10-31'], 'AXLES':[3, 3, 3], 'CLASS':[5,6,7]}
dfTest = pd.DataFrame(dftest)
dfTest.TIMESTAMP = pd.to_datetime(pd.Series(dfTest.TIMESTAMP))
resamp = dfTest.set_index('TIMESTAMP').groupby('AXLES').resample('M', how='count').CLASS
ax = resamp[3].plot(kind='bar')
ticklabels = [timestamp.strftime('%m-%Y') for axle, timestamp in resamp.index]
ax.xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: ticklabels[int(x)]))
plt.gcf().autofmt_xdate()
plt.show()
yields

The following should work, but it is difficult to test without some data.
Start by resetting your index to get access to the TIMESTAMP column. Then use strftime to format it to your desired text representation (e.g. mm-yy). Finally, reset the index back to AXLES and TIMESTAMP.
df = resamp.reset_index()
df['TIMESTAMP'] = [ts.strftime('%m-%y') for ts in df.TIMESTAMP]
df.set_index(['AXLES', 'TIMESTAMP'], inplace=True)
>>> df.xs(3, level=0).plot(kind='bar')

Related

How to aggregate a metric and plot groups separately

I have this dataset:
df = pd.DataFrame()
df['year'] = [2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011]
df['month'] = [1,2,3,4,5,6,1,2,3,4,5,6]
df['after'] = [0,0,0,1,1,1,0,0,0,1,1,1]
df['campaign'] = [0,0,0,0,0,0,1,1,1,1,1,1]
df['sales'] = [10000,11000,12000,10500,10000,9500,7000,8000,5000,6000,6000,7000]
df['date_m'] = pd.to_datetime(df.year.astype(str) + '-' + df.month.astype(str))
And I want to make a line plot grouped by month and campaign, so I have tried this code:
df['sales'].groupby(df['date_m','campaign']).mean().plot.line()
But I get this error message KeyError: ('date_m', 'campaign'). Please, any help will be greatly appreciated.
Plotting is typically dependant upon the shape of the DataFrame.
.groupby creates a long format DataFrame, which is great for seaborn
.pivot_table creates a wide format DataFrame, which easily works with pandas.DataFrame.plot
.groupby the DataFrame
df['sales'].groupby(...) is incorrect, because df['sales'] selects one column of the dataframe; none of the other columns are available
.groupby converts the DataFrame into a long format, which is great for plotting with seaborn.lineplot.
Specify the hue parameter to separate by 'campaign'.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# perform groupby and reset the index
dfg = df.groupby(['date_m','campaign'])['sales'].mean().reset_index()
# display(dfg.head())
date_m campaign sales
0 2011-01-01 0 10000
1 2011-01-01 1 7000
2 2011-02-01 0 11000
3 2011-02-01 1 8000
4 2011-03-01 0 12000
# plot with seaborn
sns.lineplot(data=dfg, x='date_m', y='sales', hue='campaign')
.pivot_table the DataFrame
.pivot_table shapes the DataFrame correctly for plotting with pandas.DataFrame.plot, and it has an aggregation parameter.
The DataFrame is shaped into a wide format.
# pivot the dataframe into the correct shape for plotting
dfp = df.pivot_table(index='date_m', columns='campaign', values='sales', aggfunc='mean')
# display(dfp.head())
campaign 0 1
date_m
2011-01-01 10000 7000
2011-02-01 11000 8000
2011-03-01 12000 5000
2011-04-01 10500 6000
2011-05-01 10000 6000
# plot the dataframe
dfp.plot()
Plotting with matplotlib directly
fig, ax = plt.subplots(figsize=(8, 6))
for v in df.campaign.unique():
# select the data based on the campaign
data = df[df.campaign.eq(v)]
# this is only necessary if there is more than one value per date
data = data.groupby(['date_m','campaign'])['sales'].mean().reset_index()
ax.plot('date_m', 'sales', data=data, label=f'{v}')
plt.legend(title='campaign')
plt.show()
Notes
Package versions:
pandas v1.2.4
seaborn v0.11.1
matplotlib v3.3.4

Pandas reindex and interpolate time series efficiently (reindex drops data)

Suppose I wish to re-index, with linear interpolation, a time series to a pre-defined index, where none of the index values are shared between old and new index. For example
# index is all precise timestamps e.g. 2018-10-08 05:23:07
series = pandas.Series(data,index)
# I want rounded date-times
desired_index = pandas.date_range("2010-10-08",periods=10,freq="30min")
Tutorials/API suggest the way to do this is to reindex then fill NaN values using interpolate. But, as there is no overlap of datetimes between the old and new index, reindex outputs all NaN:
# The following outputs all NaN as no date times match old to new index
series.reindex(desired_index)
I do not want to fill nearest values during reindex as that will lose precision, so I came up with the following; concatenate the reindexed series with the original before interpolating:
pandas.concat([series,series.reindex(desired_index)]).sort_index().interpolate(method="linear")
This seems very inefficient, concatenating and then sorting the two series. Is there a better way?
The only (simple) way I can see of doing this is to use resample to upsample to your time resolution (say 1 second), then reindex.
Get an example DataFrame:
import numpy as np
import pandas as pd
np.random.seed(2)
df = (pd.DataFrame()
.assign(SampleTime=pd.date_range(start='2018-10-01', end='2018-10-08', freq='30T')
+ pd.to_timedelta(np.random.randint(-5, 5, size=337), unit='s'),
Value=np.random.randn(337)
)
.set_index(['SampleTime'])
)
Let's see what the data looks like:
df.head()
Value
SampleTime
2018-10-01 00:00:03 0.033171
2018-10-01 00:30:03 0.481966
2018-10-01 01:00:01 -0.495496
Get the desired index:
desired_index = pd.date_range('2018-10-01', periods=10, freq='30T')
Now, reindex the data with the union of the desired and existing indices, interpolate based on the time, and reindex again using only the desired index:
(df
.reindex(df.index.union(desired_index))
.interpolate(method='time')
.reindex(desired_index)
)
Value
2018-10-01 00:00:00 NaN
2018-10-01 00:30:00 0.481218
2018-10-01 01:00:00 -0.494952
2018-10-01 01:30:00 -0.103270
As you can see, you still have an issue with the first timestamp because it's outside the range of the original index; there are number of ways to deal with this (pad, for example).
my methods
frequency = nyse_trading_dates.rename_axis([None]).index
df = prices.rename_axis([None]).reindex(frequency)
for d in prices.rename_axis([None]).index:
df.loc[d] = prices.loc[d]
df.interpolate(method='linear')
method 2
prices = data.loc[~data.index.duplicated(keep='last')]
#prices = data.reset_index()
idx1 = prices.index
idx1 = pd.to_datetime(idx1, errors='coerce')
merged = idx1.union(idx2)
s = prices.reindex(merged)
df = s.interpolate(method='linear').dropna(axis=0, how='any')
data=df

Arrange pandas DataFrame for color Plotting

I have a dataframe which looks like this (left column is the index):
YYYY-MO-DD HH-MI-SS_SSS ATMOSPHERIC PRESSURE (hPa) mean
2016-11-07 14:00:00 1014.028782
2016-11-07 15:00:00 1014.034111
.... ....
2016-11-30 09:00:00 1006.516436
2016-11-30 10:00:00 1006.216156
Now I want to plot a colormap with this data - so I want to create an X (horizontal axis) to be just the dates:
2016-11-07, 2016-11-08,...,2016-11-30
and the Y (Vertical axis) to be the time:
00:00:00, 01:00:00, 02:00:00, ..., 23:00:00
And finally the Z (color map) to be the pressure data for each date and time [f(x,y)].
How can I arrange the data for this kind of plotting ?
Thank you !
With test data prepared like so:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
samples = 24 * 365
index = pd.date_range('2017-01-01', freq='1H', periods=samples)
data = pd.DataFrame(np.random.rand(samples), index=index, columns=['data'])
I would do something like this:
data = data.reset_index()
data['date'] = data['index'].apply(lambda x: x.date())
data['time'] = data['index'].apply(lambda x: x.time())
pivoted = data.pivot(index='time', columns='date', values='data')
fig, ax = plt.subplots(1, 1)
ax.imshow(pivoted, origin='lower', cmap='viridis')
plt.show()
Which produces:
To improve the axis labeling, this is a start:
ax.set_yticklabels(['{:%H:%M:%S}'.format(t) for t in data['time'].unique()])
ax.set_xticklabels(['{:%Y-%m-%d}'.format(t) for t in data['date'].unique()])
but you'll need to figure out how to choose how often a label appears with set_xticks() and set_yticks()

Matplotlib: Making a line graph's datetime x axis labels look like Excel

I have a simple pandas DataFrame with yearly values that I am plotting as a line graph:
import matplotlib.pyplot as plt
import pandas as pd
>>>df
a b
2010-01-01 9.7 9.0
2011-01-01 8.8 14.2
2012-01-01 8.4 7.6
2013-01-01 9.6 8.4
2014-01-01 8.2 5.5
The expected format for the X axis is to use no margins for the labels:
fig = plt.figure(0)
ax = fig.add_subplot(1, 1, 1)
df.plot(ax = ax)
But I would like to force the values to plot in the middle of the year range, like as done in excel:
I have tried setting the x axis margins:
ax.margins(xmargin = 1)
But can see no difference.
If you just want to move the dates, you could try adding this line at the end:
ax.set_xlim(ax.get_xlim()[0] - 0.5, ax.get_xlim()[1] + 0.5)
If you need to format the dates as well you could either modify your index or make changes in the plotted ticks like so:
(presuming that you df.index is a datetime object)
ax.set_xticklabels(df.index.to_series().apply(lambda x: x.strftime('%d/%m/%Y')))
This will format the dates to look like your Excel example.
Or you could change your index to look like you want and then call .plot():
df.index = df.index.to_series().apply(lambda x: x.strftime('%d/%m/%Y'))
print df.index.tolist()
['01/01/2010', '01/01/2011', '01/01/2012', '01/01/2013', '01/01/2014']
And, if you index is not datetime you need to convert it first like this:
df.index = pd.to_datetime(df.index)

When plotting datetime index data, put markers in the plot on specific days (e.g. weekend)

I create a pandas dataframe with a DatetimeIndex like so:
import datetime
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# create datetime index and random data column
todays_date = datetime.datetime.now().date()
index = pd.date_range(todays_date-datetime.timedelta(10), periods=14, freq='D')
data = np.random.randint(1, 10, size=14)
columns = ['A']
df = pd.DataFrame(data, index=index, columns=columns)
# initialize new weekend column, then set all values to 'yes' where the index corresponds to a weekend day
df['weekend'] = 'no'
df.loc[(df.index.weekday == 5) | (df.index.weekday == 6), 'weekend'] = 'yes'
print(df)
Which gives
A weekend
2014-10-13 7 no
2014-10-14 6 no
2014-10-15 7 no
2014-10-16 9 no
2014-10-17 4 no
2014-10-18 6 yes
2014-10-19 4 yes
2014-10-20 7 no
2014-10-21 8 no
2014-10-22 8 no
2014-10-23 1 no
2014-10-24 4 no
2014-10-25 3 yes
2014-10-26 8 yes
I can easily plot the A colum with pandas by doing:
df.plot()
plt.show()
which plots a line of the A column but leaves out the weekend column as it does not hold numerical data.
How can I put a "marker" on each spot of the A column where the weekend column has the value yes?
Meanwhile I found out, it is as simple as using boolean indexing in pandas. Doing the plot directly with pyplot instead of pandas' own plot wrapper (which is more convenient to me):
plt.plot(df.index, df.A)
plt.plot(df[df.weekend=='yes'].index, df[df.weekend=='yes'].A, 'ro')
Now, the red dots mark all weekend days which are given by df.weekend='yes' values.

Categories

Resources