Pandas Seaborn Heatmap Error - python

I have a DataFrame that looks like this when unstacked.
Start Date 2016-07-11 2016-07-12 2016-07-13
Period
0 1.000000 1.000000 1.0
1 0.684211 0.738095 NaN
2 0.592105 NaN NaN
I'm trying to plot it in Seaborn as a heatmap but it's giving me unintended results.
Here's my code:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame(np.array(data), columns=['Start Date', 'Period', 'Users'])
df = df.fillna(0)
df = df.set_index(['Start Date', 'Period'])
sizes = df['Users'].groupby(level=0).first()
df = df['Users'].unstack(0).divide(sizes, axis=1)
plt.title("Test")
sns.heatmap(df.T, mask=df.T.isnull(), annot=True, fmt='.0%')
plt.tight_layout()
plt.savefig(table._v_name + "fig.png")
I want it so that text doesn't overlap and there aren't 6 heat legends on the side. Also if possible, how do I fix the date so that it only displays %Y-%m-%d?

While exact reproducible data is not available, consider below using posted snippet data. This example runs a pivot_table() to achieve the structure as posted with StartDates across columns. Overall, your heatmap possibly outputs the multiple color bars and overlapping figures due to the unstack() processing where you seem to be dividing by users (look into seaborn.FacetGrid to split). So below runs the df as is through heatmap. Also, an apply() re-formats datetime to specified need.
from io import StringIO
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
data = '''Period,StartDate,Value
0,2016-07-11,1.000000
0,2016-07-12,1.000000
0,2016-07-13,1.0
1,2016-07-11,0.684211
1,2016-07-12,0.738095
1,2016-07-13
2,2016-07-11,0.592105
2,2016-07-12
2,2016-07-13'''
df = pd.read_csv(StringIO(data))
df['StartDate'] = pd.to_datetime(df['StartDate'])
df['StartDate'] = df['StartDate'].apply(lambda x: x.strftime('%Y-%m-%d'))
pvtdf = df.pivot_table(values='Value', index=['Period'],
columns='StartDate', aggfunc=sum)
print(pvtdf)
# StartDate 2016-07-11 2016-07-12 2016-07-13
# Period
# 0 1.000000 1.000000 1.0
# 1 0.684211 0.738095 NaN
# 2 0.592105 NaN NaN
sns.set()
plt.title("Test")
ax = sns.heatmap(pvtdf.T, mask=pvtdf.T.isnull(), annot=True, fmt='.0%')
plt.tight_layout()
plt.show()

Related

How to aggregate a metric and plot groups separately

I have this dataset:
df = pd.DataFrame()
df['year'] = [2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011]
df['month'] = [1,2,3,4,5,6,1,2,3,4,5,6]
df['after'] = [0,0,0,1,1,1,0,0,0,1,1,1]
df['campaign'] = [0,0,0,0,0,0,1,1,1,1,1,1]
df['sales'] = [10000,11000,12000,10500,10000,9500,7000,8000,5000,6000,6000,7000]
df['date_m'] = pd.to_datetime(df.year.astype(str) + '-' + df.month.astype(str))
And I want to make a line plot grouped by month and campaign, so I have tried this code:
df['sales'].groupby(df['date_m','campaign']).mean().plot.line()
But I get this error message KeyError: ('date_m', 'campaign'). Please, any help will be greatly appreciated.
Plotting is typically dependant upon the shape of the DataFrame.
.groupby creates a long format DataFrame, which is great for seaborn
.pivot_table creates a wide format DataFrame, which easily works with pandas.DataFrame.plot
.groupby the DataFrame
df['sales'].groupby(...) is incorrect, because df['sales'] selects one column of the dataframe; none of the other columns are available
.groupby converts the DataFrame into a long format, which is great for plotting with seaborn.lineplot.
Specify the hue parameter to separate by 'campaign'.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# perform groupby and reset the index
dfg = df.groupby(['date_m','campaign'])['sales'].mean().reset_index()
# display(dfg.head())
date_m campaign sales
0 2011-01-01 0 10000
1 2011-01-01 1 7000
2 2011-02-01 0 11000
3 2011-02-01 1 8000
4 2011-03-01 0 12000
# plot with seaborn
sns.lineplot(data=dfg, x='date_m', y='sales', hue='campaign')
.pivot_table the DataFrame
.pivot_table shapes the DataFrame correctly for plotting with pandas.DataFrame.plot, and it has an aggregation parameter.
The DataFrame is shaped into a wide format.
# pivot the dataframe into the correct shape for plotting
dfp = df.pivot_table(index='date_m', columns='campaign', values='sales', aggfunc='mean')
# display(dfp.head())
campaign 0 1
date_m
2011-01-01 10000 7000
2011-02-01 11000 8000
2011-03-01 12000 5000
2011-04-01 10500 6000
2011-05-01 10000 6000
# plot the dataframe
dfp.plot()
Plotting with matplotlib directly
fig, ax = plt.subplots(figsize=(8, 6))
for v in df.campaign.unique():
# select the data based on the campaign
data = df[df.campaign.eq(v)]
# this is only necessary if there is more than one value per date
data = data.groupby(['date_m','campaign'])['sales'].mean().reset_index()
ax.plot('date_m', 'sales', data=data, label=f'{v}')
plt.legend(title='campaign')
plt.show()
Notes
Package versions:
pandas v1.2.4
seaborn v0.11.1
matplotlib v3.3.4

Python Seaborn Lineplot

I am new to Python and have a question regarding a lineplot.
I have a data set which I would like to display as a Seaborn lineplot.
In this dataset I have 3 categories which should be on the Y axis. I have no data for an X axis, but I want to use the index.
Unfortunately I did not get it right. I would like to use it like the Excel picture.
The columns are also of different lengths.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("Testdata.csv", delimiter= ";")
df
Double Single Triple
0 50.579652 24.498143 60.954680
1 53.313919 24.497490 60.494626
2 54.174343 24.490651 60.052566
3 56.622435 24.485605 59.622501
4 59.656155 26.201791 59.199581
... ... ... ...
410 NaN NaN 75.478118
411 NaN NaN 73.780804
412 NaN NaN 72.716096
413 NaN NaN 72.468472
414 NaN NaN 71.179819
How do I do that?
I appreciate your help.
First melt your columns and then use hue parameter to plot each line:
fig, ax = pyplot.subplots(figsize=(10, 10))
ax =seaborn.lineplot(
data= df.melt(id_vars='index').rename(columns=str.title),
x= 'index',
y= 'value',
hue='varaible'
)

Plot Price as Horizontal Line for Non Zero Volume Values

My Code:
import matplotlib.pyplot as plt
plt.style.use('seaborn-ticks')
import pandas as pd
import numpy as np
path = 'C:\\File\\Data.txt'
df = pd.read_csv(path, sep=",")
df.columns = ['Date','Time','Price','volume']
df = df[df.Date == '08/02/2019'].reset_index(drop=True)
df['Volume'] = np.where((df.volume/1000) < 60, 0, (df.volume/1000))
df.plot('Time','Price')
dff = df[df.Volume > 60].reset_index(drop=True)
dff = dff[['Date','Time','Price','Volume']]
print(dff)
plt.subplots_adjust(left=0.05, bottom=0.05, right=0.95, top=0.95, wspace=None, hspace=None)
plt.show()
My Plot Output is as below:
The Output of dff Datframe as below:
Date Time Price Volume
0 08/02/2019 13:39:43 685.35 97.0
1 08/02/2019 13:39:57 688.80 68.0
2 08/02/2019 13:43:50 683.00 68.0
3 08/02/2019 13:43:51 681.65 92.0
4 08/02/2019 13:49:42 689.95 70.0
5 08/02/2019 13:52:00 695.20 64.0
6 08/02/2019 14:56:42 686.25 68.0
7 08/02/2019 15:03:15 685.35 63.0
8 08/02/2019 15:03:31 683.15 69.0
9 08/02/2019 15:08:08 684.00 61.0
I want to plot the Prices of this table as Vertical Lines as per the below image. Any Help..
Based on your image, I think you mean horizontal lines. Either way it's pretty simple, Pyplot has hlines/vlines builtins. In your case, try something like
plt.hlines(dff['Price'], '08/02/2019', '09/02/2019')
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
path = 'File.txt'
df = pd.read_csv(path, sep=",")
df.columns = ['Date','Time','Price','volume']
df = df[df.Date == '05/02/2019'].reset_index(drop=True)
df['Volume'] = np.where((df.volume/7500) < 39, 0, (df.volume/7500))
df["Time"] = pd.to_datetime(df['Time'])
df.plot(x="Time",y='Price', rot=0)
plt.title("Date: " + str(df['Date'].iloc[0]))
dff = df[df.Volume > 39].reset_index(drop=True)
dff = dff[['Date','Time','Price','Volume']]
print(dff)
dict = dff.to_dict('index')
for x in range(0, len(dict)):
plt.axhline(y=dict[x]['Price'],linewidth=2, color='blue')
plt.subplots_adjust(left=0.05, bottom=0.06, right=0.95, top=0.96, wspace=None, hspace=None)
plt.show()

Remove interpolation Time series plot for missing values

I'm trying to plot a time series data but I have some problems.
I'm using this code:
from matplotlib import pyplot as plt
plt.figure('Fig')
plt.plot(data.index,data.Colum,'g', linewidth=2.0,label='Data')
And I get this:
But I dont want the interpolation between missing values!
How can I achieve this?
Since you are using pandas you could do something like this:
import pandas as pd
import matplotlib.pyplot as plt
pd.np.random.seed(1234)
idx = pd.date_range(end=datetime.today().date(), periods=10, freq='D')
vals = pd.Series(pd.np.random.randint(1, 10, size=idx.size), index=idx)
vals.iloc[4:8] = pd.np.nan
print vals
Here is an example of a column from a DataFrame with DatetimeIndex
2016-03-29 4.0
2016-03-30 7.0
2016-03-31 6.0
2016-04-01 5.0
2016-04-02 NaN
2016-04-03 NaN
2016-04-04 NaN
2016-04-05 NaN
2016-04-06 9.0
2016-04-07 1.0
Freq: D, dtype: float64
To plot it without dates where data is NaN you could do something like this:
fig, ax = plt.subplots()
ax.plot(range(vals.dropna().size), vals.dropna())
ax.set_xticklabels(vals.dropna().index.date.tolist());
fig.autofmt_xdate()
Which should produce a plot like this:
The trick here is to replace the dates with some range of values that do not trigger matplotlib's internal date processing when you call .plot method.
Later, when the plotting is done, replace the ticklabels with actual dates.
Optionally, call .autofmt_xdate() to make labels readable.

Pandas: bar plot with multiIndex dataframe

I have a pandas DataFrame with a TIMESTAMP column (not the index), and the timestamp format is as follows:
2015-03-31 22:56:45.510
I also have columns called CLASS and AXLES. I would like to compute the count of records for each month separately for each unique value of AXLES (AXLES can take an integer value between 3-12).
I came up with a combination of resample and groupby:
resamp = dfWIM.set_index('TIMESTAMP').groupby('AXLES').resample('M', how='count').CLASS
This seems to give me a multiIndex dataframe object, as shown below.
In [72]: resamp
Out [72]:
AXLES TIMESTAMP
3 2014-07-31 5517
2014-08-31 31553
2014-09-30 42816
2014-10-31 49308
2014-11-30 44168
2014-12-31 45518
2015-01-31 54782
2015-02-28 52166
2015-03-31 47929
4 2014-07-31 3147
2014-08-31 24810
2014-09-30 39075
2014-10-31 46857
2014-11-30 42651
2014-12-31 48282
2015-01-31 42708
2015-02-28 43904
2015-03-31 50033
From here, how can I access different components of this multiIndex object to create a bar plot for the following conditions?
show data when AXLES = 3
show x ticks in the Month - Year format (no days, hours, minutes etc.)
Thanks!
EDIT: Following code gives me the plot, but I could not change the xtick formatting to MM-YY.
resamp[3].plot(kind='bar')
EDIT 2 below is a code snippet that generates a small sample of the data similar to what I have:
dftest = {'TIMESTAMP':['2014-08-31','2014-09-30','2014-10-31'], 'AXLES':[3, 3, 3], 'CLASS':[5,6,7]}
dfTest = pd.DataFrame(dftest)
dfTest.TIMESTAMP = pd.to_datetime(pd.Series(dfTest.TIMESTAMP))
resamp = dfTest.set_index('TIMESTAMP').groupby('AXLES').resample('M', how='count').CLASS
resamp[3].plot(kind='bar')
EDIT 3:
Here below is the solution:
A.Plot the whole resampled dataframe (based on #Ako 's suggestion):
df = resamp.unstack(0)
df.index = [ts.strftime('%b 20%y') for ts in df.index]
df.plot(kind='bar', rot=0)
B.Plot an individual index from the resampled dataframe (based on #Alexander 's suggestion):
df = resamp[3]
df.index = [ts.strftime('%b 20%y') for ts in df.index]
df.plot(kind='bar', rot=0)
You could generate and set the labels explicitly using ax.xaxis.set_major_formatter with a ticker.FixedFormatter. This will allow you to keep your DataFrame's MultiIndex with timestamp values, while displaying the timestamps in the desired %m-%Y format:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.ticker as ticker
dftest = {'TIMESTAMP':['2014-08-31','2014-09-30','2014-10-31'], 'AXLES':[3, 3, 3], 'CLASS':[5,6,7]}
dfTest = pd.DataFrame(dftest)
dfTest.TIMESTAMP = pd.to_datetime(pd.Series(dfTest.TIMESTAMP))
resamp = dfTest.set_index('TIMESTAMP').groupby('AXLES').resample('M', how='count').CLASS
ax = resamp[3].plot(kind='bar')
ticklabels = [timestamp.strftime('%m-%Y') for axle, timestamp in resamp.index]
ax.xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: ticklabels[int(x)]))
plt.gcf().autofmt_xdate()
plt.show()
yields
The following should work, but it is difficult to test without some data.
Start by resetting your index to get access to the TIMESTAMP column. Then use strftime to format it to your desired text representation (e.g. mm-yy). Finally, reset the index back to AXLES and TIMESTAMP.
df = resamp.reset_index()
df['TIMESTAMP'] = [ts.strftime('%m-%y') for ts in df.TIMESTAMP]
df.set_index(['AXLES', 'TIMESTAMP'], inplace=True)
>>> df.xs(3, level=0).plot(kind='bar')

Categories

Resources