Related
I have this dataset:
df = pd.DataFrame()
df['year'] = [2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011]
df['month'] = [1,2,3,4,5,6,1,2,3,4,5,6]
df['after'] = [0,0,0,1,1,1,0,0,0,1,1,1]
df['campaign'] = [0,0,0,0,0,0,1,1,1,1,1,1]
df['sales'] = [10000,11000,12000,10500,10000,9500,7000,8000,5000,6000,6000,7000]
df['date_m'] = pd.to_datetime(df.year.astype(str) + '-' + df.month.astype(str))
And I want to make a line plot grouped by month and campaign, so I have tried this code:
df['sales'].groupby(df['date_m','campaign']).mean().plot.line()
But I get this error message KeyError: ('date_m', 'campaign'). Please, any help will be greatly appreciated.
Plotting is typically dependant upon the shape of the DataFrame.
.groupby creates a long format DataFrame, which is great for seaborn
.pivot_table creates a wide format DataFrame, which easily works with pandas.DataFrame.plot
.groupby the DataFrame
df['sales'].groupby(...) is incorrect, because df['sales'] selects one column of the dataframe; none of the other columns are available
.groupby converts the DataFrame into a long format, which is great for plotting with seaborn.lineplot.
Specify the hue parameter to separate by 'campaign'.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# perform groupby and reset the index
dfg = df.groupby(['date_m','campaign'])['sales'].mean().reset_index()
# display(dfg.head())
date_m campaign sales
0 2011-01-01 0 10000
1 2011-01-01 1 7000
2 2011-02-01 0 11000
3 2011-02-01 1 8000
4 2011-03-01 0 12000
# plot with seaborn
sns.lineplot(data=dfg, x='date_m', y='sales', hue='campaign')
.pivot_table the DataFrame
.pivot_table shapes the DataFrame correctly for plotting with pandas.DataFrame.plot, and it has an aggregation parameter.
The DataFrame is shaped into a wide format.
# pivot the dataframe into the correct shape for plotting
dfp = df.pivot_table(index='date_m', columns='campaign', values='sales', aggfunc='mean')
# display(dfp.head())
campaign 0 1
date_m
2011-01-01 10000 7000
2011-02-01 11000 8000
2011-03-01 12000 5000
2011-04-01 10500 6000
2011-05-01 10000 6000
# plot the dataframe
dfp.plot()
Plotting with matplotlib directly
fig, ax = plt.subplots(figsize=(8, 6))
for v in df.campaign.unique():
# select the data based on the campaign
data = df[df.campaign.eq(v)]
# this is only necessary if there is more than one value per date
data = data.groupby(['date_m','campaign'])['sales'].mean().reset_index()
ax.plot('date_m', 'sales', data=data, label=f'{v}')
plt.legend(title='campaign')
plt.show()
Notes
Package versions:
pandas v1.2.4
seaborn v0.11.1
matplotlib v3.3.4
import plotly.offline as pyo
import plotly.express as px
import matplotlib.pyplot as pls
pyo.init_notebook_mode()
data = pd.read_csv(r'C:.......Coronovirus Datasets\time_series_covid19_deaths_global.csv')
countries = ['US']
filtered_data = data[data['Country/Region'].isin(countries)]
wanted_values = filtered_data[['Country/Region','1/22/2020','1/23/2020','1/24/2020', '1/25/2020','1/26/2020','1/27/2020','1/28/2020','1/28/2020','1/29/2020',
'1/30/2020','1/31/2020','2/1/2020','2/2/2020','2/3/2020','2/4/2020','2/5/2020','2/6/2020','2/7/2020','2/8/2020','2/9/2020','2/10/2020',
'2/11/2020','2/12/2020','2/13/2020','2/14/2020','2/15/2020','2/16/2020','2/17/2020','2/18/2020','2/19/2020','2/20/2020','2/21/2020','2/22/2020','2/23/2020',
'2/24/2020','2/25/2020','2/26/2020','2/27/2020','2/28/2020','2/29/2020','3/1/2020','3/2/2020','3/3/2020','3/4/2020','3/5/2020','3/6/2020','3/7/2020',
'3/8/2020','3/9/2020','3/10/2020','3/11/2020','3/12/2020','3/13/2020','3/14/2020','3/15/2020','3/16/2020','3/17/2020','3/18/2020','3/19/2020',
'3/20/2020','3/21/2020','4/1/2020','4/2/2020','4/3/2020','4/4/2020','4/5/2020','4/6/2020','4/7/2020','4/8/2020','4/9/2020','4/10/2020',
'4/11/2020','4/12/2020','4/13/2020','4/14/2020','4/15/2020','4/16/2020','4/17/2020','4/18/2020','4/19/2020','4/20/2020','4/21/2020','4/22/2020','4/23/2020',
'4/24/2020','4/25/2020','4/26/2020','4/27/2020','4/28/2020','4/29/2020','5/1/2020','5/2/2020','5/3/2020','5/4/2020','5/5/2020','5/6/2020','5/7/2020','5/8/2020','5/9/2020']]
fig = px.scatter(wanted_values, x ='Country/Region', y = 'dates' , title = 'Number of Deaths Per Day')
fig.show()
#wanted_values.plot(x="5/9/2020, 5/8/2020", y = 'filtered_data' kind = 'bar')
#pls.show()
How can I plot all the dates with their corresponding deaths as a scatter plot? I plan to use linear regression to predict the amount of deaths since January first. I have been having a lot of trouble with plotting these values as I am really new to Python.
The data set can be found here: https://data.humdata.org/dataset/novel-coronavirus-2019-ncov-cases
This is how your data looks like:
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv("time_series_covid19_deaths_global.csv")
data.iloc[:2,:7]
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20
0 NaN Afghanistan 33.0000 65.0000 0 0 0
1 NaN Albania 41.1533 20.1683 0 0 0
First of all, subset it by giving it the start and end of dates (that match the column names) and melting it to give long format:
data = data[data['Country/Region']=='US']
data = data.loc[:,'1/22/20':'5/9/20'].melt(var_name="date")
data['date'] = pd.to_datetime(data['date'])
Looks like this now:
date value
0 2020-01-22 0
1 2020-01-23 0
2 2020-01-24 0
Plotting is simply:
data.plot.scatter(x="date",y="value",rot=45)
Data df is in this format:
Id Timestamp Data Group
0 1 2013-08-12 10:29:19.673 40.0 1
1 2 2013-08-13 10:29:20.687 50.0 2
2 3 2013-09-14 10:29:20.687 40.0 3
3 4 2013-10-14 10:29:20.687 30.0 4
4 5 2013-11-15 10:29:20.687 50.0 5
...
I plotted the graph to observe how Data varies over time with code:
%matplotlib notebook
%matplotlib inline
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df= df[(df['Timestamp'] > '2013-12-05 ') & (df['Timestamp'] <= '2013-12-30 ')]
df1 = df[df['Group'] ==1]
df1.plot(x = 'Timestamp', y = 'Data',figsize=(20, 10))
The graph looks fine:
But when I was trying to narrow down the time interval to 2013-12-05 ~2013-12-11(from 2013-12-05 ~2013-12-30), with code:
%matplotlib notebook
%matplotlib inline
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df= df[(df['Timestamp'] > '2013-12-05 ') & (df['Timestamp'] <= '2013-12-11')]
df1 = df[df['Group'] ==1]
df1.plot(x = 'Timestamp', y = 'Data',figsize=(20, 10))
the graph looks off as we'd expected the new graph to capture the first half of the old graph given that new time interval overlaps with the old graph in the first half of the total duration. But the graph looks like this:
The x-axis marks also no longer makes sense. What could go wrong? Any help is appreciated. Thx
I have a large data set with names of stores, dates and profits.
My data set is not the most organized but I now have it in this df.
df
Store Date Profit
ABC May 1 2018 234
XYZ May 1 2018 410
AZY May 1 2018 145
ABC May 2 2018 234
XYZ May 2 2018 410
AZY May 2 2018 145
I proudly created a function to get each day into one df by itself until I realized it would be very time consuming to do one for each day.
def avg(n):
return df.loc[df['Date'] == "May" + " " + str(n) + " " +str(2018)]
where n would be the date I want to get. So that function get me just the dates I want.
What I really need is to have a way to get all dates I want in a list and to append them to a pd for each day. I tried doing this but did not work out.
def avg(n):
dlist= []
for i in n:
dlist= df.loc[df['Date'] == "May" + " " + str(i) + " " +str(2018)]
dlist=pd.DataFrame(dlist)
dlist.append(i)
return dlist
df2=avg([21,23,24,25])
My goal there was to have all the dates of (21,23,24,25) for the May
into its own series of df.
But it was a total fail got this error
cannot concatenate object of type ""; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid
I am not sure if it's also possible to add a rolling average or mean, to columns for each day of (21,23,24,25), but that's where analysis will conclude.
output desired
Store Date Profit Rolling Mean
ABC May 1 2018 234 250
XYZ May 1 2018 410 401
AZY May 1 2018 145 415
where the rolling mean is for the past 30 days. Above all, I would like to have each day into its own df where I can save it to csv file the end.
Rolling Mean:
The example data given in the question, has data in the format of May 1 2018, which can't be used for rolling. Rolling requires a datetime index.
Instead of string splitting the original Date column, it should be converted to datetime, using df.Date = pd.to_datetime(df.Date), which will give dates in the format 2018-05-01
With a properly formatted datetime column, use df['Day'] = df.Date.dt.day and df['Month'] = df.Date.dt.month_name() to get a Day and Month column, if desired.
Given the original data:
Original Data:
Store Date Profit
ABC May 1 2018 234
XYZ May 1 2018 410
AZY May 1 2018 145
ABC May 2 2018 234
XYZ May 2 2018 410
AZY May 2 2018 145
Transformed Original Data:
df.Date = pd.to_datetime(df.Date)
df['Day'] = df.Date.dt.day
df['Month'] = df.Date.dt.month_name()
Store Date Profit Day Month
ABC 2018-05-01 234 1 May
XYZ 2018-05-01 410 1 May
AZY 2018-05-01 145 1 May
ABC 2018-05-02 234 2 May
XYZ 2018-05-02 410 2 May
AZY 2018-05-02 145 2 May
Rolling Example:
The example dataset is insufficient to produce a 30-day rolling average
In order to have a 30-day rolling mean, there needs to be more than 30 days of data for each store (i.e. on the 31st day, you get the 1st mean, for the previous 30 days)
The following example will setup a dataframe consisting of every day in 2018, a random profit between 100 and 1001, and a random store, chosen from ['ABC', 'XYZ', 'AZY'].
Extended Sample:
import pandas as pd
import random
import numpy as np
from datetime import datetime, timedelta
list_of_dates = [date for date in np.arange(datetime(2018, 1, 1), datetime(2019, 1, 1), timedelta(days=1)).astype(datetime)]
df = pd.DataFrame({'Store': [random.choice(['ABC', 'XYZ', 'AZY']) for _ in range(365)],
'Date': list_of_dates,
'Profit': [np.random.randint(100, 1001) for _ in range(365)]})
Store Date Profit
ABC 2018-01-01 901
AZY 2018-01-02 540
AZY 2018-01-03 417
XYZ 2018-01-04 280
XYZ 2018-01-05 384
XYZ 2018-01-06 104
XYZ 2018-01-07 691
ABC 2018-01-08 376
XYZ 2018-01-09 942
XYZ 2018-01-10 297
df.set_index('Date', inplace=True)
df_rolling = df.groupby(['Store']).rolling(30).mean()
df_rolling.rename(columns={'Profit': '30-Day Rolling Mean'}, inplace=True)
df_rolling.reset_index(inplace=True)
df_rolling.head():
Note the first 30-days for each store, will be NaN
Store Date 30-Day Rolling Mean
ABC 2018-01-01 NaN
ABC 2018-01-03 NaN
ABC 2018-01-07 NaN
ABC 2018-01-11 NaN
ABC 2018-01-13 NaN
df_rolling.tail():
Store Date 30-Day Rolling Mean
XYZ 2018-12-17 556.966667
XYZ 2018-12-18 535.633333
XYZ 2018-12-19 534.733333
XYZ 2018-12-24 551.066667
XYZ 2018-12-27 572.033333
Plot:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 6))
g = sns.lineplot(x='Date', y='30-Day Rolling Mean', data=df_rolling, hue='Store')
for item in g.get_xticklabels():
item.set_rotation(60)
plt.show()
Alternatively: A dataframe for each store:
It's also possible to create a separate dataframe for each store and put it inside a dict
This alternative makes is easier to plot a more detailed graph with less code
import pandas as pd
import random
import numpy as np
from datetime import datetime, timedelta
list_of_dates = [date for date in np.arange(datetime(2018, 1, 1), datetime(2019, 1, 1), timedelta(days=1)).astype(datetime)]
df = pd.DataFrame({'Store': [random.choice(['ABC', 'XYZ', 'AZY']) for _ in range(365)],
'Date': list_of_dates,
'Profit': [np.random.randint(100, 1001) for _ in range(365)]})
df_dict = dict()
for store in df.Store.unique():
df_dict[store] = df[['Date', 'Profit']][df.Store == store]
df_dict[store].set_index('Date', inplace=True)
df_dict[store]['Profit: 30-Day Rolling Mean'] = df_dict[store].rolling(30).mean()
print(df_dict.keys())
>>> dict_keys(['ABC', 'XYZ', 'AZY'])
print(df_dict['ABC'].head())
Plot:
import matplotlib.pyplot as plt
_, axes = plt.subplots(1, 1, figsize=(13, 8), sharex=True)
for k, v in df_dict.items():
axes.plot(v['Profit'], marker='.', linestyle='-', linewidth=0.5, label=k)
axes.plot(v['Profit: 30-Day Rolling Mean'], marker='o', markersize=4, linestyle='-', linewidth=0.5, label=f'{k} Rolling')
axes.legend()
axes.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.ylabel('Profit ($)')
plt.xlabel('Date')
plt.title('Recorded Profit vs. 30-Day Rolling Mean of Profit')
plt.show()
Get a dataframe for a specific month:
Recall, this is randomly generated data, so the stores don't have data for every day of the month.
may_df = dict()
for k, v in df_dict.items():
v.reset_index(inplace=True)
may_df[k] = v[v.Date.dt.month_name() == 'May']
may_df[k].set_index('Date', inplace=True)
print(may_df['XYZ'])
Plot: May data only:
Save dataframes:
pandas.DataFrame.to_csv()
may_df.reset_index(inplace=True)
may_df.to_csv('may.csv', index=False)
A simple solution may be groupby()
Check out this example :
import pandas as pd
listt = [['a',2,3],
['b',5,7],
['a',3,9],
['a',1,3],
['b',9,4],
['a',4,7],
['c',7,2],
['a',2,5],
['c',4,7],
['b',5,5]]
my_df = pd.DataFrame(listt)
my_df.columns=['Class','Day_1','Day_2']
my_df.groupby('Class')['Day_1'].mean()
OutPut :
Class
a 2.400000
b 6.333333
c 5.500000
Name: Day_1, dtype: float64
Note : Similarly You can Group your data by Date and get Average of your Profit.
I have a pandas DataFrame with a TIMESTAMP column (not the index), and the timestamp format is as follows:
2015-03-31 22:56:45.510
I also have columns called CLASS and AXLES. I would like to compute the count of records for each month separately for each unique value of AXLES (AXLES can take an integer value between 3-12).
I came up with a combination of resample and groupby:
resamp = dfWIM.set_index('TIMESTAMP').groupby('AXLES').resample('M', how='count').CLASS
This seems to give me a multiIndex dataframe object, as shown below.
In [72]: resamp
Out [72]:
AXLES TIMESTAMP
3 2014-07-31 5517
2014-08-31 31553
2014-09-30 42816
2014-10-31 49308
2014-11-30 44168
2014-12-31 45518
2015-01-31 54782
2015-02-28 52166
2015-03-31 47929
4 2014-07-31 3147
2014-08-31 24810
2014-09-30 39075
2014-10-31 46857
2014-11-30 42651
2014-12-31 48282
2015-01-31 42708
2015-02-28 43904
2015-03-31 50033
From here, how can I access different components of this multiIndex object to create a bar plot for the following conditions?
show data when AXLES = 3
show x ticks in the Month - Year format (no days, hours, minutes etc.)
Thanks!
EDIT: Following code gives me the plot, but I could not change the xtick formatting to MM-YY.
resamp[3].plot(kind='bar')
EDIT 2 below is a code snippet that generates a small sample of the data similar to what I have:
dftest = {'TIMESTAMP':['2014-08-31','2014-09-30','2014-10-31'], 'AXLES':[3, 3, 3], 'CLASS':[5,6,7]}
dfTest = pd.DataFrame(dftest)
dfTest.TIMESTAMP = pd.to_datetime(pd.Series(dfTest.TIMESTAMP))
resamp = dfTest.set_index('TIMESTAMP').groupby('AXLES').resample('M', how='count').CLASS
resamp[3].plot(kind='bar')
EDIT 3:
Here below is the solution:
A.Plot the whole resampled dataframe (based on #Ako 's suggestion):
df = resamp.unstack(0)
df.index = [ts.strftime('%b 20%y') for ts in df.index]
df.plot(kind='bar', rot=0)
B.Plot an individual index from the resampled dataframe (based on #Alexander 's suggestion):
df = resamp[3]
df.index = [ts.strftime('%b 20%y') for ts in df.index]
df.plot(kind='bar', rot=0)
You could generate and set the labels explicitly using ax.xaxis.set_major_formatter with a ticker.FixedFormatter. This will allow you to keep your DataFrame's MultiIndex with timestamp values, while displaying the timestamps in the desired %m-%Y format:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.ticker as ticker
dftest = {'TIMESTAMP':['2014-08-31','2014-09-30','2014-10-31'], 'AXLES':[3, 3, 3], 'CLASS':[5,6,7]}
dfTest = pd.DataFrame(dftest)
dfTest.TIMESTAMP = pd.to_datetime(pd.Series(dfTest.TIMESTAMP))
resamp = dfTest.set_index('TIMESTAMP').groupby('AXLES').resample('M', how='count').CLASS
ax = resamp[3].plot(kind='bar')
ticklabels = [timestamp.strftime('%m-%Y') for axle, timestamp in resamp.index]
ax.xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: ticklabels[int(x)]))
plt.gcf().autofmt_xdate()
plt.show()
yields
The following should work, but it is difficult to test without some data.
Start by resetting your index to get access to the TIMESTAMP column. Then use strftime to format it to your desired text representation (e.g. mm-yy). Finally, reset the index back to AXLES and TIMESTAMP.
df = resamp.reset_index()
df['TIMESTAMP'] = [ts.strftime('%m-%y') for ts in df.TIMESTAMP]
df.set_index(['AXLES', 'TIMESTAMP'], inplace=True)
>>> df.xs(3, level=0).plot(kind='bar')