Show time series and mean for grouped data - python

I have some data of different products and its corresponding sales with Datetime index. I managed to group them by product using:
grouped_df = data.loc[:, ['ProductID', 'Sales']].groupby('ProductID')
for key, item in grouped_df:
print(grouped_df.get_group(key), "\n\n")
And the output I got is:
ProductID Sales
Datetime
2014-03-31 1 2475.03
2014-09-27 1 10033.06
2015-02-03 1 5329.33
ProductID Sales
Datetime
2014-12-17 2 1960.0
2015-06-17 2 1400.0
2016-08-29 2 230.0
.
.
.
I would like to be able to plot each grouped data on the same graph to show a time-series of sales.
I also want to get the monthly mean sales of each productID and plot them on a bar chart. I have tried re-sampling but it did not work out for me.
How do I go about doing the above?
Help will be greatly appreciated!

you can use seaborn.lineplot here
you can use
import seaborn as sns
ax = sns.lineplot(x=data.index, y="Sales", hue = 'ProductID ', data=data)
you don't even need to group them

Related

How to aggregate a metric and plot groups separately

I have this dataset:
df = pd.DataFrame()
df['year'] = [2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011]
df['month'] = [1,2,3,4,5,6,1,2,3,4,5,6]
df['after'] = [0,0,0,1,1,1,0,0,0,1,1,1]
df['campaign'] = [0,0,0,0,0,0,1,1,1,1,1,1]
df['sales'] = [10000,11000,12000,10500,10000,9500,7000,8000,5000,6000,6000,7000]
df['date_m'] = pd.to_datetime(df.year.astype(str) + '-' + df.month.astype(str))
And I want to make a line plot grouped by month and campaign, so I have tried this code:
df['sales'].groupby(df['date_m','campaign']).mean().plot.line()
But I get this error message KeyError: ('date_m', 'campaign'). Please, any help will be greatly appreciated.
Plotting is typically dependant upon the shape of the DataFrame.
.groupby creates a long format DataFrame, which is great for seaborn
.pivot_table creates a wide format DataFrame, which easily works with pandas.DataFrame.plot
.groupby the DataFrame
df['sales'].groupby(...) is incorrect, because df['sales'] selects one column of the dataframe; none of the other columns are available
.groupby converts the DataFrame into a long format, which is great for plotting with seaborn.lineplot.
Specify the hue parameter to separate by 'campaign'.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# perform groupby and reset the index
dfg = df.groupby(['date_m','campaign'])['sales'].mean().reset_index()
# display(dfg.head())
date_m campaign sales
0 2011-01-01 0 10000
1 2011-01-01 1 7000
2 2011-02-01 0 11000
3 2011-02-01 1 8000
4 2011-03-01 0 12000
# plot with seaborn
sns.lineplot(data=dfg, x='date_m', y='sales', hue='campaign')
.pivot_table the DataFrame
.pivot_table shapes the DataFrame correctly for plotting with pandas.DataFrame.plot, and it has an aggregation parameter.
The DataFrame is shaped into a wide format.
# pivot the dataframe into the correct shape for plotting
dfp = df.pivot_table(index='date_m', columns='campaign', values='sales', aggfunc='mean')
# display(dfp.head())
campaign 0 1
date_m
2011-01-01 10000 7000
2011-02-01 11000 8000
2011-03-01 12000 5000
2011-04-01 10500 6000
2011-05-01 10000 6000
# plot the dataframe
dfp.plot()
Plotting with matplotlib directly
fig, ax = plt.subplots(figsize=(8, 6))
for v in df.campaign.unique():
# select the data based on the campaign
data = df[df.campaign.eq(v)]
# this is only necessary if there is more than one value per date
data = data.groupby(['date_m','campaign'])['sales'].mean().reset_index()
ax.plot('date_m', 'sales', data=data, label=f'{v}')
plt.legend(title='campaign')
plt.show()
Notes
Package versions:
pandas v1.2.4
seaborn v0.11.1
matplotlib v3.3.4

Search column for specific phrases and count the amount of times they appear in the column and plot to bar graph

Search column for each month of the year. Column is organized like this "01-Jan-2018". I want to find how many times "Jan-2018" appears in the column. Basically count it and plot it on a bar graph. I want it to show all the quantities for "Jan-2018" , "Feb-2018", etc. Should be 12 bars on the graph. Maybe using count or sum. I am pulling the data from a CSV using pandas and python.
I have tried to printing it out onto the console with some success. But I am getting confused as correct way to search a portion of the date.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import csv
import seaborn as sns
data = pd.read_csv(r'C:\Users\rmond\Downloads\PS_csvFile1.csv', error_bad_lines=False, encoding="ISO-8859-1", skiprows=6)
cols = data.columns
cols = cols.map(lambda x: x.replace(' ', '_') if isinstance(x, (str)) else x)
data.columns = cols
print(data.groupby('Case_Date').mean().plot(kind='bar'))
I am expecting the a bar graph that will show the total quantity for each month. So there should be 12 bar graphs. But I am not sure how to search the column 12 times and each time only looking for the data of each month. While excluding the date, only searching for the month and year.
IIUC, this is what you need.
Let's work with the below dataframe as input dataframe.
date
0 1/31/2018
1 2/28/2018
2 2/28/2018
3 3/31/2018
4 4/30/2018
5 5/31/2018
6 6/30/2018
7 6/30/2018
8 7/31/2018
9 8/31/2018
10 9/30/2018
11 9/30/2018
12 9/30/2018
13 9/30/2018
14 10/31/2018
15 11/30/2018
16 12/31/2018
The below mentioned lines of code will get the number of count for each month as a bar graph. When you have a column as as datetime object, a lot of function are much easy & the contents of the column are much more flexible. With that, you don't need search string of the name of the month.
df['date'] = pd.to_datetime(df['date'])
df['my']=df.date.dt.strftime('%b-%Y')
ax = df.groupby('my', sort=False)['my'].value_counts().plot(kind='bar')
ax.set_xticklabels(df.my, rotation=90);
Output

Python Plotnine - Create a stacked bar chart

I've been trying to draw a stacked bar chart using plotnine. This graphic represents the end of month inventory within the same "Category". The "SubCategory" its what should get stacked.
I've built a pandas dataframe from a query to a database. The query retrieves the sum(inventory) for each "subcategory" within a "category" in a date range.
This is the format of the DataFrame:
SubCategory1 SubCategory2 SubCategory3 .... Dates
0 1450.0 130.5 430.2 .... 2019/Jan
1 1233.2 1000.0 13.6 .... 2019/Feb
2 1150.8 567.2 200.3 .... 2019/Mar
Dates should be in the X axis, and Y should be determined by the sum of "SubCategory1" + "SubCategory2" + "SubCategory3" and being color distinguishable.
I tried this because I thought it made sense but had no luck:
g = ggplot(df)
for key in subcategories:
g = g + geom_bar(aes(x='Dates', y=key), stat='identity', position='stack')
Where subcategories is a dictionary with the SubCategories name.
Maybe the format of the dataframe is not ideal. Or I don't know how to properly use it with plotnine/ggplot.
Thanks for the help.
You need the data in tidy format
from io import StringIO
import pandas as pd
from plotnine import *
from mizani.breaks import date_breaks
io = StringIO("""
SubCategory1 SubCategory2 SubCategory3 Dates
1450.0 130.5 430.2 2019/Jan
1233.2 1000.0 13.6 2019/Feb
1150.8 567.2 200.3 2019/Mar
""")
data = pd.read_csv(io, sep='\s+', parse_dates=[3])
# Make the data tidy
df = pd.melt(data, id_vars=['Dates'], var_name='categories')
"""
Dates categories value
0 2019-01-01 SubCategory1 1450.0
1 2019-02-01 SubCategory1 1233.2
2 2019-03-01 SubCategory1 1150.8
3 2019-01-01 SubCategory2 130.5
4 2019-02-01 SubCategory2 1000.0
5 2019-03-01 SubCategory2 567.2
6 2019-01-01 SubCategory3 430.2
7 2019-02-01 SubCategory3 13.6
8 2019-03-01 SubCategory3 200.3
"""
(ggplot(df, aes('Dates', 'value', fill='categories'))
+ geom_col()
+ scale_x_datetime(breaks=date_breaks('1 month'))
)
Do you really need to use plotnine? You can do it with just:
df.plot.bar(x='Dates', stacked=True)
Output:

Multiple plots in 1 figure using for loop

I've got data in the below format, and what I'm trying to do is to:
1) loop over each value in Region
2) For each region, plot a time series of the aggregated (across Category) sales number.
Date |Region |Category | Sales
01/01/2016| USA| Furniture|1
01/01/2016| USA| Clothes |0
01/01/2016| Europe| Furniture|2
01/01/2016| Europe| Clothes |0
01/02/2016| USA| Furniture|3
01/02/2016| USA|Clothes|0
01/02/2016| Europe| Furniture|4
01/02/2016| Europe| Clothes|0 ...
The plot should look like the attached (done in excel).
However, if I try to do it in Python using the below, I get multiple charts when I really want all the lines to show up in one figure.
Python code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv(r'C:\Users\wusm\Desktop\Book7.csv')
plt.legend()
for index, group in df.groupby(["Region"]):
group.plot(x='Date',y='Sales',title=str(index))
plt.show()
Short of reformatting the data, could anyone advise on how to get the graphs in one figure please?
You can use pivot_table:
df = df.pivot_table(index='Date', columns='Region', values='Sales', aggfunc='sum')
print (df)
Region Europe USA
Date
01/01/2016 2 1
01/02/2016 4 3
or groupby + sum + unstack:
df = df.groupby(['Date', 'Region'])['Sales'].sum().unstack(fill_value=0)
print (df)
Region Europe USA
Date
01/01/2016 2 1
01/02/2016 4 3
and then DataFrame.plot
df.plot()

How to plot by category over time

I have two columns, categorical and year, that I am trying to plot. I am trying to take the sum total of each categorical per year to create a multi-class time series plot.
ax = data[data.categorical=="cat1"]["categorical"].plot(label='cat1')
data[data.categorical=="cat2"]["categorical"].plot(ax=ax, label='cat3')
data[data.categorical=="cat3"]["categorical"].plot(ax=ax, label='cat3')
plt.xlabel("Year")
plt.ylabel("Number per category")
sns.despine()
But am getting an error stating no numeric data to plot. I am looking for something similar to the above, perhaps with data[data.categorical=="cat3"]["categorical"].lambda x : (1 for x in data.categorical)
I will use the following lists as examples.
categorical = ["cat1","cat1","cat2","cat3","cat2","cat1","cat3","cat2","cat1","cat3","cat3","cat3","cat2","cat1","cat2","cat3","cat2","cat2","cat3","cat1","cat1","cat1","cat3"]
year = [2013,2014,2013,2015,2014,2014,2013,2014,2014,2015,2015,2013,2014,2014,2013,2014,2015,2015,2015,2013,2014,2015,2013]
My goal is to obtain something similar to the following picture
I'm hesitant to call this a "solution", as it's basically just a summary of basic Pandas functionality, which is explained in the same documentation where you found the time series plot you've placed in your post. But seeing as there's some confusion around groupby and plotting, a demo may help clear things up.
We can use two calls to groupby().
The first groupby() gets a count of category appearances per year, using the count aggregation.
The second groupby() is used to plot the time series for each category.
To start, generate a sample data frame:
import pandas as pd
categorical = ["cat1","cat1","cat2","cat3","cat2","cat1","cat3","cat2",
"cat1","cat3","cat3","cat3","cat2","cat1","cat2","cat3",
"cat2","cat2","cat3","cat1","cat1","cat1","cat3"]
year = [2013,2014,2013,2015,2014,2014,2013,2014,2014,2015,2015,2013,
2014,2014,2013,2014,2015,2015,2015,2013,2014,2015,2013]
df = pd.DataFrame({'categorical':categorical,
'year':year})
categorical year
0 cat1 2013
1 cat1 2014
...
21 cat1 2015
22 cat3 2013
Now get counts per category, per year:
# reset_index() gives a column for counting, after groupby uses year and category
ctdf = (df.reset_index()
.groupby(['year','categorical'], as_index=False)
.count()
# rename isn't strictly necessary here, it's just for readability
.rename(columns={'index':'ct'})
)
year categorical ct
0 2013 cat1 2
1 2013 cat2 2
2 2013 cat3 3
3 2014 cat1 5
4 2014 cat2 3
5 2014 cat3 1
6 2015 cat1 1
7 2015 cat2 2
8 2015 cat3 4
Finally, plot time series for each category, keyed by color:
from matplotlib import pyplot as plt
fig, ax = plt.subplots()
# key gives the group name (i.e. category), data gives the actual values
for key, data in ctdf.groupby('categorical'):
data.plot(x='year', y='ct', ax=ax, label=key)
Have you tried groupby?
df.groupby(["year","categorical"]).count()

Categories

Resources