I have a dataframe/series containing hourly sampled data over a couple of years. I'd like to sum the values for each month, then calculate the mean of those monthly totals over all the years.
I can get a multi-index dataframe/series of the totals using:
df.groupby([df.index.year, df.index.month]).sum()
Date & Time Date & Time
2016 3 220.246292
4 736.204574
5 683.240291
6 566.693919
7 948.116766
8 761.214823
9 735.168033
10 771.210572
11 542.314915
12 434.467037
2017 1 728.983901
2 639.787918
3 709.944521
4 704.610437
5 685.729297
6 760.175060
7 856.928659
But I don't know how to then combine the data to get the means.
I might be totally off on the wrong track too. Also not sure I've labelled the question very well.
I think you need mean per years - so per first level:
df.groupby([df.index.year, df.index.month]).sum().mean(level=0)
You can use groupby twice, once to get the monthly sum, once to get the mean of monthly sum:
(df.groupby(pd.Grouper(freq='M')).sum()
.groupby(pd.Grouper(freq='Y')).mean()
)
I have a CSV file which contains two columns, the first is a date column in the format 01/01/2020 and the second is a number for each month representing the months sales volume. The dates range from 2004 to 2019 and my task is to create a 12 bar chart, with each bar representing the average sales volume for that month across every years data. I attempted to use a groupby function but got an error relating to not having numeric types to aggregate. I am very new to python so apologies for the beginner questions. I Have posted my code so far below. Thanks in advance for any help with this :)
# -*- coding: utf-8 -*-
import csv
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
file = "GlasgowSalesVolume.csv"
data = pd.read_csv(file)
typemean = (data.groupby(['Date', 'SalesVolume'], as_index=False).mean().groupby('Date')
['SalesVolume'].mean())
Output:
DataError: No numeric types to aggregate
enter code here
I prepared a DataFrame limited to just 2 rows and 3 months:
Date Sales
0 01/01/2019 3
1 01/02/2019 4
2 01/03/2019 8
3 01/01/2020 10
4 01/02/2020 20
5 01/03/2020 30
For now Date column is of string type, so the first step is to
convert it to datetime64:
df.Date = pd.to_datetime(df.Date, dayfirst=True)
Now to compute your result, run:
result = df.groupby(df.Date.dt.month).Sales.mean()
The result is a Series containing:
Date
1 6.5
2 12.0
3 19.0
Name: Sales, dtype: float64
The index is the month number (1 thru 12) and the value is the mean from
respective month, from all years.
Hi have a dataframe that looks like this:
ID Date Total_Amount priority
1 2007 4488 High
2 2007 40981 Low
3 2017 450 Medium
4 2008 1000 Low
each row is a new person (ID) and the rows show how much they spent per year (total amount).
I want to create a bar chart with the years on the x-axis and the Total_Amount as the y-axis height but it needs to be stacked by priority. e.g. if 10 spent money in 2007 and their Total_Amount sum is £100,000, the height of the bar needs to be 100,000 stacked by priority( e.g. 5 may have been high, 4 low and 1 medium).
I tried using crosstab with date as row and priority as columns but I don't get a dataframe for Total_Amount spent, I get one for the number of people in each priority.
You can use groupby() and then unstack():
df2 = df.groupby(['Date','priority'])['Total_Amount'].sum().unstack('priority').fillna(0)
df2.plot(kind='bar', stacked=True)
Produces:
Almost same we still using crosstab
pd.crosstab(index=df.Date,columns=df.priority,values=df.Total_Amount,aggfunc='sum')\
.fillna(0).plot(kind='bar')
I'm working in Python and I have a Pandas DataFrame of Uber data from New York City. A part of the DataFrame looks like this:
Year Week_Number Total_Dispatched_Trips
2015 51 1,109
2015 5 54,380
2015 50 8,989
2015 51 1,025
2015 21 10,195
2015 38 51,957
2015 43 266,465
2015 29 66,139
2015 40 74,321
2015 39 3
2015 50 854
As it is right now, the same week appears multiple times for each year. I want to sum the values for "Total_Dispatched_Trips" for every week for each year. I want each week to appear only once per year. (So week 51 can't appear multiple times for year 2015 etc.). How do I do this? My dataset is over 3k rows, so I would prefer not to do this manually.
Thanks in advance.
okidoki here is it, borrowing on Convert number strings with commas in pandas DataFrame to float
import locale
from locale import atof
locale.setlocale(locale.LC_NUMERIC, '')
df['numeric_trip'] = pd.to_numeric(df.Total_Dispatched_trips.apply(atof), errors = 'coerce')
df.groupby(['Year', 'Week_number']).numeric_trip.sum()
I have two columns, categorical and year, that I am trying to plot. I am trying to take the sum total of each categorical per year to create a multi-class time series plot.
ax = data[data.categorical=="cat1"]["categorical"].plot(label='cat1')
data[data.categorical=="cat2"]["categorical"].plot(ax=ax, label='cat3')
data[data.categorical=="cat3"]["categorical"].plot(ax=ax, label='cat3')
plt.xlabel("Year")
plt.ylabel("Number per category")
sns.despine()
But am getting an error stating no numeric data to plot. I am looking for something similar to the above, perhaps with data[data.categorical=="cat3"]["categorical"].lambda x : (1 for x in data.categorical)
I will use the following lists as examples.
categorical = ["cat1","cat1","cat2","cat3","cat2","cat1","cat3","cat2","cat1","cat3","cat3","cat3","cat2","cat1","cat2","cat3","cat2","cat2","cat3","cat1","cat1","cat1","cat3"]
year = [2013,2014,2013,2015,2014,2014,2013,2014,2014,2015,2015,2013,2014,2014,2013,2014,2015,2015,2015,2013,2014,2015,2013]
My goal is to obtain something similar to the following picture
I'm hesitant to call this a "solution", as it's basically just a summary of basic Pandas functionality, which is explained in the same documentation where you found the time series plot you've placed in your post. But seeing as there's some confusion around groupby and plotting, a demo may help clear things up.
We can use two calls to groupby().
The first groupby() gets a count of category appearances per year, using the count aggregation.
The second groupby() is used to plot the time series for each category.
To start, generate a sample data frame:
import pandas as pd
categorical = ["cat1","cat1","cat2","cat3","cat2","cat1","cat3","cat2",
"cat1","cat3","cat3","cat3","cat2","cat1","cat2","cat3",
"cat2","cat2","cat3","cat1","cat1","cat1","cat3"]
year = [2013,2014,2013,2015,2014,2014,2013,2014,2014,2015,2015,2013,
2014,2014,2013,2014,2015,2015,2015,2013,2014,2015,2013]
df = pd.DataFrame({'categorical':categorical,
'year':year})
categorical year
0 cat1 2013
1 cat1 2014
...
21 cat1 2015
22 cat3 2013
Now get counts per category, per year:
# reset_index() gives a column for counting, after groupby uses year and category
ctdf = (df.reset_index()
.groupby(['year','categorical'], as_index=False)
.count()
# rename isn't strictly necessary here, it's just for readability
.rename(columns={'index':'ct'})
)
year categorical ct
0 2013 cat1 2
1 2013 cat2 2
2 2013 cat3 3
3 2014 cat1 5
4 2014 cat2 3
5 2014 cat3 1
6 2015 cat1 1
7 2015 cat2 2
8 2015 cat3 4
Finally, plot time series for each category, keyed by color:
from matplotlib import pyplot as plt
fig, ax = plt.subplots()
# key gives the group name (i.e. category), data gives the actual values
for key, data in ctdf.groupby('categorical'):
data.plot(x='year', y='ct', ax=ax, label=key)
Have you tried groupby?
df.groupby(["year","categorical"]).count()