I have a simple DataFrame with 2 columns - date and value. I need to create another DataFrame with would contain an average value for every month of every year. For example, I have daily data in range from 2015-01-01 till 2018-12-31
I need averages for every month in 2015, 2016 etc.
Which is the easiest way to do that?
You can aggregate by month period with Series.dt.to_period and mean:
df['date'] = pd.to_datetime(df['date'])
df1 = df.groupby(df['date'].dt.to_period('m'))['col'].mean().reset_index()
Another solution with year and months in separate columns:
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df1 = df.groupby(['year','month'])['col'].mean().reset_index()
Sample:
df = pd.DataFrame({'date':['2015-01-02','2016-03-02','2015-01-23','2016-01-12','2015-03-02'],
'col':[1,2,5,4,6]})
print (df)
date col
0 2015-01-02 1
1 2016-03-02 2
2 2015-01-23 5
3 2016-01-12 4
4 2015-03-02 6
df['date'] = pd.to_datetime(df['date'])
df1 = df.groupby(df['date'].dt.to_period('m'))['col'].mean().reset_index()
print (df1)
date col
0 2015-01 3
1 2015-03 6
2 2016-01 4
3 2016-03 2
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df2 = df.groupby(['year','month'])['col'].mean().reset_index()
print (df2)
year month col
0 2015 1 3
1 2015 3 6
2 2016 1 4
3 2016 3 2
To get the monthly average values of a Data Frame when the DataFrame has daily data rows, I would:
Convert the column with the dates , df['dates']into the index of the DataFrame df: df.set_index('date',inplace=True)
Then I'll convert the index dates into a month-index: df.index.month
Finally I'll calculate the mean of the DataFrame GROUPED BY MONTH: df.groupby(df.index.month).data.mean()
I go slowly throw each step here:
Generation DataFrame with dates and values
You need first to import Pandas and Numpy, as well as the module datetime
from datetime import datetime
Generate a Column 'date' between 1/1/2019 and the 3/05/2019, at week 'W' intervals. And a column 'data'with random values between 1-100:
date_rng = pd.date_range(start='1/1/2018', end='3/05/2018', freq='W')
df = pd.DataFrame(date_rng, columns=['date'])
df['data']=np.random.randint(0,100,size=(len(date_rng)))
the df has two columns 'date' and 'data':
date data
0 2018-01-07 42
1 2018-01-14 54
2 2018-01-21 30
3 2018-01-28 43
4 2018-02-04 65
5 2018-02-11 40
6 2018-02-18 3
7 2018-02-25 55
8 2018-03-04 81
Set 'date'column as the index of the DataFrame:
df.set_index('date',inplace=True)
df has one column 'data' and the index is 'date':
data
date
2018-01-07 42
2018-01-14 54
2018-01-21 30
2018-01-28 43
2018-02-04 65
2018-02-11 40
2018-02-18 3
2018-02-25 55
2018-03-04 81
Capture the month number from the index
months=df.index.month
Obtain the mean value of each month groupping by month:
monthly_avg=df.groupby(months).data.mean()
The mean of the dataset by month 'monthly_avg' is:
date
1 42.25
2 40.75
3 81.00
Name: data, dtype: float64
Related
I want to filter rolls (df1) with date column that in datetime64[ns] from df2 (same column name and dtype). I tried searching for a solution but I get the error:
Can only compare identically-labeled Series objects | 'Timestamp' object is not iterable or other.
sample df1
id
date
value
1
2018-10-09
120
2
2018-10-09
60
3
2018-10-10
59
4
2018-11-25
120
5
2018-08-25
120
sample df2
date
2018-10-09
2018-10-10
sample result that I want
id
date
value
1
2018-10-09
120
2
2018-10-09
60
3
2018-10-10
59
In fact, I want this program to run 1 time in every 7 days, counting back from the day it started. So I want it to remove dates that are not in these past 7 days.
# create new dataframe -> df2
data = {'date':[]}
df2 = pd.DataFrame(data)
#Set the date to the last 7 days.
days_use = 7 # 7 -> 1
for x in range (days_use,0,-1):
days_use = x
use_day = date.today() - timedelta(days=days_use)
df2.loc[x] = use_day
#Change to datetime64[ns]
df2['date'] = pd.to_datetime(df2['date'])
Use isin:
>>> df1[df1["date"].isin(df2["date"])]
id date value
0 1 2018-10-09 120
1 2 2018-10-09 60
2 3 2018-10-10 59
If you want to create df2 with the dates for the past week, you can simply use pd.date_range:
df2 = pd.DataFrame({"date": pd.date_range(pd.Timestamp.today().date()-pd.DateOffset(7),periods=7)})
>>> df2
date
0 2022-05-03
1 2022-05-04
2 2022-05-05
3 2022-05-06
4 2022-05-07
5 2022-05-08
6 2022-05-09
I have two DataFrames:
df1:
A B
Date
01/01/2020 2 4
02/01/2020 6 8
df2:
A B
Date
01/01/2020 5 10
I want to get the following:
df3:
A B
Date
01/01/2020 10 40
02/01/2020 30 80
What I want is to multiply the column entries based on year and month in DatetimeIndex. But I'm not sure how to iterate over datetime.
use to_numpy():
df3=pd.DataFrame(df1.to_numpy()*df2.to_numpy(),index=df1.index,columns=df1.columns)
output of df3:
A B
Date
01/01/2020 10 40
02/01/2020 30 80
You may need reindex
df1.index = pd.to_datetime(df1.index,dayfirst=True)
df2.index = pd.to_datetime(df2.index,dayfirst=True)
df2.index = df2.index.strftime('%Y-%m')
df1[:] *= df2.reindex(df1.index.strftime('%Y-%m')).values
df1
Out[529]:
A B
Date
2020-01-01 10 40
2020-01-02 30 80
I have a dataset like below-
Store Date Weekly_Sales
0 1 2010-05-02 1643690.90
1 1 2010-12-02 1641957.44
2 1 2010-02-19 1611968.17
3 1 2010-02-26 1409727.59
4 1 2010-05-03 1554806.68
It has 100 stores in all. I want to filter the data of the year 2012 by Quarter
# Filter out only the data in 2012 from the dataset
import datetime as dt
df['Date'] = pd.to_datetime(df['Date'])
ds_2012 = df[df['Date'].dt.year == 2012]
# Calculate Q on the dataset
ds_2012 = ds_2012.sort_values(['Date'],ascending=True)
quarterly_sales = ds_2012.groupby(['Store', pd.Grouper(key='Date', freq='Q')])['Weekly_Sales'].sum()
quarterly_sales.head(20)
Output Received
Store Date
1 2012-03-31 18951097.69
2012-06-30 21036965.58
2012-09-30 18633209.98
2012-12-31 9580784.77
The Summation of of Q2(2012-06-30) and Q3(2012-09-30) both are incorrect when filtered in excel. I am a newbie to Pandas
You can groupby store and resample the DataFrame quarterly:
import pandas as pd
df=pd.concat([pd.DataFrame({'Store':[i]*12, 'Date':pd.date_range(start='2020-01-01', periods=12, freq='M'), 'Sales':list(range(12))}) for i in [1,2]])
df.groupby('Store').resample('Q', on='Date').sum().drop('Store', axis=1)
Sales
Store Date
1 2020-03-31 3
2020-06-30 12
2020-09-30 21
2020-12-31 30
2 2020-03-31 3
2020-06-30 12
2020-09-30 21
2020-12-31 30
Maybe check the groupby and resample docs aswell.
I have a dataset with meteorological features for 2019, to which I want to join two columns of power consumption datasets for 2017, 2018. I want to match them by hour, day and month, but the data belongs to different years. How can I do that?
The meteo dataset is a 6 column similar dataframe with datetimeindexes belonging to 2019.
You can from the index 3 additional columns that represent the hour, day and month and use them for a later join. DatetimeIndex has attribtues for different parts of the timestamp:
import pandas as pd
ind = pd.date_range(start='2020-01-01', end='2020-01-20', periods=10)
df = pd.DataFrame({'number' : range(10)}, index = ind)
df['hour'] = df.index.hour
df['day'] = df.index.day
df['month'] = df.index.month
print(df)
number hour day month
2020-01-01 00:00:00 0 0 1 1
2020-01-03 02:40:00 1 2 3 1
2020-01-05 05:20:00 2 5 5 1
2020-01-07 08:00:00 3 8 7 1
2020-01-09 10:40:00 4 10 9 1
2020-01-11 13:20:00 5 13 11 1
2020-01-13 16:00:00 6 16 13 1
2020-01-15 18:40:00 7 18 15 1
2020-01-17 21:20:00 8 21 17 1
2020-01-20 00:00:00 9 0 20 1
Have a df like that:
Client Status Dat_Start Dat_End
1 A 2015-01-01 2015-01-19
1 B 2016-01-01 2016-02-02
1 A 2015-02-12 2015-02-20
1 B 2016-01-30 2016-03-01
I'd like to get average between two dates (Dat_end and Dat_Start) for Status='A' grouping by client column using Pandas syntax.
So it will be smth SQL-like:
Select Client, AVG (Dat_end-Dat_Start) as Date_Diff
from Table
where Status='A'
Group by Client
Thanks!
Calculate the timedeltas:
df['duration'] = df.Dat_End-df.Dat_Start
df
Out[92]:
Client Status Dat_Start Dat_End duration
0 1 A 2015-01-01 2015-01-19 18 days
1 1 B 2016-01-01 2016-02-02 32 days
2 1 A 2015-02-12 2015-02-20 8 days
3 1 B 2016-01-30 2016-03-01 31 days
Filter and ask for sum and count for pandas <0.20:
df[df.Status=='A'].groupby('Client').duration.agg(['sum', 'count'])
Out[98]:
sum count
Client
1 26 days 2
For upcoming pandas 0.20, see mean added to groupby here for timedeltas. This will work:
df[df.Status=='A'].groupby('Client').duration.mean()
In [10]: df.loc[df.Status == 'A'].groupby('Client') \
.apply(lambda x: (x.Dat_End-x.Dat_Start).mean()).reset_index()
Out[10]:
Client 0
0 1 13 days