I have a dataframe that is made up of hourly electricity price data. What I am trying to do is find a way to calculate the average of the n lowest price hourly periods in day. The data spans many years and aiming to get the average of the n lowest price periods for each day. Synthetic data can be created using the following:
np.random.seed(0)
rng = pd.date_range('2020-01-01', periods=24, freq='T')
df = pd.DataFrame({ 'Date': rng, 'Price': np.random.randn(len(rng)) })
I have managed to get the lowest price for each day by using:
df_max = df.groupby([pd.Grouper(key='Date', freq='D')]).min()
Is there a way to get the average of the n lowest periods in a day?
Thanks in advance for any help.
We can group the dataframe by Grouper object with daily frequency then aggregate Price using nsmallest to obtain the n smallest values, now calculate the mean on level=0 to get the average of n smallest values in a day
df.groupby(pd.Grouper(key='Date', freq='D'))['Price'].nsmallest(5).mean(level=0)
Result of calculating the average of 5 smallest values daily
Date
2020-01-01 -1.066337
Name: Price, dtype: float64
You can also try the following:
bottom_5_prices_mean=df.sort_index(ascending=True).head(5)['Price'].mean()
top_5_prices_mean=df.sort_index(ascending=True).tail(5)['Price'].mean()
Related
I have to calculate number of days when temperature was more than 32 degree C, in last 30 days.
I am try use rolling average. The issue is that number of days in a month varies.
weather_2['highTemp_days'] = weather_2.groupby(['date','station'])['over32'].apply(lambda x: x.rolling(len('month')).sum())
weather_2 has 66 stations
date varies from 1950 to 2020
over32 is boolean data. If temp on that date is > 32 then 1 otherwise zero.
month is taken from the date data which is weather_2['month'] = weather_2['date'].dt.month
I used this
weather_2['highTemp_days'] = weather_2.groupby(['year','station'])['over32'].apply(lambda x: x.rolling(30).sum())
This issue was I was grouping by date. That is why the answer was wrong.
I have a dataset that I want to use to calculate the average quarterly growth rate, broken down by each year in the dataset.
Right now I have a dataframe with a multi-level grouping, and I'd like to apply the gmean function from scipy.stats to each year within the dataset.
The code I use to get the quarterly growth rates looks like this:
df.groupby(df.index.year).resample('Q')['Sales'].sum() / df.groupby(df.index.year).resample('Q')['Sales'].sum().shift(1)
Which gives me this as a result:
So basically I want the geometric mean of (1.162409, 1.659756, 1.250600) for 2014, and the other quarterly growth rates for every other year.
Instinctively, I want to do something like this:
(df.groupby(df.index.year).resample('Q')['Sales'].sum() / df.groupby(df.index.year).resample('Q')['Sales'].sum().shift(1)).apply(gmean, level=0)
But this doesn't work.
I don't know what your data looks like so I'm gonna make some random sample data:
dates = pd.date_range('2014-01-01', '2017-12-31')
n = 5000
np.random.seed(1)
df = pd.DataFrame({
'Order Date': np.random.choice(dates, n),
'Sales': np.random.uniform(1, 100, n)
})
Order Date Sales
0 2016-11-27 82.458720
1 2014-08-24 66.790309
2 2017-01-01 75.387001
3 2016-06-24 9.272712
4 2015-12-17 48.278467
And the code:
# Total sales per quarter
q = df.groupby(pd.Grouper(key='Order Date', freq='Q'))['Sales'].sum()
# Q-over-Q growth rate
q = (q / q.shift()).fillna(1)
# Y-over-Y growth rate
from scipy.stats import gmean
y = q.groupby(pd.Grouper(freq='Y')).agg(gmean) - 1
y.index = y.index.year
y.index.name = 'Year'
y.to_frame('Avg. Quarterly Growth').style.format('{:.1%}')
Result:
Avg. Quarterly Growth
Year
2014 -4.1%
2015 -0.7%
2016 3.5%
2017 -1.1%
I have a dataset that looks as follows:
The dataset
I have a big list of orders with there traveled distance for the entire year of 2018. To predict the orders for the future I want to calculate the total orders per Hour for all the Mondays of the year. So, the average number of orders between 00:00:00 -:01:00:00 the average orders between 01:00:00 - 02:00:00 until 23:00:00 - 24:00:00 only on Mondays. They should not include the orders on other weekdays.
What I have so far is:
df_data = pd.read_csv('Finalorders.csv', parse_dates=['datetime'])
week_dfsum = df_data.groupby(df_data['datetime'].dt.weekday_name).sum()
week_dfsum = df_data.groupby(df_data['datetime'].dt.weekday_name).sum()
pprint(week_dfsum)
pprint(week_dfmean)
But I don't know how to only include the orders on Monday.
You're close. After you produce a column called "Day of the week", filter it by Monday's using:
df[df['Day_of_Week'] == 1]
This will return only values from Mondays.
I have two arrays, date and temp. I wish to calculate the mean temperature per day. There are minutely observations, it is not as simple as looping and averaging every 1,440 values. The sensor turns on to record the temperature at random times. A day can have 8 minutes of observations or 1,440. Therefore I will have to iterate through each day.
Data:
Two equal length Numpy arrays:
dates = ['2017-10-24 06:18:00.000' '2017-10-24 06:19:00.000' '2017-10-24 06:20:00.000' ... '2018-11-23 16:56:00.000' '2018-11-23 16:57:00.000' '2018-11-23 16:58:00.000']
temp = [1 2 3 ... 5 2 9]
I figure I need to select out the 'day' value in dates and iterate through that by +1 day
Pseudo code:
AmountOfDays = max(dates.%d)-min(dates.%d)
day_index = 0
for i in days:
for j in AmountOfDays
np.mean(temp)
Using pandas:
import numpy as np
import pandas as pd
df = pd.DataFrame({'date': np.array(dates, dtype=np.datetime64), 'temp': temp})
df.groupby(df.date.dt.date)['temp'].mean()
I am trying to calculate the weighted average of amount of times a social media post was made on a given weekday between 2009- 2018.
This is the code I have:
weight = fb_posts2[fb_posts2['title']=='status'].groupby('year',as_index=False).apply(lambda x: (x.count())/x.sum())
What i am trying to do is to groupby year and weekday, count the number of time each weekday has occurred in a year and divide that by the total number of posts in each year. The idea is to return a dataframe with a weighted average of how many times each weekday occurred between 2009 and 2018.
This is a sample of the dataframe I am interacting with:
Use .value_counts() with the normalize argument, grouping only on year.
Sample Data
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'year': np.random.choice([2010, 2011], 1000),
'weekday': np.random.choice(list('abcdefg'), 1000),
'val': np.random.normal(1, 10, 1000)})
Code:
df.groupby('year').weekday.value_counts(normalize=True)
Output:
year weekday
2010 d 0.152083
f 0.147917
g 0.147917
c 0.143750
e 0.139583
b 0.137500
a 0.131250
2011 d 0.182692
a 0.163462
e 0.153846
b 0.148077
c 0.128846
f 0.111538
g 0.111538
Name: weekday, dtype: float64