finding first and last available days of a month in pandas - python

I have a pandas dataframe from 2007 to 2017. The data is like this:
date closing_price
2007-12-03 728.73
2007-12-04 728.83
2007-12-05 728.83
2007-12-07 728.93
2007-12-10 728.22
2007-12-11 728.50
2007-12-12 728.51
2007-12-13 728.65
2007-12-14 728.65
2007-12-17 728.70
2007-12-18 728.73
2007-12-19 728.73
2007-12-20 728.73
2007-12-21 728.52
2007-12-24 728.52
2007-12-26 728.90
2007-12-27 728.90
2007-12-28 728.91
2008-01-05 728.88
2008-01-08 728.86
2008-01-09 728.84
2008-01-10 728.85
2008-01-11 728.85
2008-01-15 728.86
2008-01-16 728.89
As you can see, some days are missing for each month. I want to take the first and last 'available' days of each month, and calculate the difference of their closing_price, and put the results in a new dataframe. For example for the first month, the days will be 2007-12-03 and 2007-12-28, and the closing prices would be 728.73 and 728.91, so the result would be 0.18. How can I do this?

you can group df by month and apply a function to do it. Notice the to_period, this function convert DataFrame from DatetimeIndex to PeriodIndex with desired frequency.
def calculate(x):
start_closing_price = x.loc[x.index.min(), "closing_price"]
end_closing_price = x.loc[x.index.max(), "closing_price"]
return end_closing_price-start_closing_price
result = df.groupby(df["date"].dt.to_period("M")).apply(calculate)
# result
date
2007-12 0.18
2008-01 0.01
Freq: M, dtype: float64

First make sure they are datetime and sorted:
import pandas as pd
df['date'] = pd.to_datetime(df.date)
df = df.sort_values('date')
Groupby
gp = df.groupby([df.date.dt.year.rename('year'), df.date.dt.month.rename('month')])
gp.closing_price.last() - gp.closing_price.first()
#year month
#2007 12 0.18
#2008 1 0.01
#Name: closing_price, dtype: float64
or
gp = df.groupby(pd.Grouper(key='date', freq='1M'))
gp.last() - gp.first()
# closing_price
#date
#2007-12-31 0.18
#2008-01-31 0.01
Resample
gp = df.set_index('date').resample('1M')
gp.last() - gp.first()
# closing_price
#date
#2007-12-31 0.18
#2008-01-31 0.01

Problem: Get first or last date of indexed dataframe
Solution: Resample the index and then extract the data.
lom = pd.Series(x.index, index = x.index).resample('m').last()
xlast = x[x.index.isin(lom)] # .resample('m').last() to get monthly freq
fom = pd.Series(x.index, index = x.index).resample('m').first()
xfirst = x[x.index.isin(fom)]

Related

Pandas resample MultiIndex dataframe with forward fill

I am trying to resample a MultiIndex dataframe to a less granular frequency (daily to month end) by taking the last valid daily observation in every month.
For example, given the dataframe below:
df = pd.DataFrame({'date': [pd.to_datetime('2012-03-29')]*4
+ [pd.to_datetime('2012-03-30')]*4
+ [pd.to_datetime('2012-04-01')]*4,
'groups':[1,2,3,4]*3,
'values':np.random.normal(size=12)})
df = df.set_index(['date', 'groups'])
values
date groups
2012-03-29 1 0.013681
2 0.359522
3 -0.525454
4 -0.282541
2012-03-30 1 0.155501
2 -1.053596
3 0.003049
4 -0.165875
2012-04-01 1 -0.049135
2 2.701785
3 2.240875
4 0.057297
The desired final dataframe is:
values
date groups
2012-03-31 1 0.155501
2 -1.053596
3 0.003049
4 -0.165875
In a regular dataframe (with single index), the desired output can be achieved with df.asfreq('M', method='ffill') as shown below.
df = pd.DataFrame({'date': [pd.to_datetime('2012-03-29')] + pd.date_range('2012-04-01', '2012-04-04').to_list(),
'values':np.random.normal(size=5)})
df = df.set_index('date')
df_monthly = df.asfreq('M', method='ffill')
Where df is:
values
date
2012-03-29 1.988554
2012-04-01 -1.054163
2012-04-02 -1.112537
2012-04-03 0.224515
2012-04-04 0.152175
and df_monthly is:
values
date
2012-03-31 1.988554
Any help is much appreciated. Thanks in advance.
Use:
df_monthly = df.reset_index(level=1).groupby('groups')[['values']].apply(lambda x: x.asfreq('M', method='ffill')).swaplevel(1,0)
print (df_monthly)
values
date groups
2012-03-31 1 -2.951662
2 -1.495653
3 -0.948413
4 0.066219

Grouping dates together by year in Pandas

I have a dataset of property prices and they are currently listed by 'DATE_SOLD'. I'd like to be able to count them by year. The dataset looks like this -
SALE_DATE COUNTY SALE_PRICE
0 2010-01-01 Dublin 343000.0
1 2010-01-03 Laois 185000.0
2 2010-01-04 Dublin 438500.0
3 2010-01-04 Meath 400000.0
4 2010-01-04 Kilkenny 160000.0
This is the code I've tried -
by_year = property_prices['SALE_DATE'] = pd.to_datetime(property_prices['SALE_DATE'])
print(by_year)
I think I'm close but as a biblical noob it's quite frustrating!
Thank you for any help you can provide; this site has been awesome so far in finding little tips and tricks to make my life easier
You are close. As you did, you can use pd.to_datetime to convert your sale_date to a datetime column. Then groupby the year, using dt.year which gets the year of the datetime, and use size() on that which computes the size of each group, which in this case is the year.
property_prices['SALE_DATE'] = pd.to_datetime(property_prices['SALE_DATE'])
property_prices.groupby(property_prices.SALE_DATE.dt.year).size()
Which prints:
SALE_DATE
2010 5
dtype: int64
import pandas as pd
sample_dict = {'Date':['2010-01-11', '2020-01-22', '2010-03-12'], 'Price':[1000,2000,3500]}
df = pd.DataFrame(sample_dict)
# Creating 'year' column using the Date column
df['year'] = df.apply(lambda row: row.Date.split('-')[0], axis=1)
# Groupby function
df1 = df.groupby('Year')
# Print the first value in each group
df1.first()
Output:
Date x
year
2010 2010-01-11 1
2020 2020-01-22 2

Subtracting values across grouped data frames in Pandas

I have a set of IDs and Timestamps, and want to calculate the "total time elapsed per ID" by getting the difference of the oldest / earliest timestamps, grouped by ID.
Data
id timestamp
1 2018-02-01 03:00:00
1 2018-02-01 03:01:00
2 2018-02-02 10:03:00
2 2018-02-02 10:04:00
2 2018-02-02 11:05:00
Expected Result
(I want the delta converted to minutes)
id delta
1 1
2 62
I have a for loop, but it's very slow (10+ min for 1M+ rows). I was wondering if this was achievable via pandas functions?
# gb returns a DataFrameGroupedBy object, grouped by ID
gb = df.groupby(['id'])
# Create the resulting df
cycletime = pd.DataFrame(columns=['id','timeDeltaMin'])
def calculate_delta():
for id, groupdf in gb:
time = groupdf.timestamp
# returns timestamp rows for the current id
time_delta = time.max() - time.min()
# convert Timedelta object to minutes
time_delta = time_delta / pd.Timedelta(minutes=1)
# insert result to cycletime df
cycletime.loc[-1] = [id,time_delta]
cycletime.index += 1
Thinking of trying next:
- Multiprocessing
First ensure datetimes are OK:
df.timestamp = pd.to_datetime(df.timestamp)
Now find the number of minutes in the difference between the maximum and minimum for each id:
import numpy as np
>>> (df.timestamp.groupby(df.id).max() - df.timestamp.groupby(df.id).min()) / np.timedelta64(1, 'm')
id
1 1.0
2 62.0
Name: timestamp, dtype: float64
You can sort by id and tiemstamp, then groupby id and then find the difference between min and max timestamp per group.
df['timestamp'] = pd.to_datetime(df['timestamp'])
result = df.sort_values(['id']).groupby('id')['timestamp'].agg(['min', 'max'])
result['diff'] = (result['max']-result['min']) / np.timedelta64(1, 'm')
result.reset_index()[['id', 'diff']]
Output:
id diff
0 1 1.0
1 2 62.0
Another one:
import pandas as pd
import numpy as np
import datetime
ids = [1,1,2,2,2]
times = ['2018-02-01 03:00:00','2018-02-01 03:01:00','2018-02-02
10:03:00','2018-02-02 10:04:00','2018-02-02 11:05:00']
df = pd.DataFrame({'id':ids,'timestamp':pd.to_datetime(pd.Series(times))})
df.set_index('id', inplace=True)
print(df.groupby(level=0).diff().sum(level=0)['timestamp'].dt.seconds/60)

Propagate dates pandas and interpolate

We have some ready available sales data for certain periods, like 1week, 1month...1year:
time_pillars = pd.Series(['1W', '1M', '3M', '1Y'])
sales = pd.Series([4.75, 5.00, 5.10, 5.75])
data = {'time_pillar': time_pillars, 'sales': sales}
df = pd.DataFrame(data)
I would like to do two operations.
Firstly, create a new column of date type, df['date'], that corresponds to the actual date of 1week, 1month..1year from now.
Then, I'd like to create another column df['days_from_now'], taking how many days are on these pillars (1week would be 7days, 1month would be around 30days..1year around 365days).
The goal of this is then to use any day as input for a a simple linear_interpolation_method() to obtain sales data for any given day (eg, what are sales for 4Octobober2018? ---> We would interpolate between 3months and 1year).
Many thanks.
I'm not exactly sure what you mean regarding your interpolation, but here is a way to make your dataframe in pandas (starting from your original df you provided in your post):
from datetime import datetime
from dateutil.relativedelta import relativedelta
def create_dates(df):
df['date'] = [i.date() for i in
[d+delt for d,delt in zip([datetime.now()] * 4 ,
[relativedelta(weeks=1), relativedelta(months=1),
relativedelta(months=3), relativedelta(years=1)])]]
df['days_from_now'] = df['date'] - datetime.now().date()
return df
create_dates(df)
sales time_pillar date days_from_now
0 4.75 1W 2018-04-11 7 days
1 5.00 1M 2018-05-04 30 days
2 5.10 3M 2018-07-04 91 days
3 5.75 1Y 2019-04-04 365 days
I wrapped it in a function, so that you can call it on any given day and get your results for 1 week, 3 weeks, etc. from that exact day.
Note: if you want your days_from_now to simply be an integer of the number of days, use df['days_from_now'] = [i.days for i in df['date'] - datetime.now().date()] in the function, instead of df['days_from_now'] = df['date'] - datetime.now().date()
Explanation:
df['date'] = [i.date() for i in
[d+delt for d,delt in zip([datetime.now()] * 4 ,
[relativedelta(weeks=1), relativedelta(months=1),
relativedelta(months=3), relativedelta(years=1)])]]
Takes a list of the date today (datetime.now()) repeated 4 times, and adds a relativedelta (a time difference) of 1 week, 1 month, 3 months, and 1 year, respectively, extracts the date (i.date() for ...), finally creating a new column using the resulting list.
df['days_from_now'] = df['date'] - datetime.now().date()
is much more straightforward, it simply subtracts those new dates that you got above from the date today. The result is a timedelta object, which pandas conveniently formats as "n days".

pandas Grouper changes date value

Based on this thread: Pandas Subset of a Time Series Without Resampling
The goal is to return the latest date in a month (with a value), and return that value.
Sample code:
Date CumReturn
3/31/2017 1
4/3/2017 .99
5/31/2017 1.022
4/4/2017 100
4/28/2017 1.012
5/1/2017 1.011
6/30/2017 1.033
import pandas as pd
df = pd.read_clipboard(parse_dates = ['Date'])
df.set_index('Date')
df
I thought this would work:
df.groupby(pd.Grouper(freq = 'M')).max()
But it returns the dates corresponding to the highest values (CumReturn), rather than the max dates in the index.
df.groupby(pd.Grouper(freq = 'M')).last()
However, the output shows that the last day in April is chosen, rather than the latest day in the df. pandas assigns the value from April 28 to April 30, and returns this df:
CumReturn
Date
2017-03-31 1.000
2017-04-30 1.012
2017-05-31 1.022
2017-06-30 1.033
What causes this behavior? I assume pandas is just picking the latest date in each month, but that seems odd since those dates aren't present in the original data.

Categories

Resources