Pandas groupby within a date range - python

Consider this sample data created by this code:
import random
np.random.seed(0)
rng = pd.date_range('2017-09-19', periods=1000, freq='D')
randomlist = np.random.choice(1000, 10000, replace=True)
print(f'randomlist length is {len(randomlist)}')
test = pd.DataFrame({ 'id': randomlist[:(len(rng))], 'Date': rng, 'Val': np.random.randn(len(rng)) })
The desired output is a groupby id, summing all values, but only within a particular date range of the Date column. Even more complicated than that, I want to see the total Val by id for dates that are the following:
Using the date which is one month later than the earliest date for each id and one year later than that starting date of one month later than the earliest date.
So, for example, if my data appeared this way:
id Date Val
0 684 2017-09-19 0.640472
1 684 2017-10-20 -0.732568
2 501 2017-08-21 -1.141365
3 501 2017-09-22 -0.283020
4 501 2017-09-23 0.725941
5 684 2017-09-24 0.56789
I would want the groupby to only consider the dates for id 684 between 2017-10-19 (i.e. one month later than the earliest date) and 2018-10-19 (i.e. one year after the earliest date plus one month).
I have tried straight groupby and Grouper to no avail. None seem to have this ability to limit the consideration by date. Perhaps I am missing something easy? Thanks for taking a look

Related

How can I filter for rows one hour before and after a set timestamp in Python?

I am trying to filter a DataFrame to only show values 1-hour before and 1-hour after a specified time/date, but am having trouble finding the right function for this. I am working in Python with Pandas.
The posts I see regarding masking by date mostly cover the case of masking rows between a specified start and end date, but I am having trouble finding help on how to mask rows based around a single date.
I have time series data as a DataFrame that spans about a year, so thousands of rows. This data is at 1-minute intervals, and so each row corresponds to a row ID, a timestamp, and a value.
Example of DataFrame:
ID timestamp value
0 2011-01-15 03:25:00 34
1 2011-01-15 03:26:00 36
2 2011-01-15 03:27:00 37
3 2011-01-15 03:28:00 37
4 2011-01-15 03:29:00 39
5 2011-01-15 03:30:00 29
6 2011-01-15 03:31:00 28
...
I am trying to create a function that outputs a DataFrame that is the initial DataFrame, but only rows for 1-hour before and 1-hour after a specified timestamp, and so only rows within this specified 2-hour window.
To be more clear:
I have a DataFrame that has 1-minute interval data throughout a year (as exemplified above).
I now identify a specific timestamp: 2011-07-14 06:15:00
I now want to output a DataFrame that is the initial input DataFrame, but now only contains rows that are within 1-hour before 2011-07-14 06:15:00, and 1-hour after 2011-07-14 06:15:00.
Do you know how I can do this? I understand that I could just create a filter where I get rid of all values before 2011-07-14 05:15:00 and 2011-07-14 07:15:00, but my goal is to have the user simply enter a single date/time (e.g. 2011-07-14 06:15:00) to produce the output DataFrame.
This is what I have tried so far:
hour = pd.DateOffset(hours=1)
date = pd.Timestamp("2011-07-14 06:15:00")
df = df.set_index("timestamp")
df([date - hour: date + hour])
which returns:
File "<ipython-input-49-d42254baba8f>", line 4
df([date - hour: date + hour])
^
SyntaxError: invalid syntax
I am not sure if this is really only a syntax error, or something deeper and more complex. How can I fix this?
Thanks!
You can do with:
import pandas as pd
import datetime as dt
data = {"date": ["2011-01-15 03:10:00","2011-01-15 03:40:00","2011-01-15 04:10:00","2011-01-15 04:40:00","2011-01-15 05:10:00","2011-01-15 07:10:00"],
"value":[1,2,3,4,5,6]}
df=pd.DataFrame(data)
df['date']=pd.to_datetime(df['date'], format='%Y-%m-%d %H:%M:%S', errors='ignore')
date_search= dt.datetime.strptime("2011-01-15 05:20:00",'%Y-%m-%d %H:%M:%S')
mask = (df['date'] > date_search-dt.timedelta(hours = 1)) & (df['date'] <= date_search+dt.timedelta(hours = 1))
print(df.loc[mask])
result:
date value
3 2011-01-15 04:40:00 4
4 2011-01-15 05:10:00 5

How to fill missing observations in time series data

I have a hypothetical time series data frame, which is with some missing observations (assumption is that the data frame shall include all dates and corresponding values and for all the dates in the year). As we can see in the head and tail information, there are certain dates and corresponding values are missing (30th Jan & 29th Dec). There would be many more such in the data frame, sometimes missing observations for more than one consecutive date.
Is there a way that missing dates are detected and inserted into the data frame and corresponding values are filled with a rolling average with one week window (this would naturally increase the number of rows of the data frame)? Appreciate inputs.
df.head(3)
date value
0 2020-01-28 25
1 2020-01-29 32
2 2020-01-31 45
df.tail(3)
date value
3 2020-12-28 24
4 2020-12-30 35
5 2020-12-31 37
df.dtypes
date object
value int64
dtype: object
Create DaetimeIndex, then use DataFrame.asfreq with rolling and mean:
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date').asfreq('d').rolling('7D').mean()
If need all values by year use:
df['date'] = pd.to_datetime(df['date'])
idx = pd.date_range('2020-01-01','2020-12-31')
df = df.set_index('date').reindex(idx).rolling('7D').mean()

Pandas group values and get mean by date range

I have a DataFrame like this
df = pd.DataFrame( data = numpy_data, columns=['value','date'])
value date
0 64.885 2018-01-11
1 74.839 2018-01-15
2 41.481 2018-01-17
3 22.027 2018-01-17
4 53.747 2018-01-18
... ... ...
514 61.017 2018-12-22
515 68.376 2018-12-21
516 79.079 2018-12-26
517 73.975 2018-12-26
518 76.923 2018-12-26
519 rows × 2 columns
And I want to plot this value vs date and I am using this
df.plot( x='date',y='value')
And I get this
The point here, this plot have to many fluctuation, and I want to soften this, my idea is group the values by date intervals and get the mean, for example 10 days, the mean between July 1 and July 10, and create de point in July 5
A long way is, get date range, separate in N ranges with start and end dates, filter data with date calculate the mean, and put in other DataFrame
Is there a short way to do that?
PD: Ignore the peaks
One thing you could do for instance is to take the rolling mean of the dataframe, using DataFrame.rolling along with mean:
df = df.set_index(df.date).drop('date', axis=1)
df.rolling(3).mean().plot()
For the example dataframe you have, directly plotting the dataframe would result in:
And having taking the rolling mean, you would have:
Here I chose a window of 3, but his will depend on how wmooth you want it to be
Based on yatu answer
The problem with his answer, is the rolling function considere values as index, not as date, with some transformations rolling can read Timestamp as use time as window [ pandas.rolling ]
df = pd.DataFrame( data = numpy_data, columns=['value','date'])
df['date'] = df.apply(lambda row: pd.Timestamp(row.date), axis=1 )
df = df.set_index(df.date).drop('date', axis=1)
df.sort_index(inplace=True)
df.rolling('10d').mean().plot( ylim=(30,100) , figsize=(16,5),grid='true')
Final results

Propagate dates pandas and interpolate

We have some ready available sales data for certain periods, like 1week, 1month...1year:
time_pillars = pd.Series(['1W', '1M', '3M', '1Y'])
sales = pd.Series([4.75, 5.00, 5.10, 5.75])
data = {'time_pillar': time_pillars, 'sales': sales}
df = pd.DataFrame(data)
I would like to do two operations.
Firstly, create a new column of date type, df['date'], that corresponds to the actual date of 1week, 1month..1year from now.
Then, I'd like to create another column df['days_from_now'], taking how many days are on these pillars (1week would be 7days, 1month would be around 30days..1year around 365days).
The goal of this is then to use any day as input for a a simple linear_interpolation_method() to obtain sales data for any given day (eg, what are sales for 4Octobober2018? ---> We would interpolate between 3months and 1year).
Many thanks.
I'm not exactly sure what you mean regarding your interpolation, but here is a way to make your dataframe in pandas (starting from your original df you provided in your post):
from datetime import datetime
from dateutil.relativedelta import relativedelta
def create_dates(df):
df['date'] = [i.date() for i in
[d+delt for d,delt in zip([datetime.now()] * 4 ,
[relativedelta(weeks=1), relativedelta(months=1),
relativedelta(months=3), relativedelta(years=1)])]]
df['days_from_now'] = df['date'] - datetime.now().date()
return df
create_dates(df)
sales time_pillar date days_from_now
0 4.75 1W 2018-04-11 7 days
1 5.00 1M 2018-05-04 30 days
2 5.10 3M 2018-07-04 91 days
3 5.75 1Y 2019-04-04 365 days
I wrapped it in a function, so that you can call it on any given day and get your results for 1 week, 3 weeks, etc. from that exact day.
Note: if you want your days_from_now to simply be an integer of the number of days, use df['days_from_now'] = [i.days for i in df['date'] - datetime.now().date()] in the function, instead of df['days_from_now'] = df['date'] - datetime.now().date()
Explanation:
df['date'] = [i.date() for i in
[d+delt for d,delt in zip([datetime.now()] * 4 ,
[relativedelta(weeks=1), relativedelta(months=1),
relativedelta(months=3), relativedelta(years=1)])]]
Takes a list of the date today (datetime.now()) repeated 4 times, and adds a relativedelta (a time difference) of 1 week, 1 month, 3 months, and 1 year, respectively, extracts the date (i.date() for ...), finally creating a new column using the resulting list.
df['days_from_now'] = df['date'] - datetime.now().date()
is much more straightforward, it simply subtracts those new dates that you got above from the date today. The result is a timedelta object, which pandas conveniently formats as "n days".

pandas Grouper changes date value

Based on this thread: Pandas Subset of a Time Series Without Resampling
The goal is to return the latest date in a month (with a value), and return that value.
Sample code:
Date CumReturn
3/31/2017 1
4/3/2017 .99
5/31/2017 1.022
4/4/2017 100
4/28/2017 1.012
5/1/2017 1.011
6/30/2017 1.033
import pandas as pd
df = pd.read_clipboard(parse_dates = ['Date'])
df.set_index('Date')
df
I thought this would work:
df.groupby(pd.Grouper(freq = 'M')).max()
But it returns the dates corresponding to the highest values (CumReturn), rather than the max dates in the index.
df.groupby(pd.Grouper(freq = 'M')).last()
However, the output shows that the last day in April is chosen, rather than the latest day in the df. pandas assigns the value from April 28 to April 30, and returns this df:
CumReturn
Date
2017-03-31 1.000
2017-04-30 1.012
2017-05-31 1.022
2017-06-30 1.033
What causes this behavior? I assume pandas is just picking the latest date in each month, but that seems odd since those dates aren't present in the original data.

Categories

Resources