I have a dataframe that looks like this:
stuff datetime value
A 1/1/2019 3
A 1/2/2019 4
A 1/3/2019 5
A 1/4/2019 6
...
I want to create a new dataframe that looks like this:
stuff avg_3 avg_4 avg_5
A 3.4 4.5 5.5
B 2.3 4.2 6.1
where avg_3 is the avg of the last 3 days from today, avg_4 is the avg of the last 4 days from today etc grouped by stuff
How do I do that?
My current code:
df.groupby('stuff').apply(lambda x: pd.Series(dict(
day_3=(x.datetime > datetime.now() - timedelta(days = 3)).mean(),
day_7=(x.datetime > datetime.now() -timedelta(days = 7)).mean())))
Thanks in advance
Create boolean masks before groupby, add new columns by assign and groupby with mean:
m1 = df.datetime > pd.datetime.now() - pd.Timedelta(days = 3)
m2 = df.datetime > pd.datetime.now() - pd.Timedelta(days = 7)
df = df.assign(day_3=m1, day_7=m2).groupby('stuff')['day_3','day_7'].mean()
Related
I want to compare a timestamp datatype datetime64[ns] with a datetime.date I only want a comparison based on day and month
df
timestamp last_price
0 2023-01-22 14:15:06.033314 100.0
1 2023-01-25 14:15:06.213591 101.0
2 2023-01-30 14:15:06.313554 102.0
3 2023-03-31 14:15:07.018540 103.0
cu_date = datetime.datetime.now().date()
cu_year = cu_date.year
check_end_date = datetime.datetime.strptime(f'{cu_year}-11-05', '%Y-%m-%d').date()
check_start_date = datetime.datetime.strptime(f'{cu_year}-03-12', '%Y-%m-%d').date()
# this is incorrect as the day can be greater than check_start_date while the month might be less.
daylight_off_df = df.loc[((df.timestamp.dt.month >= check_end_date.month) & (df.timestamp.dt.day >= check_end_date.day)) |
((df.timestamp.dt.month <= check_start_date.month) & (df.timestamp.dt.day <= check_start_date.day))]
daylight_on_df = df.loc[((df.timestamp.dt.month <= check_end_date.month) & (df.timestamp.dt.day <= check_end_date.day)) &
((df.timestamp.dt.month >= check_start_date.month) & (df.timestamp.dt.day >= check_start_date.day))]
I am trying to think up of the logic to do this, but failing.
Expected output:
daylight_off_df
timestamp last_price
0 2023-01-22 14:15:06.033314 100.0
1 2023-01-25 14:15:06.213591 101.0
2 2023-01-30 14:15:06.313554 102.0
daylight_on_df
timestamp last_price
3 2023-03-31 14:15:07.018540 103.0
In summation separate the dataframe as per day and month comparison while ignoring the year.
I would break out these values then just query
df['day'] = df['timestamp'].dt.day_name()
df['month'] = df['timestamp'].dt.month_name()
then whatever you're looking for:
df.groupby('month').mean()
The following parameters could be helpfull if you dont want an additional column in your table:
check_end_date.timetuple().tm_yday # returns day of the year
#output 309
check_start_date.timetuple().tm_yday
#output 71
df['timestamp'].dt.is_leap_year.astype(int) #returns 1 if year is a leapyear
#output 0 | 1
df['timestamp'].dt.dayofyear #returns day of the year
#output
#0 22
#1 25
#2 30
#3 90
df['timestamp'].dt.dayofyear.between(a,b) #returns true if day is between a,b
there are some possible solutions now. i think using between can be the nicest looking one.
daylight_on_df4 = df.loc[df['timestamp'].dt.dayofyear.between(
check_start_date.timetuple().tm_yday + df['timestamp'].dt.is_leap_year.astype(int),
check_end_date.timetuple().tm_yday + df['timestamp'].dt.is_leap_year.astype(int))]
daylight_off_df4 = df.loc[~df['timestamp'].dt.dayofyear.between(
check_start_date.timetuple().tm_yday + df['timestamp'].dt.is_leap_year.astype(int),
check_end_date.timetuple().tm_yday + df['timestamp'].dt.is_leap_year.astype(int))]
or the code could look like this:
daylight_on_df3 = df.loc[((check_end_date.timetuple().tm_yday + df['timestamp'].dt.is_leap_year.astype(int)) - df['timestamp'].dt.dayofyear > 0)
& (df['timestamp'].dt.dayofyear - (df['timestamp'].dt.is_leap_year.astype(int) + check_start_date.timetuple().tm_yday) > 0)]
daylight_off_df3 = df.loc[((check_end_date.timetuple().tm_yday + df['timestamp'].dt.is_leap_year.astype(int)) - df['timestamp'].dt.dayofyear < 0)
| (df['timestamp'].dt.dayofyear - (check_start_date.timetuple().tm_yday + df['timestamp'].dt.is_leap_year.astype(int)) < 0)]
All daylight_on/off are doing now is checking if the day of the year is inbetween your ranges or not (inclusive leap year).
This formula has probably to be rewritten if your startdate / enddate would cross a year (ex 2022-11-19 , 2023-02-22) but i think it provides a general idea.
I have a dataframe like this
Event dates Duration
Event1 [1796-12-02, 1796-12-10, 1796-12-11] 9 days
Event2 [1848-03-31, 1848-02-26] 34 days
Event3 [1826-05-20] 0 days
And I would like to add an "Average time inbetween dates" column which would compute the difference between subsequent couple of days and look like:
"Average"
4.5
17
0
In the first one, 4.5 comes from pair of dates 1: 8 and pair of dates 2: 1, 8+1=9 /2 = 4.5
You can use this line:
df["Average"] = df.apply(lambda x: float(x["Duration"].replace(" days", ""))/(len(x["dates"])-1), axis=1)
Edit: Title changed to reflect map not being more efficient than a for loop.
Original title: Replacing a for loop with map when comparing dates
I have a list of sequential dates date_list and a data frame df which contains, for the purposes of now, contains one column named Event Date which contains the date that an event occured:
Index Event Date
0 02-01-20
1 03-01-20
2 03-01-20
I want to know how many events have happened by a given date in the format:
Date Events
01-01-20 0
02-01-20 1
03-01-20 3
My current method for doing so is as follows:
for date in date_list:
event_rows = df.apply(lambda x: True if x['Event Date'] > date else False , axis=1)
event_count = len(event_rows[event_rows == True].index)
temp = [date,event_count]
pre_df_list.append(temp)
Where the list pre_df_list is later converted to a dataframe.
This method is slow and seems inelegant but I am struggling to find a method that works.
I think it should be something along the lines of:
map(lambda x,y: True if x > y else False, df['Event Date'],date_list)
but that would compare each item in the list in pairs which is not what I'm looking for.
I appreaciate it might be odd asking for help when I have working code but I'm trying to cut down my reliance of loops as they are somewhat of a crutch for me at the moment. Also I have multiple different events to track in the full data and looping through ~1000 dates for each one will be unsatisfyingly slow.
Use groupby() and size() to get counts per date and cumsum() to get a cumulative sum, i.e. include all the dates before a particular row.
from datetime import date, timedelta
import random
import pandas as pd
# example data
dates = [date(2020, 1, 1) + timedelta(days=random.randrange(1, 100, 1)) for _ in range(1000)]
df = pd.DataFrame({'Event Date': dates})
# count events <= t
event_counts = df.groupby('Event Date').size().cumsum().reset_index()
event_counts.columns = ['Date', 'Events']
event_counts
Date Events
0 2020-01-02 13
1 2020-01-03 23
2 2020-01-04 34
3 2020-01-05 42
4 2020-01-06 51
.. ... ...
94 2020-04-05 972
95 2020-04-06 981
96 2020-04-07 989
97 2020-04-08 995
98 2020-04-09 1000
Then if there's dates in your date_list file that don't exist in your dataframe, convert the date_list into a dataframe and merge the previous results. The fillna(method='ffill') will fill gaps in the middle of the data, whille the last fillna(0) incase there's gaps at the start of the column.
date_list = [date(2020, 1, 1) + timedelta(days=x) for x in range(150)]
date_df = pd.DataFrame({'Date': date_list})
merged_df = pd.merge(date_df, event_counts, how='left', on='Date')
merged_df.columns = ['Date', 'Events']
merged_df = merged_df.fillna(method='ffill').fillna(0)
Unless I am mistaken about your objective, it seems to me that you can simply use pandas DataFrames' ability to compare against a single value and slice the dataframe like so:
>>> df = pd.DataFrame({'event_date': [date(2020,9, 1), date(2020, 9, 2), date(2020, 9, 3)]})
>>> df
event_date
0 2020-09-01
1 2020-09-02
2 2020-09-03
>>> df[df.event_date > date(2020, 9, 1)]
event_date
1 2020-09-02
2 2020-09-03
How to calculate using pandas weeks between two dates such 2019-12-15 and 2019-12-28
Data:
cw = pd.DataFrame({ "lead_date" : ["2019-12-28" , "2019-12-23"] ,
"Received_date" : ["2019-12-15" , "2019-12-21" ] })
So I could do something like
cw["weeks_between"]= ( cw["lead_date"] - cw["Received_date"]) / 7
The problem is..
For row 1:
it will return 1.85, but is wrong value because one day starts in on beginning of week Vs End of week
For row 2:
It will return 0.28, but also wrong because one day starts end of week Vs beginning of week.
-
So how can I get the number of weeks in between this two dates?
Method 1: Using list comprehension, dt.period & getattr
provided by Jon Clements in comments
This method will work when years change between the compared dates:
cw['weeks_diff'] = (
[getattr(el, 'n', 0)
for el in cw['lead_date'].dt.to_period('W') - cw['Received_date'].dt.to_period('W')]
)
Method 2: using weeknumbers with dt.strftime('%W')
We can use pd.to_datetime to convert your dates to datetime. Then we use the dt.strftime accessor to get the weeknumbers with %W.
Finally we substract both weeknumbers:
weeks = (cw[['lead_date', 'Received_date']]
.apply(lambda x: pd.to_datetime(x).dt.strftime('%W'))
.replace('NaT', 0)
.astype(int)
)
cw['weeks_diff'] = weeks['lead_date'] - weeks['Received_date']
lead_date Received_date weeks_diff
0 2019-12-28 2019-12-15 2
1 2019-12-23 2019-12-21 1
You need to use convert to datetime using pandas
import pandas as pd
import numpy as np
df = pd.DataFrame({ "lead_date" : ["2019-12-28" , "2019-12-23"] ,
"Received_date" : ["2019-12-15" , "2019-12-21" ] })
df['lead_date']=pd.to_datetime(df['lead_date'])
df['Received_date']=pd.to_datetime(df['Received_date'])
Here is the difference in days between "lead_date" and "Received_date"
df['time_between'] =df['lead_date']-df['Received_date']
print(df.head())
lead_date Received_date time_between
0 2019-12-28 2019-12-15 13 days
1 2019-12-23 2019-12-21 2 days
Update: edits below to get number of weeks. Also added import pandas and numpy.
To get 'time_between' column in weeks:
df['time_between']= df['time_between']/np.timedelta64(1,'W')
will yield
lead_date Received_date time_between
0 2019-12-28 2019-12-15 1.857143
1 2019-12-23 2019-12-21 0.285714
Update 2: If you want week number subtractions and not days between then use:
df['lead_date']=pd.to_datetime(df['lead_date']).dt.week
df['Received_date']=pd.to_datetime(df['Received_date']).dt.week
df['time_between'] =df['lead_date']-df['Received_date']
yields,
lead_date Received_date time_between
0 52 50 2
1 52 51 1
.dt.week returns week number in the year.
I have a set of IDs and Timestamps, and want to calculate the "total time elapsed per ID" by getting the difference of the oldest / earliest timestamps, grouped by ID.
Data
id timestamp
1 2018-02-01 03:00:00
1 2018-02-01 03:01:00
2 2018-02-02 10:03:00
2 2018-02-02 10:04:00
2 2018-02-02 11:05:00
Expected Result
(I want the delta converted to minutes)
id delta
1 1
2 62
I have a for loop, but it's very slow (10+ min for 1M+ rows). I was wondering if this was achievable via pandas functions?
# gb returns a DataFrameGroupedBy object, grouped by ID
gb = df.groupby(['id'])
# Create the resulting df
cycletime = pd.DataFrame(columns=['id','timeDeltaMin'])
def calculate_delta():
for id, groupdf in gb:
time = groupdf.timestamp
# returns timestamp rows for the current id
time_delta = time.max() - time.min()
# convert Timedelta object to minutes
time_delta = time_delta / pd.Timedelta(minutes=1)
# insert result to cycletime df
cycletime.loc[-1] = [id,time_delta]
cycletime.index += 1
Thinking of trying next:
- Multiprocessing
First ensure datetimes are OK:
df.timestamp = pd.to_datetime(df.timestamp)
Now find the number of minutes in the difference between the maximum and minimum for each id:
import numpy as np
>>> (df.timestamp.groupby(df.id).max() - df.timestamp.groupby(df.id).min()) / np.timedelta64(1, 'm')
id
1 1.0
2 62.0
Name: timestamp, dtype: float64
You can sort by id and tiemstamp, then groupby id and then find the difference between min and max timestamp per group.
df['timestamp'] = pd.to_datetime(df['timestamp'])
result = df.sort_values(['id']).groupby('id')['timestamp'].agg(['min', 'max'])
result['diff'] = (result['max']-result['min']) / np.timedelta64(1, 'm')
result.reset_index()[['id', 'diff']]
Output:
id diff
0 1 1.0
1 2 62.0
Another one:
import pandas as pd
import numpy as np
import datetime
ids = [1,1,2,2,2]
times = ['2018-02-01 03:00:00','2018-02-01 03:01:00','2018-02-02
10:03:00','2018-02-02 10:04:00','2018-02-02 11:05:00']
df = pd.DataFrame({'id':ids,'timestamp':pd.to_datetime(pd.Series(times))})
df.set_index('id', inplace=True)
print(df.groupby(level=0).diff().sum(level=0)['timestamp'].dt.seconds/60)