Python Optimization - Can I avoid the double for loop? - python

I have a df like the following:
import datetime as dt
import pandas as pd
import pytz
cols = ['utc_datetimes', 'zone_name']
data = [
['2019-11-13 14:41:26,2019-12-18 23:04:12', 'Europe/Stockholm'],
['2019-12-06 21:49:04,2019-12-11 22:52:57,2019-12-18 20:30:58,2019-12-23 18:49:53,2019-12-27 18:34:23,2020-01-07 21:20:51,2020-01-11 17:36:56,2020-01-20 21:45:47,2020-01-30 20:48:49,2020-02-03 21:04:52,2020-02-07 20:05:02,2020-02-10 21:07:21', 'Europe/London']
]
df = pd.DataFrame(data, columns=cols)
print(df)
# utc_datetimes zone_name
# 0 2019-11-13 14:41:26,2019-12-18 23:04:12 Europe/Stockholm
# 1 2019-12-06 21:49:04,2019-12-11 22:52:57,2019-1... Europe/London
And I would like to count the number of nights and Wednesdays, of the row's local time, the dates in the df represent. This is the desired output:
utc_datetimes zone_name nights wednesdays
0 2019-11-13 14:41:26,2019-12-18 23:04:12 Europe/Stockholm 0 1
1 2019-12-06 21:49:04,2019-12-11 22:52:57,2019-1... Europe/London 11 2
I've come up with the following double for loop, but it is not as efficient as I'd like it for the sizable df:
# New columns.
df['nights'] = 0
df['wednesdays'] = 0
for row in range(df.shape[0]):
date_list = df['utc_datetimes'].iloc[row].split(',')
user_time_zone = df['zone_name'].iloc[row]
for date in date_list:
datetime_obj = dt.datetime.strptime(
date, '%Y-%m-%d %H:%M:%S'
).replace(tzinfo=pytz.utc)
local_datetime = datetime_obj.astimezone(pytz.timezone(user_time_zone))
# Get day of the week count:
if local_datetime.weekday() == 2:
df['wednesdays'].iloc[row] += 1
# Get time of the day count:
if (local_datetime.hour >17) & (local_datetime.hour <= 23):
df['nights'].iloc[row] += 1
Any suggestions will be appreciated :)
PD. disregard the definition of 'night', just an example.

One way is to first create a reference df by exploding your utc_datetimes column and then get the TimeDelta for each zone:
df = pd.DataFrame(data, columns=cols)
s = (df.assign(utc_datetimes=df["utc_datetimes"].str.split(","))
.explode("utc_datetimes"))
s["diff"] = [pd.Timestamp(a, tz=b).utcoffset() for a,b in zip(s["utc_datetimes"],s["zone_name"])]
With this helper df you can calculate the number of wednesdays and nights:
df["wednesdays"] = (pd.to_datetime(s["utc_datetimes"])+s["diff"]).dt.day_name().eq("Wednesday").groupby(level=0).sum()
df["nights"] = ((pd.to_datetime(s["utc_datetimes"])+s["diff"]).dt.hour>17).groupby(level=0).sum()
print (df)
#
utc_datetimes zone_name wednesdays nights
0 2019-11-13 14:41:26,2019-12-18 23:04:12 Europe/Stockholm 1.0 0.0
1 2019-12-06 21:49:04,2019-12-11 22:52:57,2019-1... Europe/London 2.0 11.0

Related

How to create a binary variable based on date ranges

I would like to flag all rows with dates 1 week before and 1 week after a specific holiday to be = 1; = 0 otherwise.
What's the best way to do so? Below are my codes, which only flag New Year's Day to be new_year = 1. What I want is all 3 rows to have new_year = 1 (since they fall within 1 week before and after New Year's Day).
Note: I would like the code to work for any holidays (e.g. Thanksgiving, Easter, etc.).
Thank you!
# importing pandas as pd
import pandas as pd
import holidays
# Creating the dataframe
df = pd.DataFrame({'Date': ['1/1/2019', '1/5/2019', '12/28/2018'],
'Event': ['Music', 'Poetry', 'Theatre'],
'Cost': [10000, 5000, 15000]})
df['newDate'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
new_year = holidays.HolidayBase()
new_year.append({"2018-01-01": "New Year's Day",
"2019-01-01": "New Year's Day"})
df['hol_new_year'] = np.where(df['newDate'] in new_year, 1, 0)
You can use pandas' time series offsets:
ye = pd.tseries.offsets.YearEnd()
yb = pd.tseries.offsets.YearBegin()
d = pd.to_timedelta('1w')
s = df['newDate']
df['hol_new_year'] = (s.between(s-ye-d, s-ye+d)
|s.between(s+yb-d, s+yb+d)
).astype(int)
Output:
Date Event Cost newDate hol_new_year
0 1/1/2019 Music 10000 2019-01-01 1
1 1/5/2019 Poetry 5000 2019-01-05 1
2 12/28/2018 Theatre 15000 2018-12-28 1
3 1/15/2021 SO 0 2021-01-15 0

Pandas groupby month output is incorrect [duplicate]

My dataset has dates in the European format, and I'm struggling to convert it into the correct format before I pass it through a pd.to_datetime, so for all day < 12, my month and day switch.
Is there an easy solution to this?
import pandas as pd
import datetime as dt
df = pd.read_csv(loc,dayfirst=True)
df['Date']=pd.to_datetime(df['Date'])
Is there a way to force datetime to acknowledge that the input is formatted at dd/mm/yy?
Thanks for the help!
Edit, a sample from my dates:
renewal["Date"].head()
Out[235]:
0 31/03/2018
2 30/04/2018
3 28/02/2018
4 30/04/2018
5 31/03/2018
Name: Earliest renewal date, dtype: object
After running the following:
renewal['Date']=pd.to_datetime(renewal['Date'],dayfirst=True)
I get:
Out[241]:
0 2018-03-31 #Correct
2 2018-04-01 #<-- this number is wrong and should be 01-04 instad
3 2018-02-28 #Correct
Add format.
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
You can control the date construction directly if you define separate columns for 'year', 'month' and 'day', like this:
import pandas as pd
df = pd.DataFrame(
{'Date': ['01/03/2018', '06/08/2018', '31/03/2018', '30/04/2018']}
)
date_parts = df['Date'].apply(lambda d: pd.Series(int(n) for n in d.split('/')))
date_parts.columns = ['day', 'month', 'year']
df['Date'] = pd.to_datetime(date_parts)
date_parts
# day month year
# 0 1 3 2018
# 1 6 8 2018
# 2 31 3 2018
# 3 30 4 2018
df
# Date
# 0 2018-03-01
# 1 2018-08-06
# 2 2018-03-31
# 3 2018-04-30

Is there a quick way for checking whether a date lies within n days(say 7) from a list of dates

I'm working with the following dataset:
Date
2016-01-04
2016-01-05
2016-01-06
2016-01-07
2016-01-08
and a list holidays = ['2016-01-01','2016-01-18'....'2017-11-23','2017-12-25']
Objective: Create a column indicating whether a particular date is within +- 7 days of any holiday present in the list.
Mock output:
Date
Within a week of Holiday
2016-01-04
1
2016-01-05
1
2016-01-06
1
2016-01-07
1
2016-01-08
0
I'm working with a lot of date records and thus trying to find a quick(most optimized) way to do this.
My Current Solution:
One way I figured to do this quickly would be to create another list with only the unique dates for my desired duration(say 2 years). This way, I can implement a simple solution with 2 for loops to check if a date is within +-7days of a holiday, and it wouldn't be computationally heavy as both lists would be relatively small(730 unique dates and ~20 dates in the holiday list).
Once I have my desired list of dates, all I have to do is run a single check on my 'Date' column to see if that date is a part of this new list I created. However, any suggestions to do this even quicker?
Turn holidays into a DataFrame and then merge_asof with a tolerance of 6 days:
new_df = pd.merge_asof(df, holidays, left_on='Date', right_on='Holiday',
tolerance=pd.Timedelta(days=6))
new_df['Holiday'] = np.where(new_df['Holiday'].notnull(), 1, 0)
new_df = new_df.rename(columns={'Holiday': 'Within a week of Holiday'})
Complete Working Example:
import numpy as np
import pandas as pd
holidays = pd.DataFrame(pd.to_datetime(['2016-01-01', '2016-01-18']),
columns=['Holiday'])
df = pd.DataFrame({
'Date': ['2016-01-04', '2016-01-05', '2016-01-06', '2016-01-07',
'2016-01-08']
})
df['Date'] = pd.to_datetime(df['Date'])
new_df = pd.merge_asof(df, holidays, left_on='Date', right_on='Holiday',
tolerance=pd.Timedelta(days=6))
new_df['Holiday'] = np.where(new_df['Holiday'].notnull(), 1, 0)
new_df = new_df.rename(columns={'Holiday': 'Within a week of Holiday'})
print(new_df)
new_df:
Date Within a week of Holiday
0 2016-01-04 1
1 2016-01-05 1
2 2016-01-06 1
3 2016-01-07 1
4 2016-01-08 0
Or turn Holdiays into a np datetime array then broadcast subtraction across the 'Date' Column, compare the abs to 7 days, and see if there are any matches:
holidays = np.array(['2016-01-01', '2016-01-18']).astype('datetime64')
df['Within a week of Holiday'] = (
abs(df['Date'].values - holidays[:, None]) < pd.Timedelta(days=7)
).any(axis=0).astype(int)
Complete Working Example:
import numpy as np
import pandas as pd
holidays = np.array(['2016-01-01', '2016-01-18']).astype('datetime64')
df = pd.DataFrame({
'Date': ['2016-01-04', '2016-01-05', '2016-01-06', '2016-01-07',
'2016-01-08']
})
df['Date'] = pd.to_datetime(df['Date'])
df['Within a week of Holiday'] = (
abs(df['Date'].values - holidays[:, None]) < pd.Timedelta(days=7)
).any(axis=0).astype(int)
print(df)
df:
Date Within a week of Holiday
0 2016-01-04 1
1 2016-01-05 1
2 2016-01-06 1
3 2016-01-07 1
4 2016-01-08 0
make a function that calculate date with +- 7 days and check if calculated date is in holidays so return True else False and apply that function to Data frame
import datetime
import pandas as pd
holidays = ['2016-01-01','2016-01-18','2017-11-23','2017-12-25']
def holiday_present(date):
date = datetime.datetime.strptime(date, '%Y-%m-%d')
for i in range(-7,7):
datte = (date - datetime.timedelta(days=i)).strftime('%Y-%m-%d')
if datte in holidays:
return True
return False
data = {
"Date":[
"2016-01-04",
"2016-01-05",
"2016-01-06",
"2016-01-07",
"2016-01-08"]
}
df= pd.DataFrame(data)
df["Within a week of Holiday"] = df["Date"].apply(holiday_present).astype(int)
Output:
Date Within a week of Holiday
0 2016-01-04 1
1 2016-01-05 1
2 2016-01-06 1
3 2016-01-07 1
4 2016-01-08 0
Try this:
Sample:
import pandas as pd
df = pd.DataFrame({'Date': {0: '2016-01-04',
1: '2016-01-05',
2: '2016-01-06',
3: '2016-01-07',
4: '2016-01-08'}})
Code:
def get_date_range(holidays):
h = [pd.to_datetime(x) for x in holidays]
h = [pd.date_range(x - pd.DateOffset(6), x + pd.DateOffset(6)) for x in h]
h = [x.strftime('%Y-%m-%d') for y in h for x in y]
return h
df['Within a week of Holiday'] = df['Date'].isin(get_date_range(holidays))*1
Result:
Out[141]:
0 1
1 1
2 1
3 1
4 0
Name: Within a week of Holiday, dtype: int32

Python Group by minutes in a day

I have log data that spans over 30 days. I'm am looking to group the data to see what 15 minute window has the lowest amount of events in total over 24hours. The data is formated as so:
2021-04-26 19:12:03, upload
2021-04-26 11:32:03, download
2021-04-24 19:14:03, download
2021-04-22 1:9:03, download
2021-04-19 4:12:03, upload
2021-04-07 7:12:03, download
and I'm looking for a result like
19:15:00, 2
11:55:00, 1
7:15:00, 1
4:15:00, 1
1:15:00, 1
currently, I used grouper:
df['date'] = pd.to_datetime(df['date'])
df.groupby(pd.Grouper(key="date",freq='.25H')).Host.count()
and my results are looking like\
date
2021-04-08 16:15:00+00:00 1
2021-04-08 16:30:00+00:00 20
2021-04-08 16:45:00+00:00 6
2021-04-08 17:00:00+00:00 6
2021-04-08 17:15:00+00:00 0
..
2021-04-29 18:00:00+00:00 3
2021-04-29 18:15:00+00:00 9
2021-04-29 18:30:00+00:00 0
2021-04-29 18:45:00+00:00 3
2021-04-29 19:00:00+00:00 15
Is there any way so I can not merge again on just the time and not include the date?
Do you want something like this?
Here, the idea is - If you're not concern about the date, then you can replace all the dates with some random date, and then you can group/count the data based on time data only.
df.Host = 1
df.date = df.date.str.replace( r'(\d{4}-\d{1,2}-\d{1,2})','2021-04-26', regex=True)
df.date = pd.to_datetime(df.date)
new_df = df.groupby(pd.Grouper(key='date',freq='.25H')).agg({'Host' : sum}).reset_index()
new_df = new_df.loc[new_df['Host']!=0]
new_df['date'] = new_df['date'].dt.time
So let's say you want to gather in 5 min window. For this, you need to extract the time-stamp column. Let df is your pandas dataframe. For each time in timestamp, roundup that time to nearest multiple of 5 min and add to a counter map. See code below.
timestamp = df["timestamp"]
counter = collections.defaultdict(int)
def get_time(time):
hh, mm, ss = map(int, time.split(':'))
total_seconds = hh * 3600 + mm * 60 + ss
roundup_seconds = math.ceil(total_seconds / (5*60)) * (5*60)
# I suggest you to try out the above formula on paper for better understanding
# '5 min' means '5*60 sec' roundup
new_hh = roundup_seconds // 3600
roundup_seconds %= 3600
new_mm = roundup_seconds // 60
roundup_seconds %= 60
new_ss = roundup_seconds
return f"{new_hh}:{new_mm}:{new_ss}" # f-strings for python 3.6 and above
for time in timestamp:
counter[get_time(time)] += 1
# Now counter will carry counts of rounded time stamp
# I've tested locally and it's same as the output you mentioned.
# Let me know if you need any further help :)
One approach is to use TimeDelta instead of DateTime since the comparison happens only between hours and minutes not dates.
import pandas as pd
import numpy as np
df = pd.DataFrame({'time': {0: '2021-04-26 19:12:03', 1: '2021-04-26 11:32:03',
2: '2021-04-24 19:14:03', 3: '2021-04-22 1:9:03',
4: '2021-04-19 4:12:03', 5: '2021-04-07 7:12:03'},
'event': {0: 'upload', 1: 'download', 2: 'download',
3: 'download', 4: 'upload', 5: 'download'}})
# Convert To TimeDelta (Ignore Day)
df['time'] = pd.to_timedelta(df['time'].str[-8:])
# Set TimeDelta as index
df = df.set_index('time')
# Get Count of events per 15 minute period
df = df.resample('.25H')['event'].count()
# Convert To Nearest 15 Minute Interval
ns15min = 15 * 60 * 1000000000 # 15 minutes in nanoseconds
df.index = pd.to_timedelta(((df.index.astype(np.int64) // ns15min + 1) * ns15min))
# Reset Index, Filter and Sort
df = df.reset_index()
df = df[df['event'] > 0]
df = df.sort_values(['event', 'time'], ascending=(False, False))
# Remove Day Part of Time Delta (Convert to str)
df['time'] = df['time'].astype(str).str[-8:]
# For Display
print(df.to_string(index=False))
Filtered Output:
time event
19:15:00 2
21:00:00 1
11:30:00 1
07:15:00 1
04:15:00 1

How to find the difference between two formatted dates in days?

I have a pandas DataFrame with the following content:
df =
start end
01/April 02/May
12/April 12/April
I need to add a column with the difference (in days) between end and start values (end - start).
How can I do it?
I tried the following:
import pandas as pd
df.startdate = pd.datetime(df.start, format='%B/%d')
df.enddate = pd.datetime(df.end, format='%B/%d')
But not sure if this is a right direction.
import pandas as pd
df = pd.DataFrame({"start":["01/April", "12/April"], "end": ["02/May", "12/April"]})
df["start"] = pd.to_datetime(df["start"])
df["end"] = pd.to_datetime(df["end"])
df["diff"] = (df["end"] - df["start"])
Output:
end start diff
0 2018-05-02 2018-04-01 31 days
1 2018-04-12 2018-04-12 0 days
This is one way.
df['start'] = pd.to_datetime(df['start']+'/2018', format='%d/%B/%Y')
df['end'] = pd.to_datetime(df['end']+'/2018', format='%d/%B/%Y')
df['diff'] = df['end'] - df['start']
# start end diff
# 0 2018-04-01 2018-05-02 31 days
# 1 2018-04-12 2018-04-12 0 days

Categories

Resources