How to extract and count in certain condition in python - python

Assume I have two policy data like below.
enroll lapse
A 2010/2/1 2013/1/2
B 2012/3/1 2013/1/4
I would like to count the number who policies are ongoing at the beginning of the year.
enroll lapse year
A 2010/2/1 2013/1/2 2011/1/1
A 2010/2/1 2013/1/2 2012/1/1
A 2010/2/1 2013/1/2 2013/1/1
B 2012/3/1 2013/1/4 2013/1/1
and count these ongoing policies.
year num
2011 1
2012 1
2013 2
I guess I must use query method. but I couldnt figure out.

You need:
#convert columns to datetimes
df['enroll'] = pd.to_datetime(df['enroll'])
df['lapse'] = pd.to_datetime(df['lapse'])
For each row apply function for expand rows, reshape to Series and join to original df:
def f(x):
b = x['lapse'].year - x['enroll'].year
return (pd.Series(pd.date_range(x['enroll'], periods=b, freq='AS')))
s = df.apply(f, axis=1).stack().reset_index(level=1, drop=True).rename('year')
df = df.join(s)
print (df)
enroll lapse year
A 2010-02-01 2013-01-02 2011-01-01
A 2010-02-01 2013-01-02 2012-01-01
A 2010-02-01 2013-01-02 2013-01-01
B 2012-03-01 2013-01-04 2013-01-01
Another solution:
#create start year
df['year'] = df['enroll'] + pd.offsets.YearBegin(0)
#count repeating
a = df['lapse'].dt.year - df['enroll'].dt.year
df = df.loc[np.repeat(df.index, a)]
#add year offset
df['a'] = df.groupby(level=0).cumcount()
df["year"] = df.apply(lambda x: x["year"] + pd.offsets.DateOffset(years=x['a']), axis=1)
df = df.drop('a', 1)
print (df)
enroll lapse year
A 2010-02-01 2013-01-02 2011-01-01
A 2010-02-01 2013-01-02 2012-01-01
A 2010-02-01 2013-01-02 2013-01-01
B 2012-03-01 2013-01-04 2013-01-01
And last:
df1 = df.groupby(df['year'].dt.year).size().reset_index(name='num')
print (df1)
year num
0 2011 1
1 2012 1
2 2013 2

first read your policy data, line by line.
enroll lapse
A 2010/2/1 2013/1/2
B 2012/3/1 2012/1/4
and then put each line into function count.
dictionary result might be the one you want ?
If there's any misunderstanding of your question, please let me know.
result = {}
def count(start, end):
start = [int(i) for i in start.split('/')]
start = datetime.date(*start)
end = [int(i) for i in end.split('/')]
end = datetime.date(*end)
delta = end - start
new = start + datetime.timedelta(delta.days)
for i in range(1, new.year - start.year + 1):
result[start.year + i] = result.setdefault(start.year + i, 0) + 1
a = count('2010/2/1', '2013/1/2')
b = count('2012/3/1', '2013/1/4')

you can use pd.daterange
start = pd.Timestamp(year=df['enroll'].dt.year.min() + 1, month=1, day=1)
end = pd.Timestamp(year=df['lapse'].dt.year.max(), month=12, day=31)
for year in pd.date_range(start=start, end=end, freq='AS'):
print(year, ((df['enroll'] < year) & (df['lapse'] > year)).sum())
2011-01-01 00:00:00 1
2012-01-01 00:00:00 1
2013-01-01 00:00:00 2
data = {year.year: ((df['enroll'] < year) & (df['lapse'] > year)).sum() for year in pd.date_range(start=start, end=end, freq='AS')}
pd.Series(data)
2011 1
2012 1
2013 2
dtype: int64

Related

How to tackle a dataset that has multiple same date values

I have a large data set that I'm trying to produce a time series using ARIMA. However
some of the data in the date column has multiple rows with the same date.
The data for the dates was entered this way in the data set as it was not known the exact date of the event, hence unknown dates where entered for the first of that month(biased). Known dates have been entered correctly in the data set.
2016-01-01 10035
2015-01-01 5397
2013-01-01 4567
2014-01-01 4343
2017-01-01 3981
2011-01-01 2049
Ideally I want to randomise the dates within the month so they are not the same. I have the code to randomise the date but I cannot find a way to replace the data with the date ranges.
import random
import time
def str_time_prop(start, end, time_format, prop):
stime = time.mktime(time.strptime(start, time_format))
etime = time.mktime(time.strptime(end, time_format))
ptime = stime + prop * (etime - stime)
return time.strftime(time_format, time.localtime(ptime))
def random_date(start, end, prop):
return str_time_prop(start, end, '%Y-%m-%d', prop)
# check if the random function works
print(random_date("2021-01-02", "2021-01-11", random.random()))
The code above I use to generate a random date within a date range but I'm stuggling to find a way to replace the dates.
Any help/guidance would be great.
Thanks
With the following toy dataframe:
import random
import time
import pandas as pd
df = pd.DataFrame(
{
"date": [
"2016-01-01",
"2015-01-01",
"2013-01-01",
"2014-01-01",
"2017-01-01",
"2011-01-01",
],
"value": [10035, 5397, 4567, 4343, 3981, 2049],
}
)
print(df)
# Output
date value
0 2016-01-01 10035
1 2015-01-01 5397
2 2013-01-01 4567
3 2014-01-01 4343
4 2017-01-01 3981
5 2011-01-01 2049
Here is one way to do it:
df["date"] = [
random_date("2011-01-01", "2022-04-17", random.random()) for _ in range(df.shape[0])
]
print(df)
# Ouput
date value
0 2013-12-30 10035
1 2016-06-17 5397
2 2018-01-26 4567
3 2012-02-14 4343
4 2014-06-26 3981
5 2019-07-03 2049
Since the data in the date column has multiple rows with the same date, and you want to randomize the dates within the month, you could group by the year and month and select only those who have the day equal 1. Then, use calendar.monthrange to find the last day of the month for that particular year, and use that information when replacing the timestamp's day. Change the FIRST_DAY and last_day values to match your desired range.
import pandas as pd
import calendar
import numpy as np
np.random.seed(42)
df = pd.read_csv('sample.csv')
df['date'] = pd.to_datetime(df['date'])
# group multiple rows with the same year, month and day equal 1
grouped = df.groupby([df['date'].dt.year, df['date'].dt.month, df['date'].dt.day==1])
FIRST_DAY = 2 # set for the desired range
df_list = []
for n,g in grouped:
last_day = calendar.monthrange(n[0], n[1])[1] # get last day for this month and year
g['New_Date'] = g['date'].apply(lambda d:
d.replace(day=np.random.randint(FIRST_DAY,last_day+1))
)
df_list.append(g)
new_df = pd.concat(df_list)
print(new_df)
Output from new_df
date num New_Date
2 2013-01-01 4567 2013-01-08
3 2014-01-01 4343 2014-01-21
1 2015-01-01 5397 2015-01-30
0 2016-01-01 10035 2016-01-16
4 2017-01-01 3981 2017-01-12

Categorize a datetime interval based on other datetime interval and put values on columns

I'm dealing with a hard challenge and I don't know how to solve it.
I have a dataframe like this:
Product_Name Start_Time End_Time
Product X 2021-10-20 20:32:00 2021-10-21 03:50:00
Product Y 2021-10-21 11:50:00 2021-10-21 16:00:00
Product Z 2022-01-11 20:10:00 2022-01-12 15:30:00
And I have 3 range time and a category for each one:
A: 05:01 to 14:00
B: 14:01 to 22:00
C: 22:01 to 05:00
What I want to do is calculate how much decimal hours each category (A,B and C) have based on "Start_Time" and "End_Time", reaching some like this:
Product_Name Start_Time End_Time A B C
Product X 2021-10-20 20:30:00 2021-10-21 03:50:00 0.00 1.50 5.82
Product Y 2021-10-21 11:50:00 2021-10-21 16:00:00 2.17 1.98 0.00
Product Z 2022-01-11 20:10:00 2022-01-12 15:30:00 8.98 3.31 6.98
Could you guys help me how to do it?
I'm a really beginner in python, pandas etc, and when I wrote the post first time I really had no ideia how to even start to code it.
So I start to think in something and I get this code, I'm sure it's not right, but I think it's a start of something:
start_a = 05:01:00
end_a = 14:00:00
start_b = 14:01:00
end_b = 22:00:00
start_c = 22:01:00
end_c = 05:00:00
if df['Start_Time'] > start_a and df['End_Time'] < end_a:
df['A'] = ( df['End_Time'] - start_a ) - ( end_a - df['Start_Time'] )
else:
df['A'] = 0
if df['Start_Time'] > start_b and df['End_Time'] < end_b:
df['B'] = ( df['End_Time'] - start_b ) - ( end_b - df['Start_Time'] )
else:
df['B'] = 0
if df['Start_Time'] > start_c and df['End_Time'] < end_c:
df['C'] = ( df['End_Time'] - start_c ) - ( end_c - df['Start_Time'] )
else:
df['C'] = 0
Your problem is a lot harder than I thought. One thing that has to be noticed is that the Start_Time and End_Time can have different dates. Furthermore, category C spans over two days. Both of these facts make the code a little bit complicated, but it seems to work.
First, the setup for your problem. I created your data frame and created the variables. Important is that these structures have the correct data types.
import pandas as pd
from io import StringIO
from datetime import datetime, time, date, timedelta
# Create your data frame
data = StringIO("""Product_Name Start_Time End_Time
Product X 2021-10-20 20:32:00 2021-10-21 03:50:00
Product Y 2021-10-21 11:50:00 2021-10-21 16:00:00
Product Z 2022-01-11 20:10:00 2022-01-12 15:30:00""")
df = pd.read_csv(data, sep=' ', engine='python')
# Convert the columns to date time format
df[["Start_Time", "End_Time"]] = df[["Start_Time", "End_Time"]].apply(pd.to_datetime)
# Create the range start and end time as datetime format
start_a = datetime.strptime('05:01:00', '%H:%M:%S')
end_a = datetime.strptime('14:00:00', '%H:%M:%S')
start_b = datetime.strptime('14:01:00', '%H:%M:%S')
end_b = datetime.strptime('22:00:00', '%H:%M:%S')
start_c = datetime.strptime('22:01:00', '%H:%M:%S')
end_c = datetime.strptime('05:00:00', '%H:%M:%S')
Then, I created a function that can calculate the hours for your problem. start and end are the times that are defined for one range. The function now iterates over the days and looks at how much of your range fits in it. Usually, it needs only one iteration, but your Product Z goes over two days and needs therefore two iterations.
def calc_hours(start_time, end_time, start, end):
# Set range to have date also => allows us to compare to start_time and end_time
range_start = datetime.combine(start_time.date(), start.time())
range_end = datetime.combine(start_time.date(), end.time())
# Special case for range C as end of range is on the next day
if (range_end<range_start):
range_end = range_end + timedelta(days=1)
# start_time and end_time can go over one or more days => Iterate over the days and sum the ours in the range
total_hours=0.0
while (range_start < end_time):
# Calculation to get the hours or zero if range is not within time frame
hours_in_frame = max((min(range_end, end_time) - max(range_start, start_time)).total_seconds(), 0)/3600
total_hours += hours_in_frame
# Increment the day to check if range is in time frame
range_start = range_start + timedelta(days=1)
range_end = range_end + timedelta(days=1)
return total_hours
In order to use the function and add the results to the dataframe, I used the function apply() from pandas. The apply() takes each row of your dataframe and calculates the hours within a range with the previously shown function. This is done for all three ranges.
# Use apply to calculate the hours for each row and each range
df['A'] = df.apply(lambda x: calc_hours(x['Start_Time'], x['End_Time'], start_a, end_a), axis=1)
df['B'] = df.apply(lambda x: calc_hours(x['Start_Time'], x['End_Time'], start_b, end_b), axis=1)
df['C'] = df.apply(lambda x: calc_hours(x['Start_Time'], x['End_Time'], start_c, end_c), axis=1)
Output is almost what you wanted, but not rounded to two decimal places:
Product_Name Start_Time End_Time A B C
0 Product X 2021-10-20 20:32:00 2021-10-21 03:50:00 0.000000 1.466667 5.816667
1 Product Y 2021-10-21 11:50:00 2021-10-21 16:00:00 2.166667 1.983333 0.000000
2 Product Z 2022-01-11 20:10:00 2022-01-12 15:30:00 8.983333 3.316667 6.983333
Another approach would be to create a Series with all the serial minute numbers of the ranges in question and then intersect them to get the overlapping duration.
Don't have time to provide a full answer but thought I would drop off the idea and you could take it from there.
Create the reference series:
start = pd.Timestamp('22:01')
end = pd.Timestamp('05:00')
if end < start:
end += pd.Timedelta(days=1)
drC = pd.Series(pd.date_range(start=start, end=end, freq='min')).dt.hour * 60 + \
pd.Series(pd.date_range(start=start, end=end, freq='min')).dt.minute
Create a function to do the intersection and duration calculation:
def intersecting_duration(x):
min_of_day = pd.Series(pd.Series(pd.date_range(start=x['Start_Time'], end=x['End_Time'], freq='min')).dt.hour * 60 + \
pd.Series(pd.date_range(start=x['Start_Time'], end=x['End_Time'], freq='min')).dt.minute)
dur_mins = len(np.intersect1d(min_of_day, drC))
return 0 if (dur_mins == 0) else (dur_mins-1)/60
Then apply it:
df.apply(intersecting_duration, axis=1)
0 5.816667
1 0.000000
2 6.983333
You would need to take it from there.

Is there a quick way for checking whether a date lies within n days(say 7) from a list of dates

I'm working with the following dataset:
Date
2016-01-04
2016-01-05
2016-01-06
2016-01-07
2016-01-08
and a list holidays = ['2016-01-01','2016-01-18'....'2017-11-23','2017-12-25']
Objective: Create a column indicating whether a particular date is within +- 7 days of any holiday present in the list.
Mock output:
Date
Within a week of Holiday
2016-01-04
1
2016-01-05
1
2016-01-06
1
2016-01-07
1
2016-01-08
0
I'm working with a lot of date records and thus trying to find a quick(most optimized) way to do this.
My Current Solution:
One way I figured to do this quickly would be to create another list with only the unique dates for my desired duration(say 2 years). This way, I can implement a simple solution with 2 for loops to check if a date is within +-7days of a holiday, and it wouldn't be computationally heavy as both lists would be relatively small(730 unique dates and ~20 dates in the holiday list).
Once I have my desired list of dates, all I have to do is run a single check on my 'Date' column to see if that date is a part of this new list I created. However, any suggestions to do this even quicker?
Turn holidays into a DataFrame and then merge_asof with a tolerance of 6 days:
new_df = pd.merge_asof(df, holidays, left_on='Date', right_on='Holiday',
tolerance=pd.Timedelta(days=6))
new_df['Holiday'] = np.where(new_df['Holiday'].notnull(), 1, 0)
new_df = new_df.rename(columns={'Holiday': 'Within a week of Holiday'})
Complete Working Example:
import numpy as np
import pandas as pd
holidays = pd.DataFrame(pd.to_datetime(['2016-01-01', '2016-01-18']),
columns=['Holiday'])
df = pd.DataFrame({
'Date': ['2016-01-04', '2016-01-05', '2016-01-06', '2016-01-07',
'2016-01-08']
})
df['Date'] = pd.to_datetime(df['Date'])
new_df = pd.merge_asof(df, holidays, left_on='Date', right_on='Holiday',
tolerance=pd.Timedelta(days=6))
new_df['Holiday'] = np.where(new_df['Holiday'].notnull(), 1, 0)
new_df = new_df.rename(columns={'Holiday': 'Within a week of Holiday'})
print(new_df)
new_df:
Date Within a week of Holiday
0 2016-01-04 1
1 2016-01-05 1
2 2016-01-06 1
3 2016-01-07 1
4 2016-01-08 0
Or turn Holdiays into a np datetime array then broadcast subtraction across the 'Date' Column, compare the abs to 7 days, and see if there are any matches:
holidays = np.array(['2016-01-01', '2016-01-18']).astype('datetime64')
df['Within a week of Holiday'] = (
abs(df['Date'].values - holidays[:, None]) < pd.Timedelta(days=7)
).any(axis=0).astype(int)
Complete Working Example:
import numpy as np
import pandas as pd
holidays = np.array(['2016-01-01', '2016-01-18']).astype('datetime64')
df = pd.DataFrame({
'Date': ['2016-01-04', '2016-01-05', '2016-01-06', '2016-01-07',
'2016-01-08']
})
df['Date'] = pd.to_datetime(df['Date'])
df['Within a week of Holiday'] = (
abs(df['Date'].values - holidays[:, None]) < pd.Timedelta(days=7)
).any(axis=0).astype(int)
print(df)
df:
Date Within a week of Holiday
0 2016-01-04 1
1 2016-01-05 1
2 2016-01-06 1
3 2016-01-07 1
4 2016-01-08 0
make a function that calculate date with +- 7 days and check if calculated date is in holidays so return True else False and apply that function to Data frame
import datetime
import pandas as pd
holidays = ['2016-01-01','2016-01-18','2017-11-23','2017-12-25']
def holiday_present(date):
date = datetime.datetime.strptime(date, '%Y-%m-%d')
for i in range(-7,7):
datte = (date - datetime.timedelta(days=i)).strftime('%Y-%m-%d')
if datte in holidays:
return True
return False
data = {
"Date":[
"2016-01-04",
"2016-01-05",
"2016-01-06",
"2016-01-07",
"2016-01-08"]
}
df= pd.DataFrame(data)
df["Within a week of Holiday"] = df["Date"].apply(holiday_present).astype(int)
Output:
Date Within a week of Holiday
0 2016-01-04 1
1 2016-01-05 1
2 2016-01-06 1
3 2016-01-07 1
4 2016-01-08 0
Try this:
Sample:
import pandas as pd
df = pd.DataFrame({'Date': {0: '2016-01-04',
1: '2016-01-05',
2: '2016-01-06',
3: '2016-01-07',
4: '2016-01-08'}})
Code:
def get_date_range(holidays):
h = [pd.to_datetime(x) for x in holidays]
h = [pd.date_range(x - pd.DateOffset(6), x + pd.DateOffset(6)) for x in h]
h = [x.strftime('%Y-%m-%d') for y in h for x in y]
return h
df['Within a week of Holiday'] = df['Date'].isin(get_date_range(holidays))*1
Result:
Out[141]:
0 1
1 1
2 1
3 1
4 0
Name: Within a week of Holiday, dtype: int32

python time interval overlap duration

My question is similar to Efficient date range overlap calculation in python?, however, I need to calculate the overlap with a full timestamp and not days, but more importantly, I cannot specify a specific date as the overlap, rather only hours.
import pandas as pd
import numpy as np
df = pd.DataFrame({'first_ts': {0: np.datetime64('2020-01-25 07:30:25.435000'),
1: np.datetime64('2020-01-25 07:25:00')},
'last_ts': {0: np.datetime64('2020-01-25 07:30:25.718000'),
1: np.datetime64('2020-01-25 07:25:00')}})
df['start_hour'] = 7
df['start_minute'] = 0
df['end_hour'] = 8
df['end_minute'] = 0
display(df)
How can I calculate the overlap duration of the interval (first_ts, last_ts) with the second interval in milliseconds?
Potentially, I would need to construct a timestamp on each day with the interval defined by the hours and then calculate the overlap.
Idea is create new Series for start and end datetimes with dates by datetimes columns, use numpy.minimum and numpy.maximum, subtract, convert timedeltas by Series.dt.total_seconds and multiple by 1000:
s = (df['first_ts'].dt.strftime('%Y-%m-%d ') +
df['start_hour'].astype(str) + ':' +
df['start_minute'].astype(str))
e = (df['last_ts'].dt.strftime('%Y-%m-%d ') +
df['end_hour'].astype(str) + ':' +
df['end_minute'].astype(str))
s = pd.to_datetime(s, format='%Y-%m-%d %H:%M')
e = pd.to_datetime(e, format='%Y-%m-%d %H:%M')
df['inter'] = ((np.minimum(e, df['last_ts']) -
np.maximum(s, df['first_ts'])).dt.total_seconds() * 1000)
print (df)
first_ts last_ts start_hour start_minute \
0 2020-01-25 07:30:25.435 2020-01-25 07:30:25.718 7 0
1 2020-01-25 07:25:00.000 2020-01-25 07:25:00.000 7 0
end_hour end_minute inter
0 8 0 283.0
1 8 0 0.0
Another idea is use only np.minumum:
df['inter'] = (np.minimum(df['last_ts'] - df['first_ts'], e - s).dt.total_seconds() * 1000)
print (df)
first_ts last_ts start_hour start_minute \
0 2020-01-25 07:30:25.435 2020-01-25 07:30:25.718 7 0
1 2020-01-25 07:25:00.000 2020-01-25 07:25:00.000 7 0
end_hour end_minute inter
0 8 0 283.0
1 8 0 0.0

Pandas get days in a between two two dates from a particular month

I have a pandas dataframe with three columns. A start and end date and a month.
I would like to add a column for how many days within the month are between the two dates. I started doing something with apply, the calendar library and some math, but it started to get really complex. I bet pandas has a simple solution, but am struggling to find it.
Input:
import pandas as pd
df1 = pd.DataFrame(data=[['2017-01-01', '2017-06-01', '2016-01-01'],
['2015-03-02', '2016-02-10', '2016-02-01'],
['2011-01-02', '2018-02-10', '2016-03-01']],
columns=['start date', 'end date date', 'Month'])
Desired Output:
start date end date date Month Days in Month
0 2017-01-01 2017-06-01 2016-01-01 0
1 2015-03-02 2016-02-10 2016-02-01 10
2 2011-01-02 2018-02-10 2016-03-01 31
There is a solution:
get a date list by pd.date_range between start and end dates, and then check how many date has the same year and month with the target month.
def overlap(x):
md = pd.to_datetime(x[2])
cand = [(ad.year, ad.month) for ad in pd.date_range(x[0], x[1])]
return len([x for x in cand if x ==(md.year, md.month)])
df1["Days in Month"]= df1.apply(overlap, axis=1)
You'll get:
start date end date date Month Days in Month
0 2017-01-01 2017-06-01 2016-01-01 0
1 2015-03-02 2016-02-10 2016-02-01 10
2 2011-01-02 2018-02-10 2016-03-01 31
You can convert your cell to datetime by
df = df.applymap(lambda x: pd.to_datetime(x))
Then find intersection days with function
def intersectionDaysInMonth(start, end, month):
end_month = month.replace(month=month.month + 1)
if month <= start <= end_month:
return end_month - start
if month <= end <= end_month:
return end - month
if start <= month < end_month <= end:
return end_month - month
return pd.to_timedelta(0)
Then apply
df['Days in Month'] = df.apply(lambda row: intersectionDaysInMonth(*row).days, axis=1)

Categories

Resources