Trouble with Finding Date Range in Pandas - python

I have a dataset which has a list of subjects, a start date, and an end date. I'm trying to do a loop so that for each subject I have a list of dates between the start date and end date. I've tried so many ways to do this based on previous posts but still having issues.
an example of the dataframe:
Participant # Start_Date End_Date
1 23-04-19 25-04-19
An example of the output I want:
Participant # Range
1 23-04-19
1 24-04-19
1 25-04-19
Right now my code looks like this:
subjs_490 = tracksheet_490['Participant #']
for subj_490 in subjs_490:
temp_a = tracksheet_490[tracksheet_490['Participant #'].isin([subj_490])]
start = temp_a['Start_Date']
end = temp_a['End_Date'
start_dates = pd.to_datetime(pd.Series(start), format = '%d-%m-%y')
end_dates = pd.to_datetime(pd.Series(end), format = '%d-%m-%y')
date_range = pd.date_range(start_dates, end_dates).tolist()
With this method I'm getting the following error:
Cannot convert input [1 2016-05-03 Name: Start_Date, dtype: datetime64[ns]] of type to Timestamp

Expanding ranges tends to be a slow process. You can create the date_range and then explode it to get what you want. Moving 'Participant #' to the index makes sure it's repeated for all rows that are exploded.
df = (df.set_index('Participant #')
.apply(lambda x: pd.date_range(x.start_date, x.end_date), axis=1) # :( slow
.rename('Range')
.explode()
.reset_index())
Participant # Range
0 1 2019-04-23
1 1 2019-04-24
2 1 2019-04-25
If you can't use explode another option is to create a separate DataFrame for each row and then concat them all together.
pd.concat([pd.DataFrame({'Participant #': par, 'Range': pd.date_range(start, end)})
for par,start,end in zip(df['Participant #'], df['start_date'], df['end_date'])],
ignore_index=True)

Related

How to add a time sequence column, based on groups

link
Above is a link to an example of a CSV file I am modifying using python, I need to add a time column that increases by 1 if the date from the previous row matches.
If the date Changes the time would start back over at 8:00:00
additionally if the 'PL Seq' changes from G* to H* the time would also start back over at 8.
I think I have the logic down, just having a hard time writing it.
add a column to the df 'Time'
set the first 'Time' value to 8:00:00
read each row in the df
If date value = date value of previous row and pl seq value first character = first character set time value to time +1
else set time value to time
*Just a note I already have the code to change the format of the order #'s and the Dates of the Goal state
Current
MODELCHASS,Prod Date,PL Seq
M742-021167,20200917,G0005
M359-020535,20200917,G0010
M742-022095,20200917,G0015
M220-001083,20200918,G0400
M742-022390,20200918,G0405
M907-004747,20200918,H0090
M934-005904,20200918,H0095
Expected
MODELCHASS,Prod Date,PL Seq,Time
M742 021167,2020-09-17T,G0005,8:00:00
M359 020535,2020-09-17T,G0010,8:00:01
M742 022095,2020-09-17T,G0015,8:00:02
M220 001083,2020-09-18T,G0400,8:00:00
M742 022390,2020-09-18T,G0405,8:00:01
M907 004747,2020-09-18T,H0090,8:00:00
M934 005904,2020-09-18T,H0095,8:00:01
#Trenton Can we modify this If H orders have the same date as G orders
for example
Current Edit in Line 6
MODELCHASS,Prod Date,PL Seq
M742-021167,20200917,G0005
M359-020535,20200917,G0010
M742-022095,20200917,G0015
M220-001083,20200918,G0400
M742-022390,20200918,G0405
M907-004747,20200917,H0090
M934-005904,20200917,H0095
Expected Edit
MODELCHASS,Prod Date,PL Seq,Time
M742 021167,2020-09-17T,G0005,8:00:00
M359 020535,2020-09-17T,G0010,8:00:01
M742 022095,2020-09-17T,G0015,8:00:02
M220 001083,2020-09-18T,G0400,8:00:00
M742 022390,2020-09-18T,G0405,8:00:01
M907 004747,2020-09-17T,H0090,8:00:00
M934 005904,2020-09-17T,H0095,8:00:01
Convert the 'Prod Date' column to a datetime
Sort the dataframe by 'Prod Date' and 'PL Seq' so 'df' will be in the same order as time_seq for joining.
The main part of the answer is to create a DateRange list with the .groupby and .apply
.groupby the Prod Date and the first element of 'PL Seq'
df.groupby(['Prod Date', df['PL Seq'].str[0]])
.apply(lambda x: (pd.date_range(start=x.values[0] + pd.Timedelta(hours=8), periods=len(x), freq='s')).time)
For each group, use the first value in x as start: x.values[0]
To this date, add an 8 hour Timedelta, to get 08:00:00
The number of periods is len[x]
the freq is 's', for seconds.
This creates a DateRange, from which the time is extracted with .time
Tested in python 3.10, pandas 1.4.3
import pandas as pd
# setup test dataframe
data = {'MODELCHASS': ['M742-021167', 'M359-020535', 'M742-022095', 'M220-001083', 'M742-022390', 'M907-004747', 'M934-005904'],
'Prod Date': [20200917, 20200917, 20200917, 20200918, 20200918, 20200918, 20200918],
'PL Seq': ['G0005', 'G0010', 'G0015', 'G0400', 'G0405', 'H0090', 'H0095']}
df = pd.DataFrame(data)
# convert Prod Date to a datetime column
df['Prod Date'] = pd.to_datetime(df['Prod Date'], format='%Y%m%d')
# sort the dataframe by values so the order will correspond to the groupby order
df = df.sort_values(['Prod Date', 'PL Seq']).reset_index(drop=True)
# groupby Prod Date and the first character of PL Seq
# create a DateRange sequence for each group
# reshape the dataframe
time_seq = (df.groupby(['Prod Date', df['PL Seq'].str[0]])['Prod Date']
.apply(lambda x: (pd.date_range(start=x.values[0] + pd.Timedelta(hours=8), periods=len(x), freq='s')).time)
.reset_index(name='time_seq')
.explode('time_seq', ignore_index=True))
# join the time_seq column to df
df_new = df.join(time_seq.time_seq)
# display(df_new)
MODELCHASS Prod Date PL Seq time_seq
0 M742-021167 2020-09-17 G0005 08:00:00
1 M359-020535 2020-09-17 G0010 08:00:01
2 M742-022095 2020-09-17 G0015 08:00:02
3 M220-001083 2020-09-18 G0400 08:00:00
4 M742-022390 2020-09-18 G0405 08:00:01
5 M907-004747 2020-09-18 H0090 08:00:00
6 M934-005904 2020-09-18 H0095 08:00:01

Pandas Advanced: How to get results for customer who has bought at least twice within 5 days of period?

I have been attempting to solve a problem for hours and stuck on it. Here is the problem outline:
import numpy as np
import pandas as pd
df = pd.DataFrame({'orderid': [10315, 10318, 10321, 10473, 10621, 10253, 10541, 10645],
'customerid': ['ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'HANAR', 'HANAR', 'HANAR'],
'orderdate': ['1996-09-26', '1996-10-01', '1996-10-03', '1997-03-13', '1997-08-05', '1996-07-10', '1997-05-19', '1997-08-26']})
df
orderid customerid orderdate
0 10315 ISLAT 1996-09-26
1 10318 ISLAT 1996-10-01
2 10321 ISLAT 1996-10-03
3 10473 ISLAT 1997-03-13
4 10621 ISLAT 1997-08-05
5 10253 HANAR 1996-07-10
6 10541 HANAR 1997-05-19
7 10645 HANAR 1997-08-26
I would like to select all the customers who has ordered items more than once WITHIN 5 DAYS.
For example, here only the customer ordered within 5 days of period and he has done it twice.
I would like to get the output in the following format:
Required Output
customerid initial_order_id initial_order_date nextorderid nextorderdate daysbetween
ISLAT 10315 1996-09-26 10318 1996-10-01 5
ISLAT 10318 1996-10-01 10321 1996-10-03 2
First, to be able to count the difference in days, convert orderdate
column to datetime:
df.orderdate = pd.to_datetime(df.orderdate)
Then define the following function:
def fn(grp):
return grp[(grp.orderdate.shift(-1) - grp.orderdate) / np.timedelta64(1, 'D') <= 5]
And finally apply it:
df.sort_values(['customerid', 'orderdate']).groupby('customerid').apply(fn)
It is a bit tricky because there can be any number of purchase pairs within 5 day windows. It is a good use case for leveraging merge_asof, which allows to do approximate-but-not-exact matching of a dataframe with itself.
Input data
import pandas as pd
df = pd.DataFrame({'orderid': [10315, 10318, 10321, 10473, 10621, 10253, 10541, 10645],
'customerid': ['ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'HANAR', 'HANAR', 'HANAR'],
'orderdate': ['1996-09-26', '1996-10-01', '1996-10-03', '1997-03-13', '1997-08-05', '1996-07-10', '1997-05-19', '1997-08-26']})
Define a function that computes the pairs of purchases, given data for a customer.
def compute_purchase_pairs(df):
# Approximate self join on the date, but not exact.
df_combined = pd.merge_asof(df,df, left_index=True, right_index=True,
suffixes=('_first', '_second') , allow_exact_matches=False)
# Compute difference
df_combined['timedelta'] = df_combined['orderdate_first'] - df_combined['orderdate_second']
return df_combined
Do the preprocessing and compute the pairs
# Convert to datetime
df['orderdate'] = pd.to_datetime(df['orderdate'])
# Sort dataframe from last buy to newest (groupby will not change this order)
df2 = df.sort_values(by='orderdate', ascending=False)
# Create an index for joining
df2 = df.set_index('orderdate', drop=False)
# Compute puchases pairs for each customer
df_differences = df2.groupby('customerid').apply(compute_purchase_pairs)
# Show only the ones we care about
result = df_differences[df_differences['timedelta'].dt.days<=5]
result.reset_index(drop=True)
Result
orderid_first customerid_first orderdate_first orderid_second \
0 10318 ISLAT 1996-10-01 10315.0
1 10321 ISLAT 1996-10-03 10318.0
customerid_second orderdate_second timedelta
0 ISLAT 1996-09-26 5 days
1 ISLAT 1996-10-01 2 days
you can create the column 'daysbetween' with sort_values and diff. After to get the following order, you can join df with df once groupby per customerid and shift all the data. Finally, query where the number of days in 'daysbetween_next ' is met:
df['daysbetween'] = df.sort_values(['customerid', 'orderdate'])['orderdate'].diff().dt.days
df_final = df.join(df.groupby('customerid').shift(-1),
lsuffix='_initial', rsuffix='_next')\
.drop('daysbetween_initial', axis=1)\
.query('daysbetween_next <= 5 and daysbetween_next >=0')
It's quite simple. Let's write down the requirements one at the time and try to build upon.
First, I guess that the customer has a unique id since it's not specified. We'll use that id for identifying customers.
Second, I assume it does not matter if the customer bought 5 days before or after.
My solution, is to use a simple filter. Note that this solution can also be implemented in a SQL database.
As a condition, we require the user to be the same. We can achieve this as follows:
new_df = df[df["ID"] == df["ID"].shift(1)]
We create a new DataFrame, namely new_df, with all rows such that the xth row has the same user id as the xth - 1 row (i.e. the previous row).
Now, let's search for purchases within the 5 days, by adding the condition to the previous piece of code
new_df = df[df["ID"] == df["ID"].shift(1) & (df["Date"] - df["Date"].shift(1)) <= 5]
This should do the work. I cannot test it write now, so some fixes may be needed. I'll try to test it as soon as I can

Spliting DataFrame into Multiple Frames by Dates Python

I fully understand there are a few versions of this questions out there, but none seem to get at the core of my problem. I have a pandas Dataframe with roughly 72,000 rows from 2015 to now. I am using a calculation that finds the most impactful words for a given set of text (tf_idf). This calculation does not account for time, so I need to break my main Dataframe down into time-based segments, ideally every 15 and 30 days (or n days really, not week/month), then run the calculation on each time-segmented Dataframe in order to see and plot what words come up more and less over time.
I have been able to build part of this this out semi-manually with the following:
def dateRange():
start = input("Enter a start date (MM-DD-YYYY) or '30' for last 30 days: ")
if (start != '30'):
datetime.strptime(start, '%m-%d-%Y')
end = input("Enter a end date (MM-DD-YYYY): ")
datetime.strptime(end, '%m-%d-%Y')
dataTime = data[(data['STATUSDATE'] > start) & (data['STATUSDATE'] <= end)]
else:
dataTime = data[data.STATUSDATE > datetime.now() - pd.to_timedelta('30day')]
return dataTime
dataTime = dateRange()
dataTime2 = dateRange()
def calcForDateRange(dateRangeFrame):
##### LONG FUNCTION####
return word and number
calcForDateRange(dataTime)
calcForDateRange(dataTime2)
This works - however, I have to manually create the 2 dates which is expected as I created this as a test. How can I split the Dataframe by increments and run the calculation for each dataframe?
dicts are allegedly the way to do this. I tried:
dict_of_dfs = {}
for n, g in data.groupby(data['STATUSDATE']):
dict_of_dfs[n] = g
for frame in dict_of_dfs:
calcForDateRange(frame)
The dict result was 2015-01-02: Dataframe with no frame. How can I break this down into a 100 or so Dataframes to run my function on?
Also, I do not fully understand how to break down ['STATUSDATE'] by number of days specifically?
I would to avoid iterating as much as possible, but I know I probably will have to someehere.
THank you
Let us assume you have a data frame like this:
date = pd.date_range(start='1/1/2018', end='31/12/2018', normalize=True)
x = np.random.randint(0, 1000, size=365)
df = pd.DataFrame(x, columns = ["X"])
df['Date'] = date
df.head()
Output:
X Date
0 328 2018-01-01
1 188 2018-01-02
2 709 2018-01-03
3 259 2018-01-04
4 131 2018-01-05
So this data frame has 365 rows, one for each day of the year.
Now if you want to group this data into intervals of 20 days and assign each group to a dict, you can do the following
df_dict = {}
for k,v in df.groupby(pd.Grouper(key="Date", freq='20D')):
df_dict[k.strftime("%Y-%m-%d")] = pd.DataFrame(v)
print(df_dict)
How about something like this. It creates a dictionary of non empty dataframes keyed on the
starting date of the period.
import datetime as dt
start = '12-31-2017'
interval_days = 30
start_date = pd.Timestamp(start)
end_date = pd.Timestamp(dt.date.today() + dt.timedelta(days=1))
dates = pd.date_range(start=start_date, end=end_date, freq=f'{interval_days}d')
sub_dfs = {d1.strftime('%Y%m%d'): df.loc[df.dates.ge(d1) & df.dates.lt(d2)]
for d1, d2 in zip(dates, dates[1:])}
# Remove empty dataframes.
sub_dfs = {k: v for k, v in sub_dfs.items() if not v.empty}

How do I iterate an input list over this function and consequently

I want to iterate a list with input variables over the following function, to then consequently return the output as a csv file. I have a big csv file which I first important to create a dataframe. I then want to get certain part of the dataframe, namely, the -10 days and +10 days around a certain date and for a certain stock.
The dataframe looks as follows (this is just a small part, in reality its 100k+ rows, for every day, all stock tickers, for the period 2011 - 2019)
Date Symbol ShortExemptVolume ShortVolume TotalVolume
2011-01-03 AAWW 0.0 28369 78113.0
2011-01-03 AMD 0.0 3183556 8095093.0
2011-01-03 AMRS 0.0 14196 18811.0
2011-01-03 ARAY 0.0 31685 77976.0
2011-01-03 ARCC 0.0 177208 423768.0
The function is as follows. What it does is it filters the dataframe for the stock ticker and then the dates (-10 and +10 days around a given specific date).
import pandas as pd
from datetime import datetime
import urllib
import datetime
def get_data(issue_date, stock_ticker):
df = pd.read_csv (r'C:\Users\name\document.csv')
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
x = -10 #set window range
y = -(x)
date_1 = datetime.datetime.strptime(issue_date, "%Y-%m-%d")
before_date = pricing_date = date_1 + datetime.timedelta(days=x) days of issue date
after_date = date_1 + datetime.timedelta(days=y)
cond1 = df['Date'] >= before_date
cond2 = df['Date'] <= after_date
cond3 = df['Symbol'] == 'stock_ticker'
short_data = df[cond1 & cond2 & cond3]
return [short_data]
I have a list with a couple hundred rows that contain a specific stock ticker and issue date, for example like this:
ARAY 4/24/2014
ACET 11/16/2015
ACET 11/16/2015
AEGR 8/15/2014
ATSG 9/29/2017
I would like to iterate the list with stocks and their respective date, over the function and get the output in csv format.
The output should be 20 dates for every row in the input file.
Any tips or help is welcome
Consider building a list of data frames generated from function and compile together with concat. Also, there is no need to separately call to_datetime as you can use parse_dates argument in read_csv:
def get_data(issue_date, stock_ticker):
df = pd.read_csv (r'C:\Users\name\document.csv', parse_dates=['Date'])
x = -10 #set window range
y = -(x)
date_1 = dt.strptime(issue_date, "%m/%d/%Y") # MATCH ACCORDING TO INPUT
before_date = date_1 + datetime.timedelta(days=x)
after_date = date_1 + datetime.timedelta(days=y)
cond1 = df['Date'] >= before_date
cond2 = df['Date'] <= after_date
cond3 = df['Symbol'] == stock_ticker # REMOVE SINGLE QUOTES
short_data = df[cond1 & cond2 & cond3]
return short_data # REMOVE LIST BRACKETS
stock_date_list = [['ARAY', '4/24/2014'],
['ACET', '11/16/2015'],
['ACET', '11/16/2015'],
['AEGR', '8/15/2014'],
['ATSG', '9/29/2017']]
# LIST COMPREHENSION ITERATIVELY CALLING FUNCTION
df_list = [get_data(i[1], i[0]) for i in stock_date_list)]
# SINGLE DATA FRAME COMPILATION
final_df = pd.concat(df_list, ignore_index=True)

check for date and time between two columns in pandas data frame

I have two data frames:
The first date frame is:
import pandas as pd
df1 = pd.DataFrame({'serialNo':['aaaa','bbbb','cccc','ffff','aaaa','bbbb','aaaa'],
'Name':['Sayonti','Ruchi','Tony','Gowtam','Toffee','Tom','Sayonti'],
'testName': [4402, 3747 ,5555,8754,1234,9876,3602],
'moduleName': ['singing', 'dance','booze', 'vocals','drama','paint','singing'],
'endResult': ['WARNING', 'FAILED', 'WARNING', 'FAILED','WARNING','FAILED','WARNING'],
'Date':['2018-10-5','2018-10-6','2018-10-7','2018-10-8','2018-10-9','2018-10-10','2018-10-8'],
'Time_df1':['23:26:39','22:50:31','22:15:28','21:40:19','21:04:15','20:29:11','19:54:03']})
The second data frame is:
df2 = pd.DataFrame({'serialNo':['aaaa','bbbb','aaaa','ffff','xyzy','aaaa'],
'Food':['Strawberry','Coke','Pepsi','Nuts','Apple','Candy'],
'Work': ['AP', 'TC','OD', 'PU','NO','PM'],
'Date':['2018-10-1','2018-10-6','2018-10-2','2018-10-3','2018-10-5','2018-10-10'],
'Time_df2':['09:00:00','10:00:00','11:00:00','12:00:00','13:00:00','14:00:00']
})
I am joining the two based on serial number:
df1['Date'] = pd.to_datetime(df1['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])
result = pd.merge(df1,df2,on=['serialNo'],how='inner')
Now I want that Date_y lies within 3 days of Date_x starting from Date_x
which means Date_X+(1,2,3 days) should be Date_y. And I can get that as below but I also want to check for the time range which I do not know how to achieve
result = result[result.Date_x.sub(result.Date_y).dt.days.between(0,3)]
I want to check for the time such that Time_df2 is within 6 hours of start time being Time_df1. Please help?
You could have a column within your dataframe that combines the date and the time. Here's an example of combining a single row in the dataframe:
# Combining Date_x and time_df1
value_1_x = datetime.datetime.combine(result['Date_x'][0].date() ,\
datetime.datetime.strptime(result['Time_df1'][0], '%H:%M:%S').time())
# Combining date_y and time_df2
value_2_y = datetime.datetime.combine(result['Date_y'][0].date() , \
datetime.datetime.strptime(result['Time_df2'][0], '%H:%M:%S').time())
Then given two datetime objects, you can simply subtract to find the difference you are looking for:
difference = value_1_x - value_2_y
print(difference)
Which gives the output:
4 days, 14:26:39
My understanding is that you are looking to see if something is within 3 days and 6 hours (or a total of 78 hours). You can convert this to hours easily, and then make the desired comparison:
hours_difference = abs(value_1_x - value_2_y).total_seconds() / 3600.0
print(hours_difference)
Which gives the output:
110.44416666666666
Hope that helps!

Categories

Resources