split one row into multiple record python - python

I have an input dataframe as follows
Class Duration StudentID Age Startdate Start Time Enddate End Time TimeDifference
5th XX 20002 5 04/12/2021 17:00:00 04/14/2021 20:00:00 3000
And I would like to split the same into three different rows based on the start and end date as follows.
Class Duration StudentID Age Startdate Start Time Enddate End Time TimeDifference
5th XX 20002 5 04/12/2021 17:00:00 04/12/2021 23:59:59 360
5th XX 20002 5 04/13/2021 0:00:00 04/13/2021 23:59:59 1440
5th XX 20002 5 04/14/2021 0:00:00 04/14/2021 20:00:00 1200
I am trying with python. Please help.
Input Output is here

I get a slightly different value for 'Time Difference', but this is an approach you can tweak and and use.
Step 1:
You can start by using melt() with your id_vars being all your columns except your 'Startdate' and 'Enddate'.
Step 2:
Then you can set your index to be your StartEndDate column, created after melting your dataframe.
Step 3:
Then using reindex() you can add the new row with your missing dates.
Lastly what's left is to calculate the time difference column and rearrange your dataframe to get to your final output.
I assume your dataframe is called df:
# Step 1
ids = [c for c in df.columns if c not in ['Startdate','Enddate']]
new = df.melt(id_vars=ids,value_name = 'StartEndDate').drop('variable',axis=1)
new.loc[new.StartEndDate.isin(df['Startdate'].tolist()),'Start Time'] = "00:00"
print(new)
Class Duration StudentID Age Start Time End Time TimeDifference \
0 5th XX 20002 5 00:00 20:00 3000
1 5th XX 20002 5 17:00 20:00 3000
StartEndDate
0 04/12/2021
1 04/14/2021
# Step 2
new['StartEndDate'] = pd.to_datetime(new['StartEndDate']).dt.date
new.set_index(pd.DatetimeIndex(new.StartEndDate),inplace=True)
# Step 3
final = new.reindex(pd.date_range(new.index.min(),new.index.max()), method='ffill').reset_index()\
.rename({'index':'Startdate'},axis=1).drop('StartEndDate',axis=1)
final['Enddate'] = final['Startdate']
final['TimeDifference'] = (final['End Time'].str[:2].astype(int) - final['Start Time'].str[:2].astype(int))*60
Prints:
final = final[['Class','Duration','StudentID','Age','Startdate','Start Time','Enddate','End Time','TimeDifference']]
Class Duration StudentID Age Startdate Start Time Enddate End Time \
0 5th XX 20002 5 2021-04-12 00:00 2021-04-12 20:00
1 5th XX 20002 5 2021-04-13 00:00 2021-04-13 20:00
2 5th XX 20002 5 2021-04-14 17:00 2021-04-14 20:00
TimeDifference
0 1200
1 1200
2 180
I think some information is missing from your question, so I would suggest running line by line and do the necessary adjustments to suit your task.

The below code worked for me. df is the source data frame and df1 is the result.
cols = data.columns
df=data
df['ReportDate'] = [pd.date_range(x, y) for x , y in zip(df['Startdate'],df['Enddate'])]
df1 = df.explode('ReportDate')
df1.head()
df1['RptStart']=np.where(df1['ReportDate'] == df1['Startdate'],df1['StartTime'],datetime.time(00, 00, 00))
df1['RptEnd']=np.where(df1['Enddate'] == df1['ReportDate'],df1['EndTime'],datetime.time(23, 59, 59))
df1['StartDtTm']=df1.apply(lambda r : pd.datetime.combine(r['ReportDate'],r['RptStart']),1)
df1['EndDtTm']=df1.apply(lambda r : pd.datetime.combine(r['ReportDate'],r['RptEnd']),1)
df1['Duration']= round((df1['EndDtTm']-df1['StartDtTm']).astype('timedelta64[s]')/60)

Related

Insert new row in a DataFrame if condition is met

I have a dataframe with two date columns:
Brand Start Date Finish Date Check
0 1 2013-03-16 2014-03-02 Consecutive
1 2 2014-03-03 2015-09-05 Consecutive
2 3 2015-12-12 2016-12-12 Non Consecutive
3 4 2017-01-01 2017-06-01 Non consecutive
4 5 2017-06-02 2019-02-20 Consecutive
I created a new column (check column) checking if the start date is consecutive of the finish date in the previous row, which I populated with 'Consecutive' and 'Non consecutive'.
I want to insert a new row where the value of the check column is 'Non consecutive' that contains, as Start date, the date of the column 'finish date' + 1 day (consecutive with previous row) and as 'finish date' the date of the column Finish Date - 1 day (consecutive with next row). So indexes 2 and 4 will be the new rows
Brand Start Date Finish Date
0 1 2013-03-16 2014-03-02
1 2 2014-03-03 2015-09-05
2 3 2015-09-06 2015-12-11
3 3 2015-12-12 2016-12-12
4 4 2016-12-13 2016-12-31
5 4 2017-01-01 2017-06-01
6 5 2017-06-02 2019-02-20
How can I achieve this?
date_format = '%Y-%m-%d'
rows = df.index[~df['Consecutive']]
df2 = pd.DataFrame(columns=df.columns, index=rows)
res = []
for row in rows:
df2.loc[row, :] = df.loc[row, :].copy()
df2.loc[row, 'Start Date'] = datetime.strftime(datetime.strptime(df.loc[row, 'Start Date'], date_format) + timedelta(days=1), date_format)
df2.loc[row, 'Finish Date'] = datetime.strftime(datetime.strptime(df.loc[row+1, 'Finish Date'], date_format) - timedelta(days=1), date_format)
df3 = pd.concat([df, df2]).sort_values(['Brand', 'Start Date']).reset_index(drop=True)
This uses sorting to put the rows in the correct place. If your df is big the sorting could be the slowest part and you could consider adding the rows one at a time into the correct place see here.

Inserting flag on occurence of date

I have a pandas dataframe data-
Round Number Date
1 7/4/2018 20:00
1 8/4/2018 16:00
1 8/4/2018 20:00
1 9/4/2018 20:00
Now I want to create a new dataframe which has two columns
['Date' ,'flag']
The Date column will have the dates of the range of dates in the data dataframe(in the actual data the dates are in the range of 7/4/2018 8:00:00 PM to 27/05/2018 19:00 so the date column in the new dataframe will have dates from 1/4/2018 to 30/05/2018 since 7/4/2018 8:00:00 PM is in the month of April so we will include the whole month of April and similarly since 27/05/2018 is in May so we include dates from 1/05/2018 t0 30/05/2018.
In the flag column we put 1 if that particular date was there in the old dataframe.
Output(partial)-
Date Flag
1/4/2018 0
2/4/2018 0
3/4/2018 0
4/4/2018 0
5/4/2018 0
6/4/2018 0
7/4/2018 1
8/4/2018 1
and so on...
I would use np.where() to address this issue. Furthermore, I'm working to improve the answer by setting the dateranges from old_df to be input of new_df
import pandas as pd
import numpy as np
old_df = pd.DataFrame({'date':['4/7/2018 20:00','4/8/2018 20:00'],'value':[1,2]})
old_df['date'] = pd.to_datetime(old_df['date'],infer_datetime_format=True)
new_df = pd.DataFrame({'date':pd.date_range(start='4/1/2018',end='5/30/2019',freq='d')})
new_df['flag'] = np.where(new_df['date'].dt.date.astype(str).isin(old_df['date'].dt.date.astype(str).tolist()),1,0)
print(new_df.head(10))
Output:
date flag
0 2018-04-01 0
1 2018-04-02 0
2 2018-04-03 0
3 2018-04-04 0
4 2018-04-05 0
5 2018-04-06 0
6 2018-04-07 1
7 2018-04-08 1
8 2018-04-09 0
9 2018-04-10 0
Edit:
Improved version, full code:
import pandas as pd
import numpy as np
old_df = pd.DataFrame({'date':['4/7/2018 20:00','4/8/2018 20:00','5/30/2018 20:00'],'value':[1,2,3]})
old_df['date'] = pd.to_datetime(old_df['date'],infer_datetime_format=True)
if old_df['date'].min().month < 10:
start_date = pd.to_datetime(
("01/0"+str(old_df['date'].min().month)+"/"+str(old_df['date'].min().year)))
else:
start_date = pd.to_datetime(
("01/"+str(old_df['date'].min().month)+"/"+str(old_df['date'].min().year)))
end_date = old_df['date'].max()
end_date = pd.to_datetime(old_df['date'].max())
new_df = pd.DataFrame({'date':pd.date_range(start=start_date,end=end_date,freq='d')})
new_df['flag'] = np.where(new_df['date'].dt.date.astype(str).isin(old_df['date'].dt.date.astype(str).tolist()),1,0)

how to replace missing date from NAT to some date in increasing order in python

I have a date column
the missing values(NAT in python) needs to be incremented in loop with one day
that is 1/1/2015 , 1/2/2016, 1/3/2016
Can any one help me out ?
This will add an incremental date to your dataframe.
import pandas as pd
import datetime as dt
ddict = {
'Date': ['2014-12-29','2014-12-30','2014-12-31','','','','',]
}
data = pd.DataFrame(ddict)
data['Date'] = pd.to_datetime(data['Date'])
def fill_dates(data_frame, date_col='Date'):
### Seconds in a day (3600 seconds per hour x 24 hours per day)
day_s = 3600 * 24
### Create datetime variable for adding 1 day
_day = dt.timedelta(seconds=day_s)
### Get the max non-null date
max_dt = data_frame[date_col].max()
### Get index of missing date values
NaT_index = data_frame[data_frame[date_col].isnull()].index
### Loop through index; Set incremental date value; Increment variable by 1 day
for i in NaT_index:
data_frame[date_col][i] = max_dt + _day
_day += dt.timedelta(seconds=day_s)
### Execute function
fill_dates(data, 'Date')
Initial data frame:
Date
0 2014-12-29
1 2014-12-30
2 2014-12-31
3 NaT
4 NaT
5 NaT
6 NaT
After running the function:
Date
0 2014-12-29
1 2014-12-30
2 2014-12-31
3 2015-01-01
4 2015-01-02
5 2015-01-03
6 2015-01-04

Count String Values in Column across 30 Minute Time Bins using Pandas

I am looking to determine the count of string variables in a column across a 3 month data sample. Samples were taken at random times throughout each day. I can group the data by hour, but I require the fidelity of 30 minute intervals (ex. 0500-0600, 0600-0630) on roughly 10k rows of data.
An example of the data:
datetime stringvalues
2018-06-06 17:00 A
2018-06-07 17:30 B
2018-06-07 17:33 A
2018-06-08 19:00 B
2018-06-09 05:27 A
I have tried setting the datetime column as the index, but I cannot figure how to group the data on anything other than 'hour' and I don't have fidelity on the string value count:
df['datetime'] = pd.to_datetime(df['datetime']
df.index = df['datetime']
df.groupby(df.index.hour).count()
Which returns an output similar to:
datetime stringvalues
datetime
5 0 0
6 2 2
7 5 5
8 1 1
...
I researched multi-indexing and resampling to some length the past two days but I have been unable to find a similar question. The desired result would look something like this:
datetime A B
0500 1 2
0530 3 5
0600 4 6
0630 2 0
....
There is no straightforward way to do a TimeGrouper on the time component, so we do this in two steps:
v = (df.groupby([pd.Grouper(key='datetime', freq='30min'), 'stringvalues'])
.size()
.unstack(fill_value=0))
v.groupby(v.index.time).sum()
stringvalues A B
05:00:00 1 0
17:00:00 1 0
17:30:00 1 1
19:00:00 0 1

Day number of a quarter for a given date in pandas

I have create a dataframe of dates as follows:
import pandas as pd
timespan = 366
df = pd.DataFrame({'Date':pd.date_range(pd.datetime.today(), periods=timespan).tolist()})
I'm struggling to identify the day number in a quarter. For example
date expected_value
2017-01-01 1 # First day in Q1
2017-01-02 2 # Second day in Q1
2017-02-01 32 # 32nd day in Q1
2017-04-01 1 # First day in Q2
May I have your suggestions? Thank you in advance.
>>> df.assign(
days_in_qurater=[(date - ts.start_time).days + 1
for date, ts in zip(df['Date'],
pd.PeriodIndex(df['Date'], freq='Q'))])
Date days_in_qurater
0 2017-01-01 1
1 2017-01-02 2
2 2017-01-03 3
...
363 2017-12-30 91
364 2017-12-31 92
365 2018-01-01 1
This is around 250x faster than Alexander's solution:
df['day_qtr']=(df.Date - pd.PeriodIndex(df.Date,freq='Q').start_time).dt.days + 1
One of way is by creating a new df based on dates and quarter cumcount then map the values to the real df i.e
timespan = 5000
ndf = pd.DataFrame({'Date':pd.date_range('2015-01-01', periods=timespan).tolist()})
ndf['q'] = ndf['Date'].dt.to_period('Q')
ndf['new'] = ndf.groupby('q').cumcount()+1
maps = dict(zip(ndf['Date'].dt.date, ndf['new'].values.tolist()))
Map the values
df['expected'] = df.Date.dt.date.map(maps)
Output:
Date expected
0 2017-09-12 09:42:14.324492 74
1 2017-09-13 09:42:14.324492 75
2 2017-09-14 09:42:14.324492 76
3 2017-09-15 09:42:14.324492 77
4 2017-09-16 09:42:14.324492 78
.
.
143 2018-02-02 09:42:14.324492 33
.
.
201 2018-04-01 09:42:14.324492 1
Hope it helps.
start with:
day_of_year = datetime.now().timetuple().tm_yday
from
Convert Year/Month/Day to Day of Year in Python
you can get 1st day of each quarter that way and bracket / subtract the value of the first day to get the day of the quarter

Categories

Resources