Pandas cumulative sum if between certain times/values - python

I want to insert a new column called total in final_dfwhich is a cumulative sum of value in df if it occurs between the times in final_df. It sums the values if it occurs between the start and end in final_df. So for example during the time range 01:30 to 02:00 in final_df - both index 0 and 1 in df occur between this time range so the total is 15 (10+5).
I have two pandas dataframes:
df
import pandas as pd
d = {'start_time': ['01:00','00:00','00:30','02:00'],
'end_time': ['02:00','03:00','01:30','02:30'],
'value': ['10','5','20','5']}
df = pd.DataFrame(data=d)
final_df
final_df = {'start_time': ['00:00, 00:30, 01:00, 01:30, 02:00, 02:30'],
'end_time': ['00:30, 01:00, 01:30, 02:00, 02:30, 03:00']}
final_df = pd.DataFrame(data=final_d)
output I want final_df
start_time end_time total
00:00 00:30 5
00:30 01:00 25
01:00 01:30 35
01:30 02:00 15
02:30 03:00 10
My try
final_df['total'] = final_df.apply(lambda x: df.loc[(df['start_time'] >= x.start_time) &
(df['end_time'] <= x.end_time), 'value'].sum(), axis=1)
Problem 1
I get the error: TypeError: ("'>=' not supported between instances of 'str' and 'datetime.time'", 'occurred at index 0')
I converted the relevant columns to datetime as follows:
df[['start_time','end_time']] = df[['start_time','end_time']].apply(pd.to_datetime, format='%H:%M')
final_df[['start_time','end_time']] = final_df[['start_time','end_time']].apply(pd.to_datetime, format='%H:%M:%S')
But I don't want to convert to datetime. Is there a way around this?
Problem 2
The sum is not working properly. It's only looking for exact match for the time range. So the output is:
start_time end_time total
00:00 00:30 0
00:30 01:00 0
01:00 01:30 0
01:30 02:00 0
02:30 03:00 5

One way to not use apply could be like this this.
df_ = (df.rename(columns={'start_time':1, 'end_time':-1}) #to use in the calculation later
.rename_axis(columns='mult') # mostly for esthetic
.set_index('value').stack() #reshape the data
.reset_index(name='time') # put the index back to columns
)
df_ = (df_.set_index(pd.to_datetime(df_['time'], format='%H:%M')) #to use resampling technic
.assign(total=lambda x: x['value'].astype(float)*x['mult']) #get plus or minus the value depending start/end
.resample('30T')[['total']].sum() # get the sum at the 30min bounds
.cumsum() #cumulative sum from the beginning
)
# create the column for merge with final resul
df_['start_time'] = df_.index.strftime('%H:%M')
# merge
final_df = final_df.merge(df_)
and you get
print (final_df)
start_time end_time total
0 00:00 00:30 5.0
1 00:30 01:00 25.0
2 01:00 01:30 35.0
3 01:30 02:00 15.0
4 02:00 02:30 10.0
5 02:30 03:00 5.0
But if you want to use apply, first you need to ensure that the columns are the good dtype and then you did the inegality in the reverse order like:
df['start_time'] = pd.to_datetime(df['start_time'], format='%H:%M')
df['end_time'] = pd.to_datetime(df['end_time'], format='%H:%M')
df['value'] = df['value'].astype(float)
final_df['start_time'] = pd.to_datetime(final_df['start_time'], format='%H:%M')
final_df['end_time'] = pd.to_datetime(final_df['end_time'], format='%H:%M')
final_df.apply(
lambda x: df.loc[(df['start_time'] <= x.start_time) & #see other inequality
(df['end_time'] >= x.end_time), 'value'].sum(), axis=1)
0 5.0
1 25.0
2 35.0
3 15.0
4 10.0
5 5.0
dtype: float64

Related

Pandas merge with average of second dataframe

I have two panda dataframes.
Data frame one has three columns:
name
start_time
end_time
alice
04:00
05:00
bob
05:00
07:00
Data frame two has three columns:
time
points_1
points_2
04:30
5
4
04:45
8
6
05:30
10
3
06:15
4
7
06:55
1
0
I would like to merge the two dataframes such that the first dataframe now has 5 columns:
name
start_time
end_time
average_point_1
average_point_2
alice
04:00
05:00
6.5
5
bob
05:00
07:00
5
3.33
Where the average_point_1 columns consists of average of points_1 from dataframe two between the start and end time for each row. Similarly average_point_2. Could someone tell me how I can merge the two dataframes like this without having to develop an averaging function to make the columns first and then merging.
Try:
#convert all time fields to datetime for merge_asof compatibility
df1["start_time"] = pd.to_datetime(df1["start_time"],format="%H:%M")
df1["end_time"] = pd.to_datetime(df1["end_time"],format="%H:%M")
df2["time"] = pd.to_datetime(df2["time"],format="%H:%M")
#merge both dataframes on time
merged = pd.merge_asof(df2, df1, left_on="time", right_on="start_time")
#groupy and get average for each name
output = merged.groupby(["name", "start_time", "end_time"],as_index=False).mean()
#convert time columns back to strings if needed
output["start_time"] = output["start_time"].dt.strftime("%H:%M")
output["end_time"] = output["end_time"].dt.strftime("%H:%M")
>>> output
name start_time end_time points_1 points_2
0 alice 04:00 05:00 6.5 5.000000
1 bob 05:00 07:00 5.0 3.333333

Reshape dataframe into several columns based on date column

I want to rearrange my example dataframe (df.csv) below based on the date column. Each row represents an hour's data for instance for both dates 2002-01-01 and 2002-01-02, there is 5 rows respectively, each representing 1 hour.
date,symbol
2002-01-01,A
2002-01-01,A
2002-01-01,A
2002-01-01,B
2002-01-01,A
2002-01-02,B
2002-01-02,B
2002-01-02,A
2002-01-02,A
2002-01-02,A
My expected output is as below .
date,hour1, hour2, hour3, hour4, hour5
2002-01-01,A,A,A,B,A
2002-01-02,B,B,A,A,A
I have tried the below as explained here: https://pandas.pydata.org/docs/user_guide/reshaping.html, but it doesnt work in my case because the symbol column contains duplicates.
import pandas as pd
import numpy as np
df = pd.read_csv('df.csv')
pivoted = df.pivot(index="date", columns="symbol")
print(pivoted)
The data does not have the timestamps but only the date. However, each row for the same date represents an hourly interval, for instance the output could also be represented as below:
date,01:00, 02:00, 03:00, 04:00, 05:00
2002-01-01,A,A,A,B,A
2002-01-02,B,B,A,A,A
where the hour1 represent 01:00, hour2 represent 02:00...etc
You had the correct pivot approach, but you were missing a column 'time', so let's split the datetime into date and time:
s = pd.to_datetime(df['date'])
df['date'] = s.dt.date
df['time'] = s.dt.time
df2 = df.pivot(index='date', columns='time', values='symbol')
output:
time 01:00:00 02:00:00 03:00:00 04:00:00 05:00:00
date
2002-01-01 A A A B A
2002-01-02 B B A A A
Alternatively for having a HH:MM time, use df['time'] = s.dt.strftime('%H:%M')
used input:
date,symbol
2002-01-01 01:00,A
2002-01-01 02:00,A
2002-01-01 03:00,A
2002-01-01 04:00,B
2002-01-01 05:00,A
2002-01-02 01:00,B
2002-01-02 02:00,B
2002-01-02 03:00,A
2002-01-02 04:00,A
2002-01-02 05:00,A
not time as input!
If really you have no time in the input dates and need to 'invent' increasing ones, you could use groupby.cumcount:
df['time'] = pd.to_datetime(df.groupby('date').cumcount(), format='%H').dt.strftime('%H:%M')
df2 = df.pivot(index='date', columns='time', values='symbol')
output:
time 01:00 02:00 03:00 04:00 05:00
date
2002-01-01 A A A B A
2002-01-02 B B A A A
For each entry as an hour:
k = df.groupby("date").cumcount().add(1).astype(str).radd("hour")
out = df.pivot_table('symbol','date',k,aggfunc=min)
print(out)
hour1 hour2 hour3 hour4 hour5
date
2002-01-01 A A A B A
2002-01-02 B B A A A
I'd have an approach for you, I guess it not the most elegant way since I have to rename both index and columns but it does the job.
new_cols = ['01:00', '02:00', '03:00', '04:00', '05:00']
df1 = df.loc[df['date']=='2002-01-01', :].T.drop('date').set_axis(new_cols, axis=1).set_axis(['2002-01-01'])
df2 = df.loc[df['date']=='2002-01-02', :].T.drop('date').set_axis(new_cols, axis=1).set_axis(['2002-01-02'])
result = pd.concat([df1,df2])
print(result)
Output:
01:00 02:00 03:00 04:00 05:00
2002-01-01 A A A B A
2002-01-02 B B A A A

Python/Pandas - TypeError when concatenating MultiIndex DataFrames

I have trouble concatenating a list of MultiIndex DataFrames with 2 levels, and adding a third one to distinguish them.
As an example, I have following input data.
import pandas as pd
import numpy as np
# Input data
start = '2020-01-01 00:00+00:00'
end = '2020-01-01 02:00+00:00'
pr1h = pd.period_range(start=start, end=end, freq='1h')
midx1 = pd.MultiIndex.from_tuples([('Sup',1),('Sup',2),('Inf',1),('Inf',2)], names=['Data','Position'])
df1 = pd.DataFrame(np.random.rand(3,4), index=pr1h, columns=midx1)
df3 = pd.DataFrame(np.random.rand(3,4), index=pr1h, columns=midx1)
midx2 = pd.MultiIndex.from_tuples([('Sup',3),('Inf',3)], names=['Data','Position'])
df2 = pd.DataFrame(np.random.rand(3,2), index=pr1h, columns=midx2)
df4 = pd.DataFrame(np.random.rand(3,2), index=pr1h, columns=midx2)
So df1 & df2 have data for the same tag 1h and while they have the same column names at Data level, they don't have the same column names at Position level.
df1
Data Sup Inf
Position 1 2 1 2
2020-01-01 00:00 0.660795 0.538452 0.861801 0.502479
2020-01-01 01:00 0.205806 0.847124 0.474861 0.906546
2020-01-01 02:00 0.681480 0.479512 0.631771 0.961844
df2
Data Sup Inf
Position 3 3
2020-01-01 00:00 0.758533 0.672899
2020-01-01 01:00 0.096463 0.304843
2020-01-01 02:00 0.080504 0.990310
Now, df3 and df4 follow the same logic and same column names. To distinguish them from df1 & df2, I want to use a different tag, 2h for instance.
I want to add this third level named Period during the call to pd.concat. For this, I am trying to use keys parameter in pd.concat(). I tried following code.
df_list = [df1, df2, df3, df4]
period_list = ['1h', '1h', '2h', '2h']
concatenated = pd.concat(df_list, keys=period_list, names=('Period', 'Data', 'Position'), axis=1)
But this raises following error.
TypeError: int() argument must be a string, a bytes-like object or a number, not 'slice'
Please, any idea what is the correct call for this?
I thank you for your help. Bests,
EDIT 05/05
As requested, here is desired result (copied directly from the answer given. Result obtained from given answer is the one I am looking for).
Period 1h \
Data Sup Inf Sup Inf
Position 1 2 1 2 3 3
2020-01-01 00:00 0.309778 0.597582 0.872392 0.983021 0.659965 0.214953
2020-01-01 01:00 0.467403 0.875744 0.296069 0.131291 0.203047 0.382865
2020-01-01 02:00 0.842818 0.659036 0.595440 0.436354 0.224873 0.114649
Period 2h
Data Sup Inf Sup Inf
Position 1 2 1 2 3 3
2020-01-01 00:00 0.356250 0.587131 0.149471 0.171239 0.583017 0.232641
2020-01-01 01:00 0.397165 0.637952 0.372520 0.002407 0.556518 0.523811
2020-01-01 02:00 0.548816 0.126972 0.079793 0.235039 0.350958 0.705332
A quick fix would be to use different names in period_list and rename just after the concat. Something like:
df_list = [df1, df2, df3, df4]
period_list = ['1h_a', '1h_b', '2h_a', '2h_b']
concatenated = pd.concat(df_list,
keys=period_list,
names=('Period', 'Data', 'Position'),
axis=1)\
.rename(columns={col:col.split('_')[0] for col in period_list},
level='Period')
print (concatenated)
Period 1h \
Data Sup Inf Sup Inf
Position 1 2 1 2 3 3
2020-01-01 00:00 0.309778 0.597582 0.872392 0.983021 0.659965 0.214953
2020-01-01 01:00 0.467403 0.875744 0.296069 0.131291 0.203047 0.382865
2020-01-01 02:00 0.842818 0.659036 0.595440 0.436354 0.224873 0.114649
Period 2h
Data Sup Inf Sup Inf
Position 1 2 1 2 3 3
2020-01-01 00:00 0.356250 0.587131 0.149471 0.171239 0.583017 0.232641
2020-01-01 01:00 0.397165 0.637952 0.372520 0.002407 0.556518 0.523811
2020-01-01 02:00 0.548816 0.126972 0.079793 0.235039 0.350958 0.705332
Edit: as speed is a concern, it seems that rename is slow, so you can do:
concatenated = pd.concat(df_list,
keys=period_list,
axis=1)
concatenated.columns = pd.MultiIndex.from_tuples([(col[0].split('_')[0], col[1], col[2])
for col in concatenated.columns],
names=('Period', 'Data', 'Position'), )
Consider an inner concat on similar data frames then run a final concat to bind all together:
concatenated = pd.concat([pd.concat([df1, df2], axis=1),
pd.concat([df3, df4], axis=1)],
keys = ['1h', '2h'],
names=('Period', 'Data', 'Position'),
axis=1)
print(concatenated)
Period 1h \
Data Sup Inf Sup Inf
Position 1 2 1 2 3 3
2020-01-01 00:00 0.189802 0.675083 0.624484 0.781774 0.453101 0.224525
2020-01-01 01:00 0.249818 0.829180 0.190488 0.923107 0.495873 0.278201
2020-01-01 02:00 0.602634 0.494915 0.612672 0.903609 0.426809 0.248981
Period 2h
Data Sup Inf Sup Inf
Position 1 2 1 2 3 3
2020-01-01 00:00 0.746499 0.385714 0.008561 0.961152 0.988231 0.897454
2020-01-01 01:00 0.643730 0.365023 0.812249 0.291733 0.045417 0.414968
2020-01-01 02:00 0.887567 0.680102 0.978388 0.018501 0.695866 0.679730

Inserting flag on occurence of date

I have a pandas dataframe data-
Round Number Date
1 7/4/2018 20:00
1 8/4/2018 16:00
1 8/4/2018 20:00
1 9/4/2018 20:00
Now I want to create a new dataframe which has two columns
['Date' ,'flag']
The Date column will have the dates of the range of dates in the data dataframe(in the actual data the dates are in the range of 7/4/2018 8:00:00 PM to 27/05/2018 19:00 so the date column in the new dataframe will have dates from 1/4/2018 to 30/05/2018 since 7/4/2018 8:00:00 PM is in the month of April so we will include the whole month of April and similarly since 27/05/2018 is in May so we include dates from 1/05/2018 t0 30/05/2018.
In the flag column we put 1 if that particular date was there in the old dataframe.
Output(partial)-
Date Flag
1/4/2018 0
2/4/2018 0
3/4/2018 0
4/4/2018 0
5/4/2018 0
6/4/2018 0
7/4/2018 1
8/4/2018 1
and so on...
I would use np.where() to address this issue. Furthermore, I'm working to improve the answer by setting the dateranges from old_df to be input of new_df
import pandas as pd
import numpy as np
old_df = pd.DataFrame({'date':['4/7/2018 20:00','4/8/2018 20:00'],'value':[1,2]})
old_df['date'] = pd.to_datetime(old_df['date'],infer_datetime_format=True)
new_df = pd.DataFrame({'date':pd.date_range(start='4/1/2018',end='5/30/2019',freq='d')})
new_df['flag'] = np.where(new_df['date'].dt.date.astype(str).isin(old_df['date'].dt.date.astype(str).tolist()),1,0)
print(new_df.head(10))
Output:
date flag
0 2018-04-01 0
1 2018-04-02 0
2 2018-04-03 0
3 2018-04-04 0
4 2018-04-05 0
5 2018-04-06 0
6 2018-04-07 1
7 2018-04-08 1
8 2018-04-09 0
9 2018-04-10 0
Edit:
Improved version, full code:
import pandas as pd
import numpy as np
old_df = pd.DataFrame({'date':['4/7/2018 20:00','4/8/2018 20:00','5/30/2018 20:00'],'value':[1,2,3]})
old_df['date'] = pd.to_datetime(old_df['date'],infer_datetime_format=True)
if old_df['date'].min().month < 10:
start_date = pd.to_datetime(
("01/0"+str(old_df['date'].min().month)+"/"+str(old_df['date'].min().year)))
else:
start_date = pd.to_datetime(
("01/"+str(old_df['date'].min().month)+"/"+str(old_df['date'].min().year)))
end_date = old_df['date'].max()
end_date = pd.to_datetime(old_df['date'].max())
new_df = pd.DataFrame({'date':pd.date_range(start=start_date,end=end_date,freq='d')})
new_df['flag'] = np.where(new_df['date'].dt.date.astype(str).isin(old_df['date'].dt.date.astype(str).tolist()),1,0)

Pandas - How to extract HH:MM from datetime column in Python?

I just want to extract from my df HH:MM. How do I do it?
Here's a description of the column in the df:
count 810
unique 691
top 2018-07-25 11:14:00
freq 5
Name: datetime, dtype: object
The string value includes a full time stamp. The goal is to parse each row's HH:MM into another df, and to loop back over and extract just the %Y-%m-%d into another df.
Assume the df looks like
print(df)
date_col
0 2018-07-25 11:14:00
1 2018-08-26 11:15:00
2 2018-07-29 11:17:00
#convert from string to datetime
df['date_col'] = pd.to_datetime(df['date_col'])
#to get date only
print(df['date_col'].dt.date)
0 2018-07-25
1 2018-08-26
2 2018-07-29
#to get time:
print(df['date_col'].dt.time)
0 11:14:00
1 11:15:00
2 11:17:00
#to get hour and minute
print(df['date_col'].dt.strftime('%H:%M'))
0 11:14
1 11:15
2 11:17
First convert to datetime:
df['datetime'] = pd.to_datetime(df['datetime'])
Then you can do:
df2['datetime'] = df['datetime'].dt.strptime('%H:%M')
df3['datetime'] = df['datetime'].dt.strptime('%Y-%m-%d')
General solution (not pandas based)
import time
top = '2018-07-25 11:14:00'
time_struct = time.strptime(top, '%Y-%m-%d %H:%M:%S')
short_top = time.strftime('%H:%M', time_struct)
print(short_top)
Output
11:14

Categories

Resources