Pandas merge with average of second dataframe - python

I have two panda dataframes.
Data frame one has three columns:
name
start_time
end_time
alice
04:00
05:00
bob
05:00
07:00
Data frame two has three columns:
time
points_1
points_2
04:30
5
4
04:45
8
6
05:30
10
3
06:15
4
7
06:55
1
0
I would like to merge the two dataframes such that the first dataframe now has 5 columns:
name
start_time
end_time
average_point_1
average_point_2
alice
04:00
05:00
6.5
5
bob
05:00
07:00
5
3.33
Where the average_point_1 columns consists of average of points_1 from dataframe two between the start and end time for each row. Similarly average_point_2. Could someone tell me how I can merge the two dataframes like this without having to develop an averaging function to make the columns first and then merging.

Try:
#convert all time fields to datetime for merge_asof compatibility
df1["start_time"] = pd.to_datetime(df1["start_time"],format="%H:%M")
df1["end_time"] = pd.to_datetime(df1["end_time"],format="%H:%M")
df2["time"] = pd.to_datetime(df2["time"],format="%H:%M")
#merge both dataframes on time
merged = pd.merge_asof(df2, df1, left_on="time", right_on="start_time")
#groupy and get average for each name
output = merged.groupby(["name", "start_time", "end_time"],as_index=False).mean()
#convert time columns back to strings if needed
output["start_time"] = output["start_time"].dt.strftime("%H:%M")
output["end_time"] = output["end_time"].dt.strftime("%H:%M")
>>> output
name start_time end_time points_1 points_2
0 alice 04:00 05:00 6.5 5.000000
1 bob 05:00 07:00 5.0 3.333333

Related

split one row into multiple record python

I have an input dataframe as follows
Class Duration StudentID Age Startdate Start Time Enddate End Time TimeDifference
5th XX 20002 5 04/12/2021 17:00:00 04/14/2021 20:00:00 3000
And I would like to split the same into three different rows based on the start and end date as follows.
Class Duration StudentID Age Startdate Start Time Enddate End Time TimeDifference
5th XX 20002 5 04/12/2021 17:00:00 04/12/2021 23:59:59 360
5th XX 20002 5 04/13/2021 0:00:00 04/13/2021 23:59:59 1440
5th XX 20002 5 04/14/2021 0:00:00 04/14/2021 20:00:00 1200
I am trying with python. Please help.
Input Output is here
I get a slightly different value for 'Time Difference', but this is an approach you can tweak and and use.
Step 1:
You can start by using melt() with your id_vars being all your columns except your 'Startdate' and 'Enddate'.
Step 2:
Then you can set your index to be your StartEndDate column, created after melting your dataframe.
Step 3:
Then using reindex() you can add the new row with your missing dates.
Lastly what's left is to calculate the time difference column and rearrange your dataframe to get to your final output.
I assume your dataframe is called df:
# Step 1
ids = [c for c in df.columns if c not in ['Startdate','Enddate']]
new = df.melt(id_vars=ids,value_name = 'StartEndDate').drop('variable',axis=1)
new.loc[new.StartEndDate.isin(df['Startdate'].tolist()),'Start Time'] = "00:00"
print(new)
Class Duration StudentID Age Start Time End Time TimeDifference \
0 5th XX 20002 5 00:00 20:00 3000
1 5th XX 20002 5 17:00 20:00 3000
StartEndDate
0 04/12/2021
1 04/14/2021
# Step 2
new['StartEndDate'] = pd.to_datetime(new['StartEndDate']).dt.date
new.set_index(pd.DatetimeIndex(new.StartEndDate),inplace=True)
# Step 3
final = new.reindex(pd.date_range(new.index.min(),new.index.max()), method='ffill').reset_index()\
.rename({'index':'Startdate'},axis=1).drop('StartEndDate',axis=1)
final['Enddate'] = final['Startdate']
final['TimeDifference'] = (final['End Time'].str[:2].astype(int) - final['Start Time'].str[:2].astype(int))*60
Prints:
final = final[['Class','Duration','StudentID','Age','Startdate','Start Time','Enddate','End Time','TimeDifference']]
Class Duration StudentID Age Startdate Start Time Enddate End Time \
0 5th XX 20002 5 2021-04-12 00:00 2021-04-12 20:00
1 5th XX 20002 5 2021-04-13 00:00 2021-04-13 20:00
2 5th XX 20002 5 2021-04-14 17:00 2021-04-14 20:00
TimeDifference
0 1200
1 1200
2 180
I think some information is missing from your question, so I would suggest running line by line and do the necessary adjustments to suit your task.
The below code worked for me. df is the source data frame and df1 is the result.
cols = data.columns
df=data
df['ReportDate'] = [pd.date_range(x, y) for x , y in zip(df['Startdate'],df['Enddate'])]
df1 = df.explode('ReportDate')
df1.head()
df1['RptStart']=np.where(df1['ReportDate'] == df1['Startdate'],df1['StartTime'],datetime.time(00, 00, 00))
df1['RptEnd']=np.where(df1['Enddate'] == df1['ReportDate'],df1['EndTime'],datetime.time(23, 59, 59))
df1['StartDtTm']=df1.apply(lambda r : pd.datetime.combine(r['ReportDate'],r['RptStart']),1)
df1['EndDtTm']=df1.apply(lambda r : pd.datetime.combine(r['ReportDate'],r['RptEnd']),1)
df1['Duration']= round((df1['EndDtTm']-df1['StartDtTm']).astype('timedelta64[s]')/60)

Pandas cumulative sum if between certain times/values

I want to insert a new column called total in final_dfwhich is a cumulative sum of value in df if it occurs between the times in final_df. It sums the values if it occurs between the start and end in final_df. So for example during the time range 01:30 to 02:00 in final_df - both index 0 and 1 in df occur between this time range so the total is 15 (10+5).
I have two pandas dataframes:
df
import pandas as pd
d = {'start_time': ['01:00','00:00','00:30','02:00'],
'end_time': ['02:00','03:00','01:30','02:30'],
'value': ['10','5','20','5']}
df = pd.DataFrame(data=d)
final_df
final_df = {'start_time': ['00:00, 00:30, 01:00, 01:30, 02:00, 02:30'],
'end_time': ['00:30, 01:00, 01:30, 02:00, 02:30, 03:00']}
final_df = pd.DataFrame(data=final_d)
output I want final_df
start_time end_time total
00:00 00:30 5
00:30 01:00 25
01:00 01:30 35
01:30 02:00 15
02:30 03:00 10
My try
final_df['total'] = final_df.apply(lambda x: df.loc[(df['start_time'] >= x.start_time) &
(df['end_time'] <= x.end_time), 'value'].sum(), axis=1)
Problem 1
I get the error: TypeError: ("'>=' not supported between instances of 'str' and 'datetime.time'", 'occurred at index 0')
I converted the relevant columns to datetime as follows:
df[['start_time','end_time']] = df[['start_time','end_time']].apply(pd.to_datetime, format='%H:%M')
final_df[['start_time','end_time']] = final_df[['start_time','end_time']].apply(pd.to_datetime, format='%H:%M:%S')
But I don't want to convert to datetime. Is there a way around this?
Problem 2
The sum is not working properly. It's only looking for exact match for the time range. So the output is:
start_time end_time total
00:00 00:30 0
00:30 01:00 0
01:00 01:30 0
01:30 02:00 0
02:30 03:00 5
One way to not use apply could be like this this.
df_ = (df.rename(columns={'start_time':1, 'end_time':-1}) #to use in the calculation later
.rename_axis(columns='mult') # mostly for esthetic
.set_index('value').stack() #reshape the data
.reset_index(name='time') # put the index back to columns
)
df_ = (df_.set_index(pd.to_datetime(df_['time'], format='%H:%M')) #to use resampling technic
.assign(total=lambda x: x['value'].astype(float)*x['mult']) #get plus or minus the value depending start/end
.resample('30T')[['total']].sum() # get the sum at the 30min bounds
.cumsum() #cumulative sum from the beginning
)
# create the column for merge with final resul
df_['start_time'] = df_.index.strftime('%H:%M')
# merge
final_df = final_df.merge(df_)
and you get
print (final_df)
start_time end_time total
0 00:00 00:30 5.0
1 00:30 01:00 25.0
2 01:00 01:30 35.0
3 01:30 02:00 15.0
4 02:00 02:30 10.0
5 02:30 03:00 5.0
But if you want to use apply, first you need to ensure that the columns are the good dtype and then you did the inegality in the reverse order like:
df['start_time'] = pd.to_datetime(df['start_time'], format='%H:%M')
df['end_time'] = pd.to_datetime(df['end_time'], format='%H:%M')
df['value'] = df['value'].astype(float)
final_df['start_time'] = pd.to_datetime(final_df['start_time'], format='%H:%M')
final_df['end_time'] = pd.to_datetime(final_df['end_time'], format='%H:%M')
final_df.apply(
lambda x: df.loc[(df['start_time'] <= x.start_time) & #see other inequality
(df['end_time'] >= x.end_time), 'value'].sum(), axis=1)
0 5.0
1 25.0
2 35.0
3 15.0
4 10.0
5 5.0
dtype: float64

Is there a Pandas function to highlight a week's 10 lowest values in a time series?

Rookie here so please excuse my question format:
I got an event time series dataset for two months (columns for "date/time" and "# of events", each row representing an hour).
I would like to highlight the 10 hours with the lowest numbers of events for each week. Is there a specific Pandas function for that? Thanks!
Let's say you have a dataframe df with column col as well as a datetime column.
You can simply sort the column with
import pandas as pd
df = pd.DataFrame({'col' : [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],
'datetime' : ['2019-01-01 00:00:00','2015-02-01 00:00:00','2015-03-01 00:00:00','2015-04-01 00:00:00',
'2018-05-01 00:00:00','2016-06-01 00:00:00','2017-07-01 00:00:00','2013-08-01 00:00:00',
'2015-09-01 00:00:00','2015-10-01 00:00:00','2015-11-01 00:00:00','2015-12-01 00:00:00',
'2014-01-01 00:00:00','2020-01-01 00:00:00','2014-01-01 00:00:00']})
df = df.sort_values('col')
df = df.iloc[0:10,:]
df
Output:
col datetime
0 1 2019-01-01 00:00:00
1 2 2015-02-01 00:00:00
2 3 2015-03-01 00:00:00
3 4 2015-04-01 00:00:00
4 5 2018-05-01 00:00:00
5 6 2016-06-01 00:00:00
6 7 2017-07-01 00:00:00
7 8 2013-08-01 00:00:00
8 9 2015-09-01 00:00:00
9 10 2015-10-01 00:00:00
I know there's a function called nlargest. I guess there should be an nsmallest counterpart. pandas.DataFrame.nsmallest
df.nsmallest(n=10, columns=['col'])
My bad, so your DateTimeIndex is a Hourly sampling. And you need the hour(s) with least events weekly.
...
Date n_events
2020-06-06 08:00:00 3
2020-06-06 09:00:00 3
2020-06-06 10:00:00 2
...
Well I'd start by converting each hour into columns.
1. Create an Hour column that holds the hour of the day.
df['hour'] = df['date'].hour
Pivot the hour values into columns having values as n_events.
So you'll then have 1 datetime index, 24 hour columns, with values denoting #events. pandas.DataFrame.pivot_table
...
Date hour0 ... hour8 hour9 hour10 ... hour24
2020-06-06 0 3 3 2 0
...
Then you can resample it to weekly level aggregate using sum.
df.resample('w').sum()
The last part is a bit tricky to do on the dataframe. But fairly simple if you just need the output.
for row in df.itertuples():
print(sorted(row[1:]))

Python/Pandas - TypeError when concatenating MultiIndex DataFrames

I have trouble concatenating a list of MultiIndex DataFrames with 2 levels, and adding a third one to distinguish them.
As an example, I have following input data.
import pandas as pd
import numpy as np
# Input data
start = '2020-01-01 00:00+00:00'
end = '2020-01-01 02:00+00:00'
pr1h = pd.period_range(start=start, end=end, freq='1h')
midx1 = pd.MultiIndex.from_tuples([('Sup',1),('Sup',2),('Inf',1),('Inf',2)], names=['Data','Position'])
df1 = pd.DataFrame(np.random.rand(3,4), index=pr1h, columns=midx1)
df3 = pd.DataFrame(np.random.rand(3,4), index=pr1h, columns=midx1)
midx2 = pd.MultiIndex.from_tuples([('Sup',3),('Inf',3)], names=['Data','Position'])
df2 = pd.DataFrame(np.random.rand(3,2), index=pr1h, columns=midx2)
df4 = pd.DataFrame(np.random.rand(3,2), index=pr1h, columns=midx2)
So df1 & df2 have data for the same tag 1h and while they have the same column names at Data level, they don't have the same column names at Position level.
df1
Data Sup Inf
Position 1 2 1 2
2020-01-01 00:00 0.660795 0.538452 0.861801 0.502479
2020-01-01 01:00 0.205806 0.847124 0.474861 0.906546
2020-01-01 02:00 0.681480 0.479512 0.631771 0.961844
df2
Data Sup Inf
Position 3 3
2020-01-01 00:00 0.758533 0.672899
2020-01-01 01:00 0.096463 0.304843
2020-01-01 02:00 0.080504 0.990310
Now, df3 and df4 follow the same logic and same column names. To distinguish them from df1 & df2, I want to use a different tag, 2h for instance.
I want to add this third level named Period during the call to pd.concat. For this, I am trying to use keys parameter in pd.concat(). I tried following code.
df_list = [df1, df2, df3, df4]
period_list = ['1h', '1h', '2h', '2h']
concatenated = pd.concat(df_list, keys=period_list, names=('Period', 'Data', 'Position'), axis=1)
But this raises following error.
TypeError: int() argument must be a string, a bytes-like object or a number, not 'slice'
Please, any idea what is the correct call for this?
I thank you for your help. Bests,
EDIT 05/05
As requested, here is desired result (copied directly from the answer given. Result obtained from given answer is the one I am looking for).
Period 1h \
Data Sup Inf Sup Inf
Position 1 2 1 2 3 3
2020-01-01 00:00 0.309778 0.597582 0.872392 0.983021 0.659965 0.214953
2020-01-01 01:00 0.467403 0.875744 0.296069 0.131291 0.203047 0.382865
2020-01-01 02:00 0.842818 0.659036 0.595440 0.436354 0.224873 0.114649
Period 2h
Data Sup Inf Sup Inf
Position 1 2 1 2 3 3
2020-01-01 00:00 0.356250 0.587131 0.149471 0.171239 0.583017 0.232641
2020-01-01 01:00 0.397165 0.637952 0.372520 0.002407 0.556518 0.523811
2020-01-01 02:00 0.548816 0.126972 0.079793 0.235039 0.350958 0.705332
A quick fix would be to use different names in period_list and rename just after the concat. Something like:
df_list = [df1, df2, df3, df4]
period_list = ['1h_a', '1h_b', '2h_a', '2h_b']
concatenated = pd.concat(df_list,
keys=period_list,
names=('Period', 'Data', 'Position'),
axis=1)\
.rename(columns={col:col.split('_')[0] for col in period_list},
level='Period')
print (concatenated)
Period 1h \
Data Sup Inf Sup Inf
Position 1 2 1 2 3 3
2020-01-01 00:00 0.309778 0.597582 0.872392 0.983021 0.659965 0.214953
2020-01-01 01:00 0.467403 0.875744 0.296069 0.131291 0.203047 0.382865
2020-01-01 02:00 0.842818 0.659036 0.595440 0.436354 0.224873 0.114649
Period 2h
Data Sup Inf Sup Inf
Position 1 2 1 2 3 3
2020-01-01 00:00 0.356250 0.587131 0.149471 0.171239 0.583017 0.232641
2020-01-01 01:00 0.397165 0.637952 0.372520 0.002407 0.556518 0.523811
2020-01-01 02:00 0.548816 0.126972 0.079793 0.235039 0.350958 0.705332
Edit: as speed is a concern, it seems that rename is slow, so you can do:
concatenated = pd.concat(df_list,
keys=period_list,
axis=1)
concatenated.columns = pd.MultiIndex.from_tuples([(col[0].split('_')[0], col[1], col[2])
for col in concatenated.columns],
names=('Period', 'Data', 'Position'), )
Consider an inner concat on similar data frames then run a final concat to bind all together:
concatenated = pd.concat([pd.concat([df1, df2], axis=1),
pd.concat([df3, df4], axis=1)],
keys = ['1h', '2h'],
names=('Period', 'Data', 'Position'),
axis=1)
print(concatenated)
Period 1h \
Data Sup Inf Sup Inf
Position 1 2 1 2 3 3
2020-01-01 00:00 0.189802 0.675083 0.624484 0.781774 0.453101 0.224525
2020-01-01 01:00 0.249818 0.829180 0.190488 0.923107 0.495873 0.278201
2020-01-01 02:00 0.602634 0.494915 0.612672 0.903609 0.426809 0.248981
Period 2h
Data Sup Inf Sup Inf
Position 1 2 1 2 3 3
2020-01-01 00:00 0.746499 0.385714 0.008561 0.961152 0.988231 0.897454
2020-01-01 01:00 0.643730 0.365023 0.812249 0.291733 0.045417 0.414968
2020-01-01 02:00 0.887567 0.680102 0.978388 0.018501 0.695866 0.679730

Count String Values in Column across 30 Minute Time Bins using Pandas

I am looking to determine the count of string variables in a column across a 3 month data sample. Samples were taken at random times throughout each day. I can group the data by hour, but I require the fidelity of 30 minute intervals (ex. 0500-0600, 0600-0630) on roughly 10k rows of data.
An example of the data:
datetime stringvalues
2018-06-06 17:00 A
2018-06-07 17:30 B
2018-06-07 17:33 A
2018-06-08 19:00 B
2018-06-09 05:27 A
I have tried setting the datetime column as the index, but I cannot figure how to group the data on anything other than 'hour' and I don't have fidelity on the string value count:
df['datetime'] = pd.to_datetime(df['datetime']
df.index = df['datetime']
df.groupby(df.index.hour).count()
Which returns an output similar to:
datetime stringvalues
datetime
5 0 0
6 2 2
7 5 5
8 1 1
...
I researched multi-indexing and resampling to some length the past two days but I have been unable to find a similar question. The desired result would look something like this:
datetime A B
0500 1 2
0530 3 5
0600 4 6
0630 2 0
....
There is no straightforward way to do a TimeGrouper on the time component, so we do this in two steps:
v = (df.groupby([pd.Grouper(key='datetime', freq='30min'), 'stringvalues'])
.size()
.unstack(fill_value=0))
v.groupby(v.index.time).sum()
stringvalues A B
05:00:00 1 0
17:00:00 1 0
17:30:00 1 1
19:00:00 0 1

Categories

Resources