Efficient way to count/sum rows in dataframe based on conditions - python

I'm working with a large flight delay dataset trying to predict the flight delay based on multiple new features. Based on a plane's tailnumber, I want to count the number of flights and sum the total airtime the plane has done in the past X (to be specified) hours/days to create a new "usage" variable.
Example of data (excluded airtime:
ID tail_num deptimestamp dep_delay distance air_time
2018-11-13-1659_UA2379 N14118 13/11/2018 16:59 -3 2425 334
2018-11-09-180_UA275 N13138 09/11/2018 18:00 -3 2454 326
2018-06-04-1420_9E3289 N304PQ 04/06/2018 14:20 -2 866 119
2018-09-29-1355_WN3583 N8557Q 29/09/2018 13:55 -5 762 108
2018-05-03-815_DL2324 N817DN 03/05/2018 08:15 0 1069 138
2018-01-12-1850_NK347 N635NK 12/01/2018 18:50 100 563 95
2018-09-16-1340_OO4721 N242SY 16/09/2018 13:40 -3 335 61
2018-06-06-1458_DL2935 N351NB 06/06/2018 14:58 1 187 34
2018-06-25-1030_B61 N965JB 25/06/2018 10:30 48 1069 143
2018-12-06-1215_MQ3617 N812AE 06/12/2018 12:15 -9 427 76
Example output for give = 'all' (not based on example data):
2018-12-31-2240_B61443 (1, 152.0, 1076.0, 18.0)
I've written a function to be applied to each row that filters the dataframe for flights with the same tail number and within the specified time frame and then gives back either the number of flights/total airtime or a dataframe containing the flights in question. It works but take a long time (around 3 hours calculating for a subset of 400k flights but filtering the entire dataset of over 7m rows). Is there a way to speed this up?
def flightsbefore(ID,
give = 'number',
direction = 'before',
seconds = 0,
minutes = 0,
hours = 0,
days = 0,
weeks = 0,
months = 0,
years = 0):
""" Takes the ID of a flight and a time unit to return the flights of that plane within that timeframe"""
tail_num = dfallcities.loc[ID,'tail_num']
date = dfallcities.loc[ID].deptimestamp
#dfallcities1 = dfallcities[(dfallcities.a != -1) & (dfallcities.b != -1)]
if direction == 'before':
timeframe = dfallcities.loc[ID].deptimestamp - datetime.timedelta(seconds = seconds,
minutes = minutes,
hours = hours,
days = days,
weeks = weeks)
output = dfallcities[(dfallcities.tail_num == tail_num) & \
(dfallcities.deptimestamp >= timeframe) & \
(dfallcities.deptimestamp < date)]
else:
timeframe = dfallcities.loc[ID].deptimestamp + datetime.timedelta(seconds = seconds,
minutes = minutes,
hours = hours,
days = days,
weeks = weeks)
output = dfallcities[(dfallcities.tail_num == tail_num) &
(dfallcities.depTimestamp <= timeframe) &
(dfallcities.deptimestamp >= date)]
if give == 'number':
return output.shape[0]
elif give == 'all':
if output.empty:
prev_delay = 0
else:
prev_delay = np.max((output['dep_delay'].iloc[-1],0))
return (output.shape[0], output['air_time'].sum(),output['distance'].sum(), prev_delay)
elif give == 'flights':
return output.sort_values('deptimestamp')
else:
raise ValueError("give must be one of [number, all, flights]")
No errors but simply very slow

Related

split login session into shift buckets

I have a table of logins and logouts by user.
the table looks like this but has a few hundred thousand rows:
data = [['aa', '2020-05-31 00:00:01', '2020-05-31 00:00:31'],
['bb','2020-05-31 00:01:01', '2020-05-31 00:02:01'],
['aa','2020-05-31 00:02:01', '2020-05-31 00:06:03'],
['cc','2020-05-31 00:03:01', '2020-05-31 00:04:01'],
['dd','2020-05-31 00:04:01', '2020-05-31 00:34:01'],
['aa', '2020-05-31 00:05:01', '2020-05-31 00:07:31'],
['bb','2020-05-31 00:05:01', '2020-05-31 00:06:01'],
['aa','2020-05-31 22:05:01', '2020-06-31 09:08:03'],
['cc','2020-05-31 22:10:01', '2020-06-31 09:40:01'],
['dd','2020-05-31 00:20:01', '2020-05-31 15:35:01']]
df_test = pd.DataFrame(data, columns=['user_id','login', 'logout'], dtype='datetime64[ns]')
I need to be able to tell how much time each session spent in 4 different shifts:
night (12am to 6 am), morning (6am to 12pm), afternoon (12pm to 6pm), evening(6pm to 12am)
I was able to solve this (code below) but some session span over multiple days and if the shift starts at 10pm and ends at 9am next day, my script wont properly allocate time.
Im not sure if there is a proper algorithm for this kind of problem in python.
Here is my code:
shifting = df_test.copy()
# extracting day from each datetime. We will use it to dynamically create shifts for each loop iteration
shifting['day'] = shifting['login'].dt.floor("D")
# adding 4 empty columns to the data, 1 for each shift
shifting['night'] = ''
shifting['morning'] = ''
shifting['afternoon'] = ''
shifting['evening'] = ''
# writing logic to properly split time between shifts if needed
def time_in_shift(start, end, shift_start, shift_end):
"""
Properly splits time between shifts if needed.
The logic is as follows: if the user logs in before the actual the shift start time -> shift's start time takes place of the login time.
if the user logs out after the shifts end time -> shift's end time takes place of the logout time. This logic is not perfect as sessions can span over
multiple days. This function accounts for that by equally splitting the time in 4 if a session is longer than 24h. Need a bit more time to figure out the rest.
Args:
start (datetime): login timestamp.
end (datetime): logout timestamp.
shift_start (datetime): start time of a shift.
shift_end (datetime): end time of a shift.
Returns:
hours spent in each shift (numeric)
"""
# first condition: if the session is longer than 24h -> split evenly between 4 shifts
if (end - start).total_seconds()/3600 > 24:
return (end - start).total_seconds()/3600/4
# if not -> follow the logic outlined in the description of this function
else:
if start < shift_start:
start = shift_start
if end > shift_end:
end = shift_end
# calculating time spent in the session here (in hours)
time_spent = (end-start).total_seconds()/3600
# negative hours means that no time was spent in that shift -> turn to 0
if time_spent < 0:
time_spent = 0
return time_spent
# applying the time_in_shift function to each row of the connections dataset (now shifting)
for i in shifting.index:
# dynamically creating shifts for each session. Must be done because dates are always different.
shift_start=(shifting.loc[i,'day'],
shifting.loc[i,'day'] + timedelta(hours = 6),
shifting.loc[i,'day'] + timedelta(hours = 12),
shifting.loc[i,'day'] + timedelta(hours = 18))
shift_end= (shift_start[1],
shift_start[2],
shift_start[3],
shift_start[0] + timedelta(days=1))
# range here corresponds to 4 shifts
for shift in range(4):
# storing time in the shift_time variable
shift_time = time_in_shift(shifting.loc[i,'login'], shifting.loc[i,'logout'], shift_start[shift], shift_end[shift])
Please let me know if you know how to do this better.
Thanks in advance!
If I'm understanding you are trying to bin shift hours?
df = pd.DataFrame(data, columns=['user_id', "login", "logout"], dtype="datetime64[ns]")
df["delta_hours"] = (df["logout"] - df["login"]).dt.seconds / 3600
bins = [0, 6, 12, 18, 24]
labels = ["night", "morning", "afternoon", "evening"]
df = (
df
.groupby(["user_id", pd.cut(df["login"].dt.hour, bins=bins, labels=labels, right=False)])["delta_hours"]
.sum()
.unstack()
.rename_axis(None, axis=1)
.reset_index()
)
print(df)
user_id night morning afternoon evening
0 aa 0.117222 0.0 0.0 11.050556
1 bb 0.033333 0.0 0.0 0.000000
2 cc 0.016667 0.0 0.0 11.500000
3 dd 15.750000 0.0 0.0 0.000000

To identify what are the channels that increase more than 10% against the data of last week

I have a large data frame across different timestamps. Here is my attempt:
all_data = []
for ws in wb.worksheets():
rows=ws.get_all_values()
df_all_data=pd.DataFrame.from_records(rows[1:],columns=rows[0])
all_data.append(df_all_data)
data = pd.concat(all_data)
#Change data type
data['Year'] = pd.DatetimeIndex(data['Week']).year
data['Month'] = pd.DatetimeIndex(data['Week']).month
data['Week'] = pd.to_datetime(data['Week']).dt.date
data['Application'] = data['Application'].astype('str')
data['Function'] = data['Function'].astype('str')
data['Service'] = data['Service'].astype('str')
data['Channel'] = data['Channel'].astype('str')
data['Times of alarms'] = data['Times of alarms'].astype('int')
#Compare Channel values over weeks
subchannel_df = data.pivot_table('Times of alarms', index = 'Week', columns='Channel', aggfunc='sum').fillna(0)
subchannel_df = subchannel_df.sort_index(axis=1)
The data frame I am working on
What I hope to achieve:
add a percentage row (the last row vs the second last row) at the end of the data frame, excluding situations as such: divide by zero and negative percentage
show those channels which increase more than 10% as compared against last week.
I have been trying different methods to achieve those for days. However, I would not manage to do it. Thank you in advance.
You could use the shift function as an equivalent to Lag window function in SQL to return last week's value and then perform the calculations in row level. To avoid dividing by zero you can use numpy where function that is equivalent to CASE WHEN in SQL. Let's say your column value on which you perform the calculations named: "X"
subchannel_df["XLag"] = subchannel_df["X"].shift(periods=1).fillna(0).astype('int')
subchannel_df["ChangePercentage"] = np.where(subchannel_df["XLag"] == 0, 0, (subchannel_df["X"]-subchannel_df["XLag"])/subchannel_df["XLag"])
subchannel_df["ChangePercentage"] = (subchannel_df["ChangePercentage"]*100).round().astype("int")
subchannel_df[subchannel_df["ChangePercentage"]>10]
Output:
Channel X XLag ChangePercentage
Week
2020-06-12 12 5 140
2020-11-15 15 10 50
2020-11-22 20 15 33
2020-12-13 27 16 69
2020-12-20 100 27 270

Finding one to many matches in two Pandas Dataframes

I am attempting to put together a generic matching process for financial data. The goal is to take one set of data with larger transactions and match it to a set of data with smaller transactions. Some are one to many, others are one to one.
There are a few times where it may be reversed and part of the approach is to feed back the miss matches in inverse order to capture those possible matches.
I have three different modules I have created to iterate across each other to complete the work, but I am not getting consistent results. I see possible matches in my data that should be picked up but are not.
There is no clear matching criteria either, so the assumption is if I put the datasets in date order, and look for matching values, I want to take the first match since it should be closer to the same timeframe.
I am using Pandas and Itertools, but maybe not in the ideal format. Any help to get consistent matches would be appreciated.
Data examples:
Large Transaction Size:
AID AIssue Date AAmount
1508 3/14/2018 -560
1506 3/27/2018 -35
1500 4/25/2018 5000
Small Transaction Size:
BID BIssue Date BAmount
1063 3/6/2018 -300
1062 3/6/2018 -260
839 3/22/2018 -35
423 4/24/2018 5000
Expected Results
AID AIssue Date AAMount BID BIssue Date BAmount
1508 3/14/2018 -560 1063 3/6/2018 -300
1508 3/14/2018 -560 1062 3/6/2018 -260
1506 3/27/2018 -35 839 3/22/2018 -35
1500 4/25/2018 5000 423 4/24/2018 5000
but I usually get
AID AIssue Date AAMount BID BIssue Date BAmount
1508 3/14/2018 -560 1063 3/6/2018 -300
1508 3/14/2018 -560 1062 3/6/2018 -260
1506 3/27/2018 -35 839 3/22/2018 -35
with the 5000 not matching. And this is one example, but positive negative does not appear to be the factor when looking at the larger data set.
When reviewing the unmatched results from each, I see at least one $5000 transaction I would expect to be a 1-1 match and it is not in the results.
def matches(iterable):
s = list(iterable)
#Only going to 5 matches to avoid memory overrun on large datasets
s = list(itertools.chain.from_iterable(itertools.combinations(s, r) for r in range(5)))
return [list(elem) for elem in s]
def one_to_many(dfL, dfS, dID = 0, dDT = 1, dVal = 2):
#dfL = dataset with larger values
#dfS = dataset with smaller values
#dID = column index of ID record
#dDT = column index of date record
#dVal = column index of dollar value record
S = dfS[dfS.columns[dID]].values.tolist()
S_amount = dfS[dfS.columns[dVal]].values.tolist()
S = matches(S)
S_amount = matches(S_amount)
#get ID of first large record, the ID to be matched in this module
L = dfL[dfL.columns[dID]].iloc[0]
#get Value of first large record, this value will be matching criteria
L_amount = dfL[dfL.columns[dVal]].iloc[0]
count_of_sets = len(S)
for a in range(0,count_of_sets):
list_of_items = S[a]
list_of_values = S_amount[a]
if round(sum(list_of_values),2) == round(L_amount,2):
break
if round(sum(list_of_values),2) == round(L_amount,2):
retVal = list_of_items
else:
retVal = [-1]
return retVal
def iterate_one_to_many(dfLarge, dfSmall, dID = 0, dDT = 1, dVal = 2):
#dfL = dataset with larger values
#dfS = dataset with smaller values
#dID = column index of ID record
#dDT = column index of date record
#dVal = column index of dollar value record
#returns a list of dataframes [paired matches, unmatched from dfL, unmatched from dfS]
dfLarge = dfLarge.set_index(dfLarge.columns[dID]).sort_values([dfLarge.columns[dDT], dfLarge.columns[dVal]]).reset_index()
dfSmall = dfSmall.set_index(dfSmall.columns[dID]).sort_values([dfSmall.columns[dDT], dfSmall.columns[dVal]]).reset_index()
end_row = len(dfLarge.columns[dID]) - 1
matches_master = pd.DataFrame(data = None, columns = dfLarge.columns.append(dfSmall.columns))
for lg in range(0,end_row):
sm_match_id = one_to_many(dfLarge, dfSmall)
lg_match_id = dfLarge[dfLarge.columns[dID]][lg]
if sm_match_id != [-1]:
end_of_matches = len(sm_match_id)
for sm in range(0, end_of_matches):
if sm == 0:
sm_match = dfSmall.loc[dfSmall[dfSmall.columns[dID]] == sm_match_id[sm]].copy()
dfSmall = dfSmall.loc[dfSmall[dfSmall.columns[dID]] != sm_match_id[sm]].copy()
else:
sm_match = sm_match.append(dfSmall.loc[dfSmall[dfSmall.columns[dID]] == sm_match_id[sm]].copy())
dfSmall = dfSmall.loc[dfSmall[dfSmall.columns[dID]] != sm_match_id[sm]].copy()
lg_match = dfLarge.loc[dfLarge[dfLarge.columns[dID]] == lg_match_id].copy()
sm_match['Match'] = lg
lg_match['Match'] = lg
sm_match.set_index('Match', inplace=True)
lg_match.set_index('Match', inplace=True)
matches = lg_match.join(sm_match, how='left')
matches_master = matches_master.append(matches)
dfLarge = dfLarge.loc[dfLarge[dfLarge.columns[dID]] != lg_match_id].copy()
return [matches_master, dfLarge, dfSmall]
IIUUC, the match is just to find the transaction in the Large DataFrame which is on or the closest future transaction to a transaction in the small one. You can use pandas.merge_asof() to perform a match based on the closest date in the future.
import pandas as pd
# Ensure your dates are datetime
df_large['AIssue Date'] = pd.to_datetime(df_large['AIssue Date'])
df_small['BIssue Date'] = pd.to_datetime(df_small['BIssue Date'])
merged = pd.merge_asof(df_small, df_large, left_on='BIssue Date',
right_on='AIssue Date', direction='forward')
merged is now:
BID BAmount BIssue Date AID AAmount AIssue Date
0 1063 -300 2018-03-06 1508 -560 2018-03-14
1 1062 -260 2018-03-06 1508 -560 2018-03-14
2 839 -35 2018-03-22 1506 -35 2018-03-27
3 423 5000 2018-04-24 1500 5000 2018-04-25
If you expect things to never match, you can also throw in a tolerance to restrict the matches to within a smaller window., that way a missing value in one DataFrame doesn't throw everything off.
in my module iterate_one_to_many, I was counting my row length incorrectly. I needed to replace
end_row = len(dfLarge.columns[dID]) - 1
with
end_row = len(dfLarge.index)

Using 3 criteria for a Table Lookup Python

Backstory: I'm fairly new to python, and have only ever done things in MATLAB prior.
I am looking to take a specific value from a table based off of data I have.
The data I have is
Temperatures = [0.8,0.1,-0.8,-1.4,-1.7,-1.5,-2,-1.7,-1.7,-1.3,-0.7,-0.2,0.3,1.4,1.4,1.5,1.2,1,0.9,1.3,1.7,1.7,1.6,1.6]
Hour of the Day =
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]
This is all data for a Monday.
My Monday table looks like this:
Temp | Hr0 | Hr1 | Hr2 ...
-15 < t <= -10 | 0.01 | 0.02 | 0.06 ...
-10 < t <= -5 | 0.04 | 0.03 | 0.2 ...
with the Temperatures increment by +5 until 30, and the hours of the day until 23. The values in the table are constants that I would like to call based off of the temperature and hour.
For example, I'd like to be able to say:
print(monday(1,1)) = 0.01
I would also be doing this for everyday of the week for a mass data analysis, thus the need for it to be efficient.
What I've done so far:
So i have stored all of my tables in dictionaries that look kind of like this:
monday_hr0 = [0.01,0.04, ... ]
So first by column then calling them by the temperature value.
What I have now is a bunch of loops that looks like this:
for i in range (0,365):
for j in range (0,24):
if Day[i] = monday
if hr[i+24*j] = 0
if temp[i] = -15
constant.append(monday_hr1[0])
...
if hr[i+24*j] = 1
if temp[i] = -15
constant.append(monday_hr2[0])
...
...
elif Day[i] = tuesday
if hr[i+24*j] = 0
if temp[i] = -15
constant.append(tuesday_hr1[0])
...
if hr[i+24*j] = 1
if temp[i] = -15
constant.append(tuesday_hr2[0])
...
...
...
I'm basically saying here if it's a monday, use this table. Then if it's this hour use this column. Then if it's this temperature, use this cell. This is VERY VERY inefficient however.
I'm sure there's a quicker way but I can't wrap my head around it. Thank you very much for your help!
Okay, bear with me here, I'm on mobile. I'll try to write up a solution.
I am assuming the following:
you have a dictionary called day_data which contains the table of data for each day of the week.
you have a dictionary called days which maps 0-6 to a day of the week. 0 is monday, 6 is Sunday.
you have a list of temperatures you want something done with
you have a time of the day you want to use to pick out the appropriate data from your day_data. You want to do this for each day of the year.
We should only have to iterate once through all 365 days and once through each hour of the day.
heat-load-days={}
for day_index in range(1,365):
day=Days[day_index%7]
#day is now the Day of the week.
data = day_data[day]
Heat_load =[]
for hour in range(24):
#still unsure on how to select which temperature row from the data table.
Heat_load.append (day_data_selected)
heat-load-days [day] = Heat_load

Rolling Average to calculate rainfall intensity

I have some real rainfall data recorded as the date and time, and the accumulated number of tips on a tipping bucket rain-gauge. The tipping bucket represents 0.5mm of rainfall.
I want to cycle through the file and determine the variation in intensity (rainfall/time)
So I need a rolling average over multiple fixed time frames:
So I want to accumulate rainfall, until 5minutes of rain is accumulated and determine the intensity in mm/hour. So if 3mm is recorded in 5min it is equal to 3/5*60 = 36mm/hr.
the same rainfall over 10 minutes would be 18mm/hr...
So if I have rainfall over several hours I may need to review at several standard intervals of say: 5, 10,15,20,25,30,45,60 minutes etc...
Also the data is recorded in reverse order in the raw file, so the earliest time is at the end of the file and the later and last time step appears first after a header:
Looks like... (here 975 - 961 = 14 tips = 7mm of rainfall) average intensity 1.4mm/hr
But between 16:27 and 16:34 967-961 = 6 tips = 3mm in 7 min = 27.71mm/hour
7424 Figtree (O'Briens Rd)
DATE :hh:mm Accum Tips
8/11/2011 20:33 975
8/11/2011 20:14 974
8/11/2011 20:04 973
8/11/2011 20:00 972
8/11/2011 19:35 971
8/11/2011 18:29 969
8/11/2011 16:44 968
8/11/2011 16:34 967
8/11/2011 16:33 966
8/11/2011 16:32 965
8/11/2011 16:28 963
8/11/2011 16:27 962
8/11/2011 15:30 961
Any suggestions?
I am not entirely sure what it is that you have a question about.
Do you know how to read out the file? You can do something like:
data = [] # Empty list of counts
# Skip the header
lines = [line.strip() for line in open('data.txt')][2::]
for line in lines:
print line
date, hour, count = line.split()
h,m = hour.split(':')
t = int(h) * 60 + int(m) # Compute total minutes
data.append( (t, int(count) ) ) # Append as tuple
data.reverse()
Since your data is cumulative, you need to subtract each two entries, this is where
python's list comprehensions are really nice.
data = [(t1, d2 - d1) for ((t1,d1), (t2, d2)) in zip(data, data[1:])]
print data
Now we need to loop through and see how many entries are within the last x minutes.
timewindow = 10
for i, (t, count) in enumerate(data):
# Find the entries that happened within the last [...] minutes
withinwindow = filter( lambda x: x[0] > t - timewindow, data )
# now you can print out any kind of stats about this "within window" entries
print sum( count for (t, count) in withinwindow )
Since the time stamps do not come at regular intervals, you should use interpolating to get the most accurate results. This will make the rolling average easier too. I'm using the Interpolate class in this answer in the below code.
from time import strptime, mktime
totime = lambda x: int(mktime(strptime(x, "%d/%m/%Y %H:%M")))
with open("my_file.txt", "r") as myfile:
# Skip header
for line in myfile:
if line.startswith("DATE"):
break
times = []
values = []
for line in myfile:
date, time, value = line.split()
times.append(totime(" ".join((date, time))))
values.append(int(value))
times.reverse()
values.reverse()
i = Interpolate(times, values)
Now it's just a matter of choosing your intervals and computing the difference between the endpoints of each interval. Let's create a generator function for that:
def rolling_avg(cumulative_lookup, start, stop, step_size, window_size):
for t in range(start + window_size, stop, step_size):
total = cumulative_lookup[t] - cumulative_lookup[t - window_size]
yield total / window_size
Below I'm printing the number of tips per hour in the previous hour with 10 minute intervals:
start = totime("8/11/2011 15:30")
stop = totime("8/11/2011 20:33")
for avg in rolling_avg(i, start, stop, 600, 3600):
print avg * 3600
EDIT: Made totime return an int and created the rolling_avg generator.

Categories

Resources