I have two arrays (data frames actually but the function below deals with arrays)
bh_arr = array of bank holiday dates in UK
sales_dates = Sales dates for a few years (millions of rows, I mean really millions)
I want to know for each date in sales_dates, how many days to the next bank holiday (from bh_arr).
I built a function like the below and it works but as is evident from the code it is very wasteful, as it calculates all differences first and then gets the non-negative min.
def get_days_to_bh_arr(sale_dates, bh_arr):
"""
Subtract all elements of bh_arr from each element of sales_dates.
Get the min ( > 0) for each element of sales_dates.
Return that array.
"""
bh_arr = pd.to_datetime(bh_arr)
res = []
for each in sale_dates:
gg = int(np.min([stuff for stuff in (bh_arr - pd.to_datetime(each))/np.timedelta64(1,'D') if stuff >= 0]))
res.append(gg)
return np.array(res)
bh_arr = ['2018-03-30', '2018-08-27', '2019-05-27']
sales_dates = ['2018-03-15', '2019-05-22', '2018-02-01', '2018-08-05', '2018-06-21']
get_days_to_bh_arr(sale_dates, bh_arr)
15, 5, 57, 22, 67
In the actual code, I finally make the call like so:
sales['days_to_next_bh'] = get_days_to_bh_arr(sales['full_date'], bh['holiday']).astype(np.int32)
Is there a more efficient way of writing the function (of course there is)?
If not, should I try something else like finding the next date from a sorted 'bh_arr', for each date in 'sales_dates' and only at the end do the subtraction? How would I make that work?
Could I vectorise that instead of looping?
Any guidance would be much appreciated.
In pandas this can be done with pd.merge_asof to bring the closest bank holiday in the future. Then datetime subtraction gets the days in between.
Because merge_asof requires sorting, we need to reset the index so that we can maintain the original ordering after the merge.
import pandas as pd
df_b = pd.DataFrame({'bh': pd.to_datetime(bh_arr)})
df_s = pd.DataFrame({'sd': pd.to_datetime(sales_dates)})
df_s = (pd.merge_asof(df_s.reset_index().sort_values('sd'),
df_b.sort_values('bh'),
direction='forward',
left_on='sd',
right_on='bh')
.sort_values('index'))
arr = (df_s['bh'] - df_s['sd']).dt.days.to_numpy()
#array([15, 5, 57, 22, 67], dtype=int64)
Related
I'm working with field change history data which has timestamps for when the field value was changed. In this example, I need to calculate the overall case duration in 'Termination in Progress' status.
The given case was changed from and to this status three times in total:
see screenshot
I need to add up all three durations in this case and in other cases it can be more or less than three.
Does anyone know how to calculate that in Python?
Welcome to Stack Overflow!
Based on the limited data you provided, here is a solution that should work although the code makes some assumptions that could cause errors so you will want to modify it to suit your needs. I avoided using list comprehension or array math to make it more clear since you said you're new to Python.
Assumptions:
You're pulling this data into a pandas dataframe
All Old values of "Termination in Progress" have a matching new value for all Case Numbers
import datetime
import pandas as pd
import numpy as np
fp = r'<PATH TO FILE>\\'
f = '<FILENAME>.csv'
data = pd.read_csv(fp+f)
#convert ts to datetime for later use doing time delta calculations
data['Edit Date'] = pd.to_datetime(data['Edit Date'])
# sort by the same case number and date in opposing order to make sure values for old and new align properly
data.sort_values(by = ['CaseNumber','Edit Date'], ascending = [True,False],inplace = True)
#find timestamps where Termination in progress occurs
old_val_ts = data.loc[data['Old Value'] == 'Termination in progress']['Edit Date'].to_list()
new_val_ts = data.loc[data['New Value'] == 'Termination in progress']['Edit Date'].to_list()
#Loop over the timestamps and calc the time delta
ts_deltas = list()
for i in range(len(old_val_ts)):
item = old_val_ts[i] - new_val_ts[i]
ts_deltas.append(item)
# this loop could also be accomplished with list comprehension like this:
#ts_deltas = [old_ts - new_ts for (old_ts, new_ts) in zip(old_val_ts, new_val_ts)]
print('Deltas between groups')
print(ts_deltas)
print()
#Sum the time deltas
total_ts_delta = sum(ts_deltas,datetime.timedelta())
print('Total Time Delta')
print(total_ts_delta)
Deltas between groups
[Timedelta('0 days 00:08:00'), Timedelta('0 days 00:06:00'), Timedelta('0 days 02:08:00')]
Total Time Delta
0 days 02:22:00
I've also attached a picture of the solution minus my file path for obvious reasons. Hope this helps. Please remember to mark as correct if this solution works for you. Otherwise let me know what issues you run into.
EDIT:
If you have multiple case numbers you want to look at, you could do it in various ways, but the simplest would be to just get a list of unique case numbers with data['CaseNumber'].unique() then iterate over that array filtering for each case number and appending the total time delta to a new list or a dictionary (not necessarily the most efficient solution, but it will work).
cases_total_td = {}
unique_cases = data['CaseNumber'].unique()
for case in unique_cases:
temp_data = data[data['CaseNumber'] == case]
#find timestamps where Termination in progress occurs
old_val_ts = data.loc[data['Old Value'] == 'Termination in progress']['Edit Date'].to_list()
new_val_ts = data.loc[data['New Value'] == 'Termination in progress']['Edit Date'].to_list()
#Loop over the timestamps and calc the time delta
ts_deltas = list()
for i in range(len(old_val_ts)):
item = old_val_ts[i] - new_val_ts[i]
ts_deltas.append(item)
ts_deltas = [old_ts - new_ts for (old_ts, new_ts) in zip(old_val_ts, new_val_ts)]
#Sum the time deltas
total_ts_delta = sum(ts_deltas,datetime.timedelta())
cases_total_td[case] = total_ts_delta
print(cases_total_td)
{1005222: Timedelta('0 days 02:22:00')}
It seems that for OLS linear regression to work well in Pandas, the arguments must be floats. I'm starting with a csv (called "gameAct.csv") of the form:
date, city, players, sales
2014-04-28,London,111,1091.28
2014-04-29,London,100,1100.44
2014-04-28,Paris,87,1001.33
...
I want to perform linear regression of how sales depend on date (as time moves forward, how do sales move?). The problem with my code below seems to be with dates not being float values. I would appreciate help on how to resolve this indexing problem in Pandas.
My current (non-working, but compiling code):
import pandas as pd
from pandas import DataFrame, Series
import statsmodels.formula.api as sm
df = pd.read_csv('gameAct.csv')
df.columns = ['date', 'city', 'players', 'sales']
city_data = df[df['city'] == 'London']
result = sm.ols(formula = 'sales ~ date', data = city_data).fit()
As I vary the city value, I get R^2 = 1 results, which is wrong. I have also attempted index_col = 0, parse_dates == True' in defining the dataframe df, but without success.
I suspect there is a better way to read in such csv files to perform basic regression over dates, and also for more general time series analysis. Help, examples, and resources are appreciated!
Note, with the above code, if I convert the dates index (for a given city) to an array, the values in this array are of the form:
'\xef\xbb\xbf2014-04-28'
How does one produce an AIC analysis over all of the non-sales parameters? (e.g. the result might be that sales depend most linearly on date and city).
For this kind of regression, I usually convert the dates or timestamps to an integer number of days since the start of the data.
This does the trick nicely:
df = pd.read_csv('test.csv')
df['date'] = pd.to_datetime(df['date'])
df['date_delta'] = (df['date'] - df['date'].min()) / np.timedelta64(1,'D')
city_data = df[df['city'] == 'London']
result = sm.ols(formula = 'sales ~ date_delta', data = city_data).fit()
The advantage of this method is that you're sure of the units involved in the regression (days), whereas an automatic conversion may implicitly use other units, creating confusing coefficients in your linear model. It also allows you to combine data from multiple sales campaigns that started at different times into your regression (say you're interested in effectiveness of a campaign as a function of days into the campaign). You could also pick Jan 1st as your 0 if you're interested in measuring the day of year trend. Picking your own 0 date puts you in control of all that.
There's also evidence that statsmodels supports timeseries from pandas. You may be able to apply this to linear models as well:
http://statsmodels.sourceforge.net/stable/examples/generated/ex_dates.html
Also, a quick note:
You should be able to read column names directly out of the csv automatically as in the sample code I posted. In your example I see there are spaces between the commas in the first line of the csv file, resulting in column names like ' date'. Remove the spaces and automatic csv header reading should just work.
get date as floating point year
I prefer a date-format, which can be understood without context. Hence, the floating point year representation.
The nice thing here is, that the solution works on a numpy level - hence should be fast.
import numpy as np
import pandas as pd
def dt64_to_float(dt64):
"""Converts numpy.datetime64 to year as float.
Rounded to days
Parameters
----------
dt64 : np.datetime64 or np.ndarray(dtype='datetime64[X]')
date data
Returns
-------
float or np.ndarray(dtype=float)
Year in floating point representation
"""
year = dt64.astype('M8[Y]')
# print('year:', year)
days = (dt64 - year).astype('timedelta64[D]')
# print('days:', days)
year_next = year + np.timedelta64(1, 'Y')
# print('year_next:', year_next)
days_of_year = (year_next.astype('M8[D]') - year.astype('M8[D]')
).astype('timedelta64[D]')
# print('days_of_year:', days_of_year)
dt_float = 1970 + year.astype(float) + days / (days_of_year)
# print('dt_float:', dt_float)
return dt_float
if __name__ == "__main__":
dates = np.array([
'1970-01-01', '2014-01-01', '2020-12-31', '2019-12-31', '2010-04-28'],
dtype='datetime64[D]')
df = pd.DataFrame({
'date': dates,
'number': np.arange(5)
})
df['date_float'] = dt64_to_float(df['date'].to_numpy())
print('df:', df, sep='\n')
print()
dt64 = np.datetime64( "2011-11-11" )
print('dt64:', dt64_to_float(dt64))
output
df:
date number date_float
0 1970-01-01 0 1970.000000
1 2014-01-01 1 2014.000000
2 2020-12-31 2 2020.997268
3 2019-12-31 3 2019.997260
4 2010-04-28 4 2010.320548
dt64: 2011.8602739726027
I'm not sure about the specifics of the statsmodels, but this post lists all the date/time conversions for python. They aren't always one-to-one, so it's a reference I used often ;-)
df.date.dt.total_seconds()
If the data type of your date is datetime64[ns] than dt.total_seconds() should work; this will return a number of seconds (float).
I am attempting to collect counts of occurrences of an id between two time periods in a dataframe. I have a moderately sized dataframe (about 400 unique ids and just short of 1m rows) containing a time of occurrence and an id for the account which caused the occurrence. I am attempting to get a count of occurrences for multiple time periods (1 hour, 6 hour, 1 day, etc.) prior a specific occurrence and have run into lots of difficulties.
I am using Python 3.7, and for this instance I only have the pandas package loaded. I have tried using for loops and while it likely would have worked (eventually), I am looking for something a bit more efficient time-wise. I have also tried using list comprehension and have run into some errors that I did not anticipate when dealing with datetimes columns. Examples of both are below.
## Sample data
data = {'id':[ 'EAED813857474821E1A61F588FABA345', 'D528C270B80F11E284931A7D66640965', '6F394474B8C511E2A76C1A7D66640965', '7B9C7C02F19711E38C670EDFB82A24A9', '80B409D1EC3D4CC483239D15AAE39F2E', '314EB192F25F11E3B68A0EDFB82A24A9', '68D30EE473FE11E49C060EDFB82A24A9', '156097CF030E4519DBDF84419B855E10', 'EE80E4C0B82B11E28C561A7D66640965', 'CA9F2DF6B82011E28C561A7D66640965', '6F394474B8C511E2A76C1A7D66640965', '314EB192F25F11E3B68A0EDFB82A24A9', 'D528C270B80F11E284931A7D66640965', '3A024345C1E94CED8C7E0DA3A96BBDCA', '314EB192F25F11E3B68A0EDFB82A24A9', '47C18B6B38E540508561A9DD52FD0B79', 'B72F6EA5565B49BBEDE0E66B737A8E6B', '47C18B6B38E540508561A9DD52FD0B79', 'B92CB51EFA2611E2AEEF1A7D66640965', '136EDF0536F644E0ADE6F25BB293DD17', '7B9C7C02F19711E38C670EDFB82A24A9', 'C5FAF9ACB88D4B55AB8196DBFFE5B3C0', '1557D4ECEFA74B40C718A4E5425F3ACB', '68D30EE473FE11E49C060EDFB82A24A9', '68D30EE473FE11E49C060EDFB82A24A9', 'CAF9D8CD627B422DFE1D587D25FC4035', 'C620D865AEE1412E9F3CA64CB86DC484', '47C18B6B38E540508561A9DD52FD0B79', 'CA9F2DF6B82011E28C561A7D66640965', '06E2501CB81811E290EF1A7D66640965', '68EEE17873FE11E4B5B90AFEF9534BE1', '47C18B6B38E540508561A9DD52FD0B79', '1BFE9CB25AD84B64CC2D04EF94237749', '7B20C2BEB82811E28C561A7D66640965', '261692EA8EE447AEF3804836E4404620', '74D7C3901F234993B4788EFA9E6BEE9E', 'CAF9D8CD627B422DFE1D587D25FC4035', '76AAF82EB8C511E2A76C1A7D66640965', '4BD38D6D44084681AFE13C146542A565', 'B8D27E80B82911E28C561A7D66640965' ], 'datetime':[ "24/06/2018 19:56", "24/05/2018 03:45", "12/01/2019 14:36", "18/08/2018 22:42", "19/11/2018 15:43", "08/07/2017 21:32", "15/05/2017 14:00", "25/03/2019 22:12", "27/02/2018 01:59", "26/05/2019 21:50", "11/02/2017 01:33", "19/11/2017 19:17", "04/04/2019 13:46", "08/05/2019 14:12", "11/02/2018 02:00", "07/04/2018 16:15", "29/10/2016 20:17", "17/11/2018 21:58", "12/05/2017 16:39", "28/01/2016 19:00", "24/02/2019 19:55", "13/06/2019 19:24", "30/09/2016 18:02", "14/07/2018 17:59", "06/04/2018 22:19", "25/08/2017 17:51", "07/04/2019 02:24", "26/05/2018 17:41", "27/08/2014 06:45", "15/07/2016 19:30", "30/10/2016 20:08", "15/09/2018 18:45", "29/01/2018 02:13", "10/09/2014 23:10", "11/05/2017 22:00", "31/05/2019 23:58", "19/02/2019 02:34", "02/02/2019 01:02", "27/04/2018 04:00", "29/11/2017 20:35"]}
df = pd.dataframe(data)
df = df.sort_values(['id', 'datetime'], ascending=True)
# for loop attempt
totalAccounts = df['id'].unique()
for account in totalAccounts:
oneHourCount=0
subset = df[df['id'] == account]
for i in range(len(subset)):
onehour = subset['datetime'].iloc[i] - timedelta(hours=1)
for j in range(len(subset)):
if (subset['datetime'].iloc[j] >= onehour) and (subset['datetime'].iloc[j] < sub):
oneHourCount+=1
#list comprehension attempt
df['onehour'] = df['datetime'] - timedelta(hours=1)
for account in totalAccounts:
onehour = sum([1 for x in subset['datetime'] if x >= subset['onehour'] and x < subset['datetime']])
I am getting either 1) incredibly long runtime with the for loop or 2) an ValueError regarding the truth of a series being ambiguous. I know the issue is dealing with the datetimes, and perhaps it is just going to be slow-going, but I want to check here first just to make sure.
So I was able to figure this out using bisection. If you have a similar question please PM me and I'd be more than happy to help.
Solution:
left = bisect_left(keys, subset['start_time'].iloc[i]) ## calculated time
right = bisect_right(keys, subset['datetime'].iloc[i]) ## actual time of occurrence
count=len(subset['datetime'][left:right]
I am trying to create a dummy file to make some ML predictions afterwards. The input are about 2000 'routes' and I want to create a dummy that contains year-month-day-hour combinations for 7 days, meaning 168 rows per route, about 350k rows in total.
The problem that I am facing is that pandas becomes terribly slow in appending rows at a certain size.
I am using the following code:
DAYS = [0, 1, 2, 3, 4, 5, 6]
HODS = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
ISODOW = {
1: "monday",
2: "tuesday",
3: "wednesday",
4: "thursday",
5: "friday",
6: "saturday",
7: "sunday"
}
def createMyPredictionDummy(start=datetime.datetime.now(), sourceFile=(utils.mountBasePath + 'routeProperties.csv'), destFile=(utils.outputBasePath + 'ToBePredictedTTimes.csv')):
'''Generate a dummy file that can be used for predictions'''
data = ['route', 'someProperties']
dataFile = data + ['yr', 'month', 'day', 'dow', 'hod']
# New DataFrame with all required columns
file = pd.DataFrame(columns=dataFile)
# Old data frame that has only the target columns
df = pd.read_csv(sourceFile, converters=convert, delimiter=',')
df = df[data]
# Counter - To avoid constant lookup for length of the DF
ix = 0
routes = df['route'].drop_duplicates().tolist()
# Iterate through all routes and create a row for every route-yr-month-day-hour combination for 7 day --> about 350k rows
for no, route in enumerate(routes):
print('Current route is %s which is no. %g out of %g' % (str(route), no+1, len(routes)))
routeDF = df.loc[df['route'] == route].iloc[0].tolist()
for i in range(0, 7):
tmpDate = start + datetime.timedelta(days=i)
day = tmpDate.day
month = tmpDate.month
year = tmpDate.year
dow = ISODOW[tmpDate.isoweekday()]
for hod in HODS:
file.loc[ix] = routeDF + [year, month, day, dow, hod] # This is becoming terribly slow
ix += 1
file.to_csv(destFile, index=False)
print('Wrote file')
I think the main problem lies in appending the row with .loc[] - Is there any way to append a row more efficiently?
If you have any other suggestions, I am happy to hear them all!
Thanks and best,
carbee
(this is more of a long comment than an answer, sorry but without example data I can't run much...)
Since it seems to me that you're adding rows one at a time sequentially (i.e. the dataframe is indexed by integers accessed sequentially) and you always know the order of the columns, you're probably much better off creating a list of lists and then transforming it to a DataFrame, that is, define something like file_list = [] and then replace the line file.loc[ix] = ... by:
file_list.append(routeDF + [year, month, day, dow, hod])
In the end, you can then define
file = pd.DataFrame(file_list, columns=dataFile)
If furthermore all your data is of a fixed type (e.g. int, depending on what is your routeDF and by not converting dow until after creating the dataframe) you might be even better off by pre-allocating a numpy array and writing into it, but I'm quite sure that adding elements to a list will not be the bottleneck of your code, so this is probably excessive optimization.
Another alternative to minimize changes in your code, simply preallocate enough space by creating a DataFrame full of NaN instead of a DataFrame with no lines, i.e. change the definition of file to (after moving the line with drop_duplicates up):
file = pd.DataFrame(columns=dataFile, index=range(len(routes)*168))
I'm quite sure this is faster than your code, but it might still be slower than the list of lists approach above since it won't know which data types to expect until you fill in data (it might e.g. convert your ints to float which is not ideal). But again, once you get rid of the continuous reallocations due to expanding a DataFrame at each step, this will probably not be your bottleneck anymore (the double loop will likely be.)
You create an empty dataframe named file and then you fill it by appending rows this seems the problem. If you just
def createMyPredictionDummy(...):
...
# make it yield a dict of attributes from the for loop
for hod in HODS:
yield data
# then use this to create the *file* dataframe outside that function
newDF = pd.DataFrame([r for r in createMyPredictionDummy()])
newDF.to_csv(destFile, index=False)
print('Wrote file')
I'm trying to write a function that takes a continuous time series and returns a data structure which describes any missing gaps in the data (e.g. a DF with columns 'start' and 'end'). It seems like a fairly common issue for time series, but despite messing around with groupby, diff, and the like -- and exploring SO -- I haven't been able to come up with much better than the below.
It's a priority for me that this use vectorized operations to remain efficient. There has got to be a more obvious solution using vectorized operations -- hasn't there? Thanks for any help, folks.
import pandas as pd
def get_gaps(series):
"""
#param series: a continuous time series of data with the index's freq set
#return: a series where the index is the start of gaps, and the values are
the ends
"""
missing = series.isnull()
different_from_last = missing.diff()
# any row not missing while the last was is a gap end
gap_ends = series[~missing & different_from_last].index
# count the start as different from the last
different_from_last[0] = True
# any row missing while the last wasn't is a gap start
gap_starts = series[missing & different_from_last].index
# check and remedy if series ends with missing data
if len(gap_starts) > len(gap_ends):
gap_ends = gap_ends.append(series.index[-1:] + series.index.freq)
return pd.Series(index=gap_starts, data=gap_ends)
For the record, Pandas==0.13.1, Numpy==1.8.1, Python 2.7
This problem can be transformed to find the continuous numbers in a list. find all the indices where the series is null, and if a run of (3,4,5,6) are all null, you only need to extract the start and end (3,6)
import numpy as np
import pandas as pd
from operator import itemgetter
from itertools import groupby
# create an example
data = [2, 3, 4, 5, 12, 13, 14, 15, 16, 17]
s = pd.series( data, index=data)
s = s.reindex(xrange(18))
print find_gap(s)
def find_gap(s):
""" just treat it as a list
"""
nullindex = np.where( s.isnull())[0]
ranges = []
for k, g in groupby(enumerate(nullindex), lambda (i,x):i-x):
group = map(itemgetter(1), g)
ranges.append((group[0], group[-1]))
startgap, endgap = zip(* ranges)
return pd.series( endgap, index= startgap )
reference : Identify groups of continuous numbers in a list