Pandas/Numpy - Vectorize datetime calculation - python

tl;dr
I need df.dates[iter]-df.dates[initial_fixed] per slice of a dataframe indexed by an item_id in the fastest way possible (for the sake of learning and improving skills... and deadlines).
How to calculate business hours between these same dates, not just straight time. And I need partial days (4.763 days for example) not just an integer like with .days
Hi,
First, I have a dataframe df
item_id dates new_column ... other_irrelevant_columns
101 2020-09-10-08-... FUNCTION -neglected-
101 2020-09-18-17-... FUNCTION -neglected-
101 2020-10-03-11-... FUNCTION -neglected-
107 2017-08-dd-hh-... FUNCTION -neglected-
107 2017-09-dd-hh-... FUNCTION -neglected-
209 2019-01-dd-hh-... FUNCTION -neglected
209 2019-01-dd-hh-... FUNCTION -neglected-
209 2019-01-dd-hh-... FUNCTION -neglected-
209 2019-01-dd-hh-... FUNCTION -neglected-
where the dates column (type = datetime object) is chronological per item_id, so the first instance is the earliest date.
I have over 400,000 rows, and I need to calculate the elapsed time by taking the distance between each datetime and the origin datetime, per item_id. Then there is a sequence
item_id dates [new_column = elapsed_time] ... other_irrelevant_columns
101 2020-09-10-08-... [dates[0]-dates[0] = 0 days] -neglected- for plotting
101 2020-09-18-17-... [dates[1]-dates[0] = 8.323 days] -neglected-
101 2020-10-03-11-... [dates[2]-dates[0] = 23.56 days] -neglected-
At the moment, I'm stuck using a for loop which I think is vectorized, which calculates the total seconds of a timedelta and converts to days as a float:
for id in df.item_id:
df.elapsed_days[df.item_id == id] = ((df.dates[df.item_id == id] - min(df.dates[df.boot_id == id])).dt.total_seconds()/86400).astype(float)
which is taking forever. Not in the data science spirit. What I'd like to know, is a better way to perform this whether it's using apply() with a lambda, and I tried to use digitize and isin() from this guys article but can't fathom how to bin the item_id to make it work.
Second, I am also interested in a similar duration but over business hours only (8am-6pm no weekends or holidays in Canada),so the real time that the item is active is measured.
Thanks for any help.

You can use join to do that much faster.
First you need to perform the min as you do in your current code:
tmp = df.loc[df['item_id'] == df['boot_id']] # row filtering
tmp = df[['item_id','date']] # column filtering
dateMin = tmp.groupby('item_id', as_index=False).min() # Find the minimal date for each item_id
Then you can do the merge:
# Actual merge
indexed_df = df.set_index('item_id')
indexed_dateMin = dateMin.set_index('item_id')
merged = indexed_df.join(indexed_dateMin, lsuffix='_df', rsuffix='_dateMin')
# Vectorized computation
df['elapsed_days'] = (merged['date_df'] - merged['date_dateMin']).dt.total_seconds()/86400).astype(float)

Related

Improve running time when using inflation method on pandas

I'm trying to get real prices for my data in pandas. Right now, I am just playing with one year's worth of data (3962050 rows) and it took me 443 seconds to inflate the values using the code below. Is there a quicker way to find the real value? Is it possible to use pooling? I have many more years and if would take too long to wait every time.
Portion of df:
year quarter fare
0 1994 1 213.98
1 1994 1 214.00
2 1994 1 214.00
3 1994 1 214.50
4 1994 1 214.50
import cpi
import pandas as pd
def inflate_column(data, column):
"""
Adjust for inflation the series of values in column of the
dataframe data. Using cpi library.
"""
print('Beginning to inflate ' + column)
start_time = time.time()
df = data.apply(lambda x: cpi.inflate(x[column],
x.year), axis=1)
print("Inflating process took", time.time() - start_time, " seconds to run")
return df
df['real_fare'] = inflate_column(df, 'fare')
You have multiple values for each year: you can just call one for every year, store it in dict and then use the value instead of calling to cpi.inflate everytime.
all_years = df["year"].unique()
dict_years = {}
for year in all_years:
dict_years[year] = cpi.inflate(1.0, year)
df['real_fare'] = # apply here: dict_years[row['year']]*row['fare']
You can fill the last line using apply, or try do it in some other way like df['real_fare']=df['fare']*...

Spliting DataFrame into Multiple Frames by Dates Python

I fully understand there are a few versions of this questions out there, but none seem to get at the core of my problem. I have a pandas Dataframe with roughly 72,000 rows from 2015 to now. I am using a calculation that finds the most impactful words for a given set of text (tf_idf). This calculation does not account for time, so I need to break my main Dataframe down into time-based segments, ideally every 15 and 30 days (or n days really, not week/month), then run the calculation on each time-segmented Dataframe in order to see and plot what words come up more and less over time.
I have been able to build part of this this out semi-manually with the following:
def dateRange():
start = input("Enter a start date (MM-DD-YYYY) or '30' for last 30 days: ")
if (start != '30'):
datetime.strptime(start, '%m-%d-%Y')
end = input("Enter a end date (MM-DD-YYYY): ")
datetime.strptime(end, '%m-%d-%Y')
dataTime = data[(data['STATUSDATE'] > start) & (data['STATUSDATE'] <= end)]
else:
dataTime = data[data.STATUSDATE > datetime.now() - pd.to_timedelta('30day')]
return dataTime
dataTime = dateRange()
dataTime2 = dateRange()
def calcForDateRange(dateRangeFrame):
##### LONG FUNCTION####
return word and number
calcForDateRange(dataTime)
calcForDateRange(dataTime2)
This works - however, I have to manually create the 2 dates which is expected as I created this as a test. How can I split the Dataframe by increments and run the calculation for each dataframe?
dicts are allegedly the way to do this. I tried:
dict_of_dfs = {}
for n, g in data.groupby(data['STATUSDATE']):
dict_of_dfs[n] = g
for frame in dict_of_dfs:
calcForDateRange(frame)
The dict result was 2015-01-02: Dataframe with no frame. How can I break this down into a 100 or so Dataframes to run my function on?
Also, I do not fully understand how to break down ['STATUSDATE'] by number of days specifically?
I would to avoid iterating as much as possible, but I know I probably will have to someehere.
THank you
Let us assume you have a data frame like this:
date = pd.date_range(start='1/1/2018', end='31/12/2018', normalize=True)
x = np.random.randint(0, 1000, size=365)
df = pd.DataFrame(x, columns = ["X"])
df['Date'] = date
df.head()
Output:
X Date
0 328 2018-01-01
1 188 2018-01-02
2 709 2018-01-03
3 259 2018-01-04
4 131 2018-01-05
So this data frame has 365 rows, one for each day of the year.
Now if you want to group this data into intervals of 20 days and assign each group to a dict, you can do the following
df_dict = {}
for k,v in df.groupby(pd.Grouper(key="Date", freq='20D')):
df_dict[k.strftime("%Y-%m-%d")] = pd.DataFrame(v)
print(df_dict)
How about something like this. It creates a dictionary of non empty dataframes keyed on the
starting date of the period.
import datetime as dt
start = '12-31-2017'
interval_days = 30
start_date = pd.Timestamp(start)
end_date = pd.Timestamp(dt.date.today() + dt.timedelta(days=1))
dates = pd.date_range(start=start_date, end=end_date, freq=f'{interval_days}d')
sub_dfs = {d1.strftime('%Y%m%d'): df.loc[df.dates.ge(d1) & df.dates.lt(d2)]
for d1, d2 in zip(dates, dates[1:])}
# Remove empty dataframes.
sub_dfs = {k: v for k, v in sub_dfs.items() if not v.empty}

How to get a value from a pandas core series?

I have a dataframe df in which are the timezones for particular ip numbers:
ip1 ip2 timezone
0 16777215 0
16777216 16777471 +10:00
16777472 16778239 +08:00
16778240 16779263 +11:00
16779264 16781311 +08:00
16781312 16785407 +09:00
...
The first row is valid for the ip numbers from 0 to 16777215, the second from 16777216 to 16777471 an so on.
Now, I go through a folder an want to know the timezone for every file (after I calculate the ip_number of the file).
I use:
time=df.loc[(df['ip1'] <= ip_number) & (ip_number <= df['ip2']), 'timezone']
and become my expected output:
1192 +05:30
Name: timezone, dtype: object
But this is panda core series series and I just want to have "+5:30".
How do I become this? Or is there another way instead of df.loc[...]to become directly the value of the column timezonein df?
just list it
list(time)
if you are excepting only one value
list(time)[0]
or you can make it earlier:
#for numpy array
time=df.loc[(df['ip1'] <= ip_number) & (ip_number <= df['ip2']), 'timezone'].values
#for list
time=list(df.loc[(df['ip1'] <= ip_number) & (ip_number <= df['ip2']), 'timezone'].values)
To pull the only value out of a Series of size 1, use the Series.item() method:
time = df.loc[(df['ip1'] <= ip_number) & (ip_number <= df['ip2']), 'timezone'].item()
Note that this raises a ValueError if the Series contains more than one item.
Usually pulling single values out of a Series is an anti-pattern. NumPy/Pandas
is built around the idea that applying vectorized functions to large arrays is
going to be much much faster than using a Python loop that processes single
values one at a time.
Given your df and a list of IP numbers, here is a way to find the
corresponding timezone offsets for all the IP numbers with just one call to pd.merge_asof.
import pandas as pd
df = pd.DataFrame({'ip1': [0, 16777216, 16777472, 16778240, 16779264, 16781312],
'ip2': [16777215, 16777471, 16778239, 16779263, 16781311, 16785407],
'timezone': ['0', '+10:00', '+08:00', '+11:00', '+08:00', '+09:00']})
df1 = df.melt(id_vars=['timezone'], value_name='ip').sort_values(by='ip').drop('variable', axis=1)
ip_nums = [16777473, 16777471, 16778238, 16785406]
df2 = pd.DataFrame({'ip':ip_nums}).sort_values(by='ip')
result = pd.merge_asof(df2, df1)
print(result)
yields
ip timezone
0 16777471 +10:00
1 16777473 +08:00
2 16778238 +08:00
3 16785406 +09:00
Ideally, your next step would be to apply more NumPy/Pandas vectorized functions
to process the whole DataFrame at once. But if you must, you could iterate
through the result DataFrame row-by-row. Still, your code will look a little bit cleaner
since you'll be able to read off ip and corresponding offset easily (and without calling .item()).
for row in result.itertuples():
print('{} --> {}'.format(row.ip, row.timezone))
# 16777471 --> +10:00
# 16777473 --> +08:00
# 16778238 --> +08:00
# 16785406 --> +09:00

Calculating a max for every X number of lines, how to take leap year into account?

I am trying to take yearly max rainfall data for multiple years of data within one array. I understand how you would need to use a for loop if I wanted to take the max of a single range, I saw there was similar question to the problem I'm having. However, I need to take leap year into account!
So for the first year I have 14616 data points from 1960-1965, not including 1965, which contains 2 leap years: 1960 and 1964. A leap year contains 2928 data points and every other year contains 2920 data points.
I first thought was to modify the solution from the similar question which involved using a for loop as follows (just a straight copy paste from their's):
for i,d in enumerate(data_you_want):
if (i % 600) == 0:
avg_for_day = np.mean(data_you_want[i - 600:i])
daily_averages.append(avg_for_day)
Their's involved taking the average of every 600 lines in their data. I thought there might be a way to just modify this, but I couldn't figure out a way for it to work. If modification of this won't work, is there another way to loop it with the leap years taken into account without completely cutting up the file manually.
Fake data:
import numpy as np
fake = np.random.randint(2, 30, size = 14616)
Use pandas to handle the leap year functionality.
Create timestamps for your data with pandas.date_range().
import pandas as pd
index = pd.date_range(start = '1960-1-1 00:00:00', end = '1964-12-31 23:59:59' , freq='3H')
Then create a DataFrame using the timestamps for the index.
df = pd.DataFrame(data = fake, index = index)
Aggregate by year - taking advantage of the DatetimeIndex flexibilty.
>>> df['1960'].max()
0 29
dtype: int32
>>> df['1960'].mean()
0 15.501366
dtype: float64
>>>
>>> len(df['1960'])
2928
>>> len(df['1961'])
2920
>>> len(df['1964'])
2928
>>>
I just cobbled this together from the Time Series / Date functionality section of the docs. Given pandas capability this looks a bit naive and probably can be improved upon.
Like resampling (using the same DataFrame)
>>> df.resample('A').mean()
0
1960-12-31 15.501366
1961-12-31 15.170890
1962-12-31 15.412329
1963-12-31 15.538699
1964-12-31 15.382514
>>> df.resample('A').max()
0
1960-12-31 29
1961-12-31 29
1962-12-31 29
1963-12-31 29
1964-12-31 29
>>>
>>> r = df.resample('A')
>>> r.agg([np.sum, np.mean, np.std])
0
sum mean std
1960-12-31 45388 15.501366 8.211835
1961-12-31 44299 15.170890 8.117072
1962-12-31 45004 15.412329 8.257992
1963-12-31 45373 15.538699 7.986877
1964-12-31 45040 15.382514 8.178057
>>>
Food for thought:
Time-aware Rolling vs. Resampling

Python - time series alignment and "to date" functions

I have a dataset with the following first three columns.
Include Basket ID (unique identifier), Sale amount (dollars) and date of the transaction. I want to calculate the following column for each row of the dataset, and I would like to it in Python.
Previous Sale of the same basket (if any); Sale Count to date for current basket; Mean To Date for current basket (if available); Max To Date for current basket (if available)
Basket Sale Date PrevSale SaleCount MeanToDate MaxToDate
88 $15 3/01/2012 1
88 $30 11/02/2012 $15 2 $23 $30
88 $16 16/08/2012 $30 3 $20 $30
123 $90 18/06/2012 1
477 $77 19/08/2012 1
477 $57 11/12/2012 $77 2 $67 $77
566 $90 6/07/2012 1
I'm pretty new with Python, and I really struggle to find anything to do it in a fancy way. I've sorted the data (as above) by BasketID and Date, so I can get the previous sale in bulk by shifting forward by one for each single basket. No clue how to get the MeanToDate and MaxToDate in an efficient way apart from looping... any ideas?
This should do the trick:
from pandas import concat
from pandas.stats.moments import expanding_mean, expanding_count
def handler(grouped):
se = grouped.set_index('Date')['Sale'].sort_index()
# se is the (ordered) time series of sales restricted to a single basket
# we can now create a dataframe by combining different metrics
# pandas has a function for each of the ones you are interested in!
return concat(
{
'MeanToDate': expanding_mean(se), # cumulative mean
'MaxToDate': se.cummax(), # cumulative max
'SaleCount': expanding_count(se), # cumulative count
'Sale': se, # simple copy
'PrevSale': se.shift(1) # previous sale
},
axis=1
)
# we then apply this handler to all the groups and pandas combines them
# back into a single dataframe indexed by (Basket, Date)
# we simply need to reset the index to get the shape you mention in your question
new_df = df.groupby('Basket').apply(handler).reset_index()
You can read more about grouping/aggregating here.
import pandas as pd
pd.__version__ # u'0.24.2'
from pandas import concat
def handler(grouped):
se = grouped.set_index('Date')['Sale'].sort_index()
return concat(
{
'MeanToDate': se.expanding().mean(), # cumulative mean
'MaxToDate': se.expanding().max(), # cumulative max
'SaleCount': se.expanding().count(), # cumulative count
'Sale': se, # simple copy
'PrevSale': se.shift(1) # previous sale
},
axis=1
)
###########################
from datetime import datetime
df = pd.DataFrame({'Basket':[88,88,88,123,477,477,566],
'Sale':[15,30,16,90,77,57,90],
'Date':[datetime.strptime(ds,'%d/%m/%Y')
for ds in ['3/01/2012','11/02/2012','16/08/2012','18/06/2012',
'19/08/2012','11/12/2012','6/07/2012']]})
#########
new_df = df.groupby('Basket').apply(handler).reset_index()

Categories

Resources