PySpark: Sliding windows for selective rows - python

I have a dataframe containing following 3 columns:
1. ID
2. timestamp
3. IP_Address
The data spans from 2019-07-01 to 2019-09-20. I am trying to aggregate counts of IP_address over the last 60 days partitioned by ID for all the rows between the 20day period of 2019-09-01 to 2019-09-20.
I have tried using the following window function and it works just fine:
days = lambda i: i*86400
w = Window.partitionBy('id')\
.orderBy(unix_timestamp(col('timestamp')))\
.rangeBetween(start=-days(60), end=Window.currentRow)
df = df.withColumn("ip_counts", count(df.ip_address).over(w))
However, the problem with that is that it calculates these aggregations even for the period I don't need the computation for: 2019-07-01 to 2019-08-31. I could easily filter out the results for the selected period retrospectively after calculations but I don't want unnecessary computations as I am dealing with ~3-10 Million rows per day.
If I filter the dataframe like this:
dates = ('2019-09-01', '2019-09-20')
date_from, date_to = [F.to_date(F.lit(s)).cast("timestamp") for s in dates]
w = Window.partitionBy('id')\
.orderBy(unix_timestamp(col('timestamp')))\
.rangeBetween(start=-days(60), end=Window.currentRow)
df = df.where((df.timestamp >= date_from) & (df.timestamp <= date_to))\
.withColumn("ip_counts", count(df.ip_address).over(w))
in that case, the IDs between these days are unable to access the data for those IDs from preceding 60 days and therefore, the counts are incorrect.
What can I do to compute aggregations only for the rows falling between 2019-09-01 to 2019-09-20 while at the same time making sure that windows have access to preceding 60 days of data for each of those aggregation. Thank you so much for your help.

I would first make a new data frame keeping all the data from the last 60 days, then follow your first method computing aggregations only for the rows falling between 2019-09-01 to 2019-09-20.

Related

Add hours column to regular list of minutes, group by it, and average the data in Python

I have looked for similar questions, but none seems to be addressing the following challenge. I have a pandas dataframe with a list of minutes and corresponding values, like the following:
minute value
0 454
1 434
2 254
The list is a year-long list, thus counting 60 minutes * 24 hours * 365 days = 525600 observations.
I would like to add a new column called hour, which indeed expresses the hour of the day (assuming minutes 0-59 are 12AM, 60-119 are 1AM, and so forth until the following day, where the sequence restarts).
Then, once the hour column is added, I would like to group observations by it and calculate the average value for every hour of the year, and end up with a dataframe with 24 observations, each expressing the average value of the original data at each hour n.
Using integer and remainder division you can get the hour.
df['hour'] = df['minute']//60%24
If you want other date information it can be useful to use January 1st of some year (not a leap year) as the origin and convert to a datetime. Then you can grab a lot of the date attributes, in this case hour.
df['hour'] = pd.to_datetime(df['minute'], unit='m', origin='2017-01-01').dt.hour
Then for your averages you get the resulting 24 row Series with:
df.groupby('hour')['value'].mean()
Here's a way to do:
# sample df
df = pd.DataFrame({'minute': np.arange(525600), 'value': np.arange(525600)})
# set time format
df['minute'] = pd.to_timedelta(df['minute'], unit='m')
# calculate mean
df_new = df.groupby(pd.Grouper(key='minute', freq='1H'))['value'].mean().reset_index()
Although, you don't need hour column explicity to calculate these value, but if you want to get it, you can do it by:
df_new['hour'] = pd.to_datetime(df_new['minute']).dt.hour

How to calculate moving average incrementally with daily data added to data frame in pandas?

I have daily data and want to calculate 5 days, 30 days and 90 days moving average per user and write out to a CSV. New data comes in everyday. How do I calculate these averages for the new data only, assuming I will load the data frame with last 89 days data plus today's data.
date user daily_sales 5_days_MA 30_days_MV 90_days_MV
2019-05-01 1 34
2019-05-01 2 20
....
2019-07-18 .....
The number of rows per day is about 1 million. If data for 90days is too much, 30 days is OK
You can apply rolling() method on your dataset if it's in DataFrame format.
your_df['MA_30_days'] = df[where_to_apply].rolling(window = 30).mean()
If you need different window on which moving average will be calculated just change window parameter. In my example I used mean() to calculate but you can choose some other statistic as well.
This code will create another column named 'MA_30_days' with calculated moving average in your DataFrame.
You can also create another DataFrame where you will collect and loop over your dataset to calculate all moving averages and save it to CSV format as you wanted.
your_df.to_csv('filename.csv')
In your case to calculation should be consider only the newest data. If you want to perform this on latest data just slice it. However the very first rows will be NaN (depends on window).
df[where_to_apply][-90:].rolling(window = 30).mean()
This will calculate moving average on last 90 rows of specific column in some df and first 29 rows would be NaN. If your latest 90 rows should be all meaningful data than you can start calculation earlier than on last 90 rows - depends on window size.
if the df already contains yesterday's moving average, and just the new day's Simple MA is required, I would say use this approach:
MAlength=90
df.loc[day-1:'MA']=(
(df.loc[day-1:'MA']*MAlength) #expand yesterday's MA value
-df.loc[day-MAlength:'Price'] #remove oldest price
+df.loc[day-MAlength:'Price'] #add newest price
)/MAlength #re-average

Find if there is any holidays between two dates in a large dataset?

I am working on a dataset that has some 26 million rows and 13 columns including two datetime columns arr_date and dep_date. I am trying to create a new boolean column to check if there is any US holidays between these dates.
I am using apply function to the entire dataframe but the execution time is too slow. The code has been running for more than 48 hours now on Goolge Cloud Platform (24GB ram, 4 core). Is there a faster way to do this?
The dataset looks like this:
Sample data
The code I am using is -
import pandas as pd
import numpy as np
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
df = pd.read_pickle('dataGT70.pkl')
cal = calendar()
def mark_holiday(df):
df.apply(lambda x: True if (len(cal.holidays(start=x['dep_date'], end=x['arr_date']))>0 and x['num_days']<20) else False, axis=1)
return df
df = mark_holiday(df)
This took me about two minutes to run on a sample dataframe of 30m rows with two columns, start_date and end_date.
The idea is to get a sorted list of all holidays occurring on or after the minimum start date, and then to use bisect_left from the bisect module to determine the next holiday occurring on or after each start date. This holiday is then compared to the end date. If it is less than or equal to the end date, then there must be at least one holiday in the date range between the start and end dates (both inclusive).
from bisect import bisect_left
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
# Create sample dataframe of 10k rows with an interval of 1-19 days.
np.random.seed(0)
n = 10000 # Sample size, e.g. 10k rows.
years = np.random.randint(2010, 2019, n)
months = np.random.randint(1, 13, n)
days = np.random.randint(1, 29, n)
df = pd.DataFrame({'start_date': [pd.Timestamp(*x) for x in zip(years, months, days)],
'interval': np.random.randint(1, 20, n)})
df['end_date'] = df['start_date'] + pd.TimedeltaIndex(df['interval'], unit='d')
df = df.drop('interval', axis=1)
# Get a sorted list of holidays since the fist start date.
hols = calendar().holidays(df['start_date'].min())
# Determine if there is a holiday between the start and end dates (both inclusive).
df['holiday_in_range'] = df['end_date'].ge(
df['start_date'].apply(lambda x: bisect_left(hols, x)).map(lambda x: hols[x]))
>>> df.head(6)
start_date end_date holiday_in_range
0 2015-07-14 2015-07-31 False
1 2010-12-18 2010-12-30 True # 2010-12-24
2 2013-04-06 2013-04-16 False
3 2013-09-12 2013-09-24 False
4 2017-10-28 2017-10-31 False
5 2013-12-14 2013-12-29 True # 2013-12-25
So, for a given start_date timestamp (e.g. 2013-12-14), bisect_right(hols, '2013-12-14') would yield 39, and hols[39] results in 2013-12-25, the next holiday falling on or after the 2013-12-14 start date. The next holiday calculated as df['start_date'].apply(lambda x: bisect_left(hols, x)).map(lambda x: hols[x]). This holiday is then compared to the end_date, and holiday_in_range is thus True if the end_date is greater than or equal to this holiday value, otherwise the holiday must fall after this end_date.
Have you already considered using pandas.merge_asof for this?
I could imagine that map and apply with lambda functions cannot be executed that efficiently.
UPDATE: Ah sorry, I just read, that you only need a boolean if there are any holidays inbetween, this makes it much easier. If thats enough you just need to perform steps 1-5 then group the DataFrame that is the result of step5 by start/end date and use count as the aggregate function to have the number of holidays in the ranges. This result you can join to your original dataset similar to step 8 described below. Then fill the rest of the values with fillna(0). Do something like joined_df['includes_holiday']= joined_df['joined_count_column']>0. After that, you can delete the joined_count_column again from your DataFrame, if you like.
If you use pandas_merge_asof you could work through these steps (step 6 and 7 are only necessary if you need to have all the holidays inbetween start and end in your result DataFrame as well, not just the booleans):
Load your holiday records in a DataFrame and index it on the date. The holidays should be one date per line (storing ranges like for christmas from 24th-26th in one row, would make it much more complex).
Create a copy of your dataframe with just the start, end date columns. UPDATE: every start, end date should only occur once in it. E.g. by using groupby.
Use merge_asof with a reasonable tolerance value (if you join over the start of the period, use direction='forward', if you use the end date, use direction='backward' and how='inner'.
As a result you have a merged DataFrame with your start, end columns and the date column from your holiday dataframe. You get only records, for which a holiday was found with the given tolerance, but later you can merge this data back with your original DataFrame. You will probably now have duplicates of your original records.
Then check the joined holiday for your records with indexers by comparing them with the start and end column and remove the holidays, which are not inbetween.
Sort the dataframe you obtained form step 5 (use something like df.sort_values(['start', 'end', 'holiday'], inplace=True). Now you should insert a number column that numbers the holidays between your periods (the ones you obtained after step 5) form 1 to ... (for each period starting from 1). This is necesary to use unstack in the next step to get the holidays in columns.
Add an index on your dataframe based on period start date, period end date and the count column you inserted in step 6. Use df.unstack(level=-1) on the DataFrame you prepared in steps 1-7. What you now have, is a condensed DataFrame with your original periods with the holidays arranged columnwise.
Now you only have to merge this DataFrame back to your original data using original_df.merge(df_from_step7, left_on=['start', 'end'], right_index=True, how='left')
The result of this is a file with your original data containing the date ranges and for each date range the holidays that lie inbetween the period are stored in a separte columns each behind the data. Loosely speaking the numbering in step 6 assigns the holidays to the columns and has the effect, that the holidays are always assigned from right to left to the columns (you wouldn't have a holiday in column 3 if column 1 is empty).
Step 6. is probably also a bit tricky, but you can do that for example by adding a series filled with a range and then fixing it, so the numbering starts by 0 or 1 in each group by using shift or grouping by start, end with aggregate({'idcol':'min') and joining the result back to subtract it from the value assigned by the range-sequence.
In all, I think it sounds more complicated, than it is and it should be performed quite efficient. Especially if your periods are not that large, because then after step 5, your result set should be much smaller than your original dataframe, but even if that is not the case, it should still be quite efficient, since it can use compiled code.

Find Maximum Date within Date Range without filtering in Python

I have a file with one row per EMID per Effective Date. I need to find the maximum Effective date per EMID that occurred before a specific date. For instance, if EMID =1 has 4 rows, one for 1/1/16, one for 10/1/16, one for 12/1/16, and one for 12/2/17, and I choose the date 1/1/17 as my specific date, I'd want to know that 12/1/16 is the maximum date for EMID=1 that occurred before 1/1/17.
I know how to find the maximum date overall by EMID (groupby.max()). I also can filter the file to just dates before 1/1/17 and find the max of the remaining rows. However, ultimately I need the last row before 1/1/17, and then all the rows following 1/1/17, so filtering out the rows that occur after the date isn't optimal, because then I have to do complicated joins to get them back in.
# Create dummy data
dummy = pd.DataFrame(columns=['EmID', 'EffectiveDate'])
dummy['EmID'] = [random.randint(1, 10000) for x in range(49999)]
dummy['EffectiveDate'] = [np.random.choice(pd.date_range(datetime.datetime(2016,1,1), datetime.datetime(2018,1,3))) for i in range(49999)]
#Create group by
g = dummy.groupby('EmID')['EffectiveDate']
# This doesn't work, but effectively shows what I'm trying to do
dummy['max_prestart'] = max(dt for dt in g if dt < datetime(2017,1,1))
I expect that output to be an additional column in my dataframe that has the maximum date that occurred before the specified date.
Using map after selected .
s=dummy.loc[dummy.EffectiveDate>'2017-01-01'].groupby('EmID').EffectiveDate.max()
dummy['new']=dummy.EmID.map(s)
Here Using transform and assuming else dt
dummy['new']=dummy.loc[dummy.EffectiveDate>'2017-01-01'].groupby('EmID').EffectiveDate.transform('max')
dummy['new']=dummy['new'].fillna(dummy.EffectiveDate)

A Multi-Index Construction for Intraday TimeSeries (10 min price data)

I have a file with intraday prices every ten minutes. [0:41] times in a day. Each date is repeated 42 times. The multi-index below should "collapse" the repeated dates into one for all times.
There are 62,035 rows x 3 columns: [date, time, price].
I would like write a function to get the difference of the ten minute prices, restricting differences to each unique date.
In other words, 09:30 is the first time of each day and 16:20 is the last: I cannot overlap differences between days of price from 16:20 - 09:30. The differences should start as 09:40 - 09:30 and end as 16:20 - 16:10 for each unique date in the dataframe.
Here is my attempt. Any suggestions would be greatly appreciated.
def diffSeries(rounded,data):
'''This function accepts a column called rounded from 'data'
The 2nd input 'data' is a dataframe
'''
df=rounded.shift(1)
idf=data.set_index(['date', 'time'])
data['diff']=['000']
for i in range(0,length(rounded)):
for day in idf.index.levels[0]:
for time in idf.index.levels[1]:
if idf.index.levels[1]!=1620:
data['diff']=rounded[i]-df[i]
else:
day+=1
time+=2
data[['date','time','price','II','diff']].to_csv('final.csv')
return data['diff']
Then I call:
data=read_csv('file.csv')
rounded=roundSeries(data['price'],5)
diffSeries(rounded,data)
On the traceback - I get an Assertion Error.
You can use groupby and then apply to achieve what you want:
diffs = data.groupby(lambda idx: idx[0]).apply(lambda row: row - row.shift(1))
For a full example, suppose you create a test data set for 14 Nov to 16 Nov:
import pandas as pd
from numpy.random import randn
from datetime import time
# Create date range with 10 minute intervals, and filter out irrelevant times
times = pd.bdate_range(start=pd.datetime(2012,11,14,0,0,0),end=pd.datetime(2012,11,17,0,0,0), freq='10T')
filtered_times = [x for x in times if x.time() >= time(9,30) and x.time() <= time(16,20)]
prices = randn(len(filtered_times))
# Create MultiIndex and data frame matching the format of your CSV
arrays = [[x.date() for x in filtered_times]
,[x.time() for x in filtered_times]]
tuples = zip(*arrays)
m_index = pd.MultiIndex.from_tuples(tuples, names=['date', 'time'])
data = pd.DataFrame({'prices': prices}, index=m_index)
You should get a DataFrame a bit like this:
prices
date time
2012-11-14 09:30:00 0.696054
09:40:00 -1.263852
09:50:00 0.196662
10:00:00 -0.942375
10:10:00 1.915207
As mentioned above, you can then get the differences by grouping by the first index and then subtracting the previous row for each row:
diffs = data.groupby(lambda idx: idx[0]).apply(lambda row: row - row.shift(1))
Which gives you something like:
prices
date time
2012-11-14 09:30:00 NaN
09:40:00 -1.959906
09:50:00 1.460514
10:00:00 -1.139036
10:10:00 2.857582
Since you are grouping by the date, the function is not applied for 16:20 - 09:30.
You might want to consider using a TimeSeries instead of a DataFrame, because it will give you far greater flexibility with this kind of data. Supposing you have already loaded your DataFrame from the CSV file, you can easily convert it into a TimeSeries and perform a similar function to get the differences:
dt_index = pd.DatetimeIndex([datetime.combine(i[0],i[1]) for i in data.index])
# or dt_index = pd.DatetimeIndex([datetime.combine(i.date,i.time) for i in data.index])
# if you don't have an multi-level index on data yet
ts = pd.Series(data.prices.values, dt_index)
diffs = ts.groupby(lambda idx: idx.date()).apply(lambda row: row - row.shift(1))
However, you would now have access to the built-in time series functions such as resampling. See here for more about time series in pandas.
#MattiJohn's construction gives a filtered list of length 86,772--when run over 1/3/2007-8/30/2012 for 42 times (10 minute intervals). Observe the data cleaning issues.
Here the data of prices coming from the csv is length: 62,034.
Hence, simply importing from the .csv, as follows, is problematic:
filtered_times = [x for x in times if x.time() >= time(9,30) and x.time() <= time(16,20)]
DF=pd.read_csv('MR10min.csv')
prices = DF.price
# I.E. rather than the generic: prices = randn(len(filtered_times)) above.
The fact that the real data falls short of the length it "should be" means there are data cleaning issues. Often we do not have the full times as bdate_time will generate (half days in the market, etc, holidays).
Your solution is elegant. But I am not sure how to overcome the mismatch between the actual data and the a priori, prescribed dataframe.
Your second TimesSeries suggestion seems to still require construction of a datetime index similar to the first one. For example, if I were use the following two lines to get the actual data of interest:
DF=pd.read_csv('MR10min.csv')
data=pd.DF.set_index(['date','time'])
dt_index = pd.DatetimeIndex([datetime.combine(i[0],i[1]) for i in data.index])
It will generate a:
TypeError: combine() argument 1 must be datetime.date, not str
How does one make a bdate_time array completely informed by the actual data available?
Thank you to (#MattiJohn) and to anyone with interest in continuing this discussion.

Categories

Resources