Pyspark tumbling window from last movement

Pyspark tumbling window from last movement - python

If I have users with movements across all the year I want to create windows of 30 days to aggregate the data starting from the last movement he did.
So if I have a user with movements on dates:
id
date
value
1
2021-01-30
2
1
2021-02-01
4
1
2021-02-08
7
1
2021-04-15
23
I want to create:
[window 3, from 01/15 to 02/15]
[window 2, from 02/15 to 03/15]
[window 1, from 03/15 to 04/15]
And I almost got it with:
dfsp.groupBy(["id", F.window("date", "30 days")])
.agg({'value':'sum'})
.orderBy("window")
.fillna(0)
But I noticed that the windows it generates doesn't start at the end, and I don't know how that can be done.
So, basically, the final dataframe would be something like:
id
window
sum(value)
1
(2021-01-15 00:00:00, 2021-02-15 00:00:00)
13
1
(2021-02-15 00:00:00, 2021-03-15 00:00:00)
0
1
(2021-03-15 00:00:00, 2021-04-15 00:00:00)
23

You could achieve something close to this using applyInPandas and resample. In native Spark it would require more creative coding.
Note: orgin='end' parameter of resample is only available pandas >= 1.3.0
df = df.withColumn('date', f.col('date').cast(TimestampType()))
def pd_resample(df):
return df.groupby('id').resample('30D', on='date', origin='end').value.sum().reset_index()
schema = StructType([
StructField('id', IntegerType(), True),
StructField('date', TimestampType(), True),
StructField('value', IntegerType(), True)
])
df.groupby('id').applyInPandas(pd_resample, schema=schema).show()
+---+-------------------+-----+
| id| date|value|
+---+-------------------+-----+
| 1|2021-02-14 00:00:00| 13|
| 1|2021-03-16 00:00:00| 0|
| 1|2021-04-15 00:00:00| 23|
+---+-------------------+-----+

Related

Minutes to Hours on datetime column Pyspark

I have a pyspark dataframe with a column datetime containing : 2022-06-01 13:59:58
I would like to transform that datetime value into : 2022-06-01 14:00:58
Is there a way to round the minutes into hours , when the minutes are 59 min?

You can accomplish this using expr or unix_timestamp and further adding 1 minute respectively , based on the minute against your timestamp value using when-otherwise -
Unix Timestamps can get a bit fiddly as it involves an additional step of converting it to epoch, but either ways the end result is the same across both
Data Preparation
s = StringIO("""
date_str
2022-03-01 13:59:50
2022-05-20 13:45:50
2022-06-21 16:59:50
2022-10-22 20:59:50
""")
df = pd.read_csv(s,delimiter=',')
sparkDF = sql.createDataFrame(df)\
.withColumn('date_parsed',F.to_timestamp(F.col('date_str'), 'yyyy-MM-dd HH:mm:ss'))\
.drop('date_str')
sparkDF.show()
+-------------------+
| date_parsed|
+-------------------+
|2022-03-01 13:59:50|
|2022-05-20 13:45:50|
|2022-06-21 16:59:50|
|2022-10-22 20:59:50|
+-------------------+
Extracting Minute & Addition
sparkDF = sparkDF.withColumn("date_minute", F.minute("date_parsed"))
sparkDF = sparkDF.withColumn('date_parsed_updated_expr',
F.when(F.col('date_minute') == 59,F.col('date_parsed') + F.expr('INTERVAL 1 MINUTE'))\
.otherwise(F.col('date_parsed'))
).withColumn('date_parsed_updated_unix',
F.when(F.col('date_minute') == 59,(F.unix_timestamp(F.col('date_parsed')) + 60).cast('timestamp'))
.otherwise(F.col('date_parsed'))
)
sparkDF.show()
+-------------------+-----------+------------------------+------------------------+
| date_parsed|date_minute|date_parsed_updated_expr|date_parsed_updated_unix|
+-------------------+-----------+------------------------+------------------------+
|2022-03-01 13:59:50| 59| 2022-03-01 14:00:50| 2022-03-01 14:00:50|
|2022-05-20 13:45:50| 45| 2022-05-20 13:45:50| 2022-05-20 13:45:50|
|2022-06-21 16:59:50| 59| 2022-06-21 17:00:50| 2022-06-21 17:00:50|
|2022-10-22 20:59:50| 59| 2022-10-22 21:00:50| 2022-10-22 21:00:50|
+-------------------+-----------+------------------------+------------------------+

Multiple files ranking using pyspark

I have stock data (over 6000+ stocks, 100+ GB) saved as HDF5 file.
Basically, I am trying to translate this pandas code into pyspark. Ideally, I would love to have values used for the ranking, as well as ranks themselves saved to a file.
agg_df = pd.DataFrame()
for stock in stocks:
df = pd.read_csv(stock)
df = my_func(df) # custom function, output of which will be used for ranking. For simplicity, can use standard deviation
agg_df = pd.concat([agg_df, df], ax=1) # row-wise concat
agg_df.rank() #.to_csv() <- would like to save ranks for future use
Each data file has the same schema like:
Symbol Open High Low Close Volume
DateTime
2010-09-13 09:30:00 A 29.23 29.25 29.17 29.25 17667.0
2010-09-13 09:31:00 A 29.26 29.34 29.25 29.33 5000.0
2010-09-13 09:32:00 A 29.31 29.36 29.31 29.36 600.0
2010-09-13 09:33:00 A 29.33 29.36 29.30 29.35 7300.0
2010-09-13 09:34:00 A 29.35 29.39 29.31 29.39 3222.0
Desired output (where each number is a rank):
A AAPL MSFT ...etc
DateTime
2010-09-13 09:30:00 1 3 7 ...
2010-09-13 09:31:00 4 5 7 ...
2010-09-13 09:32:00 24 17 99 ...
2010-09-13 09:33:00 7 63 42 ...
2010-09-13 09:34:00 5 4 13 ...
I read other answers about Window and pyspark.sql, but not sure how to apply those to my case as I kind of need to aggregate those by row before ranking (at least in pandas)
Edit 1: After I read the data to an rdd rdd = sc.parallelize(data.keys).map(data.read_data), rdd becomes a PipelineRDD, which doesnt have .select() method. 0xDFDFDFDF's example contains all data in one dataframe, but I dont think it's a good idea to append everything to one dataframe to do the computation.
Result: Finally was able to solve it. There were 2 problems: reading files and performing the calculations.
Regarding reading file, I initially loaded them from HDF5 using rdd = sc.parallelize(data.keys).map(data.read_data) which resulted in PipelineRDD, which was a collection of pandas dataframes. These needed to be transformed to spark dataframe in order for the solution to work. I ended up transforming my hdf5 file to parquet and saved them to a separate folder. Then using
sqlContext = pyspark.sql.SQLContext(sc)
rdd_p = sqlContext.read.parquet(r"D:\parq")
read all the files to a dataframe.
After that performed calculation from accepted answer. Huge thanks to 0xDFDFDFDF for help
Extras:
discussion - https://chat.stackoverflow.com/rooms/214307/discussion-between-biarys-and-0xdfdfdfdf
0xDFDFDFDF solution - https://gist.github.com/0xDFDFDFDF/a93a7e4448abc03f606008c7422784d1

Indeed, windows functions will do the trick.
I've created a small mock dataset, which should resemble yours.
columns = ['DateTime', 'Symbol', 'Open', 'High', 'Low', 'Close', 'Volume']
data = [('2010-09-13 09:30:00','A',29.23,29.25,29.17,29.25,17667.0),
('2010-09-13 09:31:00','A',29.26,29.34,29.25,29.33,5000.0),
('2010-09-13 09:32:00','A',29.31,29.36,29.31,29.36,600.0),
('2010-09-13 09:34:00','A',29.35,29.39,29.31,29.39,3222.0),
('2010-09-13 09:30:00','AAPL',39.23,39.25,39.17,39.25,37667.0),
('2010-09-13 09:31:00','AAPL',39.26,39.34,39.25,39.33,3000.0),
('2010-09-13 09:32:00','AAPL',39.31,39.36,39.31,39.36,300.0),
('2010-09-13 09:33:00','AAPL',39.33,39.36,39.30,39.35,3300.0),
('2010-09-13 09:34:00','AAPL',39.35,39.39,39.31,39.39,4222.0),
('2010-09-13 09:34:00','MSFT',39.35,39.39,39.31,39.39,7222.0)]
df = spark.createDataFrame(data, columns)
Now, df.show() will give us this:
+-------------------+------+-----+-----+-----+-----+-------+
| DateTime|Symbol| Open| High| Low|Close| Volume|
+-------------------+------+-----+-----+-----+-----+-------+
|2010-09-13 09:30:00| A|29.23|29.25|29.17|29.25|17667.0|
|2010-09-13 09:31:00| A|29.26|29.34|29.25|29.33| 5000.0|
|2010-09-13 09:32:00| A|29.31|29.36|29.31|29.36| 600.0|
|2010-09-13 09:34:00| A|29.35|29.39|29.31|29.39| 3222.0|
|2010-09-13 09:30:00| AAPL|39.23|39.25|39.17|39.25|37667.0|
|2010-09-13 09:31:00| AAPL|39.26|39.34|39.25|39.33| 3000.0|
|2010-09-13 09:32:00| AAPL|39.31|39.36|39.31|39.36| 300.0|
|2010-09-13 09:33:00| AAPL|39.33|39.36| 39.3|39.35| 3300.0|
|2010-09-13 09:34:00| AAPL|39.35|39.39|39.31|39.39| 4222.0|
|2010-09-13 09:34:00| MSFT|39.35|39.39|39.31|39.39| 7222.0|
+-------------------+------+-----+-----+-----+-----+-------+
Here's the solution, which uses the aforementioned window function for rank(). Some transformation is needed, for which you can use pivot() function.
from pyspark.sql.window import Window
import pyspark.sql.functions as f
result = (df
.select(
'DateTime',
'Symbol',
f.rank().over(Window().partitionBy('DateTime').orderBy('Volume')).alias('rank')
)
.groupby('DateTime')
.pivot('Symbol')
.agg(f.first('rank'))
.orderBy('DateTime')
)
By calling result.show() you'll get:
+-------------------+----+----+----+
| DateTime| A|AAPL|MSFT|
+-------------------+----+----+----+
|2010-09-13 09:30:00| 1| 2|null|
|2010-09-13 09:31:00| 2| 1|null|
|2010-09-13 09:32:00| 2| 1|null|
|2010-09-13 09:33:00|null| 1|null|
|2010-09-13 09:34:00| 1| 2| 3|
+-------------------+----+----+----+
Make sure you understand the difference between rank(), dense_rank() and row_number() functions, as they behave differently when they encounter equal numbers in a given window - you can find the explanation here.

Python PySpark: Count Number of Rows by Week w/Week Starting on Monday and Ending on Sunday

I have a data frame that contains the following columns:
ID Scheduled Date
241 10/9/2018
423 9/25/2018
126 9/30/2018
123 8/13/2018
132 8/16/2018
143 10/6/2018
I want to count the total number of IDs by week. Specifically, I want the week to always start on Monday and always end on Sunday.
I achieved this in Jupyter Notebook already:
weekly_count_output = df.resample('W-Mon', on='Scheduled Date', label='left', closed='left').sum().query('count_row > 0')
weekly_count_output = weekly_count_output.reset_index()
weekly_count_output = weekly_count_output[['Scheduled Date', 'count_row']]
weekly_count_output = weekly_count_output.rename(columns = {'count_row': 'Total Count'})
But I don't know how to write the above code in Python PySpark syntax. I want my resulting output to look like this:
Scheduled Date Total Count
8/13/2018 2
9/24/2018 2
10/1/2018 1
10/8/2018 1
Please note the Scheduled Date is always a Monday (indicating beginning of week) and the total count goes from Monday to Sunday of that week.

Thanks to Get Last Monday in Spark for defining the funcion previous_day.
Firstly import,
from pyspark.sql.functions import *
from datetime import datetime
Assuming your input data as in my df (DataFrame)
cols = ['id', 'scheduled_date']
vals = [
(241, '10/09/2018'),
(423, '09/25/2018'),
(126, '09/30/2018'),
(123, '08/13/2018'),
(132, '08/16/2018'),
(143, '10/06/2018')
]
df = spark.createDataFrame(vals, cols)
This is the function defined
def previous_day(date, dayOfWeek):
return date_sub(next_day(date, 'monday'), 7)
# Converting the string column to timestamp.
df = df.withColumn('scheduled_date', date_format(unix_timestamp('scheduled_date', 'MM/dd/yyy') \
.cast('timestamp'), 'yyyy-MM-dd'))
df.show()
+---+--------------+
| id|scheduled_date|
+---+--------------+
|241| 2018-10-09|
|423| 2018-09-25|
|126| 2018-09-30|
|123| 2018-08-13|
|132| 2018-08-16|
|143| 2018-10-06|
+---+--------------+
# Returns the first monday of a week
df_mon = df.withColumn("scheduled_date", previous_day('scheduled_date', 'monday'))
df_mon.show()
+---+--------------+
| id|scheduled_date|
+---+--------------+
|241| 2018-10-08|
|423| 2018-09-24|
|126| 2018-09-24|
|123| 2018-08-13|
|132| 2018-08-13|
|143| 2018-10-01|
+---+--------------+
# You can groupBy and do agg count of 'id'.
df_mon_grp = df_mon.groupBy('scheduled_date').agg(count('id')).orderBy('scheduled_date')
# Reformatting to match your resulting output.
df_mon_grp = df_mon_grp.withColumn('scheduled_date', date_format(unix_timestamp('scheduled_date', "yyyy-MM-dd") \
.cast('timestamp'), 'MM/dd/yyyy'))
df_mon_grp.show()
+--------------+---------+
|scheduled_date|count(id)|
+--------------+---------+
| 08/13/2018| 2|
| 09/24/2018| 2|
| 10/01/2018| 1|
| 10/08/2018| 1|
+--------------+---------+

Python Rolling 30 day period GROUP By with Count Distinct String

I have a dataset of 'user_name','mac','dayte'(day). I would like to GROUP BY ['user_name']. Then for that GROUP BY create WINDOW of rolling 30 days using 'dayte'. In that rolling 30 day period, I would like to count the distinct number of 'mac'. And add that to my dataframe. Sample of the data.
user_name mac dayte
0 001j 7C:D1 2017-09-15
1 0039711 40:33 2017-07-25
2 0459 F0:79 2017-08-01
3 0459 F0:79 2017-08-06
4 0459 F0:79 2017-08-31
5 0459 78:D7 2017-09-08
6 0459 E0:C7 2017-09-16
7 133833 18:5E 2017-07-27
8 133833 F4:0F 2017-07-31
9 133833 A4:E4 2017-08-07
I have tried solving this with a PANDAs dataframe.
df['ct_macs'] = df.groupby(['user_name']).rolling('30d', on='dayte').mac.apply(lambda x:len(x.unique()))
But received the error
Exception: cannot handle a non-unique multi-index!
I tried in PySpark, but received an error as well.
from pyspark.sql import functions as F
#function to calculate number of seconds from number of days
days = lambda i: i * 86400
#convert string timestamp to timestamp type
df= df.withColumn('dayte', df.dayte.cast('timestamp'))
#create window by casting timestamp to long (number of seconds)
w = Window.partitionBy("user_name").orderBy("dayte").rangeBetween(-days(30), 0)
df= df.select("user_name","mac","dayte",F.size(F.denseRank().over(w).alias("ct_mac")))
But received the error
Py4JJavaError: An error occurred while calling o464.select.
: org.apache.spark.sql.AnalysisException: Window function dense_rank does not take a frame specification.;
I also tried
df= df.select("user_name","dayte",F.countDistinct(col("mac")).over(w).alias("ct_mac"))
But it's (count distinct in Window) not supported, apparently, in Spark.
I'm open to a purely SQL approach. In either MySQL or SQL Server, but would prefer Python or Spark.

Pyspark
Window functions are limited in the following ways:
A frame can only be defined by rows and not column values
countDistinct doesn't exist
enumerating functions cannot be used with a frame
Instead you can self join your table.
First let's create the dataframe:
df = sc.parallelize([["001j", "7C:D1", "2017-09-15"], ["0039711", "40:33", "2017-07-25"], ["0459", "F0:79", "2017-08-01"],
["0459", "F0:79", "2017-08-06"], ["0459", "F0:79", "2017-08-31"], ["0459", "78:D7", "2017-09-08"],
["0459", "E0:C7", "2017-09-16"], ["133833", "18:5E", "2017-07-27"], ["133833", "F4:0F", "2017-07-31"],
["133833", "A4:E4", "2017-08-07"]]).toDF(["user_name", "mac", "dayte"])
Now for the join and groupBy:
import pyspark.sql.functions as psf
df.alias("left")\
.join(
df.alias("right"),
(psf.col("left.user_name") == psf.col("right.user_name"))
& (psf.col("right.dayte").between(psf.date_add("left.dayte", -30), psf.col("left.dayte"))),
"leftouter")\
.groupBy(["left." + c for c in df.columns])\
.agg(psf.countDistinct("right.mac").alias("ct_macs"))\
.sort("user_name", "dayte").show()
+---------+-----+----------+-------+
|user_name| mac| dayte|ct_macs|
+---------+-----+----------+-------+
| 001j|7C:D1|2017-09-15| 1|
| 0039711|40:33|2017-07-25| 1|
| 0459|F0:79|2017-08-01| 1|
| 0459|F0:79|2017-08-06| 1|
| 0459|F0:79|2017-08-31| 1|
| 0459|78:D7|2017-09-08| 2|
| 0459|E0:C7|2017-09-16| 3|
| 133833|18:5E|2017-07-27| 1|
| 133833|F4:0F|2017-07-31| 2|
| 133833|A4:E4|2017-08-07| 3|
+---------+-----+----------+-------+
Pandas
This works for python3
import pandas as pd
import numpy as np
df["mac"] = pd.factorize(df["mac"])[0]
df.groupby('user_name').rolling('30D', on="dayte").mac.apply(lambda x: len(np.unique(x)))

Using .apply() in Sframes to manipulate multiple columns of each row

I have an SFrame with the columns Date1 and Date2.
I am trying to use .apply() to find the datediff between Date1 and Date2, but I can't figure out how to use the other argument.
Ideally something like
frame['new_col'] = frame['Date1'].apply(lambda x: datediff(x,frame('Date2')))

You can directly take the difference between the dates in column Date2 and those in Date1 by just subtracting frame['Date1'] from frame['Date2']. That, for some reason, returns the number of seconds between the two dates (only tested with python's datetime objects), which you can convert into number of days with simple arithmetics:
from sframe import SFrame
from datetime import datetime, timedelta
mydict = {'Date1':[datetime.now(), datetime.now()+timedelta(2)],
'Date2':[datetime.now()+timedelta(10), datetime.now()+timedelta(17)]}
frame = SFrame(mydict)
frame['new_col'] = (frame['Date2'] - frame['Date1']).apply(lambda x: x//(60*60*24))
Output:
+----------------------------+----------------------------+---------+
| Date1 | Date2 | new_col |
+----------------------------+----------------------------+---------+
| 2016-10-02 21:12:14.712556 | 2016-10-12 21:12:14.712574 | 10.0 |
| 2016-10-04 21:12:14.712567 | 2016-10-19 21:12:14.712576 | 15.0 |
+----------------------------+----------------------------+---------+

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pyspark tumbling window from last movement - python

Related

Minutes to Hours on datetime column Pyspark

Multiple files ranking using pyspark

Python PySpark: Count Number of Rows by Week w/Week Starting on Monday and Ending on Sunday

Python Rolling 30 day period GROUP By with Count Distinct String

Using .apply() in Sframes to manipulate multiple columns of each row

Categories

Resources