Pyspark tumbling window from last movement - python

If I have users with movements across all the year I want to create windows of 30 days to aggregate the data starting from the last movement he did.
So if I have a user with movements on dates:
id
date
value
1
2021-01-30
2
1
2021-02-01
4
1
2021-02-08
7
1
2021-04-15
23
I want to create:
[window 3, from 01/15 to 02/15]
[window 2, from 02/15 to 03/15]
[window 1, from 03/15 to 04/15]
And I almost got it with:
dfsp.groupBy(["id", F.window("date", "30 days")])
.agg({'value':'sum'})
.orderBy("window")
.fillna(0)
But I noticed that the windows it generates doesn't start at the end, and I don't know how that can be done.
So, basically, the final dataframe would be something like:
id
window
sum(value)
1
(2021-01-15 00:00:00, 2021-02-15 00:00:00)
13
1
(2021-02-15 00:00:00, 2021-03-15 00:00:00)
0
1
(2021-03-15 00:00:00, 2021-04-15 00:00:00)
23

You could achieve something close to this using applyInPandas and resample. In native Spark it would require more creative coding.
Note: orgin='end' parameter of resample is only available pandas >= 1.3.0
df = df.withColumn('date', f.col('date').cast(TimestampType()))
def pd_resample(df):
return df.groupby('id').resample('30D', on='date', origin='end').value.sum().reset_index()
schema = StructType([
StructField('id', IntegerType(), True),
StructField('date', TimestampType(), True),
StructField('value', IntegerType(), True)
])
df.groupby('id').applyInPandas(pd_resample, schema=schema).show()
+---+-------------------+-----+
| id| date|value|
+---+-------------------+-----+
| 1|2021-02-14 00:00:00| 13|
| 1|2021-03-16 00:00:00| 0|
| 1|2021-04-15 00:00:00| 23|
+---+-------------------+-----+

Related

Minutes to Hours on datetime column Pyspark

I have a pyspark dataframe with a column datetime containing : 2022-06-01 13:59:58
I would like to transform that datetime value into : 2022-06-01 14:00:58
Is there a way to round the minutes into hours , when the minutes are 59 min?
You can accomplish this using expr or unix_timestamp and further adding 1 minute respectively , based on the minute against your timestamp value using when-otherwise -
Unix Timestamps can get a bit fiddly as it involves an additional step of converting it to epoch, but either ways the end result is the same across both
Data Preparation
s = StringIO("""
date_str
2022-03-01 13:59:50
2022-05-20 13:45:50
2022-06-21 16:59:50
2022-10-22 20:59:50
""")
df = pd.read_csv(s,delimiter=',')
sparkDF = sql.createDataFrame(df)\
.withColumn('date_parsed',F.to_timestamp(F.col('date_str'), 'yyyy-MM-dd HH:mm:ss'))\
.drop('date_str')
sparkDF.show()
+-------------------+
| date_parsed|
+-------------------+
|2022-03-01 13:59:50|
|2022-05-20 13:45:50|
|2022-06-21 16:59:50|
|2022-10-22 20:59:50|
+-------------------+
Extracting Minute & Addition
sparkDF = sparkDF.withColumn("date_minute", F.minute("date_parsed"))
sparkDF = sparkDF.withColumn('date_parsed_updated_expr',
F.when(F.col('date_minute') == 59,F.col('date_parsed') + F.expr('INTERVAL 1 MINUTE'))\
.otherwise(F.col('date_parsed'))
).withColumn('date_parsed_updated_unix',
F.when(F.col('date_minute') == 59,(F.unix_timestamp(F.col('date_parsed')) + 60).cast('timestamp'))
.otherwise(F.col('date_parsed'))
)
sparkDF.show()
+-------------------+-----------+------------------------+------------------------+
| date_parsed|date_minute|date_parsed_updated_expr|date_parsed_updated_unix|
+-------------------+-----------+------------------------+------------------------+
|2022-03-01 13:59:50| 59| 2022-03-01 14:00:50| 2022-03-01 14:00:50|
|2022-05-20 13:45:50| 45| 2022-05-20 13:45:50| 2022-05-20 13:45:50|
|2022-06-21 16:59:50| 59| 2022-06-21 17:00:50| 2022-06-21 17:00:50|
|2022-10-22 20:59:50| 59| 2022-10-22 21:00:50| 2022-10-22 21:00:50|
+-------------------+-----------+------------------------+------------------------+

Multiple files ranking using pyspark

I have stock data (over 6000+ stocks, 100+ GB) saved as HDF5 file.
Basically, I am trying to translate this pandas code into pyspark. Ideally, I would love to have values used for the ranking, as well as ranks themselves saved to a file.
agg_df = pd.DataFrame()
for stock in stocks:
df = pd.read_csv(stock)
df = my_func(df) # custom function, output of which will be used for ranking. For simplicity, can use standard deviation
agg_df = pd.concat([agg_df, df], ax=1) # row-wise concat
agg_df.rank() #.to_csv() <- would like to save ranks for future use
Each data file has the same schema like:
Symbol Open High Low Close Volume
DateTime
2010-09-13 09:30:00 A 29.23 29.25 29.17 29.25 17667.0
2010-09-13 09:31:00 A 29.26 29.34 29.25 29.33 5000.0
2010-09-13 09:32:00 A 29.31 29.36 29.31 29.36 600.0
2010-09-13 09:33:00 A 29.33 29.36 29.30 29.35 7300.0
2010-09-13 09:34:00 A 29.35 29.39 29.31 29.39 3222.0
Desired output (where each number is a rank):
A AAPL MSFT ...etc
DateTime
2010-09-13 09:30:00 1 3 7 ...
2010-09-13 09:31:00 4 5 7 ...
2010-09-13 09:32:00 24 17 99 ...
2010-09-13 09:33:00 7 63 42 ...
2010-09-13 09:34:00 5 4 13 ...
I read other answers about Window and pyspark.sql, but not sure how to apply those to my case as I kind of need to aggregate those by row before ranking (at least in pandas)
Edit 1: After I read the data to an rdd rdd = sc.parallelize(data.keys).map(data.read_data), rdd becomes a PipelineRDD, which doesnt have .select() method. 0xDFDFDFDF's example contains all data in one dataframe, but I dont think it's a good idea to append everything to one dataframe to do the computation.
Result: Finally was able to solve it. There were 2 problems: reading files and performing the calculations.
Regarding reading file, I initially loaded them from HDF5 using rdd = sc.parallelize(data.keys).map(data.read_data) which resulted in PipelineRDD, which was a collection of pandas dataframes. These needed to be transformed to spark dataframe in order for the solution to work. I ended up transforming my hdf5 file to parquet and saved them to a separate folder. Then using
sqlContext = pyspark.sql.SQLContext(sc)
rdd_p = sqlContext.read.parquet(r"D:\parq")
read all the files to a dataframe.
After that performed calculation from accepted answer. Huge thanks to 0xDFDFDFDF for help
Extras:
discussion - https://chat.stackoverflow.com/rooms/214307/discussion-between-biarys-and-0xdfdfdfdf
0xDFDFDFDF solution - https://gist.github.com/0xDFDFDFDF/a93a7e4448abc03f606008c7422784d1
Indeed, windows functions will do the trick.
I've created a small mock dataset, which should resemble yours.
columns = ['DateTime', 'Symbol', 'Open', 'High', 'Low', 'Close', 'Volume']
data = [('2010-09-13 09:30:00','A',29.23,29.25,29.17,29.25,17667.0),
('2010-09-13 09:31:00','A',29.26,29.34,29.25,29.33,5000.0),
('2010-09-13 09:32:00','A',29.31,29.36,29.31,29.36,600.0),
('2010-09-13 09:34:00','A',29.35,29.39,29.31,29.39,3222.0),
('2010-09-13 09:30:00','AAPL',39.23,39.25,39.17,39.25,37667.0),
('2010-09-13 09:31:00','AAPL',39.26,39.34,39.25,39.33,3000.0),
('2010-09-13 09:32:00','AAPL',39.31,39.36,39.31,39.36,300.0),
('2010-09-13 09:33:00','AAPL',39.33,39.36,39.30,39.35,3300.0),
('2010-09-13 09:34:00','AAPL',39.35,39.39,39.31,39.39,4222.0),
('2010-09-13 09:34:00','MSFT',39.35,39.39,39.31,39.39,7222.0)]
df = spark.createDataFrame(data, columns)
Now, df.show() will give us this:
+-------------------+------+-----+-----+-----+-----+-------+
| DateTime|Symbol| Open| High| Low|Close| Volume|
+-------------------+------+-----+-----+-----+-----+-------+
|2010-09-13 09:30:00| A|29.23|29.25|29.17|29.25|17667.0|
|2010-09-13 09:31:00| A|29.26|29.34|29.25|29.33| 5000.0|
|2010-09-13 09:32:00| A|29.31|29.36|29.31|29.36| 600.0|
|2010-09-13 09:34:00| A|29.35|29.39|29.31|29.39| 3222.0|
|2010-09-13 09:30:00| AAPL|39.23|39.25|39.17|39.25|37667.0|
|2010-09-13 09:31:00| AAPL|39.26|39.34|39.25|39.33| 3000.0|
|2010-09-13 09:32:00| AAPL|39.31|39.36|39.31|39.36| 300.0|
|2010-09-13 09:33:00| AAPL|39.33|39.36| 39.3|39.35| 3300.0|
|2010-09-13 09:34:00| AAPL|39.35|39.39|39.31|39.39| 4222.0|
|2010-09-13 09:34:00| MSFT|39.35|39.39|39.31|39.39| 7222.0|
+-------------------+------+-----+-----+-----+-----+-------+
Here's the solution, which uses the aforementioned window function for rank(). Some transformation is needed, for which you can use pivot() function.
from pyspark.sql.window import Window
import pyspark.sql.functions as f
result = (df
.select(
'DateTime',
'Symbol',
f.rank().over(Window().partitionBy('DateTime').orderBy('Volume')).alias('rank')
)
.groupby('DateTime')
.pivot('Symbol')
.agg(f.first('rank'))
.orderBy('DateTime')
)
By calling result.show() you'll get:
+-------------------+----+----+----+
| DateTime| A|AAPL|MSFT|
+-------------------+----+----+----+
|2010-09-13 09:30:00| 1| 2|null|
|2010-09-13 09:31:00| 2| 1|null|
|2010-09-13 09:32:00| 2| 1|null|
|2010-09-13 09:33:00|null| 1|null|
|2010-09-13 09:34:00| 1| 2| 3|
+-------------------+----+----+----+
Make sure you understand the difference between rank(), dense_rank() and row_number() functions, as they behave differently when they encounter equal numbers in a given window - you can find the explanation here.

Python PySpark: Count Number of Rows by Week w/Week Starting on Monday and Ending on Sunday

I have a data frame that contains the following columns:
ID Scheduled Date
241 10/9/2018
423 9/25/2018
126 9/30/2018
123 8/13/2018
132 8/16/2018
143 10/6/2018
I want to count the total number of IDs by week. Specifically, I want the week to always start on Monday and always end on Sunday.
I achieved this in Jupyter Notebook already:
weekly_count_output = df.resample('W-Mon', on='Scheduled Date', label='left', closed='left').sum().query('count_row > 0')
weekly_count_output = weekly_count_output.reset_index()
weekly_count_output = weekly_count_output[['Scheduled Date', 'count_row']]
weekly_count_output = weekly_count_output.rename(columns = {'count_row': 'Total Count'})
But I don't know how to write the above code in Python PySpark syntax. I want my resulting output to look like this:
Scheduled Date Total Count
8/13/2018 2
9/24/2018 2
10/1/2018 1
10/8/2018 1
Please note the Scheduled Date is always a Monday (indicating beginning of week) and the total count goes from Monday to Sunday of that week.
Thanks to Get Last Monday in Spark for defining the funcion previous_day.
Firstly import,
from pyspark.sql.functions import *
from datetime import datetime
Assuming your input data as in my df (DataFrame)
cols = ['id', 'scheduled_date']
vals = [
(241, '10/09/2018'),
(423, '09/25/2018'),
(126, '09/30/2018'),
(123, '08/13/2018'),
(132, '08/16/2018'),
(143, '10/06/2018')
]
df = spark.createDataFrame(vals, cols)
This is the function defined
def previous_day(date, dayOfWeek):
return date_sub(next_day(date, 'monday'), 7)
# Converting the string column to timestamp.
df = df.withColumn('scheduled_date', date_format(unix_timestamp('scheduled_date', 'MM/dd/yyy') \
.cast('timestamp'), 'yyyy-MM-dd'))
df.show()
+---+--------------+
| id|scheduled_date|
+---+--------------+
|241| 2018-10-09|
|423| 2018-09-25|
|126| 2018-09-30|
|123| 2018-08-13|
|132| 2018-08-16|
|143| 2018-10-06|
+---+--------------+
# Returns the first monday of a week
df_mon = df.withColumn("scheduled_date", previous_day('scheduled_date', 'monday'))
df_mon.show()
+---+--------------+
| id|scheduled_date|
+---+--------------+
|241| 2018-10-08|
|423| 2018-09-24|
|126| 2018-09-24|
|123| 2018-08-13|
|132| 2018-08-13|
|143| 2018-10-01|
+---+--------------+
# You can groupBy and do agg count of 'id'.
df_mon_grp = df_mon.groupBy('scheduled_date').agg(count('id')).orderBy('scheduled_date')
# Reformatting to match your resulting output.
df_mon_grp = df_mon_grp.withColumn('scheduled_date', date_format(unix_timestamp('scheduled_date', "yyyy-MM-dd") \
.cast('timestamp'), 'MM/dd/yyyy'))
df_mon_grp.show()
+--------------+---------+
|scheduled_date|count(id)|
+--------------+---------+
| 08/13/2018| 2|
| 09/24/2018| 2|
| 10/01/2018| 1|
| 10/08/2018| 1|
+--------------+---------+

Python Rolling 30 day period GROUP By with Count Distinct String

I have a dataset of 'user_name','mac','dayte'(day). I would like to GROUP BY ['user_name']. Then for that GROUP BY create WINDOW of rolling 30 days using 'dayte'. In that rolling 30 day period, I would like to count the distinct number of 'mac'. And add that to my dataframe. Sample of the data.
user_name mac dayte
0 001j 7C:D1 2017-09-15
1 0039711 40:33 2017-07-25
2 0459 F0:79 2017-08-01
3 0459 F0:79 2017-08-06
4 0459 F0:79 2017-08-31
5 0459 78:D7 2017-09-08
6 0459 E0:C7 2017-09-16
7 133833 18:5E 2017-07-27
8 133833 F4:0F 2017-07-31
9 133833 A4:E4 2017-08-07
I have tried solving this with a PANDAs dataframe.
df['ct_macs'] = df.groupby(['user_name']).rolling('30d', on='dayte').mac.apply(lambda x:len(x.unique()))
But received the error
Exception: cannot handle a non-unique multi-index!
I tried in PySpark, but received an error as well.
from pyspark.sql import functions as F
#function to calculate number of seconds from number of days
days = lambda i: i * 86400
#convert string timestamp to timestamp type
df= df.withColumn('dayte', df.dayte.cast('timestamp'))
#create window by casting timestamp to long (number of seconds)
w = Window.partitionBy("user_name").orderBy("dayte").rangeBetween(-days(30), 0)
df= df.select("user_name","mac","dayte",F.size(F.denseRank().over(w).alias("ct_mac")))
But received the error
Py4JJavaError: An error occurred while calling o464.select.
: org.apache.spark.sql.AnalysisException: Window function dense_rank does not take a frame specification.;
I also tried
df= df.select("user_name","dayte",F.countDistinct(col("mac")).over(w).alias("ct_mac"))
But it's (count distinct in Window) not supported, apparently, in Spark.
I'm open to a purely SQL approach. In either MySQL or SQL Server, but would prefer Python or Spark.
Pyspark
Window functions are limited in the following ways:
A frame can only be defined by rows and not column values
countDistinct doesn't exist
enumerating functions cannot be used with a frame
Instead you can self join your table.
First let's create the dataframe:
df = sc.parallelize([["001j", "7C:D1", "2017-09-15"], ["0039711", "40:33", "2017-07-25"], ["0459", "F0:79", "2017-08-01"],
["0459", "F0:79", "2017-08-06"], ["0459", "F0:79", "2017-08-31"], ["0459", "78:D7", "2017-09-08"],
["0459", "E0:C7", "2017-09-16"], ["133833", "18:5E", "2017-07-27"], ["133833", "F4:0F", "2017-07-31"],
["133833", "A4:E4", "2017-08-07"]]).toDF(["user_name", "mac", "dayte"])
Now for the join and groupBy:
import pyspark.sql.functions as psf
df.alias("left")\
.join(
df.alias("right"),
(psf.col("left.user_name") == psf.col("right.user_name"))
& (psf.col("right.dayte").between(psf.date_add("left.dayte", -30), psf.col("left.dayte"))),
"leftouter")\
.groupBy(["left." + c for c in df.columns])\
.agg(psf.countDistinct("right.mac").alias("ct_macs"))\
.sort("user_name", "dayte").show()
+---------+-----+----------+-------+
|user_name| mac| dayte|ct_macs|
+---------+-----+----------+-------+
| 001j|7C:D1|2017-09-15| 1|
| 0039711|40:33|2017-07-25| 1|
| 0459|F0:79|2017-08-01| 1|
| 0459|F0:79|2017-08-06| 1|
| 0459|F0:79|2017-08-31| 1|
| 0459|78:D7|2017-09-08| 2|
| 0459|E0:C7|2017-09-16| 3|
| 133833|18:5E|2017-07-27| 1|
| 133833|F4:0F|2017-07-31| 2|
| 133833|A4:E4|2017-08-07| 3|
+---------+-----+----------+-------+
Pandas
This works for python3
import pandas as pd
import numpy as np
df["mac"] = pd.factorize(df["mac"])[0]
df.groupby('user_name').rolling('30D', on="dayte").mac.apply(lambda x: len(np.unique(x)))

Using .apply() in Sframes to manipulate multiple columns of each row

I have an SFrame with the columns Date1 and Date2.
I am trying to use .apply() to find the datediff between Date1 and Date2, but I can't figure out how to use the other argument.
Ideally something like
frame['new_col'] = frame['Date1'].apply(lambda x: datediff(x,frame('Date2')))
You can directly take the difference between the dates in column Date2 and those in Date1 by just subtracting frame['Date1'] from frame['Date2']. That, for some reason, returns the number of seconds between the two dates (only tested with python's datetime objects), which you can convert into number of days with simple arithmetics:
from sframe import SFrame
from datetime import datetime, timedelta
mydict = {'Date1':[datetime.now(), datetime.now()+timedelta(2)],
'Date2':[datetime.now()+timedelta(10), datetime.now()+timedelta(17)]}
frame = SFrame(mydict)
frame['new_col'] = (frame['Date2'] - frame['Date1']).apply(lambda x: x//(60*60*24))
Output:
+----------------------------+----------------------------+---------+
| Date1 | Date2 | new_col |
+----------------------------+----------------------------+---------+
| 2016-10-02 21:12:14.712556 | 2016-10-12 21:12:14.712574 | 10.0 |
| 2016-10-04 21:12:14.712567 | 2016-10-19 21:12:14.712576 | 15.0 |
+----------------------------+----------------------------+---------+

Categories

Resources