I have a pyspark dataframe with a column datetime containing : 2022-06-01 13:59:58
I would like to transform that datetime value into : 2022-06-01 14:00:58
Is there a way to round the minutes into hours , when the minutes are 59 min?
You can accomplish this using expr or unix_timestamp and further adding 1 minute respectively , based on the minute against your timestamp value using when-otherwise -
Unix Timestamps can get a bit fiddly as it involves an additional step of converting it to epoch, but either ways the end result is the same across both
Data Preparation
s = StringIO("""
date_str
2022-03-01 13:59:50
2022-05-20 13:45:50
2022-06-21 16:59:50
2022-10-22 20:59:50
""")
df = pd.read_csv(s,delimiter=',')
sparkDF = sql.createDataFrame(df)\
.withColumn('date_parsed',F.to_timestamp(F.col('date_str'), 'yyyy-MM-dd HH:mm:ss'))\
.drop('date_str')
sparkDF.show()
+-------------------+
| date_parsed|
+-------------------+
|2022-03-01 13:59:50|
|2022-05-20 13:45:50|
|2022-06-21 16:59:50|
|2022-10-22 20:59:50|
+-------------------+
Extracting Minute & Addition
sparkDF = sparkDF.withColumn("date_minute", F.minute("date_parsed"))
sparkDF = sparkDF.withColumn('date_parsed_updated_expr',
F.when(F.col('date_minute') == 59,F.col('date_parsed') + F.expr('INTERVAL 1 MINUTE'))\
.otherwise(F.col('date_parsed'))
).withColumn('date_parsed_updated_unix',
F.when(F.col('date_minute') == 59,(F.unix_timestamp(F.col('date_parsed')) + 60).cast('timestamp'))
.otherwise(F.col('date_parsed'))
)
sparkDF.show()
+-------------------+-----------+------------------------+------------------------+
| date_parsed|date_minute|date_parsed_updated_expr|date_parsed_updated_unix|
+-------------------+-----------+------------------------+------------------------+
|2022-03-01 13:59:50| 59| 2022-03-01 14:00:50| 2022-03-01 14:00:50|
|2022-05-20 13:45:50| 45| 2022-05-20 13:45:50| 2022-05-20 13:45:50|
|2022-06-21 16:59:50| 59| 2022-06-21 17:00:50| 2022-06-21 17:00:50|
|2022-10-22 20:59:50| 59| 2022-10-22 21:00:50| 2022-10-22 21:00:50|
+-------------------+-----------+------------------------+------------------------+
Related
If I have users with movements across all the year I want to create windows of 30 days to aggregate the data starting from the last movement he did.
So if I have a user with movements on dates:
id
date
value
1
2021-01-30
2
1
2021-02-01
4
1
2021-02-08
7
1
2021-04-15
23
I want to create:
[window 3, from 01/15 to 02/15]
[window 2, from 02/15 to 03/15]
[window 1, from 03/15 to 04/15]
And I almost got it with:
dfsp.groupBy(["id", F.window("date", "30 days")])
.agg({'value':'sum'})
.orderBy("window")
.fillna(0)
But I noticed that the windows it generates doesn't start at the end, and I don't know how that can be done.
So, basically, the final dataframe would be something like:
id
window
sum(value)
1
(2021-01-15 00:00:00, 2021-02-15 00:00:00)
13
1
(2021-02-15 00:00:00, 2021-03-15 00:00:00)
0
1
(2021-03-15 00:00:00, 2021-04-15 00:00:00)
23
You could achieve something close to this using applyInPandas and resample. In native Spark it would require more creative coding.
Note: orgin='end' parameter of resample is only available pandas >= 1.3.0
df = df.withColumn('date', f.col('date').cast(TimestampType()))
def pd_resample(df):
return df.groupby('id').resample('30D', on='date', origin='end').value.sum().reset_index()
schema = StructType([
StructField('id', IntegerType(), True),
StructField('date', TimestampType(), True),
StructField('value', IntegerType(), True)
])
df.groupby('id').applyInPandas(pd_resample, schema=schema).show()
+---+-------------------+-----+
| id| date|value|
+---+-------------------+-----+
| 1|2021-02-14 00:00:00| 13|
| 1|2021-03-16 00:00:00| 0|
| 1|2021-04-15 00:00:00| 23|
+---+-------------------+-----+
I have a data frame that contains the following columns:
ID Scheduled Date
241 10/9/2018
423 9/25/2018
126 9/30/2018
123 8/13/2018
132 8/16/2018
143 10/6/2018
I want to count the total number of IDs by week. Specifically, I want the week to always start on Monday and always end on Sunday.
I achieved this in Jupyter Notebook already:
weekly_count_output = df.resample('W-Mon', on='Scheduled Date', label='left', closed='left').sum().query('count_row > 0')
weekly_count_output = weekly_count_output.reset_index()
weekly_count_output = weekly_count_output[['Scheduled Date', 'count_row']]
weekly_count_output = weekly_count_output.rename(columns = {'count_row': 'Total Count'})
But I don't know how to write the above code in Python PySpark syntax. I want my resulting output to look like this:
Scheduled Date Total Count
8/13/2018 2
9/24/2018 2
10/1/2018 1
10/8/2018 1
Please note the Scheduled Date is always a Monday (indicating beginning of week) and the total count goes from Monday to Sunday of that week.
Thanks to Get Last Monday in Spark for defining the funcion previous_day.
Firstly import,
from pyspark.sql.functions import *
from datetime import datetime
Assuming your input data as in my df (DataFrame)
cols = ['id', 'scheduled_date']
vals = [
(241, '10/09/2018'),
(423, '09/25/2018'),
(126, '09/30/2018'),
(123, '08/13/2018'),
(132, '08/16/2018'),
(143, '10/06/2018')
]
df = spark.createDataFrame(vals, cols)
This is the function defined
def previous_day(date, dayOfWeek):
return date_sub(next_day(date, 'monday'), 7)
# Converting the string column to timestamp.
df = df.withColumn('scheduled_date', date_format(unix_timestamp('scheduled_date', 'MM/dd/yyy') \
.cast('timestamp'), 'yyyy-MM-dd'))
df.show()
+---+--------------+
| id|scheduled_date|
+---+--------------+
|241| 2018-10-09|
|423| 2018-09-25|
|126| 2018-09-30|
|123| 2018-08-13|
|132| 2018-08-16|
|143| 2018-10-06|
+---+--------------+
# Returns the first monday of a week
df_mon = df.withColumn("scheduled_date", previous_day('scheduled_date', 'monday'))
df_mon.show()
+---+--------------+
| id|scheduled_date|
+---+--------------+
|241| 2018-10-08|
|423| 2018-09-24|
|126| 2018-09-24|
|123| 2018-08-13|
|132| 2018-08-13|
|143| 2018-10-01|
+---+--------------+
# You can groupBy and do agg count of 'id'.
df_mon_grp = df_mon.groupBy('scheduled_date').agg(count('id')).orderBy('scheduled_date')
# Reformatting to match your resulting output.
df_mon_grp = df_mon_grp.withColumn('scheduled_date', date_format(unix_timestamp('scheduled_date', "yyyy-MM-dd") \
.cast('timestamp'), 'MM/dd/yyyy'))
df_mon_grp.show()
+--------------+---------+
|scheduled_date|count(id)|
+--------------+---------+
| 08/13/2018| 2|
| 09/24/2018| 2|
| 10/01/2018| 1|
| 10/08/2018| 1|
+--------------+---------+
I have a dataset of 'user_name','mac','dayte'(day). I would like to GROUP BY ['user_name']. Then for that GROUP BY create WINDOW of rolling 30 days using 'dayte'. In that rolling 30 day period, I would like to count the distinct number of 'mac'. And add that to my dataframe. Sample of the data.
user_name mac dayte
0 001j 7C:D1 2017-09-15
1 0039711 40:33 2017-07-25
2 0459 F0:79 2017-08-01
3 0459 F0:79 2017-08-06
4 0459 F0:79 2017-08-31
5 0459 78:D7 2017-09-08
6 0459 E0:C7 2017-09-16
7 133833 18:5E 2017-07-27
8 133833 F4:0F 2017-07-31
9 133833 A4:E4 2017-08-07
I have tried solving this with a PANDAs dataframe.
df['ct_macs'] = df.groupby(['user_name']).rolling('30d', on='dayte').mac.apply(lambda x:len(x.unique()))
But received the error
Exception: cannot handle a non-unique multi-index!
I tried in PySpark, but received an error as well.
from pyspark.sql import functions as F
#function to calculate number of seconds from number of days
days = lambda i: i * 86400
#convert string timestamp to timestamp type
df= df.withColumn('dayte', df.dayte.cast('timestamp'))
#create window by casting timestamp to long (number of seconds)
w = Window.partitionBy("user_name").orderBy("dayte").rangeBetween(-days(30), 0)
df= df.select("user_name","mac","dayte",F.size(F.denseRank().over(w).alias("ct_mac")))
But received the error
Py4JJavaError: An error occurred while calling o464.select.
: org.apache.spark.sql.AnalysisException: Window function dense_rank does not take a frame specification.;
I also tried
df= df.select("user_name","dayte",F.countDistinct(col("mac")).over(w).alias("ct_mac"))
But it's (count distinct in Window) not supported, apparently, in Spark.
I'm open to a purely SQL approach. In either MySQL or SQL Server, but would prefer Python or Spark.
Pyspark
Window functions are limited in the following ways:
A frame can only be defined by rows and not column values
countDistinct doesn't exist
enumerating functions cannot be used with a frame
Instead you can self join your table.
First let's create the dataframe:
df = sc.parallelize([["001j", "7C:D1", "2017-09-15"], ["0039711", "40:33", "2017-07-25"], ["0459", "F0:79", "2017-08-01"],
["0459", "F0:79", "2017-08-06"], ["0459", "F0:79", "2017-08-31"], ["0459", "78:D7", "2017-09-08"],
["0459", "E0:C7", "2017-09-16"], ["133833", "18:5E", "2017-07-27"], ["133833", "F4:0F", "2017-07-31"],
["133833", "A4:E4", "2017-08-07"]]).toDF(["user_name", "mac", "dayte"])
Now for the join and groupBy:
import pyspark.sql.functions as psf
df.alias("left")\
.join(
df.alias("right"),
(psf.col("left.user_name") == psf.col("right.user_name"))
& (psf.col("right.dayte").between(psf.date_add("left.dayte", -30), psf.col("left.dayte"))),
"leftouter")\
.groupBy(["left." + c for c in df.columns])\
.agg(psf.countDistinct("right.mac").alias("ct_macs"))\
.sort("user_name", "dayte").show()
+---------+-----+----------+-------+
|user_name| mac| dayte|ct_macs|
+---------+-----+----------+-------+
| 001j|7C:D1|2017-09-15| 1|
| 0039711|40:33|2017-07-25| 1|
| 0459|F0:79|2017-08-01| 1|
| 0459|F0:79|2017-08-06| 1|
| 0459|F0:79|2017-08-31| 1|
| 0459|78:D7|2017-09-08| 2|
| 0459|E0:C7|2017-09-16| 3|
| 133833|18:5E|2017-07-27| 1|
| 133833|F4:0F|2017-07-31| 2|
| 133833|A4:E4|2017-08-07| 3|
+---------+-----+----------+-------+
Pandas
This works for python3
import pandas as pd
import numpy as np
df["mac"] = pd.factorize(df["mac"])[0]
df.groupby('user_name').rolling('30D', on="dayte").mac.apply(lambda x: len(np.unique(x)))
I have the following dataframe:
Date Time Quantity
20171003 5:00 2
20171003 5:15 5
....
20171005 5:00 1
20171005 5:15 9
I need to create a new column containing the quantity of the same day of the previous week, that is:
Date Time Quantity Quantity-1
20171003 5:00 2 NaN
20171003 5:15 5 NaN
....
20171005 5:00 1 2
20171005 5:15 9 5
I figured out how to get the same day of the last week by using for example:
last_week = today() + relativedelta(weeks=-1, weekday= now.weekday())
How to apply this to my dataframe?
Thank you in advance!
Does your index have a pattern? If yes, you could use pd.shift(). The periods paramater would be the number of periods in your df. For example, assuming your Time column is always whether 5:00 or 5:15, and that you have calendar days, your period would be 7 * 2 = 14
df['Quantity-1'] = df['Quantity'].shift(14)
If the data is collected in the exact same length everyday, using pd.shift as #EricB mentioned should be perfect.
Alternatively, you can create new dataframe and merge where days shift by 14 days and then merge back to original dataframe on column date and time (note assuming that you want the quantity at the same time on the next 14 days).
df = pd.DataFrame([
['20171003', '5:00', '2'],
['20171003', '5:15', '5'],
['20171005', '5:00', '1'],
['20171005', '5:15', '9'],
['20171019', '5:00', '8']],
columns=['date', 'time', 'quantity'])
df.loc[:, 'date'] = pd.to_datetime(df.date)
df2 = df[['date', 'time', 'quantity']]
df2.loc[:, 'date'] = df2.date + datetime.timedelta(weeks=2) # shift by 2 weeks
df_shift = df.merge(df2, on=['time', 'date'], how='left')
Output of df_shift
+-----------+----+----------+----------+
| date|time|quantity_x|quantity_y|
+-----------+----+----------+----------+
|2017-10-03 |5:00| 2| |
|2017-10-03 |5:15| 5| |
|2017-10-05 |5:00| 1| |
|2017-10-05 |5:15| 9| |
|2017-10-19 |5:00| 8| 1|
+-----------+----+----------+----------+
Adding to #titipata solution, there is another way to do it without having to merge.
The approach in a nutshell goes as following
Get the datetime after 1 day/week/month from the first value
starting from that datetime onwards get the value 1 day/week/month before
so for example, if your dataset starts at 1/10/2021 00:00:00 (that's the 1st of October for you Americans)
First:
you will have these values
1 day after: 2/10/2021
1 week after: 8/10/2021
1 month after: 1/11/2021
Second Step
get the following
Previous day values for values starting from 2/10/2021
And so on and so forth
Hope someone finds this helpful
from pandas import DateOffset
def add_past_values(df):
df = df.set_index('datetime')
firstvalue = df.index[0]
#1. get the datetime after 1 day/week/month from the first value
secondday = firstvalue + DateOffset(days = 1)
secondweek = firstvalue + DateOffset(weeks = 1)
secondmonth = firstvalue + DateOffset(months = 1)
#2. starting from that datetime onwards get the value 1 day/week/month before
df.loc[secondday:,'lag_day_1'] = df.loc[df.loc[secondday:].index - DateOffset(days=1),'myvalue'].values
df.loc[secondweek:,'lag_week_1'] = df.loc[df.loc[secondweek:].index - DateOffset(weeks=1),'myvalue'].values
df.loc[secondmonth:,'lag_month_1'] = df.loc[df.loc[secondmonth:].index - DateOffset(months=1),'myvalue'].values
df = df.reset_index()
return df
I have an SFrame with the columns Date1 and Date2.
I am trying to use .apply() to find the datediff between Date1 and Date2, but I can't figure out how to use the other argument.
Ideally something like
frame['new_col'] = frame['Date1'].apply(lambda x: datediff(x,frame('Date2')))
You can directly take the difference between the dates in column Date2 and those in Date1 by just subtracting frame['Date1'] from frame['Date2']. That, for some reason, returns the number of seconds between the two dates (only tested with python's datetime objects), which you can convert into number of days with simple arithmetics:
from sframe import SFrame
from datetime import datetime, timedelta
mydict = {'Date1':[datetime.now(), datetime.now()+timedelta(2)],
'Date2':[datetime.now()+timedelta(10), datetime.now()+timedelta(17)]}
frame = SFrame(mydict)
frame['new_col'] = (frame['Date2'] - frame['Date1']).apply(lambda x: x//(60*60*24))
Output:
+----------------------------+----------------------------+---------+
| Date1 | Date2 | new_col |
+----------------------------+----------------------------+---------+
| 2016-10-02 21:12:14.712556 | 2016-10-12 21:12:14.712574 | 10.0 |
| 2016-10-04 21:12:14.712567 | 2016-10-19 21:12:14.712576 | 15.0 |
+----------------------------+----------------------------+---------+