Python Rolling 30 day period GROUP By with Count Distinct String - python

I have a dataset of 'user_name','mac','dayte'(day). I would like to GROUP BY ['user_name']. Then for that GROUP BY create WINDOW of rolling 30 days using 'dayte'. In that rolling 30 day period, I would like to count the distinct number of 'mac'. And add that to my dataframe. Sample of the data.
user_name mac dayte
0 001j 7C:D1 2017-09-15
1 0039711 40:33 2017-07-25
2 0459 F0:79 2017-08-01
3 0459 F0:79 2017-08-06
4 0459 F0:79 2017-08-31
5 0459 78:D7 2017-09-08
6 0459 E0:C7 2017-09-16
7 133833 18:5E 2017-07-27
8 133833 F4:0F 2017-07-31
9 133833 A4:E4 2017-08-07
I have tried solving this with a PANDAs dataframe.
df['ct_macs'] = df.groupby(['user_name']).rolling('30d', on='dayte').mac.apply(lambda x:len(x.unique()))
But received the error
Exception: cannot handle a non-unique multi-index!
I tried in PySpark, but received an error as well.
from pyspark.sql import functions as F
#function to calculate number of seconds from number of days
days = lambda i: i * 86400
#convert string timestamp to timestamp type
df= df.withColumn('dayte', df.dayte.cast('timestamp'))
#create window by casting timestamp to long (number of seconds)
w = Window.partitionBy("user_name").orderBy("dayte").rangeBetween(-days(30), 0)
df= df.select("user_name","mac","dayte",F.size(F.denseRank().over(w).alias("ct_mac")))
But received the error
Py4JJavaError: An error occurred while calling o464.select.
: org.apache.spark.sql.AnalysisException: Window function dense_rank does not take a frame specification.;
I also tried
df= df.select("user_name","dayte",F.countDistinct(col("mac")).over(w).alias("ct_mac"))
But it's (count distinct in Window) not supported, apparently, in Spark.
I'm open to a purely SQL approach. In either MySQL or SQL Server, but would prefer Python or Spark.

Pyspark
Window functions are limited in the following ways:
A frame can only be defined by rows and not column values
countDistinct doesn't exist
enumerating functions cannot be used with a frame
Instead you can self join your table.
First let's create the dataframe:
df = sc.parallelize([["001j", "7C:D1", "2017-09-15"], ["0039711", "40:33", "2017-07-25"], ["0459", "F0:79", "2017-08-01"],
["0459", "F0:79", "2017-08-06"], ["0459", "F0:79", "2017-08-31"], ["0459", "78:D7", "2017-09-08"],
["0459", "E0:C7", "2017-09-16"], ["133833", "18:5E", "2017-07-27"], ["133833", "F4:0F", "2017-07-31"],
["133833", "A4:E4", "2017-08-07"]]).toDF(["user_name", "mac", "dayte"])
Now for the join and groupBy:
import pyspark.sql.functions as psf
df.alias("left")\
.join(
df.alias("right"),
(psf.col("left.user_name") == psf.col("right.user_name"))
& (psf.col("right.dayte").between(psf.date_add("left.dayte", -30), psf.col("left.dayte"))),
"leftouter")\
.groupBy(["left." + c for c in df.columns])\
.agg(psf.countDistinct("right.mac").alias("ct_macs"))\
.sort("user_name", "dayte").show()
+---------+-----+----------+-------+
|user_name| mac| dayte|ct_macs|
+---------+-----+----------+-------+
| 001j|7C:D1|2017-09-15| 1|
| 0039711|40:33|2017-07-25| 1|
| 0459|F0:79|2017-08-01| 1|
| 0459|F0:79|2017-08-06| 1|
| 0459|F0:79|2017-08-31| 1|
| 0459|78:D7|2017-09-08| 2|
| 0459|E0:C7|2017-09-16| 3|
| 133833|18:5E|2017-07-27| 1|
| 133833|F4:0F|2017-07-31| 2|
| 133833|A4:E4|2017-08-07| 3|
+---------+-----+----------+-------+
Pandas
This works for python3
import pandas as pd
import numpy as np
df["mac"] = pd.factorize(df["mac"])[0]
df.groupby('user_name').rolling('30D', on="dayte").mac.apply(lambda x: len(np.unique(x)))

Related

Minutes to Hours on datetime column Pyspark

I have a pyspark dataframe with a column datetime containing : 2022-06-01 13:59:58
I would like to transform that datetime value into : 2022-06-01 14:00:58
Is there a way to round the minutes into hours , when the minutes are 59 min?
You can accomplish this using expr or unix_timestamp and further adding 1 minute respectively , based on the minute against your timestamp value using when-otherwise -
Unix Timestamps can get a bit fiddly as it involves an additional step of converting it to epoch, but either ways the end result is the same across both
Data Preparation
s = StringIO("""
date_str
2022-03-01 13:59:50
2022-05-20 13:45:50
2022-06-21 16:59:50
2022-10-22 20:59:50
""")
df = pd.read_csv(s,delimiter=',')
sparkDF = sql.createDataFrame(df)\
.withColumn('date_parsed',F.to_timestamp(F.col('date_str'), 'yyyy-MM-dd HH:mm:ss'))\
.drop('date_str')
sparkDF.show()
+-------------------+
| date_parsed|
+-------------------+
|2022-03-01 13:59:50|
|2022-05-20 13:45:50|
|2022-06-21 16:59:50|
|2022-10-22 20:59:50|
+-------------------+
Extracting Minute & Addition
sparkDF = sparkDF.withColumn("date_minute", F.minute("date_parsed"))
sparkDF = sparkDF.withColumn('date_parsed_updated_expr',
F.when(F.col('date_minute') == 59,F.col('date_parsed') + F.expr('INTERVAL 1 MINUTE'))\
.otherwise(F.col('date_parsed'))
).withColumn('date_parsed_updated_unix',
F.when(F.col('date_minute') == 59,(F.unix_timestamp(F.col('date_parsed')) + 60).cast('timestamp'))
.otherwise(F.col('date_parsed'))
)
sparkDF.show()
+-------------------+-----------+------------------------+------------------------+
| date_parsed|date_minute|date_parsed_updated_expr|date_parsed_updated_unix|
+-------------------+-----------+------------------------+------------------------+
|2022-03-01 13:59:50| 59| 2022-03-01 14:00:50| 2022-03-01 14:00:50|
|2022-05-20 13:45:50| 45| 2022-05-20 13:45:50| 2022-05-20 13:45:50|
|2022-06-21 16:59:50| 59| 2022-06-21 17:00:50| 2022-06-21 17:00:50|
|2022-10-22 20:59:50| 59| 2022-10-22 21:00:50| 2022-10-22 21:00:50|
+-------------------+-----------+------------------------+------------------------+

Transforming multiindex df into sorted xlsx in python

long-time reader, first-time poster. I've been doing some time tracking for two projects, grouped the data by project and date using pandas, and would like to fill it into an existing Excel template for a client that is sorted by date (y-axis) and project (x-axis). But I'm stumped. I've been struggling to convert the multi-index dataframe into a sorted xlsx file.
Example data I want to sort
|Date | Project | Hours |
|-----------|---------------------------|---------|
|2022-05-09 |Project 1 | 5.50|
|2022-05-09 |Project 1 | 3.75|
|2022-05-11 |Project 2 | 1.50|
|2022-05-11 |Project 2 | 4.75|
etc.
Desired template
|Date |Project 1|Project 2|
|-----------|---------|---------|
|2022-05-09 | 5.5| 3.75|
|2022-05-11 | 4.75| 1.5|
etc...
So far I've tried a very basic iteration using openpyxl that has inserted the dates, but I can't figure out how to
a) rearrange the data in pandas so I can simply insert it or
b) how to write conditionally in openpyxl for a given date and project
# code grouping dates and projects
df = df.groupby(["Date", "Project"]).sum("Hours")
r = 10 # below the template headers and where I would start inserting time tracked
for date in df.index:
sheet.cell(row=r, column=1).value = date
r+=1
I've trawled StackOverflow for answers but am coming up empty. Thanks for any help you can provide.
I think your data sample is not correct. the 2nd row, instead of 2022-05-09 |Project 1|3.75, it should be 2022-05-09 |Project 2|3.75. The same with 4th row.
As I understand, your data is in long-format and your output is wide-format. In this case, pd.pivot_table can help:
pd.pivot_table(data=df, columns='name', index='year', values='hours').reset_index()
df.pivot_table(index='Date', columns='Project', values='Hours')
Date Project1 Project2
2022-05-09 5.5 3.75
2022-05-11 4.75 1.5

Multiple files ranking using pyspark

I have stock data (over 6000+ stocks, 100+ GB) saved as HDF5 file.
Basically, I am trying to translate this pandas code into pyspark. Ideally, I would love to have values used for the ranking, as well as ranks themselves saved to a file.
agg_df = pd.DataFrame()
for stock in stocks:
df = pd.read_csv(stock)
df = my_func(df) # custom function, output of which will be used for ranking. For simplicity, can use standard deviation
agg_df = pd.concat([agg_df, df], ax=1) # row-wise concat
agg_df.rank() #.to_csv() <- would like to save ranks for future use
Each data file has the same schema like:
Symbol Open High Low Close Volume
DateTime
2010-09-13 09:30:00 A 29.23 29.25 29.17 29.25 17667.0
2010-09-13 09:31:00 A 29.26 29.34 29.25 29.33 5000.0
2010-09-13 09:32:00 A 29.31 29.36 29.31 29.36 600.0
2010-09-13 09:33:00 A 29.33 29.36 29.30 29.35 7300.0
2010-09-13 09:34:00 A 29.35 29.39 29.31 29.39 3222.0
Desired output (where each number is a rank):
A AAPL MSFT ...etc
DateTime
2010-09-13 09:30:00 1 3 7 ...
2010-09-13 09:31:00 4 5 7 ...
2010-09-13 09:32:00 24 17 99 ...
2010-09-13 09:33:00 7 63 42 ...
2010-09-13 09:34:00 5 4 13 ...
I read other answers about Window and pyspark.sql, but not sure how to apply those to my case as I kind of need to aggregate those by row before ranking (at least in pandas)
Edit 1: After I read the data to an rdd rdd = sc.parallelize(data.keys).map(data.read_data), rdd becomes a PipelineRDD, which doesnt have .select() method. 0xDFDFDFDF's example contains all data in one dataframe, but I dont think it's a good idea to append everything to one dataframe to do the computation.
Result: Finally was able to solve it. There were 2 problems: reading files and performing the calculations.
Regarding reading file, I initially loaded them from HDF5 using rdd = sc.parallelize(data.keys).map(data.read_data) which resulted in PipelineRDD, which was a collection of pandas dataframes. These needed to be transformed to spark dataframe in order for the solution to work. I ended up transforming my hdf5 file to parquet and saved them to a separate folder. Then using
sqlContext = pyspark.sql.SQLContext(sc)
rdd_p = sqlContext.read.parquet(r"D:\parq")
read all the files to a dataframe.
After that performed calculation from accepted answer. Huge thanks to 0xDFDFDFDF for help
Extras:
discussion - https://chat.stackoverflow.com/rooms/214307/discussion-between-biarys-and-0xdfdfdfdf
0xDFDFDFDF solution - https://gist.github.com/0xDFDFDFDF/a93a7e4448abc03f606008c7422784d1
Indeed, windows functions will do the trick.
I've created a small mock dataset, which should resemble yours.
columns = ['DateTime', 'Symbol', 'Open', 'High', 'Low', 'Close', 'Volume']
data = [('2010-09-13 09:30:00','A',29.23,29.25,29.17,29.25,17667.0),
('2010-09-13 09:31:00','A',29.26,29.34,29.25,29.33,5000.0),
('2010-09-13 09:32:00','A',29.31,29.36,29.31,29.36,600.0),
('2010-09-13 09:34:00','A',29.35,29.39,29.31,29.39,3222.0),
('2010-09-13 09:30:00','AAPL',39.23,39.25,39.17,39.25,37667.0),
('2010-09-13 09:31:00','AAPL',39.26,39.34,39.25,39.33,3000.0),
('2010-09-13 09:32:00','AAPL',39.31,39.36,39.31,39.36,300.0),
('2010-09-13 09:33:00','AAPL',39.33,39.36,39.30,39.35,3300.0),
('2010-09-13 09:34:00','AAPL',39.35,39.39,39.31,39.39,4222.0),
('2010-09-13 09:34:00','MSFT',39.35,39.39,39.31,39.39,7222.0)]
df = spark.createDataFrame(data, columns)
Now, df.show() will give us this:
+-------------------+------+-----+-----+-----+-----+-------+
| DateTime|Symbol| Open| High| Low|Close| Volume|
+-------------------+------+-----+-----+-----+-----+-------+
|2010-09-13 09:30:00| A|29.23|29.25|29.17|29.25|17667.0|
|2010-09-13 09:31:00| A|29.26|29.34|29.25|29.33| 5000.0|
|2010-09-13 09:32:00| A|29.31|29.36|29.31|29.36| 600.0|
|2010-09-13 09:34:00| A|29.35|29.39|29.31|29.39| 3222.0|
|2010-09-13 09:30:00| AAPL|39.23|39.25|39.17|39.25|37667.0|
|2010-09-13 09:31:00| AAPL|39.26|39.34|39.25|39.33| 3000.0|
|2010-09-13 09:32:00| AAPL|39.31|39.36|39.31|39.36| 300.0|
|2010-09-13 09:33:00| AAPL|39.33|39.36| 39.3|39.35| 3300.0|
|2010-09-13 09:34:00| AAPL|39.35|39.39|39.31|39.39| 4222.0|
|2010-09-13 09:34:00| MSFT|39.35|39.39|39.31|39.39| 7222.0|
+-------------------+------+-----+-----+-----+-----+-------+
Here's the solution, which uses the aforementioned window function for rank(). Some transformation is needed, for which you can use pivot() function.
from pyspark.sql.window import Window
import pyspark.sql.functions as f
result = (df
.select(
'DateTime',
'Symbol',
f.rank().over(Window().partitionBy('DateTime').orderBy('Volume')).alias('rank')
)
.groupby('DateTime')
.pivot('Symbol')
.agg(f.first('rank'))
.orderBy('DateTime')
)
By calling result.show() you'll get:
+-------------------+----+----+----+
| DateTime| A|AAPL|MSFT|
+-------------------+----+----+----+
|2010-09-13 09:30:00| 1| 2|null|
|2010-09-13 09:31:00| 2| 1|null|
|2010-09-13 09:32:00| 2| 1|null|
|2010-09-13 09:33:00|null| 1|null|
|2010-09-13 09:34:00| 1| 2| 3|
+-------------------+----+----+----+
Make sure you understand the difference between rank(), dense_rank() and row_number() functions, as they behave differently when they encounter equal numbers in a given window - you can find the explanation here.

Python PySpark: Count Number of Rows by Week w/Week Starting on Monday and Ending on Sunday

I have a data frame that contains the following columns:
ID Scheduled Date
241 10/9/2018
423 9/25/2018
126 9/30/2018
123 8/13/2018
132 8/16/2018
143 10/6/2018
I want to count the total number of IDs by week. Specifically, I want the week to always start on Monday and always end on Sunday.
I achieved this in Jupyter Notebook already:
weekly_count_output = df.resample('W-Mon', on='Scheduled Date', label='left', closed='left').sum().query('count_row > 0')
weekly_count_output = weekly_count_output.reset_index()
weekly_count_output = weekly_count_output[['Scheduled Date', 'count_row']]
weekly_count_output = weekly_count_output.rename(columns = {'count_row': 'Total Count'})
But I don't know how to write the above code in Python PySpark syntax. I want my resulting output to look like this:
Scheduled Date Total Count
8/13/2018 2
9/24/2018 2
10/1/2018 1
10/8/2018 1
Please note the Scheduled Date is always a Monday (indicating beginning of week) and the total count goes from Monday to Sunday of that week.
Thanks to Get Last Monday in Spark for defining the funcion previous_day.
Firstly import,
from pyspark.sql.functions import *
from datetime import datetime
Assuming your input data as in my df (DataFrame)
cols = ['id', 'scheduled_date']
vals = [
(241, '10/09/2018'),
(423, '09/25/2018'),
(126, '09/30/2018'),
(123, '08/13/2018'),
(132, '08/16/2018'),
(143, '10/06/2018')
]
df = spark.createDataFrame(vals, cols)
This is the function defined
def previous_day(date, dayOfWeek):
return date_sub(next_day(date, 'monday'), 7)
# Converting the string column to timestamp.
df = df.withColumn('scheduled_date', date_format(unix_timestamp('scheduled_date', 'MM/dd/yyy') \
.cast('timestamp'), 'yyyy-MM-dd'))
df.show()
+---+--------------+
| id|scheduled_date|
+---+--------------+
|241| 2018-10-09|
|423| 2018-09-25|
|126| 2018-09-30|
|123| 2018-08-13|
|132| 2018-08-16|
|143| 2018-10-06|
+---+--------------+
# Returns the first monday of a week
df_mon = df.withColumn("scheduled_date", previous_day('scheduled_date', 'monday'))
df_mon.show()
+---+--------------+
| id|scheduled_date|
+---+--------------+
|241| 2018-10-08|
|423| 2018-09-24|
|126| 2018-09-24|
|123| 2018-08-13|
|132| 2018-08-13|
|143| 2018-10-01|
+---+--------------+
# You can groupBy and do agg count of 'id'.
df_mon_grp = df_mon.groupBy('scheduled_date').agg(count('id')).orderBy('scheduled_date')
# Reformatting to match your resulting output.
df_mon_grp = df_mon_grp.withColumn('scheduled_date', date_format(unix_timestamp('scheduled_date', "yyyy-MM-dd") \
.cast('timestamp'), 'MM/dd/yyyy'))
df_mon_grp.show()
+--------------+---------+
|scheduled_date|count(id)|
+--------------+---------+
| 08/13/2018| 2|
| 09/24/2018| 2|
| 10/01/2018| 1|
| 10/08/2018| 1|
+--------------+---------+

Using .apply() in Sframes to manipulate multiple columns of each row

I have an SFrame with the columns Date1 and Date2.
I am trying to use .apply() to find the datediff between Date1 and Date2, but I can't figure out how to use the other argument.
Ideally something like
frame['new_col'] = frame['Date1'].apply(lambda x: datediff(x,frame('Date2')))
You can directly take the difference between the dates in column Date2 and those in Date1 by just subtracting frame['Date1'] from frame['Date2']. That, for some reason, returns the number of seconds between the two dates (only tested with python's datetime objects), which you can convert into number of days with simple arithmetics:
from sframe import SFrame
from datetime import datetime, timedelta
mydict = {'Date1':[datetime.now(), datetime.now()+timedelta(2)],
'Date2':[datetime.now()+timedelta(10), datetime.now()+timedelta(17)]}
frame = SFrame(mydict)
frame['new_col'] = (frame['Date2'] - frame['Date1']).apply(lambda x: x//(60*60*24))
Output:
+----------------------------+----------------------------+---------+
| Date1 | Date2 | new_col |
+----------------------------+----------------------------+---------+
| 2016-10-02 21:12:14.712556 | 2016-10-12 21:12:14.712574 | 10.0 |
| 2016-10-04 21:12:14.712567 | 2016-10-19 21:12:14.712576 | 15.0 |
+----------------------------+----------------------------+---------+

Categories

Resources