I have a PySpark dataframe containing multiple rows for each user:
userId
action
time
1
buy
8 AM
1
buy
9 AM
1
sell
2 PM
1
sell
3 PM
2
sell
10 AM
2
buy
11 AM
2
sell
2 PM
2
sell
3 PM
My goal is to split this dataset into a training and a test set in such a way that, for each userId, N % of the rows are in the training set and 100-N % rows are in the test set. For example, given N=75%, the training set will be
userId
action
time
1
buy
8 AM
1
buy
9 AM
1
sell
2 PM
2
sell
10 AM
2
buy
11 AM
2
sell
2 PM
and the test set will be
userId
action
time
1
sell
3 PM
2
sell
3 PM
Any suggestion? Rows are ordered according to column time and I don't think that Spark's RandomSplit may help as I cannot stratify the split on specific columns
We had similar requirement and solved it in following way:
data = [
(1, "buy"),
(1, "buy"),
(1, "sell"),
(1, "sell"),
(2, "sell"),
(2, "buy"),
(2, "sell"),
(2, "sell"),
]
df = spark.createDataFrame(data, ["userId", "action"])
Use Window functionality to create serial row numbers. Also compute count of records by each userId. This will be helpful to compute percentage of records to filter.
from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number
window = Window.partitionBy(df["userId"]).orderBy(df["userId"])
df_count = df.groupBy("userId").count().withColumnRenamed("userId", "userId_grp")
df = df.join(df_count, col("userId") == col("userId_grp"), "left").drop("userId_grp")
df = df.select("userId", "action", "count", row_number().over(window).alias("row_number"))
df.show()
+------+------+-----+----------+
|userId|action|count|row_number|
+------+------+-----+----------+
| 1| buy| 4| 1|
| 1| buy| 4| 2|
| 1| sell| 4| 3|
| 1| sell| 4| 4|
| 2| sell| 4| 1|
| 2| buy| 4| 2|
| 2| sell| 4| 3|
| 2| sell| 4| 4|
+------+------+-----+----------+
Filter training records by required percentage:
n = 75
df_train = df.filter(col("row_number") <= col("count") * n / 100)
df_train.show()
+------+------+-----+----------+
|userId|action|count|row_number|
+------+------+-----+----------+
| 1| buy| 4| 1|
| 1| buy| 4| 2|
| 1| sell| 4| 3|
| 2| sell| 4| 1|
| 2| buy| 4| 2|
| 2| sell| 4| 3|
+------+------+-----+----------+
And remaining records go to the test set:
df_test = df.alias("df").join(df_train.alias("tr"), (col("df.userId") == col("tr.userId")) & (col("df.row_number") == col("tr.row_number")), "leftanti")
df_test.show()
+------+------+-----+----------+
|userId|action|count|row_number|
+------+------+-----+----------+
| 1| sell| 4| 4|
| 2| sell| 4| 4|
+------+------+-----+----------+
You can use ntile:
ds = ds.withColumn("tile", expr("ntile(4) over (partition by id order by id)"))
The dataset where tile=4 is your test set, and tile<4 is your train set:
val train = ds.filter(col("tile").equalTo(4))
val test = ds.filter(col("tile").lt(4))
test.show()
+---+------+----+----+
| id|action|time|tile|
+---+------+----+----+
| 1| sell|3 PM| 4|
| 2| sell|3 PM| 4|
+---+------+----+----+
train.show()
+---+------+-----+----+
| id|action| time|tile|
+---+------+-----+----+
| 1| buy| 8 AM| 1|
| 1| buy| 9 AM| 2|
| 1| sell| 2 PM| 3|
| 2| sell|10 AM| 1|
| 2| buy|11 AM| 2|
| 2| sell| 2 PM| 3|
+---+------+-----+----+
Good luck!
Related
I have the following pyspark dataframe:
import pandas as pd
foo = pd.DataFrame({'id': [1,1,1,1,1, 2,2,2,2,2],
'time': [1,2,3,4,5, 1,2,3,4,5],
'value': ['a','a','a','b','b', 'b','b','c','c','c']})
foo_df = spark.createDataFrame(foo)
foo_df.show()
+---+----+-----+
| id|time|value|
+---+----+-----+
| 1| 1| a|
| 1| 2| a|
| 1| 3| a|
| 1| 4| b|
| 1| 5| b|
| 2| 1| b|
| 2| 2| b|
| 2| 3| c|
| 2| 4| c|
| 2| 5| c|
+---+----+-----+
I would like, for a rolling time window of 3, to calculate the percentage of appearances of all the values, in the value column. The operation should be done by id.
The output dataframe would look something like this:
+---+------------------+------------------+------------------+
| id| perc_a| perc_b| perc_c|
+---+------------------+------------------+------------------+
| 1| 1.0| 0.0| 0.0|
| 1|0.6666666666666666|0.3333333333333333| 0.0|
| 1|0.3333333333333333|0.6666666666666666| 0.0|
| 2| 0.0|0.6666666666666666|0.3333333333333333|
| 2| 0.0|0.3333333333333333|0.6666666666666666|
| 2| 0.0| 0.0| 1.0|
+---+------------------+------------------+------------------+
Explanation of result:
for id=1, and the first window of (time=[1,2,3]), the value column contains only as. so the perc_a equals 100, and the rest is 0.
for id=1, and the second window of (time=[2,3,4]), the value column contains 2 as and 1 b, so the perc_a equals 66.6 the perc_b is 33.3 and the perc_c equals 0
etc
How could I achieve that in pyspark ?
EDIT
I am using pyspark 2.4
You can use count with a window function.
w = Window.partitionBy('id').orderBy('time').rowsBetween(Window.currentRow, 2)
df = (df.select('id', F.col('time').alias('window'),
*[(F.count(F.when(F.col('value') == x, 'value')).over(w)
/
F.count('value').over(w) * 100).alias(f'perc_{x}')
for x in ['a', 'b', 'c']])
.filter(F.col('time') < 4))
Clever answer by #Emma. Expanding the answer with a SparkSQL implementation.
The approach is to collect values over the intended sliding row range i.e ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING & filtering on time < 4 , further exploding the collected list to count the individual frequency , and finally pivoting it to the intended format
SparkSQL - Collect List
foo = pd.DataFrame({'id': [1,1,1,1,1, 2,2,2,2,2],
'time': [1,2,3,4,5, 1,2,3,4,5],
'value': ['a','a','a','b','b', 'b','b','c','c','c']})
sparkDF = sql.createDataFrame(foo)
sparkDF.registerTempTable("INPUT")
sql.sql("""
SELECT
id,
time,
value,
ROW_NUMBER() OVER(PARTITION BY id ORDER BY time
) as window_map,
COLLECT_LIST(value) OVER(PARTITION BY id ORDER BY time
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING
) as collected_list
FROM INPUT
""").show()
+---+----+-----+----------+--------------+
| id|time|value|window_map|collected_list|
+---+----+-----+----------+--------------+
| 1| 1| a| 1| [a, a, a]|
| 1| 2| a| 2| [a, a, b]|
| 1| 3| a| 3| [a, b, b]|
| 1| 4| b| 4| [b, b]|
| 1| 5| b| 5| [b]|
| 2| 1| b| 1| [b, b, c]|
| 2| 2| b| 2| [b, c, c]|
| 2| 3| c| 3| [c, c, c]|
| 2| 4| c| 4| [c, c]|
| 2| 5| c| 5| [c]|
+---+----+-----+----------+--------------+
SparkSQL - Explode - Frequency Calculation
immDF = sql.sql(
"""
SELECT
id,
time,
exploded_value,
COUNT(*) as value_count
FROM (
SELECT
id,
time,
value,
window_map,
EXPLODE(collected_list) as exploded_value
FROM (
SELECT
id,
time,
value,
ROW_NUMBER() OVER(PARTITION BY id ORDER BY time
) as window_map,
COLLECT_LIST(value) OVER(PARTITION BY id ORDER BY time
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING
) as collected_list
FROM INPUT
)
WHERE window_map < 4 <--> Filtering List where values are less than 3
)
GROUP BY 1,2,3
ORDER BY id,time
;
"""
)
immDF.registerTempTable("IMM_RESULT")
immDF.show()
+---+----+--------------+-----------+
| id|time|exploded_value|value_count|
+---+----+--------------+-----------+
| 1| 1| a| 3|
| 1| 2| b| 1|
| 1| 2| a| 2|
| 1| 3| a| 1|
| 1| 3| b| 2|
| 2| 1| b| 2|
| 2| 1| c| 1|
| 2| 2| b| 1|
| 2| 2| c| 2|
| 2| 3| c| 3|
+---+----+--------------+-----------+
SparkSQL - Pivot
sql.sql("""
SELECT
id,
time,
ROUND(NVL(a,0),2) as perc_a,
ROUND(NVL(b,0),2) as perc_b,
ROUND(NVL(c,0),2) as perc_c
FROM IMM_RESULT
PIVOT (
MAX(value_count)/3 * 100.0
FOR exploded_value IN ('a'
,'b'
,'c'
)
)
""").show()
+---+----+------+------+------+
| id|time|perc_a|perc_b|perc_c|
+---+----+------+------+------+
| 1| 1| 100.0| 0.0| 0.0|
| 1| 2| 66.67| 33.33| 0.0|
| 1| 3| 33.33| 66.67| 0.0|
| 2| 1| 0.0| 66.67| 33.33|
| 2| 2| 0.0| 33.33| 66.67|
| 2| 3| 0.0| 0.0| 100.0|
+---+----+------+------+------+
I have a dataframe like:
id Name Rank Course
1 S1 21 Physics
2 S2 22 Chemistry
3 S3 24 Math
4 S2 22 English
5 S2 22 Social
6 S1 21 Geography
I want to group this dataset over Name, Rank and calculate group number. In pandas, I can easily do:
df['ngrp'] = df.groupby(['Name', 'Rank']).ngroup()
After computing the above, I get the following output:
id Name Rank Course ngrp
1 S1 21 Physics 0
6 S1 22 Geography 0
2 S2 22 Chemistry 1
4 S2 22 English 1
5 S2 23 Social 1
3 S3 24 Math 2
Is there a method in Pyspark that will achieve the same output? I tried the following, but it doesn't seem to work:
from pyspark.sql import Window
w = Window.partitionBy(['Name', 'Rank'])
df.select(['Name', 'Rank'], ['Course'], f.count(['Name', 'Rank']).over(w).alias('ngroup')).show()
You can opt for DENSE_RANK -
Data Preparation
df = pd.read_csv(StringIO("""
id,Name,Rank,Course
1,S1,21,Physics
2,S2,22,Chemistry
3,S3,24,Math
4,S2,22,English
5,S2,22,Social
6,S1,21,Geography
"""),delimiter=',')
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+---+----+----+---------+
| id|Name|Rank| Course|
+---+----+----+---------+
| 1| S1| 21| Physics|
| 2| S2| 22|Chemistry|
| 3| S3| 24| Math|
| 4| S2| 22| English|
| 5| S2| 22| Social|
| 6| S1| 21|Geography|
+---+----+----+---------+
Dense Rank
window = Window.orderBy(['Name','Rank'])
sparkDF = sparkDF.withColumn('ngroup',F.dense_rank().over(window) - 1)
sparkDF.orderBy(['Name','ngroup']).show()
+---+----+----+---------+------+
| id|Name|Rank| Course|ngroup|
+---+----+----+---------+------+
| 6| S1| 21|Geography| 0|
| 1| S1| 21| Physics| 0|
| 4| S2| 22| English| 1|
| 2| S2| 22|Chemistry| 1|
| 5| S2| 22| Social| 1|
| 3| S3| 24| Math| 2|
+---+----+----+---------+------+
Dense Rank - SparkSQL
sql.sql("""
SELECT
ID,
NAME,
RANK,
COURSE,
DENSE_RANK() OVER(ORDER BY NAME,RANK) - 1 as NGROUP
FROM TB1
""").show()
+---+----+----+---------+------+
| ID|NAME|RANK| COURSE|NGROUP|
+---+----+----+---------+------+
| 1| S1| 21| Physics| 0|
| 6| S1| 21|Geography| 0|
| 2| S2| 22|Chemistry| 1|
| 4| S2| 22| English| 1|
| 5| S2| 22| Social| 1|
| 3| S3| 24| Math| 2|
+---+----+----+---------+------+
Imagine you have two datasets df and df2 like the following:
df:
ID Size Condition
1 2 1
2 3 0
3 5 0
4 7 1
df2:
aux_ID Scalar
1 2
3 2
I want to get an output where if the condition of df is 1, we multiply the size times the scalar and then return df with the changed values.
I would want to do this as efficient as possible, perhaps avoiding the join if that's possible.
output_df:
ID Size Condition
1 4 1
2 3 0
3 5 0
4 7 1
Not sure why would you want to avoid Joins in the first place. They can be efficient in there own regards.
With this said , this can be easily done with Merging the 2 datasets and building a case-when statement against the condition
Data Preparation
df1 = pd.read_csv(StringIO("""ID,Size,Condition
1,2,1
2,3,0
3,5,0
4,7,1
""")
,delimiter=','
)
df2 = pd.read_csv(StringIO("""aux_ID,Scalar
1,2
3,2
""")
,delimiter=','
)
sparkDF1 = sql.createDataFrame(df1)
sparkDF2 = sql.createDataFrame(df2)
sparkDF1.show()
+---+----+---------+
| ID|Size|Condition|
+---+----+---------+
| 1| 2| 1|
| 2| 3| 0|
| 3| 5| 0|
| 4| 7| 1|
+---+----+---------+
sparkDF2.show()
+------+------+
|aux_ID|Scalar|
+------+------+
| 1| 2|
| 3| 2|
+------+------+
Case When
finalDF = sparkDF1.join(sparkDF2
,sparkDF1['ID'] == sparkDF2['aux_ID']
,'left'
).select(sparkDF1['*']
,sparkDF2['Scalar']
,sparkDF2['aux_ID']
).withColumn('Size_Changed',F.when( ( F.col('Condition') == 1 )
& ( F.col('aux_ID').isNotNull())
,F.col('Size') * F.col('Scalar')
).otherwise(F.col('Size')
)
)
finalDF.show()
+---+----+---------+------+------+------------+
| ID|Size|Condition|Scalar|aux_ID|Size_Changed|
+---+----+---------+------+------+------------+
| 1| 2| 1| 2| 1| 4|
| 3| 5| 0| 2| 3| 5|
| 2| 3| 0| null| null| 3|
| 4| 7| 1| null| null| 7|
+---+----+---------+------+------+------------+
You can drop the unnecessary columns , I kept them for your illustration
I have a PySpark data table that looks like the following
shouldMerge | number
true | 1
true | 1
true | 2
false | 3
false | 1
I want to combine all of the columns with shouldMerge as true and add up the numbers.
so the final output would look like
shouldMerge | number
true | 4
false | 3
false | 1
How can I select all the ones with shouldMerge == true, add up the numbers, and generate a new row in PySpark?
Edit: Alternate, slightly more complicated scenario closer to what I'm trying to solve, where we only aggregate positive numbers:
mergeId | number
1 | 1
2 | 1
1 | 2
-1 | 3
-1 | 1
shouldMerge | number
1 | 3
2 | 1
-1 | 3
-1 | 1
IIUC, you want to do a groupBy but only on the positive mergeIds.
One way is to filter your DataFrame for the positive ids, group, aggregate, and union this back with the negative ids (similar to #shanmuga's answer).
Other way would be use when to dynamically create a grouping key. If the mergeId is positive, use the mergeId to group. Otherwise, use a monotonically_increasing_id to ensure that the row does not get aggregated.
Here is an example:
import pyspark.sql.functions as f
df.withColumn("uid", f.monotonically_increasing_id())\
.groupBy(
f.when(
f.col("mergeId") > 0,
f.col("mergeId")
).otherwise(f.col("uid")).alias("mergeKey"),
f.col("mergeId")
)\
.agg(f.sum("number").alias("number"))\
.drop("mergeKey")\
.show()
#+-------+------+
#|mergeId|number|
#+-------+------+
#| -1| 1.0|
#| 1| 3.0|
#| 2| 1.0|
#| -1| 3.0|
#+-------+------+
This can easily be generalized by changing the when condition (in this case it's f.col("mergeId") > 0) to match your specific requirements.
Explanation:
First we create a temporary column uid which is a unique ID for each row. Next, we call groupBy and if the mergeId is positive use the mergeId to group. Otherwise we use the uid as the mergeKey. I also passed in the mergeId as a second group by column as a way to keep that column for the output.
To demonstrate what is going on, take a look at the intermediate result:
df.withColumn("uid", f.monotonically_increasing_id())\
.withColumn(
"mergeKey",
f.when(
f.col("mergeId") > 0,
f.col("mergeId")
).otherwise(f.col("uid")).alias("mergeKey")
)\
.show()
#+-------+------+-----------+-----------+
#|mergeId|number| uid| mergeKey|
#+-------+------+-----------+-----------+
#| 1| 1| 0| 1|
#| 2| 1| 8589934592| 2|
#| 1| 2|17179869184| 1|
#| -1| 3|25769803776|25769803776|
#| -1| 1|25769803777|25769803777|
#+-------+------+-----------+-----------+
As you can see, the mergeKey remains the unique value for the negative mergeIds.
From this intermediate step, the desired result is just a trivial group by and sum, followed by dropping the mergeKey column.
You will have to filter out only the rows where should merge is true and aggregate. then union this with all the remaining rows.
import pyspark.sql.functions as functions
df = sqlContext.createDataFrame([
(True, 1),
(True, 1),
(True, 2),
(False, 3),
(False, 1),
], ("shouldMerge", "number"))
false_df = df.filter("shouldMerge = false")
true_df = df.filter("shouldMerge = true")
result = true_df.groupBy("shouldMerge")\
.agg(functions.sum("number").alias("number"))\
.unionAll(false_df)
df = sqlContext.createDataFrame([
(1, 1),
(2, 1),
(1, 2),
(-1, 3),
(-1, 1),
], ("mergeId", "number"))
merge_condition = df["mergeId"] > -1
remaining = ~merge_condition
grouby_field = "mergeId"
false_df = df.filter(remaining)
true_df = df.filter(merge_condition)
result = true_df.groupBy(grouby_field)\
.agg(functions.sum("number").alias("number"))\
.unionAll(false_df)
result.show()
The first problem posted by the OP.
# Create the DataFrame
valuesCol = [(True,1),(True,1),(True,2),(False,3),(False,1)]
df = sqlContext.createDataFrame(valuesCol,['shouldMerge','number'])
df.show()
+-----------+------+
|shouldMerge|number|
+-----------+------+
| true| 1|
| true| 1|
| true| 2|
| false| 3|
| false| 1|
+-----------+------+
# Packages to be imported
from pyspark.sql.window import Window
from pyspark.sql.functions import when, col, lag
# Register the dataframe as a view
df.registerTempTable('table_view')
df=sqlContext.sql(
'select shouldMerge, number, sum(number) over (partition by shouldMerge) as sum_number from table_view'
)
df = df.withColumn('number',when(col('shouldMerge')==True,col('sum_number')).otherwise(col('number')))
df.show()
+-----------+------+----------+
|shouldMerge|number|sum_number|
+-----------+------+----------+
| true| 4| 4|
| true| 4| 4|
| true| 4| 4|
| false| 3| 4|
| false| 1| 4|
+-----------+------+----------+
df = df.drop('sum_number')
my_window = Window.partitionBy().orderBy('shouldMerge')
df = df.withColumn('shouldMerge_lag', lag(col('shouldMerge'),1).over(my_window))
df.show()
+-----------+------+---------------+
|shouldMerge|number|shouldMerge_lag|
+-----------+------+---------------+
| false| 3| null|
| false| 1| false|
| true| 4| false|
| true| 4| true|
| true| 4| true|
+-----------+------+---------------+
df = df.where(~((col('shouldMerge')==True) & (col('shouldMerge_lag')==True))).drop('shouldMerge_lag')
df.show()
+-----------+------+
|shouldMerge|number|
+-----------+------+
| false| 3|
| false| 1|
| true| 4|
+-----------+------+
For the second problem posted by the OP
# Create the DataFrame
valuesCol = [(1,2),(1,1),(2,1),(1,2),(-1,3),(-1,1)]
df = sqlContext.createDataFrame(valuesCol,['mergeId','number'])
df.show()
+-------+------+
|mergeId|number|
+-------+------+
| 1| 2|
| 1| 1|
| 2| 1|
| 1| 2|
| -1| 3|
| -1| 1|
+-------+------+
# Packages to be imported
from pyspark.sql.window import Window
from pyspark.sql.functions import when, col, lag
# Register the dataframe as a view
df.registerTempTable('table_view')
df=sqlContext.sql(
'select mergeId, number, sum(number) over (partition by mergeId) as sum_number from table_view'
)
df = df.withColumn('number',when(col('mergeId') > 0,col('sum_number')).otherwise(col('number')))
df.show()
+-------+------+----------+
|mergeId|number|sum_number|
+-------+------+----------+
| 1| 5| 5|
| 1| 5| 5|
| 1| 5| 5|
| 2| 1| 1|
| -1| 3| 4|
| -1| 1| 4|
+-------+------+----------+
df = df.drop('sum_number')
my_window = Window.partitionBy('mergeId').orderBy('mergeId')
df = df.withColumn('mergeId_lag', lag(col('mergeId'),1).over(my_window))
df.show()
+-------+------+-----------+
|mergeId|number|mergeId_lag|
+-------+------+-----------+
| 1| 5| null|
| 1| 5| 1|
| 1| 5| 1|
| 2| 1| null|
| -1| 3| null|
| -1| 1| -1|
+-------+------+-----------+
df = df.where(~((col('mergeId') > 0) & (col('mergeId_lag').isNotNull()))).drop('mergeId_lag')
df.show()
+-------+------+
|mergeId|number|
+-------+------+
| 1| 5|
| 2| 1|
| -1| 3|
| -1| 1|
+-------+------+
Documentation: lag() - Returns the value that is offset rows before the current row.
I have a dataframe in pyspark.
Say the has some columns a,b,c...
I want to group the data into groups as the value of column changes. Say
A B
1 x
1 y
0 x
0 y
0 x
1 y
1 x
1 y
There will be 3 groups as (1x,1y),(0x,0y,0x),(1y,1x,1y)
And corresponding row data
If I understand correctly you want to create a distinct group every time column A changes values.
First we'll create a monotonically increasing id to keep the row order as it is:
import pyspark.sql.functions as psf
df = sc.parallelize([[1,'x'],[1,'y'],[0,'x'],[0,'y'],[0,'x'],[1,'y'],[1,'x'],[1,'y']])\
.toDF(['A', 'B'])\
.withColumn("rn", psf.monotonically_increasing_id())
df.show()
+---+---+----------+
| A| B| rn|
+---+---+----------+
| 1| x| 0|
| 1| y| 1|
| 0| x| 2|
| 0| y| 3|
| 0| x|8589934592|
| 1| y|8589934593|
| 1| x|8589934594|
| 1| y|8589934595|
+---+---+----------+
Now we'll use a window function to create a column that contains 1 every time column A changes:
from pyspark.sql import Window
w = Window.orderBy('rn')
df = df.withColumn("changed", (df.A != psf.lag('A', 1, 0).over(w)).cast('int'))
+---+---+----------+-------+
| A| B| rn|changed|
+---+---+----------+-------+
| 1| x| 0| 1|
| 1| y| 1| 0|
| 0| x| 2| 1|
| 0| y| 3| 0|
| 0| x|8589934592| 0|
| 1| y|8589934593| 1|
| 1| x|8589934594| 0|
| 1| y|8589934595| 0|
+---+---+----------+-------+
Finally we'll use another window function to allocate different numbers to each group:
df = df.withColumn("group_id", psf.sum("changed").over(w)).drop("rn").drop("changed")
+---+---+--------+
| A| B|group_id|
+---+---+--------+
| 1| x| 1|
| 1| y| 1|
| 0| x| 2|
| 0| y| 2|
| 0| x| 2|
| 1| y| 3|
| 1| x| 3|
| 1| y| 3|
+---+---+--------+
Now you can build you groups