Spark MinMaxScaler on dataframe

Spark MinMaxScaler on dataframe - python

Assuming I have follwoing dataframe
+---+-----+-------+
|day| time| result|
+---+-----+-------+
| 1| 6 | 0.5 |
| 1| 7 | 10.2 |
| 1| 8 | 5.7 |
| 2| 6 | 11.0 |
| 2| 10 | 22.3 |
+---+-----+-------+
I like to normalize the results per day while keeping the time belonging to each result. I like to use MinMaxScaler I assume I have cast the values to dense vector for each day but how I do I keep time values?

I like to normalize the results (...) I like to use MinMaxScaler
These two requirements are mutually exclusive. MinMaxScaler cannot be used to operate on groups. You can use window functions
from pyspark.sql.functions import min, max, col
from pyspark.sql.window import Window
df = spark.createDataFrame(
[(1, 6, 0.5), (1, 7, 10.2), (1, 8, 5.7), (2, 6, 11.0), (2, 10, 22.3)],
("day", "time", "result"))
w = Window.partitionBy("day")
scaled_result = (col("result") - min("result").over(w)) / (max("result").over(w) - min("result").over(w))
df.withColumn("scaled_result", scaled_result).show()
# +---+----+------+------------------+
# |day|time|result| scaled_result|
# +---+----+------+------------------+
# | 1| 6| 0.5| 0.0|
# | 1| 7| 10.2| 1.0|
# | 1| 8| 5.7|0.5360824742268042|
# | 2| 6| 11.0| 0.0|
# | 2| 10| 22.3| 1.0|
# +---+----+------+------------------+
or group, aggregate and join:
minmax_result = df.groupBy("day").agg(min("result").alias("min_result"), max("result").alias("max_result"))
minmax_result.join(df, ["day"]).select(
"day", "time", "result",
((col("result") - col("min_result")) / (col("max_result") - col("min_result"))).alias("scaled_result")
).show()
# +---+----+------+------------------+
# |day|time|result| scaled_result|
# +---+----+------+------------------+
# | 1| 6| 0.5| 0.0|
# | 1| 7| 10.2| 1.0|
# | 1| 8| 5.7|0.5360824742268042|
# | 2| 6| 11.0| 0.0|
# | 2| 10| 22.3| 1.0|
# +---+----+------+------------------+

Related

filter then count for many different threshold

I want to calculate the number of lines that satisfy a condition on a very large dataframe which can be achieved by
df.filter(col("value") >= thresh).count()
I want to know the result for each threshold in range [1, 10]. Enumerate each threshold then do this action will scan the dataframe for 10 times. It's slow.
If I can achieve it by scanning the df only once?

Create an indicator column for each threshold, then sum:
import random
import pyspark.sql.functions as F
from pyspark.sql import Row
df = spark.createDataFrame([Row(value=random.randint(0,10)) for _ in range(1_000_000)])
df.select([
(F.col("value") >= thresh)
.cast("int")
.alias(f"ind_{thresh}")
for thresh in range(1,11)
]).groupBy().sum().show()
# +----------+----------+----------+----------+----------+----------+----------+----------+----------+-----------+
# |sum(ind_1)|sum(ind_2)|sum(ind_3)|sum(ind_4)|sum(ind_5)|sum(ind_6)|sum(ind_7)|sum(ind_8)|sum(ind_9)|sum(ind_10)|
# +----------+----------+----------+----------+----------+----------+----------+----------+----------+-----------+
# | 908971| 818171| 727240| 636334| 545463| 454279| 363143| 272460| 181729| 90965|
# +----------+----------+----------+----------+----------+----------+----------+----------+----------+-----------+

Using conditional aggregation with when expressions should do the job.
Here's an example:
from pyspark.sql import functions as F
df = spark.createDataFrame([(1,), (2,), (3,), (4,), (4,), (6,), (7,)], ["value"])
count_expr = [
F.count(F.when(F.col("value") >= th, 1)).alias(f"gte_{th}")
for th in range(1, 11)
]
df.select(*count_expr).show()
#+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
#|gte_1|gte_2|gte_3|gte_4|gte_5|gte_6|gte_7|gte_8|gte_9|gte_10|
#+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
#| 7| 6| 5| 4| 2| 2| 1| 0| 0| 0|
#+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+

Using a user-defined function udf from pyspark.sql.functions:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100, size=(20)), columns=['val'])
thres = [90, 80, 30] # these are the thresholds
thres.sort(reverse=True) # list needs to be sorted
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
spark = SparkSession.builder \
.master("local[2]") \
.appName("myApp") \
.getOrCreate()
sparkDF = spark.createDataFrame(df)
myUdf = udf(lambda x: 0 if x>thres[0] else 1 if x>thres[1] else 2 if x>thres[2] else 3)
sparkDF = sparkDF.withColumn("rank", myUdf(sparkDF.val))
sparkDF.show()
# +---+----+
# |val|rank|
# +---+----+
# | 28| 3|
# | 54| 2|
# | 19| 3|
# | 4| 3|
# | 74| 2|
# | 62| 2|
# | 95| 0|
# | 19| 3|
# | 55| 2|
# | 62| 2|
# | 33| 2|
# | 93| 0|
# | 81| 1|
# | 41| 2|
# | 80| 2|
# | 53| 2|
# | 14| 3|
# | 16| 3|
# | 30| 3|
# | 77| 2|
# +---+----+
sparkDF.groupby(['rank']).count().show()
# Out:
# +----+-----+
# |rank|count|
# +----+-----+
# | 3| 7|
# | 0| 2|
# | 1| 1|
# | 2| 10|
# +----+-----+
A value gets rank i if it's strictly greater than thres[i] but smaller or equal thres[i-1]. This should minimize the number of comparisons.
For thres = [90, 80, 30] we have the ranks 0-> [max, 90[, 1-> [90, 80[, 2->[80, 30[, 3->[30, min]

py4JJavaError. pyspark applying function to column

spark = 2.x
New to pyspark.
While encoding date related columns for training DNN keep on facing error mentioned in the title.
from df
day month ...
1 1
2 3
3 1 ...
I am trying to get cos, sine value for each column in order to capture their cyclic nature.
When applying function to column in pyspark udf worked fine until now. But below code doesn't work
def to_cos(x, _max):
return np.sin(2*np.pi*x / _max)
to_cos_udf = udf(to_cos, DecimalType())
df = df.withColumn("month", to_cos_udf("month", 12))
I've tried it with IntegerType and tried it with only one variable def to_cos(x) however none of them seem to work and outputs:
Py4JJavaError: An error occurred while calling 0.24702.showString.

Since you havent shared the entire Stacktrack from the error , not sure what is the actual error which is causing the failure
However by the code snippets you have shared , Firstly you need to update your UDF definition as below -
Will passing arguments to a UDF function using it with lambda is probably the best approach towards it , apart from that you can use partial
Data Preparation
df = pd.DataFrame({
'month':[i for i in range(0,12)],
})
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+-----+
|month|
+-----+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
| 10|
| 11|
+-----+
Custom UDF
def to_cos(x,_max):
try:
res = np.sin(2*np.pi*x / _max)
except Exception as e:
res = 0.0
return float(res)
max_cos = 12
to_cos_udf = F.udf(lambda x: to_cos(x,max_cos),FloatType())
sparkDF = sparkDF.withColumn('month_cos',to_cos_udf('month'))
sparkDF.show()
+-----+-------------+
|month| month_cos|
+-----+-------------+
| 0| 0.0|
| 1| 0.5|
| 2| 0.8660254|
| 3| 1.0|
| 4| 0.8660254|
| 5| 0.5|
| 6|1.2246469E-16|
| 7| -0.5|
| 8| -0.8660254|
| 9| -1.0|
| 10| -0.8660254|
| 11| -0.5|
+-----+-------------+
Custom UDF - Partial
from functools import partial
partial_func = partial(to_cos,_max=max_cos)
to_cos_partial_udf = F.udf(partial_func)
sparkDF = sparkDF.withColumn('month_cos',to_cos_partial_udf('month'))
sparkDF.show()
+-----+--------------------+
|month| month_cos|
+-----+--------------------+
| 0| 0.0|
| 1| 0.49999999999999994|
| 2| 0.8660254037844386|
| 3| 1.0|
| 4| 0.8660254037844388|
| 5| 0.49999999999999994|
| 6|1.224646799147353...|
| 7| -0.4999999999999998|
| 8| -0.8660254037844384|
| 9| -1.0|
| 10| -0.8660254037844386|
| 11| -0.5000000000000004|
+-----+--------------------+

Merge 2 spark dataframes with non overlapping columns

I have two data frames, df1:
+---+---------+
| id| col_name|
+---+---------+
| 0| a |
| 1| b |
| 2| null|
| 3| null|
| 4| e |
| 5| f |
| 6| g |
| 7| h |
| 8| null|
| 9| j |
+---+---------+
and df2:
+---+---------+
| id| col_name|
+---+---------+
| 0| null|
| 1| null|
| 2| c|
| 3| d|
| 4| null|
| 5| null|
| 6| null|
| 7| null|
| 8| i|
| 9| null|
+---+---------+
and I want to merge them so I get
+---+---------+
| id| col_name|
+---+---------+
| 0| a|
| 1| b|
| 2| c|
| 3| d|
| 4| e|
| 5| f|
| 6| g|
| 7| h|
| 8| i|
| 9| j|
+---+---------+
I know for sure that they aren't overlapping (ie when df2 entry is null df1 entry isn't and vise versa)
I know that if I use join I won't get them on the same column and will instead get 2 "col_name". I just want it on the one column. How do I do this? Thanks

Try this-
df1.alias("a").join(df2.alias("b"), "id").selectExpr("id", "coalesce(a.col_name, b.col_name) as col_name")

You could do this:
mydf = df1.copy() #make copy of first array
idx = np.where(df1['col_name'].values == 'null')[0] #get indices of null
val = df2['col_name'].values[idx] #get values from df2 where df1 is null
mydf['col_name'][idx] = val #assign those values in mydf
mydf #print mydf

you should be able to utilize the coalesce function to achieve this.
df1 = df1.withColumnRenamed("col_name", "col_name_a")
df2 = df2.withColumnRenamed("col_name", "col_name_b")
joinedDF = renamedDF1.join(renamedDF2, "id")
joinedDF = joinedDF.withColumn(
"col_name",
coalesce(joinedDF.col("col_name_a"), joinedDF.col("col_name_b"))
)

Reshape pyspark dataframe to show moving window of item interactions

I have a large pyspark dataframe of subject interactions in long format--each row describes a subject interacting with some item of interest, along with a timestamp and a rank-order for that subject's interaction (i.e., first interaction is 1, second is 2, etc.). Here's a few rows:
+----------+---------+----------------------+--------------------+
| date|itemId |interaction_date_order| userId|
+----------+---------+----------------------+--------------------+
|2019-07-23| 10005880| 1|37 |
|2019-07-23| 10005903| 2|37 |
|2019-07-23| 10005903| 3|37 |
|2019-07-23| 12458442| 4|37 |
|2019-07-26| 10005903| 5|37 |
|2019-07-26| 12632813| 6|37 |
|2019-07-26| 12632813| 7|37 |
|2019-07-26| 12634497| 8|37 |
|2018-11-24| 12245677| 1|5 |
|2018-11-24| 12245677| 1|5 |
|2019-07-29| 12541871| 2|5 |
|2019-07-29| 12541871| 3|5 |
|2019-07-30| 12626854| 4|5 |
|2019-08-31| 12776880| 5|5 |
|2019-08-31| 12776880| 6|5 |
+----------+---------+----------------------+--------------------+
I need to reshape these data such that, for each subject, a row has a length-5 moving window of interactions. So then, something like this:
+------+--------+--------+--------+--------+--------+
|userId| i-2 | i-1 | i | i+1 | i+2|
+------+--------+--------+--------+--------+--------+
|37 |10005880|10005903|10005903|12458442|10005903|
|37 |10005903|10005903|12458442|10005903|12632813|
Does anyone have suggestions for how I might do this?

Import spark and everything
from pyspark.sql import *
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)
Create your dataframe
columns = '| date|itemId |interaction_date_order| userId|'.split('|')
lines = '''2019-07-23| 10005880| 1|37 |
2019-07-23| 10005903| 2|37 |
2019-07-23| 10005903| 3|37 |
2019-07-23| 12458442| 4|37 |
2019-07-26| 10005903| 5|37 |
2019-07-26| 12632813| 6|37 |
2019-07-26| 12632813| 7|37 |
2019-07-26| 12634497| 8|37 |
2018-11-24| 12245677| 1|5 |
2018-11-24| 12245677| 2|5 |
2019-07-29| 12541871| 3|5 |
2019-07-29| 12541871| 4|5 |
2019-07-30| 12626854| 5|5 |
2019-08-31| 12776880| 6|5 |
2019-08-31| 12776880| 7|5 |'''
Interaction = Row("date", "itemId", "interaction_date_order", "userId")
interactions = []
for line in lines.split('\n'):
column_values = line.split('|')
interaction = Interaction(column_values[0], int(column_values[1]), int(column_values[2]), int(column_values[3]))
interactions.append(interaction)
df = spark.createDataFrame(interactions)
now we have
df.show()
+----------+--------+----------------------+------+
| date| itemId|interaction_date_order|userId|
+----------+--------+----------------------+------+
|2019-07-23|10005880| 1| 37|
|2019-07-23|10005903| 2| 37|
|2019-07-23|10005903| 3| 37|
|2019-07-23|12458442| 4| 37|
|2019-07-26|10005903| 5| 37|
|2019-07-26|12632813| 6| 37|
|2019-07-26|12632813| 7| 37|
|2019-07-26|12634497| 8| 37|
|2018-11-24|12245677| 1| 5|
|2018-11-24|12245677| 2| 5|
|2019-07-29|12541871| 3| 5|
|2019-07-29|12541871| 4| 5|
|2019-07-30|12626854| 5| 5|
|2019-08-31|12776880| 6| 5|
|2019-08-31|12776880| 7| 5|
+----------+--------+----------------------+------+
Create a window and collect itemId with count
from pyspark.sql.window import Window
import pyspark.sql.functions as F
window = Window() \
.partitionBy('userId') \
.orderBy('interaction_date_order') \
.rowsBetween(Window.currentRow, Window.currentRow+4)
df2 = df.withColumn("itemId_list", F.collect_list('itemId').over(window))
df2 = df2.withColumn("itemId_count", F.count('itemId').over(window))
df_final = df2.where(df2['itemId_count'] == 5)
now we have
df_final.show()
+----------+--------+----------------------+------+--------------------+------------+
| date| itemId|interaction_date_order|userId| itemId_list|itemId_count|
+----------+--------+----------------------+------+--------------------+------------+
|2018-11-24|12245677| 1| 5|[12245677, 122456...| 5|
|2018-11-24|12245677| 2| 5|[12245677, 125418...| 5|
|2019-07-29|12541871| 3| 5|[12541871, 125418...| 5|
|2019-07-23|10005880| 1| 37|[10005880, 100059...| 5|
|2019-07-23|10005903| 2| 37|[10005903, 100059...| 5|
|2019-07-23|10005903| 3| 37|[10005903, 124584...| 5|
|2019-07-23|12458442| 4| 37|[12458442, 100059...| 5|
+----------+--------+----------------------+------+--------------------+------------+
Final touch
df_final2 = (df_final
.withColumn('i-2', df_final['itemId_list'][0])
.withColumn('i-1', df_final['itemId_list'][1])
.withColumn('i', df_final['itemId_list'][2])
.withColumn('i+1', df_final['itemId_list'][3])
.withColumn('i+2', df_final['itemId_list'][4])
.select('userId', 'i-2', 'i-1', 'i', 'i+1', 'i+2')
)
df_final2.show()
+------+--------+--------+--------+--------+--------+
|userId| i-2| i-1| i| i+1| i+2|
+------+--------+--------+--------+--------+--------+
| 5|12245677|12245677|12541871|12541871|12626854|
| 5|12245677|12541871|12541871|12626854|12776880|
| 5|12541871|12541871|12626854|12776880|12776880|
| 37|10005880|10005903|10005903|12458442|10005903|
| 37|10005903|10005903|12458442|10005903|12632813|
| 37|10005903|12458442|10005903|12632813|12632813|
| 37|12458442|10005903|12632813|12632813|12634497|
+------+--------+--------+--------+--------+--------+

Combine two rows in Pyspark if a condition is met

I have a PySpark data table that looks like the following
shouldMerge | number
true | 1
true | 1
true | 2
false | 3
false | 1
I want to combine all of the columns with shouldMerge as true and add up the numbers.
so the final output would look like
shouldMerge | number
true | 4
false | 3
false | 1
How can I select all the ones with shouldMerge == true, add up the numbers, and generate a new row in PySpark?
Edit: Alternate, slightly more complicated scenario closer to what I'm trying to solve, where we only aggregate positive numbers:
mergeId | number
1 | 1
2 | 1
1 | 2
-1 | 3
-1 | 1
shouldMerge | number
1 | 3
2 | 1
-1 | 3
-1 | 1

IIUC, you want to do a groupBy but only on the positive mergeIds.
One way is to filter your DataFrame for the positive ids, group, aggregate, and union this back with the negative ids (similar to #shanmuga's answer).
Other way would be use when to dynamically create a grouping key. If the mergeId is positive, use the mergeId to group. Otherwise, use a monotonically_increasing_id to ensure that the row does not get aggregated.
Here is an example:
import pyspark.sql.functions as f
df.withColumn("uid", f.monotonically_increasing_id())\
.groupBy(
f.when(
f.col("mergeId") > 0,
f.col("mergeId")
).otherwise(f.col("uid")).alias("mergeKey"),
f.col("mergeId")
)\
.agg(f.sum("number").alias("number"))\
.drop("mergeKey")\
.show()
#+-------+------+
#|mergeId|number|
#+-------+------+
#| -1| 1.0|
#| 1| 3.0|
#| 2| 1.0|
#| -1| 3.0|
#+-------+------+
This can easily be generalized by changing the when condition (in this case it's f.col("mergeId") > 0) to match your specific requirements.
Explanation:
First we create a temporary column uid which is a unique ID for each row. Next, we call groupBy and if the mergeId is positive use the mergeId to group. Otherwise we use the uid as the mergeKey. I also passed in the mergeId as a second group by column as a way to keep that column for the output.
To demonstrate what is going on, take a look at the intermediate result:
df.withColumn("uid", f.monotonically_increasing_id())\
.withColumn(
"mergeKey",
f.when(
f.col("mergeId") > 0,
f.col("mergeId")
).otherwise(f.col("uid")).alias("mergeKey")
)\
.show()
#+-------+------+-----------+-----------+
#|mergeId|number| uid| mergeKey|
#+-------+------+-----------+-----------+
#| 1| 1| 0| 1|
#| 2| 1| 8589934592| 2|
#| 1| 2|17179869184| 1|
#| -1| 3|25769803776|25769803776|
#| -1| 1|25769803777|25769803777|
#+-------+------+-----------+-----------+
As you can see, the mergeKey remains the unique value for the negative mergeIds.
From this intermediate step, the desired result is just a trivial group by and sum, followed by dropping the mergeKey column.

You will have to filter out only the rows where should merge is true and aggregate. then union this with all the remaining rows.
import pyspark.sql.functions as functions
df = sqlContext.createDataFrame([
(True, 1),
(True, 1),
(True, 2),
(False, 3),
(False, 1),
], ("shouldMerge", "number"))
false_df = df.filter("shouldMerge = false")
true_df = df.filter("shouldMerge = true")
result = true_df.groupBy("shouldMerge")\
.agg(functions.sum("number").alias("number"))\
.unionAll(false_df)
df = sqlContext.createDataFrame([
(1, 1),
(2, 1),
(1, 2),
(-1, 3),
(-1, 1),
], ("mergeId", "number"))
merge_condition = df["mergeId"] > -1
remaining = ~merge_condition
grouby_field = "mergeId"
false_df = df.filter(remaining)
true_df = df.filter(merge_condition)
result = true_df.groupBy(grouby_field)\
.agg(functions.sum("number").alias("number"))\
.unionAll(false_df)
result.show()

The first problem posted by the OP.
# Create the DataFrame
valuesCol = [(True,1),(True,1),(True,2),(False,3),(False,1)]
df = sqlContext.createDataFrame(valuesCol,['shouldMerge','number'])
df.show()
+-----------+------+
|shouldMerge|number|
+-----------+------+
| true| 1|
| true| 1|
| true| 2|
| false| 3|
| false| 1|
+-----------+------+
# Packages to be imported
from pyspark.sql.window import Window
from pyspark.sql.functions import when, col, lag
# Register the dataframe as a view
df.registerTempTable('table_view')
df=sqlContext.sql(
'select shouldMerge, number, sum(number) over (partition by shouldMerge) as sum_number from table_view'
)
df = df.withColumn('number',when(col('shouldMerge')==True,col('sum_number')).otherwise(col('number')))
df.show()
+-----------+------+----------+
|shouldMerge|number|sum_number|
+-----------+------+----------+
| true| 4| 4|
| true| 4| 4|
| true| 4| 4|
| false| 3| 4|
| false| 1| 4|
+-----------+------+----------+
df = df.drop('sum_number')
my_window = Window.partitionBy().orderBy('shouldMerge')
df = df.withColumn('shouldMerge_lag', lag(col('shouldMerge'),1).over(my_window))
df.show()
+-----------+------+---------------+
|shouldMerge|number|shouldMerge_lag|
+-----------+------+---------------+
| false| 3| null|
| false| 1| false|
| true| 4| false|
| true| 4| true|
| true| 4| true|
+-----------+------+---------------+
df = df.where(~((col('shouldMerge')==True) & (col('shouldMerge_lag')==True))).drop('shouldMerge_lag')
df.show()
+-----------+------+
|shouldMerge|number|
+-----------+------+
| false| 3|
| false| 1|
| true| 4|
+-----------+------+
For the second problem posted by the OP
# Create the DataFrame
valuesCol = [(1,2),(1,1),(2,1),(1,2),(-1,3),(-1,1)]
df = sqlContext.createDataFrame(valuesCol,['mergeId','number'])
df.show()
+-------+------+
|mergeId|number|
+-------+------+
| 1| 2|
| 1| 1|
| 2| 1|
| 1| 2|
| -1| 3|
| -1| 1|
+-------+------+
# Packages to be imported
from pyspark.sql.window import Window
from pyspark.sql.functions import when, col, lag
# Register the dataframe as a view
df.registerTempTable('table_view')
df=sqlContext.sql(
'select mergeId, number, sum(number) over (partition by mergeId) as sum_number from table_view'
)
df = df.withColumn('number',when(col('mergeId') > 0,col('sum_number')).otherwise(col('number')))
df.show()
+-------+------+----------+
|mergeId|number|sum_number|
+-------+------+----------+
| 1| 5| 5|
| 1| 5| 5|
| 1| 5| 5|
| 2| 1| 1|
| -1| 3| 4|
| -1| 1| 4|
+-------+------+----------+
df = df.drop('sum_number')
my_window = Window.partitionBy('mergeId').orderBy('mergeId')
df = df.withColumn('mergeId_lag', lag(col('mergeId'),1).over(my_window))
df.show()
+-------+------+-----------+
|mergeId|number|mergeId_lag|
+-------+------+-----------+
| 1| 5| null|
| 1| 5| 1|
| 1| 5| 1|
| 2| 1| null|
| -1| 3| null|
| -1| 1| -1|
+-------+------+-----------+
df = df.where(~((col('mergeId') > 0) & (col('mergeId_lag').isNotNull()))).drop('mergeId_lag')
df.show()
+-------+------+
|mergeId|number|
+-------+------+
| 1| 5|
| 2| 1|
| -1| 3|
| -1| 1|
+-------+------+
Documentation: lag() - Returns the value that is offset rows before the current row.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Spark MinMaxScaler on dataframe - python

Related

filter then count for many different threshold

py4JJavaError. pyspark applying function to column

Merge 2 spark dataframes with non overlapping columns

Reshape pyspark dataframe to show moving window of item interactions

Combine two rows in Pyspark if a condition is met

Categories

Resources