I have a dictionary:
dict = {10: 1, 50: 2, 200: 3, 500: 4}
And a Dask DataFrame:
+---+---+
| a| b|
+---+---+
| 1| 24|
| 1| 49|
| 2|125|
| 3|400|
+---+---+
I want to groupBy a and get the minimum b value. After that, I want to check which dict key is closest to b and create a new column with the dict value.
As a example, when b=24, the closest key is 10. So I want to assign the value 1.
This is the result I am expecting:
+---+---+-------+
| a| b|closest|
+---+---+-------+
| 1| 24| 1|
| 1| 49| 2|
| 2|125| 3|
| 3|400| 4|
+---+---+-------+
I have found something similar with PySpark. I have not been able to make it run, but it apparently run for other people. Sharing it anyway for reference.
df = spark.createDataFrame(
[
(1, 24),
(1, 49),
(2, 125),
(3, 400)
],
["a", "b"]
)
dict = {10:1, 50:2, 200: 3, 500: 4}
def func(value, dict):
closest_key = (
value if value in dict else builtins.min(
dict.keys(), key=lambda k: builtins.abs(k - value)
)
)
score = dict.get(closest_key)
return score
df = (
df.groupby('a')
.agg(
min('b')
)
).withColumn('closest', func('b', dict))
From what I understand, I think on the spark version the calculation was done per row and I have not been able to replicate that.
Instead of thinking of a row-rise operation, you can think of it as a partition-wise operation. If my interpretation is off, you can still use this sample I wrote for the most part with a few tweaks.
I will show a solution with Fugue that lets you just define your logic in Pandas, and then bring it to Dask. This will return a Dask DataFrame.
First some setup, note that df is a Pandas DataFrame. This is meant to represent a smaller sample you can test on:
import pandas as pd
import dask.dataframe as dd
import numpy as np
_dict = {10: 1, 50: 2, 200: 3, 500: 4}
df = pd.DataFrame({"a": [1,1,2,3], "b":[24,49,125,400]})
ddf = dd.from_pandas(df, npartitions=2)
and then we define the logic. This is written to handle one partition so everything in column a will already be the same value.
def logic(df: pd.DataFrame) -> pd.DataFrame:
# handles the logic for 1 group. all values in a are the same
min_b = df['b'].min()
keys = np.array(list(_dict.keys()))
# closest taken from https://stackoverflow.com/a/10465997/11163214
closest = keys[np.abs(keys - min_b).argmin()]
closest_val = _dict[closest]
df = df.assign(closest=closest_val)
return df
We can test this on Pandas:
logic(df.loc[df['a'] == 1])
and we'll get:
a b closest
0 1 24 1
1 1 49 1
So then we can just bring it to Dask with Fugue. We just need to call the transform function:
from fugue import transform
ddf = transform(ddf,
logic,
schema="*,closest:int",
partition={"by":"a"},
engine="dask")
ddf.compute()
This can take in either Pandas or Dask DataFrames and will output the Dask DataFrame because we specified the "dask" engine. There is also a "spark" engine if you want a Spark DataFrame.
Schema is a requirement for distributed computing so we specify the output schema here. We also partition by column a.
So here it is another approach for you friend, this will return a numpy array, but hey it will be faster than spark, and you can easily reindex it.
import numpy as np
a = pydf.toNumpy()
a = a[:,1] # Grabs your b column
np.where([a <=10,a <=50,a<=200,a<=500],[1,2,3,4],a) # Check the closest values and fill them with what you want
Related
I have a pyspark dataframe in which I want to use two of its columns to output a dictionary.
input pyspark dataframe:
col1|col2|col3
v | 3 | a
d | 2 | b
q | 9 | g
output:
dict = {'v': 3, 'd': 2, 'q': 9}
how should I do this efficiently?
I believe you can achieve it by converting the DF (with only the two columns you want) to rdd:
data_rdd = data.selet(['col1', 'col2']).rdd
create a rdd containing key, pair with both columns using rdd.map function:
kp_rdd = data_rdd.map(lambda row : (row[0],row[1]))
and then collect as map:
dict = kp_rdd.collectAsMap()
that's the main idea, sorry I don't have an instance of pyspark running right now to test it.
Given your example, after selecting the applicable columns and converting to an rdd, collectAsMap will accomplish the desired dictionary without any additional steps:
df.select('col1', 'col2').rdd.collectAsMap()
few different options here depending on the format needed ... check this out ... am using structured api ... if you need to persist then either save as json dict or preserve schema with parquet
from pyspark.sql.functions import to_json
from pyspark.sql.functions import create_map
from pyspark.sql.functions import col
df = spark\
.createDataFrame([\
('v', 3, 'a'),\
('d', 2, 'b'),\
('q', 9, 'g')],\
["c1", "c2", "c3"])
mapDF = df.select(create_map(col("c1"), col("c2")).alias("mapper"))
mapDF.show(3)
+--------+
| mapper|
+--------+
|[v -> 3]|
|[d -> 2]|
|[q -> 9]|
+--------+
dictDF = df.select(to_json(create_map(col("c1"), col("c2")).alias("mapper")).alias("dict"))
dictDF.show()
+-------+
| dict|
+-------+
|{"v":3}|
|{"d":2}|
|{"q":9}|
+-------+
keyValueDF = df.selectExpr("(c1, c2) as keyValueDict").select(to_json(col("keyValueDict")).alias("keyValueDict"))
keyValueDF.show()
+-----------------+
| keyValueDict|
+-----------------+
|{"c1":"v","c2":3}|
|{"c1":"d","c2":2}|
|{"c1":"q","c2":9}|
+-----------------+
Based on previous questions: 1, 2. Suppose I have the following dataframe:
df = spark.createDataFrame(
[(1, "a", 23.0), (3, "B", -23.0)],
("x1", "x2", "x3"))
And I want to add new column x4 but I have value in a list of Python instead to add to the new column e.g. x4_ls = [35.0, 32.0]. Is there a best way to add new column to the Spark dataframe? (note that I use Spark 2.1)
Output should be something like:
## +---+---+-----+----+
## | x1| x2| x3| x4|
## +---+---+-----+----+
## | 1| a| 23.0|35.0|
## | 3| B|-23.0|32.0|
## +---+---+-----+----+
I can also transform my list to dataframe df_x4 = spark.createDataFrame([Row(**{'x4': x}) for x in x4_ls]) (but I don't how to concatenate dataframe together)
Thanks to Gaurav Dhama for a great answer! I made changes a little with his solution. Here is my solution which join two dataframe together on added new column row_num.
from pyspark.sql import Row
def flatten_row(r):
r_ = r.features.asDict()
r_.update({'row_num': r.row_num})
return Row(**r_)
def add_row_num(df):
df_row_num = df.rdd.zipWithIndex().toDF(['features', 'row_num'])
df_out = df_row_num.rdd.map(lambda x : flatten_row(x)).toDF()
return df_out
df = add_row_num(df)
df_x4 = add_row_num(df_x4)
df_concat = df.join(df_x4, on='row_num').drop('row_num')
We can concatenate on the basis of rownumbers as follows. Suppose we have two dataframes df and df_x4 :
def addrownum(df):
dff = df.rdd.zipWithIndex().toDF(['features','rownum'])
odf = dff.map(lambda x : tuple(x.features)+tuple([x.rownum])).toDF(df.columns+['rownum'])
return odf
df1 = addrownum(df)
df2 = addrownum(df_x4)
outputdf = df1.join(df2,df1.rownum==df2.rownum).drop(df1.rownum).drop(df2.rownum)
## outputdf
## +---+---+-----+----+
## | x1| x2| x3| x4|
## +---+---+-----+----+
## | 1| a| 23.0|35.0|
## | 3| B|-23.0|32.0|
## +---+---+-----+----+
outputdf is your required output dataframe
I have this python code that runs locally in a pandas dataframe:
df_result = pd.DataFrame(df
.groupby('A')
.apply(lambda x: myFunction(zip(x.B, x.C), x.name))
I would like to run this in PySpark, but having trouble dealing with pyspark.sql.group.GroupedData object.
I've tried the following:
sparkDF
.groupby('A')
.agg(myFunction(zip('B', 'C'), 'A'))
which returns
KeyError: 'A'
I presume because 'A' is no longer a column and I can't find the equivalent for x.name.
And then
sparkDF
.groupby('A')
.map(lambda row: Row(myFunction(zip('B', 'C'), 'A')))
.toDF()
but get the following error:
AttributeError: 'GroupedData' object has no attribute 'map'
Any suggestions would be really appreciated!
Since Spark 2.3 you can use pandas_udf. GROUPED_MAP takes Callable[[pandas.DataFrame], pandas.DataFrame] or in other words a function which maps from Pandas DataFrame of the same shape as the input, to the output DataFrame.
For example if data looks like this:
df = spark.createDataFrame(
[("a", 1, 0), ("a", -1, 42), ("b", 3, -1), ("b", 10, -2)],
("key", "value1", "value2")
)
and you want to compute average value of pairwise min between value1 value2, you have to define output schema:
from pyspark.sql.types import *
schema = StructType([
StructField("key", StringType()),
StructField("avg_min", DoubleType())
])
pandas_udf:
import pandas as pd
from pyspark.sql.functions import pandas_udf
from pyspark.sql.functions import PandasUDFType
#pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def g(df):
result = pd.DataFrame(df.groupby(df.key).apply(
lambda x: x.loc[:, ["value1", "value2"]].min(axis=1).mean()
))
result.reset_index(inplace=True, drop=False)
return result
and apply it:
df.groupby("key").apply(g).show()
+---+-------+
|key|avg_min|
+---+-------+
| b| -1.5|
| a| -0.5|
+---+-------+
Excluding schema definition and decorator, your current Pandas code can be applied as-is.
Since Spark 2.4.0 there is also GROUPED_AGG variant, which takes Callable[[pandas.Series, ...], T], where T is a primitive scalar:
import numpy as np
#pandas_udf(DoubleType(), functionType=PandasUDFType.GROUPED_AGG)
def f(x, y):
return np.minimum(x, y).mean()
which can be used with standard group_by / agg construct:
df.groupBy("key").agg(f("value1", "value2").alias("avg_min")).show()
+---+-------+
|key|avg_min|
+---+-------+
| b| -1.5|
| a| -0.5|
+---+-------+
Please note that neither GROUPED_MAP nor GROUPPED_AGG pandas_udf behave the same way as UserDefinedAggregateFunction or Aggregator, and it is closer to groupByKey or window functions with unbounded frame. Data is shuffled first, and only after that, UDF is applied.
For optimized execution you should implement Scala UserDefinedAggregateFunction and add Python wrapper.
See also User defined function to be applied to Window in PySpark?
What you are trying to is write a UDAF (User Defined Aggregate Function) as opposed to a UDF (User Defined Function). UDAFs are functions that work on data grouped by a key. Specifically they need to define how to merge multiple values in the group in a single partition, and then how to merge the results across partitions for key. There is currently no way in python to implement a UDAF, they can only be implemented in Scala.
But, you can work around it in Python. You can use collect set to gather your grouped values and then use a regular UDF to do what you want with them. The only caveat is collect_set only works on primitive values, so you will need to encode them down to a string.
from pyspark.sql.types import StringType
from pyspark.sql.functions import col, collect_list, concat_ws, udf
def myFunc(data_list):
for val in data_list:
b, c = data.split(',')
# do something
return <whatever>
myUdf = udf(myFunc, StringType())
df.withColumn('data', concat_ws(',', col('B'), col('C'))) \
.groupBy('A').agg(collect_list('data').alias('data'))
.withColumn('data', myUdf('data'))
Use collect_set if you want deduping. Also, if you have lots of values for some of your keys, this will be slow because all values for a key will need to be collected in a single partition somewhere on your cluster. If your end result is a value you build by combining the values per key in some way (for example summing them) it might be faster to implement it using the RDD aggregateByKey method which lets you build an intermediate value for each key in a partition before shuffling data around.
EDIT: 11/21/2018
Since this answer was written, pyspark added support for UDAF'S using Pandas. There are some nice performance improvements when using the Panda's UDFs and UDAFs over straight python functions with RDDs. Under the hood it vectorizes the columns (batches the values from multiple rows together to optimize processing and compression). Take a look at here for a better explanation or look at user6910411's answer below for an example.
I am going to extend above answer.
So you can implement same logic like pandas.groupby().apply in pyspark using #pandas_udf
and which is vectorization method and faster then simple udf.
from pyspark.sql.functions import pandas_udf, PandasUDFType
import pandas as pd
df3 = spark.createDataFrame([('a', 1, 0), ('a', -1, 42), ('b', 3, -1),
('b', 10, -2)], ('key', 'value1', 'value2'))
from pyspark.sql.types import *
schema = StructType([StructField('key', StringType()),
StructField('avg_value1', DoubleType()),
StructField('avg_value2', DoubleType()),
StructField('sum_avg', DoubleType()),
StructField('sub_avg', DoubleType())])
#pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def g(df):
gr = df['key'].iloc[0]
x = df.value1.mean()
y = df.value2.mean()
w = df.value1.mean() + df.value2.mean()
z = df.value1.mean() - df.value2.mean()
return pd.DataFrame([[gr] + [x] + [y] + [w] + [z]])
df3.groupby('key').apply(g).show()
You will get below result:
+---+----------+----------+-------+-------+
|key|avg_value1|avg_value2|sum_avg|sub_avg|
+---+----------+----------+-------+-------+
| b| 6.5| -1.5| 5.0| 8.0|
| a| 0.0| 21.0| 21.0| -21.0|
+---+----------+----------+-------+-------+
So , You can do more calculation between other fields in grouped data.and add them into dataframe in list format.
Another extend new in PySpark version 3.0.0:
applyInPandas
df = spark.createDataFrame([(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
("id", "v"))
def mean_func(key, pdf):
# key is a tuple of one numpy.int64, which is the value
# of 'id' for the current group
return pd.DataFrame([key + (pdf.v.mean(),)])
df.groupby('id').applyInPandas(mean_func, schema="id long, v double").show()
results in:
+---+---+
| id| v|
+---+---+
| 1|1.5|
| 2|6.0|
+---+---+
for further details see: https://spark.apache.org/docs/3.2.0/api/python/reference/api/pyspark.sql.GroupedData.applyInPandas.html
I am having difficulty in converting an RDD of the follwing structure to a dataframe in spark using python.
df1=[['usr1',('itm1',2),('itm3',3)], ['usr2',('itm2',3), ('itm3',5),(itm22,6)]]
After converting, my dataframe should look like the following:
usr1 usr2
itm1 2.0 NaN
itm2 NaN 3.0
itm22 NaN 6.0
itm3 3.0 5.0
I was initially thinking of coverting the above RDD structure to the following:
df1={'usr1': {'itm1': 2, 'itm3': 3}, 'usr2': {'itm2': 3, 'itm3': 5, 'itm22':6}}
Then use python's pandas module pand=pd.DataFrame(dat2) and then convert pandas dataframe back to a spark dataframe using spark_df = context.createDataFrame(pand). However, I beleive, by doing this, I am converting an RDD to a non-RDD object and then converting back to RDD, which is not correct. Can some please help me out with this problem?
With data like this:
rdd = sc.parallelize([
['usr1',('itm1',2),('itm3',3)], ['usr2',('itm2',3), ('itm3',5),('itm22',6)]
])
flatten the records:
def to_record(kvs):
user, *vs = kvs # For Python 2.x use standard indexing / splicing
for item, value in vs:
yield user, item, value
records = rdd.flatMap(to_record)
convert to DataFrame:
df = records.toDF(["user", "item", "value"])
pivot:
result = df.groupBy("item").pivot("user").sum()
result.show()
## +-----+----+----+
## | item|usr1|usr2|
## +-----+----+----+
## | itm1| 2|null|
## | itm2|null| 3|
## | itm3| 3| 5|
## |itm22|null| 6|
## +-----+----+----+
Note: Spark DataFrames are designed to handle long and relatively thin data. If you want to generate wide contingency table, DataFrames won't be useful, especially if data is dense and you want to keep separate column per feature.
As a simplified example, I have a dataframe "df" with columns "col1,col2" and I want to compute a row-wise maximum after applying a function to each column :
def f(x):
return (x+1)
max_udf=udf(lambda x,y: max(x,y), IntegerType())
f_udf=udf(f, IntegerType())
df2=df.withColumn("result", max_udf(f_udf(df.col1),f_udf(df.col2)))
So if df:
col1 col2
1 2
3 0
Then
df2:
col1 col2 result
1 2 3
3 0 4
The above doesn't seem to work and produces "Cannot evaluate expression: PythonUDF#f..."
I'm absolutely positive "f_udf" works just fine on my table, and the main issue is with the max_udf.
Without creating extra columns or using basic map/reduce, is there a way to do the above entirely using dataframes and udfs? How should I modify "max_udf"?
I've also tried:
max_udf=udf(max, IntegerType())
which produces the same error.
I've also confirmed that the following works:
df2=(df.withColumn("temp1", f_udf(df.col1))
.withColumn("temp2", f_udf(df.col2))
df2=df2.withColumn("result", max_udf(df2.temp1,df2.temp2))
Why is it that I can't do these in one go?
I would like to see an answer that generalizes to any function "f_udf" and "max_udf."
I had a similar problem and found the solution in the answer to this stackoverflow question
To pass multiple columns or a whole row to an UDF use a struct:
from pyspark.sql.functions import udf, struct
from pyspark.sql.types import IntegerType
df = sqlContext.createDataFrame([(None, None), (1, None), (None, 2)], ("a", "b"))
count_empty_columns = udf(lambda row: len([x for x in row if x == None]), IntegerType())
new_df = df.withColumn("null_count", count_empty_columns(struct([df[x] for x in df.columns])))
new_df.show()
returns:
+----+----+----------+
| a| b|null_count|
+----+----+----------+
|null|null| 2|
| 1|null| 1|
|null| 2| 1|
+----+----+----------+
UserDefinedFunction is throwing error while accepting UDFs as their arguments.
You can modify the max_udf like below to make it work.
df = sc.parallelize([(1, 2), (3, 0)]).toDF(["col1", "col2"])
max_udf = udf(lambda x, y: max(x + 1, y + 1), IntegerType())
df2 = df.withColumn("result", max_udf(df.col1, df.col2))
Or
def f_udf(x):
return (x + 1)
max_udf = udf(lambda x, y: max(x, y), IntegerType())
## f_udf=udf(f, IntegerType())
df2 = df.withColumn("result", max_udf(f_udf(df.col1), f_udf(df.col2)))
Note:
The second approach is valid if and only if internal functions (here f_udf) generate valid SQL expressions.
It works here because f_udf(df.col1) and f_udf(df.col2) are evaluated as Column<b'(col1 + 1)'> and Column<b'(col2 + 1)'> respectively, before being passed to max_udf. It wouldn't work with arbitrary function.
It wouldn't work if we try for example something like this:
from math import exp
df.withColumn("result", max_udf(exp(df.col1), exp(df.col2)))
The best way to handle this is to escape the pyspark.sql.DataFrame representation and use pyspark.RDDs via pyspark.sql.Row.asDict() and [pyspark.RDD.map()](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.map.html#pyspark.RDD.map).
import typing
# Save yourself some pain and always import these things: functions as F and types as T
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql import Row, SparkSession, SQLContext
spark = (
SparkSession.builder.appName("Stack Overflow Example")
.getOrCreate()
)
sc = spark.sparkContext
# sqlContet is needed sometimes to create DataFrames from RDDs
sqlContext = SQLContext(sc)
df = sc.parallelize([Row(**{"a": "hello", "b": 1, "c": 2}), Row(**{"a": "goodbye", "b": 2, "c": 1})]).toDF(["a", "b", "c"])
def to_string(record:dict) -> Row:
"""Create a readable string representation of the record"""
record["readable"] = f'Word: {record["a"]} A: {record["b"]} B: {record["c"]}'
return Row(**record)
# Apply the function with a map after converting the Row to a dict
readable_rdd = df.rdd.map(lambda x: x.asDict()).map(to_string)
# Test the function without running the entire DataFrame through it
print(readable_rdd.first())
# This results in: Row(a='hello', b=1, c=2, readable='Word: hello A: 1 B: 2')
# Sometimes you can use `toDF()` to get a dataframe
readable_df = readable_rdd.toDF()
readable_df.show()
# +-------+---+---+--------------------+
# | a| b| c| readable|
# +-------+---+---+--------------------+
# | hello| 1| 2|Word: hello A: 1 ...|
# |goodbye| 2| 1|Word: goodbye A: ...|
# +-------+---+---+--------------------+
# Sometimes you have to use createDataFrame with a specified schema
schema = T.StructType(
[
T.StructField("a", T.StringType(), True),
T.StructField("b", T.IntegerType(), True),
T.StructField("c", T.StringType(), True),
T.StructField("readable", T.StringType(), True),
]
)
# This is more reliable, you should use it in production!
readable_df = sqlContext.createDataFrame(readable_rdd, schema)
readable_df.show()
# +-------+---+---+--------------------+
# | a| b| c| readable|
# +-------+---+---+--------------------+
# | hello| 1| 2|Word: hello A: 1 ...|
# |goodbye| 2| 1|Word: goodbye A: ...|
# +-------+---+---+--------------------+
Sometimes RDD.map() functions can't use certain Python libraries because mappers get serialized and so you need to partition the data into enough partitions to occupy all the cores of the cluster and then use pyspark.RDD.mapPartition() to process an entire partition (just an Iterable of dicts) at a time. This enables you to instantiate an expensive object once - like a spaCy Language model - and apply it to one record at a time without recreating it.
def to_string_partition(partition:typing.Iterable[dict]) -> typing.Iterable[Row]:
"""Add a readable string form to an entire partition"""
# Instantiate expensive objects here
# Apply these objects' methods here
for record in partition:
record["readable"] = f'Word: {record["a"]} A: {record["b"]} B: {record["c"]}'
yield Row(**record)
readable_rdd = df.rdd.map(lambda x: x.asDict()).mapPartitions(to_string_partition)
print(readable_rdd.first())
# Row(a='hello', b=1, c=2, readable='Word: hello A: 1 B: 2')
# mapPartitions are more likely to require a specified schema
schema = T.StructType(
[
T.StructField("a", T.StringType(), True),
T.StructField("b", T.IntegerType(), True),
T.StructField("c", T.StringType(), True),
T.StructField("readable", T.StringType(), True),
]
)
# This is more reliable, you should use it in production!
readable_df = sqlContext.createDataFrame(readable_rdd, schema)
readable_df.show()
# +-------+---+---+--------------------+
# | a| b| c| readable|
# +-------+---+---+--------------------+
# | hello| 1| 2|Word: hello A: 1 ...|
# |goodbye| 2| 1|Word: goodbye A: ...|
# +-------+---+---+--------------------+
The DataFrame APIs are good because they allow SQL-like operations to be faster, but sometimes you need the power of direct Python without any limitations and it will greatly benefit your analytics practice to learn to employ RDDs. You can group records for example and then evaluate the entire group in RAM, just so long as it fits - which you can arrange by altering the partition key and limiting workers/increasing their RAM.
import numpy as np
def median_b(x):
"""Process a group and determine the median value"""
key = x[0]
values = x[1]
# Get the median value
m = np.median([record["b"] for record in values])
# Return a Row of the median for each group
return Row(**{"a": key, "median_b": m})
median_b_rdd = df.rdd.map(lambda x: x.asDict()).groupBy(lambda x: x["a"]).map(median_b)
median_b_rdd.first()
# Row(a='hello', median_b=1.0)
Below a useful code especially made to create any new column by simply calling a top-level business rule, completely isolated from the technical and heavy Spark's stuffs (no need to spend $ and to feel dependant of Databricks libraries anymore).
My advice is, in your organization try to do things simply and cleanly in life, for the benefits of top-level data users:
def createColumnFromRule(df, columnName, ruleClass, ruleName, inputColumns=None, inputValues=None, columnType=None):
from pyspark.sql import functions as F
from pyspark.sql import types as T
def _getSparkClassType(shortType):
defaultSparkClassType = "StringType"
typesMapping = {
"bigint" : "LongType",
"binary" : "BinaryType",
"boolean" : "BooleanType",
"byte" : "ByteType",
"date" : "DateType",
"decimal" : "DecimalType",
"double" : "DoubleType",
"float" : "FloatType",
"int" : "IntegerType",
"integer" : "IntegerType",
"long" : "LongType",
"numeric" : "NumericType",
"string" : defaultSparkClassType,
"timestamp" : "TimestampType"
}
sparkClassType = None
try:
sparkClassType = typesMapping[shortType]
except:
sparkClassType = defaultSparkClassType
return sparkClassType
if (columnType != None): sparkClassType = _getSparkClassType(columnType)
else: sparkClassType = "StringType"
aUdf = eval("F.udf(ruleClass." + ruleName + ", T." + sparkClassType + "())")
columns = None
values = None
if (inputColumns != None): columns = F.struct([df[column] for column in inputColumns])
if (inputValues != None): values = F.struct([F.lit(value) for value in inputValues])
# Call the rule
if (inputColumns != None and inputValues != None): df = df.withColumn(columnName, aUdf(columns, values))
elif (inputColumns != None): df = df.withColumn(columnName, aUdf(columns, F.lit(None)))
elif (inputValues != None): df = df.withColumn(columnName, aUdf(F.lit(None), values))
# Create a Null column otherwise
else:
if (columnType != None):
df = df.withColumn(columnName, F.lit(None).cast(columnType))
else:
df = df.withColumn(columnName, F.lit(None))
# Return the resulting dataframe
return df
Usage example:
# Define your business rule (you can get columns and values)
class CustomerRisk:
def churnRisk(self, columns=None, values=None):
isChurnRisk = False
# ... Rule implementation starts here
if (values != None):
if (values[0] == "FORCE_CHURN=true"): isChurnRisk = True
if (isChurnRisk == False and columns != None):
if (columns["AGE"]) <= 25): isChurnRisk = True
# ...
return isChurnRisk
# Execute the rule, it will create your new column in one line of code, that's all, easy isn't ?
# And look how to pass columns and values, it's really easy !
df = createColumnFromRule(df, columnName="CHURN_RISK", ruleClass=CustomerRisk(), ruleName="churnRisk", columnType="boolean", inputColumns=["NAME", "AGE", "ADDRESS"], inputValues=["FORCE_CHURN=true", "CHURN_RISK=100%"])