py4JJavaError. pyspark applying function to column - python

spark = 2.x
New to pyspark.
While encoding date related columns for training DNN keep on facing error mentioned in the title.
from df
day month ...
1 1
2 3
3 1 ...
I am trying to get cos, sine value for each column in order to capture their cyclic nature.
When applying function to column in pyspark udf worked fine until now. But below code doesn't work
def to_cos(x, _max):
return np.sin(2*np.pi*x / _max)
to_cos_udf = udf(to_cos, DecimalType())
df = df.withColumn("month", to_cos_udf("month", 12))
I've tried it with IntegerType and tried it with only one variable def to_cos(x) however none of them seem to work and outputs:
Py4JJavaError: An error occurred while calling 0.24702.showString.

Since you havent shared the entire Stacktrack from the error , not sure what is the actual error which is causing the failure
However by the code snippets you have shared , Firstly you need to update your UDF definition as below -
Will passing arguments to a UDF function using it with lambda is probably the best approach towards it , apart from that you can use partial
Data Preparation
df = pd.DataFrame({
'month':[i for i in range(0,12)],
})
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+-----+
|month|
+-----+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
| 10|
| 11|
+-----+
Custom UDF
def to_cos(x,_max):
try:
res = np.sin(2*np.pi*x / _max)
except Exception as e:
res = 0.0
return float(res)
max_cos = 12
to_cos_udf = F.udf(lambda x: to_cos(x,max_cos),FloatType())
sparkDF = sparkDF.withColumn('month_cos',to_cos_udf('month'))
sparkDF.show()
+-----+-------------+
|month| month_cos|
+-----+-------------+
| 0| 0.0|
| 1| 0.5|
| 2| 0.8660254|
| 3| 1.0|
| 4| 0.8660254|
| 5| 0.5|
| 6|1.2246469E-16|
| 7| -0.5|
| 8| -0.8660254|
| 9| -1.0|
| 10| -0.8660254|
| 11| -0.5|
+-----+-------------+
Custom UDF - Partial
from functools import partial
partial_func = partial(to_cos,_max=max_cos)
to_cos_partial_udf = F.udf(partial_func)
sparkDF = sparkDF.withColumn('month_cos',to_cos_partial_udf('month'))
sparkDF.show()
+-----+--------------------+
|month| month_cos|
+-----+--------------------+
| 0| 0.0|
| 1| 0.49999999999999994|
| 2| 0.8660254037844386|
| 3| 1.0|
| 4| 0.8660254037844388|
| 5| 0.49999999999999994|
| 6|1.224646799147353...|
| 7| -0.4999999999999998|
| 8| -0.8660254037844384|
| 9| -1.0|
| 10| -0.8660254037844386|
| 11| -0.5000000000000004|
+-----+--------------------+

Related

Python string matching with Spark dataframe

I have a spark dataframe
id | city | fruit | quantity
-------------------------
0 | CA | apple | 300
1 | CA | appel | 100
2 | CA | orange| 20
3 | CA | berry | 10
I want to get rows where fruits are apple or orange. So I use Spark SQL:
SELECT * FROM table WHERE fruit LIKE '%apple%' OR fruit LIKE '%orange%';
It returns
id | city | fruit | quantity
-------------------------
0 | CA | apple | 300
2 | CA | orange| 20
But it is supposed to return
id | city | fruit | quantity
-------------------------
0 | CA | apple | 300
1 | CA | appel | 100
2 | CA | orange| 20
as row 1 is just a misspelling.
So I plan on using fuzzywuzzy for string matching.
I know that
import fuzzywuzzy
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
print(fuzz.partial_ratio('apple', 'apple')) -> 100
print(fuzz.partial_ratio('apple', 'appel')) -> 83
But am not sure how to apply this to a column in dataframe to get relevant rows
Since you are interested in implementing fuzzy matching as a filter, you must first decide on a threshold of how similar you would like the matches.
Approach 1
For your fuzzywuzzy import this could be 80 for the purpose of this demonstration (adjust based on your needs). You could then implement a udf to apply your imported fuzzy logic code eg
from pyspark.sql import functions as F
from pyspark.sql import types as T
F.udf(T.BooleanType())
def is_fuzzy_match(field_value,search_value, threshold=80):
from fuzzywuzzy import fuzz
return fuzz.partial_ratio(field_value, search_value) > threshold
Then apply your udf as a filter on your dataframe
df = (
df.where(
is_fuzzy_match(F.col("fruit"),F.lit("apple")) |
is_fuzzy_match(F.col("fruit"),F.lit("orange"))
)
)
Approach 2 : Recommended
However, udfs can be expensive when executed on spark and spark has already implemented the levenshtein function which is also useful here. You may start reading more about how the levenshtein distance accomplishes fuzzy matching .
With this approach your code code look like using a threshold of 3 below
from pyspark.sql import functions as F
df = df.where(
(
F.levenshtein(
F.col("fruit"),
F.lit("apple")
) < 3
) |
(
F.levenshtein(
F.col("fruit"),
F.lit("orange")
) < 3
)
)
df.show()
+---+----+------+--------+
| id|city| fruit|quantity|
+---+----+------+--------+
| 0| CA| apple| 300|
| 1| CA| appel| 100|
| 2| CA|orange| 20|
+---+----+------+--------+
For debugging purposes the result of the levenshtein has been included below
df.withColumn("diff",
F.levenshtein(
F.col("fruit"),
F.lit("apple")
)
).show()
+---+----+------+--------+----+
| id|city| fruit|quantity|diff|
+---+----+------+--------+----+
| 0| CA| apple| 300| 0|
| 1| CA| appel| 100| 2|
| 2| CA|orange| 20| 5|
| 3| CA| berry| 10| 5|
+---+----+------+--------+----+
Update 1
In response to additional sample data provided by Op in comments:
If I have a fruit like kashmir apple and want it to match with apple
Approach 3
You could try the following approach and adjust the threshold as desired.
Since you are interested in matching the possibility of a misspelled fruit throughout the text, you could attempt to apply the levenshtein to every piece throughout the fruit name. The functions below (not udfs but for readability simplifies the application of the task ) implement this approach. matches_fruit_ratio attempts to determine how much of a match is found while matches_fruit determines the maximum matches_fruit_ratio on every piece of the fruit name split by a space.
from pyspark.sql import functions as F
def matches_fruit_ratio(fruit_column,fruit_search,threshold=0.3):
return (F.length(fruit_column) - F.levenshtein(
fruit_column,
F.lit(fruit_search)
)) / F.length(fruit_column)
def matches_fruit(fruit_column,fruit_search,threshold=0.6):
return F.array_max(F.transform(
F.split(fruit_column," "),
lambda fruit_piece : matches_fruit_ratio(fruit_piece,fruit_search)
)) >= threshold
This can be used as follows:
df = df.where(
matches_fruit(
F.col("fruit"),
"apple"
) | matches_fruit(
F.col("fruit"),
"orange"
)
)
df.show()
+---+----+-------------+--------+
| id|city| fruit|quantity|
+---+----+-------------+--------+
| 0| CA| apple| 300|
| 1| CA| appel| 100|
| 2| CA| orange| 20|
| 4| CA| apply berry| 3|
| 5| CA| apple berry| 1|
| 6| CA|kashmir apple| 5|
| 7| CA|kashmir appel| 8|
+---+----+-------------+--------+
For debugging purposes, I have added additional sample data and output columns for the different components of each function while demonstrating how this function may be used
df.withColumn("length",
F.length(
"fruit"
)
).withColumn("levenshtein",
F.levenshtein(
F.col("fruit"),
F.lit("apple")
)
).withColumn("length - levenshtein",
F.length(
"fruit"
) - F.levenshtein(
F.col("fruit"),
F.lit("apple")
)
).withColumn(
"matches_fruit_ratio",
matches_fruit_ratio(
F.col("fruit"),
"apple"
)
).withColumn(
"matches_fruit_values_before_threshold",
F.array_max(F.transform(
F.split("fruit"," "),
lambda fruit_piece : matches_fruit_ratio(fruit_piece,"apple")
))
).withColumn(
"matches_fruit",
matches_fruit(
F.col("fruit"),
"apple"
)
).show()
+---+----+-------------+--------+------+-----------+--------------------+-------------------+-------------------------------------+-------------+
| id|city| fruit|quantity|length|levenshtein|length - levenshtein|matches_fruit_ratio|matches_fruit_values_before_threshold|matches_fruit|
+---+----+-------------+--------+------+-----------+--------------------+-------------------+-------------------------------------+-------------+
| 0| CA| apple| 300| 5| 0| 5| 1.0| 1.0| true|
| 1| CA| appel| 100| 5| 2| 3| 0.6| 0.6| true|
| 2| CA| orange| 20| 6| 5| 1|0.16666666666666666| 0.16666666666666666| false|
| 3| CA| berry| 10| 5| 5| 0| 0.0| 0.0| false|
| 4| CA| apply berry| 3| 11| 6| 5|0.45454545454545453| 0.8| true|
| 5| CA| apple berry| 1| 11| 6| 5|0.45454545454545453| 1.0| true|
| 6| CA|kashmir apple| 5| 13| 8| 5|0.38461538461538464| 1.0| true|
| 7| CA|kashmir appel| 8| 13| 10| 3|0.23076923076923078| 0.6| true|
+---+----+-------------+--------+------+-----------+--------------------+-------------------+-------------------------------------+-------------+

Merge 2 spark dataframes with non overlapping columns

I have two data frames, df1:
+---+---------+
| id| col_name|
+---+---------+
| 0| a |
| 1| b |
| 2| null|
| 3| null|
| 4| e |
| 5| f |
| 6| g |
| 7| h |
| 8| null|
| 9| j |
+---+---------+
and df2:
+---+---------+
| id| col_name|
+---+---------+
| 0| null|
| 1| null|
| 2| c|
| 3| d|
| 4| null|
| 5| null|
| 6| null|
| 7| null|
| 8| i|
| 9| null|
+---+---------+
and I want to merge them so I get
+---+---------+
| id| col_name|
+---+---------+
| 0| a|
| 1| b|
| 2| c|
| 3| d|
| 4| e|
| 5| f|
| 6| g|
| 7| h|
| 8| i|
| 9| j|
+---+---------+
I know for sure that they aren't overlapping (ie when df2 entry is null df1 entry isn't and vise versa)
I know that if I use join I won't get them on the same column and will instead get 2 "col_name". I just want it on the one column. How do I do this? Thanks
Try this-
df1.alias("a").join(df2.alias("b"), "id").selectExpr("id", "coalesce(a.col_name, b.col_name) as col_name")
You could do this:
mydf = df1.copy() #make copy of first array
idx = np.where(df1['col_name'].values == 'null')[0] #get indices of null
val = df2['col_name'].values[idx] #get values from df2 where df1 is null
mydf['col_name'][idx] = val #assign those values in mydf
mydf #print mydf
you should be able to utilize the coalesce function to achieve this.
df1 = df1.withColumnRenamed("col_name", "col_name_a")
df2 = df2.withColumnRenamed("col_name", "col_name_b")
joinedDF = renamedDF1.join(renamedDF2, "id")
joinedDF = joinedDF.withColumn(
"col_name",
coalesce(joinedDF.col("col_name_a"), joinedDF.col("col_name_b"))
)

Reshape pyspark dataframe to show moving window of item interactions

I have a large pyspark dataframe of subject interactions in long format--each row describes a subject interacting with some item of interest, along with a timestamp and a rank-order for that subject's interaction (i.e., first interaction is 1, second is 2, etc.). Here's a few rows:
+----------+---------+----------------------+--------------------+
| date|itemId |interaction_date_order| userId|
+----------+---------+----------------------+--------------------+
|2019-07-23| 10005880| 1|37 |
|2019-07-23| 10005903| 2|37 |
|2019-07-23| 10005903| 3|37 |
|2019-07-23| 12458442| 4|37 |
|2019-07-26| 10005903| 5|37 |
|2019-07-26| 12632813| 6|37 |
|2019-07-26| 12632813| 7|37 |
|2019-07-26| 12634497| 8|37 |
|2018-11-24| 12245677| 1|5 |
|2018-11-24| 12245677| 1|5 |
|2019-07-29| 12541871| 2|5 |
|2019-07-29| 12541871| 3|5 |
|2019-07-30| 12626854| 4|5 |
|2019-08-31| 12776880| 5|5 |
|2019-08-31| 12776880| 6|5 |
+----------+---------+----------------------+--------------------+
I need to reshape these data such that, for each subject, a row has a length-5 moving window of interactions. So then, something like this:
+------+--------+--------+--------+--------+--------+
|userId| i-2 | i-1 | i | i+1 | i+2|
+------+--------+--------+--------+--------+--------+
|37 |10005880|10005903|10005903|12458442|10005903|
|37 |10005903|10005903|12458442|10005903|12632813|
Does anyone have suggestions for how I might do this?
Import spark and everything
from pyspark.sql import *
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)
Create your dataframe
columns = '| date|itemId |interaction_date_order| userId|'.split('|')
lines = '''2019-07-23| 10005880| 1|37 |
2019-07-23| 10005903| 2|37 |
2019-07-23| 10005903| 3|37 |
2019-07-23| 12458442| 4|37 |
2019-07-26| 10005903| 5|37 |
2019-07-26| 12632813| 6|37 |
2019-07-26| 12632813| 7|37 |
2019-07-26| 12634497| 8|37 |
2018-11-24| 12245677| 1|5 |
2018-11-24| 12245677| 2|5 |
2019-07-29| 12541871| 3|5 |
2019-07-29| 12541871| 4|5 |
2019-07-30| 12626854| 5|5 |
2019-08-31| 12776880| 6|5 |
2019-08-31| 12776880| 7|5 |'''
Interaction = Row("date", "itemId", "interaction_date_order", "userId")
interactions = []
for line in lines.split('\n'):
column_values = line.split('|')
interaction = Interaction(column_values[0], int(column_values[1]), int(column_values[2]), int(column_values[3]))
interactions.append(interaction)
df = spark.createDataFrame(interactions)
now we have
df.show()
+----------+--------+----------------------+------+
| date| itemId|interaction_date_order|userId|
+----------+--------+----------------------+------+
|2019-07-23|10005880| 1| 37|
|2019-07-23|10005903| 2| 37|
|2019-07-23|10005903| 3| 37|
|2019-07-23|12458442| 4| 37|
|2019-07-26|10005903| 5| 37|
|2019-07-26|12632813| 6| 37|
|2019-07-26|12632813| 7| 37|
|2019-07-26|12634497| 8| 37|
|2018-11-24|12245677| 1| 5|
|2018-11-24|12245677| 2| 5|
|2019-07-29|12541871| 3| 5|
|2019-07-29|12541871| 4| 5|
|2019-07-30|12626854| 5| 5|
|2019-08-31|12776880| 6| 5|
|2019-08-31|12776880| 7| 5|
+----------+--------+----------------------+------+
Create a window and collect itemId with count
from pyspark.sql.window import Window
import pyspark.sql.functions as F
window = Window() \
.partitionBy('userId') \
.orderBy('interaction_date_order') \
.rowsBetween(Window.currentRow, Window.currentRow+4)
df2 = df.withColumn("itemId_list", F.collect_list('itemId').over(window))
df2 = df2.withColumn("itemId_count", F.count('itemId').over(window))
df_final = df2.where(df2['itemId_count'] == 5)
now we have
df_final.show()
+----------+--------+----------------------+------+--------------------+------------+
| date| itemId|interaction_date_order|userId| itemId_list|itemId_count|
+----------+--------+----------------------+------+--------------------+------------+
|2018-11-24|12245677| 1| 5|[12245677, 122456...| 5|
|2018-11-24|12245677| 2| 5|[12245677, 125418...| 5|
|2019-07-29|12541871| 3| 5|[12541871, 125418...| 5|
|2019-07-23|10005880| 1| 37|[10005880, 100059...| 5|
|2019-07-23|10005903| 2| 37|[10005903, 100059...| 5|
|2019-07-23|10005903| 3| 37|[10005903, 124584...| 5|
|2019-07-23|12458442| 4| 37|[12458442, 100059...| 5|
+----------+--------+----------------------+------+--------------------+------------+
Final touch
df_final2 = (df_final
.withColumn('i-2', df_final['itemId_list'][0])
.withColumn('i-1', df_final['itemId_list'][1])
.withColumn('i', df_final['itemId_list'][2])
.withColumn('i+1', df_final['itemId_list'][3])
.withColumn('i+2', df_final['itemId_list'][4])
.select('userId', 'i-2', 'i-1', 'i', 'i+1', 'i+2')
)
df_final2.show()
+------+--------+--------+--------+--------+--------+
|userId| i-2| i-1| i| i+1| i+2|
+------+--------+--------+--------+--------+--------+
| 5|12245677|12245677|12541871|12541871|12626854|
| 5|12245677|12541871|12541871|12626854|12776880|
| 5|12541871|12541871|12626854|12776880|12776880|
| 37|10005880|10005903|10005903|12458442|10005903|
| 37|10005903|10005903|12458442|10005903|12632813|
| 37|10005903|12458442|10005903|12632813|12632813|
| 37|12458442|10005903|12632813|12632813|12634497|
+------+--------+--------+--------+--------+--------+

Sort or orderBy in pyspark showing strange output

I am trying to sort value in my pyspark dataframe, but its showing me strange output. Instead of sorting by entire number, it is sorting by first digit of entire number
I have tried sort and orderBy method, both are giving same result
sdf=spark.read.csv("dummy.txt", header=True)
sdf.sort('1',ascending=False).show()
I am getting following output
+---+
| 98|
| 9|
| 8|
| 76|
| 7|
| 68|
| 6|
| 54|
| 5|
| 43|
| 4|
| 35|
| 34|
| 34|
| 3|
| 2|
| 2|
| 2|
| 10|
+---+
Can any one explain me this thing
As your column contains data of String type, the String is being converted into a Sequence of chars and these chars are sorted.It works like a map function.
So, you could do a type cast, and then apply the orderBy function to achieve your required result.
>>> df
DataFrame[Numb: string]
>>> df.show()
+----+
|Numb|
+----+
| 20|
| 19|
| 1|
| 200|
| 60|
+----+
>>> df.orderBy(df.Numb.cast('int'),ascending=False).show()
+----+
|Numb|
+----+
| 200|
| 60|
| 20|
| 19|
| 1|
+----+

Pyspark | Seperate the string / int values from the dataframe

I have a Spark Dataframe as below:
+---------+
|col_str_1|
+---------+
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
| a|
| b|
| c|
| d|
| e|
| f|
| g|
| h|
| 1|
| 2|
| 3.0|
+---------+
I want to separate the string / int / float values based on request
For Example:
Req is for STRING, return DF must be like below
+---------+
|col_str_1|
+---------+
| a|
| b|
| c|
| d|
| e|
| f|
| g|
| h|
+---------+
Req is for Integer, return DF must be like below
+---------+
|col_str_1|
+---------+
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
| 1|
| 2|
+---------+
Tried below steps:
>> df = sqlContext.sql('select * from --db--.vt_prof_test')
>> columns = df.columns[0]
>> df.select(columns).????
how to proceed further, either use filter or map, can any one help me out ??
You can go for udf
import pyspark.sql.functions as F
df = sqlContext.sql('select * from --db--.vt_prof_test')
REQUEST = 'STRING'
request_bc = sc.broadcast(REQUEST)
def check_value(val):
if request_bc.value == 'STRING':
try:
val = int(val)
return None
except:
return val
if request_bc.value == 'INTEGER':
try:
val = int(val)
return val
except:
return None
check_udf = F.udf(lambda x: check_value(x))
df = df.select(check_udf(F.col('col_str_1').alias('col_str_1')).dropna()
Set the REQUEST parameter according to the need.

Categories

Resources