I am trying to figure out how to translate my Pandas-utilising function to PySpark.
I have a Pandas DataFrame like this:
+---+----+
|num| val|
+---+----+
| 1| 0.0|
| 2| 0.0|
| 3|48.6|
| 4|49.0|
| 5|48.7|
| 6|49.1|
| 7|74.5|
| 8|48.7|
| 9| 0.0|
| 10|49.0|
| 11| 0.0|
| 12| 0.0|
+---+----+
The code in the snippet below is fairly simple. It goes forwards till finds a non-zero value. If there is none of them it goes backwards for the same purpose
def next_non_zero(data,i,column):
for j in range(i+1,len(data[column])):
res = data[column].iloc[j]
if res !=0:
return res
for j in range(i-1,0,-1):
res = data[column].iloc[j]
if res !=0:
return res
def fix_zero(data, column):
for i, row in data.iterrows():
if (row[column] == 0):
data.at[i,column] = next_non_zero(data,i,column)
So as a result I expect to see
+---+----+
|num| val|
+---+----+
| 1|48.6|
| 2|48.6|
| 3|48.6|
| 4|49.0|
| 5|48.7|
| 6|49.1|
| 7|74.5|
| 8|48.7|
| 9|49.0|
| 10|49.0|
| 11|49.0|
| 12|49.0|
+---+----+
So I do understand that in PySpark I have to create a new column with the desired result and replace an existing column using withColumn() for example. However, I do not understand how to properly iterate through a DataFrame.
I am trying to use functions over Window:
my_window = Window.partitionBy().orderBy('num')
df = df.withColumn('new_val', F.when(df.val==0,F.lead(df.val).over(my_window)).
otherwise(F.lag(df.val).over(my_window))
Obviously, it does not provide me with the desired result as it iterates only once.
So I tried to write some udf recursion like
def fix_zero(param):
return F.when(F.lead(param).over(my_window)!=0,F.lead(param).over(my_window)).
otherwise(fix_zero(F.lead(param).over(my_window)))
spark_udf = udf(fix_zero, DoubleType())
df = df.withColumn('new_val', F.when(df.val!=0, df.val).otherwise(fix_zero('val')))
I got
RecursionError: maximum recursion depth exceeded in comparison
I suspect that this is because I pass into recursion not a row but a result of lead()
Anyway, I am totally stuck on this hurdle at this moment and would deeply appreciate any advice
There is a way with Window to go through all preceeding (or all following rows) until you reach a non-null value.
So my first step was to replace all 0 values by null
Recreating your dataframe:
values = [
(1, 0.0),
(2,0.0),
(3,48.6),
(4,49.0),
(5,48.7),
(6,49.1),
(7, 74.5),
(8,48.7),
(9,0.0),
(10,49.0),
(11,0.0),
(12,0.0)
]
df = spark.createDataFrame(values, ['num','val'])
Replacing 0s with null
from pyspark.sql.functions import when, lit, col
df= df.withColumn('val_null', when(col('val') != 0.0,col('val')))
Then define the windows, which combined with first and null, will allow us to get last non null value before row and first non null value after row
from pyspark.sql import Window
from pyspark.sql.functions import last,first,coalesce
windowForward = Window.rowsBetween(Window.unboundedPreceding, Window.currentRow)
ffilled_column = last(df['val_null'], ignorenulls=True).over(windowForward)
windowBackward = Window.rowsBetween(Window.currentRow,Window.unboundedFollowing)
bfilled_column = first(df['val_null'], ignorenulls=True).over(windowBackward)
# creating new columns in df
df =df.withColumn('ffill',ffilled_column).withColumn('bfill',bfilled_column)
# replace null with bfill if bfill is not null otherwise fill with ffill
df =df.withColumn('val_full',coalesce('bfill','ffill'))
Using this technique we arrive at your expected output in column 'val_full'
+---+----+--------+-----+-----+--------+
|num| val|val_null|ffill|bfill|val_full|
+---+----+--------+-----+-----+--------+
| 1| 0.0| null| null| 48.6| 48.6|
| 2| 0.0| null| null| 48.6| 48.6|
| 3|48.6| 48.6| 48.6| 48.6| 48.6|
| 4|49.0| 49.0| 49.0| 49.0| 49.0|
| 5|48.7| 48.7| 48.7| 48.7| 48.7|
| 6|49.1| 49.1| 49.1| 49.1| 49.1|
| 7|74.5| 74.5| 74.5| 74.5| 74.5|
| 8|48.7| 48.7| 48.7| 48.7| 48.7|
| 9| 0.0| null| 48.7| 49.0| 49.0|
| 10|49.0| 49.0| 49.0| 49.0| 49.0|
| 11| 0.0| null| 49.0| null| 49.0|
| 12| 0.0| null| 49.0| null| 49.0|
+---+----+--------+-----+-----+--------+
Related
spark = 2.x
New to pyspark.
While encoding date related columns for training DNN keep on facing error mentioned in the title.
from df
day month ...
1 1
2 3
3 1 ...
I am trying to get cos, sine value for each column in order to capture their cyclic nature.
When applying function to column in pyspark udf worked fine until now. But below code doesn't work
def to_cos(x, _max):
return np.sin(2*np.pi*x / _max)
to_cos_udf = udf(to_cos, DecimalType())
df = df.withColumn("month", to_cos_udf("month", 12))
I've tried it with IntegerType and tried it with only one variable def to_cos(x) however none of them seem to work and outputs:
Py4JJavaError: An error occurred while calling 0.24702.showString.
Since you havent shared the entire Stacktrack from the error , not sure what is the actual error which is causing the failure
However by the code snippets you have shared , Firstly you need to update your UDF definition as below -
Will passing arguments to a UDF function using it with lambda is probably the best approach towards it , apart from that you can use partial
Data Preparation
df = pd.DataFrame({
'month':[i for i in range(0,12)],
})
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+-----+
|month|
+-----+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
| 10|
| 11|
+-----+
Custom UDF
def to_cos(x,_max):
try:
res = np.sin(2*np.pi*x / _max)
except Exception as e:
res = 0.0
return float(res)
max_cos = 12
to_cos_udf = F.udf(lambda x: to_cos(x,max_cos),FloatType())
sparkDF = sparkDF.withColumn('month_cos',to_cos_udf('month'))
sparkDF.show()
+-----+-------------+
|month| month_cos|
+-----+-------------+
| 0| 0.0|
| 1| 0.5|
| 2| 0.8660254|
| 3| 1.0|
| 4| 0.8660254|
| 5| 0.5|
| 6|1.2246469E-16|
| 7| -0.5|
| 8| -0.8660254|
| 9| -1.0|
| 10| -0.8660254|
| 11| -0.5|
+-----+-------------+
Custom UDF - Partial
from functools import partial
partial_func = partial(to_cos,_max=max_cos)
to_cos_partial_udf = F.udf(partial_func)
sparkDF = sparkDF.withColumn('month_cos',to_cos_partial_udf('month'))
sparkDF.show()
+-----+--------------------+
|month| month_cos|
+-----+--------------------+
| 0| 0.0|
| 1| 0.49999999999999994|
| 2| 0.8660254037844386|
| 3| 1.0|
| 4| 0.8660254037844388|
| 5| 0.49999999999999994|
| 6|1.224646799147353...|
| 7| -0.4999999999999998|
| 8| -0.8660254037844384|
| 9| -1.0|
| 10| -0.8660254037844386|
| 11| -0.5000000000000004|
+-----+--------------------+
I'm trying to compare two different columns which are in two different data frames, and if I found a match I'm returning value 1 else None -
df1 =
df2 =
df1 (Expected_Output) =
I have tried the below code -
def getImpact(row):
match = df2.filter(df2.second_key == row)
if match.count() > 0:
return 1
return None
udf_sol = udf(lambda x: getImpact(x), IntegerType())
df1 = df1.withcolumn('impact',udf_sol(df1.first_key))
But getting below error -
TypeError: cannot pickle '_thread.RLock' object
Can anyone help me to achieve the expected output as shown above?
Thanks
Assuming first_key and second_key are unique , you can opt for a join across the dataframes -
More examples and explanation can be found here
from pyspark import SparkContext
from pyspark.sql import SQLContext
from functools import reduce
import pyspark.sql.functions as F
from pyspark.sql import Window
data_list1 = [
("abcd","Key1")
,("jkasd","Key2")
,("oigoa","Key3")
,("ad","Key4")
,("bas","Key5")
,("lkalsjf","Key6")
,("bsawva","Key7")
]
data_list2 = [
("cashj","Key1",10)
,("ax","Key11",12)
,("safa","Key5",21)
,("safasf","Key6",78)
,("vasv","Key3",4)
,("wgaga","Key8",0)
,("saasfas","Key7",10)
]
sparkDF1 = sql.createDataFrame(data_list1,['data','first_key'])
sparkDF2 = sql.createDataFrame(data_list2,['temp_data','second_key','frinks'])
>>> sparkDF1
+-------+---------+
| data|first_key|
+-------+---------+
| abcd| Key1|
| jkasd| Key2|
| oigoa| Key3|
| ad| Key4|
| bas| Key5|
|lkalsjf| Key6|
| bsawva| Key7|
+-------+---------+
>>> sparkDF2
+---------+----------+------+
|temp_data|second_key|frinks|
+---------+----------+------+
| cashj| Key1| 10|
| ax| Key11| 12|
| safa| Key5| 21|
| safasf| Key6| 78|
| vasv| Key3| 4|
| wgaga| Key8| 0|
| saasfas| Key7| 10|
+---------+----------+------+
#### Joining the dataframes on common columns
finalDF = sparkDF1.join(
sparkDF2
,(sparkDF1['first_key'] == sparkDF2['second_key'])
,'left'
).select(sparkDF1['*'],sparkDF2['frinks']).orderBy('frinks')
### Identifying impact if the frinks value is Null or Not
finalDF = finalDF.withColumn('impact',F.when(F.col('frinks').isNull(),0).otherwise(1))
>>> finalDF.show()
+-------+---------+------+------+
| data|first_key|frinks|impact|
+-------+---------+------+------+
| jkasd| Key2| null| 0|
| ad| Key4| null| 0|
| oigoa| Key3| 4| 1|
| abcd| Key1| 10| 1|
| bsawva| Key7| 10| 1|
| bas| Key5| 21| 1|
|lkalsjf| Key6| 78| 1|
+-------+---------+------+------+
import numpy as np
df1['final']= np.where(df1['first_key']==df2['second_key'],'1','None')
Here i need to find exponential moving average in spark dataframe :
Table :
ab = spark.createDataFrame(
[(1,"1/1/2020", 41.0,0.5, 0.5 ,1, '10.22'),
(1,"10/3/2020",24.0,0.3, 0.7 ,2, '' ),
(1,"21/5/2020",32.0,0.4, 0.6 ,3, '' ),
(2,"3/1/2020", 51.0,0.22, 0.78,1, '34.78'),
(2,"10/5/2020",14.56,0.333,0.66,2, '' ),
(2,"30/9/2020",17.0,0.66, 0.34,3, '' )],["CID","date","A","B","C","Row","SMA"] )
ab.show()
+---+---------+-----+-----+----+---+-----+
|CID| date| A| B| C| Row| SMA|
+---+---------+-----+-----+----+---+-----+
| 1| 1/1/2020| 41.0| 0.5| 0.5| 1|10.22|
| 1|10/3/2020| 24.0| 0.3| 0.7| 2| |
| 1|21/5/2020| 32.0| 0.4| 0.6| 3| |
| 2| 3/1/2020| 51.0| 0.22|0.78| 1|34.78|
| 2|10/5/2020|14.56|0.333|0.66| 2| |
| 2|30/9/2020| 17.0| 0.66|0.34| 3| |
+---+---------+-----+-----+----+---+-----+
Expected Output :
+---+---------+-----+-----+----+---+-----+----------+
|CID| date| A| B| C|Row| SMA| EMA|
+---+---------+-----+-----+----+---+-----+----------+
| 1| 1/1/2020| 41.0| 0.5| 0.5| 1|10.22| 10.22|
| 1|10/3/2020| 24.0| 0.3| 0.7| 2| | 14.354|
| 1|21/5/2020| 32.0| 0.4| 0.6| 3| | 21.4124|
| 2| 3/1/2020| 51.0| 0.22|0.78| 1|34.78| 34.78|
| 2|10/5/2020|14.56|0.333|0.66| 2| | 28.04674|
| 2|30/9/2020| 17.0| 0.66|0.34| 3| |20.7558916|
+---+---------+-----+-----+----+---+-----+----------+
Logic :
For every customer
if row == 1 then
SMA as EMA
else ( C * LAG(EMA) + A * B ) as EMA
The problem here is that a freshly calculated value of a previous row is used as input for the current row. That means that it is not possible to parallelize the calculations for a single customer.
For Spark 3.0+, it is possible to get the required result with a pandas udf using grouped map
ab = spark.createDataFrame(
[(1,"1/1/2020", 41.0,0.5, 0.5 ,1, '10.22'),
(1,"10/3/2020",24.0,0.3, 0.7 ,2, '' ),
(1,"21/5/2020",32.0,0.4, 0.6 ,3, '' ),
(2,"3/1/2020", 51.0,0.22, 0.78,1, '34.78'),
(2,"10/5/2020",14.56,0.333,0.66,2, '' ),
(2,"30/9/2020",17.0,0.66, 0.34,3, '' )],\
["CID","date","A","B","C","Row","SMA"] ) \
.withColumn("SMA", F.col('SMA').cast(T.DoubleType())) \
.withColumn("date", F.to_date(F.col("date"), "d/M/yyyy"))
import pandas as pd
def calc(df: pd.DataFrame):
# df is a pandas.DataFrame
df = df.sort_values('date').reset_index(drop=True)
df.loc[0, 'EMA'] = df.loc[0, 'SMA']
for i in range(1, len(df)):
df.loc[i, 'EMA'] = df.loc[i, 'C'] * df.loc[i-1, 'EMA'] + \
df.loc[i, 'A'] * df.loc[i, 'B']
return df
ab.groupBy("CID").applyInPandas(calc,
schema = "CID long, date date, A double, B double, C double, Row long, SMA double, EMA double")\
.show()
Output:
+---+----------+-----+-----+----+---+-----+------------------+
|CID| date| A| B| C|Row| SMA| EMA|
+---+----------+-----+-----+----+---+-----+------------------+
| 1|2020-01-01| 41.0| 0.5| 0.5| 1|10.22| 10.22|
| 1|2020-03-10| 24.0| 0.3| 0.7| 2| null| 14.354|
| 1|2020-05-21| 32.0| 0.4| 0.6| 3| null|21.412399999999998|
| 2|2020-01-03| 51.0| 0.22|0.78| 1|34.78| 34.78|
| 2|2020-05-10|14.56|0.333|0.66| 2| null| 27.80328|
| 2|2020-09-30| 17.0| 0.66|0.34| 3| null| 20.6731152|
+---+----------+-----+-----+----+---+-----+------------------+
The idea is to use a Pandas dataframe for each group. This Pandas dataframe contains all values of the current partition and is ordered by date. During the iteration over the Pandas dataframe we can now access the value of EMA of the previous row (which is not possible for a Spark dataframe).
There are some caveats:
all rows of one partition should fit into the memory of a single executor. Partial aggregation is not possible here
iterating over a Pandas dataframe is discouraged
I have a below dataframe and I wanted to update the rows dynamically with some values
input_frame.show()
+----------+----------+---------+
|student_id|name |timestamp|
+----------+----------+---------+
| s1|testuser | t1|
| s1|sampleuser| t2|
| s2|test123 | t1|
| s2|sample123 | t2|
+----------+----------+---------+
input_frame = input_frame.withColumn('test', sf.lit(None))
input_frame.show()
+----------+----------+---------+----+
|student_id| name|timestamp|test|
+----------+----------+---------+----+
| s1| testuser| t1|null|
| s1|sampleuser| t2|null|
| s2| test123| t1|null|
| s2| sample123| t2|null|
+----------+----------+---------+----+
input_frame = input_frame.withColumn('test', sf.concat(sf.col('test'),sf.lit('test')))
input_frame.show()
+----------+----------+---------+----+
|student_id| name|timestamp|test|
+----------+----------+---------+----+
| s1| testuser| t1|null|
| s1|sampleuser| t2|null|
| s2| test123| t1|null|
| s2| sample123| t2|null|
+----------+----------+---------+----+
I want to update the 'test' column with some values and apply the filter with partial matches on the column. But concatenating to null column resulting in a null column again. How can we do this?
use concat_ws, like this:
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([["1", "2"], ["2", None], ["3", "4"], ["4", "5"], [None, "6"]]).toDF("a", "b")
# This won't work
df = df.withColumn("concat", concat(df.a, df.b))
# This won't work
df = df.withColumn("concat + cast", concat(df.a.cast('string'), df.b.cast('string')))
# Do it like this
df = df.withColumn("concat_ws", concat_ws("", df.a, df.b))
df.show()
gives:
+----+----+------+-------------+---------+
| a| b|concat|concat + cast|concat_ws|
+----+----+------+-------------+---------+
| 1| 2| 12| 12| 12|
| 2|null| null| null| 2|
| 3| 4| 34| 34| 34|
| 4| 5| 45| 45| 45|
|null| 6| null| null| 6|
+----+----+------+-------------+---------+
Note specifically that casting a NULL column to string doesn't work as you wish, and will result in the entire row being NULL if any column is null.
There's no nice way of dealing with more complicated scenarios, but note that you can use a when statement in side a concat if you're willing to
suffer the verboseness of it, like this:
df.withColumn("concat_custom", concat(
when(df.a.isNull(), lit('_')).otherwise(df.a),
when(df.b.isNull(), lit('_')).otherwise(df.b))
)
To get, eg:
+----+----+-------------+
| a| b|concat_custom|
+----+----+-------------+
| 1| 2| 12|
| 2|null| 2_|
| 3| 4| 34|
| 4| 5| 45|
|null| 6| _6|
+----+----+-------------+
You can use the coalesce function, which returns first of its arguments which is not null, and provide a literal in the second place, which will be used in case the column has a null value.
df = df.withColumn("concat", concat(coalesce(df.a, lit('')), coalesce(df.b, lit(''))))
You can fill null values with empty strings:
import pyspark.sql.functions as f
from pyspark.sql.types import *
data = spark.createDataFrame([('s1', 't1'), ('s2', 't2')], ['col1', 'col2'])
data = data.withColumn('test', f.lit(None).cast(StringType()))
display(data.na.fill('').withColumn('test2', f.concat('col1', 'col2', 'test')))
Is that what you were looking for?
I am trying to generate an additional column in a dataframe with auto incrementing values based on the global value.However all the rows are generated with the same value and the value is not incrementing.
Here is the code
def autoIncrement():
global rec
if (rec == 0) : rec = 1
else : rec = rec + 1
return int(rec)
rec=14
UDF
autoIncrementUDF = udf(autoIncrement, IntegerType())
df1 = hiveContext.sql("select id,name,location,state,datetime,zipcode from demo.target")
df1.withColumn("id2", autoIncrementUDF()).show()
Here is the result df
+---+------+--------+----------+-------------------+-------+---+
| id| name|location| state| datetime|zipcode|id2|
+---+------+--------+----------+-------------------+-------+---+
| 20|pankaj| Chennai| TamilNadu|2018-03-26 11:00:00| NULL| 15|
| 10|geetha| Newyork|New Jersey|2018-03-27 10:00:00| NULL| 15|
| 25| pawan| Chennai| TamilNadu|2018-03-27 11:25:00| NULL| 15|
| 30|Manish| Gurgoan| Gujarat|2018-03-27 11:00:00| NULL| 15|
+---+------+--------+----------+-------------------+-------+---+
But i am expecting the below result
+---+------+--------+----------+-------------------+-------+---+
| id| name|location| state| datetime|zipcode|id2|
+---+------+--------+----------+-------------------+-------+---+
| 20|pankaj| Chennai| TamilNadu|2018-03-26 11:00:00| NULL| 15|
| 10|geetha| Newyork|New Jersey|2018-03-27 10:00:00| NULL| 16|
| 25| pawan| Chennai| TamilNadu|2018-03-27 11:25:00| NULL| 17|
| 30|Manish| Gurgoan| Gujarat|2018-03-27 11:00:00| NULL| 18|
+---+------+--------+----------+-------------------+-------+---+
Any help is appreciated.
Global variables are bounded to a python process. A UDF may be executed in parallel on different workers across some cluster, and should be deterministic.
You should use monotonically_increasing_id() function from pyspark.sql.functions module.
Check the docs for more info.
You should be careful because this function is dynamic and not sticky:
How do I add an persistent column of row ids to Spark DataFrame?