Dataframe Join Null-Safe Condition Use

Dataframe Join Null-Safe Condition Use - python

I have two dataframes with null values that I'm trying to join using PySpark 2.3.0:
dfA:
# +----+----+
# |col1|col2|
# +----+----+
# | a|null|
# | b| 0|
# | c| 0|
# +----+----+
dfB:
# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# | a|null| x|
# | b| 0| x|
# +----+----+----+
The dataframes are creatable with this script:
dfA = spark.createDataFrame(
[
('a', None),
('b', '0'),
('c', '0')
],
('col1', 'col2')
)
dfB = spark.createDataFrame(
[
('a', None, 'x'),
('b', '0', 'x')
],
('col1', 'col2', 'col3')
)
Join call:
dfA.join(dfB, dfB.columns[:2], how='left').orderBy('col1').show()
Result:
# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# | a|null|null| <- col3 should be x
# | b| 0| x|
# | c| 0|null|
# +----+----+----+
Expected result:
# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# | a|null| x| <-
# | b| 0| x|
# | c| 0|null|
# +----+----+----+
It works if I set the first row, col2 to anything other than null, but I need to support null values.
I tried using a condition to compare using null-safe equals as outlined in this post like so:
cond = (dfA.col1.eqNullSafe(dfB.col1) & dfA.col2.eqNullSafe(dfB.col2))
dfA.join(dfB, cond, how='left').orderBy(dfA.col1).show()
Result of null-safe join:
# +----+----+----+----+----+
# |col1|col2|col1|col2|col3|
# +----+----+----+----+----+
# | a|null| a|null| x|
# | b| 0| b| 0| x|
# | c| 0|null|null|null|
# +----+----+----+----+----+
This retains duplicate columns though, I'm still looking for a way to achieve the expected result at the end of a join.

A simple solution would be to select the columns that you want to keep. This will let you specify which source dataframe they should come from as well as avoid the duplicate column issue.
dfA.join(dfB, cond, how='left').select(dfA.col1, dfA.col2, dfB.col3).orderBy('col1').show()

This fails, because col1 in orderBy is ambiguous. You should reference specific source, for example dfA:
dfA.join(dfB, cond, how='left').orderBy(dfA.col1).show()

If you have to join with null values with null values in pyspark you should use eqnullsafe in joining conditions, then will match null to null values , spark 2.5 version after is better to use eqnullsafe while joining if need more with examples https://knowledges.co.in/how-to-use-eqnullsafe-in-pyspark-for-null-values/

Related

Pyspark add row based on a condition

I have a below dataframe structure
A
B
C
1
open
01.01.22 10:05:04
1
In-process
01.01.22 10:07:02
I need to insert a row before the open value row.So,I need to check the status whether its open and then add a new row before it with other columns being the same values except the C column to get 1hour subtracted. How this can be acheived using Pyspark?

Instead of "insert a row" – which is a non-trivial issue to solve –, think about it as "union dataset"
Assuming this is your dataset
df = spark.createDataFrame([
(1, 'open', '01.01.22 10:05:04'),
(1, 'In process', '01.01.22 10:07:02'),
], ['a', 'b', 'c'])
+---+----------+-----------------+
| a| b| c|
+---+----------+-----------------+
| 1| open|01.01.22 10:05:04|
| 1|In process|01.01.22 10:07:02|
+---+----------+-----------------+
Based on your rule, we can construct another dataset like this
from pyspark.sql import functions as F
df_new = (df
.where(F.col('b') == 'open')
.withColumn('b', F.lit('Before open'))
.withColumn('c', F.to_timestamp('c', 'dd.MM.yy HH:mm:ss')) # convert text to date with custom date format
.withColumn('c', F.col('c') - F.expr('interval 1 hour')) # subtract 1 hour
.withColumn('c', F.from_unixtime(F.unix_timestamp('c'), 'dd.MM.yy HH:mm:ss')) # revert to custom date format
)
+---+-----------+-----------------+
| a| b| c|
+---+-----------+-----------------+
| 1|Before open|01.01.22 09:05:04|
+---+-----------+-----------------+
Now you just need to union them together, and sort if you want to "see" it
(df
.union(df_new)
.orderBy('a', 'c')
.show()
)
+---+-----------+-----------------+
| a| b| c|
+---+-----------+-----------------+
| 1|Before open|01.01.22 09:05:04|
| 1| open|01.01.22 10:05:04|
| 1| In process|01.01.22 10:07:02|
+---+-----------+-----------------+

How to get Weighted Average for a column in pyspark

Here i need to find exponential moving average in spark dataframe :
Table :
ab = spark.createDataFrame(
[(1,"1/1/2020", 41.0,0.5, 0.5 ,1, '10.22'),
(1,"10/3/2020",24.0,0.3, 0.7 ,2, '' ),
(1,"21/5/2020",32.0,0.4, 0.6 ,3, '' ),
(2,"3/1/2020", 51.0,0.22, 0.78,1, '34.78'),
(2,"10/5/2020",14.56,0.333,0.66,2, '' ),
(2,"30/9/2020",17.0,0.66, 0.34,3, '' )],["CID","date","A","B","C","Row","SMA"] )
ab.show()
+---+---------+-----+-----+----+---+-----+
|CID| date| A| B| C| Row| SMA|
+---+---------+-----+-----+----+---+-----+
| 1| 1/1/2020| 41.0| 0.5| 0.5| 1|10.22|
| 1|10/3/2020| 24.0| 0.3| 0.7| 2| |
| 1|21/5/2020| 32.0| 0.4| 0.6| 3| |
| 2| 3/1/2020| 51.0| 0.22|0.78| 1|34.78|
| 2|10/5/2020|14.56|0.333|0.66| 2| |
| 2|30/9/2020| 17.0| 0.66|0.34| 3| |
+---+---------+-----+-----+----+---+-----+
Expected Output :
+---+---------+-----+-----+----+---+-----+----------+
|CID| date| A| B| C|Row| SMA| EMA|
+---+---------+-----+-----+----+---+-----+----------+
| 1| 1/1/2020| 41.0| 0.5| 0.5| 1|10.22| 10.22|
| 1|10/3/2020| 24.0| 0.3| 0.7| 2| | 14.354|
| 1|21/5/2020| 32.0| 0.4| 0.6| 3| | 21.4124|
| 2| 3/1/2020| 51.0| 0.22|0.78| 1|34.78| 34.78|
| 2|10/5/2020|14.56|0.333|0.66| 2| | 28.04674|
| 2|30/9/2020| 17.0| 0.66|0.34| 3| |20.7558916|
+---+---------+-----+-----+----+---+-----+----------+
Logic :
For every customer
if row == 1 then
SMA as EMA
else ( C * LAG(EMA) + A * B ) as EMA

The problem here is that a freshly calculated value of a previous row is used as input for the current row. That means that it is not possible to parallelize the calculations for a single customer.
For Spark 3.0+, it is possible to get the required result with a pandas udf using grouped map
ab = spark.createDataFrame(
[(1,"1/1/2020", 41.0,0.5, 0.5 ,1, '10.22'),
(1,"10/3/2020",24.0,0.3, 0.7 ,2, '' ),
(1,"21/5/2020",32.0,0.4, 0.6 ,3, '' ),
(2,"3/1/2020", 51.0,0.22, 0.78,1, '34.78'),
(2,"10/5/2020",14.56,0.333,0.66,2, '' ),
(2,"30/9/2020",17.0,0.66, 0.34,3, '' )],\
["CID","date","A","B","C","Row","SMA"] ) \
.withColumn("SMA", F.col('SMA').cast(T.DoubleType())) \
.withColumn("date", F.to_date(F.col("date"), "d/M/yyyy"))
import pandas as pd
def calc(df: pd.DataFrame):
# df is a pandas.DataFrame
df = df.sort_values('date').reset_index(drop=True)
df.loc[0, 'EMA'] = df.loc[0, 'SMA']
for i in range(1, len(df)):
df.loc[i, 'EMA'] = df.loc[i, 'C'] * df.loc[i-1, 'EMA'] + \
df.loc[i, 'A'] * df.loc[i, 'B']
return df
ab.groupBy("CID").applyInPandas(calc,
schema = "CID long, date date, A double, B double, C double, Row long, SMA double, EMA double")\
.show()
Output:
+---+----------+-----+-----+----+---+-----+------------------+
|CID| date| A| B| C|Row| SMA| EMA|
+---+----------+-----+-----+----+---+-----+------------------+
| 1|2020-01-01| 41.0| 0.5| 0.5| 1|10.22| 10.22|
| 1|2020-03-10| 24.0| 0.3| 0.7| 2| null| 14.354|
| 1|2020-05-21| 32.0| 0.4| 0.6| 3| null|21.412399999999998|
| 2|2020-01-03| 51.0| 0.22|0.78| 1|34.78| 34.78|
| 2|2020-05-10|14.56|0.333|0.66| 2| null| 27.80328|
| 2|2020-09-30| 17.0| 0.66|0.34| 3| null| 20.6731152|
+---+----------+-----+-----+----+---+-----+------------------+
The idea is to use a Pandas dataframe for each group. This Pandas dataframe contains all values of the current partition and is ordered by date. During the iteration over the Pandas dataframe we can now access the value of EMA of the previous row (which is not possible for a Spark dataframe).
There are some caveats:
all rows of one partition should fit into the memory of a single executor. Partial aggregation is not possible here
iterating over a Pandas dataframe is discouraged

Pyspark replace NaN with NULL

I use Spark to perform data transformations that I load into Redshift. Redshift does not support NaN values, so I need to replace all occurrences of NaN with NULL.
I tried something like this:
some_table = sql('SELECT * FROM some_table')
some_table = some_table.na.fill(None)
But I got the following error:
ValueError: value should be a float, int, long, string, bool or dict
So it seems like na.fill() doesn't support None. I specifically need to replace with NULL, not some other value, like 0.

df = spark.createDataFrame([(1, float('nan')), (None, 1.0)], ("a", "b"))
df.show()
+----+---+
| a| b|
+----+---+
| 1|NaN|
|null|1.0|
+----+---+
df = df.replace(float('nan'), None)
df.show()
+----+----+
| a| b|
+----+----+
| 1|null|
|null| 1.0|
+----+----+
You can use the .replace function to change to null values in one line of code.

I finally found the answer after Googling around a bit.
df = spark.createDataFrame([(1, float('nan')), (None, 1.0)], ("a", "b"))
df.show()
+----+---+
| a| b|
+----+---+
| 1|NaN|
|null|1.0|
+----+---+
import pyspark.sql.functions as F
columns = df.columns
for column in columns:
df = df.withColumn(column,F.when(F.isnan(F.col(column)),None).otherwise(F.col(column)))
sqlContext.registerDataFrameAsTable(df, "df2")
sql('select * from df2').show()
+----+----+
| a| b|
+----+----+
| 1|null|
|null| 1.0|
+----+----+
It doesn't use na.fill(), but it accomplished the same result, so I'm happy.

Encode a column with integer in pyspark

I have to encode the column in a big DataFrame in pyspark(spark 2.0). All the values are almost unique(about 1000mln values).
The best choice could be StringIndexer, but at some reason it always fails and kills my spark session.
Can I somehow write a function like that:
id_dict() = dict()
def indexer(x):
id_dict.setdefault(x, len(id_dict))
return id_dict[x]
And map it to DataFrame with id_dict saving the items()? Will this dict will be synced on each executor?
I need all this for preprocessing tuples ('x', 3, 5) for spark.mllib ALS model.
Thank you.

StringIndexer keeps all labels in memory, so if values are almost unique, it just won't scale.
You can take unique values, sort and add id, which is expensive, but more robust in this case:
from pyspark.sql.functions import monotonically_increasing_id
df = spark.createDataFrame(["a", "b", "c", "a", "d"], "string").toDF("value")
indexer = (df.select("value").distinct()
.orderBy("value")
.withColumn("label", monotonically_increasing_id()))
df.join(indexer, ["value"]).show()
# +-----+-----------+
# |value| label|
# +-----+-----------+
# | d|25769803776|
# | c|17179869184|
# | b| 8589934592|
# | a| 0|
# | a| 0|
# +-----+-----------+
Note that labels are not consecutive and can differ from run to run or can change if spark.sql.shuffle.partitions changes. If it is not acceptable you'll have to use RDDs:
from operator import itemgetter
indexer = (df.select("value").distinct()
.rdd.map(itemgetter(0)).zipWithIndex()
.toDF(["value", "label"]))
df.join(indexer, ["value"]).show()
# +-----+-----+
# |value|label|
# +-----+-----+
# | d| 0|
# | c| 1|
# | b| 2|
# | a| 3|
# | a| 3|
# +-----+-----+

Spark update dataframe with where condition

I have 2 dataframes in Spark (PySpark)
DF_A
col1 col2 col3
a 1 100
b 2 300
c 3 500
d 4 700
DF_B
col1 col3
a 150
b 350
c 0
d 650
I want to update the columns of DF A with values DF_B.col3 wherever present.
Currently I am doing
df_new = df_a.join(df_b, df_a.col1 == df_b.col1,'inner')
And it is giving me col1 X 2 times and col3 X 2 times in df_new.
Now I have to drop the irrelevant cells to show 0. What is a better way of doing this? without using udfs.

If I understand your question correctly you are trying to perform the following operation :
UPDATE table_a A, table_b B SET A.col3= B.col3 WHERE A.col1= B.col1; on the dataframe. If not present in B then 0. (Cf. comments)
a = [("a",1,100),("b",2,300),("c",3,500),("d",4,700)]
b = [("a",150),("b",350),("d",650)]
df_a = spark.createDataFrame(a,["col1","col2","col3"])
df_b = spark.createDataFrame(b,["col1","col3"])
df_a.show()
# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# | a| 1| 100|
# | b| 2| 300|
# | c| 3| 500|
# | d| 4| 700|
# +----+----+----+
df_b.show() # I have removed an entry for the purpose of the demo.
# +----+----+
# |col1|col3|
# +----+----+
# | a| 150|
# | b| 350|
# | d| 650|
# +----+----+
You'll need to perform an outer join followed by a coalesce :
from pyspark.sql import functions as F
df_a.withColumnRenamed('col3','col3_a') \
.join(df_b.withColumnRenamed('col3','col3_b'), on='col1', how='outer') \
.withColumn("col3", F.coalesce('col3_b', F.lit(0))) \
.drop(*['col3_a','col3_b']).show()
# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# | d| 4| 650|
# | c| 3| 0|
# | b| 2| 350|
# | a| 1| 150|
# +----+----+----+

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dataframe Join Null-Safe Condition Use - python

A simple solution would be to select the columns that you want to keep. This will let you specify which source dataframe they should come from as well as avoid the duplicate column issue. dfA.join(dfB, cond, how='left').select(dfA.col1, dfA.col2, dfB.col3).orderBy('col1').show()

This fails, because col1 in orderBy is ambiguous. You should reference specific source, for example dfA: dfA.join(dfB, cond, how='left').orderBy(dfA.col1).show()

Related

Pyspark add row based on a condition

How to get Weighted Average for a column in pyspark

Pyspark replace NaN with NULL

Encode a column with integer in pyspark

Spark update dataframe with where condition

Categories

Resources