I have 2 dataframes in Spark (PySpark)
DF_A
col1 col2 col3
a 1 100
b 2 300
c 3 500
d 4 700
DF_B
col1 col3
a 150
b 350
c 0
d 650
I want to update the columns of DF A with values DF_B.col3 wherever present.
Currently I am doing
df_new = df_a.join(df_b, df_a.col1 == df_b.col1,'inner')
And it is giving me col1 X 2 times and col3 X 2 times in df_new.
Now I have to drop the irrelevant cells to show 0. What is a better way of doing this? without using udfs.
If I understand your question correctly you are trying to perform the following operation :
UPDATE table_a A, table_b B SET A.col3= B.col3 WHERE A.col1= B.col1; on the dataframe. If not present in B then 0. (Cf. comments)
a = [("a",1,100),("b",2,300),("c",3,500),("d",4,700)]
b = [("a",150),("b",350),("d",650)]
df_a = spark.createDataFrame(a,["col1","col2","col3"])
df_b = spark.createDataFrame(b,["col1","col3"])
df_a.show()
# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# | a| 1| 100|
# | b| 2| 300|
# | c| 3| 500|
# | d| 4| 700|
# +----+----+----+
df_b.show() # I have removed an entry for the purpose of the demo.
# +----+----+
# |col1|col3|
# +----+----+
# | a| 150|
# | b| 350|
# | d| 650|
# +----+----+
You'll need to perform an outer join followed by a coalesce :
from pyspark.sql import functions as F
df_a.withColumnRenamed('col3','col3_a') \
.join(df_b.withColumnRenamed('col3','col3_b'), on='col1', how='outer') \
.withColumn("col3", F.coalesce('col3_b', F.lit(0))) \
.drop(*['col3_a','col3_b']).show()
# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# | d| 4| 650|
# | c| 3| 0|
# | b| 2| 350|
# | a| 1| 150|
# +----+----+----+
Related
So, I have a pyspark dataframe organised in this way:
ID
timestamp
value1
value2
1
1
a
x
2
1
a
y
1
2
b
x
2
2
b
y
1
3
c
y
2
3
d
y
1
4
l
y
2
4
s
y
and let's say that the timestamp is the number of day from the beginning of time. What I'd like to do is, for each line, to group into a list the values up to -x days regarding the current ID, so to have:
ID
timestamp
value1
value2
list_value_1
1
1
a
X
a
2
1
a
y
a
1
2
b
x
a,b
2
2
b
y
a,b
1
3
c
y
a,b,c
2
3
d
y
a,b,d
1
3
c
y
b,c,l
2
3
d
y
b,d,s
I imagine I should do that with a Window but I'm not sure on how to proceed (I'm quite bad with Windows for some reason).
You can do a collect_list over a Window betweeen the current row and two preceding rows, and combine the list into a comma-separated string using concat_ws:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'list_value_1',
F.concat_ws(',',
F.collect_list('value1').over(
Window.partitionBy('ID').orderBy('timestamp').rowsBetween(-2, 0)
)
)
)
df2.show()
+---+---------+------+------+------------+
| ID|timestamp|value1|value2|list_value_1|
+---+---------+------+------+------------+
| 1| 1| a| x| a|
| 1| 2| b| x| a,b|
| 1| 3| c| y| a,b,c|
| 1| 4| l| y| b,c,l|
| 2| 1| a| y| a|
| 2| 2| b| y| a,b|
| 2| 3| d| y| a,b,d|
| 2| 4| s| y| b,d,s|
+---+---------+------+------+------------+
I have a list of historical values for a device setting and a dataframe with timestamps.
I need to create a new column in my dataframe based on the comparison of the timestamps column in the dataframe and the timestamp of the setting value in my list.
settings_history = [[1, '2021-01-01'], [2, '2021-01-12']]
dataframe = df.withColumn(
'setting_col', when(col('device_timestamp') <= settings_history[0][1], settings_history[0][0])
.when(col('device_timestamp') <= settings_history[1][1], settings_history[1][0])
)
The number of entries in the settings_history array is dynamic and I need to find a way to implement something like above, but I get a syntax error. Also, I have tried to use a for loop in my withColumn function, but that didn't work either.
My raw dataframe has values like:
device_timestamp
2020-05-21
2020-12-19
2021-01-03
2021-01-11
My goal is to have something like:
device_timestamp setting_col
2020-05-21 1
2020-12-19 1
2021-01-03 2
2021-01-11 2
I'm using Databricks on Azure for my work.
You can use reduce to chain the when conditions together:
from functools import reduce
settings_history = [[1, '2021-01-01'], [2, '2021-01-12']]
new_col = reduce(
lambda c, history: c.when(col('device_timestamp') <= history[1], history[0]),
settings_history[1:],
when(col('device_timestamp') <= settings_history[0][1], settings_history[0][0])
)
dataframe = df.withColumn('setting_col', new_col)
Something like the below created when_expression function will be useful in this case. where a when condition is created based on whatever information you provide in list settings_array.
import pandas as pd
from pyspark.sql import functions as F
def when_expression(settings_array):
when_condition = None
for a, b in settings_array:
if when_condition is None:
when_condition = F.when(F.col('device_timestamp') <= a, F.lit(b))
else:
when_condition = when_condition.when(F.col('device_timestamp') <= a, F.lit(b))
return when_condition
settings_array = [
[2, 3], # if <= 2 make it 3
[5, 7], # if <= 5 make it 7
[10, 100], # if <= 10 make it 100
]
df = pd.DataFrame({'device_timestamp': range(10)})
df = spark.createDataFrame(df)
df.show()
when_condition = when_expression(settings_array)
print(when_condition)
df = df.withColumn('setting_col', when_condition)
df.show()
Output:
+----------------+
|device_timestamp|
+----------------+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
+----------------+
Column<b'CASE WHEN (device_timestamp <= 2) THEN 3 WHEN (device_timestamp <= 5) THEN 7 WHEN (device_timestamp <= 10) THEN 100 END'>
+----------------+-----------+
|device_timestamp|setting_col|
+----------------+-----------+
| 0| 3|
| 1| 3|
| 2| 3|
| 3| 7|
| 4| 7|
| 5| 7|
| 6| 100|
| 7| 100|
| 8| 100|
| 9| 100|
+----------------+-----------+
Here i need to find exponential moving average in spark dataframe :
Table :
ab = spark.createDataFrame(
[(1,"1/1/2020", 41.0,0.5, 0.5 ,1, '10.22'),
(1,"10/3/2020",24.0,0.3, 0.7 ,2, '' ),
(1,"21/5/2020",32.0,0.4, 0.6 ,3, '' ),
(2,"3/1/2020", 51.0,0.22, 0.78,1, '34.78'),
(2,"10/5/2020",14.56,0.333,0.66,2, '' ),
(2,"30/9/2020",17.0,0.66, 0.34,3, '' )],["CID","date","A","B","C","Row","SMA"] )
ab.show()
+---+---------+-----+-----+----+---+-----+
|CID| date| A| B| C| Row| SMA|
+---+---------+-----+-----+----+---+-----+
| 1| 1/1/2020| 41.0| 0.5| 0.5| 1|10.22|
| 1|10/3/2020| 24.0| 0.3| 0.7| 2| |
| 1|21/5/2020| 32.0| 0.4| 0.6| 3| |
| 2| 3/1/2020| 51.0| 0.22|0.78| 1|34.78|
| 2|10/5/2020|14.56|0.333|0.66| 2| |
| 2|30/9/2020| 17.0| 0.66|0.34| 3| |
+---+---------+-----+-----+----+---+-----+
Expected Output :
+---+---------+-----+-----+----+---+-----+----------+
|CID| date| A| B| C|Row| SMA| EMA|
+---+---------+-----+-----+----+---+-----+----------+
| 1| 1/1/2020| 41.0| 0.5| 0.5| 1|10.22| 10.22|
| 1|10/3/2020| 24.0| 0.3| 0.7| 2| | 14.354|
| 1|21/5/2020| 32.0| 0.4| 0.6| 3| | 21.4124|
| 2| 3/1/2020| 51.0| 0.22|0.78| 1|34.78| 34.78|
| 2|10/5/2020|14.56|0.333|0.66| 2| | 28.04674|
| 2|30/9/2020| 17.0| 0.66|0.34| 3| |20.7558916|
+---+---------+-----+-----+----+---+-----+----------+
Logic :
For every customer
if row == 1 then
SMA as EMA
else ( C * LAG(EMA) + A * B ) as EMA
The problem here is that a freshly calculated value of a previous row is used as input for the current row. That means that it is not possible to parallelize the calculations for a single customer.
For Spark 3.0+, it is possible to get the required result with a pandas udf using grouped map
ab = spark.createDataFrame(
[(1,"1/1/2020", 41.0,0.5, 0.5 ,1, '10.22'),
(1,"10/3/2020",24.0,0.3, 0.7 ,2, '' ),
(1,"21/5/2020",32.0,0.4, 0.6 ,3, '' ),
(2,"3/1/2020", 51.0,0.22, 0.78,1, '34.78'),
(2,"10/5/2020",14.56,0.333,0.66,2, '' ),
(2,"30/9/2020",17.0,0.66, 0.34,3, '' )],\
["CID","date","A","B","C","Row","SMA"] ) \
.withColumn("SMA", F.col('SMA').cast(T.DoubleType())) \
.withColumn("date", F.to_date(F.col("date"), "d/M/yyyy"))
import pandas as pd
def calc(df: pd.DataFrame):
# df is a pandas.DataFrame
df = df.sort_values('date').reset_index(drop=True)
df.loc[0, 'EMA'] = df.loc[0, 'SMA']
for i in range(1, len(df)):
df.loc[i, 'EMA'] = df.loc[i, 'C'] * df.loc[i-1, 'EMA'] + \
df.loc[i, 'A'] * df.loc[i, 'B']
return df
ab.groupBy("CID").applyInPandas(calc,
schema = "CID long, date date, A double, B double, C double, Row long, SMA double, EMA double")\
.show()
Output:
+---+----------+-----+-----+----+---+-----+------------------+
|CID| date| A| B| C|Row| SMA| EMA|
+---+----------+-----+-----+----+---+-----+------------------+
| 1|2020-01-01| 41.0| 0.5| 0.5| 1|10.22| 10.22|
| 1|2020-03-10| 24.0| 0.3| 0.7| 2| null| 14.354|
| 1|2020-05-21| 32.0| 0.4| 0.6| 3| null|21.412399999999998|
| 2|2020-01-03| 51.0| 0.22|0.78| 1|34.78| 34.78|
| 2|2020-05-10|14.56|0.333|0.66| 2| null| 27.80328|
| 2|2020-09-30| 17.0| 0.66|0.34| 3| null| 20.6731152|
+---+----------+-----+-----+----+---+-----+------------------+
The idea is to use a Pandas dataframe for each group. This Pandas dataframe contains all values of the current partition and is ordered by date. During the iteration over the Pandas dataframe we can now access the value of EMA of the previous row (which is not possible for a Spark dataframe).
There are some caveats:
all rows of one partition should fit into the memory of a single executor. Partial aggregation is not possible here
iterating over a Pandas dataframe is discouraged
This question already has an answer here:
Pyspark filter dataframe by columns of another dataframe
(1 answer)
Closed 4 years ago.
Say I have two dataframes,
**A** **B**
| a | b | c | |a|
| 1 | 2 | 3 | |1|
I want to filter the contents of dataframe A based on the values in column a from Dataset B. The equivalent where clause in SQL is like this
WHERE NOT (A.a in (select a from B)
How can I achieve this?
To keep all the rows in the left table where there is a match in the right, you can use the leftsemi join. In this case, you only want to keep values if there is not a match in the right table, in that case you can use a leftanti join:
df = spark.createDataFrame([(1,2,3),(2,3,4)], ["a","b","c"])
df2 = spark.createDataFrame([(1,2)], ["a","b"])
df.join(df2,'a','leftanti').show()
df
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 2| 3|
| 2| 3| 4|
+---+---+---+
df2
+---+---+
| a| b|
+---+---+
| 1| 2|
+---+---+
result
+---+---+---+
| a| b| c|
+---+---+---+
| 2| 3| 4|
+---+---+---+
Hope this helps!
I have two dataframes with null values that I'm trying to join using PySpark 2.3.0:
dfA:
# +----+----+
# |col1|col2|
# +----+----+
# | a|null|
# | b| 0|
# | c| 0|
# +----+----+
dfB:
# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# | a|null| x|
# | b| 0| x|
# +----+----+----+
The dataframes are creatable with this script:
dfA = spark.createDataFrame(
[
('a', None),
('b', '0'),
('c', '0')
],
('col1', 'col2')
)
dfB = spark.createDataFrame(
[
('a', None, 'x'),
('b', '0', 'x')
],
('col1', 'col2', 'col3')
)
Join call:
dfA.join(dfB, dfB.columns[:2], how='left').orderBy('col1').show()
Result:
# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# | a|null|null| <- col3 should be x
# | b| 0| x|
# | c| 0|null|
# +----+----+----+
Expected result:
# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# | a|null| x| <-
# | b| 0| x|
# | c| 0|null|
# +----+----+----+
It works if I set the first row, col2 to anything other than null, but I need to support null values.
I tried using a condition to compare using null-safe equals as outlined in this post like so:
cond = (dfA.col1.eqNullSafe(dfB.col1) & dfA.col2.eqNullSafe(dfB.col2))
dfA.join(dfB, cond, how='left').orderBy(dfA.col1).show()
Result of null-safe join:
# +----+----+----+----+----+
# |col1|col2|col1|col2|col3|
# +----+----+----+----+----+
# | a|null| a|null| x|
# | b| 0| b| 0| x|
# | c| 0|null|null|null|
# +----+----+----+----+----+
This retains duplicate columns though, I'm still looking for a way to achieve the expected result at the end of a join.
A simple solution would be to select the columns that you want to keep. This will let you specify which source dataframe they should come from as well as avoid the duplicate column issue.
dfA.join(dfB, cond, how='left').select(dfA.col1, dfA.col2, dfB.col3).orderBy('col1').show()
This fails, because col1 in orderBy is ambiguous. You should reference specific source, for example dfA:
dfA.join(dfB, cond, how='left').orderBy(dfA.col1).show()
If you have to join with null values with null values in pyspark you should use eqnullsafe in joining conditions, then will match null to null values , spark 2.5 version after is better to use eqnullsafe while joining if need more with examples https://knowledges.co.in/how-to-use-eqnullsafe-in-pyspark-for-null-values/