spark outer join with source - python

I am relatively new to spark, I was wondering if I could get the source of the column which was used for outer join
Let's say I have 3 DF
DF 1
+-----+----+
|item1| key|
+-----+----+
|Item1|key1|
|Item2|key2|
|Item3|key3|
|Item4|key4|
|Item5|key5|
+-----+----+
DF2
+-----+----+
|item2| key|
+-----+----+
| t1|key1|
| t2|key2|
| t3|key6|
| t4|key7|
| t5|key8|
+-----+----+
DF3
+-----+-----+
|item3| key|
+-----+-----+
| t1| key1|
| t2| key2|
| t3| key8|
| t4| key9|
| t5|key10|
+-----+-----+
I want to do full outer join on these 3 dataframes and include a new column with to indicate the source of that key.
E.g
+-----+-----+-----+-----+------+
| key|item1|item2|item3|source|
+-----+-----+-----+-----+------+
| key8| null| t5| t3| DF2|
| key5|Item5| null| null| DF1|
| key7| null| t4| null| DF2|
| key3|Item3| null| null| DF1|
| key6| null| t3| null| DF2|
| key1|Item1| t1| t1| DF1|
| key4|Item4| null| null| DF1|
| key2|Item2| t2| t2| DF1|
| key9| null| null| t4| DF3|
|key10| null| null| t5| DF3|
+-----+-----+-----+-----+------+
Is there any way to achieve this ?

I'd do something like this:
from pyspark.sql.functions import col, lit, coalesce, when
df1 = spark.createDataFrame(
[("Item1", "key1"), ("Item2", "key2"), ("Item3", "key3"),
("Item4", "key4"), ("Item5", "key5")],
["item1", "key"])
df2 = spark.createDataFrame(
[("t1", "key1"), ("t2", "key2"), ("t3", "key6"),
("t4", "key7"), ("t5", "key8")],
["item2", "key"])
df3 = spark.createDataFrame([
("t1", "key1"), ("t2", "key2"), ("t3", "key8"),
("t4", "key9"), ("t5", "key10")],
["item3", "key"])
df1.join(df2, ["key"], "outer").join(df3, ["key"], "outer").withColumn(
"source",
coalesce(
when(col("item1").isNotNull(), "df1"),
when(col("item2").isNotNull(), "df2"),
when(col("item3").isNotNull(), "df3")))
Result is:
## +-----+-----+-----+-----+------+
## | key|item1|item2|item3|source|
## +-----+-----+-----+-----+------+
## | key8| null| t5| t3| df2|
## | key5|Item5| null| null| df1|
## | key7| null| t4| null| df2|
## | key3|Item3| null| null| df1|
## | key6| null| t3| null| df2|
## | key1|Item1| t1| t1| df1|
## | key4|Item4| null| null| df1|
## | key2|Item2| t2| t2| df1|
## | key9| null| null| t4| df3|
## |key10| null| null| t5| df3|
## +-----+-----+-----+-----+------+

Related

How to join two pyspark dataframes in python on a condition while changing column value on match?

I have two dataframes like this:
df1 = spark.createDataFrame([(1, 11, 1999, 1999, None), (2, 22, 2000, 2000, 44), (3, 33, 2001, 2001,None)], ['id', 't', 'year','new_date','rev_t'])
df2 = spark.createDataFrame([(2, 44, 2022, 2022,None), (2, 55, 2001, 2001, 88)], ['id', 't', 'year','new_date','rev_t'])
df1.show()
df2.show()
+---+---+----+--------+-----+
| id| t|year|new_date|rev_t|
+---+---+----+--------+-----+
| 1| 11|1999| 1999| null|
| 2| 22|2000| 2000| 44|
| 3| 33|2001| 2001| null|
+---+---+----+--------+-----+
+---+---+----+--------+-----+
| id| t|year|new_date|rev_t|
+---+---+----+--------+-----+
| 2| 44|2022| 2022| null|
| 2| 55|2001| 2001| 88|
+---+---+----+--------+-----+
I want to join them in a way that if df2.t == df1.rev_t then update new_date to df2.year in the result dataframe.
So it should look like this:
+---+---+----+--------+-----+
| id| t|year|new_date|rev_t|
+---+---+----+--------+-----+
| 1| 11|1999| 1999| null|
| 2| 22|2000| 2022| 44|
| 2| 44|2022| 2022| null|
| 2| 55|2001| 2001| 88|
| 3| 33|2001| 2001| null|
+---+---+----+--------+-----+
To update a column from df2 in df1, you use left join + coalesce function for the column you want to update, in this case new_date.
From your expected output, it appears you want also to add the rows from df2, so union the join result with df2:
from pyspark.sql import functions as F
result = (df1.join(df2.selectExpr("t as rev_t", "new_date as df2_new_date"), ["rev_t"], "left")
.withColumn("new_date", F.coalesce("df2_new_date", "new_date"))
.select(*df1.columns).union(df2)
)
result.show()
#+---+---+----+--------+-----+
#| id| t|year|new_date|rev_t|
#+---+---+----+--------+-----+
#| 1| 11|1999| 1999| null|
#| 3| 33|2001| 2001| null|
#| 2| 22|2000| 2022| 44|
#| 2| 44|2022| 2022| null|
#| 2| 55|2001| 2001| 88|
#+---+---+----+--------+-----+

Merge 2 spark dataframes with non overlapping columns

I have two data frames, df1:
+---+---------+
| id| col_name|
+---+---------+
| 0| a |
| 1| b |
| 2| null|
| 3| null|
| 4| e |
| 5| f |
| 6| g |
| 7| h |
| 8| null|
| 9| j |
+---+---------+
and df2:
+---+---------+
| id| col_name|
+---+---------+
| 0| null|
| 1| null|
| 2| c|
| 3| d|
| 4| null|
| 5| null|
| 6| null|
| 7| null|
| 8| i|
| 9| null|
+---+---------+
and I want to merge them so I get
+---+---------+
| id| col_name|
+---+---------+
| 0| a|
| 1| b|
| 2| c|
| 3| d|
| 4| e|
| 5| f|
| 6| g|
| 7| h|
| 8| i|
| 9| j|
+---+---------+
I know for sure that they aren't overlapping (ie when df2 entry is null df1 entry isn't and vise versa)
I know that if I use join I won't get them on the same column and will instead get 2 "col_name". I just want it on the one column. How do I do this? Thanks
Try this-
df1.alias("a").join(df2.alias("b"), "id").selectExpr("id", "coalesce(a.col_name, b.col_name) as col_name")
You could do this:
mydf = df1.copy() #make copy of first array
idx = np.where(df1['col_name'].values == 'null')[0] #get indices of null
val = df2['col_name'].values[idx] #get values from df2 where df1 is null
mydf['col_name'][idx] = val #assign those values in mydf
mydf #print mydf
you should be able to utilize the coalesce function to achieve this.
df1 = df1.withColumnRenamed("col_name", "col_name_a")
df2 = df2.withColumnRenamed("col_name", "col_name_b")
joinedDF = renamedDF1.join(renamedDF2, "id")
joinedDF = joinedDF.withColumn(
"col_name",
coalesce(joinedDF.col("col_name_a"), joinedDF.col("col_name_b"))
)

How to concatenate to a null column in pyspark dataframe

I have a below dataframe and I wanted to update the rows dynamically with some values
input_frame.show()
+----------+----------+---------+
|student_id|name |timestamp|
+----------+----------+---------+
| s1|testuser | t1|
| s1|sampleuser| t2|
| s2|test123 | t1|
| s2|sample123 | t2|
+----------+----------+---------+
input_frame = input_frame.withColumn('test', sf.lit(None))
input_frame.show()
+----------+----------+---------+----+
|student_id| name|timestamp|test|
+----------+----------+---------+----+
| s1| testuser| t1|null|
| s1|sampleuser| t2|null|
| s2| test123| t1|null|
| s2| sample123| t2|null|
+----------+----------+---------+----+
input_frame = input_frame.withColumn('test', sf.concat(sf.col('test'),sf.lit('test')))
input_frame.show()
+----------+----------+---------+----+
|student_id| name|timestamp|test|
+----------+----------+---------+----+
| s1| testuser| t1|null|
| s1|sampleuser| t2|null|
| s2| test123| t1|null|
| s2| sample123| t2|null|
+----------+----------+---------+----+
I want to update the 'test' column with some values and apply the filter with partial matches on the column. But concatenating to null column resulting in a null column again. How can we do this?
use concat_ws, like this:
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([["1", "2"], ["2", None], ["3", "4"], ["4", "5"], [None, "6"]]).toDF("a", "b")
# This won't work
df = df.withColumn("concat", concat(df.a, df.b))
# This won't work
df = df.withColumn("concat + cast", concat(df.a.cast('string'), df.b.cast('string')))
# Do it like this
df = df.withColumn("concat_ws", concat_ws("", df.a, df.b))
df.show()
gives:
+----+----+------+-------------+---------+
| a| b|concat|concat + cast|concat_ws|
+----+----+------+-------------+---------+
| 1| 2| 12| 12| 12|
| 2|null| null| null| 2|
| 3| 4| 34| 34| 34|
| 4| 5| 45| 45| 45|
|null| 6| null| null| 6|
+----+----+------+-------------+---------+
Note specifically that casting a NULL column to string doesn't work as you wish, and will result in the entire row being NULL if any column is null.
There's no nice way of dealing with more complicated scenarios, but note that you can use a when statement in side a concat if you're willing to
suffer the verboseness of it, like this:
df.withColumn("concat_custom", concat(
when(df.a.isNull(), lit('_')).otherwise(df.a),
when(df.b.isNull(), lit('_')).otherwise(df.b))
)
To get, eg:
+----+----+-------------+
| a| b|concat_custom|
+----+----+-------------+
| 1| 2| 12|
| 2|null| 2_|
| 3| 4| 34|
| 4| 5| 45|
|null| 6| _6|
+----+----+-------------+
You can use the coalesce function, which returns first of its arguments which is not null, and provide a literal in the second place, which will be used in case the column has a null value.
df = df.withColumn("concat", concat(coalesce(df.a, lit('')), coalesce(df.b, lit(''))))
You can fill null values with empty strings:
import pyspark.sql.functions as f
from pyspark.sql.types import *
data = spark.createDataFrame([('s1', 't1'), ('s2', 't2')], ['col1', 'col2'])
data = data.withColumn('test', f.lit(None).cast(StringType()))
display(data.na.fill('').withColumn('test2', f.concat('col1', 'col2', 'test')))
Is that what you were looking for?

How to convert string semi colon-separated column to MapType in pyspark?

Sample of data :
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|customtargeting |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|nocid=no;store=2007;tppid=45c566dd-00d7-4193-b5c7-17843c2764e9 |
|nocid=no;store=3084;tppid=4cd36fde-c59a-41d2-a2b4-b731b6cfbe05 |
|nocid=no;tppid=c688c1be-a9c5-47a2-8c09-aef175a19847 |
|nocid=yes;search=washing liquid;store=3060 |
|pos=top;tppid=278bab7b-d40b-4783-8f89-bef94a9f5150 |
|pos=top;tppid=00bb87fa-f3f5-4b0e-bbf8-16079a1a5efe |
|nocid=no;shelf=cleanser-toner-and-face-mask;store=2019;tppid=84006d41-eb63-4ae1-8c3c-3ac9436d446c |
|pos=top;tppid=ed02b037-066b-46bd-99e6-d183160644a2 |
|nocid=yes;search=salad;store=3060 |
|pos=top;nocid=no;store=2882;tppid=164563e4-8e5c-4366-a5a8-438ffb10da9d |
|nocid=yes;search=beer;store=3060 |
|nocid=no;search=washing capsules;store=5528;tppid=4f9b99eb-65ff-4fbc-b11c-b0552b7f158d |
|pos=right;tppid=ddb54247-a5c9-40a0-9f99-8412d8542b4c |
|nocid=yes;search=bedding;store=3060 |
|pos=top |
|pos=mpu1;keywords=helium canisters;keywords=tesco.com;keywords=helium canisters reviews;keywords=tesco;keywords=helium canisters uk;keywords=balloons;pagetype=category|
I want to convert a PySpark dataframe column to a map type, the column can contain any number of key value pair and the type of column is string and for some keys there are multiple values which i want convert in array as a value for the key .
Try this,
import pyspark.sql.functions as F
from pyspark.sql.types import *
def convert_to_json(_str):
_split_str = [tuple(x.split('=')) for x in _str.split(';') if len(tuple(x.split('='))) == 2]
_json = {}
for k,v in _split_str:
if k in _json:
_json[k].append(v)
else:
_json[k] = [v]
return _json
convert_udf = F.udf(convert_to_json, MapType(StringType(),ArrayType(StringType())))
df = df.withColumn('customtargeting', convert_udf('customtargeting'))
print df.schema
print df.limit(5).collect()
This gives you the schema and output as,
StructType(List(StructField(
customtargeting,MapType(StringType,ArrayType(StringType,true),true),true)))
[Row(customtargeting={u'store': [u'2007'], u'tppid': [u'45c566dd-00d7-4193-b5c7-17843c2764e9'], u'nocid': [u'no']}),
Row(customtargeting={u'store': [u'3084'], u'tppid': [u'4cd36fde-c59a-41d2-a2b4-b731b6cfbe05'], u'nocid': [u'no']}),
Row(customtargeting={u'nocid': [u'no'], u'tppid': [u'c688c1be-a9c5-47a2-8c09-aef175a19847']}),
Row(customtargeting={u'search': [u'washing liquid'], u'nocid': [u'yes'], u'store': [u'3060']}),
Row(customtargeting={u'pos': [u'top'], u'tppid': [u'278bab7b-d40b-4783-8f89-bef94a9f5150']})]
If you want to seperate columns and create a new dataframe, you can use pandas features. Find my solution below
>>> import pandas as pd
>>>
>>> rdd = sc.textFile('/home/ali/text1.txt')
>>> rdd.first()
'nocid=no;store=2007;tppid=45c566dd-00d7-4193-b5c7-17843c2764e9'
>>> rddMap = rdd.map(lambda x: x.split(';'))
>>> rddMap.first()
['nocid=no', 'store=2007', 'tppid=45c566dd-00d7-4193-b5c7-17843c2764e9']
>>>
>>> df1 = pd.DataFrame()
>>> for rdd in rddMap.collect():
... a = {i.split('=')[0]:i.split('=')[1] for i in rdd}
... df2 = pd.DataFrame([a], columns=a.keys())
... df1 = pd.concat([df1, df2])
...
>>> df = spark.createDataFrame(df1.astype(str)).replace('nan',None)
>>> df.show()
+--------+-----+--------+-----+----------------+--------------------+-----+--------------------+
|keywords|nocid|pagetype| pos| search| shelf|store| tppid|
+--------+-----+--------+-----+----------------+--------------------+-----+--------------------+
| null| no| null| null| null| null| 2007|45c566dd-00d7-419...|
| null| no| null| null| null| null| 3084|4cd36fde-c59a-41d...|
| null| no| null| null| null| null| null|c688c1be-a9c5-47a...|
| null| yes| null| null| washing liquid| null| 3060| null|
| null| null| null| top| null| null| null|278bab7b-d40b-478...|
| null| null| null| top| null| null| null|00bb87fa-f3f5-4b0...|
| null| no| null| null| null|cleanser-toner-an...| 2019|84006d41-eb63-4ae...|
| null| null| null| top| null| null| null|ed02b037-066b-46b...|
| null| yes| null| null| salad| null| 3060| null|
| null| no| null| top| null| null| 2882|164563e4-8e5c-436...|
| null| yes| null| null| beer| null| 3060| null|
| null| no| null| null|washing capsules| null| 5528|4f9b99eb-65ff-4fb...|
| null| null| null|right| null| null| null|ddb54247-a5c9-40a...|
| null| yes| null| null| bedding| null| 3060| null|
| null| null| null| top| null| null| null| null|
|balloons| null|category| mpu1| null| null| null| null|
+--------+-----+--------+-----+----------------+--------------------+-----+--------------------+

Auto - Incrementing pyspark dataframe column values

I am trying to generate an additional column in a dataframe with auto incrementing values based on the global value.However all the rows are generated with the same value and the value is not incrementing.
Here is the code
def autoIncrement():
global rec
if (rec == 0) : rec = 1
else : rec = rec + 1
return int(rec)
rec=14
UDF
autoIncrementUDF = udf(autoIncrement, IntegerType())
df1 = hiveContext.sql("select id,name,location,state,datetime,zipcode from demo.target")
df1.withColumn("id2", autoIncrementUDF()).show()
Here is the result df
+---+------+--------+----------+-------------------+-------+---+
| id| name|location| state| datetime|zipcode|id2|
+---+------+--------+----------+-------------------+-------+---+
| 20|pankaj| Chennai| TamilNadu|2018-03-26 11:00:00| NULL| 15|
| 10|geetha| Newyork|New Jersey|2018-03-27 10:00:00| NULL| 15|
| 25| pawan| Chennai| TamilNadu|2018-03-27 11:25:00| NULL| 15|
| 30|Manish| Gurgoan| Gujarat|2018-03-27 11:00:00| NULL| 15|
+---+------+--------+----------+-------------------+-------+---+
But i am expecting the below result
+---+------+--------+----------+-------------------+-------+---+
| id| name|location| state| datetime|zipcode|id2|
+---+------+--------+----------+-------------------+-------+---+
| 20|pankaj| Chennai| TamilNadu|2018-03-26 11:00:00| NULL| 15|
| 10|geetha| Newyork|New Jersey|2018-03-27 10:00:00| NULL| 16|
| 25| pawan| Chennai| TamilNadu|2018-03-27 11:25:00| NULL| 17|
| 30|Manish| Gurgoan| Gujarat|2018-03-27 11:00:00| NULL| 18|
+---+------+--------+----------+-------------------+-------+---+
Any help is appreciated.
Global variables are bounded to a python process. A UDF may be executed in parallel on different workers across some cluster, and should be deterministic.
You should use monotonically_increasing_id() function from pyspark.sql.functions module.
Check the docs for more info.
You should be careful because this function is dynamic and not sticky:
How do I add an persistent column of row ids to Spark DataFrame?

Categories

Resources