pyspark dataframe parent child hierarchy issue - python

In continuation with the issue: pyspark dataframe withColumn command not working
I have a input dataframe: df_input (updated df_input)
|comment|inp_col|inp_val|
|11 |a |a1 |
|12 |a |a2 |
|12 |f |&a |
|12 |a |f9 |
|15 |b |b3 |
|16 |b |b4 |
|17 |c |&b |
|17 |c |c5 |
|17 |d |&c |
|17 |d |d6 |
|17 |e |&d |
|17 |e |e7 |
If you see the inp_col and inp_val is having a hierarchy and it can be n number with the root value. Here the parent value are "b" and "a".
Now, as per my requirement I have to replace the child values starting with "&" to its corresponding values.
I have tried in iterating over the list of values starting with '&' values in inp_val column and replacing with list of values over every iteration.
But, it didn't get worked. I'm facing issue how to get the list with parent and child list values.
tried code:
list_1 = [row['inp_val'] for row in tst.select(tst.inp_val).where(tst.inp_val.substr(0, 1) == '&').collect()]
# removing the '&' at every starting of the list values
list_2 = [list_val[1:] for list_val in list_1]
tst_1 = tst.withColumn("val_extract", when(tst.inp_val.substr(0, 1) == '&', regexp(tst.inp_val, "&", "")).otherwise(tst.inp_val))
for val in list_2:
df_leaf = tst_1.select(tst_1.val_extract).where(tst_1.inp_col == val)
list_3 = [row['val_extract'] for row in df_leaf.collect()]
tst_1 = tst_1.withColumn('bool', when(tst_1.val_extract == val, 'True').otherwise('False'))
tst_1 = tst_1.withColumn('val_extract', when(tst_1.bool == 'True', str(list_3)).otherwise(tst_1.val_extract)).drop('bool')
Updated Expected Output:
|comment|inp_col|inp_val|inp_extract |
|11 |a |a1 |['a1'] |
|12 |a |a2 |['a2'] |
|12 |f |&a |['a1, 'a2'] |
|12 |f |f9 |['f9'] |
|15 |b |b3 |['b3'] |
|16 |b |b4 |['b4'] |
|17 |c |&b |['b3', 'b4'] |
|18 |c |c5 |['c5'] |
|19 |d |&c |['b3', 'b4', 'c5'] |
|20 |d |d6 |['d6'] |
|21 |e |&d |['b3', 'b4', 'c5', 'd6'] |
|22 |e |e7 |['e7'] |
After that I can try and do explode to get multiple row. But, the aove output is what we require and not able to get certain percent result.

If you really want to avoid using graphs and your case is not more complex than shown above, try this.
from pyspark.sql import functions as F
df.show() #sampledataframe
#+-------+---------+---------+
#|comment|input_col|input_val|
#+-------+---------+---------+
#| 11| a| a1|
#| 12| a| a2|
#| 12| f| &a|
#| 12| f| f9|
#| 15| b| b3|
#| 16| b| b4|
#| 17| c| &b|
#| 17| c| c5|
#| 17| d| &c|
#| 17| d| d6|
#| 17| e| &d|
#| 17| e| e7|
#+-------+---------+---------+
df1=df.join(df.groupBy("input_col").agg(F.collect_list("input_val").alias("y1"))\
.withColumnRenamed("input_col","x1"),F.expr("""input_val rlike x1"""),'left')\
.withColumn("new_col", F.when(F.expr("""substring(input_val,0,1)!""")!=F.lit('&'), F.array("input_val"))\
.otherwise(F.col("y1"))).drop("x1","y1")
df2=df1.join(df1.selectExpr("input_val as input_val1","new_col as new_col1"), F.expr("""array_contains(new_col,input_val1) and\
substring(input_val1,0,1)=='&'"""),'left')
df2.join(df2.selectExpr("input_val1 as val2","new_col1 as col2")\
.dropna(),F.expr("""array_contains(new_col1,val2)"""),'left')\
.withColumn("inp_extract", F.when(F.expr("""substring(input_val,0,1)!='&'"""), F.col("new_col"))\
.otherwise(F.expr("""filter(concat(\
coalesce(new_col,array()),\
coalesce(new_col1,array()),\
coalesce(col2, array()))\
,x-> x is not null and substring(x,0,1)!='&')""")))\
.select("comment","input_col","input_val",F.array_sort("inp_extract").alias("inp_extract")).show()
#+-------+---------+---------+----------------+
#|comment|input_col|input_val| inp_extract|
#+-------+---------+---------+----------------+
#| 11| a| a1| [a1]|
#| 12| a| a2| [a2]|
#| 12| f| &a| [a1, a2]|
#| 12| f| f9| [f9]|
#| 15| b| b3| [b3]|
#| 16| b| b4| [b4]|
#| 17| c| &b| [b3, b4]|
#| 17| c| c5| [c5]|
#| 17| d| &c| [b3, b4, c5]|
#| 17| d| d6| [d6]|
#| 17| e| &d|[b3, b4, c5, d6]|
#| 17| e| e7| [e7]|
#+-------+---------+---------+----------------+

You can join the data frame to itself to get this.
input :
df.show()
+-------+-------+---------+
|comment|inp_col|input_val|
+-------+-------+---------+
| 11| a| a1|
| 12| a| a2|
| 13| f| &a|
| 14| b| b3|
| 15| b| b4|
| 16| d| &b|
+-------+-------+---------+
import pyspark.sql.functions as F
df.createOrReplaceTempView("df1")
df.withColumn("input_val", F.regexp_replace(F.col("input_val"), "&", "")).createOrReplaceTempView("df2")
spark.sql("""select * from (select coalesce(df2.comment,df1.comment) as comment ,
coalesce(df2.inp_col,df1.inp_col) as inp_col,
coalesce(df2.input_val,df2.input_val) as input_val ,
case when df1.input_val is not null then df1.input_val else df2.input_val end as output
from df1 full outer join df2 on df2.input_val = df1.inp_col) where input_val is not null order by comment """).show()
Output
+-------+-------+---------+------+
|comment|inp_col|input_val|output|
+-------+-------+---------+------+
| 11| a| a1| a1|
| 12| a| a2| a2|
| 13| f| a| a1|
| 13| f| a| a2|
| 14| b| b3| b3|
| 15| b| b4| b4|
| 16| d| b| b3|
| 16| d| b| b4|
+-------+-------+---------+------+

Related

Joining two RDD's in PySpark?

I asked a previous question regarding two RDD's I was trying to join together, here were the two RDD's:
+------+---+
| _1| _2|
+------+---+
|Python| 36|
| C| 6|
| C#| 8|
+------+---+
+------+---+
| _1| _2|
+------+---+
|Python| 10|
| C| 1|
| C#| 1|
+------+---+
After executing the following line of code on both RDD's, this was the result:
joined_rdd = rdd1.join(rdd2).map(lambda x: (x[0], *x[1]))
+------+---+---+
| _1| _2| _3|
+------+---+---+
|Python| 36| 10|
| C| 6| 1|
| C#| 8| 1|
+------+---+---+
This was exactly what I wanted, BUT, if I would like to join another RDD to this 3 column joined_rdd, how might I do that? The code I used originally does not work and I've tried every variation and cannot seem to obtain the result I want, here is what I want it to look like:
rdd3
+------+---+
| _1| _2|
+------+---+
|Python| 8|
| C| 15|
| C#|100|
+------+---+
After joining with joined_rdd:
final_joined_rdd
+------+---+---+---+
| _1| _2| _3| _3|
+------+---+---+---+
|Python| 36| 10| 8|
| C| 6| 1| 15|
| C#| 8| 1|100|
+------+---+---+---+
Any help to achieve this result would be appreciated, thanks!
Note: I cannot convert these RDD's to data frame and then join because the RDD's are really just DStream objects.

extract substring before first occurrence and substring after last occurrence of a delimiter in Pyspark

I have a data frame like below in pyspark
df = spark.createDataFrame(
[
('14_100_00','A',25),
('13_100_00','B',24),
('15_100_00','A',20),
('150_100','C',21),
('16','A',20),
('1634_100_00_01','B',22),
('1_100_00','C',23),
('18_100_00','D',24)],("rust", "name", "value"))
df.show()
+--------------+----+-----+
| rust|name|value|
+--------------+----+-----+
| 14_100_00| A| 25|
| 13_100_00| B| 24|
| 15_100_00| A| 20|
| 150_100| C| 21|
| 16| A| 20|
|1634_100_00_01| B| 22|
| 1_100_00| C| 23|
| 18_100_00| D| 24|
+--------------+----+-----+
I am trying to create a new column using the rust column using below conditions
1) extract anything before 1st underscore
2) extract anything after the last underscore
3) concatenate the above two values using tilda(~)
If no underscores in the column then have column as is
I have tried like below
from pyspark.sql.functions import substring_index
df1 = df.select("*", f.concat(f.substring_index(df.rust, '_', 1), f.lit('~'), f.substring_index(df.rust, '_', -1)).alias("extract"))
df1.show()
+--------------+----+-----+-------+
| rust|name|value|extract|
+--------------+----+-----+-------+
| 14_100_00| A| 25| 14~00|
| 13_100_00| B| 24| 13~00|
| 15_100_00| A| 20| 15~00|
| 150_100| C| 21|150~100|
| 16| A| 20| 16~16|
|1634_100_00_01| B| 22|1634~01|
| 1_100_00| C| 23| 1~00|
| 18_100_00| D| 24| 18~00|
+--------------+----+-----+-------+
expected result:
+--------------+----+-----+-------+
| rust|name|value|extract|
+--------------+----+-----+-------+
| 14_100_00| A| 25| 14~00|
| 13_100_00| B| 24| 13~00|
| 15_100_00| A| 20| 15~00|
| 150_100| C| 21|150~100|
| 16| A| 20| 16|
|1634_100_00_01| B| 22|1634~01|
| 1_100_00| C| 23| 1~00|
| 18_100_00| D| 24| 18~00|
+--------------+----+-----+-------+
How can I achieve what I want
Use the instr function to determine whether the rust column contains _, and then use the when function to process.
df1 = df.select("*",
f.when(f.instr(df.rust, '_') > 0,
f.concat(f.substring_index(df.rust, '_', 1), f.lit('~'), f.substring_index(df.rust, '_', -1))
)
.otherwise(df.rust)
.alias("extract")
)

Pyspark show the column that has the lowest value for each row

I have the following dataframe
df_old_list= [
{ "Col1":"0", "Col2" : "7","Col3": "8", "Col4" : "","Col5": "20"},
{"Col1":"5", "Col2" : "5","Col3": "5", "Col4" : "","Col5": "28"},
{ "Col1":"-1", "Col2" : "-1","Col3": "13", "Col4" : "","Col5": "83"},
{"Col1":"-1", "Col2" : "6","Col3": "6", "Col4" : "","Col5": "18"},
{ "Col1":"5", "Col2" : "4","Col3": "2", "Col4" : "","Col5": "84"},
{ "Col1":"0", "Col2" : "0","Col3": "14", "Col4" : "7","Col5": "86"}
]
spark = SparkSession.builder.getOrCreate()
df_old_list = spark.createDataFrame(Row(**x) for x in df_old_list)
df_old_list.show()
+----+----+----+----+----+
|Col1|Col2|Col3|Col4|Col5|
+----+----+----+----+----+
| 0| 7| 8| | 20|
| 5| 5| 5| | 28|
| -1| -1| 13| | 83|
| -1| 6| 6| | 18|
| 5| 4| 2| | 84|
| 0| 0| 14| 7| 86|
+----+----+----+----+----+
I want to get the lowest value from each column for each row.
This is what i was able to achieve so far
df1=df_old_list.selectExpr("*","array_sort(split(concat_ws(',',*),','))[0] lowest_col")
df1.show()
+----+----+----+----+----+----------+
|Col1|Col2|Col3|Col4|Col5|lowest_col|
+----+----+----+----+----+----------+
| 0| 7| 8| | 20| |
| 5| 5| 5| | 28| |
| -1| -1| 13| | 83| |
| -1| 6| 6| | 18| |
| 5| 4| 2| | 84| |
| 0| 0| 14| 7| 86| 0|
+----+----+----+----+----+----------+
The problem is Col4 is blank and therefore its not able to compute the lowest value.
What i am looking for is something like this :
where i get the lowest value irrespective of blank columns and also if there are more than one lowest numbers then that column fields names should be shown in the lowest_col_title as concatenated.
+-----------------+----------+----+----+----+----+----+
|lowest_cols_title|lowest_col|Col1|Col2|Col3|Col4|Col5|
+-----------------+----------+----+----+----+----+----+
| Col1| 0| 0| 7| 8| | 20|
| Col1;Col2;Col3| 5| 5| 5| 5| | 28|
| Col1;Col2| -1| -1| -1| 13| | 83|
| Col1| -1| -1| 6| 6| | 18|
| Col3| 5| 5| 4| 2| | 84|
| Col1;Col2| 0| 0| 0| 14| 7| 86|
+-----------------+----------+----+----+----+----+----+
You can use pyspark.sql.functions.least
Returns the least value of the list of column names, skipping null
values. This function takes at least 2 parameters. It will return null
iff all parameters are null.
Once we have the minimum column we can compare the min value against all columns and create another column.
Create DataFrame:
from pyspark.sql import Row
from pyspark.sql.functions import col,least,when,array,concat_ws
df_old_list= [
{ "Col1":"0", "Col2" : "7","Col3": "8", "Col4" : "","Col5": "20"}, {"Col1":"5", "Col2" : "5","Col3": "5", "Col4" : "","Col5": "28"},
{ "Col1":"-1", "Col2" : "-1","Col3": "13", "Col4" : "","Col5": "83"}, {"Col1":"-1", "Col2" : "6","Col3": "6", "Col4" : "","Col5": "18"},
{ "Col1":"5", "Col2" : "4","Col3": "2", "Col4" : "","Col5": "84"}, { "Col1":"0", "Col2" : "0","Col3": "14", "Col4" : "7","Col5": "86"}]
df = spark.createDataFrame(Row(**x) for x in df_old_list)
from pyspark.sql.functions import least, when
Calculate row wise minimum and all columns which have the minimum value.
collist = df.columns
min_ = least(*[
when(col(c) == "", float("inf")).otherwise(col(c).cast('int'))
for c in df.columns
]).alias("lowest_col")
df = df.select("*", min_)
df = df.select("*",concat_ws(";",array([
when(col(c)==col("lowest_col") ,c).otherwise(None)
for c in collist
])).alias("lowest_cols_title") )
df.show(10,False)
Output:
+----+----+----+----+----+----------+-----------------+
|Col1|Col2|Col3|Col4|Col5|lowest_col|lowest_cols_title|
+----+----+----+----+----+----------+-----------------+
|0 |7 |8 | |20 |0.0 |Col1 |
|5 |5 |5 | |28 |5.0 |Col1;Col2;Col3 |
|-1 |-1 |13 | |83 |-1.0 |Col1;Col2 |
|-1 |6 |6 | |18 |-1.0 |Col1 |
|5 |4 |2 | |84 |2.0 |Col3 |
|0 |0 |14 |7 |86 |0.0 |Col1;Col2 |
+----+----+----+----+----+----------+-----------------+
There are several ways to avoid your empty cols.
The problem is that your columns have string type:
df_old_list.dtypes
Out[13]:
[('Col1', 'string'),
('Col2', 'string'),
('Col3', 'string'),
('Col4', 'string'),
('Col5', 'string')]
So you can just cast them to int:
df_old_list = df_old_list.withColumn('Col1', F.col('Col1').cast('int'))
df_old_list = df_old_list.withColumn('Col2', F.col('Col2').cast('int'))
df_old_list = df_old_list.withColumn('Col3', F.col('Col3').cast('int'))
df_old_list = df_old_list.withColumn('Col4', F.col('Col4').cast('int'))
df_old_list = df_old_list.withColumn('Col5', F.col('Col5').cast('int'))
df_old_list.show()
+----+----+----+----+----+
|Col1|Col2|Col3|Col4|Col5|
+----+----+----+----+----+
| 0| 7| 8|null| 20|
| 5| 5| 5|null| 28|
| -1| -1| 13|null| 83|
| -1| 6| 6|null| 18|
| 5| 4| 2|null| 84|
| 0| 0| 14| 7| 86|
+----+----+----+----+----+
Now your code results in following:
df1=df_old_list.selectExpr("*","array_sort(split(concat_ws(',',*),','))[0] lowest_col")
...:
...: df1.show()
+----+----+----+----+----+----------+
|Col1|Col2|Col3|Col4|Col5|lowest_col|
+----+----+----+----+----+----------+
| 0| 7| 8|null| 20| 0|
| 5| 5| 5|null| 28| 28|
| -1| -1| 13|null| 83| -1|
| -1| 6| 6|null| 18| -1|
| 5| 4| 2|null| 84| 2|
| 0| 0| 14| 7| 86| 0|
+----+----+----+----+----+----------+
You can replace empty values with some big int, for example
df_old_list = df_old_list.replace('', str((1 << 31) - 1))
df_old_list.show()
+----+----+----+----------+----+
|Col1|Col2|Col3| Col4|Col5|
+----+----+----+----------+----+
| 0| 7| 8|2147483647| 20|
| 5| 5| 5|2147483647| 28|
| -1| -1| 13|2147483647| 83|
| -1| 6| 6|2147483647| 18|
| 5| 4| 2|2147483647| 84|
| 0| 0| 14| 7| 86|
+----+----+----+----------+----+
And then do your manipulation:
df1=df_old_list.selectExpr("*","array_sort(split(concat_ws(',',*),','))[0] lowest_col")
df1.show()
+----+----+----+----------+----+----------+
|Col1|Col2|Col3| Col4|Col5|lowest_col|
+----+----+----+----------+----+----------+
| 0| 7| 8|2147483647| 20| 0|
| 5| 5| 5|2147483647| 28|2147483647|
| -1| -1| 13|2147483647| 83| -1|
| -1| 6| 6|2147483647| 18| -1|
| 5| 4| 2|2147483647| 84| 2|
| 0| 0| 14| 7| 86| 0|
+----+----+----+----------+----+----------+
You can notice that your way of finding min value comes to false result in the second row. So I can also recommend you to try to make this manipulation with transposing your dataframe and finding min value in column rather than in row. This can be done with pandas_udf very fast if you use arrow.
Using #venky__ answer, I found a solution for you:
from pyspark.sql import functions as F
join_key = df_old_list.columns
min_ = F.least(
*[F.when(F.col(c).isNull() | (F.col(c) == ""), float("inf")).otherwise(F.col(c).cast('int'))
for c in join_key]
).alias("lowest_col")
df_with_lowest_col = df_old_list.select("*", min_.cast('int'))
df_exploded = df_old_list.withColumn(
'vars_and_vals',
F.explode(F.array(
*(F.struct(F.lit(c).alias('var'), F.col(c).alias('val')) for c in join_key)
)))
cols = join_key + [F.col("vars_and_vals")[x].alias(x) for x in ['var', 'val']]
df_exploded = df_exploded.select(*cols)
df = df_exploded.join(df_with_lowest_col, join_key)
df = df.filter('val = lowest_col')
df_with_col_names = df.groupby(*join_key).agg(
F.array_join(F.collect_list('var'), ';').alias('lowest_cols_title')
)
res_df = df_with_lowest_col.join(df_with_col_names, join_key)
result:
res_df.show()
+----+----+----+----+----+----------+-----------------+
|Col1|Col2|Col3|Col4|Col5|lowest_col|lowest_cols_title|
+----+----+----+----+----+----------+-----------------+
| 0| 0| 14| 7| 86| 0| Col1;Col2|
| -1| 6| 6| | 18| -1| Col1|
| 5| 5| 5| | 28| 5| Col1;Col2;Col3|
| 0| 7| 8| | 20| 0| Col1|
| -1| -1| 13| | 83| -1| Col1;Col2|
| 5| 4| 2| | 84| 2| Col3|
+----+----+----+----+----+----------+-----------------+
It looks complicated and can be optimized I think but it works.

Merge 2 spark dataframes with non overlapping columns

I have two data frames, df1:
+---+---------+
| id| col_name|
+---+---------+
| 0| a |
| 1| b |
| 2| null|
| 3| null|
| 4| e |
| 5| f |
| 6| g |
| 7| h |
| 8| null|
| 9| j |
+---+---------+
and df2:
+---+---------+
| id| col_name|
+---+---------+
| 0| null|
| 1| null|
| 2| c|
| 3| d|
| 4| null|
| 5| null|
| 6| null|
| 7| null|
| 8| i|
| 9| null|
+---+---------+
and I want to merge them so I get
+---+---------+
| id| col_name|
+---+---------+
| 0| a|
| 1| b|
| 2| c|
| 3| d|
| 4| e|
| 5| f|
| 6| g|
| 7| h|
| 8| i|
| 9| j|
+---+---------+
I know for sure that they aren't overlapping (ie when df2 entry is null df1 entry isn't and vise versa)
I know that if I use join I won't get them on the same column and will instead get 2 "col_name". I just want it on the one column. How do I do this? Thanks
Try this-
df1.alias("a").join(df2.alias("b"), "id").selectExpr("id", "coalesce(a.col_name, b.col_name) as col_name")
You could do this:
mydf = df1.copy() #make copy of first array
idx = np.where(df1['col_name'].values == 'null')[0] #get indices of null
val = df2['col_name'].values[idx] #get values from df2 where df1 is null
mydf['col_name'][idx] = val #assign those values in mydf
mydf #print mydf
you should be able to utilize the coalesce function to achieve this.
df1 = df1.withColumnRenamed("col_name", "col_name_a")
df2 = df2.withColumnRenamed("col_name", "col_name_b")
joinedDF = renamedDF1.join(renamedDF2, "id")
joinedDF = joinedDF.withColumn(
"col_name",
coalesce(joinedDF.col("col_name_a"), joinedDF.col("col_name_b"))
)

pyspark dataframe withColumn command not working

I have a input dataframe: df_input (updated df_input)
|comment|inp_col|inp_val|
|11 |a |a1 |
|12 |a |a2 |
|15 |b |b3 |
|16 |b |b4 |
|17 |c |&b |
|17 |c |c5 |
|17 |d |&c |
|17 |d |d6 |
|17 |e |&d |
|17 |e |e7 |
I want to replace the variable in inp_val column to its value. I have tried with the below code to create a new column.
Taken the list of values which starts with '&'
df_new = df_inp.select(inp_val).where(df.inp_val.substr(0, 1) == '&')
Now I'm iterating over the list to replace the '&' column value data to it original list.
for a in [row[inp_val] for row in df_new.collect()]
df_inp = df_inp.withColumn
(
'new_col',
when(df.inp_val.substr(0, 1) == '&',
[row[inp_val] for row in df.select(df.inp_val).where(df.inp_col == a[1:]).collect()])
.otherwise(df.inp_val)
)
But, I'm getting error as below:
Java.lang.RuntimeException: Unsupported literal tpe class java.util.ArrayList [[5], [6]]
Basically I want the output as below. Please check and let me know where is the error???.
I was thinking that two type of datatype values I'm trying to insert as per the above code??
Updated lines of code:
tst_1 = tst.withColumn("col3_extract", when(tst.col3.substr(0, 1) == '&', regexp_replace(tst.col3, "&", "")).otherwise(""))
# Select which values need to be replaced; withColumnRenamed will also solve spark self join issues
# The substring search can also be done using regex function
tst_filter=tst.where(~F.col('col3').contains('&')).withColumnRenamed('col2','col2_collect')
# For the selected data, perform a collect list
tst_clct = tst_filter.groupby('col2_collect').agg(F.collect_list('col3').alias('col3_collect'))
#%% Join the main table with the collected list
tst_join = tst_1.join(tst_clct,on=tst_1.col3_extract==tst_clct.col2_collect,how='left').drop('col2_collect')
#%% In the column3 replace the values such as a, b
tst_result = tst_join.withColumn("result",F.when(~F.col('col3').contains('&'),F.array(F.col('col3'))).otherwise(F.col('col3_collect')))
But, the above code doesn't work on the multiple iterations
Updated Expected Output:
|comment|inp_col|inp_val|new_col |
|11 |a |a1 |['a1'] |
|12 |a |a2 |['a2'] |
|15 |b |b3 |['b3'] |
|16 |b |b4 |['b4'] |
|17 |c |&b |['b3', 'b4'] |
|18 |c |c5 |['c5'] |
|19 |d |&c |['b3', 'b4', 'c5'] |
|20 |d |d6 |['d6'] |
|21 |e |&d |['b3', 'b4', 'c5', 'd6'] |
|22 |e |e7 |['e7'] |
Try this, self-join with collected list on rlike join condition is the way to go.
df.show() #sampledataframe
#+-------+---------+---------+
#|comment|input_col|input_val|
#+-------+---------+---------+
#| 11| a| 1|
#| 12| a| 2|
#| 15| b| 5|
#| 16| b| 6|
#| 17| c| &b|
#| 17| c| 7|
#+-------+---------+---------+
df.join(df.groupBy("input_col").agg(F.collect_list("input_val").alias("y1"))\
.withColumnRenamed("input_col","x1"),F.expr("""input_val rlike x1"""),'left')\
.withColumn("new_col", F.when(F.col("input_val").cast("int").isNotNull(), F.array("input_val"))\
.otherwise(F.col("y1"))).drop("x1","y1").show()
#+-------+---------+---------+-------+
#|comment|input_col|input_val|new_col|
#+-------+---------+---------+-------+
#| 11| a| 1| [1]|
#| 12| a| 2| [2]|
#| 15| b| 5| [5]|
#| 16| b| 6| [6]|
#| 17| c| &b| [5, 6]|
#| 17| c| 7| [7]|
#+-------+---------+---------+-------+
You can simply use regex_replace like this:
df.withColumn("new_col", regex_replace(col("inp_val"), "&", ""))
Can you tryout this solution. Your approach may run into whole lot of problems.
import pyspark.sql.functions as F
from pyspark.sql.functions import col
from pyspark.sql.window import Window
#Test data
tst = sqlContext.createDataFrame([(1,'a','3'),(1,'a','4'),(1,'b','5'),(1,'b','7'),(2,'c','&b'),(2,'c','&a'),(2,'d','&b')],schema=['col1','col2','col3'])
# extract the special character out
tst_1 = tst.withColumn("col3_extract",F.substring(F.col('col3'),2,1))
# Selecct which values need to be replaced; withColumnRenamed will also solve spark self join issues
# The substring search can also be done using regex function
tst_filter=tst.where(~F.col('col3').contains('&')).withColumnRenamed('col2','col2_collect')
# For the selected data, perform a collect list
tst_clct = tst_filter.groupby('col2_collect').agg(F.collect_list('col3').alias('col3_collect'))
#%% Join the main table with the collected list
tst_join = tst_1.join(tst_clct,on=tst_1.col3_extract==tst_clct.col2_collect,how='left').drop('col2_collect')
#%% In the column3 replace the values such as a, b
tst_result = tst_join.withColumn("result",F.when(~F.col('col3').contains('&'),F.array(F.col('col3'))).otherwise(F.col('col3_collect')))
Results :
+----+----+----+------------+------------+------+
|col1|col2|col3|col3_extract|col3_collect|result|
+----+----+----+------------+------------+------+
| 2| c| &a| a| [3, 4]|[3, 4]|
| 2| c| &b| b| [7, 5]|[7, 5]|
| 2| d| &b| b| [7, 5]|[7, 5]|
| 1| a| 3| | null| [3]|
| 1| a| 4| | null| [4]|
| 1| b| 5| | null| [5]|
| 1| b| 7| | null| [7]|
+----+----+----+------------+------------+------+

Categories

Resources