pyspark dataframe withColumn command not working - python

I have a input dataframe: df_input (updated df_input)
|comment|inp_col|inp_val|
|11 |a |a1 |
|12 |a |a2 |
|15 |b |b3 |
|16 |b |b4 |
|17 |c |&b |
|17 |c |c5 |
|17 |d |&c |
|17 |d |d6 |
|17 |e |&d |
|17 |e |e7 |
I want to replace the variable in inp_val column to its value. I have tried with the below code to create a new column.
Taken the list of values which starts with '&'
df_new = df_inp.select(inp_val).where(df.inp_val.substr(0, 1) == '&')
Now I'm iterating over the list to replace the '&' column value data to it original list.
for a in [row[inp_val] for row in df_new.collect()]
df_inp = df_inp.withColumn
(
'new_col',
when(df.inp_val.substr(0, 1) == '&',
[row[inp_val] for row in df.select(df.inp_val).where(df.inp_col == a[1:]).collect()])
.otherwise(df.inp_val)
)
But, I'm getting error as below:
Java.lang.RuntimeException: Unsupported literal tpe class java.util.ArrayList [[5], [6]]
Basically I want the output as below. Please check and let me know where is the error???.
I was thinking that two type of datatype values I'm trying to insert as per the above code??
Updated lines of code:
tst_1 = tst.withColumn("col3_extract", when(tst.col3.substr(0, 1) == '&', regexp_replace(tst.col3, "&", "")).otherwise(""))
# Select which values need to be replaced; withColumnRenamed will also solve spark self join issues
# The substring search can also be done using regex function
tst_filter=tst.where(~F.col('col3').contains('&')).withColumnRenamed('col2','col2_collect')
# For the selected data, perform a collect list
tst_clct = tst_filter.groupby('col2_collect').agg(F.collect_list('col3').alias('col3_collect'))
#%% Join the main table with the collected list
tst_join = tst_1.join(tst_clct,on=tst_1.col3_extract==tst_clct.col2_collect,how='left').drop('col2_collect')
#%% In the column3 replace the values such as a, b
tst_result = tst_join.withColumn("result",F.when(~F.col('col3').contains('&'),F.array(F.col('col3'))).otherwise(F.col('col3_collect')))
But, the above code doesn't work on the multiple iterations
Updated Expected Output:
|comment|inp_col|inp_val|new_col |
|11 |a |a1 |['a1'] |
|12 |a |a2 |['a2'] |
|15 |b |b3 |['b3'] |
|16 |b |b4 |['b4'] |
|17 |c |&b |['b3', 'b4'] |
|18 |c |c5 |['c5'] |
|19 |d |&c |['b3', 'b4', 'c5'] |
|20 |d |d6 |['d6'] |
|21 |e |&d |['b3', 'b4', 'c5', 'd6'] |
|22 |e |e7 |['e7'] |

Try this, self-join with collected list on rlike join condition is the way to go.
df.show() #sampledataframe
#+-------+---------+---------+
#|comment|input_col|input_val|
#+-------+---------+---------+
#| 11| a| 1|
#| 12| a| 2|
#| 15| b| 5|
#| 16| b| 6|
#| 17| c| &b|
#| 17| c| 7|
#+-------+---------+---------+
df.join(df.groupBy("input_col").agg(F.collect_list("input_val").alias("y1"))\
.withColumnRenamed("input_col","x1"),F.expr("""input_val rlike x1"""),'left')\
.withColumn("new_col", F.when(F.col("input_val").cast("int").isNotNull(), F.array("input_val"))\
.otherwise(F.col("y1"))).drop("x1","y1").show()
#+-------+---------+---------+-------+
#|comment|input_col|input_val|new_col|
#+-------+---------+---------+-------+
#| 11| a| 1| [1]|
#| 12| a| 2| [2]|
#| 15| b| 5| [5]|
#| 16| b| 6| [6]|
#| 17| c| &b| [5, 6]|
#| 17| c| 7| [7]|
#+-------+---------+---------+-------+

You can simply use regex_replace like this:
df.withColumn("new_col", regex_replace(col("inp_val"), "&", ""))

Can you tryout this solution. Your approach may run into whole lot of problems.
import pyspark.sql.functions as F
from pyspark.sql.functions import col
from pyspark.sql.window import Window
#Test data
tst = sqlContext.createDataFrame([(1,'a','3'),(1,'a','4'),(1,'b','5'),(1,'b','7'),(2,'c','&b'),(2,'c','&a'),(2,'d','&b')],schema=['col1','col2','col3'])
# extract the special character out
tst_1 = tst.withColumn("col3_extract",F.substring(F.col('col3'),2,1))
# Selecct which values need to be replaced; withColumnRenamed will also solve spark self join issues
# The substring search can also be done using regex function
tst_filter=tst.where(~F.col('col3').contains('&')).withColumnRenamed('col2','col2_collect')
# For the selected data, perform a collect list
tst_clct = tst_filter.groupby('col2_collect').agg(F.collect_list('col3').alias('col3_collect'))
#%% Join the main table with the collected list
tst_join = tst_1.join(tst_clct,on=tst_1.col3_extract==tst_clct.col2_collect,how='left').drop('col2_collect')
#%% In the column3 replace the values such as a, b
tst_result = tst_join.withColumn("result",F.when(~F.col('col3').contains('&'),F.array(F.col('col3'))).otherwise(F.col('col3_collect')))
Results :
+----+----+----+------------+------------+------+
|col1|col2|col3|col3_extract|col3_collect|result|
+----+----+----+------------+------------+------+
| 2| c| &a| a| [3, 4]|[3, 4]|
| 2| c| &b| b| [7, 5]|[7, 5]|
| 2| d| &b| b| [7, 5]|[7, 5]|
| 1| a| 3| | null| [3]|
| 1| a| 4| | null| [4]|
| 1| b| 5| | null| [5]|
| 1| b| 7| | null| [7]|
+----+----+----+------------+------------+------+

Related

extract substring before first occurrence and substring after last occurrence of a delimiter in Pyspark

I have a data frame like below in pyspark
df = spark.createDataFrame(
[
('14_100_00','A',25),
('13_100_00','B',24),
('15_100_00','A',20),
('150_100','C',21),
('16','A',20),
('1634_100_00_01','B',22),
('1_100_00','C',23),
('18_100_00','D',24)],("rust", "name", "value"))
df.show()
+--------------+----+-----+
| rust|name|value|
+--------------+----+-----+
| 14_100_00| A| 25|
| 13_100_00| B| 24|
| 15_100_00| A| 20|
| 150_100| C| 21|
| 16| A| 20|
|1634_100_00_01| B| 22|
| 1_100_00| C| 23|
| 18_100_00| D| 24|
+--------------+----+-----+
I am trying to create a new column using the rust column using below conditions
1) extract anything before 1st underscore
2) extract anything after the last underscore
3) concatenate the above two values using tilda(~)
If no underscores in the column then have column as is
I have tried like below
from pyspark.sql.functions import substring_index
df1 = df.select("*", f.concat(f.substring_index(df.rust, '_', 1), f.lit('~'), f.substring_index(df.rust, '_', -1)).alias("extract"))
df1.show()
+--------------+----+-----+-------+
| rust|name|value|extract|
+--------------+----+-----+-------+
| 14_100_00| A| 25| 14~00|
| 13_100_00| B| 24| 13~00|
| 15_100_00| A| 20| 15~00|
| 150_100| C| 21|150~100|
| 16| A| 20| 16~16|
|1634_100_00_01| B| 22|1634~01|
| 1_100_00| C| 23| 1~00|
| 18_100_00| D| 24| 18~00|
+--------------+----+-----+-------+
expected result:
+--------------+----+-----+-------+
| rust|name|value|extract|
+--------------+----+-----+-------+
| 14_100_00| A| 25| 14~00|
| 13_100_00| B| 24| 13~00|
| 15_100_00| A| 20| 15~00|
| 150_100| C| 21|150~100|
| 16| A| 20| 16|
|1634_100_00_01| B| 22|1634~01|
| 1_100_00| C| 23| 1~00|
| 18_100_00| D| 24| 18~00|
+--------------+----+-----+-------+
How can I achieve what I want
Use the instr function to determine whether the rust column contains _, and then use the when function to process.
df1 = df.select("*",
f.when(f.instr(df.rust, '_') > 0,
f.concat(f.substring_index(df.rust, '_', 1), f.lit('~'), f.substring_index(df.rust, '_', -1))
)
.otherwise(df.rust)
.alias("extract")
)

GroupBy a dataframe records and display all columns with PySpark

I have the following dataframe
dataframe - columnA, columnB, columnC, columnD, columnE
I want to groupBy columnC and then consider max value of columnE
dataframe .select('*').groupBy('columnC').max('columnE')
expected output
dataframe - columnA, columnB, columnC, columnD, columnE
Real output
dataframe - columnC, columnE
Why all columns in the dataframe are not displayed as expected ?
For Spark version >= 3.0.0 you can use max_by to select the additional columns.
import random
from pyspark.sql import functions as F
#create some testdata
df = spark.createDataFrame(
[[random.randint(1,3)] + random.sample(range(0, 30), 4) for _ in range(10)],
schema=["columnC", "columnB", "columnA", "columnD", "columnE"]) \
.select("columnA", "columnB", "columnC", "columnD", "columnE")
df.groupBy("columnC") \
.agg(F.max("columnE"),
F.expr("max_by(columnA, columnE) as columnA"),
F.expr("max_by(columnB, columnE) as columnB"),
F.expr("max_by(columnD, columnE) as columnD")) \
.show()
For the testdata
+-------+-------+-------+-------+-------+
|columnA|columnB|columnC|columnD|columnE|
+-------+-------+-------+-------+-------+
| 25| 20| 2| 0| 2|
| 14| 2| 2| 24| 6|
| 26| 13| 3| 2| 1|
| 5| 24| 3| 19| 17|
| 22| 5| 3| 14| 21|
| 24| 5| 1| 8| 4|
| 7| 22| 3| 16| 20|
| 6| 17| 1| 5| 7|
| 24| 22| 2| 8| 3|
| 4| 14| 1| 16| 11|
+-------+-------+-------+-------+-------+
the result is
+-------+------------+-------+-------+-------+
|columnC|max(columnE)|columnA|columnB|columnD|
+-------+------------+-------+-------+-------+
| 1| 11| 4| 14| 16|
| 3| 21| 22| 5| 14|
| 2| 6| 14| 2| 24|
+-------+------------+-------+-------+-------+
What you want to achieve can be done via WINDOW function. Not groupBy
partition your data by columnC
Order your data within each partition in desc (rank)
filter out your desired result.
from pyspark.sql.window import Window
from pyspark.sql.functions import rank
from pyspark.sql.functions import col
windowSpec = Window.partitionBy("columnC").orderBy(col("columnE").desc())
expectedDf = df.withColumn("rank", rank().over(windowSpec)) \
.filter(col("rank") == 1)
You might wanna restructure your question.

Merge 2 spark dataframes with non overlapping columns

I have two data frames, df1:
+---+---------+
| id| col_name|
+---+---------+
| 0| a |
| 1| b |
| 2| null|
| 3| null|
| 4| e |
| 5| f |
| 6| g |
| 7| h |
| 8| null|
| 9| j |
+---+---------+
and df2:
+---+---------+
| id| col_name|
+---+---------+
| 0| null|
| 1| null|
| 2| c|
| 3| d|
| 4| null|
| 5| null|
| 6| null|
| 7| null|
| 8| i|
| 9| null|
+---+---------+
and I want to merge them so I get
+---+---------+
| id| col_name|
+---+---------+
| 0| a|
| 1| b|
| 2| c|
| 3| d|
| 4| e|
| 5| f|
| 6| g|
| 7| h|
| 8| i|
| 9| j|
+---+---------+
I know for sure that they aren't overlapping (ie when df2 entry is null df1 entry isn't and vise versa)
I know that if I use join I won't get them on the same column and will instead get 2 "col_name". I just want it on the one column. How do I do this? Thanks
Try this-
df1.alias("a").join(df2.alias("b"), "id").selectExpr("id", "coalesce(a.col_name, b.col_name) as col_name")
You could do this:
mydf = df1.copy() #make copy of first array
idx = np.where(df1['col_name'].values == 'null')[0] #get indices of null
val = df2['col_name'].values[idx] #get values from df2 where df1 is null
mydf['col_name'][idx] = val #assign those values in mydf
mydf #print mydf
you should be able to utilize the coalesce function to achieve this.
df1 = df1.withColumnRenamed("col_name", "col_name_a")
df2 = df2.withColumnRenamed("col_name", "col_name_b")
joinedDF = renamedDF1.join(renamedDF2, "id")
joinedDF = joinedDF.withColumn(
"col_name",
coalesce(joinedDF.col("col_name_a"), joinedDF.col("col_name_b"))
)

pyspark dataframe parent child hierarchy issue

In continuation with the issue: pyspark dataframe withColumn command not working
I have a input dataframe: df_input (updated df_input)
|comment|inp_col|inp_val|
|11 |a |a1 |
|12 |a |a2 |
|12 |f |&a |
|12 |a |f9 |
|15 |b |b3 |
|16 |b |b4 |
|17 |c |&b |
|17 |c |c5 |
|17 |d |&c |
|17 |d |d6 |
|17 |e |&d |
|17 |e |e7 |
If you see the inp_col and inp_val is having a hierarchy and it can be n number with the root value. Here the parent value are "b" and "a".
Now, as per my requirement I have to replace the child values starting with "&" to its corresponding values.
I have tried in iterating over the list of values starting with '&' values in inp_val column and replacing with list of values over every iteration.
But, it didn't get worked. I'm facing issue how to get the list with parent and child list values.
tried code:
list_1 = [row['inp_val'] for row in tst.select(tst.inp_val).where(tst.inp_val.substr(0, 1) == '&').collect()]
# removing the '&' at every starting of the list values
list_2 = [list_val[1:] for list_val in list_1]
tst_1 = tst.withColumn("val_extract", when(tst.inp_val.substr(0, 1) == '&', regexp(tst.inp_val, "&", "")).otherwise(tst.inp_val))
for val in list_2:
df_leaf = tst_1.select(tst_1.val_extract).where(tst_1.inp_col == val)
list_3 = [row['val_extract'] for row in df_leaf.collect()]
tst_1 = tst_1.withColumn('bool', when(tst_1.val_extract == val, 'True').otherwise('False'))
tst_1 = tst_1.withColumn('val_extract', when(tst_1.bool == 'True', str(list_3)).otherwise(tst_1.val_extract)).drop('bool')
Updated Expected Output:
|comment|inp_col|inp_val|inp_extract |
|11 |a |a1 |['a1'] |
|12 |a |a2 |['a2'] |
|12 |f |&a |['a1, 'a2'] |
|12 |f |f9 |['f9'] |
|15 |b |b3 |['b3'] |
|16 |b |b4 |['b4'] |
|17 |c |&b |['b3', 'b4'] |
|18 |c |c5 |['c5'] |
|19 |d |&c |['b3', 'b4', 'c5'] |
|20 |d |d6 |['d6'] |
|21 |e |&d |['b3', 'b4', 'c5', 'd6'] |
|22 |e |e7 |['e7'] |
After that I can try and do explode to get multiple row. But, the aove output is what we require and not able to get certain percent result.
If you really want to avoid using graphs and your case is not more complex than shown above, try this.
from pyspark.sql import functions as F
df.show() #sampledataframe
#+-------+---------+---------+
#|comment|input_col|input_val|
#+-------+---------+---------+
#| 11| a| a1|
#| 12| a| a2|
#| 12| f| &a|
#| 12| f| f9|
#| 15| b| b3|
#| 16| b| b4|
#| 17| c| &b|
#| 17| c| c5|
#| 17| d| &c|
#| 17| d| d6|
#| 17| e| &d|
#| 17| e| e7|
#+-------+---------+---------+
df1=df.join(df.groupBy("input_col").agg(F.collect_list("input_val").alias("y1"))\
.withColumnRenamed("input_col","x1"),F.expr("""input_val rlike x1"""),'left')\
.withColumn("new_col", F.when(F.expr("""substring(input_val,0,1)!""")!=F.lit('&'), F.array("input_val"))\
.otherwise(F.col("y1"))).drop("x1","y1")
df2=df1.join(df1.selectExpr("input_val as input_val1","new_col as new_col1"), F.expr("""array_contains(new_col,input_val1) and\
substring(input_val1,0,1)=='&'"""),'left')
df2.join(df2.selectExpr("input_val1 as val2","new_col1 as col2")\
.dropna(),F.expr("""array_contains(new_col1,val2)"""),'left')\
.withColumn("inp_extract", F.when(F.expr("""substring(input_val,0,1)!='&'"""), F.col("new_col"))\
.otherwise(F.expr("""filter(concat(\
coalesce(new_col,array()),\
coalesce(new_col1,array()),\
coalesce(col2, array()))\
,x-> x is not null and substring(x,0,1)!='&')""")))\
.select("comment","input_col","input_val",F.array_sort("inp_extract").alias("inp_extract")).show()
#+-------+---------+---------+----------------+
#|comment|input_col|input_val| inp_extract|
#+-------+---------+---------+----------------+
#| 11| a| a1| [a1]|
#| 12| a| a2| [a2]|
#| 12| f| &a| [a1, a2]|
#| 12| f| f9| [f9]|
#| 15| b| b3| [b3]|
#| 16| b| b4| [b4]|
#| 17| c| &b| [b3, b4]|
#| 17| c| c5| [c5]|
#| 17| d| &c| [b3, b4, c5]|
#| 17| d| d6| [d6]|
#| 17| e| &d|[b3, b4, c5, d6]|
#| 17| e| e7| [e7]|
#+-------+---------+---------+----------------+
You can join the data frame to itself to get this.
input :
df.show()
+-------+-------+---------+
|comment|inp_col|input_val|
+-------+-------+---------+
| 11| a| a1|
| 12| a| a2|
| 13| f| &a|
| 14| b| b3|
| 15| b| b4|
| 16| d| &b|
+-------+-------+---------+
import pyspark.sql.functions as F
df.createOrReplaceTempView("df1")
df.withColumn("input_val", F.regexp_replace(F.col("input_val"), "&", "")).createOrReplaceTempView("df2")
spark.sql("""select * from (select coalesce(df2.comment,df1.comment) as comment ,
coalesce(df2.inp_col,df1.inp_col) as inp_col,
coalesce(df2.input_val,df2.input_val) as input_val ,
case when df1.input_val is not null then df1.input_val else df2.input_val end as output
from df1 full outer join df2 on df2.input_val = df1.inp_col) where input_val is not null order by comment """).show()
Output
+-------+-------+---------+------+
|comment|inp_col|input_val|output|
+-------+-------+---------+------+
| 11| a| a1| a1|
| 12| a| a2| a2|
| 13| f| a| a1|
| 13| f| a| a2|
| 14| b| b3| b3|
| 15| b| b4| b4|
| 16| d| b| b3|
| 16| d| b| b4|
+-------+-------+---------+------+

How to concatenate to a null column in pyspark dataframe

I have a below dataframe and I wanted to update the rows dynamically with some values
input_frame.show()
+----------+----------+---------+
|student_id|name |timestamp|
+----------+----------+---------+
| s1|testuser | t1|
| s1|sampleuser| t2|
| s2|test123 | t1|
| s2|sample123 | t2|
+----------+----------+---------+
input_frame = input_frame.withColumn('test', sf.lit(None))
input_frame.show()
+----------+----------+---------+----+
|student_id| name|timestamp|test|
+----------+----------+---------+----+
| s1| testuser| t1|null|
| s1|sampleuser| t2|null|
| s2| test123| t1|null|
| s2| sample123| t2|null|
+----------+----------+---------+----+
input_frame = input_frame.withColumn('test', sf.concat(sf.col('test'),sf.lit('test')))
input_frame.show()
+----------+----------+---------+----+
|student_id| name|timestamp|test|
+----------+----------+---------+----+
| s1| testuser| t1|null|
| s1|sampleuser| t2|null|
| s2| test123| t1|null|
| s2| sample123| t2|null|
+----------+----------+---------+----+
I want to update the 'test' column with some values and apply the filter with partial matches on the column. But concatenating to null column resulting in a null column again. How can we do this?
use concat_ws, like this:
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([["1", "2"], ["2", None], ["3", "4"], ["4", "5"], [None, "6"]]).toDF("a", "b")
# This won't work
df = df.withColumn("concat", concat(df.a, df.b))
# This won't work
df = df.withColumn("concat + cast", concat(df.a.cast('string'), df.b.cast('string')))
# Do it like this
df = df.withColumn("concat_ws", concat_ws("", df.a, df.b))
df.show()
gives:
+----+----+------+-------------+---------+
| a| b|concat|concat + cast|concat_ws|
+----+----+------+-------------+---------+
| 1| 2| 12| 12| 12|
| 2|null| null| null| 2|
| 3| 4| 34| 34| 34|
| 4| 5| 45| 45| 45|
|null| 6| null| null| 6|
+----+----+------+-------------+---------+
Note specifically that casting a NULL column to string doesn't work as you wish, and will result in the entire row being NULL if any column is null.
There's no nice way of dealing with more complicated scenarios, but note that you can use a when statement in side a concat if you're willing to
suffer the verboseness of it, like this:
df.withColumn("concat_custom", concat(
when(df.a.isNull(), lit('_')).otherwise(df.a),
when(df.b.isNull(), lit('_')).otherwise(df.b))
)
To get, eg:
+----+----+-------------+
| a| b|concat_custom|
+----+----+-------------+
| 1| 2| 12|
| 2|null| 2_|
| 3| 4| 34|
| 4| 5| 45|
|null| 6| _6|
+----+----+-------------+
You can use the coalesce function, which returns first of its arguments which is not null, and provide a literal in the second place, which will be used in case the column has a null value.
df = df.withColumn("concat", concat(coalesce(df.a, lit('')), coalesce(df.b, lit(''))))
You can fill null values with empty strings:
import pyspark.sql.functions as f
from pyspark.sql.types import *
data = spark.createDataFrame([('s1', 't1'), ('s2', 't2')], ['col1', 'col2'])
data = data.withColumn('test', f.lit(None).cast(StringType()))
display(data.na.fill('').withColumn('test2', f.concat('col1', 'col2', 'test')))
Is that what you were looking for?

Categories

Resources