Merge duplicated records in pyspark dataframe

Merge duplicated records in pyspark dataframe - python

I have a pyspark dataframe that have duplicated ids. There are missing values in some of the records and differences in the "Time" field among the duplicated ids.
+-------------+------------------------+-------------------------+---------------------------------+
|id |Time |Type |Status|
+-------------+------------------------+-------------------------+---------------------------------+
|1 |2020-03-01 | | |
|1 |2020-03-01 |A |Single |
|1 |2020-03-01 |A | |
|2 |2020-02-01 |C | Double |
|2 |2020-02-25 | | Double |
+-------------+------------------------+-------------------------+---------------------------------+
How can I merge the info in every field and make them into one record? And if there is there are difference "time"value, how can i just choose the most recent one? The ideal dataframe looks like this:
+-------------+------------------------+-------------------------+---------------------------------+
|id |Time |Type |Status|
+-------------+------------------------+-------------------------+---------------------------------+
|1 |2020-03-01 |A | Single |
|2 |2020-02-25 |C | Double |
+-------------+------------------------+-------------------------+-------------------------------
And please note that I have around 100 fields in this dataframe, not just the four I am showing.

Try by grouping by id then get the max(time) by converting to date type to get most recent one.
Example:
df.show()
#+---+----------+----+------+
#| id| Time|Type|Status|
#+---+----------+----+------+
#| 1|2020-03-01| | |
#| 1|2020-03-01| A|Single|
#| 1|2020-03-01| A| |
#| 2|2020-02-01| C|Double|
#| 2|2020-02-25| |Double|
#+---+----------+----+------+
from pyspark.sql.functions import *
expr=[max(to_date(f'{i}')).alias("Time") for i in df.columns if i == 'Time'] +[max(f'{i}').alias(f'{i}') for i in df.columns if i not in ['id','Time']]
#[Column<b'max(to_date(`Time`)) AS `Time`'>, Column<b'max(Type) AS `Type`'>, Column<b'max(Status) AS `Status`'>]
df.groupBy("id").agg(*expr).show()
#+---+----------+----+------+
#| id| Time|Type|Status|
#+---+----------+----+------+
#| 1|2020-03-01| A|Single|
#| 2|2020-02-25| C|Double|
#+---+----------+----+------+
If you want to get first/last based on Type,Status values then
when_expr=['id'] + [when(length(f'{i}') ==0 ,lit(None)).otherwise(col(f'{i}')).alias(f'{i}') for i in df.columns if i not in ['id']]
df1=df.select(*when_expr)
df1.show()
#+---+----------+----+------+
#| id| Time|Type|Status|
#+---+----------+----+------+
#| 1|2020-03-01|null| null|
#| 1|2020-03-01| A|Single|
#| 1|2020-03-01| A| null|
#| 2|2020-02-01| C|Double|
#| 2|2020-02-25|null|Double|
#+---+----------+----+------+
#using first function
expr=[max(to_date(f'{i}')).alias("Time") for i in df.columns if i == 'Time'] +[first(f'{i}',True).alias(f'{i}') for i in df.columns if i not in ['id','Time']]
#using last function
expr=[max(to_date(f'{i}')).alias("Time") for i in df.columns if i == 'Time'] +[last(f'{i}',True).alias(f'{i}') for i in df.columns if i not in ['id','Time']]
df1.groupBy("id").agg(*expr).show()
#+---+----------+----+------+
#| id| Time|Type|Status|
#+---+----------+----+------+
#| 1|2020-03-01| A|Single|
#| 2|2020-02-25| C|Double|
#+---+----------+----+------+

Related

Pyspark : Return all column names of max values

I have a DataFrame like this :
from pyspark.sql import functions as f
from pyspark.sql.types import IntegerType, StringType
#import numpy as np
data = [(("ID1", 3, 5,5)), (("ID2", 4, 5,6)), (("ID3", 3, 3,3))]
df = sqlContext.createDataFrame(data, ["ID", "colA", "colB","colC"])
df.show()
cols = df.columns
maxcol = f.udf(lambda row: cols[row.index(max(row)) +1], StringType())
maxDF = df.withColumn("Max_col", maxcol(f.struct([df[x] for x in df.columns[1:]])))
maxDF.show(truncate=False)
+---+----+----+----+
| ID|colA|colB|colC|
+---+----+----+----+
|ID1| 3| 5| 5|
|ID2| 4| 5| 6|
|ID3| 3| 3| 3|
+---+----+----+----+
+---+----+----+----+-------+
|ID |colA|colB|colC|Max_col|
+---+----+----+----+-------+
|ID1|3 |5 |5 |colB |
|ID2|4 |5 |6 |colC |
|ID3|3 |3 |3 |colA |
+---+----+----+----+-------+
I want to return all column names of max values in case there are ties, how can I achieve this in pyspark like this :
+---+----+----+----+--------------+
|ID |colA|colB|colC|Max_col |
+---+----+----+----+--------------+
|ID1|3 |5 |5 |colB,colC |
|ID2|4 |5 |6 |colC |
|ID3|3 |3 |3 |colA,ColB,ColC|
+---+----+----+----+--------------+
Thank you

Seems like a udf solution.
iterate over the columns you have (pass them as an input to the class) and perform a python operations to get the max and check who has the same value. return a list (aka array) of the column names.
#udf(returnType=ArrayType(StringType()))
def collect_same_max():
...
Or, maybe if it doable you can try use the transform function
from https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.transform.html

Data Engineering, I would say. See code and logic below
new =(df.withColumn('x',F.array(*[F.struct(F.lit(x).alias('col'),F.col(x).alias('num')) for x in df.columns if x!='ID']))#Create Struct Column of columns and their values
.selectExpr('ID','colA','colB','colC', 'inline(x)')#Explode struct column
.withColumn('z', first('num').over(Window.partitionBy('ID').orderBy(F.desc('num'))))#Create column with max value for each id
.where(col('num')==col('z'))#isolate max values in each id
.groupBy(['ID','colA','colB','colC']).agg(F.collect_list('col').alias('col'))#combine max columns into list
)
+---+----+----+----+------------------+
| ID|colA|colB|colC| Max_col|
+---+----+----+----+------------------+
|ID1| 3| 5| 5| [colB, colC]|
|ID2| 4| 5| 6| [colC]|
|ID3| 3| 3| 3|[colA, colB, colC]|
+---+----+----+----+------------------+

How to get Weighted Average for a column in pyspark

Here i need to find exponential moving average in spark dataframe :
Table :
ab = spark.createDataFrame(
[(1,"1/1/2020", 41.0,0.5, 0.5 ,1, '10.22'),
(1,"10/3/2020",24.0,0.3, 0.7 ,2, '' ),
(1,"21/5/2020",32.0,0.4, 0.6 ,3, '' ),
(2,"3/1/2020", 51.0,0.22, 0.78,1, '34.78'),
(2,"10/5/2020",14.56,0.333,0.66,2, '' ),
(2,"30/9/2020",17.0,0.66, 0.34,3, '' )],["CID","date","A","B","C","Row","SMA"] )
ab.show()
+---+---------+-----+-----+----+---+-----+
|CID| date| A| B| C| Row| SMA|
+---+---------+-----+-----+----+---+-----+
| 1| 1/1/2020| 41.0| 0.5| 0.5| 1|10.22|
| 1|10/3/2020| 24.0| 0.3| 0.7| 2| |
| 1|21/5/2020| 32.0| 0.4| 0.6| 3| |
| 2| 3/1/2020| 51.0| 0.22|0.78| 1|34.78|
| 2|10/5/2020|14.56|0.333|0.66| 2| |
| 2|30/9/2020| 17.0| 0.66|0.34| 3| |
+---+---------+-----+-----+----+---+-----+
Expected Output :
+---+---------+-----+-----+----+---+-----+----------+
|CID| date| A| B| C|Row| SMA| EMA|
+---+---------+-----+-----+----+---+-----+----------+
| 1| 1/1/2020| 41.0| 0.5| 0.5| 1|10.22| 10.22|
| 1|10/3/2020| 24.0| 0.3| 0.7| 2| | 14.354|
| 1|21/5/2020| 32.0| 0.4| 0.6| 3| | 21.4124|
| 2| 3/1/2020| 51.0| 0.22|0.78| 1|34.78| 34.78|
| 2|10/5/2020|14.56|0.333|0.66| 2| | 28.04674|
| 2|30/9/2020| 17.0| 0.66|0.34| 3| |20.7558916|
+---+---------+-----+-----+----+---+-----+----------+
Logic :
For every customer
if row == 1 then
SMA as EMA
else ( C * LAG(EMA) + A * B ) as EMA

The problem here is that a freshly calculated value of a previous row is used as input for the current row. That means that it is not possible to parallelize the calculations for a single customer.
For Spark 3.0+, it is possible to get the required result with a pandas udf using grouped map
ab = spark.createDataFrame(
[(1,"1/1/2020", 41.0,0.5, 0.5 ,1, '10.22'),
(1,"10/3/2020",24.0,0.3, 0.7 ,2, '' ),
(1,"21/5/2020",32.0,0.4, 0.6 ,3, '' ),
(2,"3/1/2020", 51.0,0.22, 0.78,1, '34.78'),
(2,"10/5/2020",14.56,0.333,0.66,2, '' ),
(2,"30/9/2020",17.0,0.66, 0.34,3, '' )],\
["CID","date","A","B","C","Row","SMA"] ) \
.withColumn("SMA", F.col('SMA').cast(T.DoubleType())) \
.withColumn("date", F.to_date(F.col("date"), "d/M/yyyy"))
import pandas as pd
def calc(df: pd.DataFrame):
# df is a pandas.DataFrame
df = df.sort_values('date').reset_index(drop=True)
df.loc[0, 'EMA'] = df.loc[0, 'SMA']
for i in range(1, len(df)):
df.loc[i, 'EMA'] = df.loc[i, 'C'] * df.loc[i-1, 'EMA'] + \
df.loc[i, 'A'] * df.loc[i, 'B']
return df
ab.groupBy("CID").applyInPandas(calc,
schema = "CID long, date date, A double, B double, C double, Row long, SMA double, EMA double")\
.show()
Output:
+---+----------+-----+-----+----+---+-----+------------------+
|CID| date| A| B| C|Row| SMA| EMA|
+---+----------+-----+-----+----+---+-----+------------------+
| 1|2020-01-01| 41.0| 0.5| 0.5| 1|10.22| 10.22|
| 1|2020-03-10| 24.0| 0.3| 0.7| 2| null| 14.354|
| 1|2020-05-21| 32.0| 0.4| 0.6| 3| null|21.412399999999998|
| 2|2020-01-03| 51.0| 0.22|0.78| 1|34.78| 34.78|
| 2|2020-05-10|14.56|0.333|0.66| 2| null| 27.80328|
| 2|2020-09-30| 17.0| 0.66|0.34| 3| null| 20.6731152|
+---+----------+-----+-----+----+---+-----+------------------+
The idea is to use a Pandas dataframe for each group. This Pandas dataframe contains all values of the current partition and is ordered by date. During the iteration over the Pandas dataframe we can now access the value of EMA of the previous row (which is not possible for a Spark dataframe).
There are some caveats:
all rows of one partition should fit into the memory of a single executor. Partial aggregation is not possible here
iterating over a Pandas dataframe is discouraged

Pyspark remove strings (or imputate them) from numeric columns

I have some strings in a numeric columns. Like 1, 2, 3, 4, 'lol', 6 ...
I just wanna del this rows. How can I del them?
.cast did not return NaN. I wrote function, but it takes too much time (unreal), and it didn't work anyway...
from pyspark.sql.types import *
def is_digit(val):
try:
is_num = str(val).replace(".", "", 1).isdigit() if val else False
return is_num
except:
return False
is_digit_udf = F.udf(is_digit, BooleanType())
It's so stupid :(((

Use .rlike() function to filter only the rows that are not having numbers by specifying regex.
Example:
df.show()
+----+
| val|
+----+
| 1|
| 2|
| 3|
| lol|
| 6|
+----+
df.filter(col("val").rlike("[^a-zA-Z]")).show()
#or using [^\d] regex
df.filter(~col("val").rlike("[^\d]")).show()
#+---+
#|val|
#+---+
#| 1|
#| 2|
#| 3|
#| 6|
#+---+

As far as I understand, you want to delete the rows which contain not a numeric value, right? Then simply cast and filter out the values.
df.printSchema()
df.show(10, False)
root
|-- num: string (nullable = true)
+---+
|num|
+---+
|1 |
|2 |
|3 |
|lol|
|5 |
+---+
import pyspark.sql.functions as f
df.withColumn("num", f.col("num").cast("int")) \
.filter(f.col("num").isNotNull()) \
.show(10, False)
+---+
|num|
+---+
|1 |
|2 |
|3 |
|5 |
+---+

pyspark dataframe withColumn command not working

I have a input dataframe: df_input (updated df_input)
|comment|inp_col|inp_val|
|11 |a |a1 |
|12 |a |a2 |
|15 |b |b3 |
|16 |b |b4 |
|17 |c |&b |
|17 |c |c5 |
|17 |d |&c |
|17 |d |d6 |
|17 |e |&d |
|17 |e |e7 |
I want to replace the variable in inp_val column to its value. I have tried with the below code to create a new column.
Taken the list of values which starts with '&'
df_new = df_inp.select(inp_val).where(df.inp_val.substr(0, 1) == '&')
Now I'm iterating over the list to replace the '&' column value data to it original list.
for a in [row[inp_val] for row in df_new.collect()]
df_inp = df_inp.withColumn
(
'new_col',
when(df.inp_val.substr(0, 1) == '&',
[row[inp_val] for row in df.select(df.inp_val).where(df.inp_col == a[1:]).collect()])
.otherwise(df.inp_val)
)
But, I'm getting error as below:
Java.lang.RuntimeException: Unsupported literal tpe class java.util.ArrayList [[5], [6]]
Basically I want the output as below. Please check and let me know where is the error???.
I was thinking that two type of datatype values I'm trying to insert as per the above code??
Updated lines of code:
tst_1 = tst.withColumn("col3_extract", when(tst.col3.substr(0, 1) == '&', regexp_replace(tst.col3, "&", "")).otherwise(""))
# Select which values need to be replaced; withColumnRenamed will also solve spark self join issues
# The substring search can also be done using regex function
tst_filter=tst.where(~F.col('col3').contains('&')).withColumnRenamed('col2','col2_collect')
# For the selected data, perform a collect list
tst_clct = tst_filter.groupby('col2_collect').agg(F.collect_list('col3').alias('col3_collect'))
#%% Join the main table with the collected list
tst_join = tst_1.join(tst_clct,on=tst_1.col3_extract==tst_clct.col2_collect,how='left').drop('col2_collect')
#%% In the column3 replace the values such as a, b
tst_result = tst_join.withColumn("result",F.when(~F.col('col3').contains('&'),F.array(F.col('col3'))).otherwise(F.col('col3_collect')))
But, the above code doesn't work on the multiple iterations
Updated Expected Output:
|comment|inp_col|inp_val|new_col |
|11 |a |a1 |['a1'] |
|12 |a |a2 |['a2'] |
|15 |b |b3 |['b3'] |
|16 |b |b4 |['b4'] |
|17 |c |&b |['b3', 'b4'] |
|18 |c |c5 |['c5'] |
|19 |d |&c |['b3', 'b4', 'c5'] |
|20 |d |d6 |['d6'] |
|21 |e |&d |['b3', 'b4', 'c5', 'd6'] |
|22 |e |e7 |['e7'] |

Try this, self-join with collected list on rlike join condition is the way to go.
df.show() #sampledataframe
#+-------+---------+---------+
#|comment|input_col|input_val|
#+-------+---------+---------+
#| 11| a| 1|
#| 12| a| 2|
#| 15| b| 5|
#| 16| b| 6|
#| 17| c| &b|
#| 17| c| 7|
#+-------+---------+---------+
df.join(df.groupBy("input_col").agg(F.collect_list("input_val").alias("y1"))\
.withColumnRenamed("input_col","x1"),F.expr("""input_val rlike x1"""),'left')\
.withColumn("new_col", F.when(F.col("input_val").cast("int").isNotNull(), F.array("input_val"))\
.otherwise(F.col("y1"))).drop("x1","y1").show()
#+-------+---------+---------+-------+
#|comment|input_col|input_val|new_col|
#+-------+---------+---------+-------+
#| 11| a| 1| [1]|
#| 12| a| 2| [2]|
#| 15| b| 5| [5]|
#| 16| b| 6| [6]|
#| 17| c| &b| [5, 6]|
#| 17| c| 7| [7]|
#+-------+---------+---------+-------+

You can simply use regex_replace like this:
df.withColumn("new_col", regex_replace(col("inp_val"), "&", ""))

Can you tryout this solution. Your approach may run into whole lot of problems.
import pyspark.sql.functions as F
from pyspark.sql.functions import col
from pyspark.sql.window import Window
#Test data
tst = sqlContext.createDataFrame([(1,'a','3'),(1,'a','4'),(1,'b','5'),(1,'b','7'),(2,'c','&b'),(2,'c','&a'),(2,'d','&b')],schema=['col1','col2','col3'])
# extract the special character out
tst_1 = tst.withColumn("col3_extract",F.substring(F.col('col3'),2,1))
# Selecct which values need to be replaced; withColumnRenamed will also solve spark self join issues
# The substring search can also be done using regex function
tst_filter=tst.where(~F.col('col3').contains('&')).withColumnRenamed('col2','col2_collect')
# For the selected data, perform a collect list
tst_clct = tst_filter.groupby('col2_collect').agg(F.collect_list('col3').alias('col3_collect'))
#%% Join the main table with the collected list
tst_join = tst_1.join(tst_clct,on=tst_1.col3_extract==tst_clct.col2_collect,how='left').drop('col2_collect')
#%% In the column3 replace the values such as a, b
tst_result = tst_join.withColumn("result",F.when(~F.col('col3').contains('&'),F.array(F.col('col3'))).otherwise(F.col('col3_collect')))
Results :
+----+----+----+------------+------------+------+
|col1|col2|col3|col3_extract|col3_collect|result|
+----+----+----+------------+------------+------+
| 2| c| &a| a| [3, 4]|[3, 4]|
| 2| c| &b| b| [7, 5]|[7, 5]|
| 2| d| &b| b| [7, 5]|[7, 5]|
| 1| a| 3| | null| [3]|
| 1| a| 4| | null| [4]|
| 1| b| 5| | null| [5]|
| 1| b| 7| | null| [7]|
+----+----+----+------------+------------+------+

How to concatenate to a null column in pyspark dataframe

I have a below dataframe and I wanted to update the rows dynamically with some values
input_frame.show()
+----------+----------+---------+
|student_id|name |timestamp|
+----------+----------+---------+
| s1|testuser | t1|
| s1|sampleuser| t2|
| s2|test123 | t1|
| s2|sample123 | t2|
+----------+----------+---------+
input_frame = input_frame.withColumn('test', sf.lit(None))
input_frame.show()
+----------+----------+---------+----+
|student_id| name|timestamp|test|
+----------+----------+---------+----+
| s1| testuser| t1|null|
| s1|sampleuser| t2|null|
| s2| test123| t1|null|
| s2| sample123| t2|null|
+----------+----------+---------+----+
input_frame = input_frame.withColumn('test', sf.concat(sf.col('test'),sf.lit('test')))
input_frame.show()
+----------+----------+---------+----+
|student_id| name|timestamp|test|
+----------+----------+---------+----+
| s1| testuser| t1|null|
| s1|sampleuser| t2|null|
| s2| test123| t1|null|
| s2| sample123| t2|null|
+----------+----------+---------+----+
I want to update the 'test' column with some values and apply the filter with partial matches on the column. But concatenating to null column resulting in a null column again. How can we do this?

use concat_ws, like this:
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([["1", "2"], ["2", None], ["3", "4"], ["4", "5"], [None, "6"]]).toDF("a", "b")
# This won't work
df = df.withColumn("concat", concat(df.a, df.b))
# This won't work
df = df.withColumn("concat + cast", concat(df.a.cast('string'), df.b.cast('string')))
# Do it like this
df = df.withColumn("concat_ws", concat_ws("", df.a, df.b))
df.show()
gives:
+----+----+------+-------------+---------+
| a| b|concat|concat + cast|concat_ws|
+----+----+------+-------------+---------+
| 1| 2| 12| 12| 12|
| 2|null| null| null| 2|
| 3| 4| 34| 34| 34|
| 4| 5| 45| 45| 45|
|null| 6| null| null| 6|
+----+----+------+-------------+---------+
Note specifically that casting a NULL column to string doesn't work as you wish, and will result in the entire row being NULL if any column is null.
There's no nice way of dealing with more complicated scenarios, but note that you can use a when statement in side a concat if you're willing to
suffer the verboseness of it, like this:
df.withColumn("concat_custom", concat(
when(df.a.isNull(), lit('_')).otherwise(df.a),
when(df.b.isNull(), lit('_')).otherwise(df.b))
)
To get, eg:
+----+----+-------------+
| a| b|concat_custom|
+----+----+-------------+
| 1| 2| 12|
| 2|null| 2_|
| 3| 4| 34|
| 4| 5| 45|
|null| 6| _6|
+----+----+-------------+

You can use the coalesce function, which returns first of its arguments which is not null, and provide a literal in the second place, which will be used in case the column has a null value.
df = df.withColumn("concat", concat(coalesce(df.a, lit('')), coalesce(df.b, lit(''))))

You can fill null values with empty strings:
import pyspark.sql.functions as f
from pyspark.sql.types import *
data = spark.createDataFrame([('s1', 't1'), ('s2', 't2')], ['col1', 'col2'])
data = data.withColumn('test', f.lit(None).cast(StringType()))
display(data.na.fill('').withColumn('test2', f.concat('col1', 'col2', 'test')))
Is that what you were looking for?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merge duplicated records in pyspark dataframe - python

Related

Pyspark : Return all column names of max values

How to get Weighted Average for a column in pyspark

Pyspark remove strings (or imputate them) from numeric columns

pyspark dataframe withColumn command not working

How to concatenate to a null column in pyspark dataframe

Categories

Resources