how to improve performance in pyspark joins

how to improve performance in pyspark joins - python

I have two dataframes which look like this:
df1(20M rows):
+-----------------+--------------------+--------+----------+
| id | geolocations| lat| long|
+-----------------+--------------------+--------+----------+
|4T1BF1FK1HU376566|kkxyDbypwQ????uGs...|30.60 | -98.39 |
|4T1BF1FK1HU376566|i~nyD~~xvQA??????...|30.55 | -98.27 |
|4T1BF1FK1HU376566|}etyDzqxvQb#Sy#zB...|30.58 | -98.27 |
|JTNB11HK6J3000405|kkxyDbypwQ????uGs...|30.60 | -98.39 |
|JTNB11HK6J3000405|i~nyD~~xvQA??????...|30.55 | -98.27 |
df2(50 rows):
+---------+-----------+--------------------+
| lat| long| state|
+---------+-----------+--------------------+
|63.588753|-154.493062| Alaska|
|32.318231| -86.902298| Alabama|
| 35.20105| -91.831833| Arkansas|
|34.048928|-111.093731| Arizona|
I want to get a new column 'state' in df1 by comparing lat-long in df1 and df2. From the below dataframes ,Join on lat-long will give zero records, so I am using a threshold and using that I am performing join operation:
threshold = F.lit(3)
def lat_long_approximation(col1, col2, threshold):
return F.abs(col1 - col2) < threshold
df3 = df1.join(F.broadcast(df2), lat_long_approximation(df1.lat, df_state.lat, threshold) & lat_long_approximation(df1.long, df_state.long, threshold))
This is taking a long time. Can anyone help me out how can I optimise this join or any better approach where I can avoid using separate function(lat_long_approximation)

You can use between. I am not sure about the performance.
threshold = 10 # for test
df1.join(F.broadcast(df2),
df1.lat.between(df2.lat - threshold, df2.lat + threshold) &
df1.long.between(df2.long - threshold, df2.long + threshold), "left").show()

Related

How to compare differences between dataframes in pyspark

I have two dataframes that are essentially the same the same, but coming from two different sources. In my first dataframe I have p_user_id and date_of_birth fields that are a longType and one that is dateType and the rest of the fields are stringType. In my second dataframe everything is of stringType. I first check the row count for both dataframes based on the p_user_id(That is my unique identifier).
DF1:
+--------------+
|test1_racounts|
+--------------+
| 418895|
+--------------+
DF2:
+---------+
|d_tst_rac|
+---------+
| 418915|
+---------+
Then if there is a difference in the row count I run a check on which p_user_id values are in one dataframe and not the other.
p_user_tst_rac.subtract(rac_p_user_df).show(100, truncate=0)
Gives me this result:
+---------+
|p_user_id|
+---------+
|661520 |
|661513 |
|661505 |
|661461 |
|661501 |
|661476 |
|661478 |
|661468 |
|661479 |
|661464 |
|661467 |
|661474 |
|661484 |
|661495 |
|661499 |
|661486 |
|661502 |
|661506 |
|661517 |
+---------+
My issue comes into play when I'm trying to pull the rest of the corresponding fields for the difference. I want to pull the rest of the fields so that I can do a manual search in the DB and application to see if there is something that is overlooked. When I add the rest of columns my results get higher than 20 row counts for a difference. What is a better way to run the match and get the corresponding data:
Full code scope:
#racs in mysql
my_rac = spark.read.parquet("/Users/mysql.parquet")
my_rac.printSchema()
my_rac.createOrReplaceTempView('my_rac')
d_rac = spark.sql('''select distinct * from my_rac''')
d_rac.createOrReplaceTempView('d_rac')
spark.sql('''select count(*) as test1_racounts_ from d_rac''').show()
rac_p_user_df = spark.sql('''select
cast(p_user_id as string) as p_user_id
, record_id
, contact_last_name
, contact_first_name
from d_rac''')
#mssql_rac
sql_rac = spark.read.csv("/Users/mzn293/Downloads/kavi-20211116.csv")
#sql_rac.printSchema()
hav_rac.createOrReplaceTempView('sql_rac')
d_sql_rac = spark.sql('''select distinct
_c0 as p_user_id
, _c1 as record_id
, _c4 as contact_last_name
, _c5 as contact_first_name
from sql_rac''')
d_sql_rac.createOrReplaceTempView('d_sql_rac')
spark.sql('''select count(*) as d_aws_rac from d_sql_rac''').show()
dist_aws_rac = spark.sql('''select * from d_aws_rac''')
dist_sql_rac.subtract(rac_p_user_df).show(100, truncate=0)
With this I get more than a 20 count difference. Furthermore, I feel there is a better way to get my result. But I'm not sure what I'm missing to get the data for those 20 rows and not get 100 plus rows.

The easiest way will be to use the anti join in this case.
df_diff = df1.join(df2, df1.p_user_id == df2.p_user_id, "leftanti")
this will give you the row of all records existing in df1, but have no matching record in df2.

Python / SQL - Loop to fill records based on a roll back/forward of existing populated records

I have some data in two tables, one table is a list of dates (with other fields), running from 1st Jan 2014 until yesterday. The other table contains a year's worth of numeric data (coefficients / metric data) in 2020.
A left join between the two datasets on the date table results in all the dates being brought back with only the year of data being populated for 2020, the rest null as expected.
What I want to do is to populate the history to 2014 (and future) with the data in 2020 on a -364 day mapping.
For example
#+----------+-----------+
#|date |metric |
#+----------+-----------+
#|03/02/2018|null |
#|04/02/2018|null |
#|05/02/2018|null |
#|06/02/2018|null |
#|07/02/2018|null |
#|08/02/2018|null |
#|09/02/2018|null |
#|10/02/2018|null |
#|.... | |
#|02/02/2019|null |
#|03/02/2019|null |
#|04/02/2019|null |
#|05/02/2019|null |
#|06/02/2019|null |
#|07/02/2019|null |
#|08/02/2019|null |
#|09/02/2019|null |
#|... |... |
#|01/02/2020|0.071957531|
#|02/02/2020|0.086542975|
#|03/02/2020|0.023767137|
#|04/02/2020|0.109725808|
#|05/02/2020|0.005774458|
#|06/02/2020|0.056242301|
#|07/02/2020|0.086208715|
#|08/02/2020|0.010676928|
This is what I am trying to achieve:
#+----------+-----------+
#|date |metric |
#+----------+-----------+
#|03/02/2018|0.071957531|
#|04/02/2018|0.086542975|
#|05/02/2018|0.023767137|
#|06/02/2018|0.109725808|
#|07/02/2018|0.005774458|
#|08/02/2018|0.056242301|
#|09/02/2018|0.086208715|
#|10/02/2018|0.010676928|
#|.... | |
#|02/02/2019|0.071957531|
#|03/02/2019|0.086542975|
#|04/02/2019|0.023767137|
#|05/02/2019|0.109725808|
#|06/02/2019|0.005774458|
#|07/02/2019|0.056242301|
#|08/02/2019|0.086208715|
#|09/02/2019|0.010676928|
#|... |... |
#|01/02/2020|0.071957531|
#|02/02/2020|0.086542975|
#|03/02/2020|0.023767137|
#|04/02/2020|0.109725808|
#|05/02/2020|0.005774458|
#|06/02/2020|0.056242301|
#|07/02/2020|0.086208715|
#|08/02/2020|0.010676928|
Worth noting I may eventually have to go back more than 2014 so any dynamism on the population would help!
I'm doing this in databricks so I can use various languages but wanted to focus on Python / Pyspark / SQL for solutions.
Any help would be appreciated.
Thanks.
CT

First create new columns month and year:
df_with_month = df.withColumn("month", f.month(f.to_timestamp("date", "dd/MM/yyyy")))
.withColumn("year", f.month(f.to_timestamp("date", "dd/MM/yyyy")))
with import pyspark.sql.functions as f
Create a new DataFrame with 2020's data:
df_2020 = df_with_month.filter(col("year") == 2020)
.withColumnRenamed("metric", "new_metric")
Join the results on the month:
df_with_metrics = df_with_month.join(df_2020, df_with_month.month == df_2020.month, "left")
.drop("metric")
.withColumnRenamed("new_metric", "metric")

You can do a self join using the condition that the date difference is a multiple of 364 days:
import pyspark.sql.functions as F
df2 = df.join(
df.toDF('date2', 'metric2'),
F.expr("""
datediff(to_date(date, 'dd/MM/yyyy'), to_date(date2, 'dd/MM/yyyy')) % 364 = 0
and
to_date(date, 'dd/MM/yyyy') <= to_date(date2, 'dd/MM/yyyy')
""")
).select(
'date',
F.coalesce('metric', 'metric2').alias('metric')
).filter('metric is not null')
df2.show(999)
+----------+-----------+
| date| metric|
+----------+-----------+
|03/02/2018|0.071957531|
|04/02/2018|0.086542975|
|05/02/2018|0.023767137|
|06/02/2018|0.109725808|
|07/02/2018|0.005774458|
|08/02/2018|0.056242301|
|09/02/2018|0.086208715|
|10/02/2018|0.010676928|
|02/02/2019|0.071957531|
|03/02/2019|0.086542975|
|04/02/2019|0.023767137|
|05/02/2019|0.109725808|
|06/02/2019|0.005774458|
|07/02/2019|0.056242301|
|08/02/2019|0.086208715|
|09/02/2019|0.010676928|
|01/02/2020|0.071957531|
|02/02/2020|0.086542975|
|03/02/2020|0.023767137|
|04/02/2020|0.109725808|
|05/02/2020|0.005774458|
|06/02/2020|0.056242301|
|07/02/2020|0.086208715|
|08/02/2020|0.010676928|
+----------+-----------+

First you can add the timestamp column:
df = df.select(F.to_timestamp("date", "dd/MM/yyyy").alias('ts'), '*')
Then you can join on equal month and day:
cond = [F.dayofmonth(F.col('left.ts')) == F.dayofmonth(F.col('right.ts')),
F.month(F.col('left.ts')) == F.month(F.col('right.ts'))]
df.select('ts', 'date').alias('left').\
join(df.filter(F.year('ts')==2020).select('ts', 'metric').alias('right'), cond)\
.orderBy(F.col('left.ts')).drop('ts').show()

Is there a way in pyspark or spark sql to apply custom formula to current and next row and sum in an iterative fashion

I have a pyspark dataframe df.
I want to apply a formula to this by partitioning at type.
revised W(t) = current of W(t)*2 + revised of W(t-1)*3.
For first row since there is no previous revised - it'll be
revised W1 (week) = current of W1*2 + 0
For remaining weeks,for eg.
revised W2 (week) = current of W2 * 2 + revised of W1 * 3
expected output-
How do we do this in spark or sql? Can we use Window.currentRow and Window.unboundedPreceding along with window func? or do we need to write a custom udf.
If not possible in pyspark, do we resort to pandas or using loops. Please suggest.

The formula for the revised column can be rewritten as
This formula can be evaluated with the help of a window function and an udf.
#define a window that contains all rows for a type ordered by week
w = Window.partitionBy(df.type).orderBy(df["week"].asc()) \
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
#collect all values of past "current" values into an array
df = df.withColumn("tmp", F.collect_list(df["current"]).over(w))
#and reverse the array
df = df.withColumn("tmp", F.reverse(df["tmp"]))
#define an udf that applies the formula above on the array
calc = F.udf(lambda a: 2*sum([int(value) * (3 ** (index)) \
for index, value in enumerate(a)]), T.LongType())
#run the calculation and drop the intermediate array
df = df.withColumn("revised", calc(df["tmp"])).drop("tmp")
df.show(truncate=False)
prints
+--------+----+-------+-------+
|type |week|current|revised|
+--------+----+-------+-------+
|COMPUTER|w1 |100 |200 |
|COMPUTER|w2 |200 |1000 |
|COMPUTER|w3 |300 |3600 |
|COMPUTER|w4 |400 |11600 |
|COMPUTER|w5 |500 |35800 |
|SYSTEM |w1 |120 |240 |
|SYSTEM |w2 |150 |1020 |
|SYSTEM |w3 |250 |3560 |
|SYSTEM |w4 |450 |11580 |
|SYSTEM |w5 |500 |35740 |
+--------+----+-------+-------+

Without udf, quite dirty but you can try.
First, I have calculated the index i that is starting from zero and collect it as a list until the current row. Second, I have gathered the current values into a list the same as the index.
The key point is that the order of the index array does not have what I want. So, In order the make it to the descending order, I have used the array_sort function with the custom ordering function.
After that, make it to struct array by using arrays_zip, and according to the brief formula of #werner's answer, I can aggregate the revised value.
I have worried about if there are more lines, the array could be more lengthy and could cause some memory problem, but anyway in this sample level it works.
w = Window.partitionBy('type').orderBy('week')
df2 = df.withColumn('i', collect_list(row_number().over(w) - 1).over(w)) \
.withColumn('i', expr('array_sort(i, (left, right) -> case when left < right then 1 when left > right then -1 else 0 end)')) \
.withColumn('w', collect_list('current').over(w)) \
.withColumn('array', arrays_zip('i', 'w')) \
.withColumn('revised', expr('aggregate(array, 0D, (acc, x) -> acc + x.w * 2 * power(3, x.i))')) \
.select(*df.columns, 'revised')
df2.show()
+--------+----+-------+-------+
| type|week|current|revised|
+--------+----+-------+-------+
|COMPUTER| w1| 100.0| 200.0|
|COMPUTER| w2| 200.0| 1000.0|
|COMPUTER| w3| 300.0| 3600.0|
|COMPUTER| w4| 400.0|11600.0|
|COMPUTER| w5| 500.0|35800.0|
| SYSTEM| w1| 120.0| 240.0|
| SYSTEM| w2| 150.0| 1020.0|
| SYSTEM| w3| 250.0| 3560.0|
| SYSTEM| w4| 450.0|11580.0|
| SYSTEM| w5| 500.0|35740.0|
+--------+----+-------+-------+

How to find the closest matching rows in between two dataframes that has no direct join columns?

For each set of coordinates in a pyspark dataframe, I need to find closest set of coordinates in another dataframe
I have one pyspark dataframe with coordinate data like so (dataframe a):
+------------------+-------------------+
| latitude_deg| longitude_deg|
+------------------+-------------------+
| 40.07080078125| -74.93360137939453|
| 38.704022| -101.473911|
| 59.94919968| -151.695999146|
| 34.86479949951172| -86.77030181884766|
| 35.6087| -91.254898|
| 34.9428028| -97.8180194|
And another like so (dataframe b): (only few rows are shown for understanding)
+-----+------------------+-------------------+
|ident| latitude_deg| longitude_deg|
+-----+------------------+-------------------+
| 00A| 30.07080078125| -24.93360137939453|
| 00AA| 56.704022| -120.473911|
| 00AK| 18.94919968| -109.695999146|
| 00AL| 76.86479949951172| -67.77030181884766|
| 00AR| 10.6087| -87.254898|
| 00AS| 23.9428028| -10.8180194|
Is it possible to somehow merge the dataframes to have a result that a has the closest ident from dataframe b for each row in dataframe a:
+------------------+-------------------+-------------+
| latitude_deg| longitude_deg|closest_ident|
+------------------+-------------------+-------------+
| 40.07080078125| -74.93360137939453| 12A|
| 38.704022| -101.473911| 14BC|
| 59.94919968| -151.695999146| 278A|
| 34.86479949951172| -86.77030181884766| 56GH|
| 35.6087| -91.254898| 09HJ|
| 34.9428028| -97.8180194| 09BV|
What I have tried so far:
I have a pyspark UDF to calculate the haversine distance between 2 pairs of coordinates defined.
udf_get_distance = F.udf(get_distance)
It works like this:
df = (df.withColumn(“ABS_DISTANCE”, udf_get_distance(
df.latitude_deg_a, df.longitude_deg_a,
df.latitude_deg_b, df.longitude_deg_b,)
))
I'd appreciate any kind of help. Thanks so much

You need to do a crossJoin first. something like this
joined_df=source_df1.crossJoin(source_df2)
Then you can call your udf like you have mentioned, generate rownum based on distance and filter out the close one
from pyspark.sql.functions import row_number,Window
rwindow=Window.partitionBy("latitude_deg_a","latitude_deg_b").orderBy("ABS_DISTANCE")
udf_result_df = joined_df.withColumn(“ABS_DISTANCE”, udf_get_distance(
df.latitude_deg_a, df.longitude_deg_a,
df.latitude_deg_b, df.longitude_deg_b).withColumn("rownum",row_number().over(rwindow)).filter("rownum = 1")
Note: add return type to your udf

Eliminate perceived index value from data frame concatenation

I'm trying to concatenate two data frames and write said data-frame to an excel file. The concatenation is performed somewhat successfully, but I'm having a difficult time eliminating the index row that also gets appended.
I would appreciate it if someone could highlight what it is I'm doing wrong. I thought providing the "index = False" argument at every excel call would eliminate the issue, but it has not.
enter image description here
Hopefully you can see the image, if not please let me know.
# filenames
file_name = "C:\\Users\\ga395e\\Desktop\\TEST_FILE.xlsx"
file_name2 = "C:\\Users\\ga395e\\Desktop\\TEST_FILE_2.xlsx"
#create data frames
df = pd.read_excel(file_name, index = False)
df2 = pd.read_excel(file_name2,index =False)
#filter frame
df3 = df2[['WDDT', 'Part Name', 'Remove SN']]
#concatenate values
df4 = df3['WDDT'].map(str) + '-' +df3['Part Name'].map(str) + '-' + 'SN:'+ df3['Remove SN'].map(str)
test=pd.DataFrame(df4)
test=test.transpose()
df = pd.concat([df, test], axis=1)
df.to_excel("C:\\Users\\ga395e\\Desktop\\c.xlsx", index=False)
Thanks

so as the other users also wrote I dont see the index in your image as well because in this case you would have an output which would be like the following:
| Index | Column1 | Column2 |
|-------+----------+----------|
| 0 | Entry1_1 | Entry1_2 |
| 1 | Entry2_1 | Entry2_2 |
| 2 | Entry3_1 | Entry3_2 |
if you pass the index=False option the index will be removed:
| Column1 | Column2 |
|----------+----------|
| Entry1_1 | Entry1_2 |
| Entry2_1 | Entry2_2 |
| Entry3_1 | Entry3_2 |
| | |
which looks like it your case. Your problem be could related to the concatenation and the transposed matrix.
Did you check here you temporary dataframe before exporting it?
You might want to check if pandas imports the time column as a time index
if you want to delete those time columns you could use df.drop and pass an array of columns into this function, e.g. with df.drop(df.columns[:3]). Does this maybe solve your problem?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to improve performance in pyspark joins - python

You can use between. I am not sure about the performance. threshold = 10 # for test df1.join(F.broadcast(df2), df1.lat.between(df2.lat - threshold, df2.lat + threshold) & df1.long.between(df2.long - threshold, df2.long + threshold), "left").show()

Related

How to compare differences between dataframes in pyspark

Python / SQL - Loop to fill records based on a roll back/forward of existing populated records

Is there a way in pyspark or spark sql to apply custom formula to current and next row and sum in an iterative fashion

How to find the closest matching rows in between two dataframes that has no direct join columns?

Eliminate perceived index value from data frame concatenation

Categories

Resources