Spark solution for below problem (Transform , Pivot, CrossJoin) - python

This is the input dataframe:
df1_input = spark.createDataFrame([ \
("P1","A","B","C"), \
("P1","D","E","F"), \
("P1","G","H","I"), \
("P1","J","K","L") ], ["Person","L1B","B2E","J3A"])
df1_input.show()
+------+---+---+---+
|Person|L1B|B2E|J3A|
+------+---+---+---+
| P1| A| B| C|
| P1| D| E| F|
| P1| G| H| I|
| P1| J| K| L|
+------+---+---+---+
Below gives the corresponding descriptions:
df1_item_details = spark.createDataFrame([ \
("L1B","item Desc1","A","Detail Desc1"), \
("L1B","item Desc1","D","Detail Desc2"), \
("L1B","item Desc1","G","Detail Desc3"), \
("L1B","item Desc1","J","Detail Desc4"), \
("B2E","item Desc2","B","Detail Desc5"), \
("B2E","item Desc2","E","Detail Desc6"), \
("B2E","item Desc2","H","Detail Desc7"), \
("B2E","item Desc2","K","Detail Desc8"), \
("J3A","item Desc3","C","Detail Desc9"), \
("J3A","item Desc3","F","Detail Desc10"), \
("J3A","item Desc3","I","Detail Desc11"), \
("J3A","item Desc3","L","Detail Desc12")], ["Item","Item Desc","Detail","Detail Desc"])
df1_item_details.show()
+----+----------+------+-------------+
|Item| Item Desc|Detail| Detail Desc|
+----+----------+------+-------------+
| L1B|item Desc1| A| Detail Desc1|
| L1B|item Desc1| D| Detail Desc2|
| L1B|item Desc1| G| Detail Desc3|
| L1B|item Desc1| J| Detail Desc4|
| B2E|item Desc2| B| Detail Desc5|
| B2E|item Desc2| E| Detail Desc6|
| B2E|item Desc2| H| Detail Desc7|
| B2E|item Desc2| K| Detail Desc8|
| J3A|item Desc3| C| Detail Desc9|
| J3A|item Desc3| F|Detail Desc10|
| J3A|item Desc3| I|Detail Desc11|
| J3A|item Desc3| L|Detail Desc12|
+----+----------+------+-------------+
Below are some standard information that need to be plastered on the final output:
df1_stdColumns = spark.createDataFrame([ \
("School","BMM"), \
("College","MSRIT"), \
("Workplace1","Blr"), \
("Workplace2","Chn")], ["StdKey","StdVal"])
df1_stdColumns.show()
+----------+------+
| StdKey|StdVal|
+----------+------+
| School| BMM|
| College| MSRIT|
|Workplace1| Blr|
|Workplace2| Chn|
+----------+------+
The expected output would look like below:
+--------+-----+---------------+-----+---------------+-----+----------------+--------+---------+------------+------------+
| Person | L1B | Item Desc1 | B2E | Item Desc2 | J3A | Item Desc3 | School | College | WorkPlace1 | WorkPlace2 |
+--------+-----+---------------+-----+---------------+-----+----------------+--------+---------+------------+------------+
| P1 | A | Detail Desc 1 | B | Detail Desc 5 | C | Detail Desc 9 | Bmm | MSRIT | Blr | Chn |
| P1 | D | Detail Desc 2 | E | Detail Desc 6 | F | Detail Desc 10 | Bmm | MSRIT | Blr | Chn |
| P1 | G | Detail Desc 3 | H | Detail Desc 7 | I | Detail Desc 11 | Bmm | MSRIT | Blr | Chn |
| P1 | J | Detail Desc 4 | K | Detail Desc 8 | L | Detail Desc 12 | Bmm | MSRIT | Blr | Chn |
+--------+-----+---------------+-----+---------------+-----+----------------+--------+---------+------------+------------+
Could someone suggest an optimal Spark way of doing this ? The input dataset size is in millions. Current code that i have runs for around 10hrs and it not optimal.. looking for some good performant spark(Python \ Scala \ Sql) code if possible
edit : Below is the code that i have that works but takes forever to finish when input volume is in millions
from pyspark.sql.functions import monotonically_increasing_id
import pyspark.sql.functions as F
from pyspark.sql.functions import array, col, explode, lit, struct
from pyspark.sql.types import StructType, StructField, LongType
from pyspark.sql import DataFrame
from typing import Iterable
#Databricks runtime 7.3 on spark 3.0.1 which supports AQE
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "1")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.shuffle.partitions", "10000")
spark.conf.set("spark.sql.adaptive.coalescePartitions.minPartitionNum","1")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionFactor", "1")
spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", "10KB")
spark.conf.set("spark.sql.adaptive.advisoryPartitionSizeInBytes", "1B")
df1_input=df1_input.withColumn("RecordId", monotonically_increasing_id())
df1_input_2=df1_input
#Custom function to do transpose
def melt(df: DataFrame, id_vars: Iterable[str], value_vars: Iterable[str], var_name: str="variable", value_name: str="Value") -> DataFrame:
# Create array<struct<variable: str, value: ...>>
_vars_and_vals = array(*(
struct(lit(c).alias(var_name), col(c).alias(value_name))
for c in value_vars))
# Add to the DataFrame and explode
_tmp = df.withColumn("_vars_and_vals", explode(_vars_and_vals))
cols = id_vars + [
col("_vars_and_vals")[x].alias(x) for x in [var_name, value_name]]
return _tmp.select(*cols)
df9_ColsList=melt(df1_stdColumns, id_vars=['StdKey'], value_vars=df1_stdColumns.columns).filter("variable <>'StdKey'")
df9_ColsList=df9_ColsList.groupBy("variable").pivot("StdKey").agg(F.first("value")).drop("variable")
df1_input_2=melt(df1_input_2, id_vars=['RecordId','Person'], value_vars=df1_input_2.columns).filter("variable != 'Person'").filter("variable != 'RecordId'").withColumnRenamed('variable','Name')
df_prevStepInputItemDets=(df1_input_2.join(df1_item_details,(df1_input_2.Name == df1_item_details.Item) & (df1_input_2.Value == df1_item_details.Detail)))
#Since pivot performs better if the columns are know in advance, sacrificing a collect to do it. (Since Pivot without providing this was performing worse)
CurrStagePivotCols_tmp = df1_item_details.select("Item Desc").rdd.flatMap(lambda x: x).collect()
CurrStagePivotCols = []
[CurrStagePivotCols.append(x) for x in CurrStagePivotCols_tmp if x not in CurrStagePivotCols]
df_prevStepInputItemDets=(df_prevStepInputItemDets \
.groupBy('RecordId',"Person") \
.pivot("Item Desc",CurrStagePivotCols) \
#.pivot("Item Desc") \
.agg(F.first("Detail Desc"))).drop("RecordId")
#combine codes and descriptions
#Add rowNumber to both dataframes so that they can be merged side-by-side
def add_rowNum(sdf):
new_schema = StructType(sdf.schema.fields + [StructField("RowNum", LongType(), False),])
return sdf.rdd.zipWithIndex().map(lambda row: row[0] + (row[1],)).toDF(schema=new_schema)
ta = df1_input.alias('ta')
tb = df_prevStepInputItemDets.alias('tb')
ta = add_rowNum(ta)
tb = add_rowNum(tb)
df9_code_desc = tb.join(ta.drop("Katashiki"), on="RowNum",how='inner').drop("RowNum")
#CrossJoin to plaster standard columns
df9_final=df9_code_desc.crossJoin(df9_ColsList).drop("RecordId")
display(df9_final)

Related

Create new Data frame from an existing one in pyspark

I created this dataframe with pySpark from txt file that includes searches queries and user ID.
`spark = SparkSession.builder.getOrCreate()
df = spark.read.option("header", "true") \
.option("delimiter", "\t") \
.option("inferSchema", "true") \
.csv("/content/drive/MyDrive/my_data.txt")
df.select("AnonID","Query").show()`
And it look like that:
+------+--------------------+
|AnonID| Query|
+------+--------------------+
| 142| rentdirect.com|
| 142|www.prescriptionf...|
| 142| staple.com|
| 142| staple.com|
| 142|www.newyorklawyer...|
| 142|www.newyorklawyer...|
| 142| westchester.gov|
| 142| space.comhttp|
| 142| dfdf|
| 142| dfdf|
| 142| vaniqa.comh|
| 142| www.collegeucla.edu|
| 142| www.elaorg|
| 142| 207 ad2d 530|
| 142| 207 ad2d 530|
| 142| broadway.vera.org|
| 142| broadway.vera.org|
| 142| vera.org|
| 142| broadway.vera.org|
| 142| frankmellace.com|
| 142| ucs.ljx.com|
| 142| attornyleslie.com|
| 142|merit release app...|
| 142| www.bonsai.wbff.org|
| 142| loislaw.com|
| 142| rapny.com|
| 142| whitepages.com|
| 217| lottery|
| 217| lottery|
| 217| ameriprise.com|
| 217| susheme|
| 217| united.com|
| 217| mizuno.com|
| 217|p; .; p;' p; ' ;'...|
| 217|p; .; p;' p; ' ;'...|
| 217|asiansexygoddess.com|
| 217| buddylis|
| 217|bestasiancompany.com|
| 217| lottery|
| 217| lottery|
| 217| ask.com|
| 217| weather.com|
| 217| wellsfargo.com|
| 217|www.tabiecummings...|
| 217| wanttickets.com|
| 217| yahoo.com|
| 217| -|
| 217| www.ngo-quen.org|
| 217| -|
| 217| vietnam|
+------+--------------------+
What I want to do is that each user ID will be a row and each query will be in a column.
+------+------------+---------
|ID | 1 | 2 | 3 .......
+------+------------+---------
|142| query1|query2| query3
|217| query1|query2| query3
|993| query1|query2| query3
|1268| query1|query2| query3
|1326| query1|query2| query3
.
.
.
I tried to switch between rows and columns with the help of a search I did on Google, but I didn't succeed.
You can group the dataframe by AnonID, and then pivot the Query column to create new columns for each unique query:
import pyspark.sql.functions as F
df = df.groupBy("AnonID").pivot("Query").agg(F.first("Query"))
If you have a lot of distinct values try
df = df.groupBy("AnonID").agg(F.collect_list("Query").alias("Queries"))
You can then rename the columns to 1, 2, 3, etc.
df = df.selectExpr("AnonID", *[f"`{i+1}` as `{i+1}`" for i in range(len(df.columns)-1)])

Date difference in years in PySpark dataframe

I come from Pandas background and new to Spark. I have a dataframe which has id, dob, age as columns. I want to get the age of the user from his dob (in some cases age column is NULL).
+----+------+----------+
| id | age | dob |
+----+------+----------+
| 1 | 24 | NULL |
| 2 | 25 | NULL |
| 3 | NULL | 1/1/1973 |
| 4 | NULL | 6/6/1980 |
| 5 | 46 | |
| 6 | NULL | 1/1/1971 |
+----+------+----------+
I want a new column which will calculate age from dob and current date.
I tried this, but not getting any results from it:
df.withColumn("diff",
datediff(to_date(lit("01-06-2020")),
to_date(unix_timestamp('dob', "dd-MM-yyyy").cast("timestamp")))).show()
You need to compute the date difference and convert the result to years, something like this:
df.withColumn('diff',
when(col('age').isNull(),
floor(datediff(current_date(), to_date(col('dob'), 'M/d/yyyy'))/365.25))\
.otherwise(col('age'))).show()
Which produces:
+---+----+--------+----+
| id| age| dob|diff|
+---+----+--------+----+
| 1| 24| null| 24|
| 2| 25| null| 25|
| 3|null|1/1/1973| 47|
| 4|null|6/6/1980| 39|
| 5| 46| null| 46|
| 6|null|1/1/1971| 49|
+---+----+--------+----+
It preserves the age column where not null and computes the difference (in days) between dob and today where age is null. The result is then converted to years (by dividing by 365.25; you may want to confirm this) then floored.
I believe it is more appropriate to use months_between when it comes to year difference. we should use datediff only when if you need difference in days
Approach-
val data =
"""
| id | age | dob
| 1 | 24 |
| 2 | 25 |
| 3 | | 1/1/1973
| 4 | | 6/6/1980
| 5 | 46 |
| 6 | | 1/1/1971
""".stripMargin
val stringDS = data.split(System.lineSeparator())
.map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
.toSeq.toDS()
val df = spark.read
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "true")
.csv(stringDS)
df.show(false)
df.printSchema()
/**
* +---+----+--------+
* |id |age |dob |
* +---+----+--------+
* |1 |24 |null |
* |2 |25 |null |
* |3 |null|1/1/1973|
* |4 |null|6/6/1980|
* |5 |46 |null |
* |6 |null|1/1/1971|
* +---+----+--------+
*
* root
* |-- id: integer (nullable = true)
* |-- age: integer (nullable = true)
* |-- dob: string (nullable = true)
*/
Find age
df.withColumn("diff",
coalesce(col("age"),
round(months_between(current_date(),to_date(col("dob"), "d/M/yyyy"),true).divide(12),2)
)
).show()
/**
* +---+----+--------+-----+
* | id| age| dob| diff|
* +---+----+--------+-----+
* | 1| 24| null| 24.0|
* | 2| 25| null| 25.0|
* | 3|null|1/1/1973|47.42|
* | 4|null|6/6/1980|39.99|
* | 5| 46| null| 46.0|
* | 6|null|1/1/1971|49.42|
* +---+----+--------+-----+
*/
round it to 0 if you want age in whole number
Using months_between like in this answer, but in a different approach:
in my table, I don't have 'age' column yet;
for rounding to full years I use .cast('int').
from pyspark.sql import functions as F
df = df.withColumn('age', (F.months_between(current_date(), F.col('dob')) / 12).cast('int'))
If system date is UTC and your locale is different, a separate date function may be needed:
from pyspark.sql import functions as F
def current_local_date():
return F.from_utc_timestamp(F.current_timestamp(), 'Europe/Riga').cast('date')
df = df.withColumn('age', (F.months_between(current_local_date(), F.col('dob')) / 12).cast('int'))

Pyspark merge 2 dataframes without losing data

I'm looking for joining 2 pyspark dataframes without losing any data inside. Easiest way would be showing you with example. Might even count them together and sort. If null in the desktop or phone column, then it should equal 0 in the output.
I tried:
desktop_df.join(phone_df, on='query')\
.fillna(0).orderBy("desktop", ascending=False)\
.show(20)
(doesn't have the total column yet, so I'm ordering it by count1)
But this approach doesn't seems to be working - doesn't show zeros at all.
desktop_df:
query |desktop|
----------------
query1 | 12 |
----------------
query2 | 23 |
----------------
query3 | 8 |
----------------
query4 | 11 |
----------------
query6 | 45 |
----------------
query9 | 89 |
phone_df:
query | phone |
----------------
query1 | 21 |
----------------
query2 | 33 |
----------------
query4 | 11 |
----------------
query5 | 55 |
----------------
query6 | 45 |
----------------
query7 | 1234 |
----------------
query8 | 4321 |
----------------
query10| 10 |
----------------
query11| 1 |
Output I'm looking for:
query | desktop| phone | total |
--------------------------------
query8 | 0 | 4321 | 4321 |
--------------------------------
query7 | 0 | 1234 | 1234 |
--------------------------------
query6 | 45 | 45 | 90 |
--------------------------------
query9 | 89 | 0 | 89 |
--------------------------------
query2 | 23 | 33 | 56 |
--------------------------------
query5 | 0 | 55 | 55 |
--------------------------------
query1 | 12 | 21 | 33 |
--------------------------------
query4 | 11 | 11 | 22 |
--------------------------------
query10| 0 | 10 | 10 |
--------------------------------
query3 | 8 | 0 | 8 |
--------------------------------
query11| 0 | 1 | 1 |
Wanted solutions
df = desktop_df.join(phone_df, on=["query"], how='fullouter').fillna(0).withColumn("total",col("desktop")+col("phone")).show(200)
or
from pyspark.sql.functions import lit
from pyspark.sql.functions import col
from pyspark.sql.functions import max
desktop_df = df.filter("hwType == 'DESKTOP'").groupby("query").count().orderBy("count", ascending=False).withColumnRenamed('count','desktop')
phone_df = df.filter("hwType == 'PHONE'").groupby("query").count().orderBy("count", ascending=False).withColumnRenamed('count','phone')
# add missing column to each dataframe
desktop_df = desktop_df.withColumn('phone', lit(0)).select('query', 'desktop', 'phone')
phone_df = phone_df.withColumn('desktop', lit(0)).select('query', 'desktop', 'phone')
# union all and agg to select max value
phone_df.unionAll(desktop_df).groupBy('query').agg(max(col('desktop')).alias('desktop'), max(col('phone')).alias('phone'))
# withColumn('total', col('desktop') + col('phone')) \
# .orderBy(col('total').desc()) \
# .show()
May be try inner join on query column. And find "Total" by adding column values.
df = desktop_df.join(phone_df, desktop_df.query==phone_df.query,"full").select(desktop_df.query,"count1","count2").fillna(0).withColumn("total",col("count1")+col("count2"))
You can do that using unionAll and then groupBy.
Example:
desktop_data = [("query1", 12), ("query2", 23), ("query3", 8),
("query4", 11), ("query6", 45), ("query9", 89)]
phone_data = [("query1", 21), ("query2", 33), ("query4", 11), ("query5", 55), ("query6", 45),
("query7", 1234), ("query8", 4321), ("query10", 10), ("query11", 1)]
desktop_df = spark.createDataFrame(desktop_data, ['query', 'count1'])
phone_df = spark.createDataFrame(phone_data, ['query', 'count2'])
# add missing column to each dataframe
desktop_df = desktop_df.withColumn('count2', lit(0)).select('query', 'count1', 'count2')
phone_df = phone_df.withColumn('count1', lit(0)).select('query', 'count1', 'count2')
# union all and agg to select max value
phone_df.unionAll(desktop_df) \
.groupBy('query').agg(max(col('count1')).alias('count1'), max(col('count2')).alias('count2')) \
.withColumn('total', col('count1') + col('count2')) \
.orderBy(col('total').desc()) \
.show()
+-------+------+------+-----+
| query|count1|count2|total|
+-------+------+------+-----+
| query8| 0| 4321| 4321|
| query7| 0| 1234| 1234|
| query6| 45| 45| 90|
| query9| 89| 0| 89|
| query2| 23| 33| 56|
| query5| 0| 55| 55|
| query1| 12| 21| 33|
| query4| 11| 11| 22|
|query10| 0| 10| 10|
| query3| 8| 0| 8|
|query11| 0| 1| 1|
+-------+------+------+-----+

Find column names of interconnected row values - Spark

I have a Spark dataframe that adheres to the following structure:
+------+-----------+-----------+-----------+------+
|ID | Name1 | Name2 | Name3 | Y |
+------+-----------+-----------+-----------+------+
| 1 | A,1 | B,1 | C,4 | B |
| 2 | D,2 | E,2 | F,8 | D |
| 3 | G,5 | H,2 | I,3 | H |
+------+-----------+-----------+-----------+------+
For every row I want to find in which column the value of Y is denoted as the first element. So, ideally I want to retrieve a list like: [Name2,Name1,Name2].
I am not sure how and whether it works to convert first to a RDD, then use a map function and convert the result back to DataFrame.
Any ideas are welcome.
You can probably try this piece of code :
df.show()
+---+-----+-----+-----+---+
| ID|Name1|Name2|Name3| Y|
+---+-----+-----+-----+---+
| 1| A,1| B,1| C,4| B|
| 2| D,2| E,2| F,8| D|
| 3| G,5| H,2| I,3| H|
+---+-----+-----+-----+---+
from pyspark.sql import functions as F
name_cols = ["Name1", "Name2", "Name3"]
cond = F
for col in name_cols:
cond = cond.when(F.split(F.col(col),',').getItem(0) == F.col("Y"), col)
df.withColumn("whichName", cond).show()
+---+-----+-----+-----+---+---------+
| ID|Name1|Name2|Name3| Y|whichName|
+---+-----+-----+-----+---+---------+
| 1| A,1| B,1| C,4| B| Name2|
| 2| D,2| E,2| F,8| D| Name1|
| 3| G,5| H,2| I,3| H| Name2|
+---+-----+-----+-----+---+---------+

PySpark Dataframe identify distinct value on one column based on duplicate values in other columns

I have a pyspark dataframe like: where c1,c2,c3,c4,c5,c6 are the columns
+----------------------------+
|c1 | c2 | c3 | c4 | c5 | c6 |
|----------------------------|
| a | x | y | z | g | h |
| b | m | f | l | n | o |
| c | x | y | z | g | h |
| d | m | f | l | n | o |
| e | x | y | z | g | i |
+----------------------------+
I want to extract c1 values for the rows which have same c2,c3,c4,c5 values but different c1 values.
Like, 1st, 3rd & 5th rows have same values for c2,c3,c4 & c5 but different c1 value. So the output should be a, c & e.
(update)
similarly, 2nd & 4th rows have same values for c2,c3,c4 & c5 but different c1 value. So the output should also contain b & d
How can I obtain such result ? I have tried applying groupby but I don't understand how to obtain distinct values for c1.
UPDATE:
Output should be a Dataframe of c1 values
# +-------+
# |c1_dups|
# +-------+
# | a,c,e|
# | b,e|
# +-------+
My Approach:
m = data.groupBy('c2','c3','c4','c5)
but I'm not understanding how to retrieve the values in m. I'm new to pyspark dataframes hence very much confused
This is actually very simple, let's create some data first :
schema = ['c1','c2','c3','c4','c5','c6']
rdd = sc.parallelize(["a,x,y,z,g,h","b,x,y,z,l,h","c,x,y,z,g,h","d,x,f,y,g,i","e,x,y,z,g,i"]) \
.map(lambda x : x.split(","))
df = sqlContext.createDataFrame(rdd,schema)
# +---+---+---+---+---+---+
# | c1| c2| c3| c4| c5| c6|
# +---+---+---+---+---+---+
# | a| x| y| z| g| h|
# | b| x| y| z| l| h|
# | c| x| y| z| g| h|
# | d| x| f| y| g| i|
# | e| x| y| z| g| i|
# +---+---+---+---+---+---+
Now the fun part, you'll just need to import some functions, group by and explode as following :
from pyspark.sql.functions import *
dupes = df.groupBy('c2','c3','c4','c5') \
.agg(collect_list('c1').alias("c1s"),count('c1').alias("count")) \ # we collect as list and count at the same time
.filter(col('count') > 1) # we filter dupes
df2 = dupes.select(explode("c1s").alias("c1_dups"))
df2.show()
# +-------+
# |c1_dups|
# +-------+
# | a|
# | c|
# | e|
# +-------+
I hope this answers your question.

Categories

Resources