Pyspark merge 2 dataframes without losing data - python

I'm looking for joining 2 pyspark dataframes without losing any data inside. Easiest way would be showing you with example. Might even count them together and sort. If null in the desktop or phone column, then it should equal 0 in the output.
I tried:
desktop_df.join(phone_df, on='query')\
.fillna(0).orderBy("desktop", ascending=False)\
.show(20)
(doesn't have the total column yet, so I'm ordering it by count1)
But this approach doesn't seems to be working - doesn't show zeros at all.
desktop_df:
query |desktop|
----------------
query1 | 12 |
----------------
query2 | 23 |
----------------
query3 | 8 |
----------------
query4 | 11 |
----------------
query6 | 45 |
----------------
query9 | 89 |
phone_df:
query | phone |
----------------
query1 | 21 |
----------------
query2 | 33 |
----------------
query4 | 11 |
----------------
query5 | 55 |
----------------
query6 | 45 |
----------------
query7 | 1234 |
----------------
query8 | 4321 |
----------------
query10| 10 |
----------------
query11| 1 |
Output I'm looking for:
query | desktop| phone | total |
--------------------------------
query8 | 0 | 4321 | 4321 |
--------------------------------
query7 | 0 | 1234 | 1234 |
--------------------------------
query6 | 45 | 45 | 90 |
--------------------------------
query9 | 89 | 0 | 89 |
--------------------------------
query2 | 23 | 33 | 56 |
--------------------------------
query5 | 0 | 55 | 55 |
--------------------------------
query1 | 12 | 21 | 33 |
--------------------------------
query4 | 11 | 11 | 22 |
--------------------------------
query10| 0 | 10 | 10 |
--------------------------------
query3 | 8 | 0 | 8 |
--------------------------------
query11| 0 | 1 | 1 |
Wanted solutions
df = desktop_df.join(phone_df, on=["query"], how='fullouter').fillna(0).withColumn("total",col("desktop")+col("phone")).show(200)
or
from pyspark.sql.functions import lit
from pyspark.sql.functions import col
from pyspark.sql.functions import max
desktop_df = df.filter("hwType == 'DESKTOP'").groupby("query").count().orderBy("count", ascending=False).withColumnRenamed('count','desktop')
phone_df = df.filter("hwType == 'PHONE'").groupby("query").count().orderBy("count", ascending=False).withColumnRenamed('count','phone')
# add missing column to each dataframe
desktop_df = desktop_df.withColumn('phone', lit(0)).select('query', 'desktop', 'phone')
phone_df = phone_df.withColumn('desktop', lit(0)).select('query', 'desktop', 'phone')
# union all and agg to select max value
phone_df.unionAll(desktop_df).groupBy('query').agg(max(col('desktop')).alias('desktop'), max(col('phone')).alias('phone'))
# withColumn('total', col('desktop') + col('phone')) \
# .orderBy(col('total').desc()) \
# .show()

May be try inner join on query column. And find "Total" by adding column values.
df = desktop_df.join(phone_df, desktop_df.query==phone_df.query,"full").select(desktop_df.query,"count1","count2").fillna(0).withColumn("total",col("count1")+col("count2"))

You can do that using unionAll and then groupBy.
Example:
desktop_data = [("query1", 12), ("query2", 23), ("query3", 8),
("query4", 11), ("query6", 45), ("query9", 89)]
phone_data = [("query1", 21), ("query2", 33), ("query4", 11), ("query5", 55), ("query6", 45),
("query7", 1234), ("query8", 4321), ("query10", 10), ("query11", 1)]
desktop_df = spark.createDataFrame(desktop_data, ['query', 'count1'])
phone_df = spark.createDataFrame(phone_data, ['query', 'count2'])
# add missing column to each dataframe
desktop_df = desktop_df.withColumn('count2', lit(0)).select('query', 'count1', 'count2')
phone_df = phone_df.withColumn('count1', lit(0)).select('query', 'count1', 'count2')
# union all and agg to select max value
phone_df.unionAll(desktop_df) \
.groupBy('query').agg(max(col('count1')).alias('count1'), max(col('count2')).alias('count2')) \
.withColumn('total', col('count1') + col('count2')) \
.orderBy(col('total').desc()) \
.show()
+-------+------+------+-----+
| query|count1|count2|total|
+-------+------+------+-----+
| query8| 0| 4321| 4321|
| query7| 0| 1234| 1234|
| query6| 45| 45| 90|
| query9| 89| 0| 89|
| query2| 23| 33| 56|
| query5| 0| 55| 55|
| query1| 12| 21| 33|
| query4| 11| 11| 22|
|query10| 0| 10| 10|
| query3| 8| 0| 8|
|query11| 0| 1| 1|
+-------+------+------+-----+

Related

Pyspark: replace row value by another column with the same name

I have a pyspark dataframe as below, df
| D1 | D2 | D3 |Out|
| 2 | 4 | 5 |D2 |
| 5 | 8 | 4 |D3 |
| 3 | 7 | 8 |D1 |
And I would like to replace the row values of the "out" column by the row value within the same row with the same column name of the row value of the "out" column.
| D1 | D2 | D3 |Out|Result|
| 2 | 4 | 5 |D2 |4 |
| 5 | 8 | 4 |D3 |4 |
| 3 | 7 | 8 |D1 |3 |
df_lag=df.rdd.map(lambda row: row + (row[row.Out],)).toDF(df.columns + ["Result"])
I have tried the code above and it could obtain the result but when I tried to save to csv, it keeps showing the error "Job aborted due to......" so I would like to ask if there is any other method could also obtain the same result. Thanks!
You can use chained when statements generated dynamically from the column names using reduce:
from functools import reduce
import pyspark.sql.functions as F
df2 = df.withColumn(
'Result',
reduce(
lambda x, y: x.when(F.col('Out') == y, F.col(y)),
df.columns[:-1],
F
)
)
df2.show()
+---+---+---+---+------+
| D1| D2| D3|Out|Result|
+---+---+---+---+------+
| 2| 4| 5| D2| 4|
| 5| 8| 4| D3| 4|
| 3| 7| 8| D1| 3|
+---+---+---+---+------+

Date difference in years in PySpark dataframe

I come from Pandas background and new to Spark. I have a dataframe which has id, dob, age as columns. I want to get the age of the user from his dob (in some cases age column is NULL).
+----+------+----------+
| id | age | dob |
+----+------+----------+
| 1 | 24 | NULL |
| 2 | 25 | NULL |
| 3 | NULL | 1/1/1973 |
| 4 | NULL | 6/6/1980 |
| 5 | 46 | |
| 6 | NULL | 1/1/1971 |
+----+------+----------+
I want a new column which will calculate age from dob and current date.
I tried this, but not getting any results from it:
df.withColumn("diff",
datediff(to_date(lit("01-06-2020")),
to_date(unix_timestamp('dob', "dd-MM-yyyy").cast("timestamp")))).show()
You need to compute the date difference and convert the result to years, something like this:
df.withColumn('diff',
when(col('age').isNull(),
floor(datediff(current_date(), to_date(col('dob'), 'M/d/yyyy'))/365.25))\
.otherwise(col('age'))).show()
Which produces:
+---+----+--------+----+
| id| age| dob|diff|
+---+----+--------+----+
| 1| 24| null| 24|
| 2| 25| null| 25|
| 3|null|1/1/1973| 47|
| 4|null|6/6/1980| 39|
| 5| 46| null| 46|
| 6|null|1/1/1971| 49|
+---+----+--------+----+
It preserves the age column where not null and computes the difference (in days) between dob and today where age is null. The result is then converted to years (by dividing by 365.25; you may want to confirm this) then floored.
I believe it is more appropriate to use months_between when it comes to year difference. we should use datediff only when if you need difference in days
Approach-
val data =
"""
| id | age | dob
| 1 | 24 |
| 2 | 25 |
| 3 | | 1/1/1973
| 4 | | 6/6/1980
| 5 | 46 |
| 6 | | 1/1/1971
""".stripMargin
val stringDS = data.split(System.lineSeparator())
.map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
.toSeq.toDS()
val df = spark.read
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "true")
.csv(stringDS)
df.show(false)
df.printSchema()
/**
* +---+----+--------+
* |id |age |dob |
* +---+----+--------+
* |1 |24 |null |
* |2 |25 |null |
* |3 |null|1/1/1973|
* |4 |null|6/6/1980|
* |5 |46 |null |
* |6 |null|1/1/1971|
* +---+----+--------+
*
* root
* |-- id: integer (nullable = true)
* |-- age: integer (nullable = true)
* |-- dob: string (nullable = true)
*/
Find age
df.withColumn("diff",
coalesce(col("age"),
round(months_between(current_date(),to_date(col("dob"), "d/M/yyyy"),true).divide(12),2)
)
).show()
/**
* +---+----+--------+-----+
* | id| age| dob| diff|
* +---+----+--------+-----+
* | 1| 24| null| 24.0|
* | 2| 25| null| 25.0|
* | 3|null|1/1/1973|47.42|
* | 4|null|6/6/1980|39.99|
* | 5| 46| null| 46.0|
* | 6|null|1/1/1971|49.42|
* +---+----+--------+-----+
*/
round it to 0 if you want age in whole number
Using months_between like in this answer, but in a different approach:
in my table, I don't have 'age' column yet;
for rounding to full years I use .cast('int').
from pyspark.sql import functions as F
df = df.withColumn('age', (F.months_between(current_date(), F.col('dob')) / 12).cast('int'))
If system date is UTC and your locale is different, a separate date function may be needed:
from pyspark.sql import functions as F
def current_local_date():
return F.from_utc_timestamp(F.current_timestamp(), 'Europe/Riga').cast('date')
df = df.withColumn('age', (F.months_between(current_local_date(), F.col('dob')) / 12).cast('int'))

Concat multiple string rows for each unique ID by a particular order

I want to create a table where each row is a unique ID and the Place and City column consists of all the places and cities a person visited , ordered by the date of visit , either using Pyspark or Hive.
df.groupby("ID").agg(F.concat_ws("|",F.collect_list("Place")))
does the concatenation but I am unable to order it by the date. Also for each column I need to keep doing this step separately.
I also tried using windows function as mentioned in this post, (collect_list by preserving order based on another variable)
but it trows an error :java.lang.UnsupportedOperationException: 'collect_list(') is not supported in a window operation.
I want to :
1- order the concatenated columns in order of the date travelled
2- do this step for multiple columns
Data
| ID | Date | Place | City |
| 1 | 2017 | UK | Birm |
| 2 | 2014 | US | LA |
| 1 | 2018 | SIN | Sin |
| 1 | 2019 | MAL | KL |
| 2 | 2015 | US | SF |
| 3 | 2019 | UK | Lon |
Expected
| ID | Place | City |
| 1 | UK,SIN,MAL | Birm,Sin,KL |
| 2 | US,US | LA,SF |
| 3 | UK | Lon |
>>> from pyspark.sql import functions as F
>>> from pyspark.sql import Window
>>> w = Window.partitionBy('ID').orderBy('Date')
//Input data frame
>>> df.show()
+---+----+-----+----+
| ID|Date|Place|City|
+---+----+-----+----+
| 1|2017| UK|Birm|
| 2|2014| US| LA|
| 1|2018| SIN| Sin|
| 1|2019| MAL| KL|
| 2|2015| US| SF|
| 3|2019| UK| Lon|
+---+----+-----+----+
>>> df2 = df.withColumn("Place",F.collect_list("Place").over(w)).withColumn("City",F.collect_list("City").over(w)).groupBy("ID").agg(F.max("Place").alias("Place"), F.max("City").alias("City"))
//Data value in List
>>> df2.show()
+---+--------------+---------------+
| ID| Place| City|
+---+--------------+---------------+
| 3| [UK]| [Lon]|
| 1|[UK, SIN, MAL]|[Birm, Sin, KL]|
| 2| [US, US]| [LA, SF]|
+---+--------------+---------------+
//If you want value in String
>>> df2.withColumn("Place", F.concat_ws(" ", "Place")).withColumn("City", F.concat_ws(" ", "City")).show()
+---+----------+-----------+
| ID| Place| City|
+---+----------+-----------+
| 3| UK| Lon|
| 1|UK SIN MAL|Birm Sin KL|
| 2| US US| LA SF|
+---+----------+-----------+

How to create a UDF that creates a new column AND modifies an existing column

I have a dataframe like this:
id | color
---| -----
1 | red-dark
2 | green-light
3 | red-light
4 | blue-sky
5 | green-dark
I would like to create a UDF such that my dataframe becomes:
id | color | shade
---| ----- | -----
1 | red | dark
2 | green | light
3 | red | light
4 | blue | sky
5 | green | dark
I've written a UDF for this:
def my_function(data_str):
return ",".join(data_str.split("-"))
my_function_udf = udf(my_function, StringType())
#apply the UDF
df = df.withColumn("shade", my_function_udf(df['color']))
However, this doesn't transform the dataframe as I intend it to be. instead it turns it into:
id | color | shade
---| ---------- | -----
1 | red-dark | red,dark
2 | green-dark | green,light
3 | red-light | red,light
4 | blue-sky | blue,sky
5 | green-dark | green,dark
How can I transform the dataframe as I want it in pyspark?
Tried based on suggested question
schema = ArrayType(StructType([
StructField("color", StringType(), False),
StructField("shade", StringType(), False)
]))
color_shade_udf = udf(
lambda s: [tuple(s.split("-"))],
schema
)
df = df.withColumn("colorshade", color_shade_udf(df['color']))
#Gives the following
id | color | colorshade
---| ---------- | -----
1 | red-dark | [{"color":"red","shade":"dark"}]
2 | green-dark | [{"color":"green","shade":"dark"}]
3 | red-light | [{"color":"red","shade":"light"}]
4 | blue-sky | [{"color":"blue","shade":"sky"}]
5 | green-dark | [{"color":"green","shade":"dark"}]
I feel like I am getting closer
You can use the built-in function split():
from pyspark.sql.functions import split, col
df.withColumn("arr", split(df.color, "\\-")) \
.select("id",
col("arr")[0].alias("color"),
col("arr")[1].alias("shade")) \
.drop("arr") \
.show()
+---+-----+-----+
| id|color|shade|
+---+-----+-----+
| 1| red| dark|
| 2|green|light|
| 3| red|light|
| 4| blue| sky|
| 5|green| dark|
+---+-----+-----+

Pyskark Dataframe: Transforming unique elements in rows to columns

I have a Pyspark Dataframe in the following format:
+------------+---------+
| date | query |
+------------+---------+
| 2011-08-11 | Query 1 |
| 2011-08-11 | Query 1 |
| 2011-08-11 | Query 2 |
| 2011-08-12 | Query 3 |
| 2011-08-12 | Query 3 |
| 2011-08-13 | Query 1 |
+------------+---------+
And I need to transform it to turn each unique query into a column, grouped by date, and insert the count of each query in the rows of the dataframe. I expect the output to be like this:
+------------+---------+---------+---------+
| date | Query 1 | Query 2 | Query 3 |
+------------+---------+---------+---------+
| 2011-08-11 | 2 | 1 | 0 |
| 2011-08-12 | 0 | 0 | 2 |
| 2011-08-13 | 1 | 0 | 0 |
+------------+---------+---------+---------+
I am trying to use this answer as example, but I don't quite understand the code, especially the return statement in the make_row function.
Is there a way to count the queries while transforming the DataFrame?
Maybe something like
import pyspark.sql.functions as func
grouped = (df
.map(lambda row: (row.date, (row.query, func.count(row.query)))) # Just an example. Not sure how to do this.
.groupByKey())
It is a dataframe with potentially hundreds of thousands of rows and queries, so I prefer the RDD version over the options that use a .collect()
Thank you!
You can use groupBy.pivot with count as the aggregation function:
from pyspark.sql.functions import count
df.groupBy('date').pivot('query').agg(count('query')).na.fill(0).orderBy('date').show()
+--------------------+-------+-------+-------+
| date|Query 1|Query 2|Query 3|
+--------------------+-------+-------+-------+
|2011-08-11 00:00:...| 2| 1| 0|
|2011-08-12 00:00:...| 0| 0| 2|
|2011-08-13 00:00:...| 1| 0| 0|
+--------------------+-------+-------+-------+

Categories

Resources