Date difference in years in PySpark dataframe

Date difference in years in PySpark dataframe - python

I come from Pandas background and new to Spark. I have a dataframe which has id, dob, age as columns. I want to get the age of the user from his dob (in some cases age column is NULL).
+----+------+----------+
| id | age | dob |
+----+------+----------+
| 1 | 24 | NULL |
| 2 | 25 | NULL |
| 3 | NULL | 1/1/1973 |
| 4 | NULL | 6/6/1980 |
| 5 | 46 | |
| 6 | NULL | 1/1/1971 |
+----+------+----------+
I want a new column which will calculate age from dob and current date.
I tried this, but not getting any results from it:
df.withColumn("diff",
datediff(to_date(lit("01-06-2020")),
to_date(unix_timestamp('dob', "dd-MM-yyyy").cast("timestamp")))).show()

You need to compute the date difference and convert the result to years, something like this:
df.withColumn('diff',
when(col('age').isNull(),
floor(datediff(current_date(), to_date(col('dob'), 'M/d/yyyy'))/365.25))\
.otherwise(col('age'))).show()
Which produces:
+---+----+--------+----+
| id| age| dob|diff|
+---+----+--------+----+
| 1| 24| null| 24|
| 2| 25| null| 25|
| 3|null|1/1/1973| 47|
| 4|null|6/6/1980| 39|
| 5| 46| null| 46|
| 6|null|1/1/1971| 49|
+---+----+--------+----+
It preserves the age column where not null and computes the difference (in days) between dob and today where age is null. The result is then converted to years (by dividing by 365.25; you may want to confirm this) then floored.

I believe it is more appropriate to use months_between when it comes to year difference. we should use datediff only when if you need difference in days
Approach-
val data =
"""
| id | age | dob
| 1 | 24 |
| 2 | 25 |
| 3 | | 1/1/1973
| 4 | | 6/6/1980
| 5 | 46 |
| 6 | | 1/1/1971
""".stripMargin
val stringDS = data.split(System.lineSeparator())
.map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
.toSeq.toDS()
val df = spark.read
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "true")
.csv(stringDS)
df.show(false)
df.printSchema()
/**
* +---+----+--------+
* |id |age |dob |
* +---+----+--------+
* |1 |24 |null |
* |2 |25 |null |
* |3 |null|1/1/1973|
* |4 |null|6/6/1980|
* |5 |46 |null |
* |6 |null|1/1/1971|
* +---+----+--------+
*
* root
* |-- id: integer (nullable = true)
* |-- age: integer (nullable = true)
* |-- dob: string (nullable = true)
*/
Find age
df.withColumn("diff",
coalesce(col("age"),
round(months_between(current_date(),to_date(col("dob"), "d/M/yyyy"),true).divide(12),2)
)
).show()
/**
* +---+----+--------+-----+
* | id| age| dob| diff|
* +---+----+--------+-----+
* | 1| 24| null| 24.0|
* | 2| 25| null| 25.0|
* | 3|null|1/1/1973|47.42|
* | 4|null|6/6/1980|39.99|
* | 5| 46| null| 46.0|
* | 6|null|1/1/1971|49.42|
* +---+----+--------+-----+
*/
round it to 0 if you want age in whole number

Using months_between like in this answer, but in a different approach:
in my table, I don't have 'age' column yet;
for rounding to full years I use .cast('int').
from pyspark.sql import functions as F
df = df.withColumn('age', (F.months_between(current_date(), F.col('dob')) / 12).cast('int'))
If system date is UTC and your locale is different, a separate date function may be needed:
from pyspark.sql import functions as F
def current_local_date():
return F.from_utc_timestamp(F.current_timestamp(), 'Europe/Riga').cast('date')
df = df.withColumn('age', (F.months_between(current_local_date(), F.col('dob')) / 12).cast('int'))

Related

Pyspark: replace row value by another column with the same name

I have a pyspark dataframe as below, df
| D1 | D2 | D3 |Out|
| 2 | 4 | 5 |D2 |
| 5 | 8 | 4 |D3 |
| 3 | 7 | 8 |D1 |
And I would like to replace the row values of the "out" column by the row value within the same row with the same column name of the row value of the "out" column.
| D1 | D2 | D3 |Out|Result|
| 2 | 4 | 5 |D2 |4 |
| 5 | 8 | 4 |D3 |4 |
| 3 | 7 | 8 |D1 |3 |
df_lag=df.rdd.map(lambda row: row + (row[row.Out],)).toDF(df.columns + ["Result"])
I have tried the code above and it could obtain the result but when I tried to save to csv, it keeps showing the error "Job aborted due to......" so I would like to ask if there is any other method could also obtain the same result. Thanks!

You can use chained when statements generated dynamically from the column names using reduce:
from functools import reduce
import pyspark.sql.functions as F
df2 = df.withColumn(
'Result',
reduce(
lambda x, y: x.when(F.col('Out') == y, F.col(y)),
df.columns[:-1],
F
)
)
df2.show()
+---+---+---+---+------+
| D1| D2| D3|Out|Result|
+---+---+---+---+------+
| 2| 4| 5| D2| 4|
| 5| 8| 4| D3| 4|
| 3| 7| 8| D1| 3|
+---+---+---+---+------+

Combine arbitrary number of columns into a new column of Array type in Pyspark

I have a pyspark dataframe that contains N number of columns containing integers. Some of the fields might be null as well.
For example:
+---+-----+-----+
| id| f_1 | f_2 |
+---+-----+-----+
| 1| null| null|
| 2|123 | null|
| 3|124 |127 |
+---+-----+-----+
What I want is to combine all f-prefixed columns into a pyspark array in a new column. For example:
+---+---------+
| id| combined|
+---+---------+
| 1| [] |
| 2|[123] |
| 3|[124,127]|
+---+---------+
The closer I have managed to get is this:
features_filtered = features.select(F.concat(* features.columns[1:]).alias('combined'))
which returns null (I assume due to the nulls in the initial dataframe).
From what I searched I would like to use .coalesce() or maybe .fillna() to handle/remove nulls but I haven't managed to make it work.
My main requirements are that I would like the newly created column to be of type Array and that I dont want to enumerate all column names that I need to concat.

In pyspark can be done as
df = df.withColumn("combined_array", f.array(*[i for i in df.columns if i.startswith('f')]))
.withColumn("combined", expr('''FILTER(combined_array, x -> x is not null)'''))

Try this- (In scala, but can be implemented in python with minimal change)
Load the data
val data =
"""
|id| f_1 | f_2
| 1| null| null
| 2|123 | null
| 3|124 |127
""".stripMargin
val stringDS = data.split(System.lineSeparator())
.map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
.toSeq.toDS()
val df = spark.read
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS)
df.printSchema()
df.show(false)
/**
* root
* |-- id: integer (nullable = true)
* |-- f_1: integer (nullable = true)
* |-- f_2: integer (nullable = true)
*
* +---+----+----+
* |id |f_1 |f_2 |
* +---+----+----+
* |1 |null|null|
* |2 |123 |null|
* |3 |124 |127 |
* +---+----+----+
*/
Convert it to array
df.withColumn("array", array(df.columns.filter(_.startsWith("f")).map(col): _*))
.withColumn("combined", expr("FILTER(array, x -> x is not null)"))
.show(false)
/**
* +---+----+----+----------+----------+
* |id |f_1 |f_2 |array |combined |
* +---+----+----+----------+----------+
* |1 |null|null|[,] |[] |
* |2 |123 |null|[123,] |[123] |
* |3 |124 |127 |[124, 127]|[124, 127]|
* +---+----+----+----------+----------+
*/

PySpark using window to create field using previous created field value

I am trying to create a new column in my df called indexCP using Window I want to take the previous value from indexCP * (current_df['return']+1) if there is no previous indexCP do 100 * (current_df['return']+1).
column_list = ["id","secname"]
windowval = (Window.partitionBy(column_list).orderBy(col('calendarday').cast("timestamp").cast("long")).rangeBetween(Window.unboundedPreceding, 0))
spark_df = spark_df.withColumn('indexCP', when(spark_df["PreviousYearUnique"] == spark_df["yearUnique"], 100 * (current_df['return']+1)).otherwise(last('indexCP').over(windowval) * (current_df['return']+1)))
when I run the above code I get an error "AnalysisException: "cannot resolve 'indexCP' given input columns:" which I believe is saying you cant take a value that has not been created yet but I am unsure of how to fix it.
Starting Data Frame
## +---+-----------+----------+------------------+
## | id|calendarday| secName| return|
## +---+-----------+----------+------------------+
## | 1|2015-01-01 | 1| 0.0076|
## | 1|2015-01-02 | 1| 0.0026|
## | 1|2015-01-01 | 2| 0.0016|
## | 1|2015-01-02 | 2| 0.0006|
## | 2|2015-01-01 | 3| 0.0012|
## | 2|2015-01-02 | 3| 0.0014|
## +---+----------+-----------+------------------+
New Data Frame IndexCP added
## +---+-----------+--------+---------+------------+
## | id|calendarday| secName| return| IndexCP|
## +---+-----------+--------+---------+------------+
## | 1|2015-01-01 | 1| 0.0076| 100.76|(1st 100*(return+1))
## | 1|2015-01-02 | 1| 0.0026| 101.021976|(2nd 100.76*(return+1))
## | 2|2015-01-01 | 2| 0.0016| 100.16|(1st 100*(return+1))
## | 2|2015-01-02 | 2| 0.0006| 100.220096|(2nd 100.16*(return+1))
## | 3|2015-01-01 | 3| 0.0012| 100.12 |(1st 100*(return+1))
## | 3|2015-01-02 | 3| 0.0014| 100.260168|(2nd 100.12*(return+1))
## +---+----------+---------+---------+------------+

EDIT: This should be the final answer, I've extended it by another row for secName column.
What you're looking for is a rolling product function using your formula of IndexCP * (current_return + 1).
First you need to aggregate all existing returns into an ArrayType and then aggregate. This can be done with some Spark SQL aggregate function, such as:
column_list = ["id","secname"]
windowval = (
Window.partitionBy(column_list)
.orderBy(f.col('calendarday').cast("timestamp"))
.rangeBetween(Window.unboundedPreceding, 0)
)
df1.show()
+---+-----------+-------+------+
| id|calendarday|secName|return|
+---+-----------+-------+------+
| 1| 2015-01-01| 1|0.0076|
| 1| 2015-01-02| 1|0.0026|
| 1| 2015-01-03| 1|0.0014|
| 2| 2015-01-01| 2|0.0016|
| 2| 2015-01-02| 2|6.0E-4|
| 2| 2015-01-03| 2| 0.0|
| 3| 2015-01-01| 3|0.0012|
| 3| 2015-01-02| 3|0.0014|
+---+-----------+-------+------+
# f.collect_list(...) gets all your returns - this must be windowed
# cast(1 as double) is your base of 1 to begin with
# (acc, x) -> acc * (1 + x) is your formula translated to Spark SQL
# where acc is the accumulated value and x is the incoming value
df1.withColumn(
"rolling_returns",
f.collect_list("return").over(windowval)
).withColumn("IndexCP",
100 * f.expr("""
aggregate(
rolling_returns,
cast(1 as double),
(acc, x) -> acc * (1+x))
""")
).orderBy("id", "calendarday").show(truncate=False)
+---+-----------+-------+------+------------------------+------------------+
|id |calendarday|secName|return|rolling_returns |IndexCP |
+---+-----------+-------+------+------------------------+------------------+
|1 |2015-01-01 |1 |0.0076|[0.0076] |100.76 |
|1 |2015-01-02 |1 |0.0026|[0.0076, 0.0026] |101.021976 |
|1 |2015-01-03 |1 |0.0014|[0.0076, 0.0026, 0.0014]|101.16340676640002|
|2 |2015-01-01 |2 |0.0016|[0.0016] |100.16000000000001|
|2 |2015-01-02 |2 |6.0E-4|[0.0016, 6.0E-4] |100.220096 |
|2 |2015-01-03 |2 |0.0 |[0.0016, 6.0E-4, 0.0] |100.220096 |
|3 |2015-01-01 |3 |0.0012|[0.0012] |100.12 |
|3 |2015-01-02 |3 |0.0014|[0.0012, 0.0014] |100.26016800000002|
+---+-----------+-------+------+------------------------+------------------+
Explanation: The starting value must be 1 and the multiplier of 100 must be on the outside of the expression, otherwise you indeed start drifting by a factor of 100 above expected returns.
I have verified the values now adhere to your formula, for instance for secName == 1 and id == 1:
100 * ((1.0026 * (0.0076 + 1)) * (0.0014 + 1)) = 101.1634067664
Which is indeed correct according to the formula (acc, x) -> acc * (1+x). Hope this helps!

Pyspark merge 2 dataframes without losing data

I'm looking for joining 2 pyspark dataframes without losing any data inside. Easiest way would be showing you with example. Might even count them together and sort. If null in the desktop or phone column, then it should equal 0 in the output.
I tried:
desktop_df.join(phone_df, on='query')\
.fillna(0).orderBy("desktop", ascending=False)\
.show(20)
(doesn't have the total column yet, so I'm ordering it by count1)
But this approach doesn't seems to be working - doesn't show zeros at all.
desktop_df:
query |desktop|
----------------
query1 | 12 |
----------------
query2 | 23 |
----------------
query3 | 8 |
----------------
query4 | 11 |
----------------
query6 | 45 |
----------------
query9 | 89 |
phone_df:
query | phone |
----------------
query1 | 21 |
----------------
query2 | 33 |
----------------
query4 | 11 |
----------------
query5 | 55 |
----------------
query6 | 45 |
----------------
query7 | 1234 |
----------------
query8 | 4321 |
----------------
query10| 10 |
----------------
query11| 1 |
Output I'm looking for:
query | desktop| phone | total |
--------------------------------
query8 | 0 | 4321 | 4321 |
--------------------------------
query7 | 0 | 1234 | 1234 |
--------------------------------
query6 | 45 | 45 | 90 |
--------------------------------
query9 | 89 | 0 | 89 |
--------------------------------
query2 | 23 | 33 | 56 |
--------------------------------
query5 | 0 | 55 | 55 |
--------------------------------
query1 | 12 | 21 | 33 |
--------------------------------
query4 | 11 | 11 | 22 |
--------------------------------
query10| 0 | 10 | 10 |
--------------------------------
query3 | 8 | 0 | 8 |
--------------------------------
query11| 0 | 1 | 1 |
Wanted solutions
df = desktop_df.join(phone_df, on=["query"], how='fullouter').fillna(0).withColumn("total",col("desktop")+col("phone")).show(200)
or
from pyspark.sql.functions import lit
from pyspark.sql.functions import col
from pyspark.sql.functions import max
desktop_df = df.filter("hwType == 'DESKTOP'").groupby("query").count().orderBy("count", ascending=False).withColumnRenamed('count','desktop')
phone_df = df.filter("hwType == 'PHONE'").groupby("query").count().orderBy("count", ascending=False).withColumnRenamed('count','phone')
# add missing column to each dataframe
desktop_df = desktop_df.withColumn('phone', lit(0)).select('query', 'desktop', 'phone')
phone_df = phone_df.withColumn('desktop', lit(0)).select('query', 'desktop', 'phone')
# union all and agg to select max value
phone_df.unionAll(desktop_df).groupBy('query').agg(max(col('desktop')).alias('desktop'), max(col('phone')).alias('phone'))
# withColumn('total', col('desktop') + col('phone')) \
# .orderBy(col('total').desc()) \
# .show()

May be try inner join on query column. And find "Total" by adding column values.
df = desktop_df.join(phone_df, desktop_df.query==phone_df.query,"full").select(desktop_df.query,"count1","count2").fillna(0).withColumn("total",col("count1")+col("count2"))

You can do that using unionAll and then groupBy.
Example:
desktop_data = [("query1", 12), ("query2", 23), ("query3", 8),
("query4", 11), ("query6", 45), ("query9", 89)]
phone_data = [("query1", 21), ("query2", 33), ("query4", 11), ("query5", 55), ("query6", 45),
("query7", 1234), ("query8", 4321), ("query10", 10), ("query11", 1)]
desktop_df = spark.createDataFrame(desktop_data, ['query', 'count1'])
phone_df = spark.createDataFrame(phone_data, ['query', 'count2'])
# add missing column to each dataframe
desktop_df = desktop_df.withColumn('count2', lit(0)).select('query', 'count1', 'count2')
phone_df = phone_df.withColumn('count1', lit(0)).select('query', 'count1', 'count2')
# union all and agg to select max value
phone_df.unionAll(desktop_df) \
.groupBy('query').agg(max(col('count1')).alias('count1'), max(col('count2')).alias('count2')) \
.withColumn('total', col('count1') + col('count2')) \
.orderBy(col('total').desc()) \
.show()
+-------+------+------+-----+
| query|count1|count2|total|
+-------+------+------+-----+
| query8| 0| 4321| 4321|
| query7| 0| 1234| 1234|
| query6| 45| 45| 90|
| query9| 89| 0| 89|
| query2| 23| 33| 56|
| query5| 0| 55| 55|
| query1| 12| 21| 33|
| query4| 11| 11| 22|
|query10| 0| 10| 10|
| query3| 8| 0| 8|
|query11| 0| 1| 1|
+-------+------+------+-----+

Find column names of interconnected row values - Spark

I have a Spark dataframe that adheres to the following structure:
+------+-----------+-----------+-----------+------+
|ID | Name1 | Name2 | Name3 | Y |
+------+-----------+-----------+-----------+------+
| 1 | A,1 | B,1 | C,4 | B |
| 2 | D,2 | E,2 | F,8 | D |
| 3 | G,5 | H,2 | I,3 | H |
+------+-----------+-----------+-----------+------+
For every row I want to find in which column the value of Y is denoted as the first element. So, ideally I want to retrieve a list like: [Name2,Name1,Name2].
I am not sure how and whether it works to convert first to a RDD, then use a map function and convert the result back to DataFrame.
Any ideas are welcome.

You can probably try this piece of code :
df.show()
+---+-----+-----+-----+---+
| ID|Name1|Name2|Name3| Y|
+---+-----+-----+-----+---+
| 1| A,1| B,1| C,4| B|
| 2| D,2| E,2| F,8| D|
| 3| G,5| H,2| I,3| H|
+---+-----+-----+-----+---+
from pyspark.sql import functions as F
name_cols = ["Name1", "Name2", "Name3"]
cond = F
for col in name_cols:
cond = cond.when(F.split(F.col(col),',').getItem(0) == F.col("Y"), col)
df.withColumn("whichName", cond).show()
+---+-----+-----+-----+---+---------+
| ID|Name1|Name2|Name3| Y|whichName|
+---+-----+-----+-----+---+---------+
| 1| A,1| B,1| C,4| B| Name2|
| 2| D,2| E,2| F,8| D| Name1|
| 3| G,5| H,2| I,3| H| Name2|
+---+-----+-----+-----+---+---------+

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Date difference in years in PySpark dataframe - python

Related

Pyspark: replace row value by another column with the same name

Combine arbitrary number of columns into a new column of Array type in Pyspark

PySpark using window to create field using previous created field value

Pyspark merge 2 dataframes without losing data

Find column names of interconnected row values - Spark

Categories

Resources