PySpark using window to create field using previous created field value - python

I am trying to create a new column in my df called indexCP using Window I want to take the previous value from indexCP * (current_df['return']+1) if there is no previous indexCP do 100 * (current_df['return']+1).
column_list = ["id","secname"]
windowval = (Window.partitionBy(column_list).orderBy(col('calendarday').cast("timestamp").cast("long")).rangeBetween(Window.unboundedPreceding, 0))
spark_df = spark_df.withColumn('indexCP', when(spark_df["PreviousYearUnique"] == spark_df["yearUnique"], 100 * (current_df['return']+1)).otherwise(last('indexCP').over(windowval) * (current_df['return']+1)))
when I run the above code I get an error "AnalysisException: "cannot resolve 'indexCP' given input columns:" which I believe is saying you cant take a value that has not been created yet but I am unsure of how to fix it.
Starting Data Frame
## +---+-----------+----------+------------------+
## | id|calendarday| secName| return|
## +---+-----------+----------+------------------+
## | 1|2015-01-01 | 1| 0.0076|
## | 1|2015-01-02 | 1| 0.0026|
## | 1|2015-01-01 | 2| 0.0016|
## | 1|2015-01-02 | 2| 0.0006|
## | 2|2015-01-01 | 3| 0.0012|
## | 2|2015-01-02 | 3| 0.0014|
## +---+----------+-----------+------------------+
New Data Frame IndexCP added
## +---+-----------+--------+---------+------------+
## | id|calendarday| secName| return| IndexCP|
## +---+-----------+--------+---------+------------+
## | 1|2015-01-01 | 1| 0.0076| 100.76|(1st 100*(return+1))
## | 1|2015-01-02 | 1| 0.0026| 101.021976|(2nd 100.76*(return+1))
## | 2|2015-01-01 | 2| 0.0016| 100.16|(1st 100*(return+1))
## | 2|2015-01-02 | 2| 0.0006| 100.220096|(2nd 100.16*(return+1))
## | 3|2015-01-01 | 3| 0.0012| 100.12 |(1st 100*(return+1))
## | 3|2015-01-02 | 3| 0.0014| 100.260168|(2nd 100.12*(return+1))
## +---+----------+---------+---------+------------+

EDIT: This should be the final answer, I've extended it by another row for secName column.
What you're looking for is a rolling product function using your formula of IndexCP * (current_return + 1).
First you need to aggregate all existing returns into an ArrayType and then aggregate. This can be done with some Spark SQL aggregate function, such as:
column_list = ["id","secname"]
windowval = (
Window.partitionBy(column_list)
.orderBy(f.col('calendarday').cast("timestamp"))
.rangeBetween(Window.unboundedPreceding, 0)
)
df1.show()
+---+-----------+-------+------+
| id|calendarday|secName|return|
+---+-----------+-------+------+
| 1| 2015-01-01| 1|0.0076|
| 1| 2015-01-02| 1|0.0026|
| 1| 2015-01-03| 1|0.0014|
| 2| 2015-01-01| 2|0.0016|
| 2| 2015-01-02| 2|6.0E-4|
| 2| 2015-01-03| 2| 0.0|
| 3| 2015-01-01| 3|0.0012|
| 3| 2015-01-02| 3|0.0014|
+---+-----------+-------+------+
# f.collect_list(...) gets all your returns - this must be windowed
# cast(1 as double) is your base of 1 to begin with
# (acc, x) -> acc * (1 + x) is your formula translated to Spark SQL
# where acc is the accumulated value and x is the incoming value
df1.withColumn(
"rolling_returns",
f.collect_list("return").over(windowval)
).withColumn("IndexCP",
100 * f.expr("""
aggregate(
rolling_returns,
cast(1 as double),
(acc, x) -> acc * (1+x))
""")
).orderBy("id", "calendarday").show(truncate=False)
+---+-----------+-------+------+------------------------+------------------+
|id |calendarday|secName|return|rolling_returns |IndexCP |
+---+-----------+-------+------+------------------------+------------------+
|1 |2015-01-01 |1 |0.0076|[0.0076] |100.76 |
|1 |2015-01-02 |1 |0.0026|[0.0076, 0.0026] |101.021976 |
|1 |2015-01-03 |1 |0.0014|[0.0076, 0.0026, 0.0014]|101.16340676640002|
|2 |2015-01-01 |2 |0.0016|[0.0016] |100.16000000000001|
|2 |2015-01-02 |2 |6.0E-4|[0.0016, 6.0E-4] |100.220096 |
|2 |2015-01-03 |2 |0.0 |[0.0016, 6.0E-4, 0.0] |100.220096 |
|3 |2015-01-01 |3 |0.0012|[0.0012] |100.12 |
|3 |2015-01-02 |3 |0.0014|[0.0012, 0.0014] |100.26016800000002|
+---+-----------+-------+------+------------------------+------------------+
Explanation: The starting value must be 1 and the multiplier of 100 must be on the outside of the expression, otherwise you indeed start drifting by a factor of 100 above expected returns.
I have verified the values now adhere to your formula, for instance for secName == 1 and id == 1:
100 * ((1.0026 * (0.0076 + 1)) * (0.0014 + 1)) = 101.1634067664
Which is indeed correct according to the formula (acc, x) -> acc * (1+x). Hope this helps!

Related

Pyspark create batch number column based on account

Suppose I have a pyspark dataframe with a number of unique account values, each of which have a unique number of entries, like so:
+-------------_+--------+--------+
| account| col1| col2 | col3 |
+--------+-----+--------+--------+
| 325235 | 59| -6| 625.64|
| 325235 | 23| -282| 923.47|
| 325235 | 77|-1310.89| 3603.48|
| 245623 | 120| 1.53| 1985.63|
| 245623 | 106| -12| 1985.06|
| 658567 | 84| -12| 194.67|
I want to specify a batch size, and assign multiple accounts to the same batch based on the batch size. Lets suppose I choose batch size = 2, then the output should be the following:
+-------------_+--------+--------+--------------+
| account| col1| col2 | col3 | batch_number |
+--------+-----+--------+--------+--------------+
| 325235 | 59| -6| 625.64| 1|
| 325235 | 23| -282| 923.47| 1|
| 325235 | 77|-1310.89| 3603.48| 1|
| 245623 | 120| 1.53| 1985.63| 1|
| 245623 | 106| -12| 1985.06| 1|
| 658567 | 84| -12| 194.67| 2|
I can then do a groupby on the batch_number column and have multiple accounts per batch. Here is my working code, but it is too slow since I am doing a toPandas().
# Get unique accounts in source data
accounts = [row.account for row in source_data.select("account").distinct().collect()]
# Find number of batches based. Last batch will have size = remainder
num_batches, remainder = divmod(len(accounts), batchsize)
# Create batch dataframe where a batch number is assigned to each account.
batches = [i for _ in range(batchsize) for i in range(1, int(num_batches) + 1)] + [num_batches + 1 for i in range(remainder)]
batch_df = pd.DataFrame({"account": accounts, "batch_number": batches}, columns=["account", "batch_number"]).set_index("account")
# Add a zero column for batch number to source data which will be populated
source_data = source_data.withColumn("batch_number", lit(0))
# Map batch numbers of accounts back into the source data
source_data_p = source_data.toPandas()
for ind in source_data_p.index:
source_data_p.at[ind, "batch_number"] = batch_df.at[source_data_p.at[ind, "account"], "batch_number"]
# Convert mapped pandas df back to spark df
batched_df = sqlcontext.createDataFrame(source_data_p)
I would ideally like to get rid of the toPandas() call, and do the mapping in pyspark. I have seen a few related posts, like this one: How to batch up items from a PySpark DataFrame, but this doesn't fit into the flow of my code, so I will have to re-write the whole project just to implement this.
From what I understand you can use an indexer using mllib or any other way and then floor division:
import pyspark.sql.functions as F
from pyspark.ml.feature import StringIndexer
n=2
idx = StringIndexer(inputCol="account",outputCol="batch_number")
(idx.fit(df).transform(df)
.withColumn("batch_number",F.floor(F.col("batch_number")/n)+1)).show()
+-------+----+--------+-------+------------+
|account|col1| col2| col3|batch_number|
+-------+----+--------+-------+------------+
| 325235| 59| -6.0| 625.64| 1|
| 325235| 23| -282.0| 923.47| 1|
| 325235| 77|-1310.89|3603.48| 1|
| 245623| 120| 1.53|1985.63| 1|
| 245623| 106| -12.0|1985.06| 1|
| 658567| 84| -12.0| 194.67| 2|
+-------+----+--------+-------+------------+

Date difference in years in PySpark dataframe

I come from Pandas background and new to Spark. I have a dataframe which has id, dob, age as columns. I want to get the age of the user from his dob (in some cases age column is NULL).
+----+------+----------+
| id | age | dob |
+----+------+----------+
| 1 | 24 | NULL |
| 2 | 25 | NULL |
| 3 | NULL | 1/1/1973 |
| 4 | NULL | 6/6/1980 |
| 5 | 46 | |
| 6 | NULL | 1/1/1971 |
+----+------+----------+
I want a new column which will calculate age from dob and current date.
I tried this, but not getting any results from it:
df.withColumn("diff",
datediff(to_date(lit("01-06-2020")),
to_date(unix_timestamp('dob', "dd-MM-yyyy").cast("timestamp")))).show()
You need to compute the date difference and convert the result to years, something like this:
df.withColumn('diff',
when(col('age').isNull(),
floor(datediff(current_date(), to_date(col('dob'), 'M/d/yyyy'))/365.25))\
.otherwise(col('age'))).show()
Which produces:
+---+----+--------+----+
| id| age| dob|diff|
+---+----+--------+----+
| 1| 24| null| 24|
| 2| 25| null| 25|
| 3|null|1/1/1973| 47|
| 4|null|6/6/1980| 39|
| 5| 46| null| 46|
| 6|null|1/1/1971| 49|
+---+----+--------+----+
It preserves the age column where not null and computes the difference (in days) between dob and today where age is null. The result is then converted to years (by dividing by 365.25; you may want to confirm this) then floored.
I believe it is more appropriate to use months_between when it comes to year difference. we should use datediff only when if you need difference in days
Approach-
val data =
"""
| id | age | dob
| 1 | 24 |
| 2 | 25 |
| 3 | | 1/1/1973
| 4 | | 6/6/1980
| 5 | 46 |
| 6 | | 1/1/1971
""".stripMargin
val stringDS = data.split(System.lineSeparator())
.map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
.toSeq.toDS()
val df = spark.read
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "true")
.csv(stringDS)
df.show(false)
df.printSchema()
/**
* +---+----+--------+
* |id |age |dob |
* +---+----+--------+
* |1 |24 |null |
* |2 |25 |null |
* |3 |null|1/1/1973|
* |4 |null|6/6/1980|
* |5 |46 |null |
* |6 |null|1/1/1971|
* +---+----+--------+
*
* root
* |-- id: integer (nullable = true)
* |-- age: integer (nullable = true)
* |-- dob: string (nullable = true)
*/
Find age
df.withColumn("diff",
coalesce(col("age"),
round(months_between(current_date(),to_date(col("dob"), "d/M/yyyy"),true).divide(12),2)
)
).show()
/**
* +---+----+--------+-----+
* | id| age| dob| diff|
* +---+----+--------+-----+
* | 1| 24| null| 24.0|
* | 2| 25| null| 25.0|
* | 3|null|1/1/1973|47.42|
* | 4|null|6/6/1980|39.99|
* | 5| 46| null| 46.0|
* | 6|null|1/1/1971|49.42|
* +---+----+--------+-----+
*/
round it to 0 if you want age in whole number
Using months_between like in this answer, but in a different approach:
in my table, I don't have 'age' column yet;
for rounding to full years I use .cast('int').
from pyspark.sql import functions as F
df = df.withColumn('age', (F.months_between(current_date(), F.col('dob')) / 12).cast('int'))
If system date is UTC and your locale is different, a separate date function may be needed:
from pyspark.sql import functions as F
def current_local_date():
return F.from_utc_timestamp(F.current_timestamp(), 'Europe/Riga').cast('date')
df = df.withColumn('age', (F.months_between(current_local_date(), F.col('dob')) / 12).cast('int'))

Giving Same Number If Same Group

I have dataframe like this
name status
+----+------+
|name|value |
+----+------+
| x | down|
| y |normal|
| z | down|
| x |normal|
| y | down|
+----+------+
If the names are same i want to put number 1,2,3 like this, new column must be like this
+----+------+------+
|name|value |newCol|
+----+------+------+
| x|down | 1|
| y|normal| 2|
| z|down | 3|
| x|normal| 1|
| y|down | 2|
+----+------+------+
win = Window.partitionBy("name").orderBy("name")
print("value")
dp_df_classification_agg_join = dp_df_classification_agg_join.withColumn("newCol",count("name").over(win))
First, replace the count("name") function with the dense_rank() function.
Then, replace this win = Window.partitionBy("name").orderBy("name") with win = Window.partitionBy().orderBy("name")

Adding column with rolling latest prior in PySpark

I have a pyspark dataframe with a list of customers, days, and transaction types.
+----------+-----+------+
| Customer | Day | Type |
+----------+-----+------+
| A | 2 | X11 |
| A | 4 | X2 |
| A | 9 | Y4 |
| A | 11 | X1 |
| B | 3 | Y4 |
| B | 7 | X1 |
+----------+-----+------+
I'd like to create a column that has "most recent X type" for each customer, like so:
+----------+-----+------+-------------+
| Customer | Day | Type | MostRecentX |
+----------+-----+------+-------------+
| A | 2 | X11 | X11 |
| A | 4 | X2 | X2 |
| A | 9 | Y4 | X2 |
| A | 11 | X1 | X1 |
| B | 3 | Y4 | - |
| B | 7 | X1 | X1 |
+----------+-----+------+-------------+
So for the X types it just takes the one from the current row, but for the Y type it takes the type from the most recent X row for that member (and if there isn't one, it gets a blank or something). I imagine I need a sort of window function but not very familiar with PySpark.
You can achieve by this taking the last column that startswith the letter "X" over a Window that partitions by the Customer and orders by the Day. Specify the Window to start at the beginning of the partition and stop at the current row.
from pyspark.sql import Window
from pyspark.sql.functions import col, last, when
w = Window.partitionBy("Customer").orderBy("Day").rowsBetween(Window.unboundedPreceding, 0)
df = df.withColumn(
"MostRecentX",
last(when(col("Type").startswith("X"), col("Type")), ignorenulls=True).over(w)
)
df.show()
#+--------+---+----+-----------+
#|Customer|Day|Type|MostRecentX|
#+--------+---+----+-----------+
#| A| 2| X11| X11|
#| A| 4| X2| X2|
#| A| 9| Y4| X2|
#| A| 11| X1| X1|
#| B| 3| Y4| null|
#| B| 7| X1| X1|
#+--------+---+----+-----------+
The trick here is to use when to return the Type column only if it starts with "X". By default, when will return null. Then we can use last with ignorenulls=True to get the value for MostRecentX.
If you want to replace the null with "-" as shown in your question, just call fillna on the MostRecentX column:
df.fillna("-", subset=["MostRecentX"]).show()
#+--------+---+----+-----------+
#|Customer|Day|Type|MostRecentX|
#+--------+---+----+-----------+
#| A| 2| X11| X11|
#| A| 4| X2| X2|
#| A| 9| Y4| X2|
#| A| 11| X1| X1|
#| B| 3| Y4| -|
#| B| 7| X1| X1|
#+--------+---+----+-----------+

Find column names of interconnected row values - Spark

I have a Spark dataframe that adheres to the following structure:
+------+-----------+-----------+-----------+------+
|ID | Name1 | Name2 | Name3 | Y |
+------+-----------+-----------+-----------+------+
| 1 | A,1 | B,1 | C,4 | B |
| 2 | D,2 | E,2 | F,8 | D |
| 3 | G,5 | H,2 | I,3 | H |
+------+-----------+-----------+-----------+------+
For every row I want to find in which column the value of Y is denoted as the first element. So, ideally I want to retrieve a list like: [Name2,Name1,Name2].
I am not sure how and whether it works to convert first to a RDD, then use a map function and convert the result back to DataFrame.
Any ideas are welcome.
You can probably try this piece of code :
df.show()
+---+-----+-----+-----+---+
| ID|Name1|Name2|Name3| Y|
+---+-----+-----+-----+---+
| 1| A,1| B,1| C,4| B|
| 2| D,2| E,2| F,8| D|
| 3| G,5| H,2| I,3| H|
+---+-----+-----+-----+---+
from pyspark.sql import functions as F
name_cols = ["Name1", "Name2", "Name3"]
cond = F
for col in name_cols:
cond = cond.when(F.split(F.col(col),',').getItem(0) == F.col("Y"), col)
df.withColumn("whichName", cond).show()
+---+-----+-----+-----+---+---------+
| ID|Name1|Name2|Name3| Y|whichName|
+---+-----+-----+-----+---+---------+
| 1| A,1| B,1| C,4| B| Name2|
| 2| D,2| E,2| F,8| D| Name1|
| 3| G,5| H,2| I,3| H| Name2|
+---+-----+-----+-----+---+---------+

Categories

Resources