retrieving pyspark offset lag dynamic value in to other dataframe

retrieving pyspark offset lag dynamic value in to other dataframe - python

I am using pyspark 2.1. Below are my input dataframes . I am stuck up in taking dynamic offset values from different dataframe please help
df1=
category value
1 3
2 2
4 5
df2
category year month weeknumber lag_attribute runs
1 0 0 0 0 2
1 2019 1 1 1 0
1 2019 1 2 2 0
1 2019 1 3 3 0
1 2019 1 4 4 1
1 2019 1 5 5 2
1 2019 1 6 6 3
1 2019 1 7 7 4
1 2019 1 8 8 5
1 2019 1 9 9 6
2 0 0 0 9 0
2 2018 1 1 2 0
2 2018 1 2 3 2
2 2018 1 3 4 3
2 2018 1 3 5 4
As shown in above example df1 is my look up table which has offset values,for 1 offset value is 3 and for category 2 offset value is 2 .
in df2 ,runs is my output column so for every category values in df1 if the lag value is 3, then from dataframe2[df2] should consider the lag_attrbute and lag down by 3 values hence you could see for every 3 values of lag_attribute the runs were repeating
I tried below coding didn't work . Please help
df1=df1.registerTempTable("df1")
df2=df2.registerTempTable("df2")
sqlCtx.sql("select st.category,st.Year,st.Month,st.weekyear,st.lag_attribute,LAG(st.lag_attribute,df1.value, 0) OVER (PARTITION BY st.cagtegory ORDER BY st.Year,st.Month,st.weekyear) as return_test from df1 st,df2 lkp where df1.category=df2.category")
Please help me to cross this hurdle

lag takes in a column object and an integer (python integer), as shown in the function's signature:
Signature: psf.lag(col, count=1, default=None)
The value for count cannot be a pyspark IntegerType (column object). There are workarounds though, let's start with the sample data:
df1 = spark.createDataFrame([[1, 3],[2, 2],[4, 5]], ["category", "value"])
df2 = spark.createDataFrame([[1, 0, 0, 0, 0, 2],[1, 2019, 1, 1, 1, 0],[1, 2019, 1, 2, 2, 0],[1, 2019, 1, 3, 3, 0],
[1, 2019, 1, 4, 4, 1],[1, 2019, 1, 5, 5, 2],[1, 2019, 1, 6, 6, 3],[1, 2019, 1, 7, 7, 4],
[1, 2019, 1, 8, 8, 5],[1, 2019, 1, 9, 9, 6],[2, 0, 0, 0, 9, 0],[2, 2018, 1, 1, 2, 0],
[2, 2018, 1, 2, 3, 2],[2, 2018, 1, 3, 4, 3],[2, 2018, 1, 3, 5, 4]],
["category", "year", "month", "weeknumber", "lag_attribute", "runs"])
What you could do, if df1 is not too big (meaning a small amount of categories and potentially a lot of values in each category), is convert df1 to a list and create an if-elif-elif... condition based on its values:
list1 = df1.collect()
sc.broadcast(list1)
import pyspark.sql.functions as psf
from pyspark.sql import Window
w = Window.partitionBy("category").orderBy("year", "month", "weeknumber")
cond = eval('psf' + ''.join(['.when(df2.category == ' + str(c) + ', psf.lag("lag_attribute", ' + str(l) + ', 0).over(w))' for c, l in list1]))
Note: this is if c and l are integers, if they are strings then:
cond = eval('psf' + ''.join(['.when(df2.category == "' + str(c) + '", psf.lag("lag_attribute", "' + str(l) + '", 0).over(w))' for c, l in list1]))
Now we can apply the condition:
df2.select("*", cond.alias("return_test")).show()
+--------+----+-----+----------+-------------+----+-----------+
|category|year|month|weeknumber|lag_attribute|runs|return_test|
+--------+----+-----+----------+-------------+----+-----------+
| 1| 0| 0| 0| 0| 2| 0|
| 1|2019| 1| 1| 1| 0| 0|
| 1|2019| 1| 2| 2| 0| 0|
| 1|2019| 1| 3| 3| 0| 0|
| 1|2019| 1| 4| 4| 1| 1|
| 1|2019| 1| 5| 5| 2| 2|
| 1|2019| 1| 6| 6| 3| 3|
| 1|2019| 1| 7| 7| 4| 4|
| 1|2019| 1| 8| 8| 5| 5|
| 1|2019| 1| 9| 9| 6| 6|
| 2| 0| 0| 0| 9| 0| 0|
| 2|2018| 1| 1| 2| 0| 0|
| 2|2018| 1| 2| 3| 2| 9|
| 2|2018| 1| 3| 4| 3| 2|
| 2|2018| 1| 3| 5| 4| 3|
+--------+----+-----+----------+-------------+----+-----------+
If df1 is big then you can self join df2 on a built lag column:
First we'll bring the values from df1 to df2 using a join:
df = df2.join(df1, "category")
if df1 is not too big, you should broadcast it:
import pyspark.sql.functions as psf
df = df2.join(psf.broadcast(df1), "category")
Now we'll enumerate the rows in each partition and build a lag column:
from pyspark.sql import Window
w = Window.partitionBy("category").orderBy("year", "month", "weeknumber")
left = df.withColumn('rn', psf.row_number().over(w))
right = left.select((left.rn + left.value).alias("rn"), left.lag_attribute.alias("return_test"))
left.join(right, ["category", "rn"], "left")\
.na.fill(0)\
.sort("category", "rn").show()
+--------+---+----+-----+----------+-------------+----+-----+-----------+
|category| rn|year|month|weeknumber|lag_attribute|runs|value|return_test|
+--------+---+----+-----+----------+-------------+----+-----+-----------+
| 1| 1| 0| 0| 0| 0| 2| 3| 0|
| 1| 2|2019| 1| 1| 1| 0| 3| 0|
| 1| 3|2019| 1| 2| 2| 0| 3| 0|
| 1| 4|2019| 1| 3| 3| 0| 3| 0|
| 1| 5|2019| 1| 4| 4| 1| 3| 1|
| 1| 6|2019| 1| 5| 5| 2| 3| 2|
| 1| 7|2019| 1| 6| 6| 3| 3| 3|
| 1| 8|2019| 1| 7| 7| 4| 3| 4|
| 1| 9|2019| 1| 8| 8| 5| 3| 5|
| 1| 10|2019| 1| 9| 9| 6| 3| 6|
| 2| 1| 0| 0| 0| 9| 0| 2| 0|
| 2| 2|2018| 1| 1| 2| 0| 2| 0|
| 2| 3|2018| 1| 2| 3| 2| 2| 9|
| 2| 4|2018| 1| 3| 4| 3| 2| 2|
| 2| 5|2018| 1| 3| 5| 4| 2| 3|
+--------+---+----+-----+----------+-------------+----+-----+-----------+
Note: There is a problem with your runs lag value, for catagory=2 it is only lagging 1 instead of 2 for instance. Also some lines have the same order (eg. the two last lines in your sample dataframe df2 have the same category, year, month and weeknumber) in your dataframe, since there is shuffling involved you might get different results everytime you run the code.

Related

How to sort Pandas crosstab columns by sum of values

I have a crosstab table with 4 rows and multiple columns, containing numeral values (representing the number of dataset elements on the crossing of two factors). I want to sort the order of columns in the crosstab by the sum of values in each column.
e.g. I have:
ct = pd.crosstab(df_flt_reg['experience'], df_flt_reg['region'])
| a| b| c| d| e|
0 | 1| 0| 7| 3| 6|
1 | 2| 4| 1| 5| 4|
2 | 3| 5| 0| 7| 2|
3 | 1| 3| 1| 9| 1|
(sum)| 7| 12| 9| 24| 13| # row doesn't exist, written here to make clear the logic
What do I want:
| d| e| b| c| a|
0 | 3| 6| 0| 7| 1|
1 | 5| 4| 4| 1| 2|
2 | 7| 2| 5| 0| 3|
3 | 9| 1| 3| 1| 1|
(sum)| 24| 13| 12| 9| 7|
I succeded only in sorting the columns by their names (in alphabet order), but that's not what I need. I tried to sum those values separately, made a list of properly ordered indexes and then addressed them to crosstab.sort_values() via "by" parameter, but it was sorted in alphabet order again. Also I tried to create a new row "sum", but succeded to create only a new column -_-
So I humbly asking for the community's help.

Calculate the sum and sort the values. Once you have the sorted series get the index and reorder your columns with it.
sorted_df = ct[ct.sum().sort_values(ascending=False).index]
d e b c a
0 3 6 0 7 1
1 5 4 4 1 2
2 7 2 5 0 3
3 9 1 3 1 1

best way to perform one-hot encoding and explode column with prefix pyspark

Trying to replicate pandas code in pyspark 2.x.
say I have dataframe as follows:
age education country
0 22 A Canada
1 34 B Mongolia
2 55 A Peru
3 44 C Korea
Usually in pandas I would scale numerical columns and one hot encode categorical and get:
age education_A education_B education_C country_Canada country_Mongolia ...
0 0 1 0 0 1 0
1 0.3 0 1 0 0 0
2 1 1 0 0 0 0 ...
3 0.7 0 0 1 ... ...
In pyspark I've done
str_indexer1 = StringIndexer(inputCol="education", outputCol=education+"_si", handleInvalid="skip")
str_indexer2 = StringIndexer(inputCol="country", outputCol=country+"_si", handleInvalid="skip")
mod_df = str_indexer1.fit(df).transform(df)
mod_df = str_indexer2.fit(df).transform(mod_df)
ohe1 = OneHotEncoder(inputCol="education", outputCol=education+"_ohe")
ohe2 = OneHotEncoder(inputCol="country", outputCol=country+"_ohe")
ohe1.fit(mod_df).transform(mod_df)
This gives me
age education country education_si country_si education_ohe
0 0 A Canada 1 1 (1,0,0,0)
1 0.3 B Mongolia 2 2 (0,1,0,0)
2 1 A Peru 1 3 (1,0,0,0)
3 0.7 C Korea 3 4 (0,0,1,0)
From here I cannot find out how to explode education_ohe into education_A, etc...
How can I do this and also is more efficient way to perform ohe and scaler in large dataframe?

There is a built in oneHotEncoder in pyspark's functions, but I could not get it to provide true one-hot encoded columns.
That being said the following code will get the desired result.
Using the following dataframe.
df.show()
+---+---------+--------+
|age|education| country|
+---+---------+--------+
| 22| A| Canada|
| 34| B|Mongolia|
| 55| A| Peru|
| 44| C| Korea|
+---+---------+--------+
You can select all the distinct column values, and iteratively create additional columns.
variables_dict = {}
for col in df.columns:
variables_dict[col] = [row[col] for row in df.distinct().collect()]
for col in variables_dict.keys():
for val in variables_dict[col]:
df = df.withColumn("{}_{}".format(col, val), functions.when((df[col] == val), 1).otherwise(0))
df.show()
+---+---------+--------+------+------+------+------+-----------+-----------+-----------+--------------+-------------+------------+----------------+
|age|education| country|age_22|age_44|age_55|age_34|education_A|education_C|education_B|country_Canada|country_Korea|country_Peru|country_Mongolia|
+---+---------+--------+------+------+------+------+-----------+-----------+-----------+--------------+-------------+------------+----------------+
| 22| A| Canada| 1| 0| 0| 0| 1| 0| 0| 1| 0| 0| 0|
| 34| B|Mongolia| 0| 0| 0| 1| 0| 0| 1| 0| 0| 0| 1|
| 55| A| Peru| 0| 0| 1| 0| 1| 0| 0| 0| 0| 1| 0|
| 44| C| Korea| 0| 1| 0| 0| 0| 1| 0| 0| 1| 0| 0|
+---+---------+--------+------+------+------+------+-----------+-----------+-----------+--------------+-------------+------------+----------------+
You can then use the same variables_dict to apply the one_hot_encoder to another dataframe.
df2.show()
+---+---------+--------+
|age|education| country|
+---+---------+--------+
| 22| A| Canada|
| 66| B|Mongolia|
| 55| D| Peru|
| 44| C| China|
+---+---------+--------+
The dataframe above has column variables that have been seen before ie. (66, D, China).
We can use the following code to transform 'df2' to have identical columns as 'df'.
for col in variables_dict.keys():
for val in variables_dict[col]:
df2 = df2.withColumn("{}_{}".format(col, val), functions.when((df2[col] == val), 1).otherwise(0))
+---+---------+--------+------+------+------+------+-----------+-----------+-----------+--------------+-------------+------------+----------------+
|age|education| country|age_22|age_44|age_55|age_34|education_A|education_C|education_B|country_Canada|country_Korea|country_Peru|country_Mongolia|
+---+---------+--------+------+------+------+------+-----------+-----------+-----------+--------------+-------------+------------+----------------+
| 22| A| Canada| 1| 0| 0| 0| 1| 0| 0| 1| 0| 0| 0|
| 66| B|Mongolia| 0| 0| 0| 0| 0| 0| 1| 0| 0| 0| 1|
| 55| D| Peru| 0| 0| 1| 0| 0| 0| 0| 0| 0| 1| 0|
| 44| C| China| 0| 1| 0| 0| 0| 1| 0| 0| 0| 0| 0|
+---+---------+--------+------+------+------+------+-----------+-----------+-----------+--------------+-------------+------------+----------------+

Grouped window operation in pyspark: restart sum by condition [duplicate]

I have this dataframe
+---+----+---+
| A| B| C|
+---+----+---+
| 0|null| 1|
| 1| 3.0| 0|
| 2| 7.0| 0|
| 3|null| 1|
| 4| 4.0| 0|
| 5| 3.0| 0|
| 6|null| 1|
| 7|null| 1|
| 8|null| 1|
| 9| 5.0| 0|
| 10| 2.0| 0|
| 11|null| 1|
+---+----+---+
What I need do is a cumulative sum of values from column C until the next value is zero.
Expected output:
+---+----+---+----+
| A| B| C| D|
+---+----+---+----+
| 0|null| 1| 1|
| 1| 3.0| 0| 0|
| 2| 7.0| 0| 0|
| 3|null| 1| 1|
| 4| 4.0| 0| 0|
| 5| 3.0| 0| 0|
| 6|null| 1| 1|
| 7|null| 1| 2|
| 8|null| 1| 3|
| 9| 5.0| 0| 0|
| 10| 2.0| 0| 0|
| 11|null| 1| 1|
+---+----+---+----+
To reproduce dataframe:
from pyspark.shell import sc
from pyspark.sql import Window
from pyspark.sql.functions import lag, when, sum
x = sc.parallelize([
[0, None], [1, 3.], [2, 7.], [3, None], [4, 4.],
[5, 3.], [6, None], [7, None], [8, None], [9, 5.], [10, 2.], [11, None]])
x = x.toDF(['A', 'B'])
# Transform null values into "1"
x = x.withColumn('C', when(x.B.isNull(), 1).otherwise(0))

Create a temporary column (grp) that increments a counter each time column C is equal to 0 (the reset condition) and use this as a partitioning column for your cumulative sum.
import pyspark.sql.functions as f
from pyspark.sql import Window
x.withColumn(
"grp",
f.sum((f.col("C") == 0).cast("int")).over(Window.orderBy("A"))
).withColumn(
"D",
f.sum(f.col("C")).over(Window.partitionBy("grp").orderBy("A"))
).drop("grp").show()
#+---+----+---+---+
#| A| B| C| D|
#+---+----+---+---+
#| 0|null| 1| 1|
#| 1| 3.0| 0| 0|
#| 2| 7.0| 0| 0|
#| 3|null| 1| 1|
#| 4| 4.0| 0| 0|
#| 5| 3.0| 0| 0|
#| 6|null| 1| 1|
#| 7|null| 1| 2|
#| 8|null| 1| 3|
#| 9| 5.0| 0| 0|
#| 10| 2.0| 0| 0|
#| 11|null| 1| 1|
#+---+----+---+---+

PySpark - Append previous and next row to current row

Let's say I have a PySpark data frame like so:
1 0 1 0
0 0 1 1
0 1 0 1
How can I append the last and next column of a row to the current row, like so:
1 0 1 0 0 0 0 0 0 0 1 1
0 0 1 1 1 0 1 0 0 1 0 1
0 1 0 1 0 0 1 1 0 0 0 0
I'm familiar with the .withColumn() method for adding columns, but am not sure what I would put in this field.
The "0 0 0 0" are placeholder values because there are no prior or subsequent rows before and after those rows.

You can use pyspark.sql.functions.lead() and pyspark.sql.functions.lag() but first you need a way to order your rows. If you don't already have a column that determines the order, you can create one using pyspark.sql.functions.monotonically_increasing_id()
Then use this in conjunction with a Window function.
For example, if you had the following DataFrame df:
df.show()
#+---+---+---+---+
#| a| b| c| d|
#+---+---+---+---+
#| 1| 0| 1| 0|
#| 0| 0| 1| 1|
#| 0| 1| 0| 1|
#+---+---+---+---+
You could do:
from pyspark.sql import Window
import pyspark.sql.functions as f
cols = df.columns
df = df.withColumn("id", f.monotonically_increasing_id())
df.select(
"*",
*([f.lag(f.col(c),default=0).over(Window.orderBy("id")).alias("prev_"+c) for c in cols] +
[f.lead(f.col(c),default=0).over(Window.orderBy("id")).alias("next_"+c) for c in cols])
).drop("id").show()
#+---+---+---+---+------+------+------+------+------+------+------+------+
#| a| b| c| d|prev_a|prev_b|prev_c|prev_d|next_a|next_b|next_c|next_d|
#+---+---+---+---+------+------+------+------+------+------+------+------+
#| 1| 0| 1| 0| 0| 0| 0| 0| 0| 0| 1| 1|
#| 0| 0| 1| 1| 1| 0| 1| 0| 0| 1| 0| 1|
#| 0| 1| 0| 1| 0| 0| 1| 1| 0| 0| 0| 0|
#+---+---+---+---+------+------+------+------+------+------+------+------+

How to set new flag based on condition in pyspark?

I have two data frames like below.
df = spark.createDataFrame(sc.parallelize([[1,1,2],[1,2,9], [2,1,2],[2,2,1],
[4,1,5],[4,2,6]]), ["sid","cid","Cr"])
df.show()
+---+---+---+
|sid|cid| Cr|
+---+---+---+
| 1| 1| 2|
| 1| 2| 9|
| 2| 1| 2|
| 2| 2| 1|
| 4| 1| 5|
| 4| 2| 6|
| 5| 1| 3|
| 5| 2| 8|
+---+---+---+
next I have created df1 like below.
df1 = spark.createDataFrame(sc.parallelize([[1,1],[1,2],[1,3], [2,1],[2,2],[2,3],[4,1],[4,2],[4,3],[5,1],[5,2],[5,3]]), ["sid","cid"])
df1.show()
+---+---+
|sid|cid|
+---+---+
| 1| 1|
| 1| 2|
| 1| 3|
| 2| 1|
| 2| 2|
| 2| 3|
| 4| 1|
| 4| 2|
| 4| 3|
| 5| 1|
| 5| 2|
| 5| 3|
+---+---+
now I want my final output should be like below i.e . if any of the data presented i.e.
if (df1.sid==df.sid)&(df1.cid==df.cid) then flag value 1 else 0.
and missing Cr values will be '0'
+---+---+---+----+
|sid|cid| Cr|flag|
+---+---+---+----+
| 1| 1| 2| 1 |
| 1| 2| 9| 1 |
| 1| 3| 0| 0 |
| 2| 1| 2| 1 |
| 2| 2| 1| 1 |
| 2| 3| 0| 0 |
| 4| 1| 5| 1 |
| 4| 2| 6| 1 |
| 4| 3| 0| 0 |
| 5| 1| 3| 1 |
| 5| 2| 8| 1 |
| 5| 3| 0| 0 |
+---+---+---+----+
please help me on this.

With data:
from pyspark.sql.functions import col, when, lit, coalesce
df = spark.createDataFrame(
[(1, 1, 2), (1, 2, 9), (2, 1, 2), (2, 2, 1), (4, 1, 5), (4, 2, 6), (5, 1, 3), (5, 2, 8)],
("sid", "cid", "Cr"))
df1 = spark.createDataFrame(
[[1,1],[1,2],[1,3], [2,1],[2,2],[2,3],[4,1],[4,2],[4,3],[5,1],[5,2],[5,3]],
["sid","cid"])
outer join:
joined = (df.alias("df")
.join(
df1.alias("df1"),
(col("df.sid") == col("df1.sid")) & (col("df.cid") == col("df1.cid")),
"rightouter"))
and select
joined.select(
col("df1.*"),
coalesce(col("Cr"), lit(0)).alias("Cr"),
col("df.sid").isNotNull().cast("integer").alias("flag")
).orderBy("sid", "cid").show()
# +---+---+---+----+
# |sid|cid| Cr|flag|
# +---+---+---+----+
# | 1| 1| 2| 1|
# | 1| 2| 9| 1|
# | 1| 3| 0| 0|
# | 2| 1| 2| 1|
# | 2| 2| 1| 1|
# | 2| 3| 0| 0|
# | 4| 1| 5| 1|
# | 4| 2| 6| 1|
# | 4| 3| 0| 0|
# | 5| 1| 3| 1|
# | 5| 2| 8| 1|
# | 5| 3| 0| 0|
# +---+---+---+----+

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

retrieving pyspark offset lag dynamic value in to other dataframe - python

Related

How to sort Pandas crosstab columns by sum of values

best way to perform one-hot encoding and explode column with prefix pyspark

Grouped window operation in pyspark: restart sum by condition [duplicate]

PySpark - Append previous and next row to current row

How to set new flag based on condition in pyspark?

Categories

Resources