Pyspark explode list creating column with index in list - python

So I have a question regarding pyspark. I have a dataframe that looks like this:
+---+------------+
| id| list|
+---+------------+
| 2|[3, 5, 4, 2]|
+---+------------+
| 3|[4, 5, 3, 2]|
+---+------------+
And I would like to explode lists it into multiple rows and keeping information about which position did each element of the list had in a separate column. The result should look like this:
+---+------------+------------+
| id| listitem| rank|
+---+------------+------------+
| 2| 3| 1|
+---+------------+------------+
| 2| 5| 2|
+---+------------+------------+
| 2| 4| 3|
+---+------------+------------+
| 2| 2| 4|
+---+------------+------------+
| 3| 4| 1|
+---+------------+------------+
| 3| 5| 2|
+---+------------+------------+
| 3| 3| 3|
+---+------------+------------+
| 3| 2| 4|
+---+------------+------------+
The rank column has the index+1 of the position each element had in the list. Any suggestions on the most optimal code to achieve it?

You can use posexplode() or posexplode_outer() function to get desired result.
df = spark.createDataFrame([(2, [3, 5, 4, 2]), (3, [4, 5, 3, 2])], ["id", "list"])
df.select('id',posexplode_outer('list').alias('rank', 'listitem')) \
.withColumn('rank', col('rank') + 1).show()
+---+----+--------+
| id|rank|listitem|
+---+----+--------+
| 2| 1| 3|
| 2| 2| 5|
| 2| 3| 4|
| 2| 4| 2|
| 3| 1| 4|
| 3| 2| 5|
| 3| 3| 3|
| 3| 4| 2|
+---+----+--------+

Related

Calculate percentages of occurrences by rolling window in pyspark

I have the following pyspark dataframe:
import pandas as pd
foo = pd.DataFrame({'id': [1,1,1,1,1, 2,2,2,2,2],
'time': [1,2,3,4,5, 1,2,3,4,5],
'value': ['a','a','a','b','b', 'b','b','c','c','c']})
foo_df = spark.createDataFrame(foo)
foo_df.show()
+---+----+-----+
| id|time|value|
+---+----+-----+
| 1| 1| a|
| 1| 2| a|
| 1| 3| a|
| 1| 4| b|
| 1| 5| b|
| 2| 1| b|
| 2| 2| b|
| 2| 3| c|
| 2| 4| c|
| 2| 5| c|
+---+----+-----+
I would like, for a rolling time window of 3, to calculate the percentage of appearances of all the values, in the value column. The operation should be done by id.
The output dataframe would look something like this:
+---+------------------+------------------+------------------+
| id| perc_a| perc_b| perc_c|
+---+------------------+------------------+------------------+
| 1| 1.0| 0.0| 0.0|
| 1|0.6666666666666666|0.3333333333333333| 0.0|
| 1|0.3333333333333333|0.6666666666666666| 0.0|
| 2| 0.0|0.6666666666666666|0.3333333333333333|
| 2| 0.0|0.3333333333333333|0.6666666666666666|
| 2| 0.0| 0.0| 1.0|
+---+------------------+------------------+------------------+
Explanation of result:
for id=1, and the first window of (time=[1,2,3]), the value column contains only as. so the perc_a equals 100, and the rest is 0.
for id=1, and the second window of (time=[2,3,4]), the value column contains 2 as and 1 b, so the perc_a equals 66.6 the perc_b is 33.3 and the perc_c equals 0
etc
How could I achieve that in pyspark ?
EDIT
I am using pyspark 2.4
You can use count with a window function.
w = Window.partitionBy('id').orderBy('time').rowsBetween(Window.currentRow, 2)
df = (df.select('id', F.col('time').alias('window'),
*[(F.count(F.when(F.col('value') == x, 'value')).over(w)
/
F.count('value').over(w) * 100).alias(f'perc_{x}')
for x in ['a', 'b', 'c']])
.filter(F.col('time') < 4))
Clever answer by #Emma. Expanding the answer with a SparkSQL implementation.
The approach is to collect values over the intended sliding row range i.e ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING & filtering on time < 4 , further exploding the collected list to count the individual frequency , and finally pivoting it to the intended format
SparkSQL - Collect List
foo = pd.DataFrame({'id': [1,1,1,1,1, 2,2,2,2,2],
'time': [1,2,3,4,5, 1,2,3,4,5],
'value': ['a','a','a','b','b', 'b','b','c','c','c']})
sparkDF = sql.createDataFrame(foo)
sparkDF.registerTempTable("INPUT")
sql.sql("""
SELECT
id,
time,
value,
ROW_NUMBER() OVER(PARTITION BY id ORDER BY time
) as window_map,
COLLECT_LIST(value) OVER(PARTITION BY id ORDER BY time
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING
) as collected_list
FROM INPUT
""").show()
+---+----+-----+----------+--------------+
| id|time|value|window_map|collected_list|
+---+----+-----+----------+--------------+
| 1| 1| a| 1| [a, a, a]|
| 1| 2| a| 2| [a, a, b]|
| 1| 3| a| 3| [a, b, b]|
| 1| 4| b| 4| [b, b]|
| 1| 5| b| 5| [b]|
| 2| 1| b| 1| [b, b, c]|
| 2| 2| b| 2| [b, c, c]|
| 2| 3| c| 3| [c, c, c]|
| 2| 4| c| 4| [c, c]|
| 2| 5| c| 5| [c]|
+---+----+-----+----------+--------------+
SparkSQL - Explode - Frequency Calculation
immDF = sql.sql(
"""
SELECT
id,
time,
exploded_value,
COUNT(*) as value_count
FROM (
SELECT
id,
time,
value,
window_map,
EXPLODE(collected_list) as exploded_value
FROM (
SELECT
id,
time,
value,
ROW_NUMBER() OVER(PARTITION BY id ORDER BY time
) as window_map,
COLLECT_LIST(value) OVER(PARTITION BY id ORDER BY time
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING
) as collected_list
FROM INPUT
)
WHERE window_map < 4 <--> Filtering List where values are less than 3
)
GROUP BY 1,2,3
ORDER BY id,time
;
"""
)
immDF.registerTempTable("IMM_RESULT")
immDF.show()
+---+----+--------------+-----------+
| id|time|exploded_value|value_count|
+---+----+--------------+-----------+
| 1| 1| a| 3|
| 1| 2| b| 1|
| 1| 2| a| 2|
| 1| 3| a| 1|
| 1| 3| b| 2|
| 2| 1| b| 2|
| 2| 1| c| 1|
| 2| 2| b| 1|
| 2| 2| c| 2|
| 2| 3| c| 3|
+---+----+--------------+-----------+
SparkSQL - Pivot
sql.sql("""
SELECT
id,
time,
ROUND(NVL(a,0),2) as perc_a,
ROUND(NVL(b,0),2) as perc_b,
ROUND(NVL(c,0),2) as perc_c
FROM IMM_RESULT
PIVOT (
MAX(value_count)/3 * 100.0
FOR exploded_value IN ('a'
,'b'
,'c'
)
)
""").show()
+---+----+------+------+------+
| id|time|perc_a|perc_b|perc_c|
+---+----+------+------+------+
| 1| 1| 100.0| 0.0| 0.0|
| 1| 2| 66.67| 33.33| 0.0|
| 1| 3| 33.33| 66.67| 0.0|
| 2| 1| 0.0| 66.67| 33.33|
| 2| 2| 0.0| 33.33| 66.67|
| 2| 3| 0.0| 0.0| 100.0|
+---+----+------+------+------+

How to write for loop or episode in Pyspark

I have a dataframe, and I am trying to write a for loop on it.
|ID | from_dt | To_dt |row_number|diff|negetive_or_not|
+----------+----------+----------+----------+----+---------------+
|11111|2020-07-30|2020-07-31| 1| -2| 0|
|11111|2020-08-02|2020-08-11| 2| 4| 1|
|11111|2020-08-07|2020-08-08| 3| -4| 0|
|11111|2020-08-12|2020-08-18| 4| 1| 1|
|11111|2020-08-17|2020-08-19| 5| 0| 1|
|11111|2020-08-19|2020-08-22| 6| 2| 1|
|11111|2020-08-20|2020-08-24| 7| -1| 0|
|11111|2020-08-25|2020-08-27| 8| 0| 1|
|11111|2020-08-27|2020-08-31| 9|-999| 0|
The goal is to determine the episode. if the negative or not start with 0, it is an episode, if the negative or not start with 1, until it hit 0, it is an episode.
Here is the ideal output
+----------+----------+----------+----------+----+---------------+
|ID | from_dt | To_dt |row_number|diff|negetive_or_not| Episode
+----------+----------+----------+----------+----+---------------+
|11111|2020-07-30|2020-07-31| 1| -2| 0| 1
|11111|2020-08-02|2020-08-11| 2| 4| 1| 2
|11111|2020-08-07|2020-08-08| 3| -4| 0| 2
|11111|2020-08-12|2020-08-18| 4| 1| 1| 3
|11111|2020-08-17|2020-08-19| 5| 0| 1| 3
|11111|2020-08-19|2020-08-22| 6| 2| 1| 3
|11111|2020-08-20|2020-08-24| 7| -1| 0| 3
|11111|2020-08-25|2020-08-27| 8| 0| 1| 4
|11111|2020-08-27|2020-08-31| 9|-999| 0| 4
|22222|2020-07-30|2020-07-31| 1| -2| 0| 1
|22222|2020-08-02|2020-08-11| 2| 4| 1| 2
|22222|2020-08-07|2020-08-08| 3| -4| 0| 2
+----------+----------+----------+----------+----+---------------+
I tried to use case when and rank, such as
case when negetive_or_not = 0 then "eps1" else then "eps2", both not working.
df2 = df.selectExpr('*') .withColumn("Episode",lead(col("to_dt")).over(Window.partitionBy("patient_id").orderBy(col("negetive_or_not"))))
I also try to write a for loop in pyspark, but I have difficulty transferring dataframe into the list, any suggestions will be highly appreciated.
The approach goes as
First flag the row when the value should change.
Based on the flag generate a new value.
Forward fill the nulls
df.withColumn('flag', when(((col('negetive_or_not')==1) & (lag('negetive_or_not').over(Window.partitionBy('ID').orderBy('row_number'))==0)) | (lag('negetive_or_not').over(Window.partitionBy('ID').orderBy('row_number')).isNull()), lit('change')).otherwise(lit('no'))).\
withColumn('ep', when(col('flag')=='change',row_number().over(Window.partitionBy('ID','flag').orderBy('row_number')))).\
withColumn('Episode', last('ep', ignorenulls=True).over(Window.partitionBy('ID').orderBy('row_number'))).\
drop('flag','ep').show()
+-----+----------+----------+----------+----+---------------+-------+
| ID| from_dt| To_dt|row_number|diff|negetive_or_not|Episode|
+-----+----------+----------+----------+----+---------------+-------+
|11111|2020-07-30|2020-07-31| 1| -2| 0| 1|
|11111|2020-08-02|2020-08-11| 2| 4| 1| 2|
|11111|2020-08-07|2020-08-08| 3| -4| 0| 2|
|11111|2020-08-12|2020-08-18| 4| 1| 1| 3|
|11111|2020-08-17|2020-08-19| 5| 0| 1| 3|
|11111|2020-08-19|2020-08-22| 6| 2| 1| 3|
|11111|2020-08-20|2020-08-24| 7| -1| 0| 3|
|11111|2020-08-25|2020-08-27| 8| 0| 1| 4|
|11111|2020-08-27|2020-08-31| 9|-999| 0| 4|
+-----+----------+----------+----------+----+---------------+-------+

Replace characters in column names in pyspark data frames

I have a dataframew like below in Pyspark
df = spark.createDataFrame([(2,'john',1,1),
(2,'john',1,2),
(3,'pete',8,3),
(3,'pete',8,4),
(5,'steve',9,5)],
['id','/na/me','val/ue', 'rank/'])
df.show()
+---+------+------+-----+
| id|/na/me|val/ue|rank/|
+---+------+------+-----+
| 2| john| 1| 1|
| 2| john| 1| 2|
| 3| pete| 8| 3|
| 3| pete| 8| 4|
| 5| steve| 9| 5|
+---+------+------+-----+
Now in this data frame I want to replace the column names where / to under scrore _. But if the / comes at the start or end of the column name then remove the / but don't replace with _.
I have done like below
for name in df.schema.names:
df = df.withColumnRenamed(name, name.replace('/', '_'))
>>> df
DataFrame[id: bigint, _na_me: string, val_ue: bigint, rank_: bigint]
>>>df.show()
+---+------+------+-----+
| id|_na_me|val_ue|rank_|
+---+------+------+-----+
| 2| john| 1| 1|
| 2| john| 1| 2|
| 3| pete| 8| 3|
| 3| pete| 8| 4|
| 5| steve| 9| 5|
+---+------+------+-----+
How can I achieve my desired result which is below
+---+------+------+-----+
| id| na_me|val_ue| rank|
+---+------+------+-----+
| 2| john| 1| 1|
| 2| john| 1| 2|
| 3| pete| 8| 3|
| 3| pete| 8| 4|
| 5| steve| 9| 5|
+---+------+------+-----+
Try with regular expression replace(re.sub) in python way.
import re
cols=[re.sub(r'(^_|_$)','',f.replace("/","_")) for f in df.columns]
df = spark.createDataFrame([(2,'john',1,1),
(2,'john',1,2),
(3,'pete',8,3),
(3,'pete',8,4),
(5,'steve',9,5)],
['id','/na/me','val/ue', 'rank/'])
df.toDF(*cols).show()
#+---+-----+------+----+
#| id|na_me|val_ue|rank|
#+---+-----+------+----+
#| 2| john| 1| 1|
#| 2| john| 1| 2|
#| 3| pete| 8| 3|
#| 3| pete| 8| 4|
#| 5|steve| 9| 5|
#+---+-----+------+----+
#or using for loop on schema.names
for name in df.schema.names:
df = df.withColumnRenamed(name, re.sub(r'(^_|_$)','',name.replace('/', '_')))
df.show()
#+---+-----+------+----+
#| id|na_me|val_ue|rank|
#+---+-----+------+----+
#| 2| john| 1| 1|
#| 2| john| 1| 2|
#| 3| pete| 8| 3|
#| 3| pete| 8| 4|
#| 5|steve| 9| 5|
#+---+-----+------+----+

How can we set a flag for last occurence of a value in a column of Pyspark Dataframe

Requirement : Here when last occurence of loyal with value is 1 then set flag as 1 else 0
Input:
+-----------+----------+----------+-------+-----+---------+-------+---+
|consumer_id|product_id| TRX_ID|pattern|loyal| trx_date|row_num| mx|
+-----------+----------+----------+-------+-----+---------+-------+---+
| 11| 1|1152397078| VVVVM| 1| 3/5/2020| 1| 5|
| 11| 1|1152944770| VVVVV| 1| 3/6/2020| 2| 5|
| 11| 1|1153856408| VVVVV| 1|3/15/2020| 3| 5|
| 11| 2|1155884040| MVVVV| 1| 4/2/2020| 4| 5|
| 11| 2|1156854301| MMVVV| 0|4/17/2020| 5| 5|
| 12| 1|1156854302| VVVVM| 1| 3/6/2020| 1| 3|
| 12| 1|1156854303| VVVVV| 1| 3/7/2020| 2| 3|
| 12| 2|1156854304| MVVVV| 1|3/16/2020| 3| 3|
+-----------+----------+----------+-------+-----+---------+-------+---+
df = spark.createDataFrame(
[('11','1','1152397078','VVVVM',1,'3/5/2020',1,5),
('11','1','1152944770','VVVVV',1,'3/6/2020',2,5),
('11','1','1153856408','VVVVV',1,'3/15/2020',3,5),
('11','2','1155884040','MVVVV',1,'4/2/2020',4,5),
('11','2','1156854301','MMVVV',0,'4/17/2020',5,5),
('12','1','1156854302','VVVVM',1,'3/6/2020',1,3),
('12','1','1156854303','VVVVV',1,'3/7/2020',2,3),
('12','2','1156854304','MVVVV',1,'3/16/2020',3,3)
]
,["consumer_id","product_id","TRX_ID","pattern","loyal","trx_date","row_num","mx"])
df.show()
Output Required:
Note : Here Flag only checks whether last loyal value containing 1 and set the flag.
+-----------+----------+----------+-------+-----+---------+-------+---+----+
|consumer_id|product_id| TRX_ID|pattern|loyal| trx_date|row_num| mx|Flag|
+-----------+----------+----------+-------+-----+---------+-------+---+----+
| 11| 1|1152397078| VVVVM| 1| 3/5/2020| 1| 5| 0|
| 11| 1|1152944770| VVVVV| 1| 3/6/2020| 2| 5| 0|
| 11| 1|1153856408| VVVVV| 1|3/15/2020| 3| 5| 0|
| 11| 2|1155884040| MVVVV| 1| 4/2/2020| 4| 5| 1|
| 11| 2|1156854301| MMVVV| 0|4/17/2020| 5| 5| 0|
| 12| 1|1156854302| VVVVM| 1| 3/6/2020| 1| 3| 0|
| 12| 1|1156854303| VVVVV| 1| 3/7/2020| 2| 3| 0|
| 12| 2|1156854304| MVVVV| 1|3/16/2020| 3| 3| 1|
+-----------+----------+----------+-------+-----+---------+-------+---+----+
What i tried :
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w2 = Window().partitionBy("consumer_id").orderBy('row_num')
df = spark.sql("""select * from inter_table""")
df = df.withColumn("Flag",F.when(F.last(F.col('loyal') == 1).over(w),1).otherwise(0))
Here there are two scenarios :
1. Value 1 with preceding 0 (for your reference row_num 4 for consumer_id 11)
2. Value 1 with no preceding (for your reference row_num 3 for consumer_id 12)
To add in the Murtaza's answer
We can add a new column which will check your second scenario for preceding null
window = Window.partitionBy('Consumer_id').orderBy('row_num')
df.withColumn('Flag',f.when((f.col('loyal')==1)
& ((f.lead(f.col('loyal')).over(window)==0)
| (f.lead(f.col('loyal')).over(window).isNull())), f.lit('1')).otherwise(f.lit('0'))).show()
+-----------+----------+----------+-------+-----+---------+-------+---+----+
|consumer_id|product_id| TRX_ID|pattern|loyal| trx_date|row_num| mx|Flag|
+-----------+----------+----------+-------+-----+---------+-------+---+----+
| 11| 1|1152397078| VVVVM| 1| 3/5/2020| 1| 5| 0|
| 11| 1|1152944770| VVVVV| 1| 3/6/2020| 2| 5| 0|
| 11| 1|1153856408| VVVVV| 1|3/15/2020| 3| 5| 0|
| 11| 2|1155884040| MVVVV| 1| 4/2/2020| 4| 5| 1|
| 11| 2|1156854300| MMVVV| 0|4/17/2020| 5| 5| 0|
| 12| 1|1156854300| VVVVM| 1| 3/6/2020| 1| 4| 0|
| 12| 1|1156854300| VVVVV| 1| 3/7/2020| 2| 4| 0|
| 12| 2|1156854300| MVVVV| 1|3/16/2020| 3| 4| 0|
| 12| 1|1156854300| MVVVV| 1| 4/3/2020| 4| 4| 1|
+-----------+----------+----------+-------+-----+---------+-------+---+----+
Try this.
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w = Window().partitionBy("product_id").orderBy('row_num')
df.withColumn("flag", F.when((F.col("loyal")==1)\
&(F.lead("loyal").over(w)==0),F.lit(1))\
.otherwise(F.lit(0))).show()
#+-----------+----------+----------+-------+-----+---------+-------+---+----+
#|consumer_id|product_id| TRX_ID|pattern|loyal| trx_date|row_num| mx|flag|
#+-----------+----------+----------+-------+-----+---------+-------+---+----+
#| 11| 1|1152397078| VVVVM| 1| 3/5/2020| 1| 5| 0|
#| 11| 1|1152944770| VVVVV| 1| 3/6/2020| 2| 5| 0|
#| 11| 1|1153856408| VVVVV| 1|3/15/2020| 3| 5| 0|
#| 11| 2|1155884040| MVVVV| 1| 4/2/2020| 4| 5| 1|
#| 11| 2|1156854300| MMVVV| 0|4/17/2020| 5| 5| 0|
#+-----------+----------+----------+-------+-----+---------+-------+---+----+
UPDATE:
from pypsark.sql import functions as F
from pyspark.sql.window import Window
w = Window().partitionBy("consumer_id").orderBy('row_num')
lead=F.lead("loyal").over(w)
df.withColumn("Flag", F.when(((F.col("loyal")==1)\
&((lead==0)|(lead.isNull()))),F.lit(1))\
.otherwise(F.lit(0))).show()

AnalysisException: "cannot resolve 'df2.*' give input columns Pyspark?

I have created two data frames like below.
df = spark.createDataFrame(
[(1, 1, 2,4), (1, 2, 9,5), (2, 1, 2,1), (2, 2, 1,2), (4, 1, 5,2), (4, 2, 6,3), (5, 1, 3,3), (5, 2, 8,4)],
("sid", "cid", "Cr","rank"))
df1 = spark.createDataFrame(
[[1,1],[1,2],[1,3], [2,1],[2,2],[2,3],[4,1],[4,2],[4,3],[5,2],[5,3],[5,3],[3,4]],
["sid","cid"])
because of some requirement I had created sqlContext and created Temporary view,like below.
df.createOrReplaceTempView("temp")
df2=sqlContext.sql("select sid,cid,cr,rank from temp")
then i am doing left join based on some condition .
joined = (df2.alias("df")
.join(
df1.alias("df1"),
(col("df.sid") == col("df1.sid")) & (col("df.cid") == col("df1.cid")),
"left"))
joined.show()
+---+---+---+----+----+----+
|sid|cid| cr|rank| sid| cid|
+---+---+---+----+----+----+
| 5| 1| 3| 3|null|null|
| 1| 1| 2| 4| 1| 1|
| 4| 2| 6| 3| 4| 2|
| 5| 2| 8| 4| 5| 2|
| 2| 2| 1| 2| 2| 2|
| 4| 1| 5| 2| 4| 1|
| 1| 2| 9| 5| 1| 2|
| 2| 1| 2| 1| 2| 1|
+---+---+---+----+----+----+
then finally i am executing below code:
final=joined.select(
col("df2.*"),
col("df1.sid").isNull().cast("integer").alias("flag")
).orderBy("sid", "cid")
then i am getting error like below.
"AnalysisException: "cannot resolve 'df2.*' give input columns 'cr, sid, sid, cid, cid, rank';"
but my expected out put should be:
+---+---+---+----+----+
|sid|cid| Cr|rank|flag|
+---+---+---+----+----+
| 1| 1| 2| 4| 0|
| 1| 2| 9| 5| 0|
| 2| 1| 2| 1| 0|
| 2| 2| 1| 2| 0|
| 4| 1| 5| 2| 0|
| 4| 2| 6| 3| 0|
| 5| 1| 3| 3| 1|
| 5| 2| 8| 4| 0|
+---+---+---+----+----+
mistake is :
joined = (df2.alias("df")
.join(
df1.alias("df1"),
(col("df2.sid") == col("df1.sid")) & (col("df2.cid") == col("df1.cid")),
"left"))
joined.show()
here we should to use df2.alias("df2") or joined.select(col("df.*")..)
Complete solution is :
joined = (df2.alias("df2")
.join(
df1.alias("df1"),
(col("df2.sid") == col("df1.sid")) & (col("df2.cid") == col("df1.cid")),
"left"))
joined.show()
final=joined.select(
col("df2.*"),
col("df1.sid").isNull().cast("integer").alias("flag")
).orderBy("sid", "cid")

Categories

Resources