I have created two data frames like below.
df = spark.createDataFrame(
[(1, 1, 2,4), (1, 2, 9,5), (2, 1, 2,1), (2, 2, 1,2), (4, 1, 5,2), (4, 2, 6,3), (5, 1, 3,3), (5, 2, 8,4)],
("sid", "cid", "Cr","rank"))
df1 = spark.createDataFrame(
[[1,1],[1,2],[1,3], [2,1],[2,2],[2,3],[4,1],[4,2],[4,3],[5,2],[5,3],[5,3],[3,4]],
["sid","cid"])
because of some requirement I had created sqlContext and created Temporary view,like below.
df.createOrReplaceTempView("temp")
df2=sqlContext.sql("select sid,cid,cr,rank from temp")
then i am doing left join based on some condition .
joined = (df2.alias("df")
.join(
df1.alias("df1"),
(col("df.sid") == col("df1.sid")) & (col("df.cid") == col("df1.cid")),
"left"))
joined.show()
+---+---+---+----+----+----+
|sid|cid| cr|rank| sid| cid|
+---+---+---+----+----+----+
| 5| 1| 3| 3|null|null|
| 1| 1| 2| 4| 1| 1|
| 4| 2| 6| 3| 4| 2|
| 5| 2| 8| 4| 5| 2|
| 2| 2| 1| 2| 2| 2|
| 4| 1| 5| 2| 4| 1|
| 1| 2| 9| 5| 1| 2|
| 2| 1| 2| 1| 2| 1|
+---+---+---+----+----+----+
then finally i am executing below code:
final=joined.select(
col("df2.*"),
col("df1.sid").isNull().cast("integer").alias("flag")
).orderBy("sid", "cid")
then i am getting error like below.
"AnalysisException: "cannot resolve 'df2.*' give input columns 'cr, sid, sid, cid, cid, rank';"
but my expected out put should be:
+---+---+---+----+----+
|sid|cid| Cr|rank|flag|
+---+---+---+----+----+
| 1| 1| 2| 4| 0|
| 1| 2| 9| 5| 0|
| 2| 1| 2| 1| 0|
| 2| 2| 1| 2| 0|
| 4| 1| 5| 2| 0|
| 4| 2| 6| 3| 0|
| 5| 1| 3| 3| 1|
| 5| 2| 8| 4| 0|
+---+---+---+----+----+
mistake is :
joined = (df2.alias("df")
.join(
df1.alias("df1"),
(col("df2.sid") == col("df1.sid")) & (col("df2.cid") == col("df1.cid")),
"left"))
joined.show()
here we should to use df2.alias("df2") or joined.select(col("df.*")..)
Complete solution is :
joined = (df2.alias("df2")
.join(
df1.alias("df1"),
(col("df2.sid") == col("df1.sid")) & (col("df2.cid") == col("df1.cid")),
"left"))
joined.show()
final=joined.select(
col("df2.*"),
col("df1.sid").isNull().cast("integer").alias("flag")
).orderBy("sid", "cid")
Related
So I have a question regarding pyspark. I have a dataframe that looks like this:
+---+------------+
| id| list|
+---+------------+
| 2|[3, 5, 4, 2]|
+---+------------+
| 3|[4, 5, 3, 2]|
+---+------------+
And I would like to explode lists it into multiple rows and keeping information about which position did each element of the list had in a separate column. The result should look like this:
+---+------------+------------+
| id| listitem| rank|
+---+------------+------------+
| 2| 3| 1|
+---+------------+------------+
| 2| 5| 2|
+---+------------+------------+
| 2| 4| 3|
+---+------------+------------+
| 2| 2| 4|
+---+------------+------------+
| 3| 4| 1|
+---+------------+------------+
| 3| 5| 2|
+---+------------+------------+
| 3| 3| 3|
+---+------------+------------+
| 3| 2| 4|
+---+------------+------------+
The rank column has the index+1 of the position each element had in the list. Any suggestions on the most optimal code to achieve it?
You can use posexplode() or posexplode_outer() function to get desired result.
df = spark.createDataFrame([(2, [3, 5, 4, 2]), (3, [4, 5, 3, 2])], ["id", "list"])
df.select('id',posexplode_outer('list').alias('rank', 'listitem')) \
.withColumn('rank', col('rank') + 1).show()
+---+----+--------+
| id|rank|listitem|
+---+----+--------+
| 2| 1| 3|
| 2| 2| 5|
| 2| 3| 4|
| 2| 4| 2|
| 3| 1| 4|
| 3| 2| 5|
| 3| 3| 3|
| 3| 4| 2|
+---+----+--------+
I have two dataframes as follows.
I want to add a new column to dataframe df_a from dataframe df_b column val_1 based on the condition df_a.col_p == df_b.id
df_a = sqlContext.createDataFrame([(1412, 31, 1), (2422, 21, 1), (4223, 22, 2), (
2244, 43, 1), (1235, 54, 1), (4126, 12, 5), (2342, 44, 1 )], ["idx", "col_n", "col_p"])
df_a.show()
+----+-----+-----+
| idx|col_n|col_p|
+----+-----+-----+
|1412| 31| 1|
|2422| 21| 1|
|4223| 22| 2|
|2244| 43| 1|
|1235| 54| 1|
|4126| 12| 5|
|2342| 44| 1|
+----+-----+-----+
df_b = sqlContext.createDataFrame([(1, 1, 1), (2, 1, 1), (3, 1, 2), (
4, 1, 1), (5, 2, 1), (6, 2, 2)], ["id", "val_1", "val_2"])
df_b.show()
+---+-----+-----+
| id|val_1|val_2|
+---+-----+-----+
| 1| 1| 1|
| 2| 1| 1|
| 3| 1| 2|
| 4| 1| 1|
| 5| 2| 1|
| 6| 2| 2|
+---+-----+-----+
Expected output
+----+-----+-----+-----+
| idx|col_n|col_p|val_1|
+----+-----+-----+-----+
|1412| 31| 1| 1|
|2422| 21| 1| 1|
|4223| 22| 2| 1|
|2244| 43| 1| 1|
|1235| 54| 1| 1|
|4126| 12| 5| 2|
|2342| 44| 1| 1|
+----+-----+-----+-----+
My code
cond = (df_a.col_p == df_b.id)
df_a_new = df_a.join(df_b, cond, how ='full').withColumn('val_new', F.when(cond, df_b.val_1))
df_a_new = df_a_new.drop(*['id', 'val_1', 'val_2'])
df_a_new = df_a_new.filter(df_a_new.idx. isNotNull())
df_a_new.show()
How can I get the proper output as expected result with correct index order?
You can assign an increasing index to df_a and sort by that index after joining. Also I'd suggest doing a left join rather than a full join.
import pyspark.sql.functions as F
df_a_new = df_a.withColumn('index', F.monotonically_increasing_id()) \
.join(df_b, df_a.col_p == df_b.id, 'left') \
.orderBy('index') \
.select('idx', 'col_n', 'col_p', 'val_1')
df_a_new.show()
+----+-----+-----+-----+
| idx|col_n|col_p|val_1|
+----+-----+-----+-----+
|1412| 31| 1| 1|
|2422| 21| 1| 1|
|4223| 22| 2| 1|
|2244| 43| 1| 1|
|1235| 54| 1| 1|
|4126| 12| 5| 2|
|2342| 44| 1| 1|
+----+-----+-----+-----+
you need to create your own indexes (monotomically_increasing_ids) and sort again after joining on those indexes.
But there is no way you can join while preserving order in spark as the rows are partitioned before joining and they lose order before combining
refer: Can Dataframe joins in Spark preserve order?
I have a dataframe, and I am trying to write a for loop on it.
|ID | from_dt | To_dt |row_number|diff|negetive_or_not|
+----------+----------+----------+----------+----+---------------+
|11111|2020-07-30|2020-07-31| 1| -2| 0|
|11111|2020-08-02|2020-08-11| 2| 4| 1|
|11111|2020-08-07|2020-08-08| 3| -4| 0|
|11111|2020-08-12|2020-08-18| 4| 1| 1|
|11111|2020-08-17|2020-08-19| 5| 0| 1|
|11111|2020-08-19|2020-08-22| 6| 2| 1|
|11111|2020-08-20|2020-08-24| 7| -1| 0|
|11111|2020-08-25|2020-08-27| 8| 0| 1|
|11111|2020-08-27|2020-08-31| 9|-999| 0|
The goal is to determine the episode. if the negative or not start with 0, it is an episode, if the negative or not start with 1, until it hit 0, it is an episode.
Here is the ideal output
+----------+----------+----------+----------+----+---------------+
|ID | from_dt | To_dt |row_number|diff|negetive_or_not| Episode
+----------+----------+----------+----------+----+---------------+
|11111|2020-07-30|2020-07-31| 1| -2| 0| 1
|11111|2020-08-02|2020-08-11| 2| 4| 1| 2
|11111|2020-08-07|2020-08-08| 3| -4| 0| 2
|11111|2020-08-12|2020-08-18| 4| 1| 1| 3
|11111|2020-08-17|2020-08-19| 5| 0| 1| 3
|11111|2020-08-19|2020-08-22| 6| 2| 1| 3
|11111|2020-08-20|2020-08-24| 7| -1| 0| 3
|11111|2020-08-25|2020-08-27| 8| 0| 1| 4
|11111|2020-08-27|2020-08-31| 9|-999| 0| 4
|22222|2020-07-30|2020-07-31| 1| -2| 0| 1
|22222|2020-08-02|2020-08-11| 2| 4| 1| 2
|22222|2020-08-07|2020-08-08| 3| -4| 0| 2
+----------+----------+----------+----------+----+---------------+
I tried to use case when and rank, such as
case when negetive_or_not = 0 then "eps1" else then "eps2", both not working.
df2 = df.selectExpr('*') .withColumn("Episode",lead(col("to_dt")).over(Window.partitionBy("patient_id").orderBy(col("negetive_or_not"))))
I also try to write a for loop in pyspark, but I have difficulty transferring dataframe into the list, any suggestions will be highly appreciated.
The approach goes as
First flag the row when the value should change.
Based on the flag generate a new value.
Forward fill the nulls
df.withColumn('flag', when(((col('negetive_or_not')==1) & (lag('negetive_or_not').over(Window.partitionBy('ID').orderBy('row_number'))==0)) | (lag('negetive_or_not').over(Window.partitionBy('ID').orderBy('row_number')).isNull()), lit('change')).otherwise(lit('no'))).\
withColumn('ep', when(col('flag')=='change',row_number().over(Window.partitionBy('ID','flag').orderBy('row_number')))).\
withColumn('Episode', last('ep', ignorenulls=True).over(Window.partitionBy('ID').orderBy('row_number'))).\
drop('flag','ep').show()
+-----+----------+----------+----------+----+---------------+-------+
| ID| from_dt| To_dt|row_number|diff|negetive_or_not|Episode|
+-----+----------+----------+----------+----+---------------+-------+
|11111|2020-07-30|2020-07-31| 1| -2| 0| 1|
|11111|2020-08-02|2020-08-11| 2| 4| 1| 2|
|11111|2020-08-07|2020-08-08| 3| -4| 0| 2|
|11111|2020-08-12|2020-08-18| 4| 1| 1| 3|
|11111|2020-08-17|2020-08-19| 5| 0| 1| 3|
|11111|2020-08-19|2020-08-22| 6| 2| 1| 3|
|11111|2020-08-20|2020-08-24| 7| -1| 0| 3|
|11111|2020-08-25|2020-08-27| 8| 0| 1| 4|
|11111|2020-08-27|2020-08-31| 9|-999| 0| 4|
+-----+----------+----------+----------+----+---------------+-------+
I have a dataframew like below in Pyspark
df = spark.createDataFrame([(2,'john',1,1),
(2,'john',1,2),
(3,'pete',8,3),
(3,'pete',8,4),
(5,'steve',9,5)],
['id','/na/me','val/ue', 'rank/'])
df.show()
+---+------+------+-----+
| id|/na/me|val/ue|rank/|
+---+------+------+-----+
| 2| john| 1| 1|
| 2| john| 1| 2|
| 3| pete| 8| 3|
| 3| pete| 8| 4|
| 5| steve| 9| 5|
+---+------+------+-----+
Now in this data frame I want to replace the column names where / to under scrore _. But if the / comes at the start or end of the column name then remove the / but don't replace with _.
I have done like below
for name in df.schema.names:
df = df.withColumnRenamed(name, name.replace('/', '_'))
>>> df
DataFrame[id: bigint, _na_me: string, val_ue: bigint, rank_: bigint]
>>>df.show()
+---+------+------+-----+
| id|_na_me|val_ue|rank_|
+---+------+------+-----+
| 2| john| 1| 1|
| 2| john| 1| 2|
| 3| pete| 8| 3|
| 3| pete| 8| 4|
| 5| steve| 9| 5|
+---+------+------+-----+
How can I achieve my desired result which is below
+---+------+------+-----+
| id| na_me|val_ue| rank|
+---+------+------+-----+
| 2| john| 1| 1|
| 2| john| 1| 2|
| 3| pete| 8| 3|
| 3| pete| 8| 4|
| 5| steve| 9| 5|
+---+------+------+-----+
Try with regular expression replace(re.sub) in python way.
import re
cols=[re.sub(r'(^_|_$)','',f.replace("/","_")) for f in df.columns]
df = spark.createDataFrame([(2,'john',1,1),
(2,'john',1,2),
(3,'pete',8,3),
(3,'pete',8,4),
(5,'steve',9,5)],
['id','/na/me','val/ue', 'rank/'])
df.toDF(*cols).show()
#+---+-----+------+----+
#| id|na_me|val_ue|rank|
#+---+-----+------+----+
#| 2| john| 1| 1|
#| 2| john| 1| 2|
#| 3| pete| 8| 3|
#| 3| pete| 8| 4|
#| 5|steve| 9| 5|
#+---+-----+------+----+
#or using for loop on schema.names
for name in df.schema.names:
df = df.withColumnRenamed(name, re.sub(r'(^_|_$)','',name.replace('/', '_')))
df.show()
#+---+-----+------+----+
#| id|na_me|val_ue|rank|
#+---+-----+------+----+
#| 2| john| 1| 1|
#| 2| john| 1| 2|
#| 3| pete| 8| 3|
#| 3| pete| 8| 4|
#| 5|steve| 9| 5|
#+---+-----+------+----+
Requirement : Here when last occurence of loyal with value is 1 then set flag as 1 else 0
Input:
+-----------+----------+----------+-------+-----+---------+-------+---+
|consumer_id|product_id| TRX_ID|pattern|loyal| trx_date|row_num| mx|
+-----------+----------+----------+-------+-----+---------+-------+---+
| 11| 1|1152397078| VVVVM| 1| 3/5/2020| 1| 5|
| 11| 1|1152944770| VVVVV| 1| 3/6/2020| 2| 5|
| 11| 1|1153856408| VVVVV| 1|3/15/2020| 3| 5|
| 11| 2|1155884040| MVVVV| 1| 4/2/2020| 4| 5|
| 11| 2|1156854301| MMVVV| 0|4/17/2020| 5| 5|
| 12| 1|1156854302| VVVVM| 1| 3/6/2020| 1| 3|
| 12| 1|1156854303| VVVVV| 1| 3/7/2020| 2| 3|
| 12| 2|1156854304| MVVVV| 1|3/16/2020| 3| 3|
+-----------+----------+----------+-------+-----+---------+-------+---+
df = spark.createDataFrame(
[('11','1','1152397078','VVVVM',1,'3/5/2020',1,5),
('11','1','1152944770','VVVVV',1,'3/6/2020',2,5),
('11','1','1153856408','VVVVV',1,'3/15/2020',3,5),
('11','2','1155884040','MVVVV',1,'4/2/2020',4,5),
('11','2','1156854301','MMVVV',0,'4/17/2020',5,5),
('12','1','1156854302','VVVVM',1,'3/6/2020',1,3),
('12','1','1156854303','VVVVV',1,'3/7/2020',2,3),
('12','2','1156854304','MVVVV',1,'3/16/2020',3,3)
]
,["consumer_id","product_id","TRX_ID","pattern","loyal","trx_date","row_num","mx"])
df.show()
Output Required:
Note : Here Flag only checks whether last loyal value containing 1 and set the flag.
+-----------+----------+----------+-------+-----+---------+-------+---+----+
|consumer_id|product_id| TRX_ID|pattern|loyal| trx_date|row_num| mx|Flag|
+-----------+----------+----------+-------+-----+---------+-------+---+----+
| 11| 1|1152397078| VVVVM| 1| 3/5/2020| 1| 5| 0|
| 11| 1|1152944770| VVVVV| 1| 3/6/2020| 2| 5| 0|
| 11| 1|1153856408| VVVVV| 1|3/15/2020| 3| 5| 0|
| 11| 2|1155884040| MVVVV| 1| 4/2/2020| 4| 5| 1|
| 11| 2|1156854301| MMVVV| 0|4/17/2020| 5| 5| 0|
| 12| 1|1156854302| VVVVM| 1| 3/6/2020| 1| 3| 0|
| 12| 1|1156854303| VVVVV| 1| 3/7/2020| 2| 3| 0|
| 12| 2|1156854304| MVVVV| 1|3/16/2020| 3| 3| 1|
+-----------+----------+----------+-------+-----+---------+-------+---+----+
What i tried :
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w2 = Window().partitionBy("consumer_id").orderBy('row_num')
df = spark.sql("""select * from inter_table""")
df = df.withColumn("Flag",F.when(F.last(F.col('loyal') == 1).over(w),1).otherwise(0))
Here there are two scenarios :
1. Value 1 with preceding 0 (for your reference row_num 4 for consumer_id 11)
2. Value 1 with no preceding (for your reference row_num 3 for consumer_id 12)
To add in the Murtaza's answer
We can add a new column which will check your second scenario for preceding null
window = Window.partitionBy('Consumer_id').orderBy('row_num')
df.withColumn('Flag',f.when((f.col('loyal')==1)
& ((f.lead(f.col('loyal')).over(window)==0)
| (f.lead(f.col('loyal')).over(window).isNull())), f.lit('1')).otherwise(f.lit('0'))).show()
+-----------+----------+----------+-------+-----+---------+-------+---+----+
|consumer_id|product_id| TRX_ID|pattern|loyal| trx_date|row_num| mx|Flag|
+-----------+----------+----------+-------+-----+---------+-------+---+----+
| 11| 1|1152397078| VVVVM| 1| 3/5/2020| 1| 5| 0|
| 11| 1|1152944770| VVVVV| 1| 3/6/2020| 2| 5| 0|
| 11| 1|1153856408| VVVVV| 1|3/15/2020| 3| 5| 0|
| 11| 2|1155884040| MVVVV| 1| 4/2/2020| 4| 5| 1|
| 11| 2|1156854300| MMVVV| 0|4/17/2020| 5| 5| 0|
| 12| 1|1156854300| VVVVM| 1| 3/6/2020| 1| 4| 0|
| 12| 1|1156854300| VVVVV| 1| 3/7/2020| 2| 4| 0|
| 12| 2|1156854300| MVVVV| 1|3/16/2020| 3| 4| 0|
| 12| 1|1156854300| MVVVV| 1| 4/3/2020| 4| 4| 1|
+-----------+----------+----------+-------+-----+---------+-------+---+----+
Try this.
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w = Window().partitionBy("product_id").orderBy('row_num')
df.withColumn("flag", F.when((F.col("loyal")==1)\
&(F.lead("loyal").over(w)==0),F.lit(1))\
.otherwise(F.lit(0))).show()
#+-----------+----------+----------+-------+-----+---------+-------+---+----+
#|consumer_id|product_id| TRX_ID|pattern|loyal| trx_date|row_num| mx|flag|
#+-----------+----------+----------+-------+-----+---------+-------+---+----+
#| 11| 1|1152397078| VVVVM| 1| 3/5/2020| 1| 5| 0|
#| 11| 1|1152944770| VVVVV| 1| 3/6/2020| 2| 5| 0|
#| 11| 1|1153856408| VVVVV| 1|3/15/2020| 3| 5| 0|
#| 11| 2|1155884040| MVVVV| 1| 4/2/2020| 4| 5| 1|
#| 11| 2|1156854300| MMVVV| 0|4/17/2020| 5| 5| 0|
#+-----------+----------+----------+-------+-----+---------+-------+---+----+
UPDATE:
from pypsark.sql import functions as F
from pyspark.sql.window import Window
w = Window().partitionBy("consumer_id").orderBy('row_num')
lead=F.lead("loyal").over(w)
df.withColumn("Flag", F.when(((F.col("loyal")==1)\
&((lead==0)|(lead.isNull()))),F.lit(1))\
.otherwise(F.lit(0))).show()