Replace characters in column names in pyspark data frames

Replace characters in column names in pyspark data frames - python

I have a dataframew like below in Pyspark
df = spark.createDataFrame([(2,'john',1,1),
(2,'john',1,2),
(3,'pete',8,3),
(3,'pete',8,4),
(5,'steve',9,5)],
['id','/na/me','val/ue', 'rank/'])
df.show()
+---+------+------+-----+
| id|/na/me|val/ue|rank/|
+---+------+------+-----+
| 2| john| 1| 1|
| 2| john| 1| 2|
| 3| pete| 8| 3|
| 3| pete| 8| 4|
| 5| steve| 9| 5|
+---+------+------+-----+
Now in this data frame I want to replace the column names where / to under scrore _. But if the / comes at the start or end of the column name then remove the / but don't replace with _.
I have done like below
for name in df.schema.names:
df = df.withColumnRenamed(name, name.replace('/', '_'))
>>> df
DataFrame[id: bigint, _na_me: string, val_ue: bigint, rank_: bigint]
>>>df.show()
+---+------+------+-----+
| id|_na_me|val_ue|rank_|
+---+------+------+-----+
| 2| john| 1| 1|
| 2| john| 1| 2|
| 3| pete| 8| 3|
| 3| pete| 8| 4|
| 5| steve| 9| 5|
+---+------+------+-----+
How can I achieve my desired result which is below
+---+------+------+-----+
| id| na_me|val_ue| rank|
+---+------+------+-----+
| 2| john| 1| 1|
| 2| john| 1| 2|
| 3| pete| 8| 3|
| 3| pete| 8| 4|
| 5| steve| 9| 5|
+---+------+------+-----+

Try with regular expression replace(re.sub) in python way.
import re
cols=[re.sub(r'(^_|_$)','',f.replace("/","_")) for f in df.columns]
df = spark.createDataFrame([(2,'john',1,1),
(2,'john',1,2),
(3,'pete',8,3),
(3,'pete',8,4),
(5,'steve',9,5)],
['id','/na/me','val/ue', 'rank/'])
df.toDF(*cols).show()
#+---+-----+------+----+
#| id|na_me|val_ue|rank|
#+---+-----+------+----+
#| 2| john| 1| 1|
#| 2| john| 1| 2|
#| 3| pete| 8| 3|
#| 3| pete| 8| 4|
#| 5|steve| 9| 5|
#+---+-----+------+----+
#or using for loop on schema.names
for name in df.schema.names:
df = df.withColumnRenamed(name, re.sub(r'(^_|_$)','',name.replace('/', '_')))
df.show()
#+---+-----+------+----+
#| id|na_me|val_ue|rank|
#+---+-----+------+----+
#| 2| john| 1| 1|
#| 2| john| 1| 2|
#| 3| pete| 8| 3|
#| 3| pete| 8| 4|
#| 5|steve| 9| 5|
#+---+-----+------+----+

Related

Pyspark explode list creating column with index in list

So I have a question regarding pyspark. I have a dataframe that looks like this:
+---+------------+
| id| list|
+---+------------+
| 2|[3, 5, 4, 2]|
+---+------------+
| 3|[4, 5, 3, 2]|
+---+------------+
And I would like to explode lists it into multiple rows and keeping information about which position did each element of the list had in a separate column. The result should look like this:
+---+------------+------------+
| id| listitem| rank|
+---+------------+------------+
| 2| 3| 1|
+---+------------+------------+
| 2| 5| 2|
+---+------------+------------+
| 2| 4| 3|
+---+------------+------------+
| 2| 2| 4|
+---+------------+------------+
| 3| 4| 1|
+---+------------+------------+
| 3| 5| 2|
+---+------------+------------+
| 3| 3| 3|
+---+------------+------------+
| 3| 2| 4|
+---+------------+------------+
The rank column has the index+1 of the position each element had in the list. Any suggestions on the most optimal code to achieve it?

You can use posexplode() or posexplode_outer() function to get desired result.
df = spark.createDataFrame([(2, [3, 5, 4, 2]), (3, [4, 5, 3, 2])], ["id", "list"])
df.select('id',posexplode_outer('list').alias('rank', 'listitem')) \
.withColumn('rank', col('rank') + 1).show()
+---+----+--------+
| id|rank|listitem|
+---+----+--------+
| 2| 1| 3|
| 2| 2| 5|
| 2| 3| 4|
| 2| 4| 2|
| 3| 1| 4|
| 3| 2| 5|
| 3| 3| 3|
| 3| 4| 2|
+---+----+--------+

How to create duplicate values of each row and then insert a new dataframe?

How do I duplicate each row of my original dataframe and then add dataframe 2 so that my final output is: I am writing this in python in a pyspark dataframe.

What you want is cross join :
result = df1.crossJoin(df2)
result.show()
#+------+--------+------+-------+------------+-----------------+
#| name| address|salary|bonus %|allowances %|employee category|
#+------+--------+------+-------+------------+-----------------+
#| Tom| Chicago| 75000| 5| 5| onsite|
#| Tom| Chicago| 75000| 10| 10| off shore|
#|Martha|New york| 80000| 5| 5| onsite|
#|Martha|New york| 80000| 10| 10| off shore|
#|Samuel| Phoenix| 90000| 5| 5| onsite|
#|Samuel| Phoenix| 90000| 10| 10| off shore|
#| Rom| Dallas| 65000| 5| 5| onsite|
#| Rom| Dallas| 65000| 10| 10| off shore|
#+------+--------+------+-------+------------+-----------------+

How to write for loop or episode in Pyspark

I have a dataframe, and I am trying to write a for loop on it.
|ID | from_dt | To_dt |row_number|diff|negetive_or_not|
+----------+----------+----------+----------+----+---------------+
|11111|2020-07-30|2020-07-31| 1| -2| 0|
|11111|2020-08-02|2020-08-11| 2| 4| 1|
|11111|2020-08-07|2020-08-08| 3| -4| 0|
|11111|2020-08-12|2020-08-18| 4| 1| 1|
|11111|2020-08-17|2020-08-19| 5| 0| 1|
|11111|2020-08-19|2020-08-22| 6| 2| 1|
|11111|2020-08-20|2020-08-24| 7| -1| 0|
|11111|2020-08-25|2020-08-27| 8| 0| 1|
|11111|2020-08-27|2020-08-31| 9|-999| 0|
The goal is to determine the episode. if the negative or not start with 0, it is an episode, if the negative or not start with 1, until it hit 0, it is an episode.
Here is the ideal output
+----------+----------+----------+----------+----+---------------+
|ID | from_dt | To_dt |row_number|diff|negetive_or_not| Episode
+----------+----------+----------+----------+----+---------------+
|11111|2020-07-30|2020-07-31| 1| -2| 0| 1
|11111|2020-08-02|2020-08-11| 2| 4| 1| 2
|11111|2020-08-07|2020-08-08| 3| -4| 0| 2
|11111|2020-08-12|2020-08-18| 4| 1| 1| 3
|11111|2020-08-17|2020-08-19| 5| 0| 1| 3
|11111|2020-08-19|2020-08-22| 6| 2| 1| 3
|11111|2020-08-20|2020-08-24| 7| -1| 0| 3
|11111|2020-08-25|2020-08-27| 8| 0| 1| 4
|11111|2020-08-27|2020-08-31| 9|-999| 0| 4
|22222|2020-07-30|2020-07-31| 1| -2| 0| 1
|22222|2020-08-02|2020-08-11| 2| 4| 1| 2
|22222|2020-08-07|2020-08-08| 3| -4| 0| 2
+----------+----------+----------+----------+----+---------------+
I tried to use case when and rank, such as
case when negetive_or_not = 0 then "eps1" else then "eps2", both not working.
df2 = df.selectExpr('*') .withColumn("Episode",lead(col("to_dt")).over(Window.partitionBy("patient_id").orderBy(col("negetive_or_not"))))
I also try to write a for loop in pyspark, but I have difficulty transferring dataframe into the list, any suggestions will be highly appreciated.

The approach goes as
First flag the row when the value should change.
Based on the flag generate a new value.
Forward fill the nulls
df.withColumn('flag', when(((col('negetive_or_not')==1) & (lag('negetive_or_not').over(Window.partitionBy('ID').orderBy('row_number'))==0)) | (lag('negetive_or_not').over(Window.partitionBy('ID').orderBy('row_number')).isNull()), lit('change')).otherwise(lit('no'))).\
withColumn('ep', when(col('flag')=='change',row_number().over(Window.partitionBy('ID','flag').orderBy('row_number')))).\
withColumn('Episode', last('ep', ignorenulls=True).over(Window.partitionBy('ID').orderBy('row_number'))).\
drop('flag','ep').show()
+-----+----------+----------+----------+----+---------------+-------+
| ID| from_dt| To_dt|row_number|diff|negetive_or_not|Episode|
+-----+----------+----------+----------+----+---------------+-------+
|11111|2020-07-30|2020-07-31| 1| -2| 0| 1|
|11111|2020-08-02|2020-08-11| 2| 4| 1| 2|
|11111|2020-08-07|2020-08-08| 3| -4| 0| 2|
|11111|2020-08-12|2020-08-18| 4| 1| 1| 3|
|11111|2020-08-17|2020-08-19| 5| 0| 1| 3|
|11111|2020-08-19|2020-08-22| 6| 2| 1| 3|
|11111|2020-08-20|2020-08-24| 7| -1| 0| 3|
|11111|2020-08-25|2020-08-27| 8| 0| 1| 4|
|11111|2020-08-27|2020-08-31| 9|-999| 0| 4|
+-----+----------+----------+----------+----+---------------+-------+

How can we set a flag for last occurence of a value in a column of Pyspark Dataframe

Requirement : Here when last occurence of loyal with value is 1 then set flag as 1 else 0
Input:
+-----------+----------+----------+-------+-----+---------+-------+---+
|consumer_id|product_id| TRX_ID|pattern|loyal| trx_date|row_num| mx|
+-----------+----------+----------+-------+-----+---------+-------+---+
| 11| 1|1152397078| VVVVM| 1| 3/5/2020| 1| 5|
| 11| 1|1152944770| VVVVV| 1| 3/6/2020| 2| 5|
| 11| 1|1153856408| VVVVV| 1|3/15/2020| 3| 5|
| 11| 2|1155884040| MVVVV| 1| 4/2/2020| 4| 5|
| 11| 2|1156854301| MMVVV| 0|4/17/2020| 5| 5|
| 12| 1|1156854302| VVVVM| 1| 3/6/2020| 1| 3|
| 12| 1|1156854303| VVVVV| 1| 3/7/2020| 2| 3|
| 12| 2|1156854304| MVVVV| 1|3/16/2020| 3| 3|
+-----------+----------+----------+-------+-----+---------+-------+---+
df = spark.createDataFrame(
[('11','1','1152397078','VVVVM',1,'3/5/2020',1,5),
('11','1','1152944770','VVVVV',1,'3/6/2020',2,5),
('11','1','1153856408','VVVVV',1,'3/15/2020',3,5),
('11','2','1155884040','MVVVV',1,'4/2/2020',4,5),
('11','2','1156854301','MMVVV',0,'4/17/2020',5,5),
('12','1','1156854302','VVVVM',1,'3/6/2020',1,3),
('12','1','1156854303','VVVVV',1,'3/7/2020',2,3),
('12','2','1156854304','MVVVV',1,'3/16/2020',3,3)
]
,["consumer_id","product_id","TRX_ID","pattern","loyal","trx_date","row_num","mx"])
df.show()
Output Required:
Note : Here Flag only checks whether last loyal value containing 1 and set the flag.
+-----------+----------+----------+-------+-----+---------+-------+---+----+
|consumer_id|product_id| TRX_ID|pattern|loyal| trx_date|row_num| mx|Flag|
+-----------+----------+----------+-------+-----+---------+-------+---+----+
| 11| 1|1152397078| VVVVM| 1| 3/5/2020| 1| 5| 0|
| 11| 1|1152944770| VVVVV| 1| 3/6/2020| 2| 5| 0|
| 11| 1|1153856408| VVVVV| 1|3/15/2020| 3| 5| 0|
| 11| 2|1155884040| MVVVV| 1| 4/2/2020| 4| 5| 1|
| 11| 2|1156854301| MMVVV| 0|4/17/2020| 5| 5| 0|
| 12| 1|1156854302| VVVVM| 1| 3/6/2020| 1| 3| 0|
| 12| 1|1156854303| VVVVV| 1| 3/7/2020| 2| 3| 0|
| 12| 2|1156854304| MVVVV| 1|3/16/2020| 3| 3| 1|
+-----------+----------+----------+-------+-----+---------+-------+---+----+
What i tried :
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w2 = Window().partitionBy("consumer_id").orderBy('row_num')
df = spark.sql("""select * from inter_table""")
df = df.withColumn("Flag",F.when(F.last(F.col('loyal') == 1).over(w),1).otherwise(0))
Here there are two scenarios :
1. Value 1 with preceding 0 (for your reference row_num 4 for consumer_id 11)
2. Value 1 with no preceding (for your reference row_num 3 for consumer_id 12)

To add in the Murtaza's answer
We can add a new column which will check your second scenario for preceding null
window = Window.partitionBy('Consumer_id').orderBy('row_num')
df.withColumn('Flag',f.when((f.col('loyal')==1)
& ((f.lead(f.col('loyal')).over(window)==0)
| (f.lead(f.col('loyal')).over(window).isNull())), f.lit('1')).otherwise(f.lit('0'))).show()
+-----------+----------+----------+-------+-----+---------+-------+---+----+
|consumer_id|product_id| TRX_ID|pattern|loyal| trx_date|row_num| mx|Flag|
+-----------+----------+----------+-------+-----+---------+-------+---+----+
| 11| 1|1152397078| VVVVM| 1| 3/5/2020| 1| 5| 0|
| 11| 1|1152944770| VVVVV| 1| 3/6/2020| 2| 5| 0|
| 11| 1|1153856408| VVVVV| 1|3/15/2020| 3| 5| 0|
| 11| 2|1155884040| MVVVV| 1| 4/2/2020| 4| 5| 1|
| 11| 2|1156854300| MMVVV| 0|4/17/2020| 5| 5| 0|
| 12| 1|1156854300| VVVVM| 1| 3/6/2020| 1| 4| 0|
| 12| 1|1156854300| VVVVV| 1| 3/7/2020| 2| 4| 0|
| 12| 2|1156854300| MVVVV| 1|3/16/2020| 3| 4| 0|
| 12| 1|1156854300| MVVVV| 1| 4/3/2020| 4| 4| 1|
+-----------+----------+----------+-------+-----+---------+-------+---+----+

Try this.
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w = Window().partitionBy("product_id").orderBy('row_num')
df.withColumn("flag", F.when((F.col("loyal")==1)\
&(F.lead("loyal").over(w)==0),F.lit(1))\
.otherwise(F.lit(0))).show()
#+-----------+----------+----------+-------+-----+---------+-------+---+----+
#|consumer_id|product_id| TRX_ID|pattern|loyal| trx_date|row_num| mx|flag|
#+-----------+----------+----------+-------+-----+---------+-------+---+----+
#| 11| 1|1152397078| VVVVM| 1| 3/5/2020| 1| 5| 0|
#| 11| 1|1152944770| VVVVV| 1| 3/6/2020| 2| 5| 0|
#| 11| 1|1153856408| VVVVV| 1|3/15/2020| 3| 5| 0|
#| 11| 2|1155884040| MVVVV| 1| 4/2/2020| 4| 5| 1|
#| 11| 2|1156854300| MMVVV| 0|4/17/2020| 5| 5| 0|
#+-----------+----------+----------+-------+-----+---------+-------+---+----+
UPDATE:
from pypsark.sql import functions as F
from pyspark.sql.window import Window
w = Window().partitionBy("consumer_id").orderBy('row_num')
lead=F.lead("loyal").over(w)
df.withColumn("Flag", F.when(((F.col("loyal")==1)\
&((lead==0)|(lead.isNull()))),F.lit(1))\
.otherwise(F.lit(0))).show()

AnalysisException: "cannot resolve 'df2.*' give input columns Pyspark?

I have created two data frames like below.
df = spark.createDataFrame(
[(1, 1, 2,4), (1, 2, 9,5), (2, 1, 2,1), (2, 2, 1,2), (4, 1, 5,2), (4, 2, 6,3), (5, 1, 3,3), (5, 2, 8,4)],
("sid", "cid", "Cr","rank"))
df1 = spark.createDataFrame(
[[1,1],[1,2],[1,3], [2,1],[2,2],[2,3],[4,1],[4,2],[4,3],[5,2],[5,3],[5,3],[3,4]],
["sid","cid"])
because of some requirement I had created sqlContext and created Temporary view,like below.
df.createOrReplaceTempView("temp")
df2=sqlContext.sql("select sid,cid,cr,rank from temp")
then i am doing left join based on some condition .
joined = (df2.alias("df")
.join(
df1.alias("df1"),
(col("df.sid") == col("df1.sid")) & (col("df.cid") == col("df1.cid")),
"left"))
joined.show()
+---+---+---+----+----+----+
|sid|cid| cr|rank| sid| cid|
+---+---+---+----+----+----+
| 5| 1| 3| 3|null|null|
| 1| 1| 2| 4| 1| 1|
| 4| 2| 6| 3| 4| 2|
| 5| 2| 8| 4| 5| 2|
| 2| 2| 1| 2| 2| 2|
| 4| 1| 5| 2| 4| 1|
| 1| 2| 9| 5| 1| 2|
| 2| 1| 2| 1| 2| 1|
+---+---+---+----+----+----+
then finally i am executing below code:
final=joined.select(
col("df2.*"),
col("df1.sid").isNull().cast("integer").alias("flag")
).orderBy("sid", "cid")
then i am getting error like below.
"AnalysisException: "cannot resolve 'df2.*' give input columns 'cr, sid, sid, cid, cid, rank';"
but my expected out put should be:
+---+---+---+----+----+
|sid|cid| Cr|rank|flag|
+---+---+---+----+----+
| 1| 1| 2| 4| 0|
| 1| 2| 9| 5| 0|
| 2| 1| 2| 1| 0|
| 2| 2| 1| 2| 0|
| 4| 1| 5| 2| 0|
| 4| 2| 6| 3| 0|
| 5| 1| 3| 3| 1|
| 5| 2| 8| 4| 0|
+---+---+---+----+----+

mistake is :
joined = (df2.alias("df")
.join(
df1.alias("df1"),
(col("df2.sid") == col("df1.sid")) & (col("df2.cid") == col("df1.cid")),
"left"))
joined.show()
here we should to use df2.alias("df2") or joined.select(col("df.*")..)
Complete solution is :
joined = (df2.alias("df2")
.join(
df1.alias("df1"),
(col("df2.sid") == col("df1.sid")) & (col("df2.cid") == col("df1.cid")),
"left"))
joined.show()
final=joined.select(
col("df2.*"),
col("df1.sid").isNull().cast("integer").alias("flag")
).orderBy("sid", "cid")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replace characters in column names in pyspark data frames - python

Related

Pyspark explode list creating column with index in list

How to create duplicate values of each row and then insert a new dataframe?

How to write for loop or episode in Pyspark

How can we set a flag for last occurence of a value in a column of Pyspark Dataframe

AnalysisException: "cannot resolve 'df2.*' give input columns Pyspark?

Categories

Resources