How to add column with range values to DataFrame

How to add column with range values to DataFrame - python

I have dataframe with the current structure
user_id | country | event |
1 | CA | 1 |
2 | USA | 1 |
and I want to add the new column with period range (0-n) and get something like this
user_id | country | event |period|
1 | CA | 1 |1
1 | CA | 1 |2
1 | CA | 1 |...
1 | CA | 1 |n
2 | USA | 1 |1
2 | USA | 1 |2
2 | USA | 1 |...
2 | USA | 1 |n
As I understand it should be some window function and withColumn function
w = (Window.partitionBy(['user_id', 'country', 'event'])
df = df.withColumn('period', (???).over(w))
How I can add the new column and at the same time new rows by some range?

First use spark.range() to create a second DataFrame containing the periods. For example, with n=3:
n = 3
periods = spark.range(1, n+1).withColumnRenamed("id", "period")
periods.show()
#+------+
#|period|
#+------+
#| 1|
#| 2|
#| 3|
#+------+
Now crossJoin this with df to get the desired output:
df = df.crossJoin(periods)
df.show()
#+-------+-------+-----+------+
#|user_id|country|event|period|
#+-------+-------+-----+------+
#| 1| CA| 1| 1|
#| 1| CA| 1| 2|
#| 1| CA| 1| 3|
#| 2| USA| 1| 1|
#| 2| USA| 1| 2|
#| 2| USA| 1| 3|
#+-------+-------+-----+------+
Note that range doesn't actually materialize the DataFrame, so the Cartesian product will not be expensive.
df.explain()
#== Physical Plan ==
#BroadcastNestedLoopJoin BuildRight, Cross
#:- Scan ExistingRDD[user_id#0,country#1,event#2]
#+- BroadcastExchange IdentityBroadcastMode
# +- *(1) Project [id#31L AS period#33L]
# +- *(1) Range (1, 4, step=1, splits=2)

Related

PySpark: How can I concatenate two different RDD with the same number of rows

I'm trying to concat 2 different dataframes but I was not able to do that at this moment.
I would like to do the following.
user_ids
+-------+
|user_id|
+-------+
| 96885|
| 58620|
| 56483|
| 65282|
| 61336|
| 63337|
| 56484|
| 36890|
| 57420|
| 54434|
| 101512|
| 58622|
| 67561|
| 55617|
| 58623|
| 54435|
| 230572|
| 58624|
| 30034|
| 55639|
+-------+
prediction
+----------+
|prediction|
+----------+
| 1|
| 1|
| 0|
| 0|
| 1|
| 0|
| 1|
| 1|
| 1|
| 1|
| 1|
| 1|
| 0|
| 0|
| 1|
| 1|
| 0|
| 1|
| 1|
| 1|
+----------+
What I would like to have is another RDD as following.
+-------++----------+
|user_id||prediction|
+-------++----------+
| 96885|| 1|
| 58620|| 1|
| 56483|| 0|
| 65282|| 0|
| 61336|| 1|
| 63337|| 0|
| 56484|| 1|
| 36890|| 1|
| 57420|| 1|
| 54434|| 1|
| 101512|| 1|
| 58622|| 1|
| 67561|| 0|
| 55617|| 0|
| 58623|| 1|
| 54435|| 1|
| 230572|| 0|
| 58624|| 1|
| 30034|| 1|
| 55639|| 1|
+-------++----------+
Using pandas I could do something like
user_ids['prediction'] = prediction.
But I need to do it in a distributed way using pyspark. How can I do that?

Pyspark get first value from a column for each group

I have a data frame in pyspark which would look like this
|Id1| id2 |row |grp |
|12 | 1234 |1 | 1 |
|23 | 1123 |2 | 1 |
|45 | 2343 |3 | 2 |
|65 | 2345 |1 | 2 |
|67 | 3456 |2 | 2 |```
I need to retrieve value for id2 corresponding to row = 1 and update all id2 values within a grp to that value.
This should be the final result
|Id1 | id2 |row |grp|
|12 |1234 |1 |1 |
|23 |1234 |2 |1 |
|45 |2345 |3 |2 |
|65 |2345 |1 |2 |
|67 |2345 |2 |2 |
I tried doing something like df.groupby('grp').sort('row').first('id2')
But apparently sort and orderby don't work with groupby in pyspark.
Any idea how to go about this?

Very similar to #Steven's answer, without using .rowsBetween
You basically create a Window for each grp, then sort the rows by row and pick the first id2 for each grp.
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window.partitionBy('grp').orderBy('row')
df = df.withColumn('id2', F.first('id2').over(w))
df.show()
+---+----+---+---+
|Id1| id2|row|grp|
+---+----+---+---+
| 12|1234| 1| 1|
| 23|1234| 2| 1|
| 65|2345| 1| 2|
| 67|2345| 2| 2|
| 45|2345| 3| 2|
+---+----+---+---+

try this :
from pyspark.sql import functions as F, Window as W
df.withColumn(
"id2",
F.first("id2").over(
W.partitionBy("grp")
.orderBy("row")
.rowsBetween(W.unboundedPreceding, W.currentRow)
),
).show()
+---+----+---+---+
|id1| id2|row|grp|
+---+----+---+---+
| 12|1234| 1| 1|
| 23|1234| 2| 1|
| 65|2345| 1| 2|
| 45|2345| 2| 2|
| 45|2345| 3| 2|
+---+----+---+---+

how to work in multiline in dataframe spark and update column in multiline [duplicate]

This question already has answers here:
PySpark - get row number for each row in a group
(2 answers)
Closed 2 years ago.
in Pyspark, I have a dataframe spark in this format :
CODE | TITLE | POSITION
A | per | 1
A | eis | 3
A | fon | 4
A | dat | 5
B | jem | 2
B | neu | 3
B | tri | 5
B | nok | 6
and I want to have that :
CODE | TITLE | POSITION
A | per | 1
A | eis | 2
A | fon | 3
A | dat | 4
B | jem | 1
B | neu | 2
B | tri | 3
B | nok | 4
the idea is that the column position starts at 1, and for example for the CODE A, it starts with 1 and I have the position 2 missing, then I need to make 3-1=>2, 4-1=>3 and 5=>4
how can we do that in pyspark ?
thank you for your help

With a slightly simpler dataframe
df.show()
+----+-----+--------+
|CODE|TITLE|POSITION|
+----+-----+--------+
| A| AA| 1|
| A| BB| 3|
| A| CC| 4|
| A| DD| 5|
| B| EE| 2|
| B| FF| 3|
| B| GG| 5|
| B| HH| 6|
+----+-----+--------+
from pyspark.sql.functions import row_number
from pyspark.sql.window import Window
df.withColumn('POSITION', row_number().over(Window.partitionBy('CODE').orderBy('POSITION'))).show()
+----+-----+--------+
|CODE|TITLE|POSITION|
+----+-----+--------+
| B| EE| 1|
| B| FF| 2|
| B| GG| 3|
| B| HH| 4|
| A| AA| 1|
| A| BB| 2|
| A| CC| 3|
| A| DD| 4|
+----+-----+--------+

Count ocurrences in pyspark dataframe

I need to count the occurrences of repeated values in a pyspark dataframe as shown.
In short, when the value is the same, it adds up until the value is different. When the value is different, the count is reset. And I need it to be in a column.
What I have:
+------+
| val |
+------+
| 0 |
| 0 |
| 0 |
| 1 |
| 1 |
| 2 |
| 2 |
| 2 |
| 3 |
| 3 |
| 3 |
| 3 |
+------+
What I need:
+------+-----+
| val |ocurr|
+------+-----+
| 0 | 0 |
| 0 | 1 |
| 0 | 2 |
| 1 | 0 |
| 1 | 1 |
| 2 | 0 |
| 2 | 1 |
| 2 | 2 |
| 3 | 0 |
| 3 | 1 |
| 3 | 2 |
| 3 | 3 |
+------+-----+

Use when and lag function to grouping the same concurrent values and use the row_number to get the counts. You should have an appropriate ordering column, my temp ordering column id is not good because that it is not guaranteed the order-preserving.
df = spark.createDataFrame([0, 0, 0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 0, 0, 0], 'int').toDF('val')
from pyspark.sql.functions import *
from pyspark.sql import Window
w1 = Window.orderBy('id')
w2 = Window.partitionBy('group').orderBy('id')
df.withColumn('id', monotonically_increasing_id()) \
.withColumn('group', sum(when(col('val') == lag('val', 1, 1).over(w1), 0).otherwise(1)).over(w1)) \
.withColumn('order', row_number().over(w2) - 1) \
.orderBy('id').show()
+---+---+-----+-----+
|val| id|group|order|
+---+---+-----+-----+
| 0| 0| 1| 0|
| 0| 1| 1| 1|
| 0| 2| 1| 2|
| 1| 3| 2| 0|
| 1| 4| 2| 1|
| 2| 5| 3| 0|
| 2| 6| 3| 1|
| 2| 7| 3| 2|
| 3| 8| 4| 0|
| 3| 9| 4| 1|
| 3| 10| 4| 2|
| 3| 11| 4| 3|
| 0| 12| 5| 0|
| 0| 13| 5| 1|
| 0| 14| 5| 2|
+---+---+-----+-----+

Pyspark Dataframe: Get previous row that meets a condition

For every row in a PySpark DataFrame I am trying to get a value from the first preceding row that satisfied a certain condition:
That is if my dataframe looks like this:
X | Flag
1 | 1
2 | 0
3 | 0
4 | 0
5 | 1
6 | 0
7 | 0
8 | 0
9 | 1
10 | 0
I want output that looks like this:
X | Lag_X | Flag
1 | NULL | 1
2 | 1 | 0
3 | 1 | 0
4 | 1 | 0
5 | 1 | 1
6 | 5 | 0
7 | 5 | 0
8 | 5 | 0
9 | 5 | 1
10 | 9 | 0
I thought I could do this with lag function and a WindowSpec, unfortunately WindowSpec doesnt support .filter or .when, so this does not work:
conditional_window = Window().orderBy(X).filter(df[Flag] == 1)
df = df.withColumn('lag_x', f.lag(df[x],1).over(conditional_window)
It seems like this should be simple, but I have been racking my brain trying to find a solution so any help with this would be greatly appreciated

Question is old, but I thought the answer might help others
Here is a working solution using window and lag functions
from pyspark.sql import functions as F
from pyspark.sql import Window
from pyspark.sql.functions import when
from pyspark.context import SparkContext
# Call SparkContext
sc = SparkContext.getOrCreate()
sc = sparkContext
# Create DataFrame
a = sc.createDataFrame([(1, 1),
(2, 0),
(3, 0),
(4, 0),
(5, 1),
(6, 0),
(7, 0),
(8, 0),
(9, 1),
(10, 0)]
, ['X', 'Flag'])
# Use a window function
win = Window.orderBy("X")
# Condition : if preceeding row in column "Flag" is not 0
condition = F.lag(F.col("Flag"), 1).over(win) != 0
# Add a new column : if condition is true, value is value of column "X" at the previous row
a = a.withColumn("Flag_X", F.when(condition, F.col("X") - 1))
Now, we obtain a DataFrame as shown below
+---+----+------+
| X|Flag|Flag_X|
+---+----+------+
| 1| 1| null|
| 2| 0| 1|
| 3| 0| null|
| 4| 0| null|
| 5| 1| null|
| 6| 0| 5|
| 7| 0| null|
| 8| 0| null|
| 9| 1| null|
| 10| 0| 9|
+---+----+------+
To fill null values :
a = a.withColumn("Flag_X",
F.last(F.col("Flag_X"), ignorenulls=True)\
.over(win))
So the final DataFrame is as required :
+---+----+------+
| X|Flag|Flag_X|
+---+----+------+
| 1| 1| null|
| 2| 0| 1|
| 3| 0| 1|
| 4| 0| 1|
| 5| 1| 1|
| 6| 0| 5|
| 7| 0| 5|
| 8| 0| 5|
| 9| 1| 5|
| 10| 0| 9|
+---+----+------+

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to add column with range values to DataFrame - python

Related

PySpark: How can I concatenate two different RDD with the same number of rows

Pyspark get first value from a column for each group

how to work in multiline in dataframe spark and update column in multiline [duplicate]

Count ocurrences in pyspark dataframe

Pyspark Dataframe: Get previous row that meets a condition

Categories

Resources