Apply a function over a column in a group in PySpark dataframe - python

I have a PySpark dataframe like this,
+----------+--------+---------+
|id_ | p | a |
+----------+--------+---------+
| 1 | 4 | 12 |
| 1 | 3 | 14 |
| 1 | -7 | 16 |
| 1 | 5 | 11 |
| 1 | -20 | 90 |
| 1 | 5 | 120 |
| 2 | 11 | 267 |
| 2 | -98 | 124 |
| 2 | -87 | 120 |
| 2 | -1 | 44 |
| 2 | 5 | 1 |
| 2 | 7 | 23 |
-------------------------------
I also have a python function like this,
def fun(x):
total = 0
result = np.empty_like(x)
for i, y in enumerate(x):
total += (y)
if total < 0:
total = 0
result[i] = total
return result
I want to group the PySpark dataframe on column id_ and apply the functon fun over the column p.
I want to something like
spark_df.groupBy('id_')['p'].apply(fun)
I am currently doing this with a pandas udf with the help of pyarrow, which is not efficient in terms of time for my application.
The result I am looking for is,
[4, 7, 0, 5, 0, 5, 11, -98, -87, -1, 5, 7]
This is resultant dataframe I am looking for,
+----------+--------+---------+
|id_ | p | a |
+----------+--------+---------+
| 1 | 4 | 12 |
| 1 | 7 | 14 |
| 1 | 0 | 16 |
| 1 | 5 | 11 |
| 1 | 0 | 90 |
| 1 | 5 | 120 |
| 2 | 11 | 267 |
| 2 | 0 | 124 |
| 2 | 0 | 120 |
| 2 | 0 | 44 |
| 2 | 5 | 1 |
| 2 | 12 | 23 |
-------------------------------
Is there a direct way to do this with pyspark APIs itself.?
I can aggregate and column p to a list using collect_list on grouping on id_ and use udf over that and use explode to get the column p as I needed in the result dataframe.
But how to retain other columns that I have in my dataframe.?

Yes, you can convert the above python function to a Pyspark UDF.
Since you are returning an array of integers, it is important to specify the return type as ArrayType(IntegerType()).
Below is the code,
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, IntegerType, collect_list
#udf(returnType=ArrayType(IntegerType()))
def fun(x):
total = 0
result = np.empty_like(x)
for i, y in enumerate(x):
total += (y)
if total < 0:
total = 0
result[i] = total
return result.tolist() # Convert NumPy Array to Python List
Since the input to your udf must be a list, let's group the data based on 'id' and convert the rows to arrays.
df = df.groupBy('id_').agg(collect_list('p'))
df = df.toDF('id_', 'p_') # Assign a new alias name 'p_'
df.show(truncate=False)
Input Data:
+---+------------------------+
|id_|collect_list(p) |
+---+------------------------+
|1 |[4, 3, -7, 5, -20, 5] |
|2 |[11, -98, -87, -1, 5, 7]|
+---+------------------------+
Next, we apply the udf on this data,
df.select('id_', fun(df.p_)).show(truncate=False)
Output:
+---+--------------------+
|id_|fun(p_) |
+---+--------------------+
|1 |[4, 7, 0, 5, 0, 5] |
|2 |[11, 0, 0, 0, 5, 12]|
+---+--------------------+

I managed to achieve the result I need with the following steps,
My DataFrame looks like this,
+---+---+---+
|id_| p| a|
+---+---+---+
| 1| 4| 12|
| 1| 3| 14|
| 1| -7| 16|
| 1| 5| 11|
| 1|-20| 90|
| 1| 5|120|
| 2| 11|267|
| 2|-98|124|
| 2|-87|120|
| 2| -1| 44|
| 2| 5| 1|
| 2| 7| 23|
+---+---+---+
I will groupby the dataframe on id_ and collect the column I want to apply the function to a list using collect_list and apply the function like this,
agg_df = df.groupBy('id_').agg(F.collect_list('p').alias('collected_p'))
agg_df = agg_df.withColumn('new', fun('collected_p'))
I now want to somehow merge agg_df to my original dataframe. For that I will first use explode to get the values in column new in rows.
agg_df = agg_df.withColumn('exploded', F.explode('new'))
In order to merge I will use monotonically_increasing_id to generate id for the original dataframe and agg_df. From that I will make idx for each dataframe because monotonically_increasing_id won't be same for both dataframes.
agg_df = agg_df.withColumn('id_mono', F.monotonically_increasing_id())
df = df.withColumn('id_mono', F.monotonically_increasing_id())
w = Window().partitionBy(F.lit(0)).orderBy('id_mono')
df = df.withColumn('idx', F.row_number().over(w))
agg_df = agg_df.withColumn('idx', F.row_number().over(w))
df = df.join(agg_df.select('idx', 'exploded'), ['idx']).drop('id_mono', 'idx')
+---+---+---+--------+
|id_| p| a|exploded|
+---+---+---+--------+
| 1| 4| 12| 4|
| 1| 3| 14| 7|
| 1| -7| 16| 0|
| 1| 5| 11| 5|
| 1|-20| 90| 0|
| 1| 5|120| 5|
| 2| 11|267| 11|
| 2|-98|124| 0|
| 2|-87|120| 0|
| 2| -1| 44| 0|
| 2| 5| 1| 5|
| 2| 7| 23| 12|
+---+---+---+--------+
I am not sure this is a straight forward way to do. It will be great if anyone can suggest any optimizations for this.

Related

add "group" column to a pyspark dataframe

I need to add a new column in the followinf dataframe so that those that have the same value in column "column_1" must have the same numerical value, starting with 1 in a new column called "group"
``` # +---+-------------+
# | id| column_1|
# +---+-------------+
# | 0| a |
# | 7| a |
# | 1| c |
# | 2| d |
# | 3| e |
# | 4| a |
# | 10| c |
# | 12| b |
# +---+-------------+```
And I want:
``` # +---+-------------+
# | id| column_1| grupo|
# +---+-----------------+
# | 0| a | 1 |
# | 7| a | 1 |
# | 1| c | 3 |
# | 2| d | 4 |
# | 3| e | 5 |
# | 4| a | 1 |
# | 10| c | 3 |
# | 12| b | 2 |
# +---+-------------+```
windowSpec = Window.partitionBy("aux").orderBy("column_1")
df_expl = df_expl.withColumn("aux", F.lit(1))
df_expl = df_expl.withColumn("group",dense_rank().over(windowSpec))
df_expl = df_expl.drop("id")

how to work in multiline in dataframe spark and update column in multiline [duplicate]

This question already has answers here:
PySpark - get row number for each row in a group
(2 answers)
Closed 2 years ago.
in Pyspark, I have a dataframe spark in this format :
CODE | TITLE | POSITION
A | per | 1
A | eis | 3
A | fon | 4
A | dat | 5
B | jem | 2
B | neu | 3
B | tri | 5
B | nok | 6
and I want to have that :
CODE | TITLE | POSITION
A | per | 1
A | eis | 2
A | fon | 3
A | dat | 4
B | jem | 1
B | neu | 2
B | tri | 3
B | nok | 4
the idea is that the column position starts at 1, and for example for the CODE A, it starts with 1 and I have the position 2 missing, then I need to make 3-1=>2, 4-1=>3 and 5=>4
how can we do that in pyspark ?
thank you for your help
With a slightly simpler dataframe
df.show()
+----+-----+--------+
|CODE|TITLE|POSITION|
+----+-----+--------+
| A| AA| 1|
| A| BB| 3|
| A| CC| 4|
| A| DD| 5|
| B| EE| 2|
| B| FF| 3|
| B| GG| 5|
| B| HH| 6|
+----+-----+--------+
from pyspark.sql.functions import row_number
from pyspark.sql.window import Window
df.withColumn('POSITION', row_number().over(Window.partitionBy('CODE').orderBy('POSITION'))).show()
+----+-----+--------+
|CODE|TITLE|POSITION|
+----+-----+--------+
| B| EE| 1|
| B| FF| 2|
| B| GG| 3|
| B| HH| 4|
| A| AA| 1|
| A| BB| 2|
| A| CC| 3|
| A| DD| 4|
+----+-----+--------+

Count ocurrences in pyspark dataframe

I need to count the occurrences of repeated values ​​in a pyspark dataframe as shown.
In short, when the value is the same, it adds up until the value is different. When the value is different, the count is reset. And I need it to be in a column.
What I have:
+------+
| val |
+------+
| 0 |
| 0 |
| 0 |
| 1 |
| 1 |
| 2 |
| 2 |
| 2 |
| 3 |
| 3 |
| 3 |
| 3 |
+------+
What I need:
+------+-----+
| val |ocurr|
+------+-----+
| 0 | 0 |
| 0 | 1 |
| 0 | 2 |
| 1 | 0 |
| 1 | 1 |
| 2 | 0 |
| 2 | 1 |
| 2 | 2 |
| 3 | 0 |
| 3 | 1 |
| 3 | 2 |
| 3 | 3 |
+------+-----+
Use when and lag function to grouping the same concurrent values and use the row_number to get the counts. You should have an appropriate ordering column, my temp ordering column id is not good because that it is not guaranteed the order-preserving.
df = spark.createDataFrame([0, 0, 0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 0, 0, 0], 'int').toDF('val')
from pyspark.sql.functions import *
from pyspark.sql import Window
w1 = Window.orderBy('id')
w2 = Window.partitionBy('group').orderBy('id')
df.withColumn('id', monotonically_increasing_id()) \
.withColumn('group', sum(when(col('val') == lag('val', 1, 1).over(w1), 0).otherwise(1)).over(w1)) \
.withColumn('order', row_number().over(w2) - 1) \
.orderBy('id').show()
+---+---+-----+-----+
|val| id|group|order|
+---+---+-----+-----+
| 0| 0| 1| 0|
| 0| 1| 1| 1|
| 0| 2| 1| 2|
| 1| 3| 2| 0|
| 1| 4| 2| 1|
| 2| 5| 3| 0|
| 2| 6| 3| 1|
| 2| 7| 3| 2|
| 3| 8| 4| 0|
| 3| 9| 4| 1|
| 3| 10| 4| 2|
| 3| 11| 4| 3|
| 0| 12| 5| 0|
| 0| 13| 5| 1|
| 0| 14| 5| 2|
+---+---+-----+-----+

Get maximum value in paspark, insert it and drop duplicates

I stuck in solving my problem.
What I have:
Pyspark dataframe that looks like:
+----+---------+---------+
| id | country | counter |
+====+=========+=========+
| A | RU | 1 |
+----+---------+---------+
| B | EN | 2 |
+----+---------+---------+
| A | IQ | 1 |
+----+---------+---------+
| C | RU | 3 |
+----+---------+---------+
| D | FR | 5 |
+----+---------+---------+
| B | FR | 5 |
+----+---------+---------+
I need to take max value of country by counter (or any if equal) and delete all duplicates.
So it should looks like:
+----+---------+---------+
| id | country | counter |
+====+=========+=========+
| A | RU | 1 |
+----+---------+---------+
| C | RU | 3 |
+----+---------+---------+
| D | FR | 5 |
+----+---------+---------+
| B | FR | 5 |
+----+---------+---------+
Can anyone help me?
You can first drop duplicates based on id and counter , then take max over a window of id , finally filter where counter equals the Maximum value;
If order of id is to be retained , we would need a monototically increasing id to be assigned so we can sort later:
from pyspark.sql.window import Window
w = Window.partitionBy('id')
out =(df.withColumn('idx',F.monotonically_increasing_id())
.drop_duplicates(['id','counter'])
.withColumn("Maximum",F.max(F.col("counter"))
.over(w)).filter("counter==Maximum").orderBy('idx')
.drop(*['idx','Maximum']))
out.show()
+---+-------+-------+
| id|country|counter|
+---+-------+-------+
| A| RU| 1|
| C| RU| 3|
| D| FR| 5|
| B| FR| 5|
+---+-------+-------+
If order of id is not a concern , same logic but no additional id required:
from pyspark.sql.window import Window
w = Window.partitionBy('id')
out1 = (df.drop_duplicates(['id','counter']).withColumn("Maximum",F.max(F.col("counter"))
.over(w)).filter("counter==Maximum")
.drop('Maximum'))
out1.show()
+---+-------+-------+
| id|country|counter|
+---+-------+-------+
| B| FR| 5|
| D| FR| 5|
| C| RU| 3|
| A| RU| 1|
+---+-------+-------+

PySpark, kind of groupby, considering sequence

I have this kind of dataset:
+------+------+------+
| Time | Tool | Hole |
+------+------+------+
| 1 | A | H1 |
| 2 | A | H2 |
| 3 | B | H3 |
| 4 | A | H4 |
| 5 | A | H5 |
| 6 | B | H6 |
+------+------+------+
The expected result is the following: It's a kind of temporal aggregation of my data, where the sequence is important.
+------+-----------+---------+
| Tool | Time_From | Time_To |
+------+-----------+---------+
| A | 1 | 2 |
| B | 3 | 3 |
| A | 4 | 5 |
| B | 6 | 6 |
+------+-----------+---------+
Current result, with groupby statement doesn't match my expectation, as the sequence is not considered.
+------+-----------+---------+
| Tool | Time_From | Time_To |
+------+-----------+---------+
| A | 1 | 5 |
| B | 3 | 5 |
+------+-----------+---------+
rdd = rdd.groupby(['tool']).agg(min(rdd.time).alias('minTMSP'),
max(rdd.time).alias('maxTMSP'))
I tried to pass through a window function, but without any result so far... Any idea how I could handle this use case in pyspark?
We can use the lag function and Window class to check if the entry in each row has changed with regard to its previous row. We can then calculate the cumulative sum using this same Window to find our column to group by. From that point on it is straightforward to find the minimum and maximum times per group.
Hope this helps!
import pyspark.sql.functions as F
from pyspark.sql.window import Window
df = spark.createDataFrame([(1,'A'), (2,'A'), (3,'B'),(4,'A'),(5,'A'),(6,'B')],
schema=['Time','Tool'])
w = Window.partitionBy().orderBy('Time')
df2 = (df.withColumn('Tool_lag',F.lag(df['Tool']).over(w))
.withColumn('equal',F.when(F.col('Tool')==F.col('Tool_lag'), F.lit(0)).otherwise(F.lit(1)))
.withColumn('group', F.sum(F.col('equal')).over(w))
.groupBy('Tool','group').agg(
F.min(F.col('Time')).alias('start'),
F.max(F.col('Time')).alias('end'))
.drop('group'))
df2.show()
Output:
+----+-----+---+
|Tool|start|end|
+----+-----+---+
| A| 1| 2|
| B| 3| 3|
| A| 4| 5|
| B| 6| 6|
+----+-----+---+

Categories

Resources